Tariq Abusheikh B.S., Bioinformatics

tanktherapistΒιοτεχνολογία

23 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

73 εμφανίσεις

Tariq Abusheikh

B.S., Bioinformatics



Research Goals


Background information:


Protein Function


Phylogenetic Profiles


BLAST,
mpiBLAST


Jaccard Dissimilarity


Hierarchical Clustering


Process


Results & Future Work


Build phylogenetic profile
-
analysis pipeline
based on method of
Shawn
Cokus

et. al. 2007


Develop using Java and R


Authenticate code using published data



Apply method to profiles of
C
. reinhardtii
proteome


Proteins are essential to all cellular processes


Determining protein
function leads to better


understanding
of cellular
processes








Functionally
related proteins:



Include proteins that combine to form a complex or


function together in a pathway



Record the occurrence of protein homologs
across genomes







Proteins with similar profiles:


Members of a protein complex


Participate in same metabolic pathways


Each protein sequence
in a genome is
compared in a
pair
-
wise manner to each protein from
a chosen set of
reference species


The phylogenetic profile consists of vectors of 0, 1


0 = protein absent in reference genome


1 = protein present in reference genome

Chlamy

Protein A

Protein B

Protein C

Species 1

Protein X

Protein Y

Protein Z

Generate Profile Score

Best Match


BLAST

(Basic Local Alignment Search
Tool)


Compares AA
sequences
against a sequence DB


Generates
E value

-

scores significance
of the
match


mpiBLAST



parallel implementation of BLAST


Database fragmentation



Query segmentation



Increased per
-
query throughput



Improved query response time



Instead
of taking days, comparison takes just a few
hours

Final Report

Queries

Distributed

Database

(Reference

Species)

n1

n2

n3

n4



A

B

p1

1

0

p2

1

1

p3

0

1

p4

0

0

p5

1

1


Assigning objects to groups based on
‘similarity’ or ‘distance’



Types:


Hierarchical


bottom up


Agglomerative
, Concept


Partitional



top down


K
-
means
, Fuzzy c
-
means, QT, Locality Sensitivity


Spectral



R


Language/environment for statistical
computing

mpiBLAST

Homologs

Profiles

Jaccard

Cluster

Tree

Optimize

Evaluation

G.O.


Testing dataset drawn from published data


Slonim

et. al. 2006



Dataset includes:


215 genomes


4200 proteins


Read profiles


Jaccard matrix

Step 1


Dendrogram


Cluster

Step 2


Pruning


Swiveling

Step 3

Genome IDs

Profile Scores

1 = Present

0 = Absent

Protein IDs

RETURN

Genome IDs

Jaccard Dissimilarity Coefficients

RETURN

Hierarchical Clustering of 215 genomes using R


Optimize swivels, prune tree



Apply
hypergeometric

function to test
significance of ‘runs’



Validate code using published
E. coli K12

data



Optimize code



Apply validated method to
C. reinhardtii


Unicellular photosynthetic
eukaryote



Model Organism


Chloroplast biology


Flagella structure/function


Alternative fuels (H
2
, biodiesel)



Sequenced genomes


Nuclear


Chloroplast


Mitochondria



Well developed genetic system:


Sexual reproduction


Haploid


The W.M. Keck Foundation



Dr. Laura Baker



Dr. Charles Hauser



Dr.
Sharon Weber


Bar
-
Joseph Z, et al. (2003) Bioinformatics 19:1070
-
1078


Cokus
, S.,
Mizutani
, S. and Pellegrini M. (2007) BMC Bioinformatics 8:S7


Date, S.V. and
Marcotte

E.M. (2003). Nature Biotechnology 21, 1055
-
1062


Jothi
, R.
Przytycka
, TM and
Aravind
, L. (2007) BMC Bioinformatics 8:173


Merchant, S.S. et al.,(2007) Science 318, 245
-
251


R Development Core Team (2008). URL:
http://www.R
-
project.org



Rivers, Cameron. (2006) Journal of Computing Sciences in Colleges 21, 190
-
195.


Slonin
, N.,
Elemento
, O., and
Travazole
, S. (2006) Molecular Systems
Biology 2: 2006.0005