BTI bioinformatics intro

breakfastcorrieΒιοτεχνολογία

22 Φεβ 2013 (πριν από 4 χρόνια και 6 μήνες)

180 εμφανίσεις



Introduction to Bioinformatics
Lukas Mueller
Boyce Thompson Institute


What is bioinformatics?

Bioinformatics /baɪ.oʊˌɪnfəәrˈmætɪks/ is the
application of computer science and information
technology to the field of biology and medicine.


Bioinformatics deals with

algorithms, databases and information systems, web
technologies, artificial intelligence and soft computing,
information and computation theory, software engineering,
data mining, image processing, modeling and simulation,
signal processing, discrete mathematics, control and system
theory, circuit theory, and statistics,

for generating new knowledge of biology and medicine, and
improving & discovering new models of computation (e.g.
DNA computing, neural computing, evolutionary computing,
immuno-computing, swarm-computing, cellular-computing).


Bioinformatics can...

Identify similar sequences

Provide a putative function for a sequence

Assemble sequences (genomes, transcriptomes)

Annotate genomes

Build networks of genes or metabolites

Determine phylogenetic relationships

Mine literature for biological information

Uncover differences between two genomes

Calculate how a protein folds


What can bioinformatics do for me?

Speed up your research

Enable you to ask new questions

Majority of projects involve large datasets

Basic knowledge of bioinformatics needed

Extract information

Transform information

Run analyses

Build hypotheses, etc.

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
200 GB / run
The digital revolution


Increase in seq data
L. Stein, Genome Biology, 2010


Web-based bioinformatics








The next step: Running “locally”

Perform analyses on large datasets

Analyses run faster

Output easier to handle

Chain analyses

More flexible

Better control of parameters

Needs more knowledge about your computer
and tools!


Highly cited bioinfo tools

1. BLAST (Altschul SF. et al. 1990; 30,202 citations)

Sequence search my homology/similarity.

2. CLUSTALW (Thompson JD. et al. 1994; 32,681 citations).

Multiple sequence alignment.

3. PAML (Yang ZH, 1997; 2,642 citations)


Maximum Likelihood” phylogenetic analysis.

4. GBROWSE (Stein LD, et al. 2002; 428 citations),

Genome visualization.

5. BLAST2GO (Conesa A. et al. 2005; 363 citations),

Sequence functional multi-annotation.


(continued)

6. VELVET (Zerbino DR, et al. 2008; 323 citations),

Sequence assembly by Bruijn Graphs.

7. SAMTOOLS (Li H. et al. 2009; 172 citations),

Multi-sequence alignment processing for NGS.

8. SOAP2 (Li RQ et al. 2009; 76 citations).

Sequence assembly (short reads).

9. MAKER (Cantarel BL, et al. 2008; 23 citations),

Genome annotation pipeline.

10. GALAXY (Goecks J. et al. 2010; 20 citations),

Genomic analysis platform that integrates several scripts and tools.


Running “pipelines”


Linux

UNIX-based, free an open source operating
system

Very stable

Adopted for most bioinformatics work

Installed on laptops, clusters, supercomputers

Can run on your computer!

Virtualized or native


C, UNIX and Linux

Ken Thompson and Dennis
Ritchie inventors of UNIX at Bell
labs in front of PDP-11 early
1970's.

Linus Torvalds implemented
an open source version of
UNIX (Linux) while a student
in Finland in the 1990s


Linux


UNIX – the terminal

Runs the “shell”

Built-in scripting


shell commands

Powerful, but text based (CLI)

Automate task, combine commands

Look like gobbledegook:
grep Niben /var/log/ftp | grep -i sca
| sort -u | wc -l


Scripting

Scripts: Small programs written by the end-user
that control the execution of other programs or
perform a simple algorithm

Extremely flexible

Written in Shell, Perl, Python

You can write them yourself!!!


Perl

Versatile language

Developed since 1980s by Larry Wall

Useful for bioinformatics and web development

Support for
objects

Excellent integration of
regular expressions
(text handling
language)

Vast open source code library (http:/cpan.org/)

BioPerl

Easy to learn

http://www.perl.org/
Example
.....



Language designed for statistics

Support for matrix calculations, graphics

Expression analysis, Next-Gen sequence
analysis, Graphics, genome annotation
statistics, phylogeny

Interactive

Bioconductor package


Databases

Biological data is highly structured

Relational database systems (postgres, mysql)

Database schemas - normalization

SQL


Transcriptomics and sequence
assembly

RNASeq technology and genome sequencing using next generation
sequencing

Experimental design, multiplexing

Special tools developed

Sequence preprocessing

Aligners such as bwa, novoalign

Assemblers such as newbler, mira, velvet

Viewers

File conversions

Evaluation of assemblies

Structural and functional annotation


Phylogenetics and comparative
genomics

How do sequences/genomes relate to each other?

Align sequences

ClustalW

Muscle

Build phylogenetic trees

Parsimony

Neighbor join

Maximum likelyhood

Analyses

Orthology

Modes of selection

Identification of SNP patterns

Genome duplications


Beyond this course

BTI Perl Club

If you have a bioinformatics question, please let
us know!




http://btiplantbioinfocourse.wordpress.com/