BTI bioinformatics intro


22 Φεβ 2013 (πριν από 5 χρόνια και 2 μήνες)

210 εμφανίσεις

Introduction to Bioinformatics
Lukas Mueller
Boyce Thompson Institute

What is bioinformatics?

Bioinformatics /baɪ.oʊˌɪnfəәrˈmætɪks/ is the
application of computer science and information
technology to the field of biology and medicine.

Bioinformatics deals with

algorithms, databases and information systems, web
technologies, artificial intelligence and soft computing,
information and computation theory, software engineering,
data mining, image processing, modeling and simulation,
signal processing, discrete mathematics, control and system
theory, circuit theory, and statistics,

for generating new knowledge of biology and medicine, and
improving & discovering new models of computation (e.g.
DNA computing, neural computing, evolutionary computing,
immuno-computing, swarm-computing, cellular-computing).

Bioinformatics can...

Identify similar sequences

Provide a putative function for a sequence

Assemble sequences (genomes, transcriptomes)

Annotate genomes

Build networks of genes or metabolites

Determine phylogenetic relationships

Mine literature for biological information

Uncover differences between two genomes

Calculate how a protein folds

What can bioinformatics do for me?

Speed up your research

Enable you to ask new questions

Majority of projects involve large datasets

Basic knowledge of bioinformatics needed

Extract information

Transform information

Run analyses

Build hypotheses, etc.
200 GB / run
The digital revolution

Increase in seq data
L. Stein, Genome Biology, 2010

Web-based bioinformatics

The next step: Running “locally”

Perform analyses on large datasets

Analyses run faster

Output easier to handle

Chain analyses

More flexible

Better control of parameters

Needs more knowledge about your computer
and tools!

Highly cited bioinfo tools

1. BLAST (Altschul SF. et al. 1990; 30,202 citations)

Sequence search my homology/similarity.

2. CLUSTALW (Thompson JD. et al. 1994; 32,681 citations).

Multiple sequence alignment.

3. PAML (Yang ZH, 1997; 2,642 citations)

Maximum Likelihood” phylogenetic analysis.

4. GBROWSE (Stein LD, et al. 2002; 428 citations),

Genome visualization.

5. BLAST2GO (Conesa A. et al. 2005; 363 citations),

Sequence functional multi-annotation.


6. VELVET (Zerbino DR, et al. 2008; 323 citations),

Sequence assembly by Bruijn Graphs.

7. SAMTOOLS (Li H. et al. 2009; 172 citations),

Multi-sequence alignment processing for NGS.

8. SOAP2 (Li RQ et al. 2009; 76 citations).

Sequence assembly (short reads).

9. MAKER (Cantarel BL, et al. 2008; 23 citations),

Genome annotation pipeline.

10. GALAXY (Goecks J. et al. 2010; 20 citations),

Genomic analysis platform that integrates several scripts and tools.

Running “pipelines”


UNIX-based, free an open source operating

Very stable

Adopted for most bioinformatics work

Installed on laptops, clusters, supercomputers

Can run on your computer!

Virtualized or native

C, UNIX and Linux

Ken Thompson and Dennis
Ritchie inventors of UNIX at Bell
labs in front of PDP-11 early

Linus Torvalds implemented
an open source version of
UNIX (Linux) while a student
in Finland in the 1990s


UNIX – the terminal

Runs the “shell”

Built-in scripting

shell commands

Powerful, but text based (CLI)

Automate task, combine commands

Look like gobbledegook:
grep Niben /var/log/ftp | grep -i sca
| sort -u | wc -l


Scripts: Small programs written by the end-user
that control the execution of other programs or
perform a simple algorithm

Extremely flexible

Written in Shell, Perl, Python

You can write them yourself!!!


Versatile language

Developed since 1980s by Larry Wall

Useful for bioinformatics and web development

Support for

Excellent integration of
regular expressions
(text handling

Vast open source code library (http:/


Easy to learn

Language designed for statistics

Support for matrix calculations, graphics

Expression analysis, Next-Gen sequence
analysis, Graphics, genome annotation
statistics, phylogeny


Bioconductor package


Biological data is highly structured

Relational database systems (postgres, mysql)

Database schemas - normalization


Transcriptomics and sequence

RNASeq technology and genome sequencing using next generation

Experimental design, multiplexing

Special tools developed

Sequence preprocessing

Aligners such as bwa, novoalign

Assemblers such as newbler, mira, velvet


File conversions

Evaluation of assemblies

Structural and functional annotation

Phylogenetics and comparative

How do sequences/genomes relate to each other?

Align sequences



Build phylogenetic trees


Neighbor join

Maximum likelyhood



Modes of selection

Identification of SNP patterns

Genome duplications

Beyond this course

BTI Perl Club

If you have a bioinformatics question, please let
us know!