Essential Computing for Bioinformatics - Pittsburgh Supercomputing ...

fleagoldfishBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

87 views




The following material is the result of a curriculum development effort to provide a set of courses to
support bioinformatics efforts involving students from the biological sciences, computer science, and
mathematics departments. They have been developed as a part of the NIH funded project “Assisting
Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum
development effort include:



Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National
Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon
University.


Dr. Ricardo Gonzalez
-
Mendez, University of Puerto Rico Medical Sciences Campus.


Dr. Alade Tokuta, North Carolina Central University.


Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez.


Dr. Satish Bhalla, Johnson C. Smith University.




Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon
University. Permission is granted for use, modify, and reproduce these materials for teaching purposes.









These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

1




This material is targeted towards students with a general background in Biology. It
was developed to introduce biology students to the computational mathematical
and biological issues surrounding bioinformatics. This specific lesson deals with the
following fundamental topics:


Essential computing for bioinformatics


Computer Science track



This material has been developed by:


Dr. Hugh B. Nicholas, Jr.


National Center for Biomedical Supercomputing


Pittsburgh Supercomputing Center


Carnegie Mellon University





These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

2

Essential Computing
for Bioinformatics




Bienvenido Vélez

UPR Mayaguez


July, 2008

Outline


Course Description


Educational Objectives


Major Course Modules


Module Descriptions


Accomplishments Year 1


Plan for Year 2

4

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Course Description

This

course

provides

a

broad

introductory

discussion

of

essential

computer

science

concepts

that

have

wide

applicability

in

the

natural

sciences
.

Particular

emphasis

will

be

placed

on

applications

to

Bioinformatics
.

T h e

c o n c e p t s

w i l l

be

motivated

by

practical

problems

arising

from

the

use

of

bioinformatics

research

tools

such

as

genetic

sequence

databases
.

Concept s

wi l l

be

discussed

in

a

weekly

lecture

and

will

be

practiced

via

simple

programming

exercises

using

Python,

an

easy

to

learn

and

widely

available

scripting

language
.

5

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Educational Objectives


Awareness of the mathematical models of computation and their
fundamental limits


Basic understanding of the inner workings of a computer system


Ability to extract useful information from various bioinformatics data
sources


Ability to design computer programs in a modern high level language
to analyze bioinformatics data.


Ability to transfer information among relational databases,
spreadsheets and other data analysis tools


Experience with commonly used software development environments
and operating systems

6

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Major Course Modules

Module

Hours

First Steps in Computing

3

Using Bioinformatics Data Sources

6

Mathematical Computing Models

3

High
-
level Programming (Python)

12

Extracting Information from Database Files

6

Relational Databases and SQL

6

Other Data Analysis Tools

3

TOTAL

39

7

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

First Steps in Computing


Need a mechanism for expressing computation


Need to understand computing in order to
understand the mechanism


Solution: Write your first bioinformatics program in a
very high level language such as: Python

Solves the Chicken and Egg Problem!

8

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Main Advantages of Python


Familiar to C/C++/C#/Java Programmers


Very High Level


Interpreted and Multi
-
platform


Dynamic


Object
-
Oriented


Modular


Strong string manipulation


Lots of libraries available


Runs everywhere


Free and Open Source


Track record in bioInformatics (BioPython)

9

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

A
Codon

-
>
AminoAcid

Dictionary

>> code =

{ ’ttt’: ’F’, ’tct’: ’S’, ’tat’: ’Y’, ’tgt’: ’C’,

...


’ttc’: ’F’, ’tcc’: ’S’, ’tac’: ’Y’, ’tgc’: ’C’,

...


’tta’: ’L’, ’tca’: ’S’, ’taa’: ’*’, ’tga’: ’*’,

...




’ttg’: ’L’, ’tcg’: ’S’, ’tag’: ’*’, ’tgg’: ’W’,

...


’ctt’: ’L’, ’cct’: ’P’, ’cat’: ’H’, ’cgt’: ’R’,

...


’ctc’: ’L’, ’ccc’: ’P’, ’cac’: ’H’, ’cgc’: ’R’,

...


’cta’: ’L’, ’cca’: ’P’, ’caa’: ’Q’, ’cga’: ’R’,

...


’ctg’: ’L’, ’ccg’: ’P’, ’cag’: ’Q’, ’cgg’: ’R’,

...


’att’: ’I’, ’act’: ’T’, ’aat’: ’N’, ’agt’: ’S’,

...


’atc’: ’I’, ’acc’: ’T’, ’aac’: ’N’, ’agc’: ’S’,

...


’ata’: ’I’, ’aca’: ’T’, ’aaa’: ’K’, ’aga’: ’R’,

...


’atg’: ’M’, ’acg’: ’T’, ’aag’: ’K’, ’agg’: ’R’,

...


’gtt’: ’V’, ’gct’: ’A’, ’gat’: ’D’, ’ggt’: ’G’,

...


’gtc’: ’V’, ’gcc’: ’A’, ’gac’: ’D’, ’ggc’: ’G’,

...


’gta’: ’V’, ’gca’: ’A’, ’gaa’: ’E’, ’gga’: ’G’,

...


’gtg’: ’V’, ’gcg’: ’A’, ’gag’: ’E’, ’ggg’: ’G’

...

...



}

>>

10

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

A DNA Sequence

>>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa

tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg

ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc

ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca

tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt

attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac

gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg

ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag

gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac

gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg

tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt

caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta

tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac

cgttttatcgggcggggtaa"

>>>

11

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

CDS Sequence
-
> Protein Sequence

>>> def translate(cds, code):

... prot = ""

... for i in range(0,len(cds),3):

... codon = cds[i:i+3]

... prot = prot + code[codon]

... return prot



>>> translate(cds, code)

’MSERLSITPLGPYIGAQ*’


12

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Using
BioInformatics

Data Sources

Goal:Basic

Experience (6 hrs)


Searching Nuceotide Sequence Databases


Searching Amino Acid Sequence Database


Performing BLAST Searches


Using Other Data Sources


Reference:
Bioinformatics for Dummies (Ch 1
-
4)

IDEA: How can we expedite data collection and analysis?


... writing programs to automate parts of the process.

13

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Mathematical Computing

Goal:General

Awareness (6 hrs)


What is Computing?


Mathematical Models of Computing


Finite Automata


Turing Machines


Church/Turing Thesis


What is an Algorithm?


Big O Notation


Complexity Classes

14

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

High
-
Level Programming (Python)

Goal:Knowledge

and Experience (12 hrs)


Downloading and Installing the Interpreter


Command
-
line versus Batch Mode


Values, Expressions and Naming


Designing your own Functional Building Blocks


Controlling the Flow of your Program


String Manipulation (Sequence Processing)


File Manipulation


Container Data Structures


Reference:
How to Think Like A Computer Scientist:
Learning with Python

15

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Extracting Information from Bioinformatics
Databases (6
hrs)


Manipulating sequences


Bioinformatics database file formats


Parsing Bioinformatics database files to focus on
information of interest


Exporting data into relational databases and other
tools

16

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Relational Databases and
SQL
(6
hrs)


The Relational Model


The SQL Relational Query Language


Downloading and Installing MySQL


Creating databases in MySQL


Importing data into MySQL


Creating Analysis Reports with iReports


Exporting Data into Spreadsheets

17

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

CS Fundamentals will be Interleaved Throughout the
Course


Information Representation and Encoding


Computer Architecture


Programming Language Translation Methods


The Software Development Cycle


Fundamental Principles of Software Engineering


Basic Data Structures for Bioinformatics


Design and Analysis of Bioinformatics Algorithms

18

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Accomplishments for Year 1


Studied alternative HLL languages


Gained experience with Python


Worked on slides for Mathematical Computing


Reviewed existing beginning courses in computing for
Bioinformatics using Python


Revised course syllabus and curricular structure


Offered the course for the first time in Spring 2007

19

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

Plan for Year 2


Offer the course a second time Fall 2007


Develop slides and lecture notes for remaining
modules


Develop Problem Sets and Solutions


Study BioPython and integrate throughout the course


Round 1 of Assessment

20

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh
Sup
ercomputing Center

21



Questions?