CRN: 84250----FW4500 Bioinformatics Programming & Skills

crashclappergapΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

74 εμφανίσεις




CRN: 84250
----
FW4500 Bioinformatics Programming & Skills




Time: MWF 1:05 pm


1:55 pm

Where: School of Forest Resource,
Rm

143


Instructor:
Hairong

Wei




Assistant Professor of Plant Bioinformatics, Molecular Biology, and Genetics








Why I need to take this course?


1.
To learn the most efficient language for manipulating text and data file;


1.
To learn the most efficient language for extracting information from various
data resources and outputs from any tools & software;


2.
To add an expertise to your CV;


3.
A must
-
have skill for doing bioinformatics, and biology research.


4.
Open an avenue to your career.


5.
A pipe
-
line developing language. It is easy to call other language from
P
erl

a
nd the intermediate results can be processed, stored and retrieve easily.


7. Rarely taught in most universities.












The most useful skills for a
bioinformatian



(1). working knowledge of biology and its applications;

(2). proficiency in computer languages;

(3). skills in data mining;

(4). skills in data visualization;

(5). experience with systems biology tools;

(6). experience in using bioinformatics resources.



In this course, we aim to provide students with the

opportunities to learn skills of (2), (6) and (1).


Upon completion of this course, students will be able to:





Prepare large
-
scale expression and sequence data for bioinformatics analyses




Manipulate files and directories





Use arrays and array functions to solve a wide variety of problems




Use hashes to solve commonly encountered problems




Use the powerful regular expression capabilities of Perl




Extract useful information from various outputs




Manipulation of gene annotation




Take advantage of Perl's powerful system interface




Use Modules from the standard Perl distribution




Use Perl references




Write pipeline to do bioinformatics tasks


Requirements:


Homework
------
5 homework assignments


25%

Projects
----------
2 projects





30%

Mid
-
term exam







15%

Final exam








25%

Participation









5%





Late Assignments

One day delay, 10% off

two
-
day delay, 30% off

three
-
day delay 60% off

more 100% off





Unix /Linux essential (http://
www.computerhope.com/unix.htm
)


1. Text editors
----
emacs
, vi, vim,
pico


2.
emacs

my_perl.pl

1.

exit: ctrl
-
x
-
c

2.
Look at first ten rows of a file: head
-
10 data.txt

3.
Look at last ten rows of a file: tail
-
10 data.txt

4.
Search a word: ctrl
-
s
, then type a word for searching

5.
g
rep

“word4search”
file.txt

6.
Move to the end of file: esc first then push “shift” down then push “>” or “<“

7.
Count how many rows you have in a file: cat
file.txt

|
wc


how many
uniq

rows you have in a file?


cat
file.txt

| sort |
uniq

|
wc


wc
---
count characters, words, lines

9.
Look at memory: top type
q

to quit

10.
Content of current dir:
ls


l or
ls


la

11.
Run program without disrupt when you logout or close you terminal.
----
screen or
nohup

command &


nohup
---
continuing a job after logout




9.
wget

http://…./
filename.gz

10.
tar

cvf

folder.tar

folder or tar

tvzf

foo.tar.gz

11.
gzip

folder.tar

12.
gunzip

folder.tar.gz


13.
tar

xvf

folder.tar

or tar

xvzf

foo.tar.gz


14.
screen and
nohup


15.
pwd
---
find path


16.
sort
---
sort file


How do you sort file according to field
x
?


%cat file.txt | sort

k 3,3

17. Quota

v

---
find out you available disk space

18.
scp

file.txt

hairong@pandora.ffr.mtu.edu
:

19.
df

-
k

summarize free disk space

20. du
-

summarize disk space used

21.
env



Bioinformatics Resources:


1.
NCBI
----
Blast, EST, mRNA sequences,
g
enome sequences.

2.
UCSC
---
Blat

3.
Ensembl
---
Biomart
,
Sahha

4.
EBI
----
protein domain analysis
ProSCAN

5.
TIGR
---
fungi and microbe genomes

6.
TAIR, NSGA
-----
Arabidopsis

7.
Maize Genome Project

8.
Rice

9.
Poplar


What is “
P
erl”?


---
Practical Extraction and Report Language
---
Larry Wall



Check if
perl

is installed in your machine:


p
erl


V

p
erl


V:startperl


Features of Perl


1.
Flexible syntax

2.
H
ard to read
---
partly because of modules

3.
Clever

4.
Slower than C

5.
Many modules
-----
CPAN


1.
A simple Perl program


#!/
usr/bin/perl


w



u
se warnings;

u
se strict;

p
rint “I love bioinformatics
\
n
”;


1.
A simple Perl program


#!/
usr
/bin/
perl


w # enable the warning


u
se warnings;

u
se strict; # load the strict module for strict syntax checks

p
rint “I love bioinformatics
\
n”;



How to make it executable?


$
chmod

+x simple_perl.pl

$./simple_perl.pl

$
mv

simple_perl.pl simple

$./simple



2. Global variables
----
$


$a, $
b
, $
var
, $A1, $signal, $exp, $tmp_1, $_ etc. all are legal variable

Alphanumeric up to a total of 251 characters in length


Illegal variables: $5dollars, $big
-
var



Local variables are not subjected to these rules.


3. Array
----
@


@first_ array =(1, 2, 3, 4, 5);

@
sec_arr
=(‘Tom’, 89, ‘little
-
foot’, 95);

@
third_arr
=
qw
(bar jar car ear far
var

mar) #
qw
: quota of words

$array[index]=$element_2_store;


4. Hash
---
%

%
fun_figures
=( Mouse=> ‘Jerry’, Cat=> ‘Tom’, Dog=> ‘Spike’);

%
mid_term
=( Jim Carr, ’85’, Tim Hall, ‘98’,
Simplson

, ‘71’);

$hash{$
a_key
} = $element_4_store;


What happens if you store two different items with the same key?




5. Subroutine
---
& (We will discuss this late)


6. How to use variables, arrays, hashes?



Some examples of
perl

scripts
----
show in classes


7. References


my $
scalar_ref
=
\
$variable;

my $
array_ref
=
\
@array;

my $
hash_ref
=
\
%hash;

my $
subroutine_ref
=
\
&
a_subroutine
;



Dereferencing


$variable = $$
scalar_ref
;

@array = @$
array_ref
;

%hash = %$
hash_ref
;








Week 1: Lecture 2: Operators:


1. String concatenation:


$
str
=“
fred
” . “
\
t” . “barney”;


Now $
str

is a string of “
fred

barney”


$
str
=“
fred
” . “|” . “barney”;


Now $
str

contains a string of “
fred|barney









2. Comparison:


Numeric


String




Return

-------------------------------------------------------------------------------

!=



ne






not equal

>



gt





greater

==



eq





equal

>=



ge





greater or equal

<



lt





less than

<=



le





less than or equal

<=>



cmp





compare







3. Logical and Bitwise


Logical operators



Bitwise Operators

-----------------------------------------------------------------


&&






& AND


||






|



OR



xor






^



Exclusive



!






~



Not



For example,


if ($file1 && $file1) , return TURE if both $file1 and $file2
exist


3=011, 6=110 3&6 = 010=2 3|6=111=7




4. Arithmetic


$x=4**0.5 =2

$x=4**2=16 power

$x=9%2=1 modulus: remainder upon dividing




5. how to use arrow?



A. Look for a hash value in a hash



$value=$
hash_ref
-
>{key}



B. Take a slice of an array


@slice = $
array_ref

-
> [5..10];



C. Get the first element of subroutine returning array reference


$result =
returned_array_ref
()
-
> [0];



6. String manipulation


a. Length $
str



b.
substr

($
str
, offset,
len
);


returns all characters in the string after the designated offset

from the start of the passed string up to the number of characters

designated by LEN


c.
Substr

($str1,offset,
len
, $str2)


Replaces the part of the string beginning at OFFSET of the length LEN with

the REPLACEMENT string.



#!/
usr
/bin/
perl

-
w


$temp =
substr
("okay", 2);

print "Substring value is $temp
\
n";


Substring value is ay


$temp =
substr
("okay", 1,2);

print "Substring value is $temp
\
n";


Substring value is ka


$sentence = "The quick brown fox jumps over the lazy dog.";

$chunk =
substr
($sentence, 4, 5); #quick


d. Find substrings with index and
rindex


$_=“It ‘s a Perl
Perl

Perl

Perl

World”

$left = index $_, ‘Perl’


# 7

$right=
rindex

$_, ‘Perl’


#22


$
str
=“It’s a Perl word”


$
substr
=
substr
($
str
, index($
str
, ‘Perl’), 4) # Perl



e.
uc

$
str

lc

$
str

get the upper or low case of the string



f. Split


Split a row or line into multiple fields according
delimitor

you specify


Usually “TAB or space”



Ptp.3328.1.S1_s_at


bZIP

family transcription factor;


221.3727

168.3524

96.88159

PtpAffx.1578.1.S1_at


similar to zinc finger protein (PMZ)

100.5228

123.7725

85.54334

PtpAffx.200456.1.S1_at


DRE binding protein (DREB1A);


121.7867

142.2339


16.21638

PtpAffx.202271.1.S1_s_at

bHLH

protein;




118.736

146.9658


48.37343

PtpAffx.215817.1.S1_at


putative protein;



46.99999

77.77928

19.88638

PtpAffx.22673.2.A1_at


bHLH

protein;




147.6356

163.8369

73.30122


@field=split(/
\
t/, $_);
# Split the current line $_ by TAB

$field[0] contains Ptp.3328.1.S1_s_at

$field[1] contains
bZIP

family transcription factor;

$field[2] contains 221.3727

….


@field=split;
#split $_ by TAB and space

$field[0] contains Ptp.3328.1.S1_s_at


$field[1] contains “
bZIP


$field[2] contains “family”

$field[3] contains “transcription”










TAB

Space

7. Array manipulation

pop



push

Shift



unshift

Array of list

push(@array, $
new_element
);

$
element_at_rightmost

= pop(@array);


unshift
(@array, $
new_element
);

$
element_at_leftmost

= shift (@array);


@array=(1,2,3)

@
reverse_array

= reverse(@array);

@
sorted_array
=sort(@array);


Special array


@ARGV


Running your code from command line

$
perl

my_perl.pl input_file1.txt input_file2.txt



Inside your program, you can get the input file names from @ARGV



$ARGV[0] = input_file1.txt

$ARGV[1] = input_file2.txt








Lecture 3: Input and output with
filehandlers



1.
How to open a file?


open (MHD, “
myfile
” ) or die “Cannot open the file: $!”;
#explicit filename


Or


open MHD $filename # filename in variable


while (<MHD>) {


print “$_
\
n”

}


When the open fails, the reason is stored in the special variable $!.
So print $! will help you to learn why it fails.


Some examples


2. Open file using shift


#!/
usr
/bin/
perl

-
w


Use warnings;

use strict;


my $
infile
=shift;


my @fields;


open (IN, "$
infile
") || die "Can not open input file
--

$
infile

\
n";


while (<IN>){


chomp;


@fields = split(/
\
t/, $_);


# @columns = split;


print "@fields
\
n";

}

3. Open file with
Getopt

module



#! /
usr
/bin/
perl


w


use
Getopt
::Std


%opt=();

Getopts
(“
hm:n:o
:”,
\
%opt);


Open (FH4M, “$opt{m}”) or die “Cannot open the input file:
$!”;

while (<FH4M>) {


chomp();


print “$_
\
n”; # $_ current row


}

4. Open file using IO::File module



#!/
usr
/bin/
perl


w


use warnings;

use strict;

use IO::File;



$FH = new IO::File; # create a file handler object

$FH
-
>open(“>
myfile
”) or die “ Unable to open :$!”;



Or $FH= new IO::File(“>
myfile
” ) or die “Unable to open: $!”;


$FH
-
>close();


Or $FH
-
>open ($
anotherfile
, “>”);



Open mode:


<

r


read only

>

w


write only

>>

wac


write, append, and create

+<

r+


read and write only

+>

w+


write, create, and truncate

+>>

a+


write, append, and create

Open FH “>$file”; #open file for writing


“>”: Open for write access. Creates the file if it does not exist, otherwise
destroys the current file

“>>”: open the file for appending access. Create the file if it does not exist,
otherwise open for appending


“<“ open for read access


“+<“: Open a file for read and write access. If the file does not exists, the
open fails. If exists, the current contents are preserved and both read and
writing start from the beginning of file.


Use only when we want to open and write over the existing contents.



“+>” : Open a file for read and write access. If the file does not exist, it is
created. If the file exists, the current content will be truncated and lost.


Use when we want create a new file that will first be written to and later
read from.


“+>>” create file if not exist. If exists, both read and write start from the end
of file. Read may start anywhere in the file in some platform

How to judge the type of a file?


Chop ($filename = <STDIN>);


While (<>) {


if (
-
e $filename) {


print “The file or directory exists”

} else {


print “ the file or directory DOES not exist”

}



-
d is directory

-
f is a plain file

-
B file is binary

-
x file or directory is executable

-
r file or directory is readable

-
w
fiel

or directory is writable