Block3_BasicLinux_part2-Solutions

completemiscreantΔιαχείριση Δεδομένων

28 Νοε 2012 (πριν από 4 χρόνια και 7 μήνες)

162 εμφανίσεις

Block 1: Basic Linux, Part 2.

Installing Programs and Setting Environment Variables:

(Note: Execution are in
bold blue
, answers in
bold black
)

1.

Installing programs using Synaptic Package:

1.1 Open the program at System => Administration => Synaptic Package

Manager.

1.2 Search by quick search the package “postgres”. How many results the program returns ?

355

1.3 Using the search button, repeat the search limiting the search by name. How many results
the program returns ?



87

1.4 Install the
packages named: postgresql
-
8.4, postgresql
-
doc
-
8.4 and postgresql
-
server
-
dev
-
8.4. How many files the program have downloaded to install these packages ?


22

1.5

Type psql

help to
check that the program has been installed and its executables are
accessibl
e.

1.6

Search by quick search the package “openoffice.org
-
base” and install it. How many new
packages will be installed and how many will be updated ?



2.

Installing programs using Apt
-
cache and Apt
-
get.

2.1

Using the terminal and 'apt
-
cache search' command for a p
rogram with the name 'blast'.
How many results the program return ?

bioinfo@ubuntu:~/BioinfoCourse/data$ apt
-
cache search blast

56

2.2

Use the argument

names
-
only. How many options the programs return ?

bioinfo@ubuntu:~/BioinfoCourse/data$ apt
-
cache search bl
ast

names
-
only

23

2.3

Install the program “Basic Local Alignment Search Tool” using apt
-
get install program

bioinfo@ubuntu:~/BioinfoCourse/data$ sudo apt
-
get install blast2

2.4

Type blastall
-
help to check that the program has been installed and its executables

are
accessible.


3.

Installing programs downloading the executable and moving to the PATH dirs.

3.2 Create a two folders with the name programs and programs/bin inside BioinfoCourse folder.

bioinfo@ubuntu:~/BioinfoCourse$ mkdir programs

bioinfo@ubuntu:~/Bioi
nfoCourse$ mkdir programs/bin

3.2 Download the blat executable file in the programs/bin from
http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.zip using wget. Unzip the folder and remove the
blatSuite.zip file

bioinfo@ubuntu:~/BioinfoCourse$ wget http:/
/hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.zip

bioinfo@ubuntu:~/BioinfoCourse$ unzip blatSuite.zip


3.3 Type blat
-
help to check that the program has been installed and its executables are accessible. Is
the program accessible ? Type ./blat ? What id t
he problem ?

No,

That works.

The executable is not in the PATH


3.4

Print the PATH for your system. Add to your bash profile (.bashrc) a new path what
include the dir /home/bioinfo/BioinfoCourse/programs/bin. Load the new configuration
and try blat

help.

bioi
nfo@ubuntu:~/BioinfoCourse/programs/bin$ echo $PATH

Add to the ~/.bashrc file a last line with:

export PATH=/home/bioinfo/BioinfoCourse/programs/bin:$PATH

bioinfo@ubuntu:~/BioinfoCourse/programs/bin$ source ~/.bashrc







Piping and Use of Common Commands

to Manipulate Sequences:

(Note: Execution are in
bold blue
, answers in
bold black
)



Previous preparation:



Take a look to the following commands using man <command name>: ls, cd, mkdir grep, cat,
cut, wc, sed, wget, awk. Take special attention to the argu
ments:
-
l and
-
h for ls,
-
c and
-
v for
grep,
-
l and
-
m for wc,
-
d and
-
f for cut.



Download the sequences from the sections 2.1 and 2.2 and the files from the section 3


1.

Create a new directory inside the BioinfoCourse with the name 'Data'. What is the
comp
lete path to this dir?

bioinfo@ubuntu
:~$ cd BioinfoCourse

bioinfo@ubuntu
:~$ mkdir Data

/home/bioinfo/BioinfoCourse/Data


2.

Download using wget the following files:

2.1) All the A.tha
liana chromosomes from TAIR10 dataset


(
ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
)


bioinfo@ubuntu:~$ wget
ftp://ftp.arabidopsis.org/home/tai
r/Sequences/whole_chromosomes/TAIR10_c
hr*.fas


2.2) All the A.thaliana genes (cdna) from TAIR10 dataset


(
ftp://ftp.arabidopsis.org/home
/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_cdna_2
0101214
)

bioinfo@ubuntu:~$ wget
ftp://ftp.arabidopsis.org/home/tair/Sequences/bl
ast_datasets/TAIR10_blastsets/TAIR10_cdna_2
0101214




2.3) What size have these files ?


bioinfo@ubuntu:~/BioinfoCourse/data$ ls
-
lh


total 184M


-
rw
-
r
--
r
--

1 bioinfo bioinfo 69M 2012
-
03
-
16 05:45 TAIR10_cdna_20101214


-
rw
-
rwxr
--

1 bioinfo bioinfo

30M 2011
-
02
-
22 00:00 TAIR10_chr1.fas


-
rw
-
rwxr
--

1 bioinfo bioinfo 20M 2011
-
02
-
22 00:00 TAIR10_chr2.fas


-
rw
-
rwxr
--

1 bioinfo bioinfo 23M 2011
-
02
-
22 00:00 TAIR10_chr3.fas


-
rw
-
rwxr
--

1 bioinfo bioinfo 18M 2011
-
02
-
22 00:00 TAIR10_chr4.fas


-
rw
-
rwxr
--

1

bioinfo bioinfo 27M 2011
-
02
-
22 00:00 TAIR10_chr5.fas


-
rw
-
rwxr
--

1 bioinfo bioinfo 153K 2011
-
02
-
22 00:00 TAIR10_chrC.fas


-
rw
-
rwxr
--

1 bioinfo bioinfo 363K 2011
-
02
-
22 00:00 TAIR10_chrM.fas



2.4) How many sequences have these files ?



bioinfo@u
buntu:~/BioinfoCourse/data$ grep
-
c '>' TAIR10_c*



TAIR10_cdna_20101214:
41671



TAIR10_chr1.fas:
1



TAIR10_chr2.fas:
1



TAIR10_chr3.fas:
1



TAIR10_chr4.fas:
1



TAIR10_chr5.fas:
1



TAIR10_chrC.fas:
1



TAIR10_chr
M.fas:
1




2.5) Using piping concatenate the Arabidopsis chromosomes in one file with the
name 'TAIR10_nuclear_genome.fasta'. What is the size of this file ? How many
sequences has this file ?

bioinfo@ubuntu:~/BioinfoCourse/data$ cat TAIR10_chr1.fas
TAIR10
_chr2.fas TAIR10_chr3.fas TAIR10_chr4.fas TAIR10_chr5.fas >
TAIR10_nuclear_genome.fasta

bioinfo@ubuntu:~/BioinfoCourse/data$ ls
-
lh TAIR10_nuclear_genome.fasta

-
rw
-
r
--
r
--

1 bioinfo bioinfo
116M

2012
-
03
-
16 05:52 TAIR10_nuclear_genome.fasta

bioinfo@ubuntu:~
/BioinfoCourse/data$ grep
-
c '>'
TAIR10_nuclear_genome.fasta

5


2.6) Using grep, could you extract the nuclear genome sequence size for
Arabidopsis thaliana.

bioinfo@ubuntu:~/BioinfoCourse/data$ grep
-
v '>'
TAIR10_nuclear_genome.fasta | wc
-
m

120654532 (
characters, 120 Mb)


2.7)

What percentage of genome represents the cdna ?

bioinfo@ubuntu:~/BioinfoCourse/data$ grep
-
v '>' TAIR10_cdna_20101214 |
wc
-
m

65698389 (65 Mb, around 54% of the genome)



2.8) How many 'kinases' have Arabidopsis for the version TA
IR10 ? Create a file
(name:TAIR10_kinases.txt) with 2 columns: All the Arabidopsis thaliana
sequence accessions (Ids) for these kinases in the 1
st

column and its description in
a 2
nd

one.

bioinfo@ubuntu:~/BioinfoCourse/data$ grep
-
c 'kinase'
TAIR10_cdna_20
101214

1762

bioinfo@ubuntu:~/BioinfoCourse/data$ grep 'kinase'
TAIR10_cdna_20101214 | sed
-
r 's/^>//' | sed
-
r 's/
\
s+
\
|
\
s+/
\
|/g' | cut
-
d '|'
-
f1,3 |
sed
-
r 's/
\
|/
\
t/' > TAIR10_kinases.txt



3.

Download using any web browser the microarray data for the follow
ing
experiments. You can find them at
http://www.ebi.ac.uk/arrayexpress/browse.html
.
Download the data for:



Species:Arabidopsis thaliana,



Platform: Affymetrix GeneChip Arabidopsis Genome [ATH1
-
121501],



Title: Analysis of the effect of auxin biosynthesis inhibitor L
-
AOPP on the
gene expression profile (whole plant).



File1: E
-
GEOD
-
30093.processed.1.zip



File2: E
-
GEOD
-
30093.sdrf.txt



File3 (under Array design A
-
AFFY
-
2 link): A
-
AFFY
-
2.adf.txt


3.1)
Move all the files from Download folder (in your home dir) to
BioinfoCourse/data/ dir.

bioinfo@ubuntu:~/BioinfoCourse/data$ cd /home/bioinfo/Downloads/

bioinfo@ubuntu:~/Downloads$ mv E
-
GEOD
-
30093.*
/home/bioinfo/BioinfoCourse/data/

bioinfo@ubuntu:~/Downloa
ds$ mv A
-
AFFY
-
2.adf.txt
/home/bioinfo/BioinfoCourse/data/


3.2) Unzip the file.

bioinfo@ubuntu:~/Downloads$ cd /home/bioinfo/BioinfoCourse/data/

bioinfo@ubuntu:~/BioinfoCourse/data$ unzip E
-
GEOD
-
30093.processed.1.zip

Archive: E
-
GEOD
-
30093.processed.1.zip

inflating: GSM744725_sample_table.txt

inflating: GSM744724_sample_table.txt

inflating: GSM744723_sample_table.txt

inflating: GSM744722_sample_table.txt

inflating: GSM744721_sample_table.txt

inflating: GSM744720_sample_table.txt

inflating: GSM7
44719_sample_table.txt

inflating: GSM744718_sample_table.txt

inflating: GSM744717_sample_table.txt

inflating: GSM744716_sample_table.txt

inflating: GSM744715_sample_table.txt

inflating: GSM744714_sample_table.txt



Write a bash script (name cle
an_probes.sh) to retrieve clean the probes taking the
probes with a P
-
VALUE < 0.05.

Open a file with the name clean_probes.sh and write:

#!/bin/bash

echo "Running clean probes";

date;

grep
-
v 'ID_REF' $1 | awk '{ if ($4 < 0.05) print $0}' > $1.clean

date;

echo "Done";

Change the permissions to do it executable

bioinfo@ubuntu:~/BioinfoCourse/data$ chmod 755 clean_probes.sh

Executate it as:

bioinfo@ubuntu:~/BioinfoCourse/data$ ./clean_probes.sh
GSM744714_sample_table.txt