BIT815 Notes on R analysis of RNA-seq data

blareweyrSoftware and s/w Development

Dec 13, 2013 (3 years and 10 months ago)

81 views

BIT815




Notes on R analysis of RNA
-
seq data


Note regarding installation of R packages in the Linux environment:

The RSQLite package and the Bioconductor packages to be used in the Exercises are all installed
by the installBioC.R

script, available from the course webpage. The following is provided as
background information for those interested in learning more about using R in the Linux
environment.

If the packages installed in R are to be available to all users on a Linux system,

then R must be
run with root privileges so the packages can be installed in the system library.
This is not critical
for the typical EC2 session, because generally anyone who uses an EC2 instance is logged in as
the ‘ubuntu’ user, so packages installed in

the ubuntu user’s home directory are available for use.

To run R with root privileges, start it using the command:


sudo R

and then install the desired packages, either from the R repository using (for example):


install.packages(“RSQLite”,dependenc
ies=TRUE)

or from the Bioconductor repository using (for example):


source("http://www.bioconductor.org/biocLite.R")


biocLite("GenomicFeatures")

If R is started without root privileges (ie no sudo command before R), then packages must be

installed i
n the user's home directory (where the user has write permission), and those packages
are not available for other users on the system.


Analysis of RNA
-
Seq data using R:

1.

The analysis of RNA
-
seq data begins with aligning the sequence reads to some reference

sequence, either a reference genome sequence or a reference transcriptome previously
assembled from sequence reads using the Trinity software package or another appropriate
assembly method.

Aligning RNA
-
seq reads to a genomic reference is best done with a
n
alignment program specifically designed for RNA
-
seq analysis, because such programs are
designed to deal with the presence of introns in genomic DNA sequences that are not present
in the RNA
-
seq reads. See
STAR: ultrafast universal RNA
-
seq aligner.

Dobin

et al,
Bioinformatics
29
:15
-
21, 2013

for an example of one such specialized alignment program.

2.

The next step is to obtain counts of the numbers of reads that map to specific features


genes, tra
nscripts, or exons. The choice of which type of feature to use in counting reads is an
important part of the experimental design, and depends on the biological question of interest.

a.

Counting reads that map to annotated genes means alternative splicing eve
nts will not be
counted


all the reads that map to a specific region of genomic DNA will be assigned to
the gene annotated
at that site.

b.

Counting reads that map to annotated transcripts allows detection of previously
-
known
and annotated alternative splic
ing events, but does not allow detection of novel events.

c.

Counting reads that map to annotated exons allows detection of alternative splicing
events, regardless of previous annotation, provided they involve the presence or absence
of annotated exons.

3.

The f
inal step is to use the read counts as input to a statistical analysis that can model the
variation in read counts as a function of both technical and biological variation, and provide
insight into which genes, transcripts, or exons are likely to be differ
entially
-
represented in the
read count data.


We will use the GenomicFeatures package of Bioconductor to create a database containing
information on Arabidopsis transcripts

(a “TranscriptDb”)
, for use in analysis of the reads
sampled from the experiment co
mparing bacteria
-
inoculated versus mock
-
inoculated
Arabidopsis plants (see
http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004047

for a
description of the experiment
, and the
link

C
reati
ng
RNAseq_
Sample
Dataset
.pdf

on the BIT815
course website
).

To be able to save a TranscriptDb

once it is created, it is necessary to have SQLite installed

and linked to R through the R
SQL
ite package. On Ubuntu, installing SQLite uses:


sudo apt
-
get install sqlite

Installing the RSQL
ite package
can be

done from an R prompt, using:


install.packages("RSQLite",dep=T)

As noted above, this should be done in an R session that is running with root privileges if the
package is to be available to all users on a Linux system.


Exercises

1.

Download the installBioC.R script from the course websi
te to an EC2 instance using a
browser, or by executing the command

wget http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/scripts/installBioC.R

2.


Execute the R script as an R batch job with root privileges to
install the Bioconductor
packages

RSQLite,
biomaR
t,
ShortRead
, GenomicFeatures, and DESeq
.

Start R, and l
oad the
packages for use.

sudo R

CMD BATCH installBioC.R

#It will ask if you want to updat
e installed packages; answer No

library(RSQLite); library(biomaRt); library(GenomicFeatures); library(DESeq)

3.

U
se the ‘listMart
s
()’ command to see a list of databases available through Biomart.

listMarts()


Read the help file on the ‘makeTranscriptDbFromBiomart()’ command to learn which Biomart
databases can be used to build a TranscriptDb


this information is at
the very bottom of the
help file.

?makeTranscriptDbFromBiomart


4.

Use the makeTranscriptDbFromBiomart to create a transcript database from the
Arabidopsis TAIR10 gene annotation dataset
“athaliana_eg_gene”
in the ‘plants_mart_1
6

database.

transcript
Db

<
-

makeTranscriptDbFromBiomart(
biomart="plants_mart_1
6
",
dataset="athaliana_eg_gene")


5.

Look at the properties of the transcript database you created


how many transcripts are
represented? How many exons?

metadata(transcriptDb)


6.

Read the help file on the ‘tra
nscripts()’ command

7.

Use the ‘transcripts()’ command to extract
the set

of
all
transcripts from
chromosome 5 (as a
GRanges object) out of

the
database.

chr5t
xpts <
-

transcripts(transcriptDb
,vals=list(tx_chrom=
"
5
"
)
)


8.

Save the database as an

SQLite file for future use.

saveFeatures(transcriptDb,file=
"
AtDB.sqlite
"
)


9.

Look at the GRanges object to see how the chromosome name is represented


what is the
format?

You can see an overview of the object by typing its name at a command prompt:

chr5txpts


10.

T
he
name of the reference sequence in the
SAM files
used for the RNA
-
seq analysis is based
on the Genbank record for the
Arabidopsis
chromosome 5 DNA sequence.

You can
download the SAM files from the S3 bucket using the wget commands given in t
he
RNAseq.sh script available on the course website


save the script file to an EC2 instance,
open it with a text editor, and copy the
first 15 lines, including the commands to create the
/mnt/data directory, change permissions to allow read
-
write access
to all users, and the six
lines of
wget commands

to a terminal window to download the SAM files.


11.

KEY POINT
:

The reference sequence names in the SAM alignment files MUST
MATCH
the names in the R transcript database object to be used for analysis of those a
lignment
results. The reason is obvious if you image aligning reads to a transcriptome assembly with
50,000 different sequences


each sequence must have a unique name in both the alignment
file and the transcript database so R can make the connection betw
een the two datasets. We
will use the stream editor command
sed

to change the name of the reference sequence in the
SAM files to match that found in the transcript database in R.

The output of this command
can be piped directly into the SAMtools commands t
o avoid the need to write out
intermediate files (which can occupy large amounts of disk space).


12.

The SAM files must also be converted to BAM files, sorted, and indexed before they can be
imported into R. The SAMtools program provides separate functions fo
r each of these steps,
but the functions can be linked together on the command l
ine using the pipe operator | as
noted above.

Go to the SAMtools manual (
http://samtools.sourceforge.net/samtools
.shtml
)
and use the information there to understand the meaning of the following command chain:

sed
-
e 's/gi|240256493|ref|NC_003076.8|/5/' cont1.sam | samtools view
-
bhSu
-
F 0x0004
-

| samtools sort
-

cont1


The RNAseq.sh shell script

run
s

this command ch
ain for all six SAM files in the /
mnt/data

directory.


13.

The modified files should now be ready for importing into R and analysis. Start another R
session (or return to the terminal window where R is running), and look up the readAligned
command.

14.

Load the BA
M files into R using readAligned().

15.

Convert the AlignedRead objects produced by the readAligned() import process into
GRanges objects using as(x, “GRanges”)
, and set the ‘strand’ variable for each aligned read
to “*”, because the library preparation method

used for these reads was not strand
-
specific,
so there is no useful biological information in that variable.

16.

Look up the countOverlaps() function in R to see what it does.

17.

Run countOverlaps() on each of the GRanges objects of aligned reads, using the chr5
txpts
object as a reference.

18.

Combine the six vectors of read counts into a dataframe.

19.

Create a vector of factors that define the experimental treatment of each column in the
dataframe.

20.

Look up the newCountDataSet() command from the DESeq package.

21.

Use newCo
untDataSet() to convert the dataframe and vector of factors into a CountDataSet.

22.

Use estimateSizeFactors() to
estimate the size of each of the six samples of reads

23.

Use estimate
Dispers
ions() to estimate the variance among replicate samples

24.

Use nbinomTest()
to run test the significance of differential expression for all transcripts.

25.

To add gene
identifier
s to the table of results, use the transcriptsBy() command to recover
information from the AtDb database into a list. What type of output does this produce?

26.

Recover the names of those identifiers from that list in a vector

27.

Convert the vector to a list for use in the transcripts() command

28.

Recover the transcript_name field from the AtDb database using transcripts().

Code examples

1. library(RSQLite
); library(
biomaRt
); library(ShortRead)
;
library(
GenomicFeatures
)
; library(DESeq)

2.

listMart()

3.

?makeTranscriptDbFromBiomart

4.

At
Db <
-

makeTranscriptDbFromBiomart(biomart="plants_mart_10",
dataset="athaliana_eg_gene")

5. At
Db

6.

?transcripts()

7.

chr5txpts <
-

tr
anscripts(At
Db,vals <
-

list(tx_chrom = "5"))

8.

sa
veFeatures(AtGeneDb,file="At
Db.sqlite")

9. head chr5txpts

10.
head /usr/local/data/RNAseq/cn1ln7.sam

11.
sed
-
e 's/gi|240256493|ref|NC_003076.8|/5/'

cn1ln7.sam >

/dev/null

12.
mkdir data


cd data

sed

-
e 's/gi|240256493|ref|NC_003076.8|/5/'
/usr/local/data/RNAseq/cn1ln7.sam | /usr/local/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bin/samtools/samtools sort
-

c1l7


/usr/local/bin/samtools/samtools index c1l7.bam



sed
-
e 's/gi|240256493
|ref|NC_003076.8|/5/'
/usr/local/data/RNAseq/cn2ln8.sam | /usr/local/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bin/samtools/samtools sort
-

c2l8


/usr/local/bin/samtools/samtools index c2l8.bam



sed
-
e 's/gi|240256493|ref|NC_003076.8|/5
/'
/usr/local/data/RNAseq/cn3ln1.sam | /usr/local/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bin/samtools/samtools sort
-

c3l1


/usr/local/bin/samtools/samtools index c3l1.bam




sed
-
e 's/gi|240256493|ref|NC_003076.8|/5/'
/usr
/local/data/RNAseq/ts1ln4.sam | /usr/local/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bi
n/samtools/samtools sort
-

t1l4



/usr/local/bin/samtools/samtools index t1l4.bam



sed
-
e 's/gi|240256493|ref|NC_003076.8|/5/'
/usr/local/data/RNAseq/ts
2ln6.sam | /usr/local/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bin/samtools/samtools sort
-

t2l6


/usr/local/bin/samtools/samtools index t2l6.bam


sed
-
e 's/gi|240256493|ref|NC_003076.8|/5/'
/usr/local/data/RNAseq/ts3ln2.sam | /usr/loca
l/bin/samtools/samtools
view
-
bhSu
-
F 0x0004
-

| /usr/local/bin/samtools/samtools sort
-

t3l2


/usr/local/bin/samtools/samtools index t3l2.bam


13. ?readAligned

14.

c1 <
-

rea
dAligned("c
ont
1.bam",type="BAM")

c2 <
-

rea
dAligned("c
ont2
.bam",type="BAM")

c3 <
-

rea
dAligned("c
ont3
.bam",type="BAM")

t1 <
-

readAligned("t
est1
.bam",type="BAM")

t2 <
-

readAligned("t
est2
.bam",type="BAM")

t3 <
-

readAligned("t
est3
.bam",type="BAM")


15.

c1gr <
-

as(c1, "GRanges")

strand(c1gr) <
-

"*"

c2gr <
-

as(c2, "GRanges")

strand(c2gr) <
-

"*"

c3gr <
-

as(c3, "GRanges")

strand(c3gr) <
-

"*"

t1gr <
-

as(t1, "GRanges")

strand(t1gr) <
-

"*"

t2gr <
-

as(t2, "GRanges")

strand(t2gr) <
-

"*"

t3gr <
-

as(t3, "GRanges")

strand(t3gr) <
-

"*"


16. countOverlaps()


17.

c1.counts=countOverlaps(
chr5txpts,c1gr)

c2.counts=countOverlaps(chr5txpts,c2gr)

c3.counts=countOverlaps(chr5txpts,c3gr)

t1.counts=countOverlaps(chr5txpts,t1gr)

t2.counts=countOverlaps(chr5txpts,t2gr)

t3.counts=countOverlaps(chr5txpts,t3gr)


18.
all.counts <
-

data.frame(c1=c1.coun
ts,

c2=c2.counts,

c3=c3.counts,

t1=t1.counts,

t2=t2.counts,

t3=t3.counts)


19.
trtmnts <
-

c(
rep(
"cntl",
3),rep(
"test",
3
)
)

20. newCountDataSet()

21. CData <
-

newCountDataSet(all.
counts,
trtmnts
)

22.

CData

<
-

estimateSizeFactors(
CData
)

23.

CData

<
-

estimate
Dispersion
s(
CData
)

24.

res
ult

<
-

nbinomTest(
CData
, "
test
", "
cntl
")

25.
GRList <
-

transcriptsBy(AtDb, by = "gene")

26.
tx_ids <
-

names(GRList)

27.
vals <
-

list(tx_id=tx_ids)

28.
txs <
-

transcripts(AtDb, vals, columns = c("tx_id", "tx_name"))