04 QC of Illumina datax - ARK Genomics

hordeprobableBiotechnology

Oct 4, 2013 (3 years and 11 months ago)

74 views

Quality Control of
Illumina

Data

Mick Watson

Director of ARK
-
Genomics

The
Roslin

Institute

QUALITY SCORES

Quality scores


The sequencer outputs base calls at each position of a read


It also outputs a quality value at each position


This relates to the probability that that base call is incorrect


The most common Quality value is the Sanger Q score, or
Phred

score


Q
sanger

-
10 * log
10
(
p
)


Where
p

is the probability that the call is
incorrect


If
p

= 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect


If
p

= 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect


If
p

= 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect


Using the equation:


p
=0.05,
Q
sanger

= 13


p
=0.01
,
Q
sanger

= 20


p
=0.001,
Q
sanger

= 30

For the geeks….


In R, you can investigate this:


s
angerq

<
-

function(x) {return(
-
10 * log10(x
))}

s
angerq
(0.05)

s
angerq
(0.01)

s
angerq
(0.001)


plot(
seq
(0,1,by=0.00001
),
sangerq
(
seq
(0,1,by=0.00001)),
type="l")


The plot

For the geeks….


And the other way round….


qtop

<
-

function(x) {return(10^(x/
-
10))}

qtop
(30)

qtop
(20)

qtop
(13
)


plot(
seq
(40,1,by
=
-
1),
qtop
(
seq
(40,1,by=
-
1)), type="l")


The important stuff


Q30


1 in 1000 chance base is incorrect


Q20


1 in 100 chance base is incorrect

QUALITY ENCODING

Quality Encoding


Bioinformaticians

do not like to make your life easy!


Q scores of 20, 30
etc

take
two digits



Bioinformaticians

would prefer they only took 1



In computers, letters have a corresponding ASCII code:




Therefore, to save space, we convert the Q score (
two digits
)
to a single letter using this scheme


The process in full


p

(probability base is wrong) : 0.01


Q
(
-
10 * log10(
p
)) : 30


Add 33 : 63


Encode as character : ?


P

Q

Code

0.05

13

.

0.01

20

5

0.001

30

?

For the geeks….


code2Q
<
-

function(x) { return(utf8ToInt(x)
-
33) }

code2Q(".")

code2Q("5")

code2Q
("?")


code2P <
-

function(x) { return(10^((utf8ToInt(x)
-
33)/
-
10)) }

code2P(".")

code2P("5")

code2P("?")

QC OF
ILLUMINA

DATA

FastQC


FastQC

is a free piece of software


Written by
Babraham

Bioinformatics group


http://www.bioinformatics.babraham.ac.uk/projects/fastqc
/


Available on Linux, Windows
etc


Command
-
line or GUI

Read the documentation

Follow the course notes

Per sequence quality


One of the most important plots from
FastQC


Plots a box at each position


The box shows the distribution of quality values at that position across all
reads

Obvious problems

Less obvious problems

Really bad problems

Other useful plots


Per sequence N content


May identify cycles that are unreliable



Over
-
represented sequences


May identify
Illumina

adapters and primers