# 04 QC of Illumina datax - ARK Genomics

Biotechnology

Oct 4, 2013 (4 years and 9 months ago)

85 views

Quality Control of
Illumina

Data

Mick Watson

Director of ARK
-
Genomics

The
Roslin

Institute

QUALITY SCORES

Quality scores

The sequencer outputs base calls at each position of a read

It also outputs a quality value at each position

This relates to the probability that that base call is incorrect

The most common Quality value is the Sanger Q score, or
Phred

score

Q
sanger

-
10 * log
10
(
p
)

Where
p

is the probability that the call is
incorrect

If
p

= 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect

If
p

= 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect

If
p

= 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect

Using the equation:

p
=0.05,
Q
sanger

= 13

p
=0.01
,
Q
sanger

= 20

p
=0.001,
Q
sanger

= 30

For the geeks….

In R, you can investigate this:

s
angerq

<
-

function(x) {return(
-
10 * log10(x
))}

s
angerq
(0.05)

s
angerq
(0.01)

s
angerq
(0.001)

plot(
seq
(0,1,by=0.00001
),
sangerq
(
seq
(0,1,by=0.00001)),
type="l")

The plot

For the geeks….

And the other way round….

qtop

<
-

function(x) {return(10^(x/
-
10))}

qtop
(30)

qtop
(20)

qtop
(13
)

plot(
seq
(40,1,by
=
-
1),
qtop
(
seq
(40,1,by=
-
1)), type="l")

The important stuff

Q30

1 in 1000 chance base is incorrect

Q20

1 in 100 chance base is incorrect

QUALITY ENCODING

Quality Encoding

Bioinformaticians

do not like to make your life easy!

Q scores of 20, 30
etc

take
two digits

Bioinformaticians

would prefer they only took 1

In computers, letters have a corresponding ASCII code:

Therefore, to save space, we convert the Q score (
two digits
)
to a single letter using this scheme

The process in full

p

(probability base is wrong) : 0.01

Q
(
-
10 * log10(
p
)) : 30

Encode as character : ?

P

Q

Code

0.05

13

.

0.01

20

5

0.001

30

?

For the geeks….

code2Q
<
-

function(x) { return(utf8ToInt(x)
-
33) }

code2Q(".")

code2Q("5")

code2Q
("?")

code2P <
-

function(x) { return(10^((utf8ToInt(x)
-
33)/
-
10)) }

code2P(".")

code2P("5")

code2P("?")

QC OF
ILLUMINA

DATA

FastQC

FastQC

is a free piece of software

Written by
Babraham

Bioinformatics group

http://www.bioinformatics.babraham.ac.uk/projects/fastqc
/

Available on Linux, Windows
etc

Command
-
line or GUI

Per sequence quality

One of the most important plots from
FastQC

Plots a box at each position

The box shows the distribution of quality values at that position across all

Obvious problems

Less obvious problems

Other useful plots

Per sequence N content

May identify cycles that are unreliable

Over
-
represented sequences

May identify
Illumina