AFFYMETRIX CNV ANALY
SIS
INTRODUCTION
Affymetrix SNP arrays produce .CEL files that store the intensity values for SNP or CNV markers. SNP markers are polymorphic
and
contain intensity values for both allele A and allele B. CNV markers are not polymorphic.
In order to provide a high level analysis on
these intensities, the following algorithms have been developed by Omicsoft:
1. A genotyping calling algorithm to make genotype calls from polymorphic SNP markers
2. A signal extraction algorithm to summarize p
robe intensity values into log ratios and B

Allele frequency
3. A segmentation algorithm based on log ratios, B

Allele frequency and SNP genotypes
This white paper describes the details of these algorithms.
SNP GENOTYPE CALLING
There are two strategies to
make genotype calls based on SNP intensity values. The first strategy uses only data from the working
experiment. For each SNP, the intensity values from allele A and allele B make a two dimensional cluster for the samples. The
normalized intensity values
should reveal three clusters representing three distinct genotypes. Missing calls can be made if the
clusters do not have a clear boundary. This method can produce accurate and reliable calls. However, there are two major
drawbacks of this approach: (1) th
e clustering algorithm does not work on small numbers of subjects (2) the computation is very
intensive for large number of subjects and usually this will force the users to do batches of smaller number of subjects. Ear
ly
algorithms provided by Affymetrix
used this approach.
The second strategy is based on previously constructed models. The genotyping model is applied to each Affymetrix CHIP and ca
lls
are made for each SNP. This method can accurately call both a single CHIP and thousands of CHIPs in reasona
ble amount of time. To
construct the SNP model, a training set is required along with matched genotype calls (these calls do not need to be 100% acc
urate
–
they just serve as a starting point). Both the BirdSeed algorithm and the Omicsoft algorithm use thi
s strategy.
For Affymetrix 500K and SNP6 platforms, Hapmap data (270 samples) have been made available and CEL files can be downloaded
from the Affymetrix web site. Initial genotype calls were made by Genotype Console 3.1.
SNP CEL PROCESSING
SNP CEL files
are loaded and probe level intensities are summarized. Geometric means (arithmetic means at the log level) are used
to calculate the mean intensity for each probe set. For polymorphic markers, the summary value for allele A and allele B are
calculated sepa
rately. Probeset level intensity values are converted to log2 scale.
INTENSITY MODEL CONS
TRUCTION
For each CEL file, the empirical distribution of the probeset intensities for allele A and allele B are calculated for polymo
rphic
markers and the empirical d
istribution of the probeset intensities for allele a is calculated for non

polymorphic markers. The overall
empirical distributions across all training samples are calculated and saved in the model. These three empirical distribution
(polymorphic allele A,
polymorphic allele B, nonpolymorphic allele A) are saved for normalization purpose.
The intensity level for each marker also needs to be stored in the model. For non

polymorphic markers, it is sufficient to calculate
and store the mean intensity value (mo
st non

polymorphic markers only contain one probeset). For polymorphic markers, the
intensity difference between the two alleles needs to be calculated before the pooled intensity level can be calculated. The
intensity
difference between the two alleles is
calculated based on the median intensity difference of heterozygous samples. The pooled
average intensity is calculated based on the adjusted signals (for allele A signals, the adjusted signals are just the origin
al signal; for
allele B signals, the adjus
ted signals are equal to the original signal subtracted by the intensity difference caused by alleles).
GENOTY
P
ING MODEL CONSTRUCTI
ON
A classification model needs to be constructed for each SNP marker. We use a bivariate normal mixture model to train and c
onstruct
the model. We assume that the two allele signals follow a bivariate normal distribution for each genotype and the variance is
the
same across all three genotypes. We also assume that the two alleles are independent which implies the covariance is
0 (the
variance for allele A and allele B are not assumed to be the same). Based on the training data, the following three steps are
used to
construct the model:
Step 1: based on the current genotype calling, calculate the means of allele A and allele B fo
r each genotype. The pooled variance for
allele A and allele B are also calculated. A total of 8 parameters are estimated.
Step 2: based on the parameter calculated in step 1, predict the genotype calls for the training data based on the mixture mo
del.
The
details of classification are described in the next section.
Step 3: repeat step 1 until the genotype calls are converged.
A small proportion of SNP markers (~0.01%) are not called with three genotypes based on the training set by genotype console
3.1.
Th
ese SNP markers require special handling. The details are not included in the white paper.
GENOTYPE CALLING
To make the genotype calls for new datasets, the CEL files are first summarized to probeset level intensity values. The quant
ile
normalization is pe
rformed on the probeset intensities for each category (polymorphic allele A, polymorphic allele B and non

polymorphic allele A).
For each marker, the probabilities for each model are evaluated and the most likely model is chosen to determine the genotype
.
Let s1 = normalized signal for allele A, s2 = normalized signal for allele B. First, calculate the following probabilities ba
sed on the SNP
model:
Pr(AAs1)=(Pr(s1AA)*Pr(AA))/(Pr(s1AA)*Pr(AA)+Pr(s1AB)*Pr(AB)+Pr(s1BB)*Pr(BB))
Pr(ABs1)=(Pr(s1AB)*Pr(AB))/(Pr(s1AA)*Pr(AA)+Pr(s1AB)*Pr(AB)+Pr(s1BB)*Pr(BB))
Pr(BBs1)=(Pr(s1BB)*Pr(BB))/(Pr(s1AA)*Pr(AA)+Pr(s1AB)*Pr(AB)+Pr(s1BB)*Pr(BB))
Pr(AAs2)=(Pr(s2AA)*Pr(AA))/(Pr(s2AA)*Pr(AA)+Pr(s2AB)*Pr(AB)+Pr(s2BB)*Pr(BB))
Pr(ABs2)=
(Pr(s2AB)*Pr(AB))/(Pr(s2AA)*Pr(AA)+Pr(s2AB)*Pr(AB)+Pr(s2BB)*Pr(BB))
Pr(BBs2)=(Pr(s2BB)*Pr(BB))/(Pr(s2AA)*Pr(AA)+Pr(s2AB)*Pr(AB)+Pr(s2BB)*Pr(BB))
Pr(AAs1,s2) = (Pr(s1,s2AA)*Pr(AA))/( Pr(s1,s2AA)*Pr(AA)+ Pr(s1,s2AB)*Pr(AB)+ Pr(s1,s2BB)*Pr(BB))
Pr(ABs1,s2) = (Pr(s1,s2AB)*Pr(AB))/( Pr(s1,s2AA)*Pr(AA)+ Pr(s1,s2AB)*Pr(AB)+ Pr(s1,s2BB)*Pr(BB))
Pr(BBs1,s2) = (Pr(s1,s2BB)*Pr(BB))/( Pr(s1,s2AA)*Pr(AA)+ Pr(s1,s2AB)*Pr(AB)+ Pr(s1,s2BB)*Pr(BB))
Define Pr(As1) = Pr(AA or ABs1), Pr(Bs2) = Pr(BB
or ABs2).
We also assume that the prior probabilities are uniform, i.e.,
Pr(AA)=Pr(AB)=Pr(BB). Finally we set a cutoff probability C (default = 0.60).
The following rules are applied to make the genotype calls:
1. If Pr(As1) >= 0.99 and Pr(Bs2) >= 0.99,
make an AB call. This rule is required to make heterozygous calls on markers with more
than two copies present such as AAB and AABB.
2. If Pr(ABs1,s2) >= C, make an AB call
3. If Pr(AAs1,s2) >= C, make an AA call
4. If Pr(BBs1,s2) >= C, make a BB call
5. Otherwise, make a missing call.
The genotype concordance rate between Omicsoft and BirdSeed is more than 99.99% for the SNP6 platform.
SIGNAL EXTRACTION
To make a powerful segmentation analysis on the Affymetrix SNP arrays, we extracted three types of i
nformation from the raw
probe intensities.
First, we need to calculate a Log2Ratio matrix, which contains the relative log2 intensity value compared to the training set
.
Second, we need to capture the intensity difference of two alleles in the B

Allele F
requency matrix. Illumina provides B
Allele Frequency as part of the genotyping procedure, and this quantity has provided useful information for CNV
segmentation. For Affymetrix, this quantity needs to be calculated as well.
Third, we also need the genoty
pe calls, in order to determine loss of heterozygosity regions.
For each CEL file, the probeset intensities are calculated first. Quantile normalization was performed to make the intensitie
s
comparable with the training set. If a marker has both allele A s
ignals and allele B
signals, allele B signals are adjusted at first by the
intensity model. The log2 level intensities are first transformed back to original scale, the sum of the two intensities are
calculated,
and then transformed back to log2 scale. The
difference of the log2 transformed signal and the median log2 intensities of the
training set are reported as log2ratio values.
For Illumina data, BAlleleFreq was automatically calculated by BeadStudio. For Affymetrix data, BAlleleFreq was calculated by
A
djustedAlleleB / (AdjustedAlleleA + AdjustedAlleleB), where AdjustedAlleleB is the normalized and adjusted allele B signal at
the
original scale, and AdjustedAlleleA is the normalized and adjusted allele A signal at the original scale. In order to generat
e
an
accurate B

Allele frequency, normalized signals are subject to background subtraction. The background signal for allele A was
estimated by the mean intensity of genotype BB, while the background signal for allele B was estimated by the mean intensity
o
f
genotype AA.
SEGMENTATION
Omicsoft has developed a sequential segmentation algorithm based on one dimensional ordered data. The sequential segmentation
algorithm consists of three steps.
The first step is called “Raw Partition”. Based on a relaxed crite
ria (e.g. p

value < 0.05 based on the t

test of two bounding
regions), the algorithm first find all possible breakpoints. Then the contiguous regions are merged from right to left, if th
e
test between contiguous regions does not generate a more significant
p

value than a given (more stringent) criteria (e.g. p

value < 0.0001 based on t

test of two bounding regions). At the end of step 1, all the break points will show significant
difference between the bounding regions.
The second step is called “Optimization”. Each break point was shifted from one position right of the starting position on
the left, to one position left of the ending position on the right. Tests are performed for each possible shift, and the new
breakpoi
nt position was set to the position with the best p

value. This will guarantee the break points are locally optimized.
The third step is called “Merging”. Smaller segments are merged into larger segments, if the length of a segment is less than
a cutoff,
or other quantities of the segments do not exceed the given criteria. This step will guarantee all the final segments
will satisfy the given conditions.
For CNV data, the following steps are performed to generate the final CNV segments:
(1) Sequential segm
entation was performed on the log2 ratio data only.
(2) For segments generated by (1), find possible LOH regions based on the genotype calls; make LOH segments if the homogzygos
ity
rate is larger than a given cutoff (e.g. 96%) and the marker number is bigg
er than a given cutoff (e.g. 30).
(3) For each segment, a variety of statistics are calculated as part of the segmentation report. Log2 ratio means, B

Allele Frequency
deviation means (defined as B

Allele frequency
–
0.5 for heterozygous and NaN for homozy
gous) and homozygosity rates, are used
to determine the copy number state of each segment.
The determination of the copy number state is difficult based on the current technology and Array Studio provides a variety o
f
graphic tools to allow users to manual
ly refine the process. Currently the following simple rules are used:
1. If the homozygosity rate is larger than the cutoff (96% by default) and the marker number is >= 5, call the segment 0 if t
he log2
ratio <=

3, or 1 if the log2 ration < =

1, otherwis
e call it LOH.
2. If the heterzygosity number of the segments is <= 5, do not make any calls. This is because the CN state calling is mainly
based on
the B

Allele frequency data, and insufficient heterozygous will not able to make an accurate call.
3. If t
he B

Allele frequency deviation mean is >= 0.21, make a call of 4 copies. Otherwise if the B

Allele frequency mean is >= 0.10,
make a call of 3 copies. If the B

Allele frequency is < 0.10, a call of “2” is made. Otherwise, no call is made.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο