attachment_id=519 - EUCCONET

tastelesscowcreekBiotechnology

Oct 4, 2013 (3 years and 10 months ago)

74 views

Lessons learnt from the 1000 Genomes
Project about sequencing in populations

Gil McVean

Wellcome

Trust Centre for Human Genetics and
Department of Statistics, University of Oxford

Some questions


What has the 1000 Genomes Project told us about how to
sequence (in) populations



What has the 1000 Genomes Project told us about
populations

Samples for the 1000 Genomes Project

Major population groups comprised of subpopulations of
c. 100
each

GBR

FIN

TSI

IBS

CEU

JPT

CHB

CHS

CDX

KHV

GWB

GHN

YRI

MAB

LWK

MXL

CLM

ASW

AJM

ACB

PEL

PUR

Samples from S. Asia

The role of the 1000G Project in medical genetics


A catalogue of variants


95% of variants at 1% frequency in populations of interest



A representation of ‘normal’ variation



A set of haplotypes for imputation into GWAS



A training ground for sequencing/statistical/computational
technologies

TSI*

CEU

JPT

CHB

CHS*

YRI

LWK*

*
Exon

pilot only

Samples for the 1000 Genomes Project:

Pilot

Population
-
scale genome sequencing

Haplotypes

2x

10x

What has the project generated?

>15 million SNPs, >50% of them novel

dbSNP

entries increased by 70%

An huge increase in the set of structural variants

A robust and modular pipeline for analysis of population
-
scale sequence data

An efficient format for storing aligned reads and a set of
tools to manipulate and view the files


SAM/BAM format for storing (aligned) reads

Bioinformatics (2009)
http://samtools.sourceforge.net

An information
-
rich format for storing generic
haplotype/genotype data and tools for manipulating
the files

http://vcftools.sourceforge.net

An understanding of the ‘rare functional variant load’ carried
by individuals

c. 250 LOF / person

c. 75 HGMD DM

USH2A


Mutations cause with Usher syndrome







66
missense

variants in
dbSNP


2/3 detected in 1000 Genomes Pilot


One HGMD ‘disease
-
causing’ variant homozygous in 3 YRI


Other reports indicate this is not a real disease
-
causing variant

Samples for the 1000 Genomes Project: Phase1

GBR

FIN

TSI

CEU

JPT

CHB

CHS

YRI

LWK

MXL

CLM

ASW

PUR

Lessons learnt about sequencing
in populations

Lesson 1.


The low
-
coverage model works
for variant discovery

A near complete record of common variants

CEU

Lesson 2.


The low coverage model works
for SNP genotyping

A set of accurate genotypes/haplotypes

CEU

Lesson 3.


The genome has a large grey
area where variant calling is hard

Lesson 4.


Joint calling of different variant
types substantially improves the
quality of calls

Lesson 5.


Managing uncertainty is
important

Lesson 6.


Data visualisation is key

Lessons learnt about populations

Closely related populations can
have substantially different rare
variants

Spatial heterogeneity in non
-
genetic risk can
differentially confound association studies for rare
and common variants

Iain
Mathieson

Thanks to the many...


Steering committee


Co
-
chairs: Richard Durbin and David
Altshuler



Samples and ELSI Committee


Co
-
chairs:
Aravinda

Chakravarti

and
Leena

Peltonen



Data Production Group


Co
-
chairs: Elaine
Mardis

and Stacey Gabriel



Analysis Group


Co
-
Chairs: Gil McVean and Goncalo
Abecasis


Subgroups in gene
-
targeted sequencing (Richard Gibbs) and population genetics (Molly
Przeworski
)



Structural Variation Group


Co
-
chairs: Matt
Hurles
, Charles Lee and Evan
Eichler



DCC


Co
-
Chairs: Paul
Flicek

and Steve Sherry