National Center for Biotechnology Information - Genetic Alliance

gooseliverΒιοτεχνολογία

22 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

93 εμφανίσεις

110101

NCBI

National Center for Biotechnology
Information


Created by Public Law 100
-
607 in 1988 as part of
National Library of Medicine at NIH to:


Create automated systems for knowledge about molecular biology,
biochemistry, and genetics.


Perform research into advanced methods of analyzing and
interpreting molecular biology data.


Enable biotechnology researchers and medical care personnel to
use the systems and methods developed.


Builders and providers of GenBank, Entrez, Blast,
PubMed. Online systems host about 1.8 million users per
day at peak rates of 3,200 web hits a second.


Center for basic research and training in computational
biology.

110101

NCBI

NCBI is the most heavily site in
biomedicine. Why?

300,000

200,000

100,000

NCBI Web Traffic


1997
-
2006

400,000

January 1998

500,000

600,000

700,000

January 1999

January 2000

January 2001

January 2002

January 2003

January 2004

January 2005

January 2006

722,000 Unique IPs a Day

91 Million Web Hits a Day

3200 Peak Web Hits a Second

1.5 Terabytes FTP a Day

1.8 Million Unique Users a Day

110101

NCBI

Data, the Next Intel Inside

110101

NCBI

Comparative Analysis of Genes
Enables “Innovation in Assembly”

Human 638
RH
ACV
E
VQDEIA
FI
P
N
DVYFEKDKQMFH
IITGPNMGGKSTY
I
RQ
TGV
I
V
LMA
Q
IG
CF
VP
C

697

Yeast 657
RH
PVL
E
MQDDIS
FI
S
N
DVTLESGKGDFL
IITGPNMGGKSTY
I
RQ
VGV
I
S
LMA
Q
IG
CF
VP
C

716

E.coli 584
RH
PVV
E
QVLNEP
FI
A
N
PLNLSPQRR
-
ML
IITGPNMGGKSTY
M
RQ
TAL
I
A
LMA
Y
IG
SY
VP
A

642

Colon cancer gene sequence

3000 Myr

1000 Myr

500 Myr

Human

Fly

Worm

Yeast

Bacteria

Mouse

110101

NCBI

Entrez
: Pathway to Discovery

Amino acid
sequence similarity

Coding region
features

Nucleotide
sequence
similarity

Term frequency
statistics

Literature
citations in
sequence
databases

Literature
citations in
sequence
databases

MEDLINE
abstracts

Nucleotide
sequences

Protein
sequences

110101

NCBI

Entrez Increases Discovery Space

Nucleotide
sequences

Protein
sequences

Taxon

Phylogeny

3
-
D
Structure

MMDB

3
-
D
Structure

PubMed
abstracts

Complete
Genomes

PubMed

Entrez
Genomes

Publishers

Genome
Centers

110101

NCBI

Elements of WGA


Phenotype Model


Genotype


Association between Phenotype Model and
Genotype

110101

NCBI

Elements of Phenotype


Protocols, Questionnaires, Documentation


Best explanation of measured attributes


Basically Text, often on paper or scanned PDF


Often unavailable even to sponsoring IC



Data



Data Dictionary




110101

NCBI

Elements of Phenotype


Protocols, Questionnaires, Documentation



Data


Measured attributes in a square table


Row is individual


Column is measure


Column names often obscure (eg. “HO112”)


In many formats (Excel, SAS, RDB, text)



Data Dictionary





110101

NCBI

Elements of Phenotype


Protocols, Questionnaires, Documentation



Data



Data Dictionary


Links Protocol element to Data column


Not generally available or widely usable



110101

NCBI

Organize Data First


Take Data in whatever format available


Automatically load columns and rows in Generic
Database


Automatically analyze content of columns


Review report with submitter

110101

NCBI

Organize Documents Next


Manually link specific questions/measures to Data
column names


Send out for tagging into XForms XML



View as HTML document


Process for indexing


Generate views of specific questions across forms
and studies


110101

NCBI

Identify Conflicts
Between Data and
Metadata

110101

NCBI

Enumerated

NOT HISPANIC


514

CHINESE


359

HISPANIC



27

TURKISH



20

SCOTTISH/IRISH


2

ITALIAN



2

AUSTRIAN/HUNGARIAN


1

SOUTH WEST NETHERLANDS


1

IRISH/CANADIAN INDIAN


1

PUERTO RICAN


1

SPANISH



1

HUNGARIAN



1

..(more)

110101

NCBI

NCBI WGA


What Can I Do?


[Unrestricted Public Use]


Browse/Search Projects and Studies


Browse/Search Protocols, Questionaires, Supporting
Documents


View Phenotype Measures Summary Data


View Genotype Measures Summary Data


Identify Studies of Interest and Authorization Authorities


View Pre
-
computed Associations [GAIN]



[Authorized Users Only]


Download Genotype/Phenotype Data for Individuals

110101

NCBI

Authorizing Use of Individual Data


Authorization for any study is done by the
sponsoring IC



Legacy studies may be governed by existing
consents but NIH may still move toward a trans
-
NIH policy on new studies



NCBI does not directly authorize anyone but only
acts on the IC’s behalf

110101

NCBI

Authentication is Done Centrally


Authentication (user account and login) is done
centrally through CIT.



Once authorized by different IC’s, user has a single
login and a single place to request and download
individual data from all studies to which they have
authorized access

110101

NCBI

Genotype Data


NCBI is working with ICs on Genotype Data where
possible


Both genotype calls and underlying intensity data
directly from vendors to NCBI


NCBI is working with major vendors on appropriate
data content and formats


Genotype summaries will be available to public as
permitted


Individual genotypes available through the same
authorization process as individual phenotypes

110101

NCBI

Automated Links from Genotype Probe

Linkage Disequilibrium

110101

NCBI

Current Activities


Genetic Association Information Network (GAIN)


Genetics and Environment Initiative (GEI)


Framingham Genetic Study


National Institute for Neurological Disease and
Stroke (NINDS)


NHGRI Medical Resequencing


NEI Macular Degeneration

110101

NCBI

GAIN


Run by Foundation for NIH


Currently accepting applications for clinical data
(due May 9)


Pfizer and Affymetrix paying for genotyping up to
seven studies (~1000 cases and controls) this year


Public metadata


Public “pre
-
computes”


Restricted access to phenotype/genotype data


Estimated public release of all data


late Summer


Future rounds of applications


110101

NCBI

Framingham Heart Study


Collaboration with NHLBI


Three generations of individuals from Framingham,
MA


15,000 people


Tens of thousands of variables


Extensive protocols and questionnaires


Already in the process of converting to XML


Data for one cohort just arriving


Access model


Public metadata


Private genotype/phenotype data


No “pre
-
computes”


110101

NCBI

NINDS Parkinsonism/Stroke


Four studies (more to follow)


Parkinsonism


Epilepsy


ALS


Stroke


Mostly categorical data


Access to Coriell cell lines for follow
-
up


Completely public access model


STR/pedigree data


Affy 100K data coming (but not public yet)

110101

NCBI

Closing the Loop