110101
NCBI
National Center for Biotechnology
Information
•
Created by Public Law 100
-
607 in 1988 as part of
National Library of Medicine at NIH to:
•
Create automated systems for knowledge about molecular biology,
biochemistry, and genetics.
•
Perform research into advanced methods of analyzing and
interpreting molecular biology data.
•
Enable biotechnology researchers and medical care personnel to
use the systems and methods developed.
•
Builders and providers of GenBank, Entrez, Blast,
PubMed. Online systems host about 1.8 million users per
day at peak rates of 3,200 web hits a second.
•
Center for basic research and training in computational
biology.
110101
NCBI
NCBI is the most heavily site in
biomedicine. Why?
300,000
200,000
100,000
NCBI Web Traffic
–
1997
-
2006
400,000
January 1998
500,000
600,000
700,000
January 1999
January 2000
January 2001
January 2002
January 2003
January 2004
January 2005
January 2006
722,000 Unique IPs a Day
91 Million Web Hits a Day
3200 Peak Web Hits a Second
1.5 Terabytes FTP a Day
1.8 Million Unique Users a Day
110101
NCBI
Data, the Next Intel Inside
110101
NCBI
Comparative Analysis of Genes
Enables “Innovation in Assembly”
Human 638
RH
ACV
E
VQDEIA
FI
P
N
DVYFEKDKQMFH
IITGPNMGGKSTY
I
RQ
TGV
I
V
LMA
Q
IG
CF
VP
C
697
Yeast 657
RH
PVL
E
MQDDIS
FI
S
N
DVTLESGKGDFL
IITGPNMGGKSTY
I
RQ
VGV
I
S
LMA
Q
IG
CF
VP
C
716
E.coli 584
RH
PVV
E
QVLNEP
FI
A
N
PLNLSPQRR
-
ML
IITGPNMGGKSTY
M
RQ
TAL
I
A
LMA
Y
IG
SY
VP
A
642
Colon cancer gene sequence
3000 Myr
1000 Myr
500 Myr
Human
Fly
Worm
Yeast
Bacteria
Mouse
110101
NCBI
Entrez
: Pathway to Discovery
Amino acid
sequence similarity
Coding region
features
Nucleotide
sequence
similarity
Term frequency
statistics
Literature
citations in
sequence
databases
Literature
citations in
sequence
databases
MEDLINE
abstracts
Nucleotide
sequences
Protein
sequences
110101
NCBI
Entrez Increases Discovery Space
Nucleotide
sequences
Protein
sequences
Taxon
Phylogeny
3
-
D
Structure
MMDB
3
-
D
Structure
PubMed
abstracts
Complete
Genomes
PubMed
Entrez
Genomes
Publishers
Genome
Centers
110101
NCBI
Elements of WGA
Phenotype Model
Genotype
Association between Phenotype Model and
Genotype
110101
NCBI
Elements of Phenotype
Protocols, Questionnaires, Documentation
•
Best explanation of measured attributes
•
Basically Text, often on paper or scanned PDF
•
Often unavailable even to sponsoring IC
Data
Data Dictionary
110101
NCBI
Elements of Phenotype
Protocols, Questionnaires, Documentation
Data
•
Measured attributes in a square table
•
Row is individual
•
Column is measure
•
Column names often obscure (eg. “HO112”)
•
In many formats (Excel, SAS, RDB, text)
Data Dictionary
110101
NCBI
Elements of Phenotype
Protocols, Questionnaires, Documentation
Data
Data Dictionary
•
Links Protocol element to Data column
•
Not generally available or widely usable
110101
NCBI
Organize Data First
Take Data in whatever format available
Automatically load columns and rows in Generic
Database
Automatically analyze content of columns
Review report with submitter
110101
NCBI
Organize Documents Next
Manually link specific questions/measures to Data
column names
Send out for tagging into XForms XML
View as HTML document
Process for indexing
Generate views of specific questions across forms
and studies
110101
NCBI
Identify Conflicts
Between Data and
Metadata
110101
NCBI
Enumerated
NOT HISPANIC
514
CHINESE
359
HISPANIC
27
TURKISH
20
SCOTTISH/IRISH
2
ITALIAN
2
AUSTRIAN/HUNGARIAN
1
SOUTH WEST NETHERLANDS
1
IRISH/CANADIAN INDIAN
1
PUERTO RICAN
1
SPANISH
1
HUNGARIAN
1
..(more)
110101
NCBI
NCBI WGA
–
What Can I Do?
•
[Unrestricted Public Use]
•
Browse/Search Projects and Studies
•
Browse/Search Protocols, Questionaires, Supporting
Documents
•
View Phenotype Measures Summary Data
•
View Genotype Measures Summary Data
•
Identify Studies of Interest and Authorization Authorities
•
View Pre
-
computed Associations [GAIN]
•
[Authorized Users Only]
•
Download Genotype/Phenotype Data for Individuals
110101
NCBI
Authorizing Use of Individual Data
Authorization for any study is done by the
sponsoring IC
Legacy studies may be governed by existing
consents but NIH may still move toward a trans
-
NIH policy on new studies
NCBI does not directly authorize anyone but only
acts on the IC’s behalf
110101
NCBI
Authentication is Done Centrally
Authentication (user account and login) is done
centrally through CIT.
Once authorized by different IC’s, user has a single
login and a single place to request and download
individual data from all studies to which they have
authorized access
110101
NCBI
Genotype Data
NCBI is working with ICs on Genotype Data where
possible
Both genotype calls and underlying intensity data
directly from vendors to NCBI
NCBI is working with major vendors on appropriate
data content and formats
Genotype summaries will be available to public as
permitted
Individual genotypes available through the same
authorization process as individual phenotypes
110101
NCBI
Automated Links from Genotype Probe
Linkage Disequilibrium
110101
NCBI
Current Activities
Genetic Association Information Network (GAIN)
Genetics and Environment Initiative (GEI)
Framingham Genetic Study
National Institute for Neurological Disease and
Stroke (NINDS)
NHGRI Medical Resequencing
NEI Macular Degeneration
110101
NCBI
GAIN
Run by Foundation for NIH
Currently accepting applications for clinical data
(due May 9)
Pfizer and Affymetrix paying for genotyping up to
seven studies (~1000 cases and controls) this year
Public metadata
Public “pre
-
computes”
Restricted access to phenotype/genotype data
Estimated public release of all data
–
late Summer
Future rounds of applications
110101
NCBI
Framingham Heart Study
Collaboration with NHLBI
Three generations of individuals from Framingham,
MA
–
15,000 people
Tens of thousands of variables
Extensive protocols and questionnaires
•
Already in the process of converting to XML
•
Data for one cohort just arriving
Access model
•
Public metadata
•
Private genotype/phenotype data
•
No “pre
-
computes”
110101
NCBI
NINDS Parkinsonism/Stroke
Four studies (more to follow)
•
Parkinsonism
•
Epilepsy
•
ALS
•
Stroke
Mostly categorical data
Access to Coriell cell lines for follow
-
up
Completely public access model
STR/pedigree data
Affy 100K data coming (but not public yet)
110101
NCBI
Closing the Loop
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment