Data validation and annotation: PrideViewer & PIKE.

weinerthreeforksΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

97 εμφανίσεις

Data Validation and Annotation:

PRIDEViewer and PIKE



Bioinformatics analysis from proteomics data

ProteoRed Bioinformatics Workshop

Salamanca


Alberto Medina
-
Aunon

March, 15th 2010

Main Topics



Mass spectrometry and protein and peptide
validation


PRIDEViewer: Description.


Examples: Uses
-
cases.


Experiment context: Linking functional
information to our proteins.


PIKE: Description.


Examples: Uses
-
cases.



Starting from:


Mass spectrum/spectra


Tentative identification/Sequence


Search Engine



MS Validation.

The easiest Way

Candidate: AFLLAMAARTGFRTR

How to do it


By hand:


Just for a few sequences/spectra


We cannot read every format files (for instance
binaries).



Semi
-
automatically:


Using PRIDE files as input: PRIDEViewer

PRIDEViewer

Experiment info

PRIDEViewer

Sample and Instrument info

PRIDEViewer

Spectra and identifications

PRIDEViewer

Gel Separation

PRIDEViewer

Mascot interface

One Example:

Identification using 5 peptides

Example

Mascot output

Another example:

350 input spectra

Validation study


Starting from one public proteomics
repository


EBI PRIDE
-
:



Retrieve a set of available experiments.


Check the level of fulfillment of the experiments.


Repeat the protein and peptide identification.


VALIDATE THE EXPERIMENT……..

Validation using PRIDE

http://www.ebi.ac.uk/pride/

PRIDE: Searching experiments:
Biomart

Validation. First Round.

Biomart

Validation
-

First Round:

PRIDE Accession 1642

First View: Mascot Results

Validation


First Round:

PRIDE Accession 1642

Protein Id

Database

Peptide

Count

Identified

IPI00295598

IPI

2

No

Q15843

SwissProt

6

Yes

P62491

SwissProt

1

No

Why? If we explore the data, we’ll find …..

Protein Id

First

Peptide

PRIDE

mass

Calculate
d

mass

IPI00295598

VISEPGEAEVFMTPEDF
VR

2184.0375

2152.0267

Q15843

EIGPPQQQR

1052.5697

1052.5483

P62491

DHADSNIVIMLVGNK

1657.8186

1625.8316

Delta mass

around 32Da

Validation


First Round:

Pride Accession 1642


Hypothesis….


First and third sequences present a mass
variation around 32 Da.


Is there a modification in C or N termini? In that way,
second sequence will present as well.


Is any residue
-
or more than one
-

modified?


We’ll extract the common aminoacids: D, A, S, I, C, M
and G


Compare they with the described modifications with a
mass variation of 32 Da.


Validation


First Round:

PRIDE Accession 1642.

Only this modification
could explain a
common property
between both
sequences.


So, we’ll select it in the
next round

Validation


First Round:

PRIDE Accession 1642

Validation


Second Round:

Latest Experiments. Retrieved by hand


PRIDE accession id: 10470 to 11257 (787
experiments).


No one is suitable to check.


No information regarding the identification is
available.


PRIDE accession id: 10000 to 10074 (74
experiments).


One dataset could be checked: 10042 to 10060.
(
Dataset title: Low abundance proteome of human red blood cells captured by
combinatorial peptide libraries
)

Validation


Second Round:

Latest experiments

Pride Accession 10053

Mascot output

Pride Accession 10060

Mascot output: No identification

Validation


Third Round:

Recent Experiments. Retrieved by
hand


Experiment id: 9900 to 9999


Two dataset are suitable to check:


9900 to 9942: LC
-
MALDI experiments (Tannerella
forsythia).


9944 to 9949: Rattus norvegicus.


9984: Zebrafish. No spectra.


9985 to 9992: Homo sapiens. (No identifications).


44 not available.

Validation


Third Round:

Experiment 9900

Validation


Third Round.

Experiment 9900

Validation


Third Round:

Experiment 9900. Summary

Protein Id

Peptide

Count

Identified

1
st

Peptide
Mass

Theoretical

Mass

TF2239

1

No

1228.5463

1228.6433

TF26612

13

Yes

--

--

TF1259

1

No

1271.6478

1271.6783

TF2116

4

No

1139.5835

1139.6208

TF1741

16

No

1044.5144

1044.5473

TF0447

2

No

1092.4619

1092.5432

TF2663

7

Yes

--

--

TF2592

2

No

1022.5306

1022.5782

Study summary


Around 1000 PRIDE experiments were
downloaded from PRIDE central repository.



Around 100 of them were suitable to test.



Less than of 50% were successfully
validated.

In summary


There

a

lot

of

data

within

the

repositories
.

(PRIDE)
.



There

a

lot

of

missing

information
.


It

is

not

possible

to

check

the

data

automatically
.



PRIDEViewer

could

help

us

saving

a

lot

of

time
.

Protein Set



Other

times,

if

there

is

a

mistake

in

the

identification,

it

will

not

so

significant

if

finally

we

can

reach

to

the

goal

of

the

experiment
.



For

instance,

proteins

involved

in

a

particular

function

or

biological

process
.

DB id

Protein Name

gi|12857455

Heat shock protein

gi|14017768

FKB9_HUMAN

gi|12836587

Tubulin alpha homo sapiens

gi|15010550

Ubiquitin specific protease

gi|15489190

vinculin isoform VCL Homo
sapiens

gi|9963904

selenium binding protein 1
Homo sapiens






PIKE
http://proteo.cnb.csic.es/

PIKE: Protein Information

and Knowledge extractor

PIKE
http://proteo.cnb.csic.es/

PIKE
http://proteo.cnb.csic.es/

PIKE
http://proteo.cnb.csic.es/

Information asked by user

PIKE
http://proteo.cnb.csic.es/

PIKE output. CSV

Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
0
2
4
6
8
10
12
14
16
18
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Series1
Series2
PIKE output

First example


medium
-
complexity protein list (containing 57 proteins)


J Proteome Res.

2005 Nov
-
Dec;4(6):2435
-
41.


First example


medium
-
complexity protein list (containing 57 proteins)

#

entry

name
a

Entry ID (UniProt ID)

Manual
searching

PIKE output
-
Only Keywords
-


6

Integrin

alpha
-
5 precursor

P08648

1 TM

KeyWord: Transmembrane

7

Sodium/potassium
-
transporting
ATPase

alpha
-
1
chain precursor

P05023

10 TM

KeyWord: Transmembrane

8

Short transient receptor potential channel 4

Q9UBN4

8 TM

KeyWord: Transmembrane

10

Band 3
anion

transport

protein

P02730

11 TM

KeyWord: Transmembrane

11

Transferrin

receptor
protein

1

P02786

1 TM

KeyWord: Transmembrane

17

calnexin

precursor

P27824

1 TM

KeyWord: Transmembrane

19

5'
-
nucleotidase precursor

P21589

1 TM; GPI

Keyword: GPI
-
anchor

21

Alkaline phosphatase, placental type precursor

P05187

GPI

KeyWords: Transmembrane; GPI
-
anchor

22

4F2 cell
-
surface antigen heavy chain

P08195

1 TM

KeyWord: Transmembrane

24

Solute carrier family 2, facilitated glucose
transporter, member 1

P11166

12 TM

KeyWord: Transmembrane

29

chloride

intracellular

channel

protein

5

Q9NZA1



KeyWord
:
Transmembrane

30

3beta
-
hydroxy
-
Delta5
-
steroid dehydrogenase
multifunctional protein I

P14060

1 TM

KeyWord
:
Transmembrane

41

myristoylated alanine
-
rich C
-
kinase substrate

P29966

Myristoylation

Keyword
:
Myristate

42

Basigin precursor

P35613

1 TM

KeyWord: Transmembrane

47

Brain acid soluble protein 1

P80723

Myristoylation

KeyWords
:
Transmembrane
;
Myrsitate

51

ADP
-
ribosylation

factor 1

P84077



KeyWords
:
Transmembrane
;
Myristate

Second example

Human Plasma Proteins from PRIDE (HPPP). PRIDE Accession 65


25 MOST FREQUENT PROTEINS



Serum albumin [Precursor]
-

Serum albumin
-

ALB

356

Complement C3 [Precursor]

273

IGHA1 protein

225

Calcium/calmodulin
-
dependent protein kinase kinase 2

100

Inter
-
alpha
-
trypsin inhibitor heavy chain H1
-
H4 [Precursor]

99

Putative uncharacterized protein

97

IGL@ protein

96

ARF GTPase
-
activating protein GIT2

90

Complement factor B [Precursor]

90

PRO2275

90

IGHM protein

78

IGKC protein

64

Alpha
-
1B
-
glycoprotein [Precursor]

62

cDNA FLJ14473 fis, clone MAMMA1001080.

58

CDNA FLJ25298 fis, clone STM07683.

58

Fibronectin [Precursor]

58

IGHD protein

56

Trypsin

55

Apolipoprotein
-
L1 [Precursor]

54

HP protein

53

Alpha
-
2
-
macroglobulin [Precursor]

52

SNC66 protein

52

Ig kappa chain V
-
III region HAH [Precursor]

50


PROTEIN COUNT



2226

REDUNDANCY RATIO (Protein count/non redundant entries)

89.04%

Third example

The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics.

2004 Apr;3(4):311
-
26. Epub 2004 Jan 12.







>> We have merged four different views of the human plasma
proteome, based on different methodologies, into a single
nonredundant list of 1175 distinct gene products ….


Third example

The Human Plasma Proteome: A non redundant list:

Mol Cell Proteomics.

2004 Apr;3(4):311
-
26. Epub 2004 Jan 12.


Conclussion


PIKE

represents

a

suitable

and

useful

bioinformatics

tool

for

small
-
or

large
-
scale

proteomics

projects
.



PIKE

main

characteristic

is

its

ability

to

systematically

access

and

automatically

retrieve

comprehensive

biological

information

contained

in

common

databases
.



The

resulting

information

is

output

in

a

wide

range

of

standard

formats

that

can

be

directly

viewed,

exported,

or

downloaded

for

additional

analysis
.

Questions?