Scientific publications and
archives: media, content
and access
Lesk, Ch 3
(Lesk, 2008)
Scientific literature
•
Scientific publications began as interpersonal communications
–
lectures, seminars and discussions
–
oral communication.
•
Formal written article or books
–
scientific literature.
•
Today, journals, presentation at meetings, books, book chapters,
Web material, films, radio, television programs, podcasts.
•
Formal academic publications must pass the test of ‘peer review’
–
quality control.
•
Before the Internet, scientific literature appeared on paper
(journals). Today, journals appear electronically as well as on paper
(some rarely visit a library to read journals).
•
Delocalized literature delivery and computational methods of
information retrieval.
2
Economic factors governing access to
scholarly publications
3
•
Traditional economic model of scientific journals: a scientific
organization or publisher produces and distribute at regular
intervals, a paper
-
bound ‘issue’ of articles.
–
Cost
: editorial office; preparation of manuscripts;
printing/distribution.
–
Support
(income): sales (subscription), page charges to authors,
donation, subsidy, advertisements etc.
•
Recently, changes:
–
More papers are published
–
driving up costs.
–
Larger volume of publication puts libraries under financial pressure.
–
Electronic facilities reduces costs.
–
Electronic distribution extends the potential format of journal articles.
–
User community supports open access.
Open access / traditional and digital
libraries
•
Redefinition of the author/publisher/reader relationship.
–
Retains peer
-
review process.
–
Accepted articles are placed on the Web, with free access.
–
Authors retain copyright (instead of publisher).
–
Cost of publication are transferred from readers to authors.
•
Traditional libraries
–
you know what it is.
•
Digital libraries.
–
Electronic form, on
-
line.
–
Raise economic questions.
–
Large
-
scale digital libraries by scanning?
4
The information explosion / Databases
•
Efficient delivery can be a mixed blessing.
•
Impossible for anyone to read all the literature in a given field.
•
The Web gives a higher dimension
–
no longer linear, new
media, new way of searching, bibliography management,
organizing and sharing the harvest.
•
Databases: contents, ontology, logical structure, format of the
data, routes for retrieval of data, links to other resources.
•
Literature as a database: e.g. Medline (Medical Literature
Analysis and Retrieval System Online)
–
now part of PubMed,
bibliographic database.
5
Databases
•
Database
organization /
design
–
e.g. design of a relational database
of amino acids.
•
Annotation: a typical entry in a molecular biology database might
contain other information (other than say gene sequences).
–
Reference information (citations of publications).
–
Interpretative information.
–
Links to other information.
•
Database quality control (errors?)
–
“Get it right the first time”: database curation and annotation
–
a new
profession.
–
Identify errors
–
external curators /users.
–
Tracking database changes.
6
Databases
•
Database access: a issue to consider.
•
Links (utility of a database): internal links and external links.
•
Database interoperability: questions that require appeal to multiple
database at once?
–
Merge several databases?
–
Methods for intercommunication between databases?
•
Data mining.
–
Knowledge discovery: description/explanation.
–
Successful forecasting / predictive modeling.
–
Statistical techniques.
–
Artificial neutral networks.
–
Support vector machines.
7
Programming languages and tools
•
Traditional programming languages: FORTRAN, C, C++
•
Scripting languages: PERL, PYTHON, RUBY…
•
Program libraries specialized for molecular biology: standard
libraries (numerical analysis and text processing), libraries for
molecular biology (e.g. bioperl.org).
•
Java
–
Java Virtual Machine
–
computing over the Web?
•
Markup languages: implements data structures, XML.
8
Natural language processing
•
Natural language: verbal
-
oral and/or textual forms of human
-
human communication.
•
Natural language processing has been a goal of computing.
•
Difficulty: ambiguity of words and phrases.
•
Identifying keywords and combinations of keywords: e.g. names of
genes and names of diseases.
•
Knowledge extraction: protein
-
protein interactions (automatic text
-
mining software).
•
Text mining:
–
Identification of references to genes and proteins.
–
Identification of interactions.
–
Interaction networks and diseases.
–
Hypothesis generation (unsuspected relationships between genes and
diseases).
9
Archives and information
retrieval
Lesk, Ch 4
(Lesk, 2008)
Database indexing and specification of
search terms
•
An index: set of pointers to information in a database.
•
Information retrieval programs accepts multiple query terms
and keywords.
•
Possible to ask for logical combinations of indexing terms.
•
Many database search engines allow complex logical
expressions.
•
Follow
-
up questions: modify query, cumulative searches, links
between entries in different databases.
•
Analysis and processing of retrieved data: using results
retrieved in one search as input for another one (some
information retrieval systems provide such facilities).
11
Nucleic acid sequence databases
•
Archiving of bioinformatics data was originally carried out by
individual research groups.
•
As requirements grew, projects become very large
-
scale.
•
Primary data collections related to biological macromolecules:
–
Nucleic acid sequences, including whole
-
genome projects.
–
Amino acid sequence of proteins.
–
Protein and nucleic acid structures.
–
Small
-
molecule crystal structures.
–
Protein functions.
–
Expression patterns of genes.
–
Networks: of metabolic pathways, of gene and protein interactions, and of
control cascades.
–
Publications.
12
Nucleic acid sequence databases
•
Triple partnership of the National Center for Biotechnology
Information (USA); the EMBLBank (European Bioinformatics
Institute, UK) and the Data Bank of Japan (National Institute of
Genetics, Japan).
•
Curate, archive and distribute DNA and RNA sequences.
•
Entries have life history:
–
Unannotated
-
> Preliminary
-
> Unreviewed
-
> Standard
•
Sample entry includes: properties of specific regions (e.g.
coding sequences, performs of affect function, interaction
with other molecules, affect replication, etc)
13
Genome databases and genome browsers
•
Genome browsers (full
-
genome sequences): databases
bringing together all molecular information available about a
particular species.
•
E.g. ensembl.org: intended to be the universal information
source for the human and other genomes.
14
Protein sequence databases
•
In 2002, three protein sequence databases, the Protein
Information Resource (PIR) , USA and SWISS
-
PORT, Swiss and
TrEMBL, Europe, formed the UniPort consortium.
•
Share the database but continue to offer separate
information
-
retrieval tools for access.
•
Databases associated with SWISS
-
PORT:
–
ENZYME DB and PROSITE
•
PIR and associated databases:
–
PIRSF: protein family classification system.
–
iProClass: protein knowledge, access to over 90 biological databases.
–
iProLINK: gateway to protein literature.
15
Databases of protein families
•
Evolutionary relationships / homology detection.
•
Two full
-
length protein sequences (>=100 residues) that have
>=25% identical residues in an optional alignment are likely to
be related.
•
Need sequence alignment algorithms.
•
Refer to a group of related proteins as a family.
16
Databases of structures
•
Structure databases archive, annotate and distribute sets of
atomic coordinates.
•
World
-
wide Protein Data Bank (wwPDB.org).
–
Joint effort of the Research Collaboratory for Structural Bioinformatics
(RCSB) and the Protein Data Bank Japan.
–
Contains the structures of proteins.
–
It overlaps several other databases.
•
Several website offer hierarchical classification of all proteins
of known structure
–
SCOPE, CATH, DALI, CE
17
Other databases
•
Classification and assignment of protein function.
–
The Enzyme Commission.
–
The Gene Ontology Consortium protein function classification.
•
Specialized, or ‘boutique’ databases.
•
Expression (mRNA levels) and proteomics databases
(interpretation in terms of protein patterns).
•
Databases of metabolic pathways (flow of molecules and
energy through pathways of chemical reactions).
•
Bibliographic databases.
•
Only a few of the many databases…
18
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment