Scientific publications and

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

98 εμφανίσεις

Scientific publications and
archives: media, content
and access

Lesk, Ch 3

(Lesk, 2008)

Scientific literature

Scientific publications began as interpersonal communications

lectures, seminars and discussions

oral communication.

Formal written article or books

scientific literature.

Today, journals, presentation at meetings, books, book chapters,
Web material, films, radio, television programs, podcasts.

Formal academic publications must pass the test of ‘peer review’

quality control.

Before the Internet, scientific literature appeared on paper
(journals). Today, journals appear electronically as well as on paper
(some rarely visit a library to read journals).

Delocalized literature delivery and computational methods of
information retrieval.


Economic factors governing access to
scholarly publications


Traditional economic model of scientific journals: a scientific
organization or publisher produces and distribute at regular
intervals, a paper
bound ‘issue’ of articles.

: editorial office; preparation of manuscripts;


(income): sales (subscription), page charges to authors,
donation, subsidy, advertisements etc.

Recently, changes:

More papers are published

driving up costs.

Larger volume of publication puts libraries under financial pressure.

Electronic facilities reduces costs.

Electronic distribution extends the potential format of journal articles.

User community supports open access.

Open access / traditional and digital

Redefinition of the author/publisher/reader relationship.

Retains peer
review process.

Accepted articles are placed on the Web, with free access.

Authors retain copyright (instead of publisher).

Cost of publication are transferred from readers to authors.

Traditional libraries

you know what it is.

Digital libraries.

Electronic form, on

Raise economic questions.

scale digital libraries by scanning?


The information explosion / Databases

Efficient delivery can be a mixed blessing.

Impossible for anyone to read all the literature in a given field.

The Web gives a higher dimension

no longer linear, new
media, new way of searching, bibliography management,
organizing and sharing the harvest.

Databases: contents, ontology, logical structure, format of the
data, routes for retrieval of data, links to other resources.

Literature as a database: e.g. Medline (Medical Literature
Analysis and Retrieval System Online)

now part of PubMed,
bibliographic database.



organization /

e.g. design of a relational database
of amino acids.

Annotation: a typical entry in a molecular biology database might
contain other information (other than say gene sequences).

Reference information (citations of publications).

Interpretative information.

Links to other information.

Database quality control (errors?)

“Get it right the first time”: database curation and annotation

a new

Identify errors

external curators /users.

Tracking database changes.



Database access: a issue to consider.

Links (utility of a database): internal links and external links.

Database interoperability: questions that require appeal to multiple
database at once?

Merge several databases?

Methods for intercommunication between databases?

Data mining.

Knowledge discovery: description/explanation.

Successful forecasting / predictive modeling.

Statistical techniques.

Artificial neutral networks.

Support vector machines.


Programming languages and tools

Traditional programming languages: FORTRAN, C, C++

Scripting languages: PERL, PYTHON, RUBY…

Program libraries specialized for molecular biology: standard
libraries (numerical analysis and text processing), libraries for
molecular biology (e.g.


Java Virtual Machine

computing over the Web?

Markup languages: implements data structures, XML.


Natural language processing

Natural language: verbal
oral and/or textual forms of human
human communication.

Natural language processing has been a goal of computing.

Difficulty: ambiguity of words and phrases.

Identifying keywords and combinations of keywords: e.g. names of
genes and names of diseases.

Knowledge extraction: protein
protein interactions (automatic text
mining software).

Text mining:

Identification of references to genes and proteins.

Identification of interactions.

Interaction networks and diseases.

Hypothesis generation (unsuspected relationships between genes and


Archives and information

Lesk, Ch 4

(Lesk, 2008)

Database indexing and specification of
search terms

An index: set of pointers to information in a database.

Information retrieval programs accepts multiple query terms
and keywords.

Possible to ask for logical combinations of indexing terms.

Many database search engines allow complex logical

up questions: modify query, cumulative searches, links
between entries in different databases.

Analysis and processing of retrieved data: using results
retrieved in one search as input for another one (some
information retrieval systems provide such facilities).


Nucleic acid sequence databases

Archiving of bioinformatics data was originally carried out by
individual research groups.

As requirements grew, projects become very large

Primary data collections related to biological macromolecules:

Nucleic acid sequences, including whole
genome projects.

Amino acid sequence of proteins.

Protein and nucleic acid structures.

molecule crystal structures.

Protein functions.

Expression patterns of genes.

Networks: of metabolic pathways, of gene and protein interactions, and of
control cascades.



Nucleic acid sequence databases

Triple partnership of the National Center for Biotechnology
Information (USA); the EMBLBank (European Bioinformatics
Institute, UK) and the Data Bank of Japan (National Institute of
Genetics, Japan).

Curate, archive and distribute DNA and RNA sequences.

Entries have life history:

> Preliminary
> Unreviewed
> Standard

Sample entry includes: properties of specific regions (e.g.
coding sequences, performs of affect function, interaction
with other molecules, affect replication, etc)


Genome databases and genome browsers

Genome browsers (full
genome sequences): databases
bringing together all molecular information available about a
particular species.

E.g. intended to be the universal information
source for the human and other genomes.


Protein sequence databases

In 2002, three protein sequence databases, the Protein
Information Resource (PIR) , USA and SWISS
PORT, Swiss and
TrEMBL, Europe, formed the UniPort consortium.

Share the database but continue to offer separate
retrieval tools for access.

Databases associated with SWISS


PIR and associated databases:

PIRSF: protein family classification system.

iProClass: protein knowledge, access to over 90 biological databases.

iProLINK: gateway to protein literature.


Databases of protein families

Evolutionary relationships / homology detection.

Two full
length protein sequences (>=100 residues) that have
>=25% identical residues in an optional alignment are likely to
be related.

Need sequence alignment algorithms.

Refer to a group of related proteins as a family.


Databases of structures

Structure databases archive, annotate and distribute sets of
atomic coordinates.

wide Protein Data Bank (

Joint effort of the Research Collaboratory for Structural Bioinformatics
(RCSB) and the Protein Data Bank Japan.

Contains the structures of proteins.

It overlaps several other databases.

Several website offer hierarchical classification of all proteins
of known structure



Other databases

Classification and assignment of protein function.

The Enzyme Commission.

The Gene Ontology Consortium protein function classification.

Specialized, or ‘boutique’ databases.

Expression (mRNA levels) and proteomics databases
(interpretation in terms of protein patterns).

Databases of metabolic pathways (flow of molecules and
energy through pathways of chemical reactions).

Bibliographic databases.

Only a few of the many databases…