Scientific publications and

blabbingunequaledAI and Robotics

Oct 24, 2013 (4 years and 15 days ago)

112 views

Scientific publications and
archives: media, content
and access

Lesk, Ch 3

(Lesk, 2008)



Scientific literature


Scientific publications began as interpersonal communications


lectures, seminars and discussions


oral communication.


Formal written article or books


scientific literature.


Today, journals, presentation at meetings, books, book chapters,
Web material, films, radio, television programs, podcasts.


Formal academic publications must pass the test of ‘peer review’


quality control.


Before the Internet, scientific literature appeared on paper
(journals). Today, journals appear electronically as well as on paper
(some rarely visit a library to read journals).


Delocalized literature delivery and computational methods of
information retrieval.


2

Economic factors governing access to
scholarly publications

3


Traditional economic model of scientific journals: a scientific
organization or publisher produces and distribute at regular
intervals, a paper
-
bound ‘issue’ of articles.


Cost
: editorial office; preparation of manuscripts;
printing/distribution.


Support

(income): sales (subscription), page charges to authors,
donation, subsidy, advertisements etc.


Recently, changes:


More papers are published


driving up costs.


Larger volume of publication puts libraries under financial pressure.


Electronic facilities reduces costs.


Electronic distribution extends the potential format of journal articles.


User community supports open access.



Open access / traditional and digital
libraries


Redefinition of the author/publisher/reader relationship.


Retains peer
-
review process.


Accepted articles are placed on the Web, with free access.


Authors retain copyright (instead of publisher).


Cost of publication are transferred from readers to authors.


Traditional libraries


you know what it is.


Digital libraries.


Electronic form, on
-
line.


Raise economic questions.


Large
-
scale digital libraries by scanning?


4

The information explosion / Databases


Efficient delivery can be a mixed blessing.


Impossible for anyone to read all the literature in a given field.


The Web gives a higher dimension


no longer linear, new
media, new way of searching, bibliography management,
organizing and sharing the harvest.



Databases: contents, ontology, logical structure, format of the
data, routes for retrieval of data, links to other resources.


Literature as a database: e.g. Medline (Medical Literature
Analysis and Retrieval System Online)


now part of PubMed,
bibliographic database.




5

Databases


Database
organization /
design


e.g. design of a relational database
of amino acids.


Annotation: a typical entry in a molecular biology database might
contain other information (other than say gene sequences).


Reference information (citations of publications).


Interpretative information.


Links to other information.


Database quality control (errors?)


“Get it right the first time”: database curation and annotation


a new
profession.


Identify errors


external curators /users.


Tracking database changes.




6

Databases


Database access: a issue to consider.


Links (utility of a database): internal links and external links.


Database interoperability: questions that require appeal to multiple
database at once?


Merge several databases?


Methods for intercommunication between databases?


Data mining.


Knowledge discovery: description/explanation.


Successful forecasting / predictive modeling.


Statistical techniques.


Artificial neutral networks.


Support vector machines.




7

Programming languages and tools


Traditional programming languages: FORTRAN, C, C++


Scripting languages: PERL, PYTHON, RUBY…


Program libraries specialized for molecular biology: standard
libraries (numerical analysis and text processing), libraries for
molecular biology (e.g. bioperl.org).


Java


Java Virtual Machine


computing over the Web?


Markup languages: implements data structures, XML.




8

Natural language processing


Natural language: verbal
-
oral and/or textual forms of human
-
human communication.


Natural language processing has been a goal of computing.


Difficulty: ambiguity of words and phrases.


Identifying keywords and combinations of keywords: e.g. names of
genes and names of diseases.


Knowledge extraction: protein
-
protein interactions (automatic text
-
mining software).


Text mining:


Identification of references to genes and proteins.


Identification of interactions.


Interaction networks and diseases.


Hypothesis generation (unsuspected relationships between genes and
diseases).



9

Archives and information
retrieval

Lesk, Ch 4

(Lesk, 2008)



Database indexing and specification of
search terms


An index: set of pointers to information in a database.


Information retrieval programs accepts multiple query terms
and keywords.


Possible to ask for logical combinations of indexing terms.


Many database search engines allow complex logical
expressions.


Follow
-
up questions: modify query, cumulative searches, links
between entries in different databases.


Analysis and processing of retrieved data: using results
retrieved in one search as input for another one (some
information retrieval systems provide such facilities).




11

Nucleic acid sequence databases


Archiving of bioinformatics data was originally carried out by
individual research groups.


As requirements grew, projects become very large
-
scale.


Primary data collections related to biological macromolecules:


Nucleic acid sequences, including whole
-
genome projects.


Amino acid sequence of proteins.


Protein and nucleic acid structures.


Small
-
molecule crystal structures.


Protein functions.


Expression patterns of genes.


Networks: of metabolic pathways, of gene and protein interactions, and of
control cascades.


Publications.



12

Nucleic acid sequence databases


Triple partnership of the National Center for Biotechnology
Information (USA); the EMBLBank (European Bioinformatics
Institute, UK) and the Data Bank of Japan (National Institute of
Genetics, Japan).


Curate, archive and distribute DNA and RNA sequences.


Entries have life history:


Unannotated
-
> Preliminary
-
> Unreviewed
-
> Standard


Sample entry includes: properties of specific regions (e.g.
coding sequences, performs of affect function, interaction
with other molecules, affect replication, etc)




13

Genome databases and genome browsers


Genome browsers (full
-
genome sequences): databases
bringing together all molecular information available about a
particular species.


E.g. ensembl.org: intended to be the universal information
source for the human and other genomes.





14

Protein sequence databases


In 2002, three protein sequence databases, the Protein
Information Resource (PIR) , USA and SWISS
-
PORT, Swiss and
TrEMBL, Europe, formed the UniPort consortium.


Share the database but continue to offer separate
information
-
retrieval tools for access.


Databases associated with SWISS
-
PORT:


ENZYME DB and PROSITE


PIR and associated databases:


PIRSF: protein family classification system.


iProClass: protein knowledge, access to over 90 biological databases.


iProLINK: gateway to protein literature.





15

Databases of protein families


Evolutionary relationships / homology detection.


Two full
-
length protein sequences (>=100 residues) that have
>=25% identical residues in an optional alignment are likely to
be related.


Need sequence alignment algorithms.


Refer to a group of related proteins as a family.





16

Databases of structures


Structure databases archive, annotate and distribute sets of
atomic coordinates.


World
-
wide Protein Data Bank (wwPDB.org).


Joint effort of the Research Collaboratory for Structural Bioinformatics
(RCSB) and the Protein Data Bank Japan.


Contains the structures of proteins.


It overlaps several other databases.


Several website offer hierarchical classification of all proteins
of known structure


SCOPE, CATH, DALI, CE





17

Other databases


Classification and assignment of protein function.


The Enzyme Commission.


The Gene Ontology Consortium protein function classification.


Specialized, or ‘boutique’ databases.


Expression (mRNA levels) and proteomics databases
(interpretation in terms of protein patterns).


Databases of metabolic pathways (flow of molecules and
energy through pathways of chemical reactions).


Bibliographic databases.


Only a few of the many databases…




18