BioPerl at 15: New Features, New Directions

greenbeansneedlesΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

76 εμφανίσεις

BioPerl at 15: New Features, New Directions

Christopher J. Fields, University of Illinois, cjfields@uiuc.edu

*Mark A. Jensen, Fortinbras Research and SRA International, mark_jensen@sra.com

Jason E. Stajich, University of California at Riverside, jason.stajich@ucr.edu

The BioPerl Project, an open
-
source Perl toolkit for bioinformatics, was initiated in 1995 and became instrumental in the automa
ted organization and analysis of original Human
Genome Project data. Since then, BioPerl has become a complete object
-
oriented Perl environment for bioinformatics development,
with modules to perform a wide range of
bioinformatics functions, including multi
-
format parsing and translation, object
-
relational model databasing, EMBL and NCBI web
service access, and external program execution.
The BioPerl developer community is actively responding to the far
-
reaching changes in the field that have taken place over the l
ast several years. Major goals are: (1) to provide new
functionality useful to researchers at the cutting edge of bioinformatics, (2) to reorganize BioPerl into smaller application
-
or
iented packages, (3) to deprecate older modules whose
utility has declined substantially, and (4) to continue to expand and improve documentation, so that BioPerl remains useful a
nd
relevant in the years ahead.

Year

Sponsoring

Institution

Student

Project

Example

Module

2008

NESCent

Mira Han

PhyloXML

parsing

Bio::
TreeIO
::
phyloxml

2009

NESCent

Chase Miller

NeXML

parsing

Bio::
Nexml

2010

OBF

Jun Yin

Alignment subsystem refactoring

in progress

source: http://www.ohloh.net/p/bioperl

Google Summer of Code


BioPerl has provided mentorship for GSoC projects for the past three years. These have resulted in material
additions to the codebase, and have been focused on expanding BioPerl's capabilities in format parsing and large
file processing.

The BioPerl wiki
(http://bioperl.org)


The wiki is now the central
location for all BioPerl
documentation: installation,
module POD, HOWTO articles,
code snippets, and personnel
descriptions. It has played an
important role as the new face
of BioPerl and as a landing for
the developer discussions that
are taking BioPerl forward.

BioPerl on gitHub
(http://github.com/bioperl)


BioPerl recently migrated all active repositories to
gitHub from OBF
-
hosted Subversion. With the
move to
git

comes decentralization and more fluid,
independent development. We expect this to
improve the BioPerl response time both to bugs
and to new developments in the field, as well as
increase new developer recruitment and
community participation.

Community participation and development

New features

New directions

Next
-
gen sequencing support


Bringing BioPerl up to speed for next
-
gen sequence data handling has led to efforts along three lines: file format
standardization, common command
-
line tool wrapping, and BioPerl object system I/O integration tailored to next
-
gen data.

Formats


BioPerl and other Bio* projects recently published a
collaborative effort to standardize FASTQ formats,
including variants for Illumina and Solexa platforms.
These formats are now in use across BioPerl and the
Bio* projects.


Support for important binary formats (BAM, BigWIG) is
provided by wrappers for command line tools, and the
integration of fast XS
-
based Perl modules such as
Lincoln Stein's
Bio
-
SamTools

and
Bio
-
BigFile

CPAN
packages.

Wrappers


Enhancements to the
Bio::Tools::Run::WrapperBase
system has made it easier to add BioPerl wrapper
modules for external programs, and to integrate
these into other modules that implement pipelines
using BioPerl sequence and alignment objects as I/O.

Tracking NCBI developments


In the past year, NCBI has released a fully updated BLAST toolkit,
blast+

, and has been encouraging a move from
their EUtilities RESTful interface to a newer SOAP interface

.


BioPerl has responded with
Bio::Tools::Run::StandAloneBlastPlus
and
Bio::DB::SoapEUtilities
. These were designed
not only to update the API interface, but also to add I/O layers that accept and parse messages into familiar BioPerl
objects, and to build in straightforward methods for creating pipelines of
blast+

program analyses or EUtilities
fetches.



ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST


http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html

bedtools

bowtie

bwa

minimo

newbler

samtools

BioPerl object support :
Bio::Assembly


The
Bio::Assembly
system has been extensively updated, to
include reading and/or writing assemblies in MAQ, BAM,
SAM, BWA, and other formats. Assembly object support is
integrated into run wrappers for
bwa
,
bedtools
,
maq
, and
samtools
. Future work will incorporate new sequence objects
that are optimized for large files (through the work of GSoC
student Jun Yin).

use Bio::Tools::Run::
Maq
;

my $
maq

= Bio::Tools::Run::
Maq
-
>new();

$
assy_obj

= $
maq
-
>run('read1.fastq',


'refseq.fas',


'read2.fastq');

Convert plain text
sequence

Map reads to
reference
seq

Assemble map into
consensus

Extract info from
consensus

fasta2bfa

fastq2bfq

map

mapmerge

assemble

mapview

cns2fq

maq

assembly pipeline

Timeline


BioPerl has grown in its user and
developer base since those early days.
New developers and collaborations have
contributed not only key modules, but
also important design methodologies and
refactoring over the years that have
helped BioPerl to maintain its usefulness
and relevance. Discontinuities followed by
increases in lines of code over time reflect
a high level of community flexibility and
dedication in pursuit of DTWT.

General wrapper facility


A set of modules (
Bio::Tools::WrapperMaker
) is under
development that will increase the responsiveness of
BioPerl development by providing an XML
-
based way
for users themselves to specify the interface for their
favorite commandl ine programs, at the same time
creating a common, consistent API for executing
those programs and accessing output.

Intermediate layers for large file handling and generic parsing


BioPerl parsers generally take raw data to Perl objects with no intermediate layer. This induces
prohibitive overhead when parsing large files, and also can limit user flexibility: parsing may be desired,
but not the BioPerl objects. The first problem is being tackled by attaching backend handlers onto
container class constructors that are able persist records of large files efficiently, creating BioPerl objects
only as needed or desired. The second problem has led to experiments in generic parsing: data file
records are parsed into a simple stream of hashes, which then can be directed where the user desires;
into the creation of BioPerl objects as usual, or elsewhere.

Biome and BioPerl 6


BioPerl has been object
-
oriented from the beginning, but suffers the weaknesses of Perl 5 objects: very
high overhead, loose encapsulation, limited object introspection, and the lack of built
-
in interfaces and
roles, among other things.


These issues are being addressed in two ways: in Perl 5 through the Moose classes and dependencies,
and in the creation of Perl 6. BioPerl is exploring both paths to true objects with the experimental
Biome

(BioPerl with Metaobject Extensions) and BioPerl 6 projects.

Class

Role

main::

Biome role as interface

Shattering the Monolith


BioPerl continues to be distributed as just a handful of
packages. The core package in particular has grown to 341
files, comprising 874 classes with 23,146 tests. Maintenance
and installation issues are barriers to developers and users
alike. We are in the process of splitting the core into
reasonable, application
-
related chunks. This plus the
git

migration should significantly improve BioPerl management.

The
BioPerl Core Development Team
is Sendu Bala, Rob Buels, Christopher Fields, Mark Jensen, Hilmar Lapp, Heikki Lehväslaiho, Aaron Mackey, Dave Messina, Brian

Os
borne, Jason Stajich, and Lincoln Stein.
Key support is provided by Chris Dagdigian and Mauricio Herrera Cuadra. Florent Angly and Dan Kortschak are lead developers o
f p
rojects discussed here.