Support on-the-fly bioinformatics data Integration

powerfultennesseeBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

75 views

Supporting High
-
Performance
Data Processing
on Flat
-
Files

Xuan Zhang

Gagan Agrawal

Ohio State University

Motivation


Challenges of bioinformatics integration


Data volume: overwhelming


DNA sequence: 100 gigabases (August, 2005)


Data growth:


exponential


Figure provided by PDB

Existing Solutions



(Relational) Databases


Support for indexing and high
-
level queries


Not suitable for biological data


Flat Files with Scripts


Compact, Perl Scripts available


Lack indexing and high
-
level query processing


Web
-
services


Significant overhead




Enhance information integration systems on


Functionality


On
-
the
-
fly data incorporation


Flat file data process


Usability


Declarative interface


Low programming requirement


Performance


Incorporate indexing support

Our Approach

Approach Summary


Metadata


Declarative description of data


Data mining algorithms for semi
-
automatic
writing


Reusable by different requests on same data


Code generation


Request analysis and execution separated


General modules with plug
-
in data module


System Overview

Understand Data

Process Data

Data File

User Request

Answer

Metadata Description

Layout Descriptor

---------------------------------------------------

Schema Descriptor

Layout Descriptor

---------------------------------------------------

Schema Descriptor

Layout Descriptor

---------------------------------------------------

Schema Descriptor

Code

Generation

Request

Processor

Layout

Miner

Schema

Miner

Information Integration System

Advantages


Simple interface


At metadata level, declarative


General data model


Semi
-
structured data


Flat file data


Low human involvement


Semi
-
automatic data incorporation


Low maintenance cost


OK Performance


Linear scale guaranteed


Can improve by using indexing

System Components


Understand data


Layout mining


Schema mining


Process data


Wrapper generation


Query Process


Query Process with indices

Data Process Overview


Automatic code generation approach


Input


Metadata about datasets involved


Optional:


Implicit data transformation task


Request by users


Indexing functions


Output


Executable programs


General modules


Task
-
specific data module

Metadata Description


Two aspects of data in flat files


Logical view of the data


Physical data organization


Two components of every data descriptor


Schema description


Layout description


Design goals


Powerful


Easy for writing and interpretation

Schema Descriptors


Follow XML DTD standard for semi
-
structured
data




Simple attribute list for relational data

<?xml version='1.0' encoding='UTF
-
8'?>

<!ELEMENT FASTA (ID, DESCRIPTION, SEQ)>

<!ELEMENT ID (#PCDATA)>

<!ELEMENT DESCRIPTION (#PCDATA)>

<!ELEMENT SEQ (#PCDATA)>

[FASTA]


//Schema Name

ID = string

//Data type definitions

DESCRIPTION = string

SEQ = string

Layout Descriptors


Overall structure (FASTA example)

DATASET “FASTAData” {



//Dataset name


DATATYPE {FASTA}


//Schema name


DATASPACE LINESIZE=80 {



//
----

File layout details goes here
----



}


DATA {osu/fasta}



//File location

}


Wrapper Generation

System Overview

DataReader

DataWriter

Synchronizer

Source

Dataset

Target

Dataset

WRAPINFO

Wrapper generation

system

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data Entry

Representation

Application Analyzer

Query With Indices

Motivation


Goal


Improve the performance of query
-
proc program


Index


Maintain the advantages


Flat file based


Low requirement on programming

Challenges & Approaches


Various indexing algorithms for various
biological data


User defined indexing functions


Standard function interfaces


Flat file data


Values parsed implicitly and ready to be indexed


Byte offset as pointer


Metadata about indices


Layout descriptor

System Revisited

query

Query parser

Metadata

collection

Dataset

descriptors

Descriptor

parser

Application analyzer

QUERYINFOR

DataReader

DataWriter

Synchronizer

Source

data files

Target

data file

Source/target names

Schema & Layout information

mappings

Query analysis

Query execution

Index file

Index functions

Language Enhancement


Describe indices


Indexing is a property of dataset


Extend layout descriptors





Maintain query format


DATASET


name
”{



INDEX

{
attribute
:
index_file_loc
:
index_gen_fun
:
index_retr_fun
:
fun_loc

[
,

attribute
:
index_file_loc
:
index_gen_fun
:
index_retr_fun
:
fun_loc
]}

}

AUTOWRAP

GNAMES

FROM

CHIPDATA, YEASTGENOME

BY

CHIPDATA.GENE
=

YEASTGENOME.ID

WHERE



New meaning of “
=
“:

If index available, use index


retrieving function

Else, compare values directly

System Enhancement


Metadata Descriptor Parser

+ parse index information


Application Analyzer

+ index information: index look
-
up table

+ test condition: compare_field_indexing

Microarray Gene
Information Look
-
up


Goal: gather
information about
genes (120)


Query: microarray
output
join

genome
database


Index: gene names
in genome

BLAST
-
ENHANCE Query


Goal: Add extra
information to
BLAST output


Query: BLAST
output
join

Swiss
-
Prot database


Index: protein ID in
Swiss
-
Prot

OMIM
-
PLUS Query


Goal: add Swiss
-
Prot link to OMIM


Query: OMIM
join

Swiss
-
Prot


Index: protein ID
in Swiss
-
Prot

Homology Search Query


Goal: find similar sequences


Query: query sequence list * sequence
database


Indexing algorithm


Sequence
-
based


Transformation of sub
-
string composition


Indexing n
-
D numerical values

Homology Search (1)


Index (Singh’s
algorithm)


Data: yeast
genome


wavelet
coefficients


minimum
bounding
rectangles

Homology Search (2)


Index
(Ferhatosmanoglu’s
algorithm)


Data: GenBank


Wavelet coefficients


Scalar quantization


R
-
tree

Conclusions


A frame work and a set of tools for on
-
the
-
fly
flat file data integration


New data source understood semi
-
automatically
by data mining tools


New data processed automatically by generated
programs


Support for indexing incorporated flexibly