Bioinformatics Functions: Categorical List

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

88 views

COMP 5115 Programming Tools in Bioinformatics

Week 3


Bioinformatics Functions:
Categorical List*

Data Formats and Databases
: Get data into MATLAB from
Web databases. Read and write to files using specific
sequence data formats.

Trace Tools
: Read data from a SCF file and draw nucleotide
trace plots.

Sequence Conversion
: Convert nucleotide and amino acid
sequences between character and integer formats, reverse
and complement the order of nucleotide bases, and
translate nucleotides codons to amino acids.

Sequence Utilities
: Calculate a consensus sequence from a
set of multiply aligned sequences, run a BLAST search from
MATLAB, and search sequences using regular expressions.

*see
http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref

for further details

Bioinformatics Functions:
Categorical List (cont.)

Sequence Statistics
: Determine base counts,
nucleotide density, codon bias, and CpG islands.
Search for words and identify open reading frames
(ORFs).

Pairwise Sequence Alignment
: Compare
nucleotide or amino acid sequences using pairwise
sequence alignment functions.

Multiple Sequence Alignment
: Compare sets of
nucleotide or amino acid sequences. Progressively
align sequences using a phylogenetic tree for
guidance.

Scoring Matrices
: Standard scoring matrices such
as the PAM and BLOSUM families of matrices that
alignment functions use.

Bioinformatics Functions:
Categorical List (cont.)

Phylogenetic Tree Tools
: Read phylogenetic tree
files, calculate pairwise distances between
sequences and build a phylogenetic tree.

Phylogenetic Tree Methods
: Select, modify, and
plot phylogenetic trees using phytree object
methods.

Graph Visualization Methods
: View relationships
between data visually with interactive maps,
hierarchy plots, and pathways.

Protein Analysis
: Determine protein characteristics
and simulate enzyme cleavage reactions.

Profile Hidden Markov Models
: Get profile hidden
Markov model data from the PFAM database or
create your own profiles from a set of sequences.

Bioinformatics Functions:
Categorical List (cont.)

Microarray File Formats
: Read data from common microarray
file formats including Affymetrix GeneChip, ImaGene results,
and SPOT files. Read GenePix GPR and GAL files.

Microarray Utility Functions
: Using Affymetrix and GeneChip
data sets, get library information for a probe, gene information
from a probe set, and probe set values from CEL and CDF
information. Show probe set information from NetAffx and plot
probe set values.

Microarray Visualization
: Visualize microarray data with spatial
plots, box plots, loglog plots, and intensity
-
ratio plots.

Microarray Normalization and Filtering
: Normalize microarray
data with lowess and mean normalization functions. Filter raw
data for cleanup before analysis.

Statistical Learning
: Classify and identify features in data sets,
set up cross
-
validation experiments, and compare different
classification methods.

Mass Spectrometry Preprocessing and Visualization
:
Preprocess raw data mass spectrometry data from
instruments, and analyze spectra to identify patterns and
compounds.

Detail Investigation of the function codes:


(e.g.,
getgenbank.m)


getgenbank (
getgenbank.m
) retrieves nucleotide and amino acid sequence
information from the GenBank database. This database is maintained by
the National Center for Biotechnology Information (NCBI). For more details
about the GenBank database, see
http://www.ncbi.nlm.nih.gov/Genbank/


Syntax


Data = getgenbank('
AccessionNumber
',
'
PropertyName
',
PropertyValue
...)


getgenbank(..., 'ToFile',
ToFileValue
)


getgenbank(..., 'FileFormat',

FileFormatValue
)


getgenbank(..., 'SequenceOnly',
SequenceOnlyValue
)


Arguments


AccessionNumber:
Unique identifier for a sequence record. Enter a
unique combination of letters and numbers.


ToFile
: Property to specify the location and filename for saving data.
Enter either a filename or a path and filename supported by your
system (ASCII text file).


FileFormat
: Property to select the format for the file specified with the
property ToFileValue. Enter either 'GenBank' or 'FASTA'.


SequenceOnly
: Property to control getting the sequence only. Enter
either true or false.


Related Bioinformatics Toolbox functions:
genbankread
,
getembl
,
getgenpept
,
getpdb
,
getpir
,
seqtool

Detail Investigation of the function codes:
(e.g.,
getgenbank.m)


Let’s use the function to retrieve the sequence from chromosome 19 that
codes for the human insulin receptor and store it in structure S.


S = getgenbank('M10051')

S =

LocusName: 'HUMINSR'

LocusSequenceLength: '4723'

LocusNumberofStrands: ''

LocusTopology: 'linear'

LocusMoleculeType: 'mRNA'

LocusGenBankDivision: 'PRI'

LocusModificationDate: '06
-
JAN
-
1995'

Definition: 'Human insulin receptor mRNA, complete cds.‘

Accession: 'M10051'

Version: 'M10051.1'

GI: '186439'

Keywords: 'insulin receptor; tyrosine kinase.'

Segment: []

Source: 'Homo sapiens (human)'

SourceOrganism: [3x65 char]

Reference: {[1x1 struct]}

Comment: [14x67 char]

Features: [51x74 char] CDS: [139 4287]

Sequence: [1x4723 char]

SearchURL: [1x105 char]

RetrieveURL: [1x95 char]

Detail Investigation of the function codes:
(e.g.,
getgenbank.m)




The function code will be studied using m file of the function, MATLAB



commend window, and bioinformatics toolbox documents


function

gbout=getgenbank(accessnum,varargin)


%


if

~
usejava
('jvm')



error('Bioinfo:NeedJVM','%s requires Java.',mfilename);


end





num_argin =
length
(varargin);


onlySequence = false;





for

n = 1:2:num_argin



arg =
lower
(varargin{n});



if

strmatch
(arg,'sequenceonly')%#ok



pval =
varargin
{n+1};



onlySequence =
opttf
(pval);



if

isempty
(onlySequence)



error('Bioinfo:InputOptionNotLogical','SequenceOnly must be a logical value, true or false.');



end



continue
;



end


end





try

length

is also a function and
coded in m file (length.m).
LENGTH(X) returns the length
of a vector X
.

opttf

decides if input options are
true are false

Detail Investigation of the function codes:
(e.g.,
getgenbank.m)


if

onlySequence



% if only the sequence is desired, always use FASTA



gb =
getncbidata
(accessnum,varargin{:},'database','nucleotide','fileformat','FASTA');



else



% otherwise, default to GenBank format. if the format is specified as



% an input, then it will be used. the nucleotide database will always be



% used.



gb =
getncbidata
(accessnum,'fileformat','GenBank','database','nucleotide',varargin{:});



end


catch



le = lasterror;



warning('Bioinfo:GenBankDataNotFound','Unable to get GenBank information for


access number %s. Trying FASTA...',accessnum);



try



gb =
getncbidata
(accessnum,varargin{:},'database','nucleotide','fileformat','FASTA');



catch



% throw original error



rethrow(le);



end


end

getncbidata
retrieves
sequence information from
the NCBI databases

Detail Investigation of the function codes:
(e.g.,
getgenbank.m)


if

nargout || onlySequence || ~usejava('desktop') || any(strcmpi(varargin,'FASTA'))



gbout = gb;


else



for

n = 1:numel(gb)



tmp = gb(n);



if

isstruct
(tmp) &&
isfield
(tmp,'SearchURL')



searchurl = tmp.SearchURL;



retrieveurl =
strrep
(tmp.RetrieveURL,'Text','Retrieve');



tmp =
rmfield
(tmp,{'SearchURL';'RetrieveURL'});



disp(tmp);



disp([char(9) 'SearchURL: <a href="' searchurl '">' tmp.Accession '</a>']);



disp([char(9) 'RetrieveURL: <a href="' retrieveurl '">' tmp.GI '</a>']);



else



% should never get here...



warning('Bioinfo:NoSearchURL','Expected a structure with field SearchURL.');



disp(tmp);



end



end


end

Detail Investigation of the function codes:
(e.g.,
getgenbank.m)


A list of the functions used to code
getgenbank.m




getncbidata


isempty


opttf


varargin


strmatch


lower


length


usejava


Isstruct


Isfield


strrep


rmfield

Detail Investigation of the function codes:
(
getncbidata
.m)


getncbidata

retrieves sequence information from the NCBI
databases


gbout

=
getncbidata
(accessnum) searches for the Accession
number in the GenBank or GenPept database, and returns a
structure containing information for the sequence.


gbout

=
getncbidata
(...,'TOFILE',FILENAME) saves the data
returned from the database in the file FILENAME


gbout

=
getncbidata
(...,'FILEFORMAT',FORMAT) retrieves
the data in the specified format. The accepted values are
'GenBank', 'GenPept‘ and 'FASTA‘


gbout

=
getncbidata
(...,'SEQUENCEONLY',TF) returns just
the sequence as a character array


Example
: The followings retrieve the sequence from
chromosome 19 that codes for the Human insulin receptor and
stores it in structure S


nt = getncbidata('M10051','database','nucleotide')


aa = getncbidata('AAA59174','database','protein')

(cont.) Detail Investigation of the function codes:
(
getncbidata
.m)


Below is very small part taken from the coding for the
getncbidata

function


….


….


% create the url that is used


% see


% http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html


% for more information


searchurl =
[
'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db='

db
'&term
=' accessnum
'&dopt='

dbfrm
'&mode=file
'];


% get the html file that is returned as a string


s=urlread(searchurl);


….


….



See
http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html

for
details about how to do
Creating a Web Link to the Entrez Databases



Full coding can be seen in
getncbidata.m

(Cont.) Detail Investigation of the function codes:
(
getncbidata
.m)

S=URLREAD(‘URL’) read the content at a URL into a string S and
returns the contents of a URL as a string.
If the server returns
binary data, the string will contain garbage


S=URLREAD(‘URL’, ‘
method
’,
PARAMS
) passes information to the
server as part of the request.


The ‘
method
’ can be ‘get’ or ‘post’


PARAMS

is a cell array of param/value pairs



[S, STATUS]=URLREAD(…) catches any errors and returns 1 if the
file downloaded successfully and 0 otherwise


Examples



s=urlread(‘http://google.com’)


s=urlread(‘http://www.mathworks.com’)


s=urlread(‘ftp://ftp.mathworks.com/pub/pentium/Moler_1.txt’)


s=urlread(‘file:///C:
\
winnt
\
matlab.init’)