Bioinformatics Functions: Categorical List

weinerthreeforksΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

86 εμφανίσεις

COMP 5115 Programming Tools in Bioinformatics

Week 4


Detail Investigation of the Bioinformatics Functions*:
getpdb


The Protein Data Bank

(PDB) (
http://www.pdb.org
) is an
archive of experimentally determined three
-
dimensional
protein structures and contains 3
-
D biological
macromolecular structure data of proteins. (note that new
beta site of PDB will replace the current RCSB PDB portal
on January 1, 2006)


getpdb

retrieves sequence information from the PDB.


Syntax

Data=getpdb('
PDBid
', '
PropertyName
',

PropertyValue...
)


getpdb(..., 'ToFile',

ToFileValue
)


getpdb(..., 'MirrorSite',

MirrorSiteValue
)

*see
http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref

for further details

Detail Investigation of the function codes:

getpdb


Arguments


PDBid:

Unique identifier for a protein
structure record. Each structure in the PDB is
represented by a 4
-
character alphanumeric
identifier. For example, 4hhb is the
identification code for hemoglobin.


ToFile:
Property to specify the location and
filename for saving data. Enter either a
filename or a path and filename supported by
your system (ASCII text file).


MirrorSite:

Property to select Web site. Enter
either http://rutgers.rcsb.org/pdb to use the
Rutgers University Web site, or enter
http://nist.rcsb.org/pdb for the National
Institute of Standards and Technology site.

Detail Investigation of the function codes:
getpdb


Data = getpdb('
PDBid
', '
PropertyName
',
PropertyValue...
)
searches for the ID in the PDB database and returns a
MATLAB structure containing the following fields:


Header, Title, Compound, Source, Keywords, Experiment
Data, Authors, Journal, Remark1, Remark2, Remark3,
Sequence, HeterogenName, HeterogenSynonym,
Formula, Site, Atom, RevisionDate, Superseded,
Remark4, Remark5, Heterogen, Helix, Turn, Cryst1,
OriginX, Scale, Terminal, HeterogenAtom, Connectivity



getpdb(..., 'ToFile', ToFileValue)

saves the data returned
from the database to a file. Read a PDB formatted file
back into MATLAB using the function
pdbread
.


getpdb(...,'MirrorSite', MirrorSiteValue)

allows you to
choose a mirror site for the PDB database.


The default site is the San Diego Supercomputer Center,
http://www.rcsb.org/pdb
.


See http://www.rcsb.org/pdb/mirrors.html for a full list of
PDB mirror sites. (e.g., www.pdb.org)

Detail Investigation of the function codes:
getpdb


Related Bioinformatics Toolbox functions

g
etembl

g
etgenbank

g
etgenpept

g
etpir

p
dbdistplot

p
dbplot

p
dbread


Examples


Retrieve the structure information for Nitrate/Nitrite
Response Regulator Protein Narl with PDB ID 1A04.


pdbstruct = getpdb('1A04')

Detail Investigation of the function codes:
(
getpdb.m)


function pdbstruct=getpdb(pdbID,varargin)


%


if ~usejava('jvm')



error('Bioinfo:getpdb:NeedJVM','%s requires Java.',mfilename);


end





tofile = false;


seqonly = false;


mirrorsite = 'http://www.rcsb.org/pdb';


if nargin > 1



if rem(nargin,2) == 0



error('Bioinfo:getpdb:IncorrectNumberOfArguments',...



'Incorrect number of arguments to %s.',mfilename);



end



okargs = {'tofile','mirror','sequenceonly'};



for j=1:2:nargin
-
2



pname = varargin{j};



pval = varargin{j+1};



k = strmatch(lower(pname), okargs);%#ok



if isempty(k)



error('Bioinfo:getpdb:UnknownParameterName',...



'Unknown parameter name: %s.',pname);



elseif length(k)>1



error('Bioinfo:getpdb:AmbiguousParameterName',...



'Ambiguous parameter name: %s.',pname);



else



switch(k)



case 1 % tofile



if ischar(pval)



tofile = true;



filename = pval;



end



case 2 % mirrorsite



if ischar(pval)



mirrorsite = pval;



if isempty(strfind(mirrorsite,'/pdb'))



error('Bioinfo:getpdb:BadMirrorSite',...



'MIRROR string does not appear to be a PDB mirror site.');



end



end



case 3 % sequenceonly



seqonly = opttf(pval);



if isempty(seqonly)



error('Bioinfo:getpdb:InputOptionNotLogical','%s must be a logical value, true or false.',...



upper(char(okargs(k))));



end



end



end



end


end

Detail Investigation of the function codes:
(
getpdb.m)


% error if ID isn't a string


if ~ischar(pdbID)



error('Bioinfo:getpdb:NotString','Access Number is not a string.')


end





% get sequence from pdb.fasta if SEQUENCEONLY is true, otherwise full pdb


if seqonly == true



searchurl = [mirrorsite '/cgi/getSequence.cgi/' pdbID '.fasta?chId=' pdbID
'&format=fasta'];



[header, pdb] =
fastaread
(searchurl);%#ok


else



searchurl = [mirrorsite '/cgi/explore.cgi?job=download&pdbId=' pdbID
'&opt=show&format=PDB&pre=1'];






% get the html file that is returned as a string



s=
urlread
(searchurl);






% replace the html version of &



s=
strrep
(s,'&','&');






% Find first line of the actual data



start =
regexp
(s,'
\
nHEADER');

fastaread
:
reads FASTA format file

urlread
:
returns the contents of a URL
as a string

strrep
: replaces string with another

regexp
: matches regular expression

Detail Investigation of the function codes:
(e.g.,
getpdb.m)



if isempty(start)



% search for text indicating that there weren't any files found



notfound=regexp(s,'Your query found .*NO.* structures');






% string was found, meaning no results were found



if ~isempty(notfound),



error('Bioinfo:getpdb:PDBIDNotFound','The ID you were searching for, %s, was not found in
the PDB database.',pdbID) ;



end



error('Bioinfo:getpdb:PDBIDAccessProblem','Unknown problem accessing entry %s in the PDB
database.',pdbID);



end






[dummy, endOfFile] = regexp(s,'
\
nEND.*?
\
n');%#ok









% shorten string, to search for uid info



s=s(start+1:endOfFile);






%make each line a separate row in string array



pdbdata =
char
(
strread
(s,'%s','delimiter','
\
n','whitespace',''));






%pass to PDBREAD to create structure



pdb=
pdbread
(pdbdata);








end


char
: creates character array (string)

strread:
reads formatted data from
string

pdbread
:
reads a Protein Data Bank
file into a structure

Detail Investigation of the function codes:
(e.g.,
getpdb.m)


if nargout



pdbstruct = pdb;



if ~seqonly



% add URL



pdbstruct.SearchURL = searchurl;



end


else



if seqonly || ~usejava('desktop')



disp(pdb);



else



disp(pdb);



disp([char(9) 'SearchURL: <a href="' searchurl '"> ' pdbID ' </a>']);



end





end





% write out file


if tofile == true



writefile = 'Yes';



% check to see if file already exists



if exist(filename,'file')



% use dialog box to display options



writefile=
questdlg
(sp
rintf('The file %s already exists. Do you want to overwrite it?',filename), ...



'', ...



'Yes','No','Yes');



end






switch writefile,



case 'Yes',



if exist(filename,'file')



disp(['File ' filename ' overwritten.']);



end



savedata(filename,pdbdata);



case 'No',



disp(['File ' filename ' not written.']);



end





end

questdlg (Question):
creates a modal dialog box
that automatically wraps the
cell array or string (vector or
matrix)
Question

to fit an
appropriately sized window

Detail Investigation of the function codes:
(e.g.,
getpdb.m)


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



function savedata(filename,pdbtext)





fid=
fopen
(filename,'wb');





rows =
size
(pdbtext,1);





for rcount=1:rows
-
1,




fprintf
(fid,'%s
\
n',pdbtext(rcount,:));


end


fprintf
(fid,'%s',pdbtext(rows,:));


fclose
(fid);


fopen
: open a file for read access

fprintf
: writes formatted data to file

fclose
: closes a file opened with

fopen

Detail Investigation of the functions related to getpdb.m code:


pdbread

reads data from a PDB formatted file into MATLAB


Syntax

PDBData =
pdbread
('
File
')


Arguments

File:
Protein Data Bank (PDB) formatted file (ASCII text file). Enter a


filename, a path and filename, or a URL pointing to a file.
File


can also be a MATLAB character array that contains the text for

a PDB file.


The data stored in each record of the PDB file is converted, where
appropriate, to a MATLAB structure. For example, the ATOM records
in a PDB file are converted to an array of structures with the following
fields: AtomSerNo, AtomName, altLoc, resName, chainID, resSeq,
iCode, X, Y, Z, occupancy, tempFactor, segID, element, and charge.


The sequence information from the PDB file is stored in the Sequence
field of PDBData. The sequence information is itself a structure with
the fields NumOfResidues, ChainID, ResidueNames, and Sequence.
The field ResidueNames contains the three
-
letter codes for the
sequence residues. The field Sequence contains the single
-
letter
codes for the sequence. If the sequence has modified residues, then
the ResidueNames might not correspond to the standard three
-
letter
amino acid codes, in which case the field Sequence will contain a ? in
the position corresponding to the modified residue.

Detail Investigation of the function codes:
(
pdbread
)

Examples:



Get information for Nitrate/Nitrite Response Regulator
Protein Narl with PDB ID 1A04 from the Protein Data
Bank, store information in the file 1a04.txt


getpdb( '1A04','ToFile', '1a04.txt')



See the content of the file 1a04.txt (5310
-
line text)



Now read the file back into MATLAB


pdbdata = pdbread('1a04.txt')



Let’s try this with PDB ID
1a14


Now, we will see 5680
-
line text

Detail Investigation of the functions related to getpdb.m code:


fastaread

reads data from a FASTA formatted file into a
MATLAB structure with Header and Sequence fields


Syntax

FASTAData = fastaread('
File
')

[Header, Sequence] = fastaread('
File
')


multialignread(...,
'PropertyName
',
PropertyValue
,...)

multialignread(..., 'IgnoreGaps',
IgnoreGapsValue
)


Arguments

File
FASTA: formatted file (ASCII text file). Enter a filename, a path
and filename, or a URL pointing to a file.
File

can also be a
MATLAB character array that contains the text for a filename.

IgnoreGapsValue:
Property to control removing gap symbols.

FASTAData:
MATLAB structure with the fields Header and Sequence


Example


Reading the human mitochondrion genome in FASTA format



entrezSite = 'http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?'
textOptions = '&txt=on &view=fasta'


genbankID = '&list_uids=NC_001807'


mitochondrion = fastaread([entrezSite textOptions genbankID])