Data Mining in Ensembl with BioMart

sentencehuddleData Management

Nov 20, 2013 (3 years and 11 months ago)

86 views

1

of 38

Data Mining in Ensembl with
BioMart

2

of 38

Simple Text
-
based

Search Engine

3

of 38

‘Mouse Gene’ Gives Us Results

4

of 38

A More Complex Query is Not as
Useful

5

of 38

BioMart
-

Data mining


BioMart is a search engine that can
find multiple terms and put them into
a table format.



Such as: human gene (IDs),
chromosome and base pair position



No programming required!


6

of 38

General or Specific Data
-
Tables





All the genes for one species



Or… only genes on one specific region of a
chromosome



Or… genes on one region of a chromosome
associated with a disease


7

of 38

BioMart Data Sets


Ensembl genes


Vega genes


SNPs



Markers


Phenotypes


Gene expression information


Gene ontology


Homology predictions


Protein annotation

8

of 38

Web Interface

With

BioMart, quickly extract gene
-
associated information from the
Ensembl databases.

9

of 38

Information Flow


Choose the species of interest (
Dataset
)



Decide what you would like to know
about the genes (
Attributes
)


(
sequences, IDs, description…
)



Decide on a smaller geneset using
Filters
.


(
enter IDs, choose a region …
)

10

of 38

Web Interface

Three main stages: Dataset, Attributes and Filters.

Choose the
species of
interest

Choose what
information
to view.

Choose the gene
set using what
we know.

11

of 38

The First Step: Choose the Dataset

Homo sapiens

genes are the
default.

12

of 38

The Second Step: Attributes

Attributes are what we want to know about the
genes.

Four output
pages.

13

of 38

The

SNP Attribute Page

Output variation information such as SNP
reference ID and alleles.

14

of 38

Filters Allow Gene Selection

Choose the gene set by region, gene ID(s),
protein/domain type.

15

of 38

Export Sequence or Tables

Genes and attributes are exported as sequence
(Fasta format) or tables.

16

of 38

Query:


For all mouse genes on chromosome
10 that are protein coding, I would like
to know the IDs in both Ensembl and
MGI.



In the query:


Attributes: what we want to know.


Filters: what we know


17

of 38

Query:


For all mouse genes on chromosome
10 that are protein coding, I would like
to know the
IDs

in both
Ensembl

and

MGI
.



In the query:


Attributes: what we want to know.


Filters: what we know


18

of 38

Query:


For all
mouse genes

on
chromosome

10

that are
protein coding
, I would like
to know the IDs in both Ensembl and
MGI.



In the query:


Attributes: what we want to know.


Filters: what we know


19

of 38

A Brief Example

Change dataset to

mouse

Mus musculus

20

of 38

A Brief Example

Dataset has changed.

21

of 38

Attributes (Output Options)

Click

Attributes.

Attributes allow us to choose what we wish to
know.

IDs are found in the ‘Features’ page.

Click on ‘GENE’.

22

of 38

Default options selected:

Ensembl Gene ID and Transcript ID

Attributes (Output Options)

Ensembl Gene ID is
selected

23

of 38

Scroll down to select MGI symbol.

Also select the accession number.

Attributes (Output Options)

‘Markersymbol ID’ will
give us the MGI ID

24

of 38

‘Results’ give us Gene IDs for all mouse genes in
the Ensembl database.

The Results Table

25

of 38

Select a Smaller Gene Set

Select
‘Filters’

Expand the
REGION panel

Instead of all mouse genes, select protein coding
genes on chromosome 10.

26

of 38

Select Genes on Chromosome 10

Select
chromosome
10

Instead of all mouse genes, select protein coding
genes on chromosome 10.

27

of 38

Select Protein Coding Genes

Filters are set to chromosome 10 and

protein
-
coding genes. Genes must meet BOTH
criteria to be in the result table.

Gene type:

protein coding

28

of 38

Results (Preview)

This is a preview
-

if you are happy with the table,
click ‘Go’.

For the full result
table: Go

29

of 38

Full Result Table

Ensembl Gene ID

Transcript
ID

MGI
symbol

MGI Accession
Number

30

of 38

Original Query:


For all mouse genes on chromosome
10 that are protein coding, I would like
to know the
IDs

in both
Ensembl and
MGI.



In the query:


Attributes: columns in the
Result

Table


Filters: what we know


31

of 38

Other Export Options (Attributes)


Sequences: UTRs, flanking sequences,
cDNA and peptides, etc



Gene IDs from Ensembl and external
sources (MGI, Entrez, etc.)



Microarray data



Protein Functions/descriptions (Interpro,
GO)



Orthologous gene sets



SNP/ Variation Data

32

of 38

Central Server

www.biomart.org

33

of 38

WormBase

34

of 38

HapMap

Population
frequencies


Inter
-

population
comparisons


Gene
annotation

35

of 38

DictyBase


36

of 38

Uniprot, MSD

37

of 38

GRAMENE


Rice, Maize, Arabidopsis genomes…

38

of 38

How to Get There


Either

www.biomart.org/biomart/martview



Or click on ‘BioMart’ from Ensembl

Thanks

Arek Kasprzyk

Beno
î
t Ballester

Syed Haider

Richard Holland

Damian Smedley