Data Mining in Ensembl with BioMart

Transcript Data Mining in Ensembl with BioMart

Data Mining with BioMart
www.ensembl.org/biomart/martview
www.biomart.org/biomart/martview
1 / 30
What is BioMart?
• A data export tool
• A quick table generator
• A web interface to mine Ensembl data
2 / 30
BioMart- Data mining
• BioMart is a search engine that can find
multiple terms and put them into a table
format.
• Such as: mouse gene (IDs), chromosome
and base pair position
• No programming required!
3 / 30
General or Specific Data-Tables
• All the genes for one species
• Or… only genes on one specific region of
a chromosome
• Or… make BioMart select genes
(I.e. all transcripts that match a microarry
probe set, GO term, or InterPro domain).
4 / 30
Results
Tables or sequences
5 / 30
The First Step: Choose the
Dataset
Dataset: Current Ensembl, Human genes
6 / 30
The Second Step: Filters
Filters: Define a gene set
7 / 30
Attributes attach information
Attributes: Determine output columns
8 / 30
Query
For the human CFTR gene, export
the Entrez Gene ID(s) and matching
Affy HG U133-PLUS-2 probeset(s)
9 / 30
Query:
For the human CFTR gene, export
the Entrez Gene ID(s) and matching
Affy HG U133-PLUS-2 probeset(s)
• In the query:
Filters: what we know
Attributes: what we want to know.
10 / 30
Query:
For the human CFTR gene, export the
Entrez Gene ID(s) and matching Affy
HG U133-PLUS-2 probeset(s)
• In the query:
Filters: what we know
Attributes: what we want to know.
11 / 30
Query:
For the human CFTR gene, export
the Entrez Gene ID(s) and matching
Affy HG U133-PLUS-2 probeset(s)
• In the query:
Filters: what we know
Attributes: what we want to know
12 / 30
A Brief Example
Use the current
Ensembl (archives
are also available)
Select
Homo sapiens
genes
13 / 30
Select the Genes with Filters
Click
Filters
Expand the
‘GENE’ panel.
Expand the GENE panel to enter in the gene ID(s).
14 / 30
Filters (and Count)
Change this to HGNC
curated name. Enter
“CFTR” in the box.
Click “Count” to see if genes passed through
your filters.
15 / 30
Attributes (Output Options)
Click on ‘Attributes’
‘Attributes’ allows you to output information.
16 / 30
Attributes (Output Options)
Select ‘EntrezGene ID’
17 / 30
Attributes (Output Options)
Select the Affy Platform
‘HG U133-PLUS-2’ in the
‘Microarray’ section
18 / 30
The Results Table - Preview
For the full result
table: click “Go” or
View “ALL” rows.
19 / 30
Full Result Table
Ensembl Gene ID
for CFTR
Ensembl
Transcript
IDs
EntrezGene
ID
Affy HG
probeset
20 / 30
Other Export Options (Attributes)
 Sequences: UTRs, flanking sequences, cDNA
and peptides, etc
 Gene IDs from Ensembl and external sources
(MGI, Entrez, etc)
 Microarray data
 Protein Functions/descriptions (Interpro, GO)
 Orthologous gene sets
 SNP/ Variation Data
21 / 30
BioMart around the
world…
BioMart started at
Ensembl…
To where has it travelled?
22 / 30
Central Portal
www.biomart.org
23 / 30
WormBase
24 / 30
HapMap
Population
frequencies
Interpopulation
comparisons
Gene
annotation
25 / 30
DictyBase
26 / 30
GRAMENE
www.gramene.org
27 / 30
The Potato Center
28 / 30
How to Get There
http://www.biomart.org/biomart/martview
http://www.ensembl.org/biomart/martview
• Or click on ‘BioMart’ from Ensembl
29 / 30
Worked Example
• Follow the worked example on pg 26
• Then, do the exercises on pg 34
(answers on pg 37)
This module should do the following:
• Show you how to export multiple data types from
Ensembl for gene IDs or chromosomal regions.
30 / 30
Ensembl Core Databases
Relational Database
• Normalised
• Each data point stored only once
Therefore:
• Quick updates
• Minimal storage requirements
But:
• Many tables
• Many joins for complicated queries
• Slow for data mining applications
31 / 30
Normalised Schema
gene_id stable_id
9970
ENSG00000170365
1712
ENSG00000175387
8240
ENSG00000166949
1967
ENSG00000141646
…
…
gene_id
gene.symbol
gene_id
transcript
9970
SMAD1
9970
ENST00000302085
1712
SMAD2
1712
ENST00000262160
8240
SMAD3
1712
ENST00000356825
1967
SMAD4
8240
ENST00000327367
…
…
1967
ENST00000342988
…
…
32 / 30
BioMart Database
Data warehouse
• De-normalised
• Query-optimised
Therefore:
• Fast and flexible
• Ideal for data mining
But:
• Tables with apparent “redundancy”
• Needs rebuilding from scratch for every release
from normalised core databases
33 / 30
De-Normalised Schema
gene_id
transcript_id
gene.symbol
ENSG00000170365
ENST00000302085
SMAD1
ENSG00000175387
ENST00000262160
SMAD2
ENSG00000175387
ENST00000356825
SMAD2
ENSG00000166949
ENST00000327367
SMAD3
ENSG00000141646
ENST00000342988
SMAD4
…
…
…
34 / 30
Information Flow
DATASET
SPECIES
FOCUS
FILTER
ATTRIBUTES
REGION
REGION
GENE
GENE
EXPRESSION
EXPRESSION
HOMOLOGY
HOMOLOGY
PROTEIN
PROTEIN
SNP
SNP
SWISSPROT
FASTA
EMBL
GTF
REFSEQ
HTML
GO
TEXT
INTERPRO
EXCEL
AFFYMETRIX
FILE
35 / 30

Data Mining in Ensembl with BioMart

Transcript Data Mining in Ensembl with BioMart

Directory