Data Mining in Ensembl with BioMart

Download Report

Transcript Data Mining in Ensembl with BioMart

Data Mining in Ensembl with
BioMart
1 of 38
Simple Text-based
Search Engine
2 of 38
‘Mouse Gene’ Gives Us Results
3 of 38
A More Complex Query is Not as
Useful
4 of 38
BioMart- Data mining
• BioMart is a search engine that can
find multiple terms and put them into
a table format.
• Such as: human gene (IDs),
chromosome and base pair position
• No programming required!
5 of 38
General or Specific Data-Tables
• All the genes for one species
• Or… only genes on one specific region of a
chromosome
• Or… genes on one region of a chromosome
associated with a disease
6 of 38
BioMart Data Sets
• Ensembl genes
• Vega genes
• SNPs
•
•
•
•
•
•
Markers
Phenotypes
Gene expression information
Gene ontology
Homology predictions
Protein annotation
7 of 38
Web Interface
With BioMart, quickly extract gene-associated information from the
Ensembl databases.
8 of 38
Information Flow
• Choose the species of interest (Dataset)
• Decide what you would like to know
about the genes (Attributes)
(sequences, IDs, description…)
• Decide on a smaller geneset using
Filters.
(enter IDs, choose a region …)
9 of 38
Web Interface
Choose the
species of
interest
Choose what
information
to view.
Choose the gene
set using what
we know.
Three main stages: Dataset, Attributes and Filters.
10 of 38
The First Step: Choose the Dataset
Homo sapiens
genes are the
default.
11 of 38
The Second Step: Attributes
Four output
pages.
Attributes are what we want to know about the
genes.
12 of 38
The SNP Attribute Page
Output variation information such as SNP
reference ID and alleles.
13 of 38
Filters Allow Gene Selection
Choose the gene set by region, gene ID(s),
protein/domain type.
14 of 38
Export Sequence or Tables
Genes and attributes are exported as sequence
(Fasta format) or tables.
15 of 38
Query:
• For all mouse genes on chromosome
10 that are protein coding, I would like
to know the IDs in both Ensembl and
MGI.
• In the query:
Attributes: what we want to know.
Filters: what we know
16 of 38
Query:
• For all mouse genes on chromosome
10 that are protein coding, I would like
to know the IDs in both Ensembl and
MGI.
• In the query:
Attributes: what we want to know.
Filters: what we know
17 of 38
Query:
• For all mouse genes on chromosome
10 that are protein coding, I would like
to know the IDs in both Ensembl and
MGI.
• In the query:
Attributes: what we want to know.
Filters: what we know
18 of 38
A Brief Example
Change dataset to
mouse
Mus musculus
19 of 38
A Brief Example
Dataset has changed.
20 of 38
Attributes (Output Options)
Click
Attributes.
Click on ‘GENE’.
Attributes allow us to choose what we wish to
know.
IDs are found in the ‘Features’ page.
21 of 38
Attributes (Output Options)
Ensembl Gene ID is
selected
Default options selected:
Ensembl Gene ID and Transcript ID
22 of 38
Attributes (Output Options)
‘Markersymbol ID’ will
give us the MGI ID
Scroll down to select MGI symbol.
Also select the accession number.
23 of 38
The Results Table
‘Results’ give us Gene IDs for all mouse genes in
the Ensembl database.
24 of 38
Select a Smaller Gene Set
Expand the
REGION panel
Select
‘Filters’
Instead of all mouse genes, select protein coding
genes on chromosome 10.
25 of 38
Select Genes on Chromosome 10
Select
chromosome
10
Instead of all mouse genes, select protein coding
genes on chromosome 10.
26 of 38
Select Protein Coding Genes
Gene type:
protein coding
Filters are set to chromosome 10 and
protein-coding genes. Genes must meet BOTH
criteria to be in the result table.
27 of 38
Results (Preview)
For the full result
table: Go
This is a preview- if you are happy with the table,
click ‘Go’.
28 of 38
Full Result Table
Ensembl Gene ID
Transcript
ID
MGI
symbol
MGI Accession
Number
29 of 38
Original Query:
• For all mouse genes on chromosome
10 that are protein coding, I would like
to know the IDs in both Ensembl and
MGI.
• In the query:
Attributes: columns in the Result Table
Filters: what we know
30 of 38
Other Export Options (Attributes)
• Sequences: UTRs, flanking sequences,
cDNA and peptides, etc
• Gene IDs from Ensembl and external
sources (MGI, Entrez, etc.)
• Microarray data
• Protein Functions/descriptions (Interpro,
GO)
• Orthologous gene sets
• SNP/ Variation Data
31 of 38
Central Server
www.biomart.org
32 of 38
WormBase
33 of 38
HapMap
Population
frequencies
Interpopulation
comparisons
Gene
annotation
34 of 38
DictyBase
35 of 38
Uniprot, MSD
36 of 38
GRAMENE
Rice, Maize, Arabidopsis genomes…
37 of 38
How to Get There
• Either www.biomart.org/biomart/martview
• Or click on ‘BioMart’ from Ensembl
38 of 38
Thanks
Arek Kasprzyk
Benoît Ballester
Syed Haider
Richard Holland
Damian Smedley