Ensembl Compara Perl API
Download
Report
Transcript Ensembl Compara Perl API
Ensembl Compara
Perl API
compara
Stephen Fitzgerald
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/
EBI - Wellcome Trust Genome Campus, UK
What is Ensembl Compara?
A single database which contains precalculated
comparative genomics data
Access via perl API and mysql
A production system for generating that database
(not in this presentation)
Compara data
Raw genomic sequence
Whole genome alignments
(tBLAT, BlastZ-net, PECAN)
Syntenic regions (based on BlastZ-net)
Protein Sequen ces
Raw Protein Alignments
Protein Family clusters
Protein trees
Gene orthology / paraology predictions
46 species in Ensembl release-52
Compara database & the Ensembl
core databases
Since there is minimal primary data inside Compara, to gain
full access to the data external links with core DBs must be reestablished
Example: compara_52 must be linked with the
Ensembl core_52 databases
Proper REGISTRY configuration is critical
Or load_registry_from_db is probably the best choice here
The Compara Perl API
Written in Object-Oriented Perl
Used to retrieve data from and store data into
ensembl-compara database
Generalized to extend to non-ensembl genomic data
(Uniprot)
Follows same ‘Data Object’ & ‘Object Adaptor’
DBAdaptor design as the other Ensembl APIs
PRIMARY DATA
Compara object model overview
NCBITaxon
GenomeDB
Member
RESULTS
ANALYSIS
DnaFrag
MethodLinkSpeciesSet
GenomicAlignBlock SyntenyRegion
GenomicAlign
ProteinTree Homology
Family
DnaFragRegion
AlignedMember
Attribute
Primary data
GenomeDB: relates to a particular Ensembl core DB
name(), assembly(), genebuild(), taxon()
fetch_by_name_assembly(), fetch_by_registry_name(),
fetch_by_Slice(), fetch_all()
DnaFrag: represents a “top level” SeqRegion
name(), length(), genome_db(), slice(), coord_system_name()
fetch_by_Slice(), fetch_by_GenomeDB_and_name()
Member: list all Ensembl genes + SwissProt + SPTrEMBL
source_name(), stable_id(), genome_db(), taxon(), sequence(),
get_all_peptide_Members(), get_longest_peptide_Member(),
gene_member()
fetch_by_source_stable_id()
Analysis
MethodLinkSpeciesSet provides a handle to isolate
specific data from the shared tables (homology,
genomic_align_block)
MethodLink: Each individual analysis in compara is tagged
with a unique name called a method_link_type
BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY,
ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES
SpeciesSet: the sets of species as (a ref. to) an array of
GenomeDBs
fetch_by_method_link_type_GenomeDBs(),
fetch_by_method_link_type_registry_aliases()
name(), method_link_type(), species_set(), source()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomeDB
1. Find out the versions of human and mouse genomes in the database
2. Print the name of all the GenomeDBs in the database
DnaFrag
1. Get the DnaFrag for the chromosome 1 of the macaque genome
(using a genome_db object as an argument)
2. Get the DnaFrag for the chromosome X of the mouse genome
(using a core slice object as an argument)
MethodLinkSpeciesSet
1. Find out how many analyses are stored in the database
2. Get the name of the MethodLinkSpeciesSet corresponding to the
BlastZ-net analysis for human and mouse
3. Get the names of the all the species using the mlss corresponding to
the Pecan analyses
GenomeDB example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor(
"Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor->
fetch_by_registry_name("human");
print “Name
:”,$genome_db->name,
"\n";
print “Assembly :”,$genome_db->assembly, "\n";
print “GeneBuild :”,$genome_db->genebuild, "\n";
GenomeDB example code
$> perl genome_db1.pl
Homo sapiens NCBI36 2006-08-Ensembl
Mus musculus NCBIM36 2006-04-Ensembl
DnaFrag example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $genome_db_adaptor = $reg->get_adaptor(
"Multi", "compara", "GenomeDB");
my $genome_db = $genome_db_adaptor->
fetch_by_registry_name("human");
my $dnafrag_adaptor = $reg->get_adaptor(
"Multi", "compara", "DnaFrag");
my $dnafrag = $dnafrag_adaptor->
fetch_by_GenomeDB_and_name($genome_db, "13");
print "Name
print "Length
print "CoordSystem
"\n";
:", $dnafrag->name,
"\n";
:", $dnafrag->length, "\n";
:", $dnafrag->coord_system_name,
DnaFrag example code
$> perl test1.pl
Name
:13
Length
:114142980
CoordSystem
:chromosome
MethodLinkSpeciesSet
example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $mlssa = $reg->get_adaptor("Multi", "compara",
"MethodLinkSpeciesSet");
my $mlss = $mlssa->
fetch_by_method_link_type_registry_aliases(
"BLASTZ_NET", ["human", "mouse"]);
print $mlss->name, "\n";
print "type: ", $mlss->method_link_type, "\n";
my $species_set = $mlss->species_set();
foreach my $this_genome_db (@$species_set) {
print $this_genome_db->name(), "\n";
}
MethodLinkSpeciesSet
example code
$ > perl method_link_species_set.pl
H.sap-M.mus blastz-net (on H.sap)
Genomic Alignments
BlastZ-Net
Translated BLAT
used to compare closely related pair of species
BlastZ-raw -> BlastZ-chain -> BlastZ-net
used to compare more distant pair of species
Pecan
multiple global alignments
all vs all coding exons wublastp -> Mercator ->
Pecan on each syntenic block
GenomicAlignBlock
GenomicAlignBlock
represents a genomic alignment
contains 1 GenomicAlign per sequence
fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice)
Methods:
method_link_species_set(), score(), length(), perc_id(),
get_all_GenomicAligns(), get_SimpleAlign()
GenomicAlign
dnafrag(), genome_db(), get_Slice(), dnafrag_start,
dnafrag_end(), dnafrag_strand(), aligned_sequence()
GenomicAlignBlock
$all_GAlign
$Simplealign
= $GABlock->get_all_GenomicAligns()
= $GABlock->get_SimpleAlign()
$arrayref
$object
$Simplealign: a bioperl object which contains the whole
alignment - can be printed in various format using bioperl
modules
$Galign:
an object which represents one of the sequences
in the alignment only
Hsap.X.1223-1230: ACCTTC-A
Cfam.X.1390-1395: ACC--CGA
<- $ga
<- $ga
Synteny
Based on BlastZ-net alignments
SyntenyRegionAdaptor
fetch_all_by_MethodLinkSpeciesSet_Slice(),
fetch_all_by_MethodLinkSpeciesSet_DnaFrag()
Methods:
get_all_DnaFragRegions(), method_link_species_set(),
DnaFragRegion
slice(), dnafrag(), dnafrag_start(), dnafrag_end(),
dnafrag_strand()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
GenomicAlignBlock
1. Fetch all the BLASTZ_NET alignments between the first 130K
nucleotides of the human chromosome X and the mouse genome.
2. Print the exact location of the alignment blocks.
3. Compare the original and the aligned sequences.
4. Find the BLASTZ_NET alignments between human gene BRCA2
and the mouse genome.
5. Print the BLASTZ_NET alignments between the rat gene ECSIT and
the mouse genome.
6. Print the PECAN multiple alignments between the rat gene ECSIT
and 11 other amniote vertebrates.
7. Print the constrained-element alignments within the rat ECSIT locus
(use the constrained elements generated from the 12-way alignments).
Synteny
1. Get the human-mouse syntenic map for human chromosome X.
GenomicAlignBlock example code
[...]
my $slice_adaptor = $reg->get_adaptor(
"human", "core", "Slice");
my $slice = $slice_adaptor->
fetch_by_region("chromosome", "12", 1e4, 2e4);
my $gaba = $reg->get_adaptor("Multi", "compara",
"GenomicAlignBlock");
my $genomic_align_blocks = $gaba->
fetch_all_by_MethodLinkSpeciesSet_Slice(
$method_link_species_set, $slice);
foreach my $this_gab (@$genomic_align_blocks) {
}
my $all_gas = $this_gab->get_all_GenomicAligns();
foreach my $this_ga (@$all_gas) {
print
$this_ga->genome_db->name(),
":", $this_ga->get_Slice()->name(), "\n";
print
$this_ga->aligned_sequence(), "\n";
}
print "\n";
GenomicAlignBlock example code
$>perl gab.pl
Mus musculus:chromosome:NCBIM37:6:121449987:121450302:-1
CCTCTTAATAAACATTATTGTCAA[…]
Homo sapiens:chromosome:NCBI36:12:19128:19507:1
CCTCTTAATAAGCACACATATCCT[..]
Synteny example code
[...]
my $synteny_region_adaptor = $reg->get_adaptor(
"Multi", "compara", "SyntenyRegion");
my $synteny_regions = $synteny_region_adaptor->
fetch_all_by_MethodLinkSpeciesSet_Slice(
$human_mouse_synteny_method_link_species_set,
$human_slice);
foreach my $this_synteny_region (@$synteny_regions) {
my $these_dnafrag_regions =
$this_synteny_region->get_all_DnaFragRegions();
foreach my $this_dnafrag_region
(@$these_dnafrag_regions) {
print $this_dnafrag_region->dnafrag->
genome_db->name, ": ",
$this_dnafrag_region->slice->name, "\n";
}
}
print "\n";
Homology
(e! 38):
Orthologue predictions based on ‘best reciprocal
blast hits’
Paralogues for a selected set of species
No global view of the evolution history of the
gene considered
e! 39+:
Orthologues and paralogues are inferred from
protein trees
Phylogeny: Orthology/Paralogy in one go
BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared,
BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the
initial clustering step is 0.33.
Homology types
Homology
Homology object
contains 1 pair of Member/Attribute per gene/protein
fetch_all_by_Member(),
fetch_all_by_MethodLinkSpeciesSet(),
fetch_all_by_Member_MethodLinkSpeciesSet()
Methods:
method_link_species_set(), description(),
subtype(), perc_id(), get_all_Member_Attribute(),
get_SimpleAlign()
Family
Compara compute gene family clusters
Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT
and Uniprot/SPTREMBL metazoan proteins
The algorithm is based on :
All vs all blastp
MCL clustering
Muscle multiple aligner
Results stored in family, family_member tables
Family
Family object
contains 1 pair of Member/Attribute per gene/protein
fetch_all by_Member()
Methods:
method_link_species_set(), description(),
description_score(), get_all_Member_Attribute(),
get_SimpleAlign()
Exercises
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html
Members
1. Find the Member corresponding to SwissProt protein O93279
2. Find the Member for the human gene BRCA2
3. Find all the peptide Members corresponding to the human gene
CTDP1
Homology
1. Get all the predicted homologues for the human gene BRCA2
2. Get all the mouse orthologues predicted for the human gene CTDP1
Family
1. Get family predicted for the human gene BRCA2
2. Get the alignments corresponding to the family of the human gene
HBEGF
Member example code
use strict;
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host=>"ensembldb.ensembl.org",
-user => "anonymous");
my $member_adaptor = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $member_adaptor->
fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
print "All proteins:\n";
my $all_peptide_members = $member->
get_all_peptide_Members();
foreach my $this_peptide (@$all_peptide_members) {
print $this_peptide->stable_id(), "\n";
}
Member example code
$> perl test2.pl
All proteins:
ENSP00000356399
ENSP00000356398
ENSP00000352658
Homology example code
[...]
my $ma = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $ma->fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
my $homology_adaptor = $reg->get_adaptor(
"Multi", "compara", "Homology");
my $homologies = $homology_adaptor->
fetch_all_by_Member($member);
foreach my $this_homology (@$homologies) {
print $this_homology->description, "\n";
my $member_attributes = $this_homology->
get_all_Member_Attribute();
foreach my $this_mem_attr (@$member_attributes) {
my ($this_member, $this_attribute) =
@$this_mem_attr;
print $this_member->genome_db->name, " ",
$this_member->source_name, " ",
$this_member->stable_id, "\n";
}
print "\n";
}
Family example code
[...]
my $ma = $reg->get_adaptor(
"Multi", "compara", "Member");
my $member = $ma->fetch_by_source_stable_id(
"ENSEMBLGENE", "ENSG00000000971");
my $family_adaptor = $reg->get_adaptor(
"Multi", "compara", "Family");
my $families = $family_adaptor->
fetch_all_by_Member($member);
foreach my $this_family (@$families) {
print $this_family->description, "\n";
my $member_attributes = $this_family->
get_all_Member_Attribute();
foreach my $this_mem_attr (@$member_attributes) {
my ($this_member, $this_attribute) =
@$this_mem_attr;
print $this_member->taxon->binomial, " ",
$this_member->source_name, " ",
$this_member->stable_id, "\n";
}
print "\n";
}
Getting More Information
perldoc – Viewer for inline API documentation.
Tutorial document:
cvs: ensembl-compara/docs/ComparaTutorial.pdf
ensembl-dev mailing list:
shell> perldoc Bio::EnsEMBL::Compara::GenomeDB
shell> perldoc
Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor
online at: http://www.ensembl.org/
[email protected]
Exercise solutions:
http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/solutions.html
Ensembl-dev mailing list and
HelpDesk
ensembl-dev mailing list is great for questions around
the API and the DB
HelpDesk is very helpful
Give detailed info on what you are trying to do
Check that you have the modules installed
($PERL5LIB pointing to them)
Ensembl Team
Leaders
Database Schema and Core API
BioMart
Distributed Annotation System (DAS)
Outreach
Web Team
Comparative Genomics
Analysis and Annotation Pipeline
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)
Glenn Proctor, Ian Longden, Patrick Meidl, Andreas Kähäri
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed Haldar
Eugene Kulesha
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael Schuster
James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)
Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo Gordon
Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel,
Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White
Functional Genomics
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios
Zebrafish Annotation
Kerstin Jekosch, Mario Caccamo, Ian Sealy
VectorBase Annotation
Systems & Support
Research
Martin Hammond, Dan Lawson, Karyn Megy
Guy Coates, Tim Cutts, Shelley Goddard
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino
A special case of ortholog