Tri-I Bioinformatics Workshop: Public data and tool
Download
Report
Transcript Tri-I Bioinformatics Workshop: Public data and tool
Tri-I Bioinformatics Workshop:
Public data and tool repositories
Alex Lash & John Major
Bioinformatics Core
Memorial Sloan-Kettering Cancer Center
Workshop sections
1.
Retrieving data from public resources (Lash)
•
•
2.
public databases at NCBI, EBI, Ensembl and UCSC
locate and utilize some of the myriad of publicly available
bioinformatics tools
Survey of analysis tools and tutorials (Lash)
•
3.
broad survey of analysis tools and tutorials available on the Web
for use directly and after download
Genome Browsers (Major)
•
•
4.
genome build process, ongoing and complete genome projects
genome browsers of Ensembl, UCSC and NCBI Mapviewer
Bulk downloads (Major)
•
•
•
how bulk bioinformatics data might be useful
common data formats
retrieving data
Public data and tool repositories
Section 1
Retrieving data from public
resources
Goals
A. Understand the scope and organization of
the major public databases: NCBI, EBI,
Ensembl and UCSC.
B. Understand the importance of a unique
identifiers, database fields, logical operators
and wildcards.
C. Be able to query, retrieve and display
publications and sequences.
Amyloid Precursor Protein
(APP)
G-protein
coupled
receptor that
binds heparin
and laminin
ß-secretase
Controls
nerve cell
growth
amyloid
fibril
amyloid
plaque
-secretase
interacts with protein-synthesis
machinery
NCBI
Strengths are data storage, annotation and
BLAST:
1.
2.
3.
4.
5.
PubMed: Biomedical publications
Heritable diseases and syndromes
GenBank: Nucleotide and protein sequences
BLAST: Pairwise sequence comparison
Curated gene-centric data, including reference
sequences
6. Genome builds
7. Nucleotide sequence traces
Ex: Finding Entrez Gene record for APP
Indexing and logical operators
Query: app[Gene Name] AND homo sapiens[Organism]
1
0
aardvark
…
…
0
app
…
…
homo sapiens 1
…
…
mus musculus 0
0 0 0 0 0 1 0 0…
1 0 0 0 1 1 0 0…
AND
2 3 4 5 6 7 8…
1 1 0 0 0 0 0…
1 0 0 0 1 0 0…
0 0 0 1 1 0 0…
1 0 0 0 0 1 0…
0 0 0 0 0 1 0 0…
An Entrez Query
1.
2.
3.
Query parsed: terms, fields and operators organized in a tree
(if syntax incorrect generate error or warning)
Unfielded terms matched to synonyms, and extra terms, fields
and operators added as needed
For each database:
a)
According to order of operations:
i.
ii.
iii.
b)
4.
5.
6.
7.
Term found in appropriate index (if term not found, then generate
warning)
Bit map pulled and uncompressed
Pairwise operations performed with previous result (if zero result,
then stop)
Number of results generated
If Global Query, display results summary and stop
List of UIDs generated from final result
UIDs sorted by user preference
Records pulled and displayed by user preference
Gene-centric questions
1.
2.
3.
4.
5.
Where is a gene located?
What’s its genomic sequence?
What variations are associated with it?
What’s its exon-intron structure?
What are the mRNA sequences of its
alternate transcripts?
6. What are the protein sequences of its isoforms?
7. What post-translational modification is possible?
8. What regulates its transcription?
9. What are its co-regulated partners?
10. What’s its normal function?
11. What’s its function in disease?
12. How does it fit into the larger cellular context?
May depend upon
cellular “state”
Ex: Looking over the Entrez Gene record for APP
KEGG pathway: Alzheimer’s disease
Common id and record formats
1.
Ids
a)
2.
GenBank accession
i.
ii.
iii.
BI559391,Y00264
AAB23646
RefSeq
Ensembl
UniGene
d)
Hs.651215
PDB Structures
i.
e)
1iyt
HUGO Gene Names
i.
APP
Flat
i.
ii.
iii.
iv.
v.
vi.
Protein
i.
i.
a)
Nucleotide
i.
b)
c)
Formats
b)
GenBank and GenPept
FASTA
Multiple FASTA
Alignment
Multiple alignment
Tab-delimited
Hierarchical
i.
ii.
iii.
ASN.1
XML
HTML
NCBI’s RefSeq project
1.
2.
3.
4.
Is a project to create curated sequence records for the
biopolymers of the Central Dogma: DNA, mRNA and protein
First release 2003
4,079 organisms, 3,234,358 proteins
Goals
1.
2.
3.
4.
5.
5.
non-redundancy
explicitly linked nucleotide and protein sequences
updates to reflect current knowledge of sequence data and
biology
data validation and format consistency distinct accession series
ongoing curation by NCBI staff and collaborators, with reviewed
records indicated
What’s its relationship to BLAST database called “nr”?
UniGene versus Entrez Gene
1.
UniGene
1.
2.
3.
4.
5.
2.
Entrez Gene
1.
2.
3.
4.
3.
Automated process that compares and clusters transcript-source
sequences (no assembly)
Gene discovery tool: predates Entrez Gene, genome assemblies
Based primarily on EST sequences
ID turn-over and retirement is common
Currently 76 taxa and 1,299,304 clusters
Curated clearinghouse of gene-centric information
Grew out of LocusLink (eukaryote model organisms) and Entrez
Genome (bacteria, viruses, organelles)
ID turn-over and retirement happens, but is less common since it
is based primarily on sequenced genomes
Currently 3882 taxa and 2,479,759 genes
Hs: 85,793 UniGene clusters compared to 38,604 Entrez
Gene records
UCSC Genome Browser
Strength is genome position-based data aggregation:
1. Data positioned on “best” genome build and
organised into “tracks”
2. Outside data tracks
1.
2.
3.
4.
5.
3.
Inside data tracks
1.
2.
4.
Genome builds
Genes, known and predicted
mRNA
Expression and regulation
Variations and repeats
Known Genes
Comparative genomics
Custom tracks
Ex: Looking at the APP gene in the UCSC Genome Browser
EBI/Ensembl
Strengths are data storage and analysis
software:
1.
2.
3.
4.
5.
6.
7.
8.
Biomedical publications
Nucleotide and protein sequences
Protein domains/signatures
Sequence comparison
Sequence analysis
Structure analysis
Protein function analysis
Ensembl genome browser
Ex: Looking at the APP gene in the EBI/Ensembl resources
Ensembl ids
1. Human
1.
2.
3.
4.
ENSG: gene
ENST: transcript
ENSE: exon
ENSP: protein
2. Other organisms
1. ENS{species 3-letter code}{G|T|P}{11 digits}
2. RNO=rat
3. MUS=mouse
Problems
1. Query Entrez Gene with the following two queries separately
and then explain the differences between the two results using a
logical NOT operation:
a) tyrosine kinase[Gene Ontology] AND human[Organism]
b) cd00192[Domain] AND human[Organism]
2. Retrieve the APP gene record from NCBI and use the Display
dropdown menu to display Conserved Domain Links. Use the
ids of the listed domains to query Entrez Gene for records with
the same domains.
3. Use the SNP Geneview link at NCBI to identify coding SNPs in
the APP gene. Which SNP is missing from this display which
was present in the Ensembl APP protein record?
4. Use the Homologene link at NCBI to identify possible functional
orthologs for human APP. How does this list compare to the
Ensembl list of orthologs that we reviewed previously?