center - University of California, Santa Cruz

Download Report

Transcript center - University of California, Santa Cruz

UCSC Genome Tools and
Databases
Quick Time™a nd a TIFF ( Uncomp res sed) deco mpre ssor are n eede d to s ee this picture .
QuickTi me™ and a T IFF (Uncom pressed) decom pressor are needed to see t his pict ure.
Jim Kent - Genome Bioinformatics Group
University of California Santa Cruz
Behind the Genome Browser
• ‘Genome’ database, one for each assembly of
each genome.
– hg17 (human genome assembly 17)
– mm6 (mus musculus 6)
– canFam1 (canis familiaris 1)
• hg17 has 1616 tables, but not really
– Some tables split across chromosomes for speed
– 228 logical tables
– Only ~30 different types of tables
Selected fields from related tables results: Ensemble Gene
(ensGene) and Superfamily Description (sfDescription).
Custom Track Output
• Useful for visualizing results of queries in
genome browser
• The way to produce more complex queries.
681/3329 (20%) of Ensemble not known also not conserved
1728/33,666 (5%) of Ensembl in general not conserved
Meta-data behind Table Browser
• The trackDb table describes each track.
• Table and field descriptions in AutoSql .as
files, which also generate SQL code and C
code to load/save from database and tabseparated files.
• Descriptions of how tables are connected in
all.joiner file, which along with joinerCheck
program checks database integrity.
.as Files - table and field docs
table cpgIsland
"Describes the CpG Islands"
(
string chrom;
"Human chromosome or FPC contig"
uint chromStart; "Start position in chromosome"
uint chromEnd; "End position in chromosome"
string name;
"CpG Island"
uint length;
"Island Length"
uint cpgNum;
"Number of CpGs in island"
uint gcNum;
"Number of C and G in island"
float perCpg; "Percentage of island that is CpG"
float perGc;
"Percentage of island that is C or G"
)
autoSql generates code from these. They also help document.
all.joiner - basic example
identifier softberryGeneName
"Link together Fshgene++ gene structure, peptide, and homolog"
$gbd.softberryGene.name
$gbd.softberryPep.name
$gbd.softberryHom.name
• The central concept is an identifier that appears in fields in multiple
table, sometimes even multiple databases.
• $gbd is a variable that contains a comma-separated list of databases.
• An identifier record ends with a blank line.
# Genbank/trEMBL Accessions and meaningful subsets thereof
identifier genbankAccession external=genbank
"Generic Genbank Accession. More specific Genbank accessions follow
$gbd.seq.acc
identifier bacEndAccession typeOf=genbankAccession
"Genbank accession of a BAC end read."
$gbd.all_bacends.qName dupeOk
$gbd.bacEndPairs.lfNames comma
$hg.fishClones.beNames comma minCheck=0.70
typeOf - allows joins between parent and child, but not
between siblings.
dupeOk - allows more than one row with same identifier in
primary table
comma - indicates field is comma separated list of identifiers
minCheck - indicates only a portion identifiers in field is in
the primary table
identifier hugoName external=HUGO fuzzy
"International Human Gene Identifier"
$hg.refLink.name
$hg.atlasOncoGene.locusSymbol
$hg.kgAlias.alias
$hg.kgXref.geneSymbol
$hg.refFlat.geneName
$hg.jaxOrtholog.humanSymbol
hg13,hg15.geneBands.name
“Biological” names for human genes are so messy, no
validation is done (note ‘fuzzy’ keyword).
Other Databases
• Genome databases - one for each assembly of each
organism: hg17, mm6, canFam1, etc.
• hgCentral - home to dbDb and user settings info.
One database shared by all web servers.
• hgFixed - mostly microarray data.
• uniProt - Relationalized SwissProt/trEMBL
database.
• go - Gene ontology terms and term/gene
associations.
• genePix - gene image database
Gene Pix
• Image browser for in-situ and other geneoriented pictures
• Hopefully in the long run will have a
million images covering almost all
vertebrate genes.
• (Needs new name, Gene Pix is a microarray
analysis program. VisiGene?)
Data Sets
• Paul Gray - ~1000 mouse transcription factor
genes - whole embryo & sections. These are in the
database now.
• Other potential sources:
–
–
–
–
German AxelDB frog in situs
Japanese NIBB frog in situs (have nice browser)
Genepaint.org - mouse stuff
EMAGE and Jackson Lab mouse images
• From development and other journals, copyright issues.
– Nathaniel Heintz BAC expression constructs
– Eddy Rubin lab mouse embryos
– UCSF cell-localization stuff?
Types of images
• Whole animal vs.
sectioned tissues, vs.
single cell.
• Single vs. multiple
probes within same
image.
• Single image vs. image
series (movies even).
• RNA, Antibody, Fusion
protein.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Mitotic cell 3 stains
Gene Pix Programs
• genePixLoad - loads SQL database from a well
defined format involving a .ra file and a tab
separated file. See genePixLoad.doc
• loadMahoney - converts Paul Gray (Mahoney
center) spreadsheet and image directory into
genePixLoad format
• Hg/lib/genePix.c - interface with SQL database.
• hgGenePix - cgi script to display images
• knownToGenePix - makes table in mm5 (or other)
genome database to connect known genes to
genePix Ids.
Gene Pix Database
• Just a single database for all assemblies of
all organisms.
• A knownToGenePix table in the assembly
database.
GenePix tables
•
•
•
•
•
•
•
fileLocation - directory
bodyPart - whole, brain etc.
sliceType - transverse, sagital
treatment - tech details
contributor - who done it
Journal - scientific journal
submissionSet - info about a
whole set of images from one
author
• sectionSet - links together
separate sections of same
specimen.
• Gene - gene info
• geneSynonym
• Antibody - info on an
antibody
• probeType - antibody, RNA,
fusion protein
• Probe - links gene, primers,
sequence Ab.
• probeColor - color probe is
• imageFile - file containing
image
• Image - a single image.
• imageProbe links image and
probe
Some Anatomy Required
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Especially with slices
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Edinburgh mouse atlas
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Theiler Stages
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Later Stages
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
NIBB Japanese Frog Site
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Earlier Stages
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Who you gonna call?
Angie Hinrichs - developer of 2nd and
4th versions of Table Browser. Genome
browser hacker extraordinaire.
Hiram Clawson - main mouse man at the
moment. Developed ‘wiggle’ tracks.
Kate Rosenbloom - ENCODE project
and multiple alignment display.
Bob Kuhn - Software and database
quality assurance.
David Haussler - Ideas. Money.
Comparative genomics.
More Acknowledgements
• UCSC - Robert Baertsch, Gill Bejerano, Galt
Barber, Ron Chao, Mark Diekhans, Jorge Garcia,
Patrick Gavin, Rachel Harte, Fan Hsu, Yontoa Lu,
Crystal Lynch, Donna Karolchik, Jennifer
Jackson, Ann Pace, Jacob Pedersen, Andy Pohl,
Katie Pollard, Ali Sultan-Qurraie, Brian Raney,
Krishna Roskin, Adam Siepel, Chuck Sugnet, Paul
Tatarsky, Daryl Thomas, Heather Trumbower
• Penn State - Scott Schwartz, Laura Elnitski,
Belinda Giardine, Ross Hardison, Minmei Hou,
Webb Miller, Anton Nekrutenko
• Funding - NHGRI, HHMI, NCI, UCSC