allenBrain2005 - University of California, Santa Cruz

Download Report

Transcript allenBrain2005 - University of California, Santa Cruz

Visualizing Genes and Evolution
Jim Kent
Genome Bioinformatics Group
University of California Santa Cruz
VisiGene
• Image browser for in-situ and other gene- oriented
pictures
• Hopefully in the long run will have a million
images covering almost all vertebrate genes.
• Currently has 12000 images
– Mouse transcription factor in-situs from Paul Gray
– Imagery from the literature collected by Jackson Labs
• Features:
– Interactive zooming and scrolling
– Extensive database for captions
– Integration into UCSC Genome Browser web site
Current VisiGene: http://hgwdev-kent.cse.ucsc.edu/cgi-bin/hgVisiGene
Clicking on area of interest zooms in on it by 2x
Zooming & scrolling fast - only visible area is transmitted to users.
Caption under image gives specimen details, credits
and links to data providers.
For well studied literature gene images from Jackson Labs abound.
Integration with Genome Browser is via
Known Genes Track
Integration with Genome Browser is via
Known Genes Track
A new link in known gene details page
Gene sorter also links to VisiGene. Here using sorter
to search for genes expressed in substantia nigra
(target of Parkinson’s disease)
VisiGene Challenges
• Virtual microscope zooming and scrolling
– Precomputed pyramid scheme for storing pictures at
various scales
– Javascript to interactively fetch just parts of images
needed. (Javascript portability issues!)
• Defining orthologous genes between species.
• Image storage - will terabytes suffice?
• Database for captions and other annotations of
images.
• Collecting imagery and data from a wide variety
of projects… including Allen Institute?
The Spectrum of Cells
A single fertilized egg develops into a
human body, brain and all. During this
process perhaps 300 different types of cells
arise. Understanding and controlling this
process of development is critical for stem
cell based medicine.
For some animals, such as C. elegans, the complete lineage and
type of every cell is worked out.
Cell Lineage Tree of C. elegans, focusing on the gut.
Full cell lineage for C. elegans, worked out by John Sulston and
colleagues using microscope, eye, sketchbook and patience.
Vertebrate development is too
complex to work by hand
Mouse embryos days 7 - 10
Some cell lineages, such
as those leading from
the hematopoietic stem
cell to the various types
of cells that make up the
blood stream and much
of the immune system
are well worked out in
vertebrates.
A good deal of stem cell
research involves finding
marker genes that
distinguish between
different cell types at
various branches in the
differentiation tree. In
many cases the discovery
of new markers has
resulted in the definition
of new cell types.
Blood cell lineages
were worked out
with surface markers
and a cell sorter.
Many types of cells
don’t sort so easily,
but usually cell
nuclei can be
resolved even in
fairly complex
embryos by
microscopy,
especially confocal
microscopy.
Multiple markers can be used to classify cells into different
types with microscopy as well as with sorters. Using quantum
dots it is now possible to label simultaneously with a dozen
markers.
Cell 450
nm
475
nm
500
nm
525
nm
550
nm
575
nm
600
nm
3
1
650
nm
1
5
7
5
625
nm
8
1
1
7
2
4
6
3
9
5
5
Note: last two might look same to eye but not to sensors, which have more than three
channels.
Ideal Markers
• Unlike cell sorting experiments, would try to pick
markers that are each present in 1/3 to 2/3 of cell
types rather than markers present in only one cell
type.
• The markers would be chosen so that their
expression patterns were relatively independent of
each other, using resources such as Allen Brain
Atlas
• In ideal world, 8 perfect markers could distinguish
256 cell types, in real world we’d hope 12 or 15
well chosen markers would be enough.
Clustering and Beyond
• Once have measured marker levels on cells, can
use clustering software as is used for microarrays,
to define cell types.
• Since markers don’t change instantaniously we
should observe in embryos ‘trails’ in color space
between clusters linking together two cell types.
• We can also have information about what cells are
near each other, and potentially information about
cell shape for more sophisticated algorithms.
Cell Spectrum Summary
1) Use brain and gene atlas data to find 10 to 20
nuclear markers with distinct but overlapping
expression patterns.
2) Label antibodies with quantum dots.
3) Stain thick slices with labeled antibodies.
4) Capture images with multichannel confocal
microscope.
5) Identify nuclei and assign colors to them.
6) Cluster based on color to define cell types.
7) Construct tree of cell types by looking at spatial
and temporal data, and looking for intermediate
forms.
Comparative Genomics
Webb Miller
Comparative Genomics at BMP10
Conservation of Gene Features
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
aligning
identity
Conservation pattern across 3165 mappings of human RefSeq mRNAs to the
genome. A program sampled 200 evenly spaced bases across 500 bases
upstream of transcription, the 5’ UTR, the first coding exon, introns, middle
coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There
are peaks of conservation at the transition from one region to another.
Chaining Alignments
• Chaining bridges the gulf between syntenic blocks and
base-by-base alignments.
• Local alignments tend to break at transposon insertions,
inversions, duplications, etc.
• Global alignments tend to force non-homologous bases to
align.
• Chaining is a rigorous way of joining together local
alignments into larger structures.
Chains join together related local alignments
Protease Regulatory Subunit 3
Affine penalties are too harsh for long gaps
Log count of gaps vs. size of gaps in mouse/human
alignment correlated with sizes of transposon relics. Affine
gap scores model red/blue plots as straight lines.
Before and After Chaining
Chaining Algorithm
• Input - blocks of gapless alignments from blastz
• Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
• Uses Miller’sj<iKD-tree algorithm to minimize which parts
of dynamic programming graph to traverse. Timing is O(N
logN), where N is number of blocks (which is in hundreds
of thousands)
Netting Alignments
• Commonly multiple mouse alignments can
be found for a particular human region,
particularly for coding regions.
• Net finds best match mouse match for each
human region.
• Highest scoring chains are used first.
• Lower scoring chains fill in gaps within
chains inducing a natural hierarchy.
Net Focuses on Ortholog
Net highlights rearrangements
A large gap in the top level of the net is filled by an
inversion containing two genes. Numerous smaller
gaps are filled in by local duplications and processed
pseudo-genes.
Useful in finding pseudogenes
Ensembl and Fgenesh++ automatic gene predictions
confounded by numerous processed pseudogenes.
Domain structure of resulting predicted protein must
be interesting!
Mouse/Human
Rearrangement Statistics
Number of rearrangements of given type per megabase
excluding known transposons.
A Rearrangement Hot Spot
Rearrangements are not evenly distributed. Roughly 5%
of the genome is in hot spots of rearrangements such as
this one. This 350,000 base region is between two very
long chains on chromosome 7.
Conservation Track at UCSC
• Based on Multiz alignments using tools from Miller lab.
• Conservation score is calculatedy by Adam Seipel’s
PhyloHMM.
• Highlights exons, promoters, enhancers, etc.
• We should have a 10-species vertebrate multiple
alignment up soon at UCSC, currently have 8-species.
Reconstructed ancestral
(boreutherian) genome for one
chromosome
Acknowledgements
• UCSC Bioinformatics: Galt Barber, Robert Baertsch, Gill
Bejerano, Mathieu Blanchette, Ron Chao, Hiram Clawson, Mark
Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Angie Hinrichs,
Fan Hsu, Jennifer Jackson, Donna Karolchik, Robert Kuhn, Yontao Lu,
Crystal Lynch, Webb Miller, Jakob Pedersen, Andy Pohl, Katie
Pollard, Brian Raney, Kate Rosenbloom, Krishna Roskin, Adam
Siepel, Chuck Sugnet, Ali Sultan-Qurraie, Paul Tatarsky, Daryl
Thomas, Heather Trumbower, David Haussler
• Penn State Comparative Genomics: Webb Miller, Ross
Hardison, Belinda Giardine, Scott Schwartz, Cathy Riemer, Minmei
Hou, LouXin Zhang, Jian Ma
• Sequence Data: Sanger Institute, Washington University, Broad
Institute, Baylor College of Medicine, Joint Genome Institute,
International Human Genome Sequencing Consortium.
• Images: Jackson Labs, Mahoney Lab, Company of Biologists.
• Funding: NHGRI, HHMI, NCI, QB3, UC Santa Cruz
The End