Transcript grant2006

Planning next five
years
Highlights from the 20072012 NHGRI Grant
Major Projects
• 60 new vertebrate genomes
• Known Genes III and IV
• Hardware upgrades
• Associating mutations and diseases
• Representing human differences
– Germ line and cancer
• Faster data mining
• Database federation
• Modest staff growth (1 staff/year)
Many more genomes
• Want to be able to support 60 new
vertebrate genomes (120 new assemblies)
over next 5 years.
– Automation will be crucial if we are to do this
and anything else.
• Want to be able to do at least a 50-way
vertebrate genome multiple alignment.
– This will require upgrades to computer cluster.
Automation Goals - 7 Scripts
• Start - checks inputs. Produces database, assembly,
gap, gc percent tracks.
• Masking - generates de-novo repeat masker library
if needed, masks sequence.
• GenBank - sets up genbank alignments automatic
build.
• Pairs - blastz/chain/net + tracks.
• Multiple - multiz/phastCons + tracks.
• Genes - Genscan, Exoniphy, Augustus, human
proteins track, maybe ExonWalk.
• Finish - runs automatic checks, builds downloads
region.
Known Genes Upgrades
• Plan to upgrade process twice during next five
years.
• First upgrade in next year - KG III.
– Covers noncoding genes
– Broken into 3 tiers of reliability.
– Set up to be updated 1x/month
• Will work with NCBI, Sanger, Ensembl towards a
common set.
– Havana will be genome-wide by then, may be our best
bet.
– Alternatively can work to extend Consensus CDS to be
broader. Since ENSEMBL will be out of loop (it’s just used to
fill in areas where Havana doesn’t cover currently) this will
be easier. Perhaps UCSC can make a bid to create the
lowest tier with automated predictions.
Known Genes III
• Align GenBank RNAs, filter, cluster.
• Run something like ExonWalk to define all splice sites
inside of a cluster.
• Pick representative RNA for each splice form.
• Correct RNA by weighted consensus with genome,
other RNAs, maybe ESTs.
• Add genes like Olfactory Receptors with no RNA.
• Filter out pseudogenes.
• Map UniProt and RefSeq to this set.
• Arrange into tiers
– Gold = consensus CDS
– Silver = associated protein, RNA corrections unneeded
– Bronze = noncoding, RNA with corrections, etc.
Hardware Upgrades
• New cluster every 2.5 years. Next in 1 year.
Each budgeted at ~$750,000
• New web servers every 3 years.
• Disk storage to double every 18 months.
• Upgrade to disk infrastructure (beyond the
spinning platters) in 2 years.
• Workstations upgraded every 5 years.
Genotype meets Phenotype
• Genome scans for candidate regions:
– Linkage studies ~20Mb regions
– SNP association studies ~100 kb regions
– Homozygosity mapping ~20 Mb regions
• Medical sequencing inside regions:
– PCR based sequencing of exons, conserved regions
– Sequencing of whole SNP association study candidate
regions.
– New technology will make sequencing vastly cheaper.
• Figuring out which mutations are causative
– 95% of the variation will have no effect.
– Helping scientists figure out the 5% that has effect, and the
even smaller percent that is truly causative is likely the most
important thing for us to do over the next 5 years.
Support for Genome Scans
Candidate Regions
Candidate Genes
Medical Sequencing
• Sol Katzman has some local experience with
medical sequencing.
• It’s possible that we’d want to get into the business
of assembling traces against the genome, and
defining differences, but:
– Tools exist that do assembly and differencing already.
– Not clear a web program is the best place for this. Data
volumes are high.
• Initially at least would expect users to upload
differences to us. Would support PolyPhred,
PolyBayes, difference formats, and define perhaps
a simpler format ourselves.
Making sense of differences
• Here’s a mockup of two differences custom
tracks, one for cases, one for controls.
• Green - likely harmless, yellow possible
candidate, red likely candidate, gray
unknown.
Representing diploids on screen
Genetic diff exchange format
• Worth defining an exchange format for
human genetic differences before XML
bloaters do it for us.
– In 10 years, likely to be 100,000 human genome
sequences around.
– In 20 years, likely to be 1 billion!
– Want representation that is efficient
• Text version - human readable, relatively
dense.
• Binary version - very dense. Efficient to
convert into memory data structures.
Text diff format:
• Header
– Specifies reference genome, other info.
• Region records
– Begin with region covered (chr1:100000-200000)
– Followed by simple diffs, rearrangements, simple diffs.
• Simple diffs:
– Substitutions, deletions, insertions (mostly small)
– One line per diff:
•
•
•
•
<start position> <type> <type-specific-data>
100387 sub A
100867 del 3
101989 ins TT
• Rearrangements
– <start> <size> <type> <type-specific-data>
• 120012 5022 invert
• 198912 0 dup chr17 1909839 5000 +
• May need more thought for translocations…
Binary diff format
•
Bulk of data in “simple diffs” most of which would
be just 16 bits long:
–
–
13 bits - number of bases up to 8k from previous diff.
3 bits to define 8 possibilities:
1)
2)
3)
4)
5)
6)
7)
Single nucleotide change to A
Single nucleotide change to C
Single nucleotide change to G
Single nucleotide change to T
Single base deletion
Insert another copy of previous base
A skip of more than 8 kb.
•
•
The 13 bits will be combined into offset of next diff
For very long skips, could have 2 skips, for a total of 39 bits of
offset.
8) A more complex change starts at this base, details to follow.
Somatic Variation
• Somatic variation in cancer cells and aging tissues is
also medically important.
• Much of structures applicable to finding significant
differences in germ line variants will apply, but:
– Copy number polymorphisms especially important.
– Likely will want to highlight changes in known oncogenes.
– Chromosome count radically different, will need to be able
to define “new” chromosomes.
• Need to be aware of CaBIG, Cancer Genome Atlas
and other efforts.
Faster data mining
• Make streaming through whole tables faster
– MySQL 5 does some of this
– May want to access MyISAM tables directly
• As part of ENCODE grant have asked for
small cluster to do parallel queries.
• Consider judicious precomputation or
caching of useful intermediates:
– Store bitmap or region list of bases covered by a
track.
– “Materialized views” for certain common joins?
Database Federation
• As soon as we integrate a database, data starts
going stale.
• NCBI, ENSEMBL, UniProt, JAX, FlyBase all are based
on relational databases.
• Try and make a federation:
– All agree to have a public SQL server.
– Develop enhancements to all.joiner and .as files that
describe tables across federation.
– Query other database rather than integrating it into our
database.
• Challenges:
– Coping with data format changes, “federated” queries
must fail gracefully, and notify us of failure.
– Preventing bad queries from monopolizing things.
– Getting everyone to agree! NCBI may be hardest nut, for
they have the most work, and least to gain. Francis has
leverage though….
Modest Growth
• Anticipate modest growth - 1 person/year
on this grant (plus replacements)
• Trying to spread out funding from other
sources than NHGRI in the long term. This
may involve a little more growth too.
Conclusion
• Grant was longest thing I’ve ever written, maybe
good preparation for writing book on our
software…
• Now it’s hurry up and wait … won’t know how well
grant received for six months or so.
• Very unlikely they will do worse to us than last year
(one year of flat funding).
• External feedback to grant has been that it is a
major improvement over last year’s.
• Time to cross our fingers, and get back to actual
research and development.
xyz
• XYZ