Orthology, Paralogy, Chains, and Nets - CS273a

Download Report

Transcript Orthology, Paralogy, Chains, and Nets - CS273a

This Friday 10am Beckman B-302
Introduction to the UCSC Browser.
HW1 Due This Fri 10/15 at noon.
http://cs273a.stanford.edu
[Bejerano Fall10/11]
1
Lecture 7
Genome Evolution
Chromosomal Mutations
Paralogy & Orthology
Chains & Nets
http://cs273a.stanford.edu
[Bejerano Fall10/11]
2
One Cell, One Genome, One Replication
Every cell holds a copy of all its DNA = its genome.
The human body is made of ~1013 cells.
All originate from a single cell through repeated cell divisions.
DNA strings =
Chromosomes
egg
egg
cell
genome =
all DNA
cell
division
chicken
chicken ≈ 1013 copies
(DNA) of egg (DNA)
http://cs273a.stanford.edu
[Bejerano Fall10/11]
egg
3
Mutation Rate per bp
• 10-9
•
•
chicken
egg
chicken
•
•
per base pair per cell division
This refers to mutations that are not
repaired
Thus, there are at least six new
mutations in each kid that were not
present in either parent
Mutations range from the smallest
possible (single base pair change) to
the largest – whole genome duplication.
Selection does not tolerate all of
these mutation, but it sure does
tolerate some.
4
Example: Human-Chimp Genomic Differences
Number of events
1%
3%
Open question..
5
Chromosomal (ie Big) Mutations
• May Involve:
– Changing the
structure of a
chromosome
– The loss or
gain of part of
a chromosome
Chromosome Mutations
• Five types exist:
– Deletion
– Inversion
– Translocation
– Nondisjunction
– Duplication
Deletion
• Due to breakage
• A piece of a
chromosome is lost
Inversion
• Chromosome segment
breaks off
• Segment flips around
backwards
• Segment reattaches
Duplication
• Occurs when a
genomic region is
repeated
Whole Genome Duplication at the Base of the Vertebrate Tree
Xen.Laevis WGD
http://cs273a.stanford.edu
[Bejerano Fall10/11]
11
Translocation
• Involves two
chromosomes that
aren’t homologous
• Part of one
chromosome is
transferred to
another chromosomes
Nondisjunction
• Failure of chromosomes to separate
during meiosis
• Causes gamete to have too many or
too few chromosomes
• Disorders:
– Down Syndrome – three 21st chromosomes
– Turner Syndrome – single X chromosome
– Klinefelter’s Syndrome – XXY chromosomes
Chromosome Mutation
Animation
The Species Tree
S
S
Sampled Genomes
S
Speciation
15
Time
A Gene tree evolves with respect to
a Species tree
Gene tree
Species tree
Speciation
Duplication
Loss
16
Terminology
Orthologs : Genes related via speciation (e.g. C,M,H3)
Paralogs: Genes related through duplication (e.g. H1,H2,H3)
Homologs: Genes that share a common origin
(e.g. C,M,H1,H2,H3)
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs273a.stanford.edu
[Bejerano Fall10/11]
17
Gene trees and even species trees are
figments of our (scientific) imagination
Species trees and gene trees can be wrong.
All we really have are extant observations, and fossils.
Observed
Inferred
Gene tree
single
ancestral
gene
Species tree
Speciation
Duplication
Loss
http://cs273a.stanford.edu
[Bejerano Fall10/11]
18
Gene Families
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif
19
Gu et al. Age distribution of human gene families shows significant roles of both large-scale
and small-scale duplication in vertebrate evolution (2002) Nature Genetics 31; 205-208
20
Chaining Alignments
Chaining highlights homologous regions between genomes (it bridges
the gulf between syntenic blocks and base-by-base alignments.
Local alignments tend to break at transposon insertions, inversions,
duplications, etc.
Global alignments tend to force non-homologous bases to align.
Chaining is a rigorous way of joining together local alignments into
larger structures.
http://cs273a.stanford.edu
[Bejerano Fall10/11]
21
“Raw” Blastz track (no longer displayed)
Alignment = homologous regions
Protease Regulatory Subunit 3
22
Chains & Nets: How they’re built
• 1: Blastz one genome to another
– Local alignment algorithm
– Finds short blocks of similarity
Hg18:
Mm8:
AAAAAACCCCCAAAAA
AAAAAAGGGGG
Hg18.1-6 + AAAAAA
Mm8.1-6 + AAAAAA
Hg18.7-11 + CCCCC
Mm8.1-5 - CCCCC
Hg18.12-16 + AAAAA
Mm8.1-5 + AAAAA
23
Chains & Nets: How they’re built
• 2: “Chain” alignment blocks together
– Links blocks that preserve order and orientation
– Not single coverage in either species
Hg18:
Mm8:
AAAAAACCCCCAAAAA
AAAAAAGGGGGAAAAA
Hg18: AAAAAACCCCCAAAAA
Mm8.1-6 +
Mm8.12-16 +
Mm8
Mm8.7-11 chains Mm8.12-15 +
Mm8.1-5 +
24
Another Chain Example
Ancestral Sequence
A
B
C
D
E
Human Sequence
A
B
C
D
E
Mouse Sequence
A
B
C
B’
D
E
In Human Browser
Implicit
Human
sequence
Mouse
chains
B’
…
D
…
D
In Mouse Browser
E
E
Implicit
Mouse
sequence
Human
chains
…
…
D
E
25
Chains join together related local alignments
likely ortholog
likely paralogs
shared domain?
Protease Regulatory Subunit 3
http://cs273a.stanford.edu
[Bejerano Fall10/11]
26
Chains
• a chain is a sequence of gapless aligned blocks, where there must be
no overlaps of blocks' target or query coords within the chain.
• Within a chain, target and query coords are monotonically nondecreasing. (i.e. always increasing or flat)
• double-sided gaps are a new capability (blastz can't do that) that
allow extremely long chains to be constructed.
• not just orthologs, but paralogs too, can result in good chains. but
that's useful!
• chains should be symmetrical -- e.g. swap human-mouse -> mousehuman chains, and you should get approx. the same chains as if you
chain swapped mouse-human blastz alignments.
• chained blastz alignments are not single-coverage in either target or
query unless some subsequent filtering (like netting) is done.
• chain tracks can contain massive pileups when a piece of the target
aligns well to many places in the query. Common causes of this
include insufficient masking of repeats and high-copy-number genes
(or paralogs).
[Angie Hinrichs, UCSC wiki]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
27
Before and After Chaining
http://cs273a.stanford.edu
[Bejerano Fall10/11]
28
Chaining Algorithm
Input - blocks of gapless alignments from blastz
Dynamic program based on the recurrence relationship:
score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))
j<i
Uses Miller’s KD-tree algorithm to minimize which parts of dynamic
programming graph to traverse. Timing is O(N logN), where N is
number of blocks (which is in hundreds of thousands)
http://cs273a.stanford.edu
[Bejerano Fall10/11]
29
Netting Alignments
Commonly multiple mouse alignments can be found for a particular
human region, particularly for coding regions.
Net finds best match mouse match for each human region.
Highest scoring chains are used first.
Lower scoring chains fill in gaps within chains inducing a natural
hierarchy.
http://cs273a.stanford.edu
[Bejerano Fall10/11]
30
Net Focuses on Ortholog
http://cs273a.stanford.edu
[Bejerano Fall10/11]
31
Nets
• a net is a hierarchical collection of chains, with the highest-scoring
non-overlapping chains on top, and their gaps filled in where possible
by lower-scoring chains, for several levels.
• a net is single-coverage for target but not for query.
• because it's single-coverage in the target, it's no longer symmetrical.
• the netter has two outputs, one of which we usually ignore: the targetcentric net in query coordinates. The reciprocal best process uses
that output: the query-referenced (but target-centric / target singlecov) net is turned back into component chains, and then those are
netted to get single coverage in the query too; the two outputs of that
netting are reciprocal-best in query and target coords. Reciprocalbest nets are symmetrical again.
• nets do a good job of filtering out massive pileups by collapsing them
down to (usually) a single level.
[Angie Hinrichs, UCSC wiki]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
32
Before and After Netting
http://cs273a.stanford.edu
[Bejerano Fall10/11]
33