Identification of Copy Number Variants using genome graphs.

Transcript Identification of Copy Number Variants using genome graphs.

Identification of Copy Number Variants
using Genome Graphs
Dhawal Verma
Advisor: Dr. Hesham Ali
Introduction


The genome of an organism offers great insight into its
 phylogenetic history
 interaction with the environment
 internal functions
Even within the same species, the genomes of two individuals
differ. Although the genomic variations are relatively small,
they account for the observed variations in:
 Phenotypes (Heterozygosity)
 Susceptibility towards various diseases.
Motivation



Heterozygosity is of major interest to researchers of
genetic variation in natural populations.
It refers to the state of having different alleles at one
or more corresponding chromosomal loci.
It is often one of the first "parameters" that one
presents in a data set. It can tell us a great deal
about the structure and even history of a population.
Motivation

Role in diseases
 SVs
and CNVs have been associated with susceptibility
or resistance to disease.
 Gene
 Copy
copy number can be elevated in cancer cells.
number variation has also been associated
with autism, schizophrenia and idiopathic learning
disability.
Visualization of Genome
Genome = A Book
Written in 4 letters of
nucleotides – A T G C
23 Chromosomes = 23
Chapters
Genes = Stories in each
chapter
G
e
n
o
m
e
ATGC
Genomic Structural Variation


Every Genome differs from another, however like
different books differ from one another, the list of
words used in the book comes from a known
dictionary of words.
Like different positions of various words in a
sentence give out a different meaning, different
positions of the same gene in a genome give us a
distinct feature and causes a variation in genomes.
Genomic Structural Variation



Until fairly recently, single nucleotide polymorphisms
(SNPs) were thought to be the main source of variation in
the human genome.
SNPs are variations that involve a change in just one
nucleotide.
THE RAT CAN RUN FAST
THE CAT CAN RUN FAST
High-throughput genome scanning technologies revealed
that there are other forms of genomic variation beyond
single base-pair substitutions.
Structural Variants
Structural variant is the umbrella term
to encompass a group of genomic
alterations involving segments of DNA
typically larger than 1 kb.
The structural variation may be
•Quantitative (CNVs – indels and duplications)
•Positional (translocations)
•Orientational (inversions).
Copy Number Variants (CNVs)


CNVs are defined as chromosomal segments, at
least 1000 bases (1 kb) in length that vary in
number of copies from human to human.
CNVs are large chunks of DNA that are deleted,
copied, flipped or otherwise rearranged in
combinations that can be unique for each individual.
YOU CAN RUN FAST
YOU CAN RUN RUN RUN FAST
SNP v CNV


SNPs always occur in two alleles, while
approximately 5% of the human genome are
defined as structurally variant in the normal
population, involving more than 800 independent
genes.
Of the total amount of variation between two
human individuals
CNVs + SVs >>> SNPs
Primitive methods for detection of CNVs
1.
2.
Whole-genome array comparative genome
hybridization(aCGH), which tests the relative
frequencies of probe DNA segments between two
genomes
SNP arrays to measure the intensity of probe
signals at known SNP loci.
Limitations of the methods

The size and breakpoint resolution of any prediction
is correlated with the density of the probes on the
array, which is limited by
 the
density of the array itself (for aCGH)
 the density of known SNP loci (for SNP arrays).

The limited resolution of arrays for high copy count
segments and the lack of unique probes make it
difficult to identify CNVs in repetitive regions.
Research Proposal


An effective computational method for the identification
of Copy Number Variants in genomes.
Model


Next generation sequencing data can be modeled in a graph
that we call a Genome Graph
Algorithm

By effectively mapping the reference genome graph with the
donor graph and making use of two different existing methods
known as Depth of coverage and Paired end mapping
together, we can overcome their limitations and detect the
CNVs with higher sensitivity and specificity.
Research Proposal




Our literature survey indicates that PEM method is used
specifically for detecting SVs and DOC method for CNVs.
CNVs in general are considered as a subset of SVs.
By integrating the two methods we can use PEM signatures
at a higher magnification level.
Also the complexity can be reduced by using the bidirectional genome graphs.
Genome Graphs

With the advent of Next Generation Sequencing data that
provides as much as 40x coverage for a human genome, a
special class of graphs known as Genome graphs emerged.

The vertices represent either the reads or their substrings (kmers expressed by various combinations of the letters A,T,G
and C)

The edges represent overlaps between them (the prefix of one
read is the suffix of the other).
Genome Graphs
•A genome graph can be unidirectional or bi-directional.
•Bi-directional genome graph implements the doublestrandedness of DNA.
•Bi-directional graphs help reduce the complexity of algorithm
as in unidirectional graphs two “complementary” walks are
searched while in bi-directional graph a single walk can fetch
both the sequence and its complement.
Depth of Coverage method

Depth of Coverage
The density of reads mapping to the region
 Several recent studies have shown that by comparing the
DOC within a sliding window of the genome to what is
expected in the reference genome, it is possible to detect
changes in copy number


Limitations
Very Complicated
 difficult to separate true changes in copy number from
segments that are over or under sampled by the sequencing
technology.

Depth of Coverage
In a genome graph, an increase/decrease in number of vertices
between two known vertices in the reference genome gives an
indication of CNV.
Paired End Mapping method

PEM method:



two paired reads (called matepairs) are generated at an
approximately known distance in the donor genome.
The reads are mapped to a reference genome, and matepairs
mapping at a distance significantly different from the expected
length (termed discordant) suggest structural variants.
Limitations

Difficulty in detecting larger insertions and variation within areas
of segmental duplications
PEM signatures in Genome Graphs
PEM signatures v DOC signatures



In contrast to most PEM signatures, DOC signatures
can be used to detect very large events.
The larger the event, the stronger the signature.
However, they are not able to accurately identify
smaller events that PEM signatures, even with low
coverage, are able to detect.
Next Steps:



While inversions do not cause any changes in copy number, an area
that is deleted (SV) will correspond to a loss (CNV). Similarly, a
region containing a tandem duplication will be annotated as both
having an insertion (SV) and as exhibiting a gain (CNV). In this way,
any PEM method for SV detection can be viewed as a method for
detecting a subset of CNVs
Depth of Coverage method is used extensively for detecting CNVs,
PEM technique is majorly used for detecting SVs
Our hypothesis is that PEM techniques can be used to improve both
the sensitivity and specificity of depth of coverage based methods
using a probabilistic graph-theoretic framework.

Identification of Copy Number Variants using genome graphs.

Transcript Identification of Copy Number Variants using genome graphs.

Directory