Identification of the power-law component in human transcriptome

Download Report

Transcript Identification of the power-law component in human transcriptome

IDENTIFICATION
OF THE POWER-LAW COMPONENT
IN HUMAN TRANSCRIPTOME
Vasily V. Grinev
Associate Professor
Department of Genetics
Faculty of Biology
Belarusian State University
Minsk, Republic of Belarus
DIVERSITY OF SPLICE SITES
IN HUMAN GENOME/TRANSCRIPTOME
A graphical representation of the traditional (linear) transcriptional model (A),
splice sites (B) and exon (C) splicing graphs models
of human RCAN3 gene organisation
DISCRETE POWER-LAW MODEL
The probability mass function
𝐩 𝒙 = 𝐂𝒙−𝜶
Normalization constant
𝐂=
𝐃 = 𝒔𝒖𝒑𝒙 𝐏 𝒙 − 𝑷𝒆𝒎𝒑 𝒙
𝟏
𝛇(𝛂, 𝒙𝒎𝒊𝒏 )
Determination of the scaling parameter
a value by maximum likelihood
estimator for xmin  6
Hurwitz zeta function
∞
𝛇 𝛂, 𝒙𝒎𝒊𝒏 =
Estimation of the lower bound xmin by
Kolmogorov-Smirnov statistic
(𝐧 + 𝒙𝒎𝒊𝒏 )−𝜶
𝐧=𝟎
The cumulative distribution function
−𝟏
𝒏
𝛂≅𝟏+𝐧
𝐥𝐧
𝐢=𝟏
𝒙𝒊
𝒙𝒎𝒊𝒏 −
𝟏
𝟐
Determination of the scaling parameter
a
value
by
direct
numerical
maximization of the likelihood function
The complementary cumulative distribution function itself for x < 6
𝒏
min
𝛇 𝛂, 𝒙
𝐏 𝒙 =𝟏−
𝛇 𝛂, 𝒙𝒎𝒊𝒏
𝛇 𝛂, 𝒙
𝑷 𝒙 =
𝛇 𝛂, 𝒙𝒎𝒊𝒏
Important equations
𝓛(𝛂) = −𝐧𝐥𝐧𝛇 𝛂, 𝒙𝒎𝒊𝒏 − 𝛂
𝐥𝐧𝒙𝒊
𝐢=𝟏
Determination of parameters
Clauset,A., Shalizi,C.R., Newman,M.N.J. (2009) Power-law distributions in empirical data. SIAM Rev.,
51, 661-703.
Newman,M.E.J. (2005) Power laws, Pareto distributions and Zipf’s law. Contemp. Phys., 46, 323-351.
Goldstein,M.L., Morris,S.A., Yen,G.G. (2004) Problems with fitting to the power-law distribution. Eur.
Phys. J. B, 41, 255-258.
COMPETITIVE STATISTICAL MODELS
1) Power-law
𝐩 𝒙 = 𝐂𝒙−𝜶
1) Log-likelihood ratio test
2) Truncated power-law
𝐩 𝒙 = 𝐂𝒙−𝜶 𝒆−𝝀𝒙
3) Yule-Simon
𝐩 𝒙 =𝐂
Г(𝐱)
Г(𝐱 + 𝛂)
4) Exponential
𝐩 𝒙 =
𝐢=𝟏
𝒑𝟏 (𝒙𝒊 )
𝒑𝟐 (𝒙𝒊 )
Vuong,Q.H. (1989) Likelihood ratio tests for
model selection and non-nested hypotheses.
Econometrica, 57, 307-333.
𝐀𝐈𝐂 = 𝟐𝐤 − 𝟐 𝒍𝒏 𝑳
5) Stretched exponential
−𝛌𝒙𝜷
𝐩 𝒙 = 𝐂𝒙𝛃−𝟏 𝒆
𝐩 𝒙 =𝐂
𝒏
2) Akaike information criterion
𝐂𝒆−𝝀𝒙 ,
6) Log-normal
𝑳𝟏
𝐑=
=
𝑳𝟐
(𝒍𝒏 𝒙−𝛍)𝟐
−
𝟐𝝈𝟐
𝒆
𝒙
7) Poisson
𝝁𝒙
𝐩 𝒙 =𝐂
𝐱!
The probability mass functions of
competitive statistical models
Akaike,Y. (1974) A new look at the statistical
model identification. IEEE Transact. Automat.
Control, 19, 716-723.
3) Bayesian information criterion
𝐁𝐈𝐂 = −𝟐 𝒍𝒏 𝑳 + 𝐤𝐥𝐧(𝐧)
Schwarz,G.E. (1978) Estimating the dimension
of a model. Ann. Stat., 6, 461-464.
Comparison of alternative
statistical models
STATISTICAL ANALYSIS CONFIRMS THE PRESENCE OF POWER-LAW
COMPONENT IN TRANSCRIPTOME OF KASUMI-1 CELLS
USAGE OF EXONS IN ALTERNATIVE SPLICING
FOLLOWS A POWER-LAW IN HUMAN TRANSCRIPTOME
USAGE OF EXONS IN ALTERNATIVE SPLICING
FOLLOWS A POWER-LAW IN HUMAN TRANSCRIPTOME
Maximum values of splicing degrees
from different models of human genes
ARE THERE ANY SPECIFIC FEATURES ASSOCIATED
WITH DIFFERENT CLASSES OF SPLICE SITES?
Every splice site was annotated with sequence, sequence-related,
functional and structural features which were extracted from four
types of the genomic/RNA elements
RANDOM FOREST BASED DATA MINING
A small set of features allows distinguish between
two classes of splice sites in Kasumi-1 cells
RANDOM FOREST BASED DATA MINING
Iterative removing of misclassified splice sites
leads to high accuracy of classification
RANDOM FOREST BASED DATA MINING
About half of misclassified cases of splice sites
can be explained by some different ways
MANY THANKS TO THE MEMBERS OF OUR TEAM:
Ilia M. Ilyushonak
Dr. Petr V. Nazarov
Dr. Laurent Vallar
Northern Institute for
Cancer Research
Prof. Olaf Heidenreich
THANK YOU FOR ATTENTION!