CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics
Lecture 3 Gene Finding
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap

Transcription and Translation

Structure and Organization of Genes

Gene Finding in genomes of Prokaryotic organisms

Introduction to Sequence Alignment

Summary
7/16/2015
2
How to Do Great Bioinformatics?
You need to understand biology
 You need to understand the NEEDS of
biologists
 You know how to identify the key
problems in biology that become
addressable today

Transcription & Translation
Prokaryotic Cells
Eukaryotic Cells
Transcription Process: RNA
Polymerase
Translation: How Ribosome
Synthesizes Proteins
Genetic Code
Ribosomes manufacture proteins based on mRNA
instructions. Each ribosome reads mRNA, recruits tRNA
molecules to fetch amino acids, and assembles the amino
acids in the proper order.
Genetic Code
Gene Structure of Prokaryotic Cells
TAA
TGA
TAG
Genes in Eukaryotic Cells
Pre-mRNA Splicing Process
1M Alternative Splicing
Gene Info:
1) A DNA sequence coding for the pre-mRNA
2) An additional DNA code or other regulating process,
which regulates the alternative splicing.
Core Promoter Structure
Roadmap

Transcription and Translation

Structure and Organization of Genes

Gene Finding in genomes of Prokaryotic organisms

Introduction to Sequence Alignment

Summary
7/16/2015
13
How to Find Genes
ATG
TAA
TGA
TAG
Gene-Finding Algorithm
Input: DNA sequences, a threshold gene
length K
 Output: All possible ORF sequences
 Procedure:
 Scan each of 3 ORFs, and find
subsequence that start with ATG and end
with one of (TAA, TAG, TGA)
 Repeat above for the complementary
sequences also

Risk of the Simple Gene Finding
Algorithm
The identified ORFs may arise just from
randomness.
 How likely is it for an ORF to be a result
of random sequences?
 Significance of an ORF to be Gene:

◦ We expect the likelihood of ORF being result
of random sequences to be less than p.
Calculating p
3 out of 64 are stopping condons
 P( run of k non-stop condons)=(61/64)^k
 (61/64)^62=0.051

Setting k=64 (62+1 ATG+ 1 StopCondon)
will make sure the identified ORFs are less
likely to be out of random permutation.

Permutation Test/Randomization Test
A generic method to estimate significance
level (p value)
 Example: how likely that a 10-condon ORF
is result of random permutation?
 Method:

◦ Randomly generate (or permute given sequences)
10,000 sequences
◦ Draw a histogram of seq lengths of sequences
that have a stop-condon (Null distribution)
◦ Calculate the percentage of random ORFs that
have lengths >=10.
Estimating cut-off K for gene finding
algorithm
Exact theoretical calculation: sensitive to
the assumptions, equal probability of
condons, etc
 Randomized test: do a permutation test,
find a length k such that <5% of random
ORFs have lengths greater than k.

Sequence Alignment: the Problem
Given two sequences, measure their
similarity
 ATAACTTTAATTAA
 ATCCTTTTACTAAA

Web Tool to Align Two Sequences

http://www.ebi.ac.uk/emboss/align
Applications of Sequence Alignment
Prediction of functions of
(gene/protein/promoters)  homology
 Database search

◦ Find similar sequences that are similar to our
query sequence (e.g. new gene)
Gene finding by genome comparison
 Sequence divergence/phylogeny
 Sequence Assembly

Summary
Transcription, Translation
 Gene structures of Prokaryotic and
Eukaryotic cells
 Finding genes (ORFs) for prokaryotic cells
 Sequence alignment applications


CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory