Final

Transcript Final

Advanced Bioinformatics
Medicago
Basic project:
Study gene expression under a single condition
Team members

Jente

Lifei

Yuebang

Nick
Our chosen eukaryotic organism:
Yeast
Input data



Fastq files as sequence data
Genome.fa file as a reference genome
Genes.gtf
Tophat, Cufflinks and Cuffmerge

Genes.gtf, genome.* and the fastq files are used
to generate .bam files

The accepted_hits.bam is used by Cufflinks to
generate a file called transcripts.gtf

Because the experiment was in triplo, we get 3
transcripts.gtf files. These are merged together
with Cuffmerge.
gtf_to_fasta

With the program gtf_to_fasta we create a fasta
file which contains all the transcripts with
sequences.

So now we have a fasta and a gtf file to extract
data from with the help of programs and scripts.
The Big Hash Table

From the FASTA we use/determine:





Gene_id
Sequence length
GC content
Codon usage
From the GTF we use/determine:




Gene_id
Expression level
Inter-transcript size
Intron length
Reading gtf file:
Sort top 100 expressed genes
 From the GTF we use/determine:
 Gene_id
 Expression level
 Inter-transcript size
 Intron length

Key point:

First order, then get top 100 genes.

Build hash table: gene_id(keys) to FPKM, intron
length, inter-transcript(values).



Using array:Gene_ID and FPKM in seq[8]
Inter-transcript: use defined($seq2[1])
Intron length: divid into different conditons
(subroutines)
After reading next transcript line, calculate last intron
length

Important: hash table –matching!

Why we need to analyse FPKM, intron
length,inter-transcript(correlation)?

FPKM: gene expression level
Intron length: positive to gene expression level

Inter-transcript: gene density

Reading the fasta file

The important information is the sequence.

From this GC content, codon usage etc. can be
determined.

To couple this info to the gtf output, we analyse
the ID as well.
Reading the fasta file

The analysis was performed by reading the file
line by line, just like the exercises.

Then the ID was extracted from the first line and
saved in a heshtable.

Normally heshtables have only a key and one
value but we managed to put arrays in these
values.
Reading the fasta file
>xxxxx 1:783285 gene_id etcetcetcetcetcetc.
AGCTGCTAGGCTGCGCATCGTGAGCTGCCTTG
%hesh
ID; seqLength, GC_content, codonUsage
Combine the best of both!

The array values from the %gtf hesh table are
pushed into the %fasta hesh table.
For example:
my $newval = $gtf {$i} [0];
my $newval2 = $gtf {$i} [1];
push @{ $fasta{$ID} }, "$newval\t", "$newval2\t”;

# Heshtable #

In this way we obtained a table that contained:

ID; length, CUP, GC, TSp, TEp, ITL, Intron size(s)

We give options to show a variable number of
genes and to sort on specific parameters.

Now Jente will unleash his package…
Package: Jente
My Package
 Codon Usage Bias
 R: correlations
Codon Usage Bias

Relative Synonymous Codon Usage (RSCU)

Effective Numbers of Codons (NC)
Codon Usage Bias
RSCU
 Not in pipeline
 Optional subroutine
Codon Usage Bias
NC = 2 +
9
𝐹2
+
1
𝐹3
+
5
𝐹4
+
3
𝐹6
Only possible for sequences that use all amino acids
Codon Usage Proportion (CUP)
CUP =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑈𝑠𝑒𝑑 𝐶𝑜𝑑𝑜𝑛𝑠
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝐶𝑜𝑑𝑜𝑛𝑠
R: Correlations
FPKM
R = -0.1205
GC
R: Correlations
R = -0.1220
Highly expressed genes
have a more extreme
codon bias
tRNAs?
R: Correlations
R = -0,1282
Highly expressed genes
are smaller
More efficient?
R: Correlations
R = 0,9588
Longer genes use more
codons...
Visualize highly expressed genes in the interaction network
 What are Networks?
 A map of interactions or relationships
 A collection of nodes and links (edges)
 Why Network?
 predict protein function through identification of
partners
 Protein’s relative position in a network
 Mechanistic understanding of the gene-function
& phenotype association
Visualize highly expressed genes in the interaction network
Interaction network (1)
Download Yeast Interactome:
http://interactome.dfci.harvard.edu/S_cerevisiae/index.php?page=download
http://www.yeastnet.org/data/
Interaction network (2)
Runing Cytoscape and import yeast Interactome
Interaction network (3)
Visualize analysis of the interaction network
Interaction network (4)
Visualize the highly expressed genes in interaction network
Interaction network (5)
Interaction network (6)
Top 100 genes
interactome data
Interaction network (7)
Interaction network (8)
Interaction network (9)
Interaction network (10)
Visualize the highly expressed genes in interaction network
Interaction network (11)
Interaction network of top 100 intractome data
Interaction network (12)
GO graph (1)
Intall BiNGO
GO graph (2)
Import the top 100 expression genes, and start BiNGO
GO graph (3)
Conclusion
In the CCSB-Y|1 file, 8 genes of top 100 highly
expressed genes are found, and no directly
interaction among them in the interaction network
It is confirmed highly expressed genes are
related to production of protein by GO term.

Final

Transcript Final

Directory