Presenter 18 - Florida International University

Download Report

Transcript Presenter 18 - Florida International University

Proteus, a Grid based Problem Solving
Environment (PSE) for Bioinformatics:
Architecture and Experiments
Presenter: Michael Robinson
Agnostic: Javier Munoz
Advanced Topics in Software Engineering CIS 6612
Florida International University
July 31, 2006
Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and
Pierangelo Veltri1 (February 2004)
1 University of Magna Graecia of Catanzaro, Italy
2 University of Calabria, Italy
Organization






Abstract
~60% is about Bioinformatics
Proteus Architecture
First Test Implementation
Results of First Test
Conclusion and Future Work
2
Abstract

Live sciences
Bioinformatics
Computer Science

Data Files sizes

Computer power
3
The Partners

What is Livesciences

What is Bioinformatics


Other Sciences used in Bioinformatics
What is Computer Science
4
Human Genome


The sum total of DNA in an organism is its genome.
The Human Genome Project (HGP) an international
effort, began in October 1990, and was completed in
1999, 2003, 2004.
(http://www.pbs.org/wgbh/nova/genome/program.html)

Project goals were to:

Determine the complete sequence of the 3 billion
DNA bases

Identify all human genes

And make them accessible for further biological
study
5
Human Genome


The bacterium E. coli and others were used to
help develop the technology and interpret
human gene function.
The Human Genome Project was sponsored by:
The U.S. Department of Energy and
The U.S. National Institutes of Health
http://www.preventiongenetics.com/edu/genetics_nutshell.htm
6
DNA (ACGT)

Humans have from 10 to 100 trillion cells

Each Human cell has about 3 billion nucleotides

We have approximately 30,000 genes

Of the three billion letters of DNA that we have,
only 1 to 1.5 percent of it is gene the rest is STUFF”.

The functions are unknown for over 50% of known genes
7
DNA (ACGT)
Human Genome




3,000,000,000 ~ dna bases
30,000,000 ~ bases in genes
2,970,000,000 ~ stuff
adenine (A) forms a base pair with thymine (T)
guanine (G) forms a base pair with cytosine (C)
8
Similarities to Human DNA
Another

human?
99.9% - All humans have the same genes, but some of these genes
contain sequence differences that make each person unique.
A chimpanzee?
98.5% - Chimpanzees are the closest living species to humans.
A mouse?
92.0% - All mammals are quite similar genetically.
A fruit fly?
44.0% - Studies of fruit flies have shown how shared genes govern the
growth and structure of both insects and mammals.
Yeast?
26.0% - Yeasts are single-celled organisms, but they have many
housekeeping genes that are the same as the genes in humans,
such as those that enable energy to be derived from the
breakdown of sugars.
A weed
(thale cress)?
18.0% - Plants have many metabolic differences from humans. For
example, they use sunlight to convert carbon dioxide gas to
sugars. But they also have similarities in their housekeeping
genes.
9
The gene sizes


Largest known human gene is dystrophin at 2.4 million bases.
Chromosome 21 is the smallest human chromosome.
Three copies of this autosome causes Down syndrome, the most
frequent genetic disorder associated with significant mental
retardation.
Academic groups from Germany and Japan mapped and
sequenced it, it has 33,546,361 bp of DNA
Analysis of the chromosome revealed:
 127 known genes,
 98 predicted genes,
 and 59 pseudogenes.

Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp
10
Bioinformatics

DNA
RNA
PROTEINS
MUTATIONS, ILLNESSES
MEDICATIONS
CLONING
11
DNA (ACGT)


Pseudomonas Aeruginosas PA01
6,264,403 bases, 5565 genes
complement(6264226..6264360)
6264181
6264241
6264301
6264361
gcttgtcccg gtcgaagtcc
cttacggcct ttggcgcgac
ggcgcggaaa ccgtggacgc
gattcggtac ctgggttgac
cgactcacca cccgtaccgg ataaatcaga cggtcagacg
gacgcgacag aacctgacgg ccgttcttgg tggccatacg
gagcgcgctt gagggtgctg ggttggaaag tacgtttcat
gacttgaggt cgcagtgacc ccg
12
RNA

In RNA, thymine is replaced by uracil (U).
DNA
6264181
6264241
6264301
6264361
RNA
6264181
6264241
6264301
6264361
gcttgtcccg
cttacggcct
ggcgcggaaa
gattcggtac
gcuugucccg
cuuacggccu
ggcgcggaaa
gauucgguac
gtcgaagtcc
ttggcgcgac
ccgtggacgc
ctgggttgac
gucgaagucc
uuggcgcgac
ccguggacgc
cuggguugac
cgactcacca cccgtaccgg ataaatcaga cggtcagacg
gacgcgacag aacctgacgg ccgttcttgg tggccatacg
gagcgcgctt gagggtgctg ggttggaaag tacgtttcat
gacttgaggt cgcagtgacc ccg
cgacucacca cccguaccgg auaaaucaga cggucagacg
gacgcgacag aaccugacgg ccguucuugg uggccauacg
gagcgcgcuu gagggugcug gguuggaaag uacguuucau
gacuugaggu cgcagugacc ccg
13
Amino Acids
U
U
C
A
G
C
A
G
UUU F phe Phenylalanine
UUG V val Valine
UAU Y tyr Tyrosine
UGU C cys Cysteine
UUC F phe Phenylalanine
UCC S ser Serine
UAC Y tyr Tyrosine
UGC C cys Cysteine
UUA L leu Leucine
UCA S ser Serine
UAA Stop
UGA Stop
UUG L leu Leucine
UCG S ser Serine
UAG Stop
UGG W trp Tryptophan
CUU L leu Leucine
CCU P pro Proline
CAU H his Histedine
CGU R srg Arginine
CUC L leu Leucine
CCC P pro Proline
CAC H his Histedine
CGC R srg Arginine
CUA L leu Leucine
CCA P pro Proline
CAA Q gln Glutamine
CGA R srg Arginine
CUG L leu Leucine
CCG P pro Proline
CAG Q gln Glutamine
CGG R srg Arginine
AUU l lle Isoleucine
ACU T thr Threonine
AAU N asn Asparagine
AGU S ser Serine
AUC l lle Isoleucine
ACC T thr Threonine
AAC N asn Asparagine
AGC S ser Serine
AUA l lle Isoleucine
ACA T thr Threonine
AAA K lys Lysine
AGA R arg Arginine
AUG M met Methionime Start
ACG T thr Threonine
AAG K lys Lysine
AGG R arg Arginine
GUU V val Valine
GCU A ala Alanine
GAU D asp Aspartic
GGU G gly Glycine
GUC V val Valine
GCC A ala Alanine
GAC D asp Aspartic
GGC G gly Glycine
GUA V val Valine
GCA A ala Alanine
GAA Z glu Glutamic
GGA G gly Glycine
GUG V val Valine
GCG A ala Alanine
GAG Z glu Glutamic
GGG G gly Glycine
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
14
Proteins (sequences)
DNA
6264181
6264241
6264301
6264361
RNA
6264181
6264241
6264301
6264361
gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg
cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg
ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat
gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg
gcuugucccg
cuuacggccu
ggcgcggaaa
gauucgguac
PROTEIN
gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg
uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg
ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau
cuggguugac gacuugaggu cgcagugacc ccg
MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV
15
Proteins: Pattern Matching
GHEGVGKVVKLGAGA
GHEKKGYF-DRGPSA
GHEGYGGRSRGGGYS
GHEFEGPK-CGALYI
GHELRGTTFMPALEC
G-H-E-X(2)-G-X(4,5)-[GA]
16
Proteins: Structures

Chemical properties that distinguish the 20 different amino
acids cause the protein chains to fold up into specific threedimensional structures that define their particular functions in
the cell
17
Reality


Somewhere in this dense chemical forest are
genes involved in deafness, Alzheimer, cancer,
cataracts, etc.
But where?
This is such a maze scientists need a map.
Out of three billion base pairs in our DNA,
just one single letter can make a difference.
18
Data Locations

GenBank in the US, 1974
http://www.ncbi.nlm.nih.gov/

1997 = 1.26 gigabases
2004 = 39
gigabases
2005 = 100
gigabases
EMBL in England, 1980
http://www.ebi.ac.uk/embl/

DDBJ in Japan, 1984
http://www.ddbj.nig.ac.jp/
19
Some Databases

The Swiss Institute of Bioinformatics maintains the following
databases:
Ashbya Genome Database
Cancer Immunome Database
Eukaryotic Promoter Database (EPD)
GermOnline
MyHits
PROSITE
Swiss-Prot and TrEMBL
SWISS-2DPAGE
SWISS-MODEL Repository
20
Specialization

Plasmodb
http://www.plasmodb.org/plasmo/home.jsp
parasitic eukaryote Plasmodium the
causative agent of the disease Malaria.
[email protected]
21
Proteus General Architecture
22
Proteus’ Software Modules
23
Some Taxonomies of the Bioinformatics Ontology
24
Snapshot of the Ontology Browser
25
Human
Protein
Clustering
Workflow
26
Snapshot of VEGA: Workspace 1 of the Data Selection Phase
27
Software Installed in the Example Grid
Software Components
segret
splitfasta
blastall
cat
Tribe-parse
Tribe-matrix
mcl
Tribe-families
Grid Nodes
Minos
k3
k4
*
*
*
*
*
*
*
*
*
*
*
*
*
*
28
Snapshot of the Ontology Browser
29
Snapshot of the Ontology Browser
30
Snapshot of the Ontology Browser
31
Snapshot of VEGA: Workspace 1 of the Pre-processing Phase
32
Conclusions and Future Work
Execution Times of the Application
TribeMCL Application 30 Proteins
All Proteins
Data Selection
1’44”
1’41”
Pre-Processing
2’50”
8h50’13”
Clustering
1’40”
2h50’28”
Results Visualization
1’14”
1’42”
Total Execution Time
7’28”
11h50’53”
33
References
On the paper the authors cited 27
references
34
Questions
Thank you
35