Bioinformatics Computing

Download Report

Transcript Bioinformatics Computing

Bioinformatics Computing
Department
Bioinformatics
Government Post
Graduate Collage ,
Mandain Abbottabad.
Sajid Khan
BI-302 Bioinformatics Computing-I
3+1
Prerequisite: Programming Fundamentals
Specific objectives of the course:
 This course aims to introduce the concepts of data
representation, searching, security and ownership. Develop
techniques for pattern matching, recognition and their
applications in bioinformatics.
Course Outline:
 Databases:
Data management, networks, geographical
scope, communications models, transmissions technology,
protocols, bandwidth, topology, hardware, contents, security,
ownership, implementation, Search engines. Search process,
search engine technology, searching and information theory,
computational methods, knowledge management, data,
sequence and structure visualization, data mining methods
and technology, pattern recognition and discovery, pattern
matching, dot matrix analysis, substitution matrices, dynamic
programming, Scripting.
Lab Outline:
 Simulation of various bioinformatics entities,
application of various bioinformatics methods,
scripting languages python, perl and PHP, and
their applications in Bioinformatics.
Recommended Books:
Latest editions of following books
 1. “Bioinformatics Computing” Bryan Borgeron,
Pearson Education.
 2. “Methods in Biotechnology and
ioengineering”, Vyas S.P. and Kohli D.V.

Introduction Bioinformatics
 Bioinformatics Computing
 The Killer Application
 Issues of Personalized Drugs
Parallel Universes
 Turing Machine/Model
 Information Theory



Central Dogma
Process of RNA synthesis
 RNA-Protein Codon Transcription Wheel.
 From Data to Knowledge
Bioinformatics is a science that involves collecting,
manipulating, analyzing, and transmitting huge quantities
of data, and uses computers whenever appropriate.
Bioinformatics Computing combines practical insight for
assessing bioinformatics technologies, practical guidance
for using them effectively, and intelligent context for
understanding their rapidly evolving roles.
The Killer Application
The
most
commonly
cited
"killer
app"
of
biotech
is
personalized
medicine just-intime delivery of
medications
(popularly
called
designer
drugs")
tailored to the
patient's condition.
High throughput screening
The use of affordable, computer-enabled microarray
technology to
determine the patient's genetic profile. The issue here is affordability, in that
microarrays costs tens of thousands of dollars.
Medically relevant information gathering
Databases on gene expression, medical relevance of signs and symptoms,
optimum therapy for given diseases, and references for the patient and
clinician must be readily available.
Custom drug synthesis
The just-in-time synthesis of patient-specific drugs, based on the patient's
medical condition and genetic profile, presents major technical as well as
political, social, and legal hurdles.
Parallel Universes
One of the major challenges faced by bioinformatacists
is keeping up with the latest techniques and discoveries
in both molecular biology and computing.
Two key events in the late 1920s were Alexander
Fleming's discovery of penicillin and Vannevar Bush's
Product Integraph, a mechanical analog computer that
could solve simple equations.
In the 1930s, Alan Turing, the British mathematician,
devised his Turing model, upon which all modern
discrete computing is based.
Turing Machine/Model
The Turing model defines the fundamental properties of a
computing system: a finite program, a large database, and
a deterministic, step-by-step mode of computation.
The Turing Machine, which can simulate any computing
system, consists of three basic elements: a control unit, a
tape, and a read-write head. The read-write head moves
along the tape and transmits information to and from the
control unit.
Information Theory
Shannon's model of a communications system includes
five components: an information source, a transmitter,
the medium, a receiver, and a destination. The amount
of information that can
be transferred from
information source to destination is a function of the
strength of the signal relative to that of the noise
generated by the noise source.
Central Dogma
The Central Dogma of
Molecular Biology. DNA is
transcribed to messenger
RNA in the cell nucleus, which
is in turn translated to protein
in the cytoplasm. The Central
Dogma, shown here from a
structural perspective, can
also be depicted from an
information flow perspective.
PROCESS OF RNA SYNTHESIS
Messenger RNA (mRNA)
Synthesis. DNA is transcribed
to nuclear RNA (nRNA) this
is in turn processed to
mature mRNA in the
nucleus. Maturation involves
discarding
the
junk
nucleotide sequences
(introns) that interrupt the
sequences
that
will
eventually be involved in
translation (exons).
RNA-Protein Codon Transcription Wheel.
The 64 possible
codons represent
the 20 common
amino acids, as
well as one start
(ATG) and three
stop (TAG, TAA and
TGA)
markers.
Redundancies
normally occur in
the last nucleotide
of the three-letter
alphabet.
From Data to Knowledge
In viewing the Central Dogma as an information flow
process, it's useful to distinguish between data,
knowledge, and metadata. For our purposes, the
following definitions and concepts apply:
Data are numbers or other identifiers derived from
observation, experiment, or calculation.
Information is data in context a collection of data and
associated explanations, interpretations, and other
material concerning a particular object, event, or
process.
Metadata is data about the context in which
information is used, such as descriptive summaries
and high-level categorization of data and information.
From Data to Knowledge
Data (from direct observation)
Patient Age: 5. Physical Exam Findings: Freckling in the
armpits and masses on and just below the surface of
the skin.
Information (from a molecular disease database)
Neurofibromatosis is a genetic disorder causing tumors
to form on nerve tissue anywhere in the body. The
pattern of inheritance is autosomal dominant.
Metadata (from an online publications database)
The incidence of neurofibromatosis is about one out of
every 2,500 people worldwide. It is associated with
difficulty seeing, hearing, and in some cases (NF2), with
paralysis and early death. In contrast, type 1 (NF1) is
more of a cosmetic disorder.
From Data to Knowledge
Information (from an online molecular
disease database)
The NF2 gene has been mapped to
chromosome 22 and is thought to be a
so-called "tumor suppressor gene."
Metadata (from an online publication
database)
The pattern of inheritance is autosomal
dominant, caused by a spontaneous
mutation in the egg or sperm before
fertilization.
Visualization of NF2 Gene on
Chromosome 22. As it would appear
through National Center for Biological
Information's Web-based Map View
program, the NF2 gene appears at
position 22q12 on chromosome 22.
Convergence
In automated gene sequencing,
purified genomic or complementary
DNA is first fragmented by restriction
enzymes, and these fragments are
separated by size on a gel.
This is followed by isolating single
fragments and using the Sanger chaintermination method to sequence each
fragment individually with chainterminating
ddNTS
(dideoxy
nucleoside triphosphates) labeled with
fluorochromes according to the base
present.
For example, a green fluorochome is
typically used for
Adenine (A), red for Tyrosine (T), blue
for Cytosine (C), and yellow for
Guanine (G).
Boolean Operators
The AND and OR
logic
operators
form the basis of
digital computer
operations.
Digital Computer Physical
Architecture (left) versus
Abstraction (right).
Process control applications
emphasize
input/output
hardware
(left),
which
corresponds
to
the
peripherals
abstraction
(right).
Archiving
Just as the transfer of data from DNA to RNA to protein relies on an
information infrastructure, data archives rely on an information technology
(IT) infrastructure. This IT infrastructure includes network and database
technologies as well as standard vocabularies to store and access
information.
Numerical Processing
Computers are recognized foremost for their computational or
numerical-processing capabilities. In bioinformatics, applications for
numerical-processing techniques range from sequence analysis,
microarray data analysis, and site prediction to gene finding, protein
structure prediction, and phylogenetic analysis.
Rules-based systems
IF First Codon = "T" AND Second Codon = "A"
AND (Third Codon = "A" OR Third Codon ="G")
THEN Codon = "Stop"
Artificial Neural Network
This
machine-learning
technology
relies
on
tightly
interconnected
input,
hidden, and output layers to
map
input patterns to
output patterns. One of
many possible truth tables
(right)
illustrates the
mapping of input to output
patterns.
Learning
is
signified by the thickness of
lines joining nodes, and
node values are indicated by
color (white = 0 and black =
1). Hidden nodes can take
on values between 0 and 1.
Communications
As Communications
a communications system, the computer is like an asynchronous
communications medium, typified by server-based e-mail.
From a data-management perspective, computers are increasingly used as
asynchronous communications devices.
The difference between e-mail and a typical biological database is that
contribution to biological databases such as GenBank are meant for others, but
the identity of the recipients is unknown and largely unknowable by the sender.
The communications of sequencing and protein structure information is
hindered because of the lack of a standard format for creating and storing gene
data, even within companies. Several contenders for the standard include Gene
Expression Markup Language (GEML), based on the eXtensible Markup
Language (XML), and Microarray Markup Language (MAML). The latter is based
on collaboration between the National Center for Biotechnology Information,
Stanford University, and the European Bioinformatics Institute.
Thank You