Computer Science and Bioinformatics

Download Report

Transcript Computer Science and Bioinformatics

Computer Science and
Bioinformatics
http://www.csee.umbc.edu/~smer1/bioinformatics.gif
James Edwards and Rajinder
Singh Bhatti.
Biology and Computer Science?

Initially Biology depended on Chemistry to
make major strides


Biology then needed to work at atomic level
explaining phenomena


Biochemistry
Biophysics
The modern era of Biology needs to interpret
a wealth of data, tools that only computer
Science is able to provide

Hence Bioinformatics
What is Bioinformatics?



The study of computational methods to
expand the use of biological data (Data
Orientated).
Often (incorrectly) used instead of the
term ‘Computational Biology’. However
this is a slightly different discipline.
Computational Biology is the use of
computational and mathematical methods
to study or simulate biological systems
(Hypothesis Orientated).
[source National Institutes of health]
Overlaps Between the two Disciplines
1
3
2
4




1
2
3
4
–
–
–
–
Bioinformatics problems
Computational Biology problems
Problems in both categories
Problems in neither category
Motivation for Bioinformatics



Quote from Donald Knuth- 1974 Turing
Award winner:
“…I can’t be as confident about computer
science as I can about biology. Biology
easily has 500 years of exciting problems to
work on. It’s at that level.” [source –
Wikiquotes]
Can Biological life be equated with
Computing?
 Results so far would suggest the answer
is yes!
Common Bioinformatics Problems





Finding and assessing Similarities
between Strings (next slides).
Detecting patterns in strings.
Constructing trees of the evolution of
organisms.
Classifying new data by clustering
existing data.
Also applications of Machine Vision to
detect interactions between proteins
Data Structures Used in Biology


Strings for
representing
sequences (e.g.
DNA, RNA, Amino
Acid Sequences).
Trees for
representing the
evolution of
organisms and
other purposes.
“ATACGGCGCGCAAGGCT”
“TATGCCGCGCGTTCCGA”
Prokaryotes
Eukaryotes
............
Reptiles
……
……
Birds
……
……
Data Structures (Cont..)


Graphs can represent
signalling pathways
(often found in Neural
networks).
3d Points and their
Linkages can
represent protein
structures.
1
1
0
1
1
2
1
First Instance of a Problem – DNA
Shotgun Sequencing

In order to derive a DNA sequence, the DNA must
first be duplicated many times.
ATCACCGTAAGAGGA

ATCACCGTAAGAGGA
ATCACCGTAAGAGGA
ATCACCGTAAGAGGA
It must then be processed by Gel Electrophoresis,
which ‘chops’ the DNA into smaller pieces named
‘fragments’.
ATCACCGTAAGAGGA
ATCACCGTAAGAGGA
ATCACCGTAAGAGGA
ATC
CCGT
AGGA
AAG
CCGT
AAGA
TCA
TAA
GTA
This is a very simplified Instance of the problem typically each fragment can
be between 250 and 1000 Bases long.
Alignments – the Smith Waterman
Method.





How do we identify fragments which link
together?
Can use dynamic programming to compute
optimal alignment scores between fragments.
Align with either match (1) gap -(1/3 x length of
gap) or mismatch (-1).
The score in each cell is the best total score from
an already chosen cell/row + the cost of the
alignment. If a score is < 0 it is said to be 0.
The first row is always filled with 0’s
A
T
A
A
0
0
0
T
0
1
0
C
0
0
0
The Smith Waterman algorithm
(Continued)




Following this trace back a path through the
optimum alignment starting at the highest
number in the matrix to the first 0.
In this case it is: ‘AT’
Algorithm extremely expensive O(NM) run time
and O(NM) storage complexity.
Always finds optimum solution.
A
T
A
A
0
0
0
T
0
1
0
C
0
0
0
Alternative to SW Algorithm

Sequences are usually at the very least
tens of thousands of characters long




Makes O(NM) runtime (and storage
complexity) unacceptable.
Alternative – use BLAST (Basic Local
Alignment Search Tool) Algorithm.
Gives a much more reasonable run time
of O(N+M).
However does not always compute best
solution.
BLAST Algorithm


Computing an entire matrix of values will
always require N x M space.
Iterating over values will always require N x
M Space.



Solution: Ignore parts of the alignment which are
unlikely to improve the score.
This improves the Storage Complexity as
only a singular alignment must be stored.
It also improves the Runtime Complexity as
at each stage of the algorithm only the
optimum so far is processed.
BLAST Illustrated

Consider forming an alignment between
two sequences:
CTCTCTCTCATTGATTGCGGGGGG
GGGGGGGGGATTGATTGCCCCCCC

The strings at the beginning and end
are very unlikely to improve the score
of the alignment.

Therefore no gap and mismatches are
computed in the matrix
---------ATTGATTGC--------------ATTGATTGC------
Alignments Relation to Shotgun
Sequencing.



So now there is a way to measure which
fragments are likely to align we still need
a way to find the correct order efficiently.
In depth Algorithm beyond scope of
presentation
However the best current techniques are:



Greedy Methods (align every element – then
use only best solutions).
Evolutionary Algorithms (start with initial set of
solutions, computing sum of alignment scores
then ‘evolve’ set of solutions in each iteration).
Problem is NP- Hard – Techniques give
Approximations.
Relating Computer Science to
Biology

What have us Computer Science students
studied so far in this MSc course that can have
some use to Bioinformatics?

Data Mining

Artificial Intelligence


Heuristic approaches (e.g. Knowledge Representation –
Logics)
Algorithm Techniques
Data Mining and Bioinformatics
How and why?


Some of you do COMP 527 Data Mining
with Rob
Why Data Mining is essential in
Bioinformatics.



KDD (Knowledge Discovery DB) is the process of
finding useful information and patterns in
data.
Data Mining is the use of algorithms to
extract information and patterns derived by
the KDD process.
Graphical Techniques such as Brush, Data
smoothing etc.
Data Mining and Bioinformatics
Algorithm implementation examples

Data Mining algorithm use for tackling
problems in Bioinformatics

In conjunction with microarray Technology


Predict a patients outcome, such as

survival time

disease recurrence

health risk assessments etc...
How does Data Mining help?

Accurate predictions could help provide better
treatment!
AI and Bioinformatics
Artificial Intelligence?




Research in genetics, molecular biology
etc. generate enormous amounts of data
Use AI to extract useful information from
the wealth of available data
Build good probabilistic models (gene
models)
AI provides several powerful algorithms
and techniques solving these problems
using the stored data
AI and Bioinformatics
AI techniques used

Neural networks (Biological and
Artificial)
Hidden Markov models (Probabilistic
Statistical models)
Bayesian networks (Models logic)

and many others....


Logic and Bioinformatics

Biology works by applying prior knowledge
“what is known” to unknown entities.
 Therefore Biology said to be knowledgebased (rather than axiom based)
 Use pre-existing knowledge to make
inferences about the item under
investigation.
 Description Logic?
Description Logic and
Bioinformatics

Why description Logic?

decidable logic with good systems

impossible for a single biologist to deal with all
of a domains knowledge!

similar to programmers writing extremely
complex programs without an IDE to help
with libraries

medical diagnosis systems make good use of
ABOX and TBOX assertions

for example, determine if a patients problem
is an element of a particular known disease
Description Logic Example
TBOX
sick person
non_sick person
isInfected.Cancer
isInfected.Cold
ABOX
Tim : person
Steven : person
Cancer : Problem Cold : Problem
(Tim, Cancer) : isInfected
(Steven, Cold) : isInfected
Improvements
How far has Bioinformatics come?


“One is struck both by how far the field has come
in a relatively short period of time, and also by
how far it has yet to go.” - Jessica D.
Tenenbaum
The discipline of Bioinformatics has vastly
improved over recent years due to




Fast technological development of the computer
industry
Demand for Computer Scientists - more computer
scientists than ever before!
Biological “unknown” discoveries – things that are
discovered with no previous knowledge base
Growing of sub-Biology interests, such as
molecular Biology
Improvements
How far will Bioinformatics go?

Thoroughly depends if the gap between Biology
and Computer Science increases or decreases

The gap increases if educational institutions
decide ignore Bioinformatics

Put emphasis on prospective students

Computer Scientists choose to ignore Biology

Biologists choose to ignore Computer Science
Closing the gap I


Biologists cannot build their own
analytical tools
Computer Scientists don't know what
to build!
Closing the gap II

Putting a Computer Scientist (Data
Mining expert) into a room with a
Biologist investigator wont solve the
problem

Boundaries such as methodologies and
discipline language are a problem.
Closing the gap III



Computer Science is the “science of the
artificial”
Biology is the “science of discovery”
The only way to bridge the gap is for
both parties to learn the basic
fundamentals of each science
Breakthroughs of Bioinformatics

Spatial patterns of structures for understanding
protein folding, evolution, and biological
functions


To predict protein functions, we develop a method by
rapidly matching local surfaces and by incorporating
evolutionary information specific to individual binding
region via a Bayesian Monte Carlo approach.
These kinds of breakthroughs encourage the computer
industry to get involved and work with Biology.
Related Problems?
Are there any other disciplines which involve the similar
integration of Computer Science with Biology?

Cheminformatics/Chemoinformatics



the application of informatics tools to solve
discovery chemistry problems
an integral component of hit and lead generation
development of new computational methods or
efficient algorithms for chemical software, and
pharmaceutical chemistry including analyses of
biological activity and other issues related to drug
discovery
Related Problems?
Are there any other disciplines which involve the similar
integration of Computer Science with Biology?

Other similar interests

Ecoinformatics

Geoinformatics

Quantum informatics

Astroinformatics

Business informatics

And many others...
Follow ups of Jacques Cohen

Bioinformatics—an introduction for computer
scientists is a previous publication from Jacques
Cohen


aims to encourage Computer Scientists to get
involved with Biology
Updating Computer Science Education released
after Bioinformatics and Computer Science

Talks about encouraging the next generation of
Computer Scientists that Computer Science is
more than just programming.
Who is Jacques Cohen?




Currently serving Brandies University since 1968.
Docter in the field of analysis of algorithms, parsing
and compiling, memory management, logic and
constraint logic programming, and parallelism
Recently started researching his interest of
Bioinformatics
His most recent publication is about methods used in
microarray Data Interpretation

See
http://www.cs.brandeis.edu/~jc/publications.html
References
and
related
material
(All web links last accessed 4 February 2008)
th

Shotgun sequencing
G. Luque, E. Alba Torres and S. Khuri, Assembling DNA Fragments with a Distributed Genetic
Algorithm, Parallel Computing for Bioinformatics and Computational Biology, WileyInterscience, New Jersey, 2006, Chapter 12, pp. 285-302.

L.D. Paulson, Bioinformatics Experiences Important Breakthroughs, 2005, pp. 26-27

J. Cohen, Bioinformatics: An Introduction for Computer Scientists, ACM Computing Surveys,
36(2), 122-158, 2004.

B. Tjaden, J. Cohen, A Survey of Computational Methods used in Microarray Data
Interpretation, Applied Mycology and Biotechnology, Bioinformatics 6, 2006.

J. Cohen, Updating Computer Science Education, Communications of the ACM, 48(6), 29-31,
2005.

J. Cohen, Computational Molecular Biology: A Promising Application Using Logic Programming
and Constraint Logic Programming, Lecture Notes in Artificial Intelligence, 1999.

R. Stevens, C.A. Goble and S. Bechhofer, Ontology-based Knowledge Representation for
Bioinformatics, 2000.

Jinyan Li, Limsoon Wong and Qiang Yang, Data Mining in Bioinformatics, 2005.

Various material about Bioinformatics, http://www.aaai.org/AITopics/html/bioinf.html

Data Mining in Bioinformatics, http://www.dbs.informatik.unimuenchen.de/Forschung/Bioinformatics/