MBG305_LS_01

Download Report

Transcript MBG305_LS_01

Applied Bioinformatics
Dr. Jens Allmer
Week 1 (Introduction)
Your Instructor
• Education
– BSc: University of Münster 1996
– MSc: University of Münster 2002
– PhD: University of Münster 2006
• Worked at
–
–
–
–
–
Izmir Institute of Technology (since 2008)
Izmir University of Economics, Turkey (Feb 2007 – Aug 2008)
University of Muenster, Germany (Jan 2006 – Feb 2007)
University of Pennsylvania, USA (Jan 2004 – Dec 2005)
University of Jena, Germany (Nov 2002 – Dec 2003)
Areas of Interest
• Bioinformatics
– Sequences
– Alignments
• Mass Spectrometry
– De novo sequencing
– Pattern matching
• Annotation
– Integration
– Automatic assessments
• General Automation and Productivity
Course Rules
• Attendance
– Is essential and will be monitored strictly
– if(absence > 12h) Then NA;
• Make-up Work
– None
Course Rules
• Lecture starts on time
– if late enter QUIETLY
– if more then 5 min late DO NOT ENTER wait for break
• Breaks are 10 min max
– if late after break enter QUIETLY
– if more then 5 min late DO NOT ENTER wait for next break
• Early leave
– Announce before course and leave if granted
Course Rules
• Project
– Parts to be performed published on the website and/or as slides
– Deadline 6pm on the day before the next class
(you may submit early of course)
– No extention
– No make-up
– No extra work
• Must be electronicly submitted to:
[email protected]
– Must be named ????_first_last.eee or will not be accepted
– Formats include: doc, ppt, odx, txt, html, ...
– Not allowed are formats that may not be edited by me like
pdf, and similar formats that are not widespread
– Must be significantly different from your classmates
– Otherwise everyone involved will obtain zero for that assignment
Grading
• All information available on class website
• Grading individualized
–
–
–
–
–
Quizzes
Mind Maps
Midterm 1
Midterm 2
Project
15%
10%
25%
25%
25%
Project
• Group Formation 0%
(08.10. 18:00)
– Group Size: 4
•
•
•
•
•
First Draft
Results
Second Draft
Presentation
Final Version
25%
15%
20%
10%
25%
(22.10. 18:00)
(19.11. 18:00)
(03.12. 18:00)
(25.12. 18:00)
(31.12. 18:00)
Grading
•
I am responsible to evaluate you
– I am not responsible to pass everyone or give great grades
•
Make it easy for me
1. Show up and participate
2. Do homeworks and pre-course preparations
3. Midterm and Final will be easy for you if you adhere to 1. and 2.
Course Structure
–
–
–
–
–
–
–
–
–
–
Start
10 min quiz
35 min lecture
5 min mind mapping
10 min break
50 min practice
10 min break
40-50 min lecture
10 min break
30 min practice
Textbooks
Primary audience
Junior bio majors
Course home page:
http://www.biolnk.com/habf
ISBN:
978-605-133-297-0
http://www.idefix.com/kitap/biyoenformatik-1-dizi-kiyaslamalarijens-allmer/tanim.asp?sid=GUFFOI44R7FJ9CIR6STU
Textbooks
Everything you currently
need to know about Applied
Bioinformatics in regard to
practical problems you will
encounter during everyday
research.
Bioinformatics
Chemistry
Biology
Molecular
biology
Mathematics
Statistics
Bioinformatics
Computer
Science
Informatics
Medicine
Physics
Bioinformatics is
Multidisciplinary
Genomics
Drug Design
Computer
Science
Molecular
Life Sciences
Phylogenetics
Structural
Biology
Math
Statistics
BIOINFORMATICS
The Pyramid of Life (2000)
Metabolomics
1400
Chemicals
Proteomics
3,000 Enzymes
Genomics
30,000 Genes
The Pyramid of Life
Protein Interactions?
100,000 Proteins
30,000 Genes
1400
Chemicals
Bioinformatics
(or Computational Biology)
• Not just the study of DNA or protein sequence data
• Inclusive definition – concerns the storage, display,
reduction, management, analysis, extraction, simulation,
modeling, fitting or prediction of biological, medical or
pharmaceutical data
Basis of molecular life sciences
• Hierarchy of relationships (some exceptions):
Genome
Gene 1
Gene 2
Gene 3
Gene X
Protein 1
Protein 2
Protein 3
Protein X
Function 1
Function 2
Function 3
Function X
How can one use bioinformatics to link
diseases to genes?
•
Disease
Map
Gene
Function
Positional cloning of
genes
1. Find genetic markers
associated with disease
2. Sequence DNA next to
the markers
3. Compare DNA from
afflicted individuals to
DNA of normal
individuals (database)
4. Find abnormalities
5. Predict gene function
from sequence
information
Bioinformatics in the old days
• Close to Molecular Biology:
– (Statistical) analysis of protein and nucleotide structure
– Protein folding problem
– Protein-protein and protein-nucleotide interaction
• Many essential methods were created early on
– Protein sequence analysis (pairwise and multiple alignment)
– Protein structure prediction (secondary, tertiary structure)
• Evolution was studied and methods created
– Phylogenetic reconstruction (clustering – e.g., Neighbor
Joining (NJ) method)
– Nowadays also part of Datamining
But then the big bang….
The Human Genome - 26 June 2000
Dr. Craig Venter
Celera Genomics
-- Shotgun method
Francis Collins (USA)/Sir
John Sulston (UK)
Human Genome Project
Human DNA
• There are at least 3bn (3  109) nucleotides in the nucleus of
almost all of the trillions (3.2  1012 ) of cells of a human
body (an exception is, for example, red blood cells which
have no nucleus and therefore no DNA) – a total of ~1022
nucleotides!
• Many DNA regions code for proteins, and are called genes (1
gene codes for 1 protein as a base rule, but the reality is a lot
more complicated)
– Name examples
• Human DNA may contain ~27,000 expressed genes
– Problems?
• Deoxyribonucleic acid (DNA) comprises 4 different types of
nucleotides: adenine (A), thiamine (T), cytosine (C) and
guanine (G). These nucleotides are sometimes also called
bases
– Ambiguities?
Y-Chromosome
• 50% of the sequence consists of
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
• Not very meaningful
– Explanation .... Same as in x chromosome
– What about the N’s in chr 1?
Human DNA (Cont.)
• All people are different
• but the DNA of different people only varies for
0.2% or less
• So, only up to 2 letters in 1000 are expected to be
different.
• Evidence in current genomics studies (Single
Nucleotide Polymorphisms or SNPs) imply that
• on average only 1 letter out of 1400 is different
between individuals.
• Over the whole genome, this means that 2 to 3
million letters would differ between individuals.
Modern bioinformatics is closely
associated with genomics
• The aim is to solve the genomics information
problem
• Ultimately, this should lead to biological
understanding how all parts fit (DNA, RNA,
proteins, metabolites) and how they interact
(gene regulation, gene expression, protein
interaction, metabolic pathways, protein
signaling, etc.)
Functional Genomics
From gene to function
Genome
Expressome
Proteome
Interactome?
TERTIARY STRUCTURE (fold)
TERTIARY STRUCTURE (fold)
Metabolome
How much of the genome is defined?
Unknown Function
What is bioinformatics?
Math
Physics
English
Bio
Comp
sci
Chem
Bioinformatics
Stats
•
•
•
•
•
•
•
•
•
•
•
•
•
Machine learning
Database systems
Data mining
Image processing
Modeling
Graph theory
Statistical analysis
Sequence
Structure
Interactions
Regulation
Genomes
Evolution
• E.g. Process the spots on a microarray, determine
which genes are differentially expressed, link spots to
sequence via a database, analyze the sequence using
predictive tools, link the genes to related genes to form
a network
What is a bioinformatician?
• Somebody who knows everything
What is a bioinformatician?
• A facilitator
– Typically has background in biology or CS, but is comfortable
with concepts from other disciplines
– Bring together ideas (or researchers) from different domains to
solve a biological problem
• Conceptualize the problem
– Use language appropriate to the domain
• Identify potential solutions
– Understanding of different fields helps to identify possible
approaches at a broad level
• Guide the development process
– Create in-house or find potential collaborators to work on
approaches in-depth
• Integrate results into overall solution
– Software/method, results of biological analysis
How is Bioinformatics Used?
Bioinformatics is used to help “focus”
the scientist on the bench top experiments
Bioinformatics isn’t going to replace
lab work anytime soon
Experimental proof is still the
“Gold Standard”.
Bioinformatics
• Is application of computational tools in Biology
Bioinformatics?
• Not really!
• In this course we will however only go into algorithmic
details rarely (like today ;)
Mind Mapping
• Have you ever studied a subject or brainstormed an
idea, only to find yourself with pages of information, but
no clear view of how pieces fit together?
•  Mind mapping
–
–
–
–
–
–
Learn more effectively
Improves memorization
Enhances creativity
Speeds up analyses
Gives structure to complex ideas
Records information for future use
Source: http://www.mindtools.com/pages/article/newISS_01.htm
An Example Mind Map for MicroRNAs
How to Mind Map
1. Identify the central
topic write in center
2. Write major parts of
the topic on lines in
all directions
3. Repeat 2. with ever
finer level of detail
until satisfied
Source: http://www.mindtools.com/pages/article/newISS_01.htm
Note Taking with Mind Maps
• Capture ideas organized into topics
– What if the central topic which I chose is not the central topic?
– Make a new mind map which captures the topic correctly
• Uses Cases
–
–
–
–
Note taking in class
Recapitulization after lecture
Analysis of a new topic
Structuring of any intended writing
• When
– During acquisition of new knowledge (faster than writing)
– For review 5m, 1h, 6h, 1d, 7d, 1m after note taking
Mind Mapping Tips
1. Use single words or very short phrases
2. Write clearly and readable
3. Use color!
4. Seperate ideas (color, lines, shading)
5. Draw symbols and images
6. Draw links among elements
A More Elaborate Mind Map
Source: http://www.mindtools.com/pages/article/newISS_01.htm
At the Heart of Bioinformatics
Genomic
>scaffold_1152
GGTGCGGCCGTCCTCCAGCTGCTTGCCGGCGAAGATCAGGCGCTGCTGGT
CCGGGGGGATGCCTGCATCCGGTGAGGAAACGCTCGTGTCAGACAAAGTG
GGTGGGCGCAGGAAGCAGCAATCAACACAGCCCAGTGCAGCTGCAAAGCG
CCCGCCTTACCACTGACCCGCCTGGCCACCCACCCCTACCCCCCGTAAGG
AAAGAGCCCCGACTCACCCTCCTTGTCCTGAATCTTGGCCTTCACGTTCT
CAATGGTGTCCGAAGACTCCACCTCGAGCGTGATGGTCTTGCCCGTCAGG
GTCTTGACGAAGATCTGCATGCCACCGCGCAGGCGCAGCACCAGGTGCAG
…
Translated
>RF1_scaffold_1152
GAAVLQLLAGEDQALLVRGDACIR$GNARVRQSGWAQEAAINTAQCSC
KAPALPLTRLATHPYPP$GKSPDSPSLS$ILARDVAHDFAKSSPR$YA
PLIPQNLRC$SIEMKQPASLLSPIGEGACASHLQCLEKCLLP$GAIVY
MIS$GSGRR$TSWVGIGGCNDGTEKRSEVDSRRGGKGNIHD
>RF2_scaffold_1152
VRPSSSCLPAKIRRCWSGGMPASGEETLVS AATAAKPQTWSPTAWEF
KVGGRRKQQSTQPSAAAKRPPYH$PAWPPTPTPRKERAPTHPPCPESW
SRSQWCPKTPPRA$WSCPSGS$RRSACHRAGAAPGAGSTPSGCCSQPG
CGRPPAACRRRSGAAGPGGCLCVGGGGEGACASHLQCLEGE
…
Try it for yourself
Sequence
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
Pattern
TGATGT
Your
Your Task
Task
You
You may
may only
only compare
compare 11 character
character at
at aa time
time
You
You may
may create
create helpful
helpful structures
structures
You
You should
should find
find the
the location
location of
of the
the Pattern
pattern in
in the
the
Sequence
Sequence with
with aa minimal
minimal number
number of
of comparisons
comparisons
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 1
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 2
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 3
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 4
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 6
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 7-16
Brute Force Approach
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 17-22
Boyer-Moore Algorithm
•Preprocessing
•Good suffix matrix
•Bad character matrix
(m+1)
(m+1)
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 1
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 2
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 3-7
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 8
Boyer-Moore Algorithm
ACGGTAGTATGTGATGTATGATCGCGAAAGAGG
TGATGT
Comparisons: 9-15
Questions
Define
Algorithm
Website
• http://mbg305.allmer.de
• Slides
• Homework
• Additional materials and challenges
• Grades
Website
• To see your grades you need to login
• Some material may need login as well
• Currently
– UserID = StudentID
– Password = StudentID
• Change now
– UserID = working email address
– Password = whatever you will remember
Login to mbg305.allmer.de
• We will now assist you to log in and to add your email
address and change your password.
Assignments
– Research about Mind Maps
• E.g.: http://en.wikipedia.org/wiki/Mind-map
• IYTE library
– Make sure to read the lecture notes for next week (Available
online on Wednesday)
– Read Chapters 1 and 2 from our textbook