T - Webcourse

Download Report

Transcript T - Webcourse

Projects 2015-16
Instructions for the final project
Introduction to Bioinformatics 2013-14
Key dates
9.12 lists of suggested projects published *
*You are highly encouraged to choose a project yourself or find
a relevant project which can help in your research
3.1 Final date to chose a project
10.1 Submission project overview (one page)
-Title
-Main question
-Major Tools you are planning to use to answer the questions
11.1 /18.1– meetings on projects
9.3 Poster submission
16.3 Poster presentation
2. Planning your research
After you have described the main question or questions of your
project, you should carefully plan your next steps
A. Make sure you understand the problem and read the necessary
background to proceed
B. formulate your working plan, step by step
C. After you have a plan, start from extracting the necessary data and
decide on the relevant tools to use at the first step.
When running a tool make sure to summarize the results and extract
the relevant information you need to answer your question, it is
recommended to save the raw data for your records , don't present
raw data in your final project.
Your initial results should guide you towards your next steps.
D. When you feel you explored all tools you can apply to answer your
question you should summarize and get to conclusions. Remember NO
is also an answer as long as you are sure it is NO. Also remember this is
a course project not only a HW exercise.
.
3. Summarizing final project in a poster (in pairs)
Prepare in PPT poster size 90-120 cm
Title of the project
Names and affiliation of the students presenting
The poster should include 5 sections :
Background should include description of your question (can add
figure)
Goal and Research Plan:
Describe the main objective and the research plan
Results (main section) : Present your results in 3-4 figures, describe
each figure (figure legends) and give a title to each result
Conclusions : summarized in points the conclusions of your project
References : List the references of paper/databases/tools used for
your project
Examples of posters will be presented in class
Leftovers from last lesson
Gene expression analysis
Clustering the data according to expression profiles
Genes
.
Expression in different conditions
Highly Expressed
Lowly Expressed
6
Different clustering approaches
• Unsupervised
- Hierarchical Clustering
- K-means
• Supervised Methods )‫(למידה מונחית‬
-Support Vector Machine (SVM)
7
Unsupervised
Hierarchical Clustering
Generate a tree based on the distances between genes
(similar to a phylogenetic tree)
Each gene is a leaf on the tree
Distances reflect the similarity of their expression pattern
Genes
Gene Cluster
Expression in different conditions
8
Analyzing the clusters of genes
Cluster 2
Cluster 3
9
Cluster 4
What can we learn from clusters
with similar gene expression ??
Similar expression between genes
-The genes have similar function
-The genes work together in the same pathway /complex
-All genes are controlled by a common regulatory genes
10
Example: Genes work together in the same complex
1400
Read Counts
1200
1000
800
600
400
200
0
Tissues
Transcription Factor
Long non-coding RNA
TF
11
Supervised approaches
for diagnostic based on expression data
Support Vector Machine
SVM
Genes
How can gene-expression help in diagnostics ?
Different patients (BRCA1 or BRCA2)
DATA
Microarray expression of all genes from two types of breast cancer
patients (BRCA1 and BRCA2)
Question:
Can we distinguish BRCA1 from BRCA2– cancers based solely on their
gene expression profiles?
• SVM would begin with a set of samples from
patients which have been diagnosed as either
BRCA1 (red dots) or BRCA2 (blue dots).
Each dot represents a vector of the expression pattern
taken from the microarray experiment of a patient.
How do SVM’s work with expression data?
The SVM is trained on data which was classified based on
histology.
?
After training the SVM to separated the BRCA1 from BRAC2 tumors
given the expression data, we can then apply it to diagnose an
15
unknown tumor for which we have the equivalent expression data .
How do SVM’s work with expression data?
The SVM is trained on data which was classified based on
histology.
?
After training the SVM to separated the BRCA1 from BRAC2 tumors
given the expression data, we can then apply it to diagnose an
16
unknown tumor for which we have the equivalent expression data .
Motif Search
What are Motifs
• Motif (dictionary) A recurrent thematic
element, a common theme
Find a common motif in the text
Find a short common motif in the text
Motifs in biological sequences
Sequence motifs represent a short common
sequence (length 4-20) which is highly
represented in the data
Motifs in biological sequences
What can we learn from these motifs?
– Regulatory motifs on DNA or RNA
– Functional sites in proteins
Regulatory Motifs on DNA
• Transcription Factors (TF) are regulatory
protein that bind to regulatory motifs near
the gene and act as a switch bottom (on/off)
Transcription
Start Site
TF1
TF2
Gene X
TF1
motif
TF2
motif
– TF binding motifs are usually 6 – 20
nucleotides long
– located near target gene, mostly upstream the
transcription start site
What can we learn from these
motifs?
About half of all cancer patients have a mutation in a gene
called p53 which codes for a key Transcription factors.
The mutations are in the DNA binding region and allows
tumors to survive and continue growing even after chemotherapy
severely damages their DNA
P53
Transcription Factor
Target Gene
Binding sites (moifs)
Why is P53 involved in so many
cancer types?
p53 regulated over
100 different genes
(hub)
We are interested to identify the genes
regulated by p53
Can we find TF targets using a
bioinformatics approach?
Finding TF targets using a
bioinformatics approach?
Scenario 1 : Binding motif is known (easier case)
Scenario 2 : Binding motif is unknown (hard case)
Scenario 1 :
Binding motif is known
• Given a motif find the binding sites in an
input sequence
Challenges in biological sequences
Motifs are usually not exact words
…….
How to present non exact motifs?
How to present non exact motifs?
• Position Specific Scoring Matrix (PSSM)
Probability for each base in each position
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
1
2
3
4
5
6
7
8
9
AAAGCCC
CTATCCA
CTATCCC
CTATCCC
GTATCCC
CTATCCC
CTATCCC
CTATCCC
TTATCTG
A
T
G
C
1
2
3
4
5 6
0.1
0.7 0.2
0.6
0.5
0.1
0.7
0.1 0.5
0.2
0.2
0.8
0.1
0.1 0.1
0.1
0.1 0.0
0.1
0.1 0.2
0.1
0.1 0.1
The PSSM can be also represented as
a sequence logo
-A letter’s height indicates the information it contains
Presenting a sequence motif as a logo
PWM
PSSM
TTCACG
TACATG
TACAGG
TACAAG
1
2
3
4
5
6
A
0
0.75
0
1
0.25
0
G
0
0
0
0
0.25
1
C
0
0
1
0
0.25
0
T
1
0.25
0
0
0.25
0
1
2
3
4
5
6
A
0
3
0
4
1
0
G
0
0
0
0
1
4
C
0
0
4
0
1
0
T
4
1
0
0
1
0
Divide each score by background
probability 0.25
Letter Height
Log2S
T position 1=Log24=2
T position 5=Log21=0
‫חידה‬
‫• מהו המקסימום גובה שנוכל לקבל בלוגו שמתאר‬
‫מוטיב שהתקבל מרצפי חלבונים??‬
How to search for a motif in a sequence
given a PSSM:
• Given a string
1 1 s9 of
9 0length
0 0 1 Al = 7
• s = s1s2…sl
6 0 0 0 0 9 8 7 C
W
• Pr(s | W) = 1 0 0 0 W
1 s0 k0 1 G

k
.11 .11 1
1
0
0
0 .11
A
.67 0
0
0
0
1 .89 .78
C
.11 0
0
0 .11 0
0 .11
G
k
• Example: 1 8 0 0 8 0 1 0
Pr(CTAATCCG) =
0.67 x 0.89 Counts
x 1 xof1each
x 0.89
base
each column
x 1 x0.89 xIn0.11
T
.11 .89 0
0 .89 0 .11 0
T
Probability of each base
In each column
Wk = probability of base  in column k
How to search for a motif in a sequence
given a PSSM:
• Given sequence S (e.g., 1000 base-pairs long)
• For each substring s of S,
– Compute Pr(s|W)
– Define if Pr(s|W) > threshold
The threshold is calculated based on the probability to find it
in random !! And can be different for each motif.
Open question: What do we do when searching motifs in DNA?
Scenario 2 :
Binding motif is unknown
“Ab initio motif finding”
Why is it hard???
Are common motifs the right thing to search for ?
?
Solutions:
-Searching for motifs which are enriched
in one set but not in a random set
- Use experimental information to rank
the sequences according to their binding
affinity and search for enriched motifs at
the top of the list
ChIP-Seq
Sequencing the regions in the genome to
which a protein (e.g. transcription factor)
binds to.
Finding the p53 binding motif in a set of p53 target sequences
which are ranked according to binding affinity
Best
Binders
ChIP –SEQ
Weak
Binders
- a word search approach to search
for enriched motif in a ranked list
Ranked sequences list
CTGTGA
CTGTGA
CTGTGA
CTGTGA
CTGTGA
Candidate k-mers
CTGTGC
CTGTGA
CTGTGC
CTATGC
CTGTGA
CTACGC
ACTTGA
CTGTGA
CTGTAC
ACGTGA
ATGTGC
CTGTGA
ACGTGC
ATGTGA
http://drimust.technion.ac.il/
uses the minimal hyper geometric
statistics (mHG) to find enriched motifs
The number of sequences containing
the motif among the top sequences
Ranked sequences list
CTGTGA
CTGTGA
CTGTGA
The number of
sequences
containing the motif
CTGTGA
CTGTGA
CTGTGA
CTGTGA
The total number of
input sequences
CTGTGA
The number of
sequences at
the top of the list
The enriched motifs are combined to get a
PSSM which represents the binding motif