CSC598BIL675-2016-L1x - Department of Computer Science

Download Report

Transcript CSC598BIL675-2016-L1x - Department of Computer Science

Biomedical Data Science
0010100101
1001011011
0010100100
0010010010
1001011100
0101000011
0000101001
1001011011
0010100100
0010010010
1 10001010011
10001010011010
1000101001
ataacgtagc
acatagtagt
ccagtagctg
atcgtagaac
tgcatgatcc
aagctgctga
tacgatgaac
acctgagatg
ctgatgctga
tagctagtcg
atgatcgctga
acgaacccgtagt
aaggtgtgaac
Sawsan Khuri, Ph.D.
Stefan Wuchty, Ph.D.
Director of Engagement,
Center for Computational Science
Assistant Research Professor,
Dept of Computer Science
Associate Professor,
Dept of Computer Science
Biology
• An information science
– what, when, where
• Diverse data types
– molecular, cellular, system, individual, population
– Local, regional, global
• Cross disciplinary
– Great explorers
– Current explorers
Got Data, Want Info
1 – Creation and Curation of Databases
what goes into it, how should it be searched
2 – Data Analysis
statistics, inference, prediction
what is your question
3 – Tool Development
bought / freely available / in-house
What Data
Biodiversity – whole populations and species
species diversity and distribution, conservation,
systematics, ecology
Molecular – molecular sequences and structures
DNA: genes, genomes, regulation
Protein: sequence, structure, function
Medical (Health informatics) – patient data
Name, age, gender, ethnicity
Disease symptoms, treatment, progress
Ancient history…
Protein and nucleic acic sequence
Chromatographic and labeling methods began 1960s …
Protein Structure
X-ray crystallography methods improved in 1960s
protein structures began to be resolved
DNA sequencing
Gene cloning came in, manual sequencing methods began in late 1970s
In mid-late 1980s : Polymerase Chain Reaction and its automation
1985 saw the launch of the Human Genome Project
Soon we had automation of DNA sequencing
… fast forward
Humanity Communicates
Telecommunications went from Morse code in 1837
00010011000
To phones about 40 years later
It took another 100 years for computers to come in
1971
1973
1976
1980
Birth of Email
Internet becomes international UCL
UNIX, becomes routine in scientific community
Cellphones, fax machines
WWW was released by CERN in 1991, was immediately put to good use by
the academic community, and enabled the big data world as we now
know it…
Number Crunching – Part 1
Abacus
Punch cards
WWII
Colossus
WITCH
Mainframe
Super
Computer
Number Crunching – Part IIa
Mainframe
Super
Computer
PC / Unix
Number Crunching – Part IIb
Mainframe
Super
Computers
PC / Unix
Clusters
Number Crunching – Part III
Mainframe
Super
Computers
PC / Unix
Clusters
WWW
Parallel
Systems
Number Crunching – Part IV
http://www.galaxyzoo.org
http://www.iau.org
Mainframe
Super
Computers
PC / Unix
Clusters
WWW
Parallel
Systems
Grid computing
&
Citizen Science
Number Crunching – Part V
Mainframe
Super
Computers
PC / Unix
Clusters
WWW
Parallel
Systems
Grid computing
&
Citizen Science
Cloud
Computing
Source: http://cyberpingui.free.fr/humour/evolution-white.jpg
Molecular Data
DNA: single genes, chromosomes, genomes
A
C
T
G
(N)
gaagtatcataaacactcatcatatatatcatccaaataattgcagaaagaaaaagaaaa
tggtgatgatgagaatcttcttcttcctattcctcttggcctttccggtcttcactgcaa
atgcatcagtgaatgacttctgcgtggccaacggccctggagcccgcgacaccccgtcag
gcttcgtgtgcaagaataccgccaaagtcacagccgccgacttcgtctactccggcctgg
caaaacccggcaacaccaccaacatcatcaacgccgccgttactccggcgttcgtgggtc
RNA: DNA is transcribed to mRNA (regulated by snoRNA and miRNA)
Protein: mRNA is translated to polypeptide chains, and these
chains get folded into protein structures
Orders of Magnitude
Human average figures:
Gene
10-15 Kb, huge variability
Chromosome 50 x106 - 250 x106 nucleotides
Genome
~3 billion nucleotides
Pattern recognition:
Genes within genomes
Repeat regions
Regulatory elements
©SawsanKhuri
Levels of Complexity
A Gene: A gene is made up of exons, that (sometimes) code for
protein, and introns, which (usually) do not.
Within “a gene”, there are also
• UTRs
• alternative splicing leading to transcript variants,
• alternative promoters,
• genes within the introns of other genes
• regulatory elements everywhere and anywhere
and there are intergenic regions, centromeres, telomeres…
©SawsanKhuri
Data Deluge
Then it became silly to continue counting…
Human Sequence Data With Added Value
Human Genome Project
HapMap project
SNP consortium
Individual genomes
+ non-sequence data that is relevant
+ every single major lab in the world
+ every single medium lab
+ every single small lab
+ non-human data that is relevant
"Here's my
sequence...”
New Yorker
Data Science
It’s about
handling,
manipulating,
analysing,
visualizing,
and interpreting data.
So first you have to learn how to handle data files, and this is
where we will start.
Data Analysis
• The process of manipulating data in order to extract useful
information.
A good experiment can be ruined through bad analysis.
A good analysis can sometimes save a bad experiment.
Good tools are important.
Good people are crucial.
Algorithms
• An algorithm is a formula, a recipe
You need something done
compare two sequences
Devise a method of doing it
the bases one by one along both sequences
Create the algorithm
For two sequences of length n and m, compare base at position 1 of n with
base at position 1 of m, repeat and record same and record different and
add them up and divide by fudge factor q.
Implement by writing the code that will execute your algorithm
Biologist or Computer Scientist
• When an algorithm doesn’t work
and you’ve checked it it isn’t the programmer’s fault
it could be the method
need gaps, sliding window, seqs of same length …
or the type of sequences that are being compared
or the question you are asking needs to be modified
Who created the algorithm? Did they have enough biological
knowledge? Did they understand how algorithms work?
Biomedical Algorithms
• May be applicable to different types of data
• Usually involve some type of approximation
exhaustive approach vs heuristic approach, ie more suited to
available resources of time and computational power
Existing algorithms work “well enough”,
or “strangely enough, they work!”
• Current research on making existing algorithms more
efficient, scalable, and more (or just as) accurate
Tool Development
• Enabling others to use your algorithm
Too many good methods lie dormant in journal articles
Some for good reason
Others because noone has developed a tool that packages them
into something a biologist can use
I want to click on a button
and get the answer
immediately.
Interpretation
Course Objectives
To provide students with the computational skills needed
for analysis of genomic data sets.
1. Manipulate data files in unix/linux
2. Work in an HPC environment
3. Write scripts in python that are relevant to genomic data
analysis
4. Gain a deeper knowledge of biomedical data types and the
commonly used bioinformatics algorithms
5. Apply above skill set in a genomics data analysis project
6. Interpret and present the results of your project
In this course you will also
• Have fun
• Gain 3 Credits
• Learn, Achieve, Evolve
… but you will not
• Create a new algorithm
• Develop a new software tool
• Leave disappointed
#gdbk, not required
Another #gdbk
What you have to do
•
•
•
•
Come to class
Be an engaged student
Read assigned papers
Submit assignments on time
First Homework
• Download the Emacs editor
• Play with it a little
• We will go through it together on Thursday
If you already use an editor, let us know which one. We may still ask you to please
learn this one, the class has to be graded in a comparable manner.