Transcript PPT

EMBL-EBI
Visualization
& Data mining
EMBL-EBI
Visualisation
 The process of representing abstract data to aid in
understanding the meaning of the data.
 Not to be confused with rendering data (drawing pictures)
 Typically though, we render data in such a way to visualize
the information within that data.
EMBL-EBI
Introduction
 Biological data comes from & is of interest to:






Chemists : reaction mechanism, drug design
Biologists : sequence, expression, homology, function.
Structure biologists : atomic structure, fold, classification, function.
Medicine : clinical effect
Education :
Media :
 Presentation of diverse information to a diverse audience.
 Each has there own point of view (context).
 Expert = scientist working within their own field of expertise
 Non-expert = scientist using data/information outside their field
 Novice = Non-scientist
EMBL-EBI
Not just presentation of results
 Web pages
 These are notoriously badly designed often resulting in
the information on that site being unusable.
 The front page should load quickly
 The main point should appear on the first full screen
 Clutter – not logically laid out
 Too busy – cannot find the salient point
 8% men & 0.5% women are colour blind
Google is a
 Bad text/fonts
 Too often it doesn’t work
good design
 User will go somewhere else
 The latest wiz-bang stuff only works on the latest browsers
 Only works in one browser – they only tested on one.
– Does not conform to standard HTMl
EMBL-EBI
Asking questions
 Asking questions
 Biological data is very complex
 Chemistry, Biology, Physics, Statistics, Medicine..
 Most users will be from a different field
 Asking the right question is difficult.
 The user cannot use the correct terminology
 Too many things to query (2000 attributes in MSD)
 SQL : not suitable for most users
 Interface too complex
 Too many check boxes, widgets etc
 Trying to be too clever
 The “Go” button is buried somewhere
EMBL-EBI
Result presentation
 Results
 Biological data is complex
 Chemistry, physics, biology, statistics, medicine…
 Experts users want all the detail
 Ie : want to use a specific method
 They want all the details
 The want (I hope) the statistical validity of the results
 The non-expert wants the best practice answer
returned within their own context.
 The want comparative analysis with other fields
 The want to know the results are valid
EMBL-EBI
Query design
 The simple text box design is very common
 Suitable for text
queries
 Only one logic
AND or OR
 Predefined
 Easy to use
 Limited scope
 2000 attributes ->
2000 check-boxes !
EMBL-EBI
Query design
 Graphical interface
 Multiple logic
AND/OR/NOT
 Under users control
 Slower
 Steep learning curve
 Some users just cannot
get it
 Intuitive once mastered
 Pretty
EMBL-EBI
Query design
 Figurative 2D sketch for 3D query (Active sites)
 Informative – presents meaning for the question
 Slower
 Less error prone
HIS|SER:S/H>C2.0
HIS.ne2:S/S>C2.0
HIS.[n]/T>C2.0
EMBL-EBI
YAMGP (yet another molecular graphics program)
 Many different programs are available
AstexViewer@MSD-EBI
LigPlot
VMD
Quanta InsightII
Bobscript
WebMol
Frodo
iMol
Chime Grasp
Pymol
POVRay
Spock
Rasmol
Pymol
Mage
Raster3D
Yasara
Molscript
Chimera
O
MolMol
Whatif
Frodo
XtalView WebLab-viewer
Swiss-PDBviewer
EMBL-EBI
Result visualisation
 Multiple types of biological data









Textual data
3D structure
2D chemical sketches
1D sequence
Node linked
General/derived data
Web pages
Time
Errors/Variance
Patented !
EMBL-EBI
Visualisation : AstexViewer@MSI-EBI
 Visualisation












Structure/sequence/data
Lensing
Linked views
Brushing
Picking
Flying views
Hyperbolic distortion
Animation
Solid rendering
Depth cues
Colour,lighting
Highlighting
Etc…
EMBL-EBI
Visualisation : comparative analysis
 Similarity/Difference
 Data superposition
 Attribute display
 Colour, size…
 Correlation
 Attribute mapping
 Sequence colour by
structure alignment
Analysis
Example
EMBL-EBI
Animation
 Animation
 Time dependent display
 Reaction chemistry
 Visual clues.
 Expression data
 Shown as…
 Rotation
 Flash
 On/off
 Object Synchronization
 Size, Colour….
 Sound
 NO : incredibly annoying
Animation Example
EMBL-EBI
Multidimensional analysis
 Comparative analysis on
multiple data
 Eg. Phi,Psi, Bvalue, Omega
 1D & 2D easy
 3D graphs are difficult to see.
 4D requires 3D + iso-surfaces
 Higher – too busy
 Use 2D + multiple properties
 SPOTFIRE is the most well known
 Use : X/Y/Colour/size/shape…
 Interactive bracketing
Example
EMBL-EBI
Visualization- Summary
 Rendering data is not visualization
 Not just the display of results
 Huge array of non-specific techniques – and entire
scientific field !
EMBL-EBI
Data mining
 “Analysis of data in a database using tools which
look for trends or anomalies without knowledge of
the meaning of the data.” (Hyperdictionary)
 “True data mining software does not just change
the presentation, but discovers previously
unknown relationships among the data.” (IBM)
EMBL-EBI
Data mining & Data analysis
 Traditional analysis is via “verification-driven
analysis”
 Requires hypothesis of the desired information
(target)
 Requires correct interpretation of proposed query
 Discovery-driven data mining
 Finds data with common characteristics
 Results are ideal solutions to discovery
 Finds results without previous hypothesis
 Results have unbiased mean and variance
EMBL-EBI
So what is Hypothesis driven data analysis ?
 Define a target = hypothesis
 Search for target
 There are/are-not “hits”
 Verify/negate hypothesis
 Distribution is centred on target
“catalytic triad” :
Atomic coordinates :
Mathematical graph :
HIS,ASP,SER :
text string matching
coordinate superposition
graph matching
data hierarchy knowledge
EMBL-EBI
Four types of data mining
 Creation of predictive models : future data
expectation
 Link analysis : connections between data objects
 Database segmentation : classification
 Deviation detection : finding outliers.
IBM : white papers
EMBL-EBI
So what is this data mining ?
 Given multiple sets of primary data (dependant variables)
 Characters, numbers, Function(numbers),….
 Find anomalies
 To many : numerical occurrence
 Data variation : Derivatives
 Singularities
 …..
 Correlations and clusters
 Within primary data
 with other data (independent variables)
Finds new things !
But not what it
means !
EMBL-EBI
Eg
 Retail and Financial industry are heavily into DM.
 A well known US food supermarket chain found a
correlation :
 Babies nappies
 Beer
 5pm on Friday
 Wife rings husband, “get some nappies for the weekend”
 Husband takes opportunity to buy some beer !
You won’t grant funding to test this hypothesis !
EMBL-EBI
Self/Cross data mining
 Most mining software looks for correlations
between dependent variables.
 Rainfall, temperature, cloud-cover
 It rains when it is cloudy
 Free : http://www.cs.waikato.ac.nz/~ml/
 Bioinformatics usually involves anomalies
within data objects
 Sequence clusters (sequence finger prints)
 Local coordinate clusters (active sites)
 Global coordinate cluster (folds)
EMBL-EBI
Data mining – not idiot proof
 Date of birth and age will give 100 % correlation
 Authors for structure submission will be correlated
to authors on primary citation.
 “Lysozyme” is the most common fold pattern
 36 spelling’s of E.Coli will mask results.
 Requires representative sets
Statistically valid ones too !
 Signal/Noise ratio is a problem
EMBL-EBI
Discovery driven data mining of the PDB
 Analysis of 3-dimensional coordinates
 Defined common patterns of atomic interactions locally
 DB segmentation - active sites & common packing features
 Link analysis - Similarity between different functional group
 Defined globally
 DB segmentation - common patterns of super-secondary str’
 Link analysis - common folds in diverse protein families
 Outlier detection - unique folds
EMBL-EBI
Issues
 Systematic “error” propagates as solution
300 lysozyme structures return as a strong solution
 Results cannot be found below the noise level
 Need to characterise the noise level
 Need to improve signal/noise ratio (S/N) to see
information
 Target is not biologically defined
 It does not give you the biological answer
 Results should reproduce known biology
 Can give you new results not previously observed
EMBL-EBI
Data selection
 Cannot leave in 300 lysozyme structures !
 Select by sequence similarity at 70% exact alignment
Different “phase space” to select data




Remove structures with resolution < 2.5A
Remove NMR (different statistics)
Remove pre-1982 etc.
Geometrical analysis criteria to check for outliers
Using properties NOT target parameters of structure solution
EMBL-EBI
Local atomic interactions
 Data
 Function(3D coordinates) = distance
 Atom names
(independent variable)
 Residue names (independent variable)
 Create 3D Hash table of triplets of
distances(*) between “points”
 This is the dependant variable
 Order = 3
EMBL-EBI
Local atomic interactions
 Merge triplets
 Any pair of N-fold interactions are a (N+1)
interaction if they have (N-1) equivalence.
 Order = N
 Just keep going until no more (N+1)
interaction are found.
 Time = 8 seconds to find ~ 2000
interactions
(Digital alpha ES40)
EMBL-EBI
Catalytic quartet
EMBL-EBI
Electrostatic interaction
Ligands are found
close by rather than
associated with the
residues
EMBL-EBI
Iron binding site
EMBL-EBI
Double disulphide
EMBL-EBI
N-linked glycosolation binding site +
 Spot the non-sugar
 This glycosolation site is the
same as active site found in
“1a53” – indol-3glycerolphosphate synthase
EMBL-EBI
Summary
 Nearly all Bioinformatics is based on hypothesis driven
data analysis
 Data mining has lost its meaning within Bioinformatics.
 Discovery driven data-analysis (true data mining) :
 Can find unknown dependencies, clusters, outliers
 Is based on statistical probability
 Returns distributions unbiased by previous ideas
 Information technology may be better for genomes (1D)
 “A numerical measure of the uncertainty of an outcome”
 Information content of gene sequences can be defined by the
normalized probability of finding “words” within that sequence