Genome Wide Visualization and Integration

Download Report

Transcript Genome Wide Visualization and Integration

Database Technologies
Plotting Omics Associations
Jake Lin
GEBI 2014
Database: Motivation
• Bioinformatics is a data driven science
• Systems Biology
– *Omics
• Human Genome Sequence
• ENCODE
• Model Organisms
– Web + Cloud Abstraction
• Persistence + Organization
• Provenance + Discovery
• Integration
DBMS Timeline
• A Database Management System Primer
–
–
–
–
Early 1960s Hierarchical
mid 1960s Network
1969 Relational*
1980s Object Oriented
• CAD, complex nested objects
• No separation between language/db
– 1990s – Internet
• HTTP
– 2009 NoSQL*
Hierarchical
• Tree – Pyramid
– Manufacturing/Suppliers
– Root
– Leaf Nodes
– Fast access/Update
• Limitation
– Difficult to relate branches
• ChildX to ChildY
Network
• Flexible Hierarchical
– Children with X Parents
• Members -> Owners
– Loose Graph
• Not Declarative
Relational SQL: Principles
• Relational Algebra/Set Theory
– Same Math properties
– Union, Intersection …
• Schema
– Relational model
• Every thing (entity + relationship) is defined as a
column in a table
– Tuple
» {‘key’,’name’,’role’,…}
SQL Vocabulary
• Table
– Columns
• Key/Types
– Primitives/VARCHAR/blobs
– Relationships/Constraints
• Update
– Procedure
– Trigger
• Select
– Joins + Views
ACID: Relational SQL DBMS
• ACID – http://en.wikipedia.org/wiki/ACID
– Atomicity
– Consistency
– Isolation
– Durability
• Guarantee that transactions are processed
fully/reliably
– Table A/Table B/Table C
• Cascade Rollback
Relational SQL: Normalization
• Normalization*
– Data organization rules to prevent ACID anomalies
– Eight “Forms” - Theory
– http://michaelmclaughlin.info/db1/lesson-3-modeling-data/normalization/
– http://researcher.watson.ibm.com/researcher/files/us-fagin/tods81.pdf
Relational SQL: Market leaders
• Commercial
–
–
–
–
–
Oracle - ~$24 Billion
IBM – DB2
Microsoft - SqlServer
Sybase
Apple - FilerMaker
• Open sourced
– MySQL – most popular – 65K downloads per day
– PostgreSQL – most advanced
– SQLite – most widely deployed
• Gadgets - Smart Phones
• Light Apps
Relational SQL: Examples Apps
• MySQL
– Most web companies + startups
• SQLite
– SQLite
• Gadgets
– iPhone
• PostgreSQL
– BASF
– Affymetrix
– Governments
Relational SQL: Schema
• Schema*
– Relational model Example
• Tuple
– {‘gene_key’,’gene_alias’,’gene_chr’,’gene_start’,’gene_end’…}
• Math : Set theory
• Columns {Types}:{byte, (var)char, int, blob}
• Querying
– Joins – something in common
– Inner/Outer …
Relational SQL: API
• Drivers – ODBC/JDBC
– Interface between programming language and DB
– Python packages
• MySQLDB
– Java JDBC jar
– Perl/C++…
• Model View Controller – App Interface
– Separation/(Http/AJAX)
•
•
•
•
Create (Put)
Read (Get)
Update (Post)
Delete (Post)
Relational SQL: Querying
• Selects – set theory
– Indexing
– Keys
• Syntax*
– Select * from TableA as A where A.cA1 = [‘someValue’]
• Joins – Foreign Keys
– Inner and Outer Joins
– Select A.cA1, B.cB1 from TableA as A, TableB as B where A.key =
B.key
– Unions, Intersects
• Common Attributes
– Traversals
• 2 joins
• 3 joins
• 4 joins …
Relational SQL: Limitations
• Not all Data Structures are ideal
representations to relational
– Many joins for one query
• To get a few tuples
• Self joins (n) …
– Relationship between rows stored in the same table
– Complex Objects
• Stored Procedures/Triggers
• Blobs
– Data Stream/File
NoSQL Foundation + Motivation
• 1998/2009 NoSQL
• Not Only SQL
• Big Data
– Web Scale + Simplicity
• Insert + Retrieve – Distributed across Cloud
– Complex relationships
• Graphs
NoSQL Technologies
• Key-Valued Store
– {p,v} distributed across machines
• Column Family Store
– Key-Valued where keys are in families
• Document-Oriented Databases
– Collection of Key-Valued in Documents
• Graph DB
Graph DB Foundation
– Graph DB
• Nodes (vertices)
• Edges (degrees)
• Properties (key-value)
Neo4j Graph DB
• Manages
• Records data
–
–
Nodes
Relationships
• Properties
–
–
Belong
» Nodes
» Relationships
INDEX
» Look Up
• Traversal
• PATH
–
Look Up
Watch + Learn
http://player.vimeo.com/video/50787208
Relationships and Degrees
Kevin Bacon
Paul Erdős
Relationships and Degrees
Kevin Bacon
Paul Erdős
Natalie Portman
POMO: Plotting Genomic Associations
– http://pomo.cs.tut.fi
– Web Viz Tool
•
•
Circos light (http://circos.ca)
No installs, dependencies
– Associations + Annotations
– Human
– Mouse
– Yeast
– Views
• Network/Grid
• Annotations
– Bar/Histogram/Heatmaps
• Filter
– Unmapped
Sketches
– http://pomo.cs.tut.fi
– Custom
– Human-Mouse
– New
POMO Architecture
• HTML5
• JavaScript
– jQuery/ExtJS/protovis/circvis
– Cytoscapeweb
• Apache Linux
– Python CGI
• SQLite
– Ensembl Biomart – latest reference builds
– Ensembl Plants
– SGD
• https://code.google.com/p/pomo
NoSQL Players
• Google*
– Big Table : Column Based
• Amazon*
– Dynamo – Key:Value
• Facebook
– Cassandra – Column Based
• Redis
– Open sourced Key:value store
• MongoDB – Document
• CouchDB - Document
• Neo4J – GraphDB
– Many other Graph Databases
– http://en.wikipedia.org/wiki/Graph_database
NoSQL: Vocabulary
• Not ACID compliant
• Focus : Speed + Scaled
• Distributed
– Replication + Propagation
• Sharding
– Single Logic DB system
» Ranges of Documents
• Cluster of machines
• Latency
Acknowledgements!
Bioinformatics Core - Luxembourg Centre for Systems Biomedicine
Shmulevich Lab – ISB Seattle
Nykter Group – BioMediTech
Reija Autio – TUT/UTA
For more information:
[email protected]
References
•
•
•
•
•
•
•
•
•
C. J. DATE, An Introduction to Database Systems, Addison-Wesley 8th Edition (2003)
http://en.wikipedia.org/wiki/Nosql
Gentle Introduction to Object and Relational DB
http://www.cl.cam.ac.uk/~fms27/db/tr-98-2.pdf
Google Bigtable
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google
.com/sv//archive/bigtable-osdi06.pdf
http://newtech.about.com/od/databasemanagement/a/Nosql.htm
http://en.wikipedia.org/wiki/Create,_read,_update_and_delete
http://lucene.apache.org/
Dynamo Paper – Amazon http://www.read.seas.harvard.edu/~kohler/class/cs239w08/decandia07dynamo.pdf
Free Graph DB Book:
•
•
http://graphdatabases.com/?_ga=1.245224581.1016191170.1409035486
POMO – http://pomo.cs.tut.fi
•
http://www.biomedcentral.com/1471-2164/14/918