Introduction to Databases

download report

Transcript Introduction to Databases

Introduction to Databases
Vetle I. Torvik
DNA was the 20th century Databases are the 21st century
 Quantum leaps in the evolution of human
brain power
– Way-back-when: information in books - phone
books, dictionaries, lab notebooks, journals
– Recently: information at your fingertips
– Now: scientific discovery at your fingertips
• data mining bio-informatics databases
• data mining text data bases
How do you find a good movie?
 New releases only?
 Browsing shelves by category (comedy,
action, drama, foreign, etc.)?
 Browsing through a book at blockbuster
–
–
–
–
by titles alphabetically?
by actors alphabetically?
by category?
by year?
A step up...
querying a database
Now imagine this…
 Visualizing the entire movie database in
ONE figure across ALL dimensions
– year, category, actor, director, popularity, rating,
length, language, country, awards, etc.
 and drilling down to find your movie(s)
PS: You don’t have to imagine...
Why not do the same
in the scientific literature?
Benefits of DBs
 Over paper books… a quantum leap
– Speed, space, less drudgery
 Over spreadsheets … another quantum leap
– Maintenance (less redundancy, etc)
– Currency (accuracy, up-to-date, on-demand)
– Access (across time and space, sharing)
– Security (recovery, restrict others’ access)
– Facilitates data mining: encode meaning,
inferences, pooling/sharing, visualization
A Database
– an electronic repository for persistent data
A Database system
 to store, retrieve, and manipulate data
 consists of 4 parts
– Data - collection of linked data files
– Hardware - for storage and execution
– Software - DB management system (e.g.,
Access, MySQL, Filemaker, Oracle)
– Users - DB administrator, data administrator,
application programmers, end users
Relational DBMSs
 Dominates market
 Data is perceived by users as tables only
• representing, manipulating, and enforcing integrity
of data so that operations function correctly
• no duplicate records, rows and columns are
unordered, each entry has a single value
 SQL = “structured query language”
• a standard language for querying databases
• independent of how the data is stored/accessed
Database design - a subjective exercise
 Entity/Relationship diagramming
– identify entities or
“things that can be distinctly identified”
• e.g. movie, category, individual(director, actor)
– identify relationships
• e.g. a movie has one director, zero or more actors,
belongs to one category
– draw the diagram
 Then “normalize” the database
Ontologies - the basis upon which
the truth of the world is viewed
 E.g. a movie has one director, zero or more actors,
belongs to one category
 makes databases a bit more intelligent
 allows for making inferences
– “the artist formerly known as Prince” - without an artist
name, nobody can make any name related inferences
about him…
Metadata - data about the data
 It would be nice if SQL knew that actors and
directors are both individuals so that (e.g.)
querying movies by actor = director makes sense
(and this type of query could be optimized)
Data mining
 Searching for novel patterns, rules or
relationships in data, e.g.:
–
–
–
–
correlations
classification
clustering
visualization
 Versus traditional statistics: hypothesis
testing
Data mining - correlations
 Searching through many possible pairs of
associations to find novel ones, e.g.:
– phenotypes versus genotypes
Data mining - classification
 find rules that discriminate between
predefined categories
– e.g., breast cancer diagnosis
–
–
–
RULE #1: IF the following conditions hold ALL true at the SAME TIME,
THEN the case is: "intra-ductal carcinoma”
CONDITIONS:
• The volume of the calcifications is more than 0.03 cm^3.
• AND The total number of calcifications is greater than 10.
• AND The variation in shape is moderate or marked.
• AND The irregularity in size of calcifications is marked.
• AND The variation of the density of calcifications is moderate or marked.
• AND There is no ductal orientation.
• AND The number of calcifications per cm^3 is less than 20.
• AND A comparison with previous exams shows a change in the number or
character of calcifications or it is newly developed.
RULE #2: ...
Data mining - clustering
 organizing information by naturally
occurring groups, e.g.:
– cluster languages by similarity of words to
assess their evolution
– organizing webpages into themes by word
usage (e.g., www.vivisimo.com)
– grouping genes by expression level in DNA
microarrays to find a subset of differentially
expressed genes
Data mining - clustering
Data mining - visualization
 Looking for patterns across multiple
dimensions, and levels of resolution e.g.:
– scientific collaboration behavior across time
and subjects
– map of power outage over time (what was the
chain of events causing a major outage?)
Data mining begins at home
 Your lab notebook is a database.
 Can you data mine your lab notebook?