Chapter 2 PowerPoint Slides

Download Report

Transcript Chapter 2 PowerPoint Slides

Assignment: Due Friday:
READ:
Ensembl API Tutorial:
http://www.ensembl.org/info/software/core/core_tutorial.html
READ:
Encode Project paper (on Icon)
1
Databases
• Reference: Bioinformatics Computer Skills
– Gibas and Jambeck
2
Database Intro
(for Ensembl Modules)
Database Design for Mere
Mortals,
M Hernandez, Addison-Wesley
3
Databases
• What is a database?
4
Databases
• Vital part of genomics, bioinformatics, genetics,
and biology
• Genomic data
– sequence, genes, annotation, markers, literature,
diseases, structures, SNPs, mutations, etc, etc, etc.
– Ensembl, UCSC, NCBI -- the big "three"
• Knowing how to find and download from central
biological repositories is an important skill
• However, this is a tool building class, and
databases on the web becomes more integral to
sharing information
– build our own DBs
5
Databases
• This portion of the course will not make you an
expert in databases (there are other courses for
that), nor will you be able to design and
implement a major database
• However, you will have a basic understanding of
the steps involved, and you will be able to
instantiate rudimentary databases from which you
may expand with efforts beyond this class
• Additionally, you will be better equipped to query
existing databases and develop
applications/interfaces to interact with DBs
6
Databases
• Steps
–
–
–
–
design a data model
select a database management system
implement data model
interface
7
DBMS
• types
– flat file index
– relational
– object-oriented
8
Database Introduction: terms
• Field, attribute, tuple, record
– Individual data items within a database
– Often “field” describes the “place holder” in a database table, and
“record” describes that actual data value
• Table
– One or more attributes grouped together
• Relation, link, pointer
– A connection between two fields (usually between tables)
• Other terms
– meta data -- data that describes attributes (like the column titles in a
spread sheet)
– data elements -- a "class" and attribute
• ex) subject, name
• subject, address
• subject, affection_status
9
Databases
• Operational
– Used whenever there is need to collect, maintain, and modify data
– Dynamic data
• Changes constantly
• Reflects up-to-the minute info
– Examples (Biology/Genetics)
• Ensembl*, NCBI (GenBank, PubMed, dBest, UniGene*, LocusLink*), UCSC*
• Analytical
– Store and track historical and time-dependent data
– Static
• Many of the biological databases (Ensembl, UniGene, UCSC) are
the result of substantial analyses and are relatively static – with a
release period of between 1 week and 6 months.
• These do not change between releases – something between
operational and analytical
10
Flat File Databases
• an ordered collection of similar files
• usually conforming to a standard format (but not
always)
• a flat file database means that rules can be applied
to find data within the files (rather than
remembering the locations of all individual files)
• made searchable by indexing
• an index is a pairing of an "attribute" from within
a file and the file name/location
11
Flat File Example
>gnl|UG|Hs#S593306 zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA
clone IMAGE:669121 5', mRNA sequence /clone=IMAGE:669121
/clone_end=5' /gb=AA234466 /gi=1858985 /ug=Hs.68879 /len=276
ATTAAGATCTGAAAACTGTGATGCGTCCTTTCTGCAGAGACGCCTCTTTCTGAATCTGCC
CGGAGCTTCGAGCCCCGGCGTCTGTCCCTCAGCCTGGCATGGCTTCTTCGGGGGTCTGCT
TTGCATGGGGAGAGGGGCCACGCAGCGGACGGACTAGGTTTGGGGATTCTCGGTAATGGA
CCCGGAGCAATGACTAACAGCCGCTCCCTCTCACTTTCCCACAGCGATCACCCTCTAACA
CCCTCCCTCCCATTCCCGGCCCCGCGCGTGACAAGG
Index:
/HomoSapiens/EST/5-prime/zr74d01.r1 Hs.68879
12
LOCUS
DEFINITION
AA234466
276 bp
mRNA
linear
EST 06-AUG-1997
zr74d01.r1 Soares_NhHMPu_S1 Homo sapiens cDNA clone IMAGE:669121
5', mRNA sequence.
ACCESSION
AA234466
VERSION
AA234466.1 GI:1858985
KEYWORDS
EST.
SOURCE
Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE
1 (bases 1 to 276)
AUTHORS
Hillier,L., Allen,M., Bowles,L., Dubuque,T., Geisel,G., Jost,S.,
Kucaba,T., Lacy,M., Le,N., Lennon,G., Marra,M., Martin,J., Moore,B.
, Schellenberg,K., Steptoe,M., Tan,F., Theising,B., White,Y., Wylie
,T., Waterston,R. and Wilson,R.
TITLE
WashU-Merck EST Project 1997
JOURNAL
Unpublished (1997)
COMMENT
Contact: Wilson RK
Washington University School of Medicine
4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
Tel: 314 286 1800
Fax: 314 286 1810
Email: [email protected]
This clone is available royalty-free through LLNL ; contact the
IMAGE Consortium ([email protected]) for further information.
Insert Length: 871
Std Error: 0.00
Seq primer: -28m13 rev2 ET from Amersham
High quality sequence stop: 266.
FEATURES
Location/Qualifiers
source
1..276
/organism="Homo sapiens"
/mol_type="mRNA"
13
Flat File Databases
• as a flat file database grows larger, the onedimensional aspect of indices makes it difficult to
make connections between attributes
• many biological databases began as flat file
databases, and this legacy has influenced file
formats and applications (PDB, ESTs, GeneBank)
• because of the ease of browsing a file system,
many systems adopt a hybrid approach -incorporating both a relational database and flat
file system
14
Relational Databases
• Just like flat file systems, relational databases are a
way of assembling, storing, and retrieving
information/data
• in a relational database, data is stored in "tables"
• a flat file database that describes protein structure
is like a book
– chapters about the origin of the sample, how the data
was collected, the sequence, secondary structure,
positions of the atoms
• in a relational database, this information is spread
across tables
– a table for experimental conditions, sequence,
secondary structure, etc.
15
Relational Databases
• rows in the tables may refer to unique proteins
• connections between the proteins can be made ( they aren't
bound together like a book)
• the form of the tables follows rules that are uniform so that
you can access different attributes from different tables
• you can freely access all entries for atomic position at
once, or all attributes for a particular protein
• for one protein, not inconvenient to "look it up" and read
the entry in the "book" (I.e., flat file DB)
• protein name, author who deposited it, can be put in an
index (like a card catalogue) to help find the book
• but how do you extract the secondary structure out of
every "book" in the library -- you can't!
16
Relational Databases
• a relational database management system
(RDMS) allows you to "look" at the
database from many different "views"
• a RDMS essentially allows you to
dynamically "create" the contents of your
"book" by querying for the information that
you want
17
Table organization
• data in a table are organized in "rows"
• each row represents one "record"
• a row may contain separate pieces of
information -- "fields" or "attributes"
• tables are not glorified flat files -- although
they may look that way if you examine
them in their entirety
18
Example
• We want to keep track of genes for the purpose of
mutation screening using SSCP or sequencing
• We want to keep track of:
–
–
–
–
lists of genes
sequence of genes (to design primers)
exons and introns
multiple transcripts?
• Since we are putting this together for Pfizer (who
grossed $52 billion in 2006), we'll be cute and
refer to genes as "targets" -- since they are looking
for druggable "targets"
19
project (table name)
A first attempt
id
project_name
gene_name
date
description
4
glaucoma
BBS4
12/12/04
Glaucoma is…
5
glaucoma
ABCA6
Glaucoma is…
target (table name)
id
name
sequence
primerF
primerR (fields)
5
BBS4
ATGCGG…
GCTAGT
ATGACCCT
6
BMP4 ATGCCC…
GGTATG
TGAATGAG
…
exon
id
gene_id sequence
1
5
ATGC…
2
5
CCGG…
20
Observations
• values in the "target" table are related to values in the
"project" table
• however, it doesn't make sense to put all of the data into
one big table (we could -- but it would be very redundant)
• why are there multiple entries in project for the same
project???
• does it really make sense to put primer information in the
"target" table???
• the two data types ("project" and "target") are related but
"orthogonal"
• anywhere in a grouping of data where it is not sensible to
include fields in a table, a new table should be created
– normalization -- the process of separating complex data
into a collection of mutually orthogonal related tables
21
What's wrong with this?
• No inherent flaws, but may not be the most
efficient
– duplication (exon sequences in "exon" table are subsets
of the full gene sequence in "target' table)
– gene names are duplicated between "project" and
"target" table
– this isn't a major issue -- but for the sake of argument -what happens if the gene name for BMP4 is changed to
BMO4?
– Description of "project"s is duplicated
22
exon
transcript_id
exon_num
sequence_start
sequence_stop
intron
primer_pair
transcript_id
id
intron_num
transcript_id
sequence_start
left_primer_id
sequence_stop
right_primer_id
project
id
name
description
transcript
sequence
id
id
sequence_id
target_id
source
type
source_id
sequence
chr_name
target
date
id
date
set_table
set_target_join
id
set_id
project_id
target_id
name
rank
date
cas_rank
description
cas_options
gene_name
description
accession
strand
genomic_start
genomic_stop
source
source_id
refresh
status
23
Sample Data
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
intron
primer_pair
transcript_id
id
intron_num = 3
transcript_id
sequence_start
left_primer_id
sequence_stop
right_primer_id
project
id
name = pro1
description
transcript
sequence
id
id
sequence_id
target_id
source = Ensembl
type = nucleotide
source_id
sequence = ATG…
chr_name = 15
target
date
id
date
set_table
set_target_join
id
set_id
project_id
target_id
name = testset
rank = 5
date
cas_rank
description
cas_options
gene_name = BBS4
description
accession
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
status
24
Database Schema
• network of tables and relationships defines
the "database schema"
• generally you would carefully develop a
schema so that it keeps utility over time
– test data
– example queries
25
Relational Database Model
(basis for modern databases)
• Codd, E. “A Relational Model of Data for Large
Shared Databanks.” Communications of the
ACM, 1970, 377-387.
– Based on set theory, and predicate logic
– “Relation” term is derived from set theory (not from
notion that tables are “related” to each other).
– Physical order of records is immaterial
– Each record in a table is identified by a field that
contains a unique value (sound like anything else we
know???)
– User does not need to know the physical location of a
record to retrieve data (unlike other models,
hierarchical, network).
26
End
27
Database Intro: An early model –
Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements
Features:
- inverted tree with Root node
- relationships are parent/child
- each child has 1 parent, but parents may
have multiple children
- tables linked by a “pointer”
- access to data requires familiarity with the
structure
Payments
28
Database Intro: An early model –
Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements
Payments
Advantages:
- quick retrieval
- referential integrity is built in and automatically enforced
- example: record in child table must be linked to existing record in
parent table
- if record deleted in parent table, all associated records in
child tables are deleted as well
29
Database Intro: An early model –
Hierarchical Database Model
Agents
Entertainers
Schedule
Clients
Engagements
Payments
Disadvantages:
- difficult to handle child entries that are initially unrelated to parent
- example:
- Want to add a new entertainer, but cannot add entertainer until it
is associated with an Agent
- Have to insert a “dummy” record
- does not support complex relationships
- redundant data: many to many relationship between Clients and
Entertainers, so Schedule (date, time) will have same entries as
Engagements (date, time)
30
Relational Database Model
Agents
Clients
Entertainers
Engagements
Music
Style
Advantages:
-built in multilevel data integrity:
table: records are not duplicated,
and detect missing primary keys
relationship: ensure connection between tables is valid
-logical and physical data independence from database application:
changes in physical implementation of database would
not (in theory) adversely affect applications built to use DB
-easy data retrieval: data may be retrieved from any table or from
any number of related tables
31
Relational Database Model
Agents
Clients
Entertainers
Engagements
Music
Styles
-Relationships
-may be: one-to-one, one-to-many, and many-to-many
-Established through matching values of a shared field
-Example: Agents are associated to Clients by an AgentsID field in
Clients table
-Data may be accessed in an almost unlimited combination of ways
32