Transcript Document

Introduction to Databases
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Structure
 Motivation
 Introduction to MySQL
 Example Queries
 Using SQL to query SCOP
By Michael Schroeder, Biotec
2
Motivation
 In the last term,
 we accessed most information online via the web
 we interacted directly and manually with databases and tools
 we had to manually submit queries, interpret results. select interesting
results, cut&paste them, and submit queries again,…
 Pro:
 Reasonably easy to get hold of information
 Con:
 Not possible to ask many queries
 Queries limited by interface provided by web page
 Difficult/impossible to integrate information from different sites
 In this term, we will look at the databases underlying the online
front ends
 How is the data internally stored?
 How can we - and more important computer programs - directly interact
with the underlying data, so that we can ask more powerful queries, large
queries, and integrate different systems
By Michael Schroeder, Biotec
3
What actually happens when you
retrieve data online?
LLNE
YLEEVE
EYEEDE
LLNE
YLEEVE
EYEEDE
Client
Message
Web Server
1

Get home page

2

Send home page

Get it and send it
3
Display home page, enter
query, and press submit

Send query

Start programme that
evaluates query by
accessing database…
4
Display result

Send result

… Compose result web
page and send it
By Michael Schroeder, Biotec
4
What actually happens
You are limited by what web server
allows you to ask:
Example CATH:
•PDB ID,
•CATH code, or
•General text
But you cannot ask:
•In how many different PDB
structures is there a P-loop domain?
•Is there a PDB entry with a P-loop
and a DNA-binding domain
•How many different superfamilies
does the largest structure in PDB
have?
•With direct access to the underlying
database you could answer all these
questions (and many more)
By Michael Schroeder, Biotec
5
Querying over the Web
 Problem is always the same:
 The web interface limits access to
the underlying database
 How can we interact directly with
the database
By Michael Schroeder, Biotec
6
What databases are about
 Logical organization of data
 data models, schema design, dictionaries
 Physical organization of data
 Fast retrieval, indexing, compact storage of data
 Other requirements:
 Logging (important to know who did what to the data)
 Security and access control (important to know who can
do what)
 Transactions and concurrency control (important when
more than one person is working on database)
 Integrity (important to ensure that only valid entries in the
database)
 Recovery (important as hardware and software can
sometimes
fail
By Michael
Schroeder, Biotec
7
Different types of databases
 Flat files
 XML
 Relational database
 (Object databases)
 (Object relational databases)
By Michael Schroeder, Biotec
8
Flat files
 We can store any data in a flat
file, e.g. EMBL
 But is this a database?
 Logical data organisation:
None, unless we define one
(as done for EMBL) and adhere
to it, which is not enforced
 Physical data organisation:
None, we cannot optimise
retrieval for common queries
 Logging: No
 Access control: Implicit
through Unix
 Transaction and concurrency
control: None
 Integrity: None
 Recovery: If files are backedup they can be recovered.
However, not on the fly
By Michael Schroeder, Biotec
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
OC
XX
RN
RP
RX
RX
RA
RT
RT
RL
XX
RN
RX
RX
RA
RT
RT
BTBPTIG
standard; genomic DNA; MAM; 3998 BP.
X03365; K00966;
X03365.1
18-NOV-1986 (Rel. 10, Created)
20-MAY-1992 (Rel. 31, Last updated, Version 3)
Bovine pancreatic trypsin inhibitor (BPTI) gene
Alu-like repetitive sequence; protease inhibitor;
trypsin inhibitor.
Bos taurus (cow)
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia;
Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae;
Bos.
[1]
1-3998
MEDLINE; 86158754.
PUBMED; 2420326.
Kingston I.B., Anderson S.;
"Sequences encoding two trypsin inhibitors occur in
strikingly similar
genomic environments";
Biochem. J. 233(2):443-450(1986).
[2]
MEDLINE; 84070725.
PUBMED; 6580617.
Anderson S., Kingston I.B.;
"Isolation of a genomic clone for bovine pancreatic
trypsin inhibitor by
using a unique-sequence synthetic DNA probe.";
9
XML files
 We can store any data in XML,
the eXtentable Mark-up
Language, e.g. Medline
 But is this a database?
 Logical data organisation:
yes, XML schema, which is
enforced
 Physical data organisation:
None, we cannot optimise
retrieval for common queries
 Logging: No
 Access control: Implicit
through Unix
 Transaction and concurrency
control: None
 Integrity: None
 Recovery: If files are backedup they can be recovered.
However, not on the fly
By Michael Schroeder, Biotec
<Article>
<Journal>
<ISSN>0270-7306</ISSN>
<JournalIssue>
<Volume>19</Volume>
<Issue>11</Issue>
<PubDate>
<Year>1999</Year>
<Month>Nov</Month>
</PubDate>
</JournalIssue>
</Journal>
<ArticleTitle>Differential regulation of the cell wall
integrity mitogen-activated protein kinase pathway in
budding yeast by the protein tyrosine phosphatases
Ptp2 and Ptp3.
</ArticleTitle>
<Pagination>
<MedlinePgn>7651-60</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>Mitogen-activated protein kinases
(MAPKs) are inactivated by dual-specificity and
protein tyrosine phosphatases (PTPs) in yeasts. In
Saccharomyces cerevisiae, two PTPs, Ptp2 and
Ptp3, inactivate the MAPKs, Hog1 and Fus3, with
different specificities... </AbstractText>
</Abstract>
<Affiliation>Department of Chemistry, University of
Colorado, Boulder, Colorado 80309-0215, USA.
</Affiliation>…
10
Relational Database
 Central Idea: Data as relations in a table
 E.g. SCOP, Structural Classification of Proteins
+-------+------+---------+---------+--------------------------------------+
| id
| type | sccs
| sid
| description
|
+-------+------+---------+---------+--------------------------------------+
| 46457 | cf
| a.1
| | Globin-like
|
| 46458 | sf
| a.1.1
| | Globin-like
|
| 46459 | fa
| a.1.1.1 | | Truncated hemoglobin
|
| 46460 | dm
| a.1.1.1 | | Truncated hemoglobin
|
| 46461 | sp
| a.1.1.1 | | Ciliate (Paramecium caudatum)
|
| 14982 | px
| a.1.1.1 | d1dlwa_ | 1dlw A:
|
| 46462 | sp
| a.1.1.1 | | Green alga (Chlamydomonas eugametos) |
| 14983 | px
| a.1.1.1 | d1dlya_ | 1dly A:
|
| 63437 | sp
| a.1.1.1 | | Mycobacterium tuberculosis
|
| 62301 | px
| a.1.1.1 | d1idra_ | 1idr A:
|
+-------+------+---------+---------+--------------------------------------+
By Michael Schroeder, Biotec
11
Relational Database
 Central Idea: Data as relations in a table
 E.g. Employee
+-------+------+---------+---------+
| id
| name | salary | role
|
+-------+------+---------+---------+
| 46457 | pete | 50.000 | director|
| 46458 | jane | 60.000 | nurse
|
| 46459 | asif | 70.000 | driver |
+-------+------+---------+---------+
By Michael Schroeder, Biotec
12
Relational Database
 Central Idea: Data as relations in a table
 E.g. pets
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
13
Relational Database
 Central Idea: Data as relations in a table
 E.g. school
+-------+------+---------+
| id
| name | subject |
+-------+------+---------+
| 46458 | rick | bio
|
| 46459 | gerd | bio
|
| 46460 | mary | bio
|
| 46461 | ella | math
|
| 14982 | anne | math
|
| 46462 | paul | math
|
+-------+------+---------+
By Michael Schroeder, Biotec
+-------+------+---------+
| id
| prof | subject |
+-------+------+---------+
| 51221 | bert | bio
|
| 55435 | anne | math
|
+-------+------+---------+
+---------+------+-----+------+
| subject | room | day | time |
+---------+------+-----+------+
| bio
| A
| mo | 3pm |
| math
| B
| tue | 1pm |
+---------+------+-----+------+
14
Relational Database
 A cell in the table stores a single number or string,
but not a list
+------+-------------+
| prof | subjects
|
+------+-------------+
| bert | {bio,sport} |
| anne | {math,arts} |
+------+-------------+
 Lists, sets need to be flattened
+------+-------------+
| prof | subject
|
+------+-------------+
| bert | bio
|
| bert | sport
|
| anne | arts
|
| anne | math
|
+------+-------------+
By Michael Schroeder, Biotec
15
Bioinformatics: 10 years of
resistance to flattening!
 Why the resistance?
 Bioinformatics data is naturally nested
 Extensive Use of sets and lists
 E.g. Swissprot: Features, keywords, References
 Such data can be flattened, but the resulting relational schema
is hard to understand hence it is hard to formulate queries.
 For example, storing the SWISSPROT entry in a relational
database would split it over 15-20 tables.
By Michael Schroeder, Biotec
16
Relational Databases
 RDB introduced in 1970 by Codd
 Took off in the 80s
 In the business world, relational databases are the rule
(Oracle, Sybase, mySQL, DB2, Microsoft Access).
 Large biomedical databases typically use a relational
technology; but there are also a lot of homegrown
systems (ACeDB, SRS indexed files). Data is almost
always viewed and exported in a variety of flat file
formats (EMBL, GenBank among others)
By Michael Schroeder, Biotec
17
The flood of biomedical data…
 Since 1980, the number and size of biomedical
databases has been growing exponentially.
 How can you find sources of information you are
seeking?
 Nucleic Acids Research Database Issue in January of
every year (http://nar.oupjournals.org/)
 Dbcat (http://www.infobiogen.fr/ ): a flat file database
of 500 biological databases.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
18
Relational Schema
The schema of a database is a set of relation names,
their field names and types.
Example:
Entry(ID: int, Length: int, Seq: string, Mod: date)
Feature(ID: int, Type: string, From: int,To: int)
Entry and Feature are relation names,
ID, Seq, Mod, etc are attribute names, and
int, string, date are domains
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
19
Relation Instance
An instance of a relation is a set of tuples of the type
of the relation.
A tuple of Entry could be:
( ID: 82814, Length: 597, Seq:“ccagctaaccg”, Mod: 1-7-95)
A tuple of Feature could be:
(ID: 82814, Type:“source”, From:1,To:8959)
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
20
Tabular representation
Typically, relations are displayed as tables
attributes
Sequence:
ID
82814
98608
Length
Seq
Mod
597
18976
“ccagctaa...”
“accgcct...”
1-07-95
2-14-98
tuples
Feature:
ID
Type
From
82814 “Source” 1
82814 “Gene”
23
By Michael Schroeder, Biotec
To
184
65
Susan B. Davidson, Biol537/CIS636, Fall 2003
21
Entities and Relationships
 There is a one-many relationship from Entry to
Feature; each entry can have many features, but a
feature can be on at most one entry.
 Put another way, the existence of a feature depends
on the existence of the owning entry  referential
integrity
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
22
Integrity Constraints
 ID is the key of Entry, indicated by underlining:
 No two tuples of any instance of Entry can have the
same ID.
 In Feature, there is a referential integrity constraint
on ID:
 Every ID in Feature must appear in some tuple in
Entry.
 This is specified in the data definition language
(DDL), and enforced by the system as updates are
made to the instance.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
23
DDL for this relational schema
CREATE TABLE Entry
(Id INTEGER,
Length INTEGER,
Sequence LONGCHAR,
Mod DATE,
PRIMARY KEY (Id) )
By Michael Schroeder, Biotec
CREATE TABLE Feature
(Id INTEGER,
Type CHAR(15),
From INTEGER,
To INTEGER,
PRIMARY KEY (Id, Type,
From, To)
FOREIGN KEY (Id)
REFERENCES Entry
ON DELETE CASCADE
ON UPDATE CASCADE)
Susan B. Davidson, Biol537/CIS636, Fall 2003
24
Querying relational databases
 The language SQL has become a standard for
querying relational databases. Based on a curious
mixture of the relational algebra and relational
calculus (formal languages), it allows new relations of
information to be computed from a set of relations.
 Unlike the relational algebra, it allows other useful
stuff: count, sum, min, max, etc.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
25
Basic Query
SELECT
FROM
WHERE
[DISTINCT] target-list
relation-list
qualification
 relation-list A list of relation names (possibly with a
range-variable after each name).
 target-list A list of attributes of relations in relation-list.
* can be used to denote all atts.
 qualification Comparisons (Attr op const or Attr1 op
Attr2, where op is one of <, <=, >, >=, =, <>
combined using AND, OR and NOT.
 DISTINCT (optional) keyword indicates that the
answer should not contain duplicates. Default is that
duplicates are not eliminated!
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
26
Conceptual Evaluation
Strategy




Compute the product of relation-list
Discard tuples that fail qualification
Project over attributes in target-list
If DISTINCT then eliminate duplicates
This is probably a very bad way of executing the query, and
a good query optimizer will use all sorts of tricks to find
efficient strategies to compute the same answer.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
27
Sample tables
Sequence:
ID
Length
Seq
Mod
82814
98608
1
76582
597
18976
16665
9976
“ccagctaa...”
“accgcct...”
“gtgtaa….”
“actgga…”
1-07-95
2-14-98
1-19-97
2-29-00
Feature:
ID
82814
82814
1
13428
13428
By Michael Schroeder, Biotec
Type
“Source”
“Gene”
“Gene”
“Gene”
“Source”
From
1
23
3
11000
1
To
597
65
9999
16665
16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
28
Simple queries
Print all sequences
SELECT *
with length less
FROM Sequence
WHERE Length < 10000; than 10000.
ID
82814
76582
Length
Seq
Mod
597
9976
“ccagctaa...”
“actgga…”
1-07-95
2-29-00
SELECT Type
FROM Feature;
Print the type of all
features.
By Michael Schroeder, Biotec
Type
“Source”
“Gene”
“Gene”
“Gene”
“Source”
Susan B. Davidson, Biol537/CIS636, Fall 2003
29
Distinct
 Note that SQL did not eliminate duplicates. We
need to request this explicitly.
SELECT DISTINCT Type
FROM Feature;
Print the type of all
features
(no duplicates).
By Michael Schroeder, Biotec
Type
“Source”
“Gene”
Susan B. Davidson, Biol537/CIS636, Fall 2003
30
Pattern Matching
 Can be used in where clause. “_” denotes any
character, “%” 0 or more characters.
SELECT *
FROM Sequence
WHERE Seq LIKE ‘a_%g'
ID
98608
76582
Length
Seq
Mod
18976
9976
“accgcct...”
“actgga…”
2-14-98
2-29-00
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
31
Arithmetic
 “as” can be used to label columns in the output;
arithmetic can be used to compute results
SELECT DISTINCT ID, To-From+1 as Length
FROM Feature;
ID
82814
82814
13428
13428
13428
By Michael Schroeder, Biotec
Length
597
43
9997
5666
16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
32
Set operations -- union
SELECT ID
FROM Sequence
WHERE Length<10000
UNION
SELECT ID
FROM Feature
WHERE Type=“Source”;
ID
76582
82814
13428
• Duplicates do not occur in the union.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
33
The UNION ALL operator
preserves duplicates
SELECT ID
FROM Sequence
WHERE Length<10000
UNION ALL
SELECT ID
FROM Feature
WHERE Type=“Source”;
By Michael Schroeder, Biotec
ID
76582
82814
82814
13428
Susan B. Davidson, Biol537/CIS636, Fall 2003
34
Intersection and difference
SELECT Id
FROM Sequence
INTERSECT
SELECT Id
FROM Feature;
SELECT Id
FROM Sequence
MINUS
SELECT Id
FROM Feature;
ID
ID
82814
13428
98608
76582
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
35
Products
SELECT *
FROM Sequence,Feature;
ID
Length
82814
597
98608
18976
1
16665
76582
9976
……. (lots more!)
Seq
Mod
ID
“ccagctaa...”
“accgcct...”
“gtgtaa….”
“actgga…”
1-07-95
2-14-98
1-19-97
2-29-00
82814
82814
82814
82814
Type
From To
“Source”
“Source”
“Source”
“Source”
1
1
1
1
597
597
597
597
Note that the ID column name is duplicated in the output.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
36
Conditional join
SELECT *
FROM Sequence, Feature
WHERE Sequence.Id = Feature.Id;
ID
Length
82814
82814
13428
1
13428
597
597
16665
16665
16665
Seq
Mod
“ccagctaa...”
“ccagctaa...”
“gtgtaa….”
“gtgtaa….”
“gtgtaa….”
By Michael Schroeder, Biotec
ID
1-07-95 82814
1-07-95 82814
1-19-97 13428
1-19-97 13428
1-19-97 13428
Susan B. Davidson, Biol537/CIS636, Fall 2003
Type
From To
“Source” 1
“Gene” 23
“Gene”
3
“Gene” 11000
“Source” 1
597
65
9999
16665
16665
37
Counting
Print the number
of feature entries.
SELECT COUNT(*)
FROM Feature;
Print the number
of types of features.
SELECT COUNT(Type)
FROM Feature;
 Surprisingly, the answer to both of these is the
following:
COUNT(TYPE)
5
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
38
Counting, cont.
 To fix this, we use the keyword “DISTINCT”:
SELECT COUNT(DISTINCT Type)
FROM Feature;
COUNT(DISTINCT Type)
3
 Can also use SUM, AVG, MIN and MAX.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
39
Group by
 So far, these aggregate operators have been
applied to all qualifying tuples. Sometimes we
want to apply them to each of several groups of
tuples.
 For example: “Print the type and number of
features of each type.”
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
40
Group by
SELECT Type, COUNT(*)
FROM Feature
GROUP BY Type;
Type COUNT(*)
“Source”
2
“Gene”
3
 Note that only the columns that appear in the
GROUP BY statement and “aggregated” columns can
appear in the output. So the following would generate
an error.
SELECT Type, From, To, COUNT(*)
FROM Feature
GROUP BY Type;
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
41
Group by … having
 HAVING is to GROUP BY as WHERE is to FROM
SELECT Type, COUNT(*)
FROM Feature
WHERE From-To > 50
GROUP BY Type
HAVING AVG(From-To)> 8500
TYPE
COUNT(*)
“Source”
2
 “HAVING” is used to restrict the groups that appear in the
result.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
42
Summary
 SQL is “relationally complete”: allows you to perform
operators in an algebra of relations (the relational
algebra).
 Additional features: string comparisons, set
membership, arithmetic and grouping.
 In contrast, Entrez is a much more limited language.
By Michael Schroeder, Biotec
Susan B. Davidson, Biol537/CIS636, Fall 2003
43
A Little Exercise
By Michael Schroeder, Biotec
44
A Little Exercise
 Given the table pet below let us formulate some
queries…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
45
A Little Exercise
 Get all pet names…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
46
A Little Exercise
SELECT name
FROM pet;
 Get all owners and list them only once
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
47
A Little Exercise
SELECT DISTINCT owner
FROM pet;
 Select the names of all birds…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
48
A Little Exercise
SELECT name
FROM pet
WHERE species=“bird”;
 Select the names of all female birds…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
49
A Little Exercise
SELECT name
FROM pet
WHERE species=“bird” AND sex=“f”;
 Select names and owners
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
50
A Little Exercise
SELECT name,owner
FROM pet;
 Select owners of birds and dogs…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
51
A Little Exercise
SELECT owner
FROM pet
WHERE species=“bird” OR species=“dog”;
 Select all owners starting with Dia…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
52
A Little Exercise
SELECT owner
FROM pet
WHERE owner LIKE “Dia%”;
 How many pets has Gwen?...
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
53
A Little Exercise
SELECT count(name)
FROM pet
WHERE owner=“Gwen”;
 Select owners of male pets in sorted order…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
54
A Little Exercise
SELECT owner
FROM pet
WHERE sex=“m”
ORDER BY owner;
 List owners and the
number of pets they
have…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
55
A Little Exercise
SELECT owner, COUNT(name)
FROM pet
GROUP BY owner;
 List owners and
the number of
pets they have
in descending
order…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
56
A Little Exercise
SELECT owner, COUNT(name) AS
num
FROM pet
GROUP BY owner
ORDER BY num DESC;
 List owners and the
number of pets they
have in descending
order, but only if
they have more
than 1 pet…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
57
A Little Exercise
SELECT owner, COUNT(name) AS num
FROM pet
GROUP BY owner HAVING num > 1
ORDER BY num DESC;
 List all pairs of cats and dogs…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
58
A Little Exercise
SELECT p1.name, p1.species, p2.name, p2.species
FROM pet AS p1, pet AS p2
WHERE p1.species=“dog” AND p2.species=“cat”;
 Select all male/female pairs of the same species…
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
59
A Little Exercise
SELECT p1.name, p1.species, p1.sex, p2.name, p2.species,
p2.sex
FROM pet AS p1, pet AS p2
WHERE p1.species=p2.species AND p1.sex=“m” AND p2.sex=“f”;
 Can we write p1.sex != p2.sex instead? …
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
60
A Little Exercise
 We would get the pair Whistler and Chirpy as
well
+----------+--------+---------+------+------------+------------+
| name
| owner | species | sex | birth
| death
|
+----------+--------+---------+------+------------+------------+
| Whistler | Gwen
| bird
|
| 0000-00-00 | NULL
|
| Chirpy
| Gwen
| bird
| f
| 1998-09-11 | 0000-00-00 |
| Bowser
| Diane | dog
| m
| 1979-08-31 | 1995-07-29 |
| Fang
| Benny | dog
| m
| 1990-08-27 | 0000-00-00 |
| Buffy
| Harold | dog
| f
| 1989-05-13 | 0000-00-00 |
| Claws
| Gwen
| cat
| m
| 1994-03-17 | 0000-00-00 |
| Fluffy
| Harold | cat
| f
| 1993-02-04 | 0000-00-00 |
| Slim
| Benny | snake
| m
| 1996-04-29 | 0000-00-00 |
+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec
61
A Little Science
 When working with SCOP through the web interface
we are limited in what we can ask
 What can we get out of SCOP when it is available as
a relational table?
 A reminder
 Classes: all alpha, all beta, alpha/beta, alpha+beta
 SCOP family: >30% sequence similarity
 SCOP superfamily: good structural similiary (possibly
<30%)
By Michael Schroeder, Biotec
62
A Little Science
 At low sequence identity, good structural
alignments possible
Family
Same
Superfamily,
But not family
30%
By Michael Schroeder, Biotec
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
63
A Little Science
 Three tables:
 cla, PDB entry and reference to its class, fold,
superfamily, family, domain
 des, description of each node in the SCOP hierarchy
 subchain, chain and possibly beginning and end on
chain for a domain instance
 astral, sequence for a domain
By Michael Schroeder, Biotec
64
A Little Science
mysql> SELECT * FROM cla LIMIT 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid
| pdb_id | sccs
| cl
| cf
| sf
| fa
| dm
| sp
| px
|
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw
| a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> SELECT * FROM des LIMIT 1;
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
mysql> SELECT * FROM astral LIMIT 1;
+---------+---------+-----------------------------------------------------------+
| sid
| sccs
| seq
|
+---------+---------+-----------------------------------------------------------+
| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|
+---------+---------+-----------------------------------------------------------+
mysql> SELECT * FROM subchain LIMIT 1;
+----+-------+----------+-------+------+
| id | px
| chain_id | begin | end |
+----+-------+----------+-------+------+
| 1 | 14982 | A
|
|
|
+----+-------+----------+-------+------+
By Michael Schroeder, Biotec
65
Entity relationship diagram for SCOP
Thanks to Boris VassilevT
A Little Science
 How many nodes are there in the hierarchy of type
class, fold, superfamily, family?
des
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec
67
A Little Science
 How many nodes are there in the hierarchy of type
class, fold, superfamily, family?
 Let us first find out how these types are called:
SELECT DISTINCT type
FROM des;
 Now let’s list them with the numbers
des
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec
68
A Little Science
 How many nodes are there in the hierarchy of type
class, fold, superfamily, family?
 SELECT type, COUNT(*) AS num
FROM des
GROUP BY type
ORDER BY num;
 There are not that many more
families than superfamilies.
 Which superfamily has the most
families
+------+-------+
| type | num
|
+------+-------+
| cl
|
11 |
| cf
|
854 |
| sf
| 1305 |
| fa
| 2156 |
| dm
| 4567 |
| sp
| 7111 |
| px
| 44327 |
+------+-------+
des
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec
69
A Little Science
Which superfamily has the most families?
SELECT des.sccs, des.description,
COUNT(DISTINCT cla.fa) AS num
FROM des, cla
WHERE des.id=cla.sf
GROUP BY cla.sf
ORDER BY num DESC;
+---------+---------------------------------------------------------+-----+
| sccs
| description
| num |
+---------+---------------------------------------------------------+-----+
| a.4.5
| "Winged helix" DNA-binding domain
| 35 |
| c.69.1 | alpha/beta-Hydrolases
| 23 |
| c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases
| 20 |
| c.52.1 | Restriction endonuclease-like
| 19 |
| c.37.1 | P-loop containing nucleotide triphosphate hydrolases
| 18 |
| b.18.1 | Galactose-binding domain-like
| 15 |
| d.92.1 | Metalloproteases ("zincins"), catalytic domain
| 14 |
| b.29.1 | Concanavalin A-like lectins/glucanases
| 14 |
| f.2.1
| Membrane all-alpha
| 13 |
| c.47.1 | Thioredoxin-like
| 12 |
| c.68.1 | Nucleotide-diphospho-sugar transferases
| 12 |
| c.2.1
| NAD(P)-binding Rossmann-fold domains
| 11 |
| a.4.1
| Homeodomain-like
| 10 |
| b.40.4 | Nucleic acid-binding proteins
| 10 |
| a.118.1
| Schroeder,
ARM repeat
| 1070|
By Michael
Biotec
A Little Science
 Which families does the DNA binding-domain superfamily
have?
mysql> SELECT * FROM cla LIMIT 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid
| pdb_id | sccs
| cl
| cf
| sf
| fa
| dm
| sp
| px
|
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw
| a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> SELECT * FROM des LIMIT 1;
+-------+------+------+------+--------------------+
| id
| type | sccs | sid | description
|
+-------+------+------+------+--------------------+
| 46456 | cl
| a
| | All alpha proteins |
+-------+------+------+------+--------------------+
 The sccs of the superfamily is a.4.5. Its families have
sccs a.4.5.1, a.4.5.2,…, so how can we list them?
By Michael Schroeder, Biotec
71
A Little Science
Which families does the DNA binding-domain superfamily
have?
SELECT DISTINCT sccs, description
FROM des
WHERE sccs LIKE “a.4.5%” AND type=“fa”
ORDER BY sccs;
| sccs
| description
+----------+-----------------------------------------------| a.4.5.1 | Biotin repressor-like
| a.4.5.10 | Replication initiation protein
| a.4.5.11 | Helicase DNA-binding domain
| a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain
| a.4.5.13 | Histone H1/H5
| a.4.5.14 | Forkhead DNA-binding domain
| a.4.5.15 | DNA-binding domain from rap30
| a.4.5.16 | C-terminal domain of RPA32
| a.4.5.17 | Cell cycle transcription factor e2f-dp
| a.4.5.18 | The central core domain of TFIIE beta
| a.4.5.19 | Z-DNA binding domain
| a.4.5.2 | LexA repressor, N-terminal DNA-binding domain
| a.4.5.20 | P4 origin-binding domain-like
| a.4.5.21 | ets domain
| a.4.5.22 | Heat-shock transcription factor
| a.4.5.23 | Interferon regulatory factor
…
By Michael Schroeder, Biotec
72
A Little Science
Which families does the DNA binding-domain superfamily have?
...
| a.4.5.24
| a.4.5.25
| a.4.5.26
| a.4.5.27
| a.4.5.28
| a.4.5.29
| a.4.5.3
| a.4.5.30
| a.4.5.31
| a.4.5.32
| a.4.5.33
| a.4.5.34
| a.4.5.35
| a.4.5.4
| a.4.5.5
| a.4.5.6
| a.4.5.7
| a.4.5.8
| a.4.5.9
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Iron-dependent represor protein
Methionine aminopeptidase, insert domain
mu transposase, DNA-binding domain
TnsA endonuclease, C-terminal domain
MarR-like transcriptional regulators
Plant O-methyltransferase, N-terminal domain
Arginine repressor (ArgR), N-terminal DNA-binding domain
C-terminal domain of the rap74 subunit of TFIIF
DEP domain
Lrp/AsnC-like transcriptional regulator N-terminal domain
Thanscriptional regulator IclR, N-terminal domain
SCF ubiquitin ligase complex WHB domain
C-terminal fragment of elongation factor SelB
CAP C-terminal domain-like
ArsR-like transcriptional regulators
GntR-like transcriptional regulators
Replication terminator protein (RTP)
N-terminal domain of molybdate-dependent transcriptional regulator ModE
Transcription factor MotA, activation domain
By Michael Schroeder, Biotec
73
A Little Science
 Which families does the DNA binding-domain superfamily
have?
 Let’s find example pdb’s
mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1";
+--------+
| pdb_id |
+--------+
| 1bia
|
| 1hxd
|
| 1bib
|
| 1j5y
|
+--------+
mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2";
+--------+
| pdb_id |
+--------+
| 1jhf
|
| 1jhh
|
| 1lea
|
| 1leb
|
+--------+
By Michael Schroeder, Biotec
74
A Little Science
1jhf
By Michael Schroeder, Biotec
1bia
75
A Little Science
1cgp
a.4.5.4
By Michael Schroeder, Biotec
76
A Little Science
1b9n, a.4.5.8
1bm9, a.4.5.7
By Michael Schroeder, Biotec
 Some more…
1f4k, which family?
1f4k, a.4.5.7
1smt, a.4.5.5
1hw1, a.4.5.6
77
A Little Science
 How many percent of superfamilies have only 1
family, how many 2,… ?
By Michael Schroeder, Biotec
78
A Little Science
How many percent of superfamilies have only 1,2,3,… families?
First let’s deposit the result of the query that found the number of
families for each superfamily in a table
CREATE TABLE fa_freq AS
SELECT des.sccs, des.description,
COUNT(DISTINCT cla.fa) AS num
FROM des, cla
WHERE des.id=cla.sf
GROUP BY cla.sf
ORDER BY num desc;
Now we count how many superfamilies have 1,2,3,… families
SELECT num AS fa_per_sf, COUNT(*) AS freq
FROM fa_freq
GROUP BY num;
By Michael Schroeder, Biotec
79
A Little Science
many percent of superfamilies have only 1,2,3,… families?
+-----------+------+
SELECT COUNT(*)
| fa_per_sf | freq |
FROM fa_freq ;
+-----------+------+
+----------+
|
1 | 981 | | count(*) |
|
2 | 164 | +----------+
|
3 |
65 | |
1305 |
|
4 |
29 | +----------+
|
5 |
25 |
|
6 |
14 |
|
7 |
6 |
|
8 |
5 |
SELECT num AS fa_per_sf,
|
9 |
1 |
(COUNT(*)/1305) AS perc
|
10 |
3 |
FROM fa_freq
|
11 |
1 |
GROUP BY num;
|
12 |
2 |
|
13 |
1 |
|
14 |
2 |
|
15 |
1 |
|
18 |
1 |
|
19 |
1 |
|
20 |
1 |
|
23 |
1 |
|
35 |
1 |
+-----------+------+
By Michael Schroeder, Biotec
+-----------+------+
| fa_per_sf | perc |
+-----------+------+
|
1 | 0.75 |
|
2 | 0.13 |
|
3 | 0.05 |
|
4 | 0.02 |
|
5 | 0.02 |
|
6 | 0.01 |
|
7 | 0.00 |
|
8 | 0.00 |
|
9 | 0.00 |
|
10 | 0.00 |
|
11 | 0.00 |
|
12 | 0.00 |
|
13 | 0.00 |
|
14 | 0.00 |
|
15 | 0.00 |
|
18 | 0.00 |
|
19 | 0.00 |
|
20 | 0.00 |
|
23 | 0.00 |
|
35 | 0.00 |
+-----------+------+
80
A Little Science
How many percent of superfamilies have only 1,2,3,…
families?
This is interesting! For the majority of superfamilies
there is only one family!
What is the PDB structure with the largest number of
(distinct) superfamilies?
By Michael Schroeder, Biotec
81
A Little Science
at is the PDB structure with the largest number of (distinct) superfamilies?
SELECT pdb_id,
COUNT(sf) AS sf_num
FROM cla
GROUP BY pdb_id
ORDER BY sf_num DESC
LIMIT 10;
+--------+--------+
| pdb_id | sf_num |
+--------+--------+
| 1aon
|
49 |
| 1hto
|
48 |
| 1ir2
|
48 |
| 1htq
|
48 |
| 1der
|
42 |
| 1f49
|
40 |
| 1jyy
|
40 |
| 1gho
|
40 |
| 1jyz
|
40 |
| 1jz1
|
40 |
+--------+--------+
By Michael Schroeder, Biotec
SELECT pdb_id,
COUNT(DISTINCT sf)
AS distinct_sf_num
FROM cla
GROUP BY pdb_id
ORDER BY distinct_sf_num DESC
LIMIT 10;
+--------+-----------------+
| pdb_id | distinct_sf_num |
+--------+-----------------+
| 1m1k
|
23 |
| 1k9m
|
23 |
| 1kd1
|
23 |
| 1kqs
|
23 |
| 1jj2
|
23 |
| 1k8a
|
23 |
| 1ffk
|
22 |
| 1i96
|
21 |
| 1hnz
|
20 |
| 1hr0
|
20 |
+--------+-----------------+82
A structure
with
23 different
superfamilies
 1k9m Co
Crystal
Structure Of
Tylosin
Bound To
The 50S
Ribosomal
Subunit Of
Haloarcula
Marismortui
Ribosome
By Michael Schroeder, Biotec
83
A Little Science
 Now let’s plot how many PDBs have 1, 2, 3,… distinct
superfamilies
 First of all let us put the result of the previous slide in a table
(note if the table already exists we have to erase it first: DROP
TABLE pdb_sf_num. But be careful using DROP
CREATE TABLE pdb_sf_num AS
SELECT pdb_id,
COUNT(DISTINCT sf) AS
distinct_sf_num
FROM cla
GROUP BY pdb_id
ORDER BY distinct_sf_num DESC;
 Now let us count how many PDBs with 1,2,3,… distinct
superfamilies…
By Michael Schroeder, Biotec
84
A Little Science
How many PDBs have 1, 2, 3,… distinct superfamilies
SELECT distinct_sf_num, COUNT(pdb_id) AS num
FROM pdb_sf_num
GROUP BY distinct_sf_num
ORDER BY distinct_sf_num;
+-----------------+-------+
| distinct_sf_num | num
|
+-----------------+-------+
|
1 | 13960 |
|
2 | 2721 |
|
3 |
495 |
|
4 |
178 |
|
5 |
33 |
|
6 |
25 |
|
7 |
1 |
|
9 |
4 |
|
20 |
9 |
|
21 |
1 |
|
22 |
1 |
|
23 |
6 |
By Michael Schroeder, Biotec
+-----------------+-------+
85
A Little Science
 Let’s do the same in percent
SELECT COUNT(DISTINCT pdb_id)
FROM cla;
+----------+
| count(*) |
+----------+
|
17434 |
+----------+
 There are 17434 PDB IDs
By Michael Schroeder, Biotec
86
A Little Science
How many PDBs have 1, 2, 3,… distinct superfamilies
SELECT distinct_sf_num, COUNT(pdb_id)/17434 AS perc
FROM pdb_sf_num
GROUP BY distinct_sf_num
ORDER BY distinct_sf_num;
+-----------------+------+
| distinct_sf_num | perc |
+-----------------+------+
|
1 | 0.80 |
|
2 | 0.16 |
|
3 | 0.03 |
|
4 | 0.01 |
|
5 | 0.00 |
|
6 | 0.00 |
|
7 | 0.00 |
|
9 | 0.00 |
|
20 | 0.00 |
|
21 | 0.00 |
|
22 | 0.00 |
|
23 | 0.00 |
+-----------------+------+
 80% of PDB entries consist only of one type of superfamily!
By Michael Schroeder, Biotec
87
A Little Science
What are the most popular superfamilies?
I.e. for which are there the most PDB entries
By Michael Schroeder, Biotec
88
A Little Science
What are the most popular superfamilies?
SELECT des.sccs, des.description,
COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids
FROM cla,des
WHERE des.id=cla.sf
GROUP BY cla.sf
ORDER BY num_of_pdb_ids DESC
LIMIT 10;
+--------+------------------------------------------------------+----------------+
| sccs
| description
| num_of_pdb_ids |
+--------+------------------------------------------------------+----------------+
| b.1.1 | Immunoglobulin
|
823 |
| d.2.1 | Lysozyme-like
|
777 |
| b.47.1 | Trypsin-like serine proteases
|
649 |
| c.37.1 | P-loop containing nucleotide triphosphate hydrolases |
521 |
| c.2.1 | NAD(P)-binding Rossmann-fold domains
|
384 |
| a.1.1 | Globin-like
|
384 |
| c.1.8 | (Trans)glycosidases
|
332 |
| b.50.1 | Acid proteases
|
288 |
| b.29.1 | Concanavalin A-like lectins/glucanases
|
230 |
| c.47.1 | Thioredoxin-like
|
217 |
+--------+------------------------------------------------------+----------------+
By Michael Schroeder, Biotec
89
A Little Science
 Are all superfamilies equally likely to co-occur?
 Let us generate a co-occurrence map as an answer
 Which superfamilies co-occur most frequently
 Which superfamilies have the most co-occurrence partners
 The co-occurrence map should consist of two tables
 A table with PDB ID, superfamily 1, superfamily 2 (to avoid
repetition we will require that sf1 is alphabetically before sf2)
 A table with superfamily 1 and 2 and the number of PDBs
containing this co-occurrence
By Michael Schroeder, Biotec
90
A Little Science
Co-occurrence map:
SELECT DISTINCT c1.pdb_id, c1.sf, c2.sf
FROM cla AS c1, cla AS c2
WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf
LIMIT 10;
+--------+-------+-------+
| pdb_id | sf
| sf
|
+--------+-------+-------+
| 1cqx
| 46458 | 63380 |
| 1cqx
| 46458 | 52343 |
| 1gvh
| 46458 | 63380 |
| 1gvh
| 46458 | 52343 |
| 1b33
| 46458 | 54580 |
| 1qgw
| 46458 | 56568 |
| 1kf6
| 46548 | 46977 |
| 1kf6
| 46548 | 51905 |
| 1kf6
| 46548 | 54292 |
| 1kf6
| 46548 | 56425 |
+--------+-------+-------+
 We are still missing the sf names, which we can get from the des
table
By Michael Schroeder, Biotec
91
A Little Science
Co-occurrence Map:
CREATE TABLE cooc AS
SELECT DISTINCT c1.pdb_id,
c1.sf AS sf1, d1.description AS sf1name,
c2.sf AS sf2, d2.description AS sf2name
FROM cla AS c1, cla AS c2, des AS d1, des AS d2
WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf AND
c1.sf=d1.id AND c2.sf=d2.id;
+--------+-------+--------------------------+-------+--------------------------------------------------------------+
| pdb_id | sf1
| sf1name
| sf2
| sf2name
|
+--------+-------+--------------------------+-------+--------------------------------------------------------------+
| 1cqx
| 46458 | Globin-like
| 63380 | Riboflavin synthase domain-like
|
| 1cqx
| 46458 | Globin-like
| 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain
|
| 1gvh
| 46458 | Globin-like
| 63380 | Riboflavin synthase domain-like
|
| 1gvh
| 46458 | Globin-like
| 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain
|
| 1b33
| 46458 | Globin-like
| 54580 | Allophycocyanin linker chain (domain)
|
| 1qgw
| 46458 | Globin-like
| 56568 | Non-globular alpha+beta subunits of globular proteins
|
| 1kf6
| 46548 | alpha-helical ferredoxin | 46977 | Succinate dehydrogenase/fumarate reductase C-terminal domain |
| 1kf6
| 46548 | alpha-helical ferredoxin | 51905 | FAD/NAD(P)-binding domain
|
| 1kf6
| 46548 | alpha-helical ferredoxin | 54292 | 2Fe-2S ferredoxin-like
|
| 1kf6
| 46548 | alpha-helical ferredoxin | 56425 | Succinate dehydrogenase/fumarate reductase catalytic domain |
+--------+-------+--------------------------+-------+--------------------------------------------------------------+
 Now let us count the distinct PDB IDs for each co-occurrence
By Michael Schroeder, Biotec
92
A Little Science
Number of instances in co-occurrence map
SELECT COUNT(DISTINCT pdb_id) AS num,
sf1, sf1name, sf2, sf2name
FROM cooc
GROUP BY sf1,sf2
ORDER BY num DESC
LIMIT 10;
+-----+-------+-----------------------------------------------+-------+-----------------------------------------------| num | sf1
| sf1name
| sf2
| sf2name
+-----+-------+-----------------------------------------------+-------+-----------------------------------------------| 137 | 48726 | Immunoglobulin
| 54452 | MHC antigen-recognition domain
| 125 | 51011 | alpha-Amylases, C-terminal beta-sheet domain | 51445 | (Trans)glycosidases
| 117 | 47616 | Glutathione S-transferases, C-terminal domain | 52833 | Thioredoxin-like
| 99 | 47802 | DNA polymerase beta, N-terminal domain-like
| 56699 | Nucleotidyltransferases
| 97 | 53098 | Ribonuclease H-like
| 56672 | DNA/RNApolymerases
| 74 | 51735 | NAD(P)-binding Rossmann-fold domains
| 55347 | Glyceraldehyde-3-phosphate dehydrogenase-like,
| 64 | 48726 | Immunoglobulin
| 51445 | (Trans)glycosidases
| 63 | 51905 | FAD/NAD(P)-binding domain
| 54373 | FAD-linked reductases, C-terminal domain
| 58 | 50203 | Bacterial enterotoxins
| 54334 | Superantigen toxins, C-terminal domain
| 55 | 48726 | Immunoglobulin
| 51011 | alpha-Amylases, C-terminal beta-sheet domain
+-----+-------+-----------------------------------------------+-------+------------------------------------------------
Is it valid to draw any conclusions from the above table of superfamily
co-occurrences with their frequencies?
We should be careful, as the number of co-occurrences may be biased
by the abundance of each superfamily,
e.g. immunoglobulin is very frequent in the PDB
By Michael Schroeder, Biotec
93
A Little Science
 This is quite similar a problem to the generation of
substitution matrices (last term)
 A little reminder…
By Michael Schroeder, Biotec
94
 BLOSUM
A Little Science
 BLOcks SUbstitution Matrix (based on BLOCKS database)
 Generation of BLOSUM x
 Group highly similar sequences and replace them by a
representative sequences.
 Only consider sequences with no more than x % similarity
 Align sequences (no gaps)
 For any pair of amino acids a,b and for all columns c of the
alignment, let q(a,b) be the number of co-occurrences of a,b in
all columns c.
 Let p(a) be the overall probability of a occurring
 BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )
By Michael Schroeder, Biotec
95
A Little Science
 “Normalised” number of instances in cooccurrence map
 Ok, so to avoid bias we will compute the following
 Logarithm of probability of sf1 and sf2 co-occurring /
(probability of sf1 * probability of sf2)
 To compute
 the co-occurrence probabilities we will count the frequency of
sf1 and sf2 co-occurring and divide this by the overall number
of co-occurrences and
 the probability of a superfamily we will divide the frequency of
the superfamily by the overall number of superfamilies
 To compute this let us put “freq of sf1 and sf2 co-occurring” into a
table cooc_pdb and “freq of sf” into a table sf_pdb
By Michael Schroeder, Biotec
96
A Little Science
Count PDBs for each co-occurrence
CREATE TABLE cooc_pdb AS
SELECT COUNT(DISTINCT pdb_id) AS num,
sf1, sf1name, sf2, sf2name
FROM cooc
GROUP BY sf1,sf2
ORDER BY num DESC;
Count PDBs for each superfamily
CREATE TABLE sf_pdb AS
SELECT des.id AS sf, des.description,
COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids
FROM cla,des
WHERE des.id=cla.sf
GROUP BY cla.sf
ORDER BY num_of_pdb_ids DESC;
By Michael Schroeder, Biotec
97
A Little Science
Count overall number of PDBs for each co-occurrence
SELECT SUM(num) AS totalCoocFreq
FROM cooc_pdb;
+---------------+
| totalCoocFreq |
+---------------+
|
9813 |
+---------------+
Count PDBs for each superfamily
SELECT SUM(num_of_pdb_ids) AS total_num_of_pdb_ids
FROM sf_pdb;
+----------------------+
| total_num_of_pdb_ids |
+----------------------+
|
22318 |
+----------------------+
By Michael Schroeder, Biotec
98
A Little Science
Normalised co-occurrence map
CREATE TABLE normalised_cooc AS
SELECT LOG((cooc_pdb.num/9813)/
((sf_pdb1.num_of_pdb_ids/22318)*(sf_pdb2.num_of_pdb_ids/22318))) AS
val, cooc_pdb.sf1, cooc_pdb.sf1name, cooc_pdb.sf2, cooc_pdb.sf2name
FROM cooc_pdb, sf_pdb AS sf_pdb1, sf_pdb AS sf_pdb2
WHERE cooc_pdb.sf1=sf_pdb1.sf AND cooc_pdb.sf2=sf_pdb2.sf
ORDER BY val desc;
+-----------+----------------------------------------------+-------------------------------+
| val
| sf1name
| sf2name
|
+-----------+----------------------------------------------+-------------------------------+
| 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3
| Arp2/3 complex 16 kDa subunit |
| 10.834834 | tRNA splicing endonuclease, C-terminal domain| tRNA splicing endonuclease Edn|
| 10.834834 | Cell-division inhibitor MinC, C-terminal doma| Cell-division inhibitor MinC, |
| 10.834834 | Tricorn protease N-terminal domain
| Tricorn protease N-terminal do|
| 10.834834 | L-fucose isomerase, C-terminal domain
| L-fucose isomerase, N-terminal|
| 10.834834 | Catalytic domain of malonyl-CoA ACP transacyl| Probable ACP-binding domain of|
| 10.834834 | Transcription factor IIA (TFIIA), N-terminal | Transcription factor IIA (TFII|
| 10.834834 | Glutamyl tRNA-reductase dimerization domain | Glutamyl tRNA-reductase cataly|
| 10.834834 | N-terminal domain of phosphatidylinositol tra| C-terminal domain of phosphati|
| 10.834834 | Rotavirus NSP2 fragment, C-terminal domain
| Rotavirus NSP2 fragment, N-ter|
| 10.834834 | Lipovitellin-phosvitin complex, superhelical | Lipovitellin-phosvitin complex|
| 10.834834 | Aminoimidazole ribonucleotide synthetase (Pur| Aminoimidazole ribonucleotide |
| 10.834834 | Colicin E3 translocation domain
| Colicin E3 receptor domain
|
| 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3
| Arp2/3 complex subunits
|
| 10.834834 | Arp2/3 complex 16 kDa subunit ARPC5
| Arp2/3 complex subunits
|
| 10.834834 | Head domain of nucleotide exchange factor Grp| Coiled-coil domain of nucleoti|
| 10.834834 | YhbC-like, C-terminal domain
| YhbC-like, N-terminal domain |
| 10.141687 | TolB, C-terminal domain
| TolB, N-terminal domain
|
By Michael Schroeder, Biotec
99
+-----------+----------------------------------------------+-------------------------------+
A Little Science
 Which superfamily has the most co-occurrence
partners?
By Michael Schroeder, Biotec
100
A Little Science
Which superfamily has the most co-occurrence partners?
CREATE TABLE cooc_partner AS
SELECT COUNT(DISTINCT c2.sf) AS distinctnum,
COUNT(c2.sf) AS num, des.sccs, des.description
FROM cla AS c1, cla AS c2, des
WHERE c1.pdb_id=c2.pdb_id AND des.id=c1.sf
GROUP BY c1.sf;
SELECT *
FROM cooc_partner
ORDER BY distinctnum DESC
LIMIT 10;
| distinctnum | num
| sccs
| description
|
+-------------+-------+--------+------------------+
|
74 | 19177 | b.1.1 | Immunoglobulin
|
68 | 5228 | c.37.1 | P-loop
|
67 | 1602 | b.40.4 | Nucleic acid-binding proteins
|
42 |
435 | c.55.4 | Translational machinery components
|
42 |
580 | g.39.1 | Glucocorticoid receptor-like
|
31 |
561 | b.43.3 | Translation proteins
|
28 | 1575 | b.47.1 | Trypsin-like serine proteases
|
28 |
725 | d.14.1 | Ribosomal protein S5 domain 2-like
|
25 |
982 | a.4.5 | "Winged helix" DNA-binding domain
|
25 |
283 | d.52.3 | Prokaryotic type KH domain (pKH-…
+-------------+-------+--------+------------------+
By Michael Schroeder, Biotec
101
A Little Science
Which superfamily has the most co-occurrence partners?
SELECT *
FROM cooc_partner
ORDER BY num DESC
LIMIT 10;
+-------------+-------+---------+--------------+
| distinctnum | num
| sccs
| description
+-------------+-------+---------+--------------+
|
74 | 19177 | b.1.1
| Immunoglobulin
|
4 | 7532 | b.1.4
| beta-Galactosidase/glucuronidase
|
17 | 6355 | c.2.1
| NAD(P)-binding Rossmann-fold domains
|
24 | 6233 | f.2.1
| Membrane all-alpha
|
7 | 6007 | d.153.1 | N-terminal nucleophile aminohydrolases
|
1 | 5687 | i.1.1
| Ribosome and ribosomal fragments
|
68 | 5228 | c.37.1 | P-loop
|
18 | 5154 | c.1.8
| (Trans)glycosidases
|
5 | 3915 | a.1.1
| Globin-like
|
11 | 3880 | b.18.1 | Galactose-binding domain-like
+-------------+-------+---------+---------------+
By Michael Schroeder, Biotec
102
Scale-free Networks
 Small-world property
 “everybody on earth is related with a degree of six
intermediaries”
 Reason: Network structure
A few highly connected nodes
Many nodes with few connections
Consequence: very short average distance between any
two nodes
Formally: Number of interaction partners follows powerlaw, i.e. distribution of number of interaction partners is
an exponential function
 Protein interactions are small-world networks!
By Michael Schroeder, Biotec
103
Scale-free Networks
 Is our co-occurrence map a scale-free network?
By Michael Schroeder, Biotec
104
Scale-free Networks
 Is our co-occurrence map a scalefree network?
 Let’s get the distribution of the
number of distinct co-occurrence
partners
SELECT distinctnum, COUNT(*)
FROM cooc_partner
GROUP BY distinctnum
ORDER BY distinctnum;
 Let’s plot it
By Michael Schroeder, Biotec
+-------------+----------+
| distinctnum | count(*) |
+-------------+----------+
|
1 |
586 |
|
2 |
291 |
|
3 |
152 |
|
4 |
72 |
|
5 |
47 |
|
6 |
27 |
|
7 |
26 |
|
8 |
9 |
|
9 |
16 |
|
10 |
8 |
|
11 |
9 |
|
12 |
3 |
|
13 |
3 |
|
14 |
1 |
|
15 |
1 |
|
16 |
2 |
|
17 |
1 |
|
18 |
3 |
|
20 |
1 |
|
21 |
16 |
|
23 |
18 |
|
24 |
1 |
|
25 |
4 |
|
28 |
2 |
|
31 |
1 |
|
42 |
2 |
|
67 |
1 |
|
68 |
1 |
105
|
74 |
1 |
Scale-free Networks
Distribution of number of co-occurrence partners
(x-axis = number of partners, y-axis=frequency
By Michael Schroeder, Biotec
106
Scale-free Networks
 But maybe the number of co-occurrence partners
is simply correlated to the number of PDB entries
we have for that superfamily, in which case the
power-law relationship would be an artefact
 Let’s test this by extending the previous table with a
column for the average number of PDBs given as the
num column in cooc_partner
By Michael Schroeder, Biotec
107
Scale-free
Networks
 Distinct number of cooccurrence partners and
average number of cooccurrences
SELECT distinctnum,
COUNT(*),AVG(num)
FROM cooc_partner
GROUP BY distinctnum
ORDER BY distinctnum;
By Michael Schroeder, Biotec
+-------------+----------+------------+
| distinctnum | count(*) | avg(num)
|
+-------------+----------+------------+
|
1 |
586 |
66.8328 |
|
2 |
291 |
124.0481 |
|
3 |
152 |
251.1316 |
|
4 |
72 |
324.1389 |
|
5 |
47 |
344.2340 |
|
6 |
27 |
289.7778 |
|
7 |
26 |
883.1923 |
|
8 |
9 |
663.8889 |
|
9 |
16 |
375.0625 |
|
10 |
8 |
772.3750 |
|
11 |
9 | 1454.3333 |
|
12 |
3 |
726.6667 |
|
13 |
3 |
642.3333 |
|
14 |
1 |
489.0000 |
|
15 |
1 | 2530.0000 |
|
16 |
2 |
645.0000 |
|
17 |
1 | 6355.0000 |
|
18 |
3 | 2254.3333 |
|
20 |
1 |
979.0000 |
|
21 |
16 |
254.1875 |
|
23 |
18 |
290.2778 |
|
24 |
1 | 6233.0000 |
|
25 |
4 |
433.0000 |
|
28 |
2 | 1150.0000 |
|
31 |
1 |
561.0000 |
|
42 |
2 |
507.5000 |
|
67 |
1 | 1602.0000 |
|
68 |
1 | 5228.0000 |
|
74 |
1 | 19177.0000
|
108
Scale-free Networks
 No good correlation between frequency of number of distinct
co-occurrence partners and number of underlying instances.
Hence: scale-free property appears to be no artefact
By Michael Schroeder, Biotec
109
Limits
 List all superfamily pairs which can interact directly or
indirectly (“Transitive Closure”).
 This query cannot be expressed in SQL
 SQL does not have the same “power” as a
programming language like Python
By Michael Schroeder, Biotec
110
Introduction to MySQL
By Michael Schroeder, Biotec
111
SQL: “Structured Query Language”—the most common
standardized language used to access databases.
SQL has several parts:
DDL – Data Definition Language
{Defining, Deleting, Modifying relation schemas}
DML – Data Manipulation Language
{Inserting, Deleting, Modifying tuples in database}
Embedded SQL – defines how SQL statements can be used
with general-purposed programming
By Michael Schroeder, Biotec
112
 MySQL, the most popular Open Source SQL database, is
developed, distributed and supported by MySQL AB.
 MySQL is a relational database management system.
 MySQL software is Open Source.
• Written in C and C++. Tested with a broad range of different
compilers.
 Works on many different platforms.
 APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby,
and Tcl.
• You can find MySQl manual and documentation at:
www .mysql.com/documentation/
• You can download and install MySQL on your own computer
(both under Windows and Linux)
By Michael Schroeder, Biotec
113
MySQL
To see a list of options provided by mysql, invoke it with the --help option:
shell> mysql --help
Using SQL:
On any linux you have to use this to log on to MySQL:
shell> /usr/local/mysql/bin/mysql -h hostname -D loginname -p
shell> mysql -h host -u user -p
Enter password: ********
The ******** represents your password; enter it when
mysql displays the Enter password: prompt.
By Michael Schroeder, Biotec
114
 Basic Query:
select A1, A2,…,An
from r1, r2, …,rm
where P;
A1, A2,…,An represent attributes
r1, r2, …rm represent relations
P represents predicate (guard condition)
Keywords may be entered in any letter case:
mysql> SELECT VERSION(), CURRENT_DATE;
mysql> select version(), current_date;
mysql> SeLeCt vErSiOn(), current_DATE;
By Michael Schroeder, Biotec
115
Prompt
Meaning
mysql>
->
‘>
Ready for new command.
Waiting for next line of multiple-line command.
Waiting for next line, collecting a string that
begins with a single quote (` ’ ’).
“>
Waiting for next line, collecting a string that
begins with a double quote (` ” ’).
mysql>
->
->
SELECT *
FROM my_table
WHERE name = “Smith” AND age < 30;
mysql> SELECT * FROM my_table WHERE name = "Smith AND
age < 30;
"> "\c
mysql>
\c to cancel the execution of a command
By Michael Schroeder, Biotec
116
 Basic Database Operation





Create a database
Create a table
Load data into the table
Retrieve data from the table in various ways
Use multiple tables
Suppose you have several pets in your home (your menagerie) and you'd
like to keep track of various types of information about them. You can do so
by creating tables to hold your data and loading them with the desired
information. Then you can answer different sorts of questions about your
animals by retrieving data from the tables.
By Michael Schroeder, Biotec
117
Creating and Using a Database
mysql> SHOW DATABASES;
SHOW statement can be used to find out the databases currently
existing on the server
mysql> USE testdb
testdb is a database name.
USE command does not need a semi colon and must be given in a
single line.
Database needs to be invoked in order to use it.
mysql> CREATE DATABASE example;
Database names are case-sensitive unlike keywords; Same applies for
table names
So example != Example != EXAMPLE or some other variant
By Michael Schroeder, Biotec
118
Creating a Table
mysql> SHOW TABLES;
Displays the current list of tables
mysql> CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20),
-> species VARCHAR(20), sex CHAR(1), birth DATE, death DATE);
mysql> SHOW TABLES;
Will display the table with the table name pet
Verification of the table can be done with DESCRIBE command
mysql> DESCRIBE pet;
+---------+-------------+---------------+---------+---------+---------+
| Field
| Type
| Null
| Key
| Default | Extra
|
+---------+-------------+---------------+---------+---------+---------+
| name
| varchar(20) | YES |
| NULL
|
|
| owner
| varchar(20) | YES |
| NULL
|
|
| species | varchar(20) | YES |
| NULL
|
|
| sex
| char(1)
| YES |
| NULL
|
|
| birth
| date
| YES |
| NULL
|
|
| death
| date
| YES |
| NULL
|
|
By Michael Schroeder, Biotec
119
Loading Data into a Table
LOAD DATA uses a text file with single record in a line that match the attributes in the table.
Useful for inserting when multiple records are involved.
Example: pet.txt is a text file with a single record
Name
Whistler
owner
Gwen
species
bird
sex
\N
birth
1997-12-09
death
\N
mysql> LOAD DATA LOCAL INFILE "pet.txt" INTO TABLE pet;
INSERT command can be used when records needs to be inserted one at a time.
NULL can be directly inserted in the field column
Example:
mysql> INSERT INTO pet
-> VALUES ('Puffball','Diane','hamster','f','1999-03-30',NULL);
By Michael Schroeder, Biotec
120
Retrieving Information from a Table
The SELECT statement is used to pull information from a table. The general form of the
statement is:
SELECT what_to_select
FROM which_table
WHERE conditions_to_satisfy
The simplest form of SELECT retrieves everything from a table:
mysql> SELECT * FROM pet;
You can select only particular rows from your table.
mysql> SELECT * FROM pet WHERE name = "Bowser";
You can specify conditions on any column, not just name. For example, if you want to
know which animals were born after 1998, test the birth column:
mysql> SELECT * FROM pet WHERE birth >= "1998-1-1";
You can combine conditions, for example, to locate female dogs:
mysql> SELECT * FROM pet WHERE species = "dog" AND sex = "f";
By Michael Schroeder, Biotec
121
Selecting Particular Columns
If you don't want to see entire rows from your table, just name the columns in
which you're interested, separated by commas.
For example, if you want to know when your animals were born, select
the name and birth columns:
mysql> SELECT name, birth FROM pet;
To find out who owns pets, use this query:
mysql> SELECT owner FROM pet;
mysql> SELECT DISTINCT owner FROM pet;
You can use a WHERE clause to combine row selection with column selection.
For example, to get birth dates for dogs and cats only, use this query:
mysql> SELECT name, species, birth FROM pet
-> WHERE species = "dog" OR species = "cat";
By Michael Schroeder, Biotec
122
Sorting Rows
To sort a result, use an ORDER BY clause.
Here are animal birthdays, sorted by date:
mysql> SELECT name, birth FROM pet ORDER BY birth;
To sort in reverse order, add the DESC (descending) keyword to the name of
the column you are sorting by:
mysql> SELECT name, birth FROM pet ORDER BY birth DESC;
You can sort on multiple columns. For example, to sort by type of animal, then
by birth date within animal type with youngest animals first, use the
following query:
mysql> SELECT name, species, birth FROM pet ORDER BY species, birth
DESC;
By Michael Schroeder, Biotec
123
Pattern Matching
MySQL provides standard SQL pattern matching as well as a form of pattern
matching based on extended regular expressions similar to those used by Unix
utilities such as grep.
SQL pattern matching allows you to use `_' to match any single character and `%' to
match an arbitrary number of characters (including zero characters). In MySQL, SQL
patterns are case-insensitive by default. Some examples are shown here. Note that
you do not use = or <> when you use SQL patterns; use the LIKE or NOT LIKE
comparison operators instead.
To find names beginning with `b':
mysql> SELECT * FROM pet WHERE name LIKE "b%";
To find names containing exactly five characters, use the `_' pattern character:
mysql> SELECT * FROM pet WHERE name LIKE “_____”;
By Michael Schroeder, Biotec
124
Counting Rows
For example, you might want to know how many pets each owner has,
Counting the total number of animals you have is the same question as “How many rows
are in the pet table?”
The COUNT() function counts the number of non-NULL results, so the query to count your
animals looks like this:
mysql> SELECT COUNT(*) FROM pet;
You can use COUNT() if you want to find out how many pets each owner has:
mysql> SELECT owner, COUNT(*) FROM pet GROUP BY owner;
+---------+---------------+
| owner
| COUNT(*)
|
+---------+---------------+
| Benny
|
2
|
| Diane
|
2
|
| Gwen
|
3
|
| Harold |
2
|
+---------+---------------+
By Michael Schroeder, Biotec
125
Examples of some common queries
CREATE TABLE shop (
article
INT(4) UNSIGNED ZEROFILL DEFAULT
dealer
CHAR(20)
DEFAULT
price
DOUBLE(16,2)
DEFAULT
PRIMARY KEY(article, dealer));
‘0000'
‘’
'0.00’
NOT NULL,
NOT NULL,
NOT NULL,
INSERT INTO shop VALUES (1,'A',3.45),(1,'B',3.99),(2,'A',10.99),(3,'B',1.45),(3,'C',1.69),
(3,'D',1.25),(4,'D',19.95);
mysql> SELECT * FROM shop;
+--------------+---------+--------+
| article
| dealer | price |
+--------------+---------+--------+
|
0001 |
A | 3.45 |
|
0001 |
B | 3.99 |
|
0002 |
A | 10.99 |
|
0003 |
B | 1.45 |
|
0003 |
C | 1.69 |
|
0003 |
D | 1.25 |
|
0004 |
D | 19.95 |
+--------------+---------+--------+
The maximum value for a column
The row holding the maximum of a certain column
Maximum of column per group
The rows
holding
the group-wise
maximum of a certain field
By Michael
Schroeder,
Biotec
126
“What's the highest price?”
SELECT MAX(price) AS price FROM shop;
+---------+
| price |
+---------+
| 19.95 |
+---------+
“Find number, dealer, and price of the most expensive article.”
In ANSI SQL (and MySQL Version 4.1) this is easily done with a subquery:
SELECT article, dealer, price FROM shop
WHERE price = (SELECT MAX(price) FROM shop)
In MySQL versions prior to 4.1, you have to do it in two steps:
1. Get the maximum price value from the table with a SELECT statement.
2. Using this value compile the actual query:
SELECT article, dealer, price FROM shop WHERE price=19.95
Or, Using User Variables ( @variable-name, @temp := 5 )
Select @max_price := max(price) from shop;
Select article, dealer, price from shop where price = @max_price;
By Michael Schroeder, Biotec
127
Maximum of Column per Group
“What's the highest price per article?”
SELECT article, MAX(price) AS price
FROM shop
GROUP BY article;
+---------+---------+
| article | price |
+---------+---------+
| 0001 | 3.99 |
| 0002 | 10.99 |
| 0003 | 1.69 |
| 0004 | 19.95 |
+---------+---------+
By Michael Schroeder, Biotec
128
The Rows Holding the Group-wise Maximum of a Certain Field
“For each article, find the dealer(s) with the most expensive price.”
In ANSI SQL (MySQL Version 4.1 or greater), do it with a subquery
SELECT article, dealer, price
FROM shop s1
WHERE price=(SELECT MAX(s2.price)
FROM shop s2
WHERE s1.article = s2.article);
By Michael Schroeder, Biotec
129
But, In MySQL versions prior to 4.1, it has to be done in several steps, with a
temporary table (It doesn’t support nested-query \subquery).
CREATE TEMPORARY TABLE tmp
( article INT(4) UNSIGNED ZEROFILL DEFAULT '0000' NOT NULL,
price DOUBLE(16,2) DEFAULT '0.00' NOT NULL);
LOCK TABLES shop read;
INSERT INTO tmp
SELECT article, MAX(price) FROM shop GROUP BY article;
SELECT shop.article, dealer, shop.price
FROM shop, tmp
WHERE shop.article=tmp.article AND shop.price=tmp.price;
UNLOCK TABLES; DROP TABLE tmp;
By Michael Schroeder, Biotec
130