Databases - TeachLine

Download Report

Transcript Databases - TeachLine

‫‪Databases‬‬
‫מאגרי מידע‬
‫אחסון‬
‫שליפה‬
Different kinds of DBs dealing with biological
information retrieved by various means
DNA
RNA
protein
•Protein
DNA
•cDNA
sequences
sequences •ESTs
•Translated nuc
(individual genes
•Non-coding
or complete
sequences
RNA
genomes)
•Protein domains
•Protein structure
phenotype
•Diseases
•polymorhism
•Gene
expression
•Prot-prot
interactions
Common to all databases
• A database is a structured collection of
information.
• A database is composed of basic objects
called records or entries (‫)רשומות‬.
• Each record is composed of fields (‫)שדות‬,
which hold defined data that is related to
that record.
Let’s consider the following database of
students learning bioinfo in HUJI
Databases
A database can be thought of as a large table, where the
rows represent records and the columns represent
fields.
ID
First Name Last Name
Gender
Comments
0775523/7
Sharon
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female…
Comes from Cuba
03321/3
Nurit
Sharon
female…
-
88924/5
Yossi
Yarkon
male…
Father of sharon
– must go home
earlier
ID (Accession Numbers): Unique identifiers of the database records.
What can we learn about fields?
• More defined (male female), less defined
(comments)
• A better database will try to store info in well
defined fields.
• Some records contain similar data in some of
the fields
• For some records there is only partial
information – some fields contain no data
(quality of DB)
• Each record needs a unique identifier
Data Retrieval
• The purpose of databases is
not merely to collect and
organize data, but mainly to
allow advanced data retrieval.
• A query (‫ )שאילתא‬is a method
to retrieve information from
the database.
• The organization of each
record into predetermined
fields, allows us to use queries
on fields.
The best search
strategy…
5. Think, evaluate. The
computer is just a machine.
You are (hopefully) a
thinking organism.
4. Access additional entries
discussing same or similar
entities by links to additional
databases (DBXref)
1. Think – phrase your scientific question.
2. Choose appropriate database
Fields
3. Phrase your query
Today
Syntax
Keywords
Boolean operators
The secretary wants to locate the record of the
student Sharon Asulin but does not remember the last
name – search Sharon
Field First Name
ID
Last
Name
Gender
Comments
0775523/7
Sharon
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female…
Comes from Cuba
03321/3
Nurit
Sharon
female…
Receives
scholarship
88924/5
Yossi
Yarkon
male…
Proud father of
sharon
The search was not limited to a certain field Sharon[all fields]
Keyword synthax (NCBI) field definition
OOPS !!
Retrieved too many records that don’t
match the required data - too much noise.
Evaluating Search Results
“
s
c
i
e
n
ti
fi
c
t
r
u
t
h
”
Search results
Found
Not found
(+)
(-)
Related
False
negative
True
positive
Unrelated
True
negative
False
positive
Field First Name
Last Name
Gender
Comments
Likes scuba
diving
ID
0775523/7
Sharon
Asulin
True positive
female
020304/4
Nurit
Niv
female… Comes from Cuba
03321/3
Nurit
Sharon
88924/5
Yossi
Yarkon
female… Receives
scholarship
False positive
male…
Proud father of
sharon False
positive
What can we do to reduce/eliminate false positives
without reducing true positives?
Let’s refine our search
Find all students whose first name is Sharon
Sharon[first name]
Keyword synthax (NCBI) field definition
ID
First Name
Last
Name
Gender
Comments
0775523/
7
Sharon
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female…
Comes from
Cuba
03321/3 Nurit
Sharon
female…
Receives
scholarship
88924/5 Yossi
Yarkon
male…
Father of
sharon – must
go home earlier
ID
First Name
Last
Name
Gender
Comments
0775523/
7
Sharom
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female…
Comes from
Cuba
03321/3 Nurit
Sharon
female…
Receives
scholarship
88924/5 Yossi
Yarkon
male…
Father of
sharon – must
go home earlier
Now we don’t retrieve any answer (false negative?) and we
are still not distracted by the noise.
The original search phrase sharon[all fields] would have
retrieved all the noise but not the required info.
Boolean Operators
1 AND 2
1
2
cell AND cycle
“cell cycle”
Cell* - cell, cells, cellular
etc)
1 OR 2
1
2
cell OR cycle
1 NOT 2
1
2
cell NOT cycle
The secretary wants to locate the record of the female
student who comes from Cuba but does not remember
her name.
Search female[gender] AND *cuba*[comments]
Keyword synthax (NCBI) field definition Boolean operator
Field First Name
ID
Last
Name
Gender
Comments
0775523/7
Sharon
Asulin
female
Likes scuba diving –
false positive
020304/4
Nurit
Niv
female…
Comes from Cuba
true positive
03321/3
Nurit
Sharon
female…
Receives
scholarship
88924/5
Yossi
Yarkon
male…
Proud father of
sharon
‫והעיקר‪ ,‬והעיקר ‪:‬‬
‫לא לפחד כלל‬