Searching DBs using keywords

Download Report

Transcript Searching DBs using keywords

‫‪Databases‬‬
‫מאגרי מידע‬
‫אחסון‬
‫שליפה‬
Different kinds of DBs dealing with biological
information retrieved by various means
DNA
RNA
protein
•Protein
DNA
•cDNA
sequences
sequences •ESTs
•Translated nuc
(individual genes
•Non-coding
or complete
sequences
RNA
genomes)
•Protein domains
•Protein structure
phenotype
•Diseases
•polymorhism
•Gene
expression
•Prot-prot
interactions
Common to all databases
• A database is a structured collection of
information.
• A database is composed of basic objects
called records or entries (‫)רשומות‬.
• Each record is composed of fields (‫)שדות‬,
which hold defined data that is related to
that record.
Let’s consider the following database of
students learning bioinfo in HUJI
Databases
A database can be thought of as a large table, where the
rows represent records and the columns represent
For some records
there is only
Some records
fields.
ID
contain similar data
in some of the fields
partial
information –
some fields
Comments
contain no data
(quality of DB)
First Name Last Name
Gender
0775523/7
Sharon
Asulin
female
020304/4
Nurit
Niv
female
Comes from Cuba
03321/3
Nurit
Sharon
female
-
88924/5
Yossi
Yarkon
male
Each record
has unique
identifier
Likes scuba
diving
Father of sharon
– must go home
earlier
ID (Accession Numbers): Unique identifiers of the database records.
Data Retrieval
• The purpose of databases is
not merely to collect and
organize data, but mainly to
allow advanced data retrieval.
• A query (‫ )שאילתא‬is a method
to retrieve information from
the database.
• The organization of each
record into predetermined
fields, allows us to use queries
on fields.
The best search
strategy…
1. Think – phrase your scientific question.
2. Choose appropriate database
Fields
Phrase your query
Syntax
Keywords
4. Access additional entries
discussing same or similar
entities by links to additional
databases.
Boolean operators
5. Think, evaluate. The
computer is just a machine.
You are (hopefully) a
thinking organism.
Phrasing a query…
Terms/words for search [field] + (BOLLEAN OPERATORS) Terms/words
for Search [field]
Boolean Operators
1 AND 2
1
2
cell AND cycle
“cell cycle”
Cell* - cell, cells, cellular
etc)
1 OR 2
1
2
cell OR cycle
1 NOT 2
1
2
cell NOT cycle
The secretary wants to locate the record of the
student Sharon Asulin but does not remember the last
name – search Sharon
Field First Name
ID
Last
Name
Gender
Comments
0775523/7
Sharon
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female
Comes from Cuba
03321/3
Nurit
Sharon
female
Receives
scholarship
88924/5
Yossi
Yarkon
male
Proud father of
sharon
The search was not limited to a certain field Sharon[all fields]
OOPS !!
Retrieved too many records that don’t
match the required data - too much noise.
Evaluating Search Results
“
s
c
i
e
n
ti
fi
c
t
r
u
t
h
”
Search results
Found
Not found
(+)
(-)
Related
False
negative
True
positive
Unrelated
True
negative
False
positive
Field First Name
Last Name
Gender
Comments
ID
0775523/7
Sharon
Asulin
True positive
female
Likes scuba
diving
020304/4
Nurit
Niv
female
Comes from Cuba
03321/3
Nurit
Sharon
female
Receives
scholarship
male
Proud father of
sharon False
False positive
88924/5
Yossi
Yarkon
positive
What can we do to reduce/eliminate false positives
without reducing true positives?
Sensitivity
Ability of a method to detect positives, irrespective
of how many false positives are reported.
Selectivity
Ability of a method to reject negatives, irrespective
of how many false negatives are rejected.
Sensitivity
Selectivity
Let’s refine our search
Find all students whose first name is Sharon
Sharon[first name]
Keyword synthax (NCBI) field definition
ID
First Name
Last
Name
Gender
Comments
0775523/
7
Sharon
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female
Comes from
Cuba
03321/3 Nurit
Sharon
female
Receives
scholarship
88924/5 Yossi
Yarkon
male
Father of
sharon – must
go home earlier
ID
First Name
Last
Name
Gender
Comments
0775523/
7
Sharom
Asulin
female
Likes scuba
diving
020304/4
Nurit
Niv
female
Comes from
Cuba
03321/3 Nurit
Sharon
female
Receives
scholarship
88924/5 Yossi
Yarkon
male
Father of
sharon – must
go home earlier
Now we don’t retrieve any answer (false negative?) and we
are still not distracted by the noise.
The original search phrase sharon[all fields] would have
retrieved all the noise but not the required info.
The secretary wants to locate the record of the female
student who comes from Cuba but does not remember
her name.
Search female[gender] AND *cuba*[comments]
Keyword synthax (NCBI) field definition Boolean operator
Field First Name
ID
Last
Name
Gender
Comments
0775523/7
Sharon
Asulin
female
Likes scuba diving –
false positive
020304/4
Nurit
Niv
female
Comes from Cuba
true positive
03321/3
Nurit
Sharon
female
Receives
scholarship
88924/5
Yossi
Yarkon
male
Proud father of
sharon
‫והעיקר‪ ,‬והעיקר ‪:‬‬
‫לא לפחד כלל‬