Transcript Document

Text-Based Searching
Lesson 3
Bioinformatics Laboratory
1
Text-Based Searching
17-Jul-15
EMBnet
European Molecular Biology Network
• In 1988 a network was established to link European
laboratories that used bio-computing and
bioinformatics in molecular biology research.
• In each country a national node provides local biocomputing services
• INN serves as Israel’s National Node
3
Text-Based Searching
17-Jul-15
Israel National Node
• INN serves as Israel’s National Node
• Authorized by the Ministry of Science in 1990.
• INN is located at the Biological Computing Unit,
Weizmann Institute of Science.
4
Text-Based Searching
17-Jul-15
Bioinformatics Units at Universities
• In Israel, in the mid 1990s Bioinformatics Units
arose at Universities to serve local needs
• At TAU – http://www.tau.ac.il/lifesci/bioinfo
5
Text-Based Searching
17-Jul-15
Database Interrogation
• Two ways to search databases
– Database interrogation – searches textual information
contained in header sections of database entries
– Database searching – searches sequence information
with sequence queries
6
Text-Based Searching
17-Jul-15
Database Interrogation
Problem of EMBnet
• No effective way of interrogating all the resources
together at a particular site, since formats differ
• A research project was undertaken with EMBnet
to address problems inherent in interfacing
complex environments – resulting in SRS –
sequence retrieval system, a network browser for
databases in molecular biology
7
Text-Based Searching
17-Jul-15
SRS
• SRS allows any flat file database to be indexed to any other.
• Powerful tool that allows users to formulate queries across a
range of different database types via a single interface, without
having to worry about underlying data structures, query
languages, etc.
Sequence
Retrieval
System
8
Text-Based Searching
17-Jul-15
SRS – List of Public SRS Servers
9
Text-Based Searching
17-Jul-15
SRS – List of Public SRS Servers
10
Text-Based Searching
17-Jul-15
Searching SRS
11
Text-Based Searching
17-Jul-15
SRS Tutorial
12
Text-Based Searching
17-Jul-15
Search SRS Databases
13
Text-Based Searching
17-Jul-15
SRS Standard Query Form
14
Text-Based Searching
17-Jul-15
SRS Standard Query Form
15
Text-Based Searching
17-Jul-15
SRS Extended Query Form
16
Text-Based Searching
17-Jul-15
NCBI
• National Center for Biotechnology
• Established in 1988 and located at the campus of NIH as a
subdivision of NLM (National Library of Medicine)
• Since 1992 one of NCBI’s major tasks has been
maintenance of GenBank
17
Text-Based Searching
17-Jul-15
Entrez
• Entrez allows retrieval of molecular biology data and
bibliographic citations from NCBI’s integrated
databases
• Entrez, unlike SRS, does not allow customization
with an institute’s preferred databases
18
Text-Based Searching
17-Jul-15
Entrez
• Most records are linked to other records, within a
given database and between databases
• Sequence databases are linked to the Medline
databases so that one can move from paper to
sequence and vice versa seamlessly
• “Neighboring” allows related papers in Medline,
with similar subjects, and sequence entries, found
through blast searches, to be grouped together
19
Text-Based Searching
17-Jul-15
Entrez at NCBI
20
Text-Based Searching
17-Jul-15
Entrez at NCBI
21
Text-Based Searching
17-Jul-15
Entrez at NCBI
22
Text-Based Searching
17-Jul-15
Entrez Pros and Cons
• Pros
– Integrates reference database with sequence
database seamlessly
• Cons
– Very dependent on the network link as databases
being searched are in the US
23
Text-Based Searching
17-Jul-15
GCG Software Package
•
•
•
Similar syntax to Unix commands
Write GCG in every new window to start
the program
Same principles for all programs:
1. Write command and arguments
2. Choose Parameters (default parameters)
3. Receive an output (screen and file)
24
Text-Based Searching
17-Jul-15
Searching with GCG
• Stringsearch: a simple text-search through local
databases.
• Searching through definitions or through full
annotations.
• The definitions contain a minimal amount of the
information for each entry: accession, organism
name, gene name, sequence length, date.
25
Text-Based Searching
17-Jul-15
Searching with GCG
• The annotations contain the complete
documentation for each entry in the sequence
database, including journal and author names,
sequence features, comments, etc.
Annotations take much longer to search through
26
Text-Based Searching
17-Jul-15
Getting a sequence
• Fetch: Get a sequence file to your account
using the accession number or the id code.
Example: fetch hum_hbb
• Fetches all the files with the given accession
number. Can be limited to a certain data
base using database code:
• Example: fetch embl:u01613
27
Text-Based Searching
17-Jul-15
Sequence formats
• Different applications use different
sequence format.
• GCG
• FASTA/Pearson
28
Text-Based Searching
17-Jul-15
Changing file formats
• Two GCG commands are used to convert
file format.
• tofasta
• formfasta
• Similar commands (fromembl, topir etc)
29
Text-Based Searching
17-Jul-15