Transcript Slide 1

Ollie Bridle BSc. Hons., MA., MPhil.
[email protected]
May 2008
Outline
1.
2.
3.
4.
5.
6.
Introduction.
Information sources in biology and
associated problems.
What is bioinformatics?
DNA databases.
Entrez. (+ exercise)
Summary.
Aims
Convince you that these bioinformatics
resources are valuable for research.
 Give you some important searching
strategies.
 Show you how to find what you want.
 Suggest other resources and further help.

What I won’t Cover
All the resources available.
 Commercial software.
 Huge amounts of scientific detail.
 Bibliographic and abstract databases

 Check
out some of the other WISER
sessions.
About Me…




Trainee librarian.
Formerly a biologist - degrees in Microbiology
(BSc) and Microbial Genetics (MPhil).
Much less familiar with animal and population
genetics…but…
As far as searching databases goes, similar
principles apply.
Information Sources for
Research - Key Questions
What is available?
 Where do I find it?
 How do I search it?

Information Sources for Research
Journals, books, theses, abstracts.
Technical literature (e.g. protocols,
equipment handbooks).
Conferences, seminars, meetings and
exhibitions.
Molecular biology databases.
Problems with Biological Data





Data collection.
The base of information is large, expanding and
diverse.
Organisation and accessibility.
Requirement for special search techniques. You
can’t Google a DNA sequence…yet!
A student/researcher wants the right information
quickly!!!
The Good News
Large projects working to organise this
information.
 Much is freely available over the internet.
 University subscribes to many e-journals
and bibliographic databases available
through Oxlip.

A Definition of Bioinformatics
‘…information technology applied to the
management and analysis of biological
data’ (Attwood, T. K)

A multidisciplinary subject.
Bioinformatics aims to…
 Collect,
 Organise,
 Store,
 Retrieve,
 Analyse,
….biological data with the use
of computers.
Scope of Bioinformatics
E-journals and
bibliographic
databases.
Protein
structural
modelling.
Gene
expression
studies.
Bioinformatics
Taxonomy and
phylogenetics.
Protein interaction.
DNA/Protein
sequence
databases.
What is a DNA Sequence?
The DNA double helix is made up of a
series of chemical bases stung along a
sugar backbone.
 There are 4 bases usually represented by
the letters A, T, C and G.
 The linear sequence in which these bases
occur determines all the instructions for
building an organism.

What is a Protein Sequence?






Proteins are complex molecules which
control most aspects of cell biology.
Constructed of small subunits called amino
acids.
There are 20 types of amino acid.
Assembeled by ‘reading’ (or translating) the
DNA sequence.
Every set of 3 bases (e.g. ATG) corresponds
to an amino acid.
So a protein is built up one amino acid at a
time according to the DNA blueprint.
In Summary…
DNA Sequence
DNA
Molecule
Proteins
Complete
Organism
Looking at DNA sequences I

Analysis of DNA or protein sequences is a
frequent requirement of research.
 Locating
genes within a sequence.
 Comparing two sequences for similarity.
 Searching for similar genes (orthologues) in
other organisms.
Looking at DNA sequences II

DNA sequences are easily stored, retrieved,
compared and manipulated on computers.
 Just


represent each base as a letter!
Computers can compare two or more
sequences and find similar regions.
Much analysis of genetic information now takes
place in silico.
Looking at DNA Sequences III
DNA sequences can be determined
experimentally.
 Software allows biologists to construct and
view maps of DNA sequence.
 The DNA code of ATCG gets transformed
into something much more human friendly.
 Artemis is one available map viewer.

Artemis Map Viewer
Using a DNA Sequence
Forensics
Identifying
genes of
similar
function
Medical
diagnostics
Determining
protein
composition
Classification
Identification
DNA Databases
Free access to vast numbers of
sequences deposited by researchers all
over the world.
 Used alongside scientific papers.
 Can be searched or ‘mined’ in a variety of
ways.

Global Bioinformatics Agencies
DNA
Data
Bank of
Japan
International
Nucleotide
Sequence
Database
Collaboration
European
Molecular
Biology
Laboratory
National
Centre for
Biotechnology
Information
NCBI and Genbank
Genbank is NCBI’s DNA database.
 Extensive search and deposit capabilities.

606 sequences
A Practical Example


A researcher might start with a piece of
DNA rather than a literature citation.
Here we will –
Search a DNA database using a piece of
DNA sequence.
2. Use the results of the search to identify
relevant literature.
1.
The Experiment
1) Grow
some bugs.
4) Generate
sequence.
2) Extract
the DNA.
3) Amplify up
the desired
section of DNA.
A DNA Sequence

The following sequence is in FASTA
format.
>G08_CHEV11Fed.seq
GTCGACGCGCAAATGGTTCTATATCCATACCAATAGCAGTATCGTTGCCA
TTATCACGAATGGAATTAAGTAAAGTTTTCATTCTATCAATAGACTCTAA
AACCACATCCATGATATCTGGAGTTATTTTTAACTCGCCATGTCTTGCTT
TGTTTAAAACATCCTCCATGTGGTGAGTTAACTTTGTTAAAACATCAAAA
TTTAAGAAGCTTGATGATCCTTTAACCGTATGTGCAACACGGAAAATTCT
ATTTAATAATTCTAAATCTTCTGGATTTGATTCAAGCTCTACTAAATCAT
GGTCGATTTGCTCAACAAGCTCAAAAGCTTCAACCAAAAAGTCTTCAAGT
ATTTCTTGCATATCTTCCATATTTTACCCCTGTTCTTGAGATTGATGTTT
TTTAATAACCTTTGCAATTTCATTGAAGAAATCGCTAGCGTTAAATTTGA
CAAGATAGCCTTCTCCACCAGCTTCTTGAACACCTTTCTCATTCATAAAT
TCATTTGATAAAGATGAGTTAAAGACTATAGGAATATCTTTAAATCCGGG
ATCTTCTTTAATGCGTGCAGCGGATCCCGGGTACCTGCAGAATTCAGCTG
CGCCCTTTAGTTCCTAAAGGGTTTTTATCAGTGCGACAAACTGGGATTTT
ATTTATTCAGCAAGTCTTGTAATTCATCCAAAAAACGGCAAACATGAAAG
CCGTCACAAACGGCATGATGCACTTGAATCGATAAGGGAATATAGTATTT
TCCGCCCTCCTCATAATACTTCCCAAACGTAAATATCGGCAGTAGATAGT
A BLAST Search
Basic Local Alignment Search Tool
 Aimed at finding highly similar sequences
in the database.
 Lets see how to submit a sequence query
to the Genbank database.

BLAST Search Screen
Enter sequence.
Select database.
Select BLAST type.
BLAST Results I
The Statistics

Guidelines for evaluating stats (data from
‘Introduction to Bioinformatics’, Lesk, A, OUP (2005))
E
≤0.02 – Sequences probably homologous
(i.e. derived from a common ancestor)
 E between 0.02 and 1 – homology unproven
but can’t be ruled out.
 E>1 – Expect this good a match by chance.
Putting the amino acid sequence
NELLYTHEELEPHANT into a BLAST
protein search produces results!
 Best match E value = 9

BLAST Results II
Two possible
matches.
BLAST Results III
Literature references
allow us to go straight
to citations in PubMed
relevant to the
sequence we have
found.
Here is the name of the
gene!
Evaluating the Data

There are errors in these databases!
Is a BLAST
search
appropriate?
What is the
source of this
sequence?
Should I
cross
reference?
What are the
statistics
telling me?
Using Accession Numbers
Papers often contain accession numbers.
 No database submission = No publication.
 Using HTML versions of papers you can
link directly to the gene or protein
sequence.
 Here’s one I made earlier….

Exploring Further




Start with a completely unknown sequence.
Searching for ‘CheV’ in WOS will not bring up all
the relevant papers.
Starting from a DNA sequence you have a new
way to search.
‘Having a BLAST with bioinformatics (and
avoiding BLASTphemy)’, A. Pertsemlidis and J.
W. Fondon III. Genome Biology (2001), 2(10),
pp. 1-10
Structure of Entrez



Powerful resource for research.
Entrez is a cross-database search engine.
Records are cross referenced and linked.
Simple
‘one box’
search.
DNA
databases
Literature
database
Protein
databases
Genome
projects
Taxonomy
databases
Entrez Main Screen
Single Keyword Search

Type keyword into the search box and
click ‘GO’
The number of hits for the search term is
shown by each database.
 Single keyword searches are limited.
 Advanced search techniques refine results
and produce fewer irrelevant hits.

Using Boolean Operators
Boolean operators and phrases build
complex searches.
 Use AND, OR and NOT to join terms.
Chemotaxis AND “Campylobacter jejuni”
 Use UPPERCASE for the operators.
 A phrase is enclosed in quotation marks.
“Protein glycosylation”

Your Turn!
A little practice using
Entrez.
 Follow the instructions on
the handout.
 Shout if you have
problems.

10 Minutes
Notes on the Exercise
Using brackets with Boolean operators
refines search results.
 Care with placing brackets is essential!
 The clipboard is helpful for recording
results of searches.

Refining Searches and Setting
Limits.
Within an individual database results may
be further refined by setting limits.
 The number and type of limits will depend
on the database.
 Click the ‘limits’ tab from within one of the
databases.

Steps in Setting a Limit
1.
2.
3.
Select a field to limit the search by.
Type in the limiting term in the search
box.
Select other limiting options e.g. –

Publication date.
 Database.
4.
Hit ‘GO’ to retrieve the results.
Using the History
The history keeps track of previous
searches.
 You can combine searches and limits
quickly and easily.
 You can isolate records matching very
specific criteria.
 A demonstration....

Jumping Between Databases


Records in Entrez are extensively cross linked.
The ‘links’ hyperlink next to each record lets you
jump between databases.
Entrez in Summary

We’ve looked at –
 Simple
and advanced searching.
 Accessing and moving between records.
 Using the clipboard.
 Setting limits.
 Using the history.
 Sorting results.
Evaluating Entrez I

Advantages
 Quickly
cross reference many databases.
 Elaborate searches can be constructed within
each database.
 Tools to save and modify searches.
 Pools many resources.
Evaluating Entrez II

Disadvantages
 Can
return many irrelevant results.
 Syntax for advanced searching is complicated
(many databases = many fields).
 Doesn't cover everything!
Summary





Bioinformatics resources help collect, organise
and analyse biological data.
Essential resources for biology research.
Bioinformatics databases can be searched in
unique ways.
Entrez provides a powerful cross-database
searching tool.
Many more resources out there!
And Finally…
Thanks for listening!
Any Questions?
Resources
Search Engines and Software
 NCBI BLAST – www.ncbi.nlm.nih.gov/blast/Blast.cgi
 Entrez – www.ncbi.nlm.nih.gov/sites/gquery
 SRS – Another cross database search engine for
bioinformatics data similar in principle to Entrez.
http://srs.ebi.ac.uk/
 EMBOSS Bioinformatics software – A whole suite of
free applications for processing many kinds of biological
data. http://emboss.sourceforge.net/
 ARTEMIS – A free sequence viewer and editor.
www.sanger.ac.uk/Software/Artemis/
Sources of Help I

EMBL, DDJ and NCBI all provide reliable introductory information on
bioinformatics. They also have extensive documentation for the
databases and bioinformatics tools they support.
Tutorials
 Try out the 2can tutorials provided by EMBL
www.ebi.ac.uk/2can/home.html
Entrez Help
 The Entrez manual can be viewed on-line or downloaded as a PDF
document.
www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezH
elp
Sources of Help II
Subject Guides
 Subject librarians have prepared a number of guides to research resources
available in a range of scientific fields. www.ouls.ox.ac.uk/rsl/e-resources
Books
 A number of books are available through OULS. I’d particularly recommend
the following. Search the OLIS catalogue at www.lib.ox.ac.uk/olis/


‘Essential Bioinformatics’ by Jin Xiong (2006), Cambridge University Press.
‘Bioinformatics. Sequence and Genome Analysis, 2nd Edition’ by D. W. Mount.
(2004), Cold Spring Harbour Laboratory Press.
Courses
 Oxford University School of Continuing Education has a bioinformatics
programme offering short courses, diplomas and Masters qualifications.
 http://bioinfomsc.stats.ox.ac.uk/