Transcript Programming

Bioinformatics at Promega Corporation
Intro to Bioinformatics Biotec
November 28, 2006
Ethan Strauss
Sr. Scientist R&D Bioinformatics,
Promega,
[email protected]
http://q7.com/~ethan/molbio
My Background
•Bachelor’s degree in biology
•PhD and work experience in Molecular Biology
•Eight years in Promega Technical Services
•Almost two years in Bioinformatics (officially)
No formal computer training
No formal bioinformatics training
Bioinformatics at Promega Corporation
•Bioinformatics did not exists as a separate function until 2001
•One person 2001- 2005
•Two people 2005 - ?
•Bioinformatics supports primarily R&D (~100 scientists)
•Mentor and train R&D scientists
•Provide expertise for projects (~120 requests per year)
•Propose and evaluate new acquisitions
•Liaison to IT department
•Manage bioinformatics infrastructure (~15 tools)
•Develop new tools and adapt existing tools in house
Bioinformatics Projects
Programming
•Tools for internal and external Promega
customers
•Plexor™ Primer Design System
(https://www.promega.com/techserv/tools/plexor/logon.aspx)
•Biomath
(http://www.promega.com/biomath/)
•siRNA Designer
(http://www.promega.com/siRNADesigner/)
•Sequence analysis for Excel and Microsoft Word
(http://www.promega.com/enotes/features/fe0025.htm)
•Analysis of BLAST results
•Automated data retrieval (Web services)
•Database for tracking vector construction
•Database for keeping track of plasmid features
Bioinformatics Projects
Biocomputing (use of computers in biological
research)
•Database searches
•data mining
•discovery research
•Primer design
•Blast analysis and interpretation
•Etc
NCBI
•I recently took the Powerscripting course from NCBI
•NCBI has a lot of very powerful tools and databases.
•They are not as well documented as they might be.
•Check them out periodically.
•Databases at NCBI I was not aware of, but am now.
•Pub Med Central
Articles with free text
•3D domain, structure,
3D structural information.
•GEO (Gene Expression Omnibus)
Micorarray expression data
•There are many more which I see on the drop down list, but
don’t really know any thing about
NCBI – ftp site
•Most NCBI data is available by FTP from
http://www.ncbi.nlm.nih.gov/Ftp/
•I have used it for a number of projects including an analysis of amino
acid residue distribution for the first 11 positions of human and E. coli
NCBI - Entrez Programming Utilities
Programatic access to Entrez
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
Allows incorporation of entrez functionality into third party tools
http://www.promega.com/techserv/tools/plexor/NewQpcrProject.aspx
Allows automation of Entrez searches
Analysis of large datasets
Automation of searches and queries
Accessable using HTTP or SOAP
NCBI - Entrez Programming Utilities
Programs available:
• ESearch: Searches and retrieves primary IDs and term translations
and optionally retains results for future use in the user's environment.
• ESummary: Retrieves document summaries from a list of primary IDs
or from the user's environment.
• EFetch: Retrieves records in the requested format from a list of one or
more primary IDs or from the user's environment.
• ELink: Checks for links from the query ID numbers to other Entrez
databases
• EInfo: Provides field index term counts, last update, and available
links for each database.
• EPost: Posts a file containing a list of primary IDs for future use in the
user's environment to use with subsequent search strategies.
NCBI - Entrez Programming Utilities
Lets try it! Go to
http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/e
u.html and play
Now try
http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/epi
pe.html
NCBI - Entrez Programming Utilities
These sorts of utilities can be access programtically
using Perl.
See “Demonstration Programs” at
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eu
tils_help.html
NCBI - Entrez Programming Utilities
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "Pubmed");
my $query = ask_user("Query", "zanzibar");
my $report = ask_user("Report", "abstract");
my $esearch = "$utils/esearch.fcgi?“
db=$db&retmax=1&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
print "\nESEARCH RESULT: $esearch_result\n";
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart;
my $retmax=3;
for($retstart = 0; $retstart < $Count; $retstart += $retmax) {
my $efetch = "$utils/efetch.fcgi?rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print "\nEF_QUERY=$efetch\n";
my $efetch_result = get($efetch);
print "---------\nEFETCH RESULT(".
($retstart + 1) . . ($retstart + $retmax) . "): ".
"[$efetch_result]\n-----PRESS ENTER!!!-------\n";
Bioinformatics Advice
• Be aware of bias in databases!
– Search Genbank (nucleotide) for
Human[Organism] apoptosis.
How many hits?
– Now try Orcinus[Organism] apoptosis
How many hits?
– Can you conclude that Orcinus does not
have apoptosis?
Bioinformatics Advice
• Bioinformatics is changing and advancing very
rapidly.
– Don’t forget to notice what is new.
• NCBI now has ~20 different databases. They had two only 3-5
years ago
– If you want to do something that you know can’t be done,
check again in two weeks!
• My standard computer can process the entire human genome
for Restriction sites, ORFs etc in a few hours. Not long ago, the
best computers couldn’t even hold that much data!
– If old tools work, don’t feel you need to use the newest tools.
• I still do much of my analysis with Microsoft Word…