Computational Resources for Teaching Bioinformatics

Download Report

Transcript Computational Resources for Teaching Bioinformatics

Computational Resources for
Teaching Bioinformatics
Jodi Schwarz, Department of Biology, Vassar College
Marc Smith, Department of Computer Science, Vassar College
Cristian Opazo, Academic Computing Services, CIS, Vassar College
Teaching Big Science at Small Colleges:
a Genomics Collaboration
Workshop 2007
Biology as an information science
• Biology increasingly high-throughput and interdisciplinary
• Why? The concurrent developments of
– technologies for high-throughput studies
– computational approaches/power
– concepts of systems and informatics
• Result: The structure of biological research is changing
1. Collaboration is essential
2. Computation is essential
• How do we biology faculty train ourselves and our students?
– work with people trained in different fields
– develop a quantitative and high-throughput perspective
– remain focused on the biology
Genomics/Bioinformatics in Vassar Biology
Biol 106: Introductory Biology:
Students given two mutant strains of C. elegans and are told that each contains a mutation in
a different gene (unc54 or unc119)
1. Microscopy to characterize phenotype in mutant and wt
2. Fluorescence microscopy to localize gene expression in wt
3. NCBI Mapviewer to find gene sequence and literature
2. BLAST to identify potential homologs
3. SequenceExtractor tool to predict length of wt PCR product
5. PCR amplification of the genes from mutant and wild type
7. Interpretation of data:
–
–
identify which gene contains mutation in each strain
discuss how mutation might confer the observed phenotype
GFP and gel images: Kate Susman
Development of upper division courses
• HHMI-supported bioinformatics faculty position
• Hired a biologist who
– uses bioinformatic tools
– has worked with computational biologists
– is not a programmer
• My goals for students
1. Learn, explore, and be excited by biology
2. Get beyond the point and click mentality of bioinformatic tools
1. Understanding computer science approaches
what is an algorithm?
what are scoring matrices?
2. Assessing the quality of the output
3. Finding and evaluating bioinformatic tools
3. Inquiry based learning: conduct original research
Example 1: 300-level Molecular Biology
Goal: learn molecular biology using inquiry-based genomics approach
In-class time: two hours per week (26 hours total)
System: Aiptasia pallida - a developing model system for studying coral symbiosis
–
–
Animal host and microbial symbiont
genomic resources: hot-off-the press ESTs, assembled into about 1000 contigs
Specific Goals:
Learn and explore biology
–
–
Become familiar with the significance of the symbiosis
Study the biology of symbiosis from a molecular and genomics perspective
Cultivate molecular biology skills
–
–
–
Bench skills
Interpretation and assessment of sequencing reads –trim vector etc
Identification of functional regions of mRNAs from EST sequence
Cultivate bioinformatic skills
–
–
genomic level: large-scale annotation of the EST dataset
annotation of genes: symbiosis or metazoan evolution
concurrent project at UC-Merced JGI Genomics course
Molecular Biology research project
1. Microscopy
2. QC of the cDNA library
plate, pick clones, plasmid isolation, restriction digest
determine average insert size
determine the redundancy of the library
14
12
10
Number of Contigs
•
•
•
8
6
4
2
3. Large scale bioinformatic analysis
•
•
which gene to choose and why?
which analyses to perform?
Pfam to identify conserved domains
conservation: clustalw alignments
evolutionary: phylogenetic trees
structure prediction
how to interpret the results?
how to prepare figures for manuscript?
Unclassified
Metabolism
Human Disease
Nervous/Sensory
System
Growth/Development
Structure/Motility
Defense
Signalling and
Communication
Genetic Information
Processing
25
23
24
26
27
+
22
21
20
19
18
16
17
15
14
9
13
8
12
11
10
blastall against swiss-prot
• fraction with blast hits
• most highly expressed genes
KEGG annotations
• larger order biological processes
•
•
•
•
•
•
7
Number of Reads
4. Characterize a target gene
•
•
6
5
0
Problems with lab component:
• not enough time for both wet lab and bioinformatics components
–
students shortchanged on the depth of knowledge
• too many diverse platforms
–
–
–
JGI computers
via remote server (web access)
PC-only and Mac-only applications
• no central location for storage of databases and results
Successes:
• students learned molecular biology from multiple perspectives
• students learned how to apply pre-existing bioinformatic tools to study
biological questions
• students could identify an interesting question and pursue it
–
–
–
some focused on symbiosis
some focused on evolution
some focused on structure
• students grappled with research
• set the stage for the next group of students to do functional studies
BRING into 200-level for more lab time to do functional work
Example 2: Bioinformatics
No wet lab component
Greater level of student-driven questions
1. Explore genome-level biological systems and questions
–
–
–
–
structural genomics (genome sequencing, annotation, architecture)
evolutionary genomics (how do genomes evolve?)
environmental genomics (metagenomics – what lives out there?)
biomedical genomics (use of microarrays and SNP analysis in studying
disease)
2. Learn approaches and tools in more detail through series of workshops
and small assignments
3. Given the questions/approaches, design and conduct an original
research project
– individual meetings to design projects
Bioinformatics research project
Stages
1. Develop a question
–
–
–
know the biology
evolution of drug resistance in malaria
Early animal evolution
uncover “novel” homologs of a particular gene in diverse organisms
2. Identify a source of sequence data
–
–
–
Student challenges
what sequences are appropriate?
ApiDB protein sequences
assembled EST reads from a genome center
nr, Wormbase, Flybase, and other organism-specific
3. Tools for analysis
how to manipulate sequence data?
how to find and understand tools?
what does this output mean?
what other analyses should I do?
4. Write/present manuscript
what do my results suggest?
how do I present the results?
Assessment and Direction
What were the limitations?
1. Logistical: No single location/computer system for:
– diversity of tools and operating systems
– storage of datasets/results
– flexibility: point and click to command line
2. Expertise: my CS limitations as a biologist
Where do we need to go next?
1. Computational platform for simple to sophisticated
2. Instructional collaboration between biologists and computer
scientists
Spring 2008: BIOL/CMPU 353: Bioinformatics
Bioinformatics Course
Co-taught by Bio/CS Depts.
•
•
•
•
Bio majors register under BIOL prefix
CS majors register under CMPU prefix
Different pre-reqs for Bio/CS students
Students work in pairs:
– one from each major--must work together
– learn to speak each other’s language
• First half of course builds fundamentals
• Second half of course students engage in a
bioinformatics research project
First half of course
• Introduce CS topics to Biology students
• Introduce Bio topics to CS students
• Computational labs (teams of two)
– CS students become more familiar with biology problem
domains (e.g. sequence alignment, …)
– Bio students become familiar with information modeling,
algorithm design
– CS students explain problem-solving process;
Bio students explain biological processes
Biology Majors
• Computational fundamentals
– Data abstraction
– Control structures
– Algorithm design and problem-solving
• Goals
–
–
–
–
Participate in algorithm design
Read/understand code
Use/compose existing computational tools
not to turn biologists into programmers
Computer Science Majors
• Biological fundamentals
– evolution
– molecular biology: structures and processes
– informatics: e.g., sequence alignment
• Goals
– understand statement of biological problems (e.g., “predict
the open reading frame of this sequence”)
– translate biological structures into data structures
– work with biologists to design algorithms
– not to turn computer scientists into biologists
Both Bio/CS majors
• Experimental computer science
– Research subject to the scientific method
– Designing, implementing, and conducting computational
experiments (nondeterminism)
– Devising heuristics for computationally infeasible problems
• Bioinformatics is interdisciplinary research
– Biologists are not computer scientists
– Computer scientists are not biologists
– Both need each other
Second half of course
• work in teams of two
• select a research problem
• literature search on related work
– questions posed / open
– techniques applied to finding answers
• devise computational experiments
– obtain datasets, implement algorithms, …
– employ scientific method
– present results
bioinf.cs.vassar.edu
• Resource for different levels of students
– via browser (BioTeam’s iNquiry tool suite)
– via secure remote login (ssh / command line tools)
• Turnkey Linux-based computing cluster by
Rocketcalc, LLC
– Delivered May, set up over the summer
– Specs: 1 chassis, 4 nodes, 16 CPUs, 1GB switch, 16GB RAM,
3TB disk capacity
– Cost (HW+SW): ~$35K
• Support infrastructure is essential!
Supporting a cross-constituency
academic endeavor
• Cross-discipline, inter-departmental endeavors require a
higher, wider level of technical support and organization
• Needs range from the strictly technical (acquisition,
deployment, testing and maintenance of computing
hardware and software) to the more academic (curricular
development, training of faculty and students,
assessment)
• Consequence: a dedicated, knowledgeable support team
of full-time professionals has to be considered from the
early stages of any bioinformatics project in higher
education
An ideal model
PROJECT MANAGEMENT
FUNDING &
SPONSORSHIP
RESEARCH AND TEACHING
FACULTY
SYSTEMS
ADMINISTRATION
FACULTY AND STUDENT
TRAINING
INTER-INSTITUTIONAL
COLLABORATION
Current support structure at Vassar
• Existence of an Academic Computing unit (ACS) within the
college’s main IT division (Computing and information Services,
CIS), whose main goal is to provide support and expertise on
faculty projects (curricular and research) that include an
important technology component
• The ACS-sciences consultant provides computing expertise,
conducting workshops and training sessions on the specific
software tools. Additionally, acts as liaison between the various
departments (academic and administrative) involved on the
overall effort
• Back-end hardware support is provided (on this preliminary
phase) by a systems administrator at the Computer Science
department. In the future, a higher institutional involvement is
expected (CIS level)
A main goal: establishing collaborations
“Contemporary life sciences is increasingly an interdisciplinary
effort, as evidenced by the emergence of academic research and
educational centers in which faculty teams from across the
natural and physical sciences are brought together to create
synergistic investigative and scholarly groups (…)”
“(…) A critical consideration in the design and management of
these centers, from their architecture to their participating
faculty and staff, is identification of means to foster
collaboration and information exchange”
"Bioinformatics: New Technology Models for Research, Education, and Service"
Gary Allen, Executive Director, University of Missouri Bioinformatics Consortium
The EDUCAUSE Center for Applied Research (ECAR) Research Bulletin, Vol 2004, Issue 8, April 13, 2004
http://connect.educause.edu/library/abstract/BioinformaticsNewTec/40090
An example:
Biomedical Informatics Research Network (BIRN)
http://www.nbirn.net/
• An initiative sponsored by the National Institutes of Health and
the National Center for Research Resources that fosters largescale biomedical science collaborations
• The BIRN is a geographically distributed virtual community of
shared resources (hardware, software applications and
databases) within which biomedical scientists and clinical
researchers make discoveries by enhancing communication and
collaboration across research disciplines
• The BIRN uses emerging cyberinfrastructure (high-speed
networks, distributed high-performance computing, and data
integration capabilities) to support a consortial effort among 12
universities and 16 research groups engaged in investigation of
human neurological disease and associated animal models
BIRN: a wide collaboration
Data derived from individual subgroups are being used to drive the definition,
construction and daily use of a federated data system, collected and stored
across geographically separated sites but presented as a unified data archive
that can be securely accessed across institutional boundaries
Final Considerations
• What life sciences initiatives currently exist on campus?
• To what degree does the institution consider academic
endeavors in genomics / bioinformatics a strategic
priority?
• Which elements of the existing IT infrastructure on
campus are well positioned to support our initiatives?
Which elements must be improved?