A Short Introduction to Unix
Download
Report
Transcript A Short Introduction to Unix
Genomics
for Librarians
Stuart M. Brown, Ph.D.
Director, Research Computing, NYU School of
Medicine
A Genome Revolution in Biology
and Medicine
We are in the midst of a "Golden Era" of
biology
The Human Genome Project has produced a
huge storehouse of data that will be used to
change every aspect of biological research
and medicine
The revolution is about treating biology as an
information science, not about specific
biochemical technologies.
The Human Genome Project
The job of the biologist is changing
As more biological information
becomes available and
laboratory equipment becomes
more automated ...
– The biologist will spend more time using computers
& on experimental design and data analysis
(and less time doing tedious lab biochemistry)
– Biology will become a more quantitative science
(think how the periodic table affected chemistry)
A review of some basic genetics
DNA
4 bases (G, C, T, A)
base pairs
G--C
T--A
genes
non-coding regions
Decoding Genes
What is Bioinformatics?
• The use of information technology to collect,
analyze, and interpret biological data.
• An ad hoc collection of computing tools that are
used by molecular biologists to manage research
data.
–
–
–
–
Computational algorithms
Database schema
Statistical methods
Data visualization tools
Genomics
What is Genomics?
– An operational definition:
• The application of high throughput automated
technologies to molecular biology.
– A philosophical definition:
• A wholistic or systems approach to the study of
information flow within a cell.
Genomics make LOTS of data!
Investigators need complex databases just to
manage their own experiments
Biologists need to know how to do data mining
to answer even simple questions in these huge
data sets
Librarians understand the challenges of storage
and searching of large amounts of data
New Biology => New Librarians?
How do Genomics and Bioinformatics overlap or
interact with Library Science?
1. The NCBI (Natl. Center for Biotechnology
Information), the home of GenBank, is part of
the National Library of Medicine
2. We store and organize genes like Journal articles accession number, annotation, etc.
3. A big part of bioinformatics involves keyword
searches and SQL queries in relational databases
Bioinformatics is Not
Library Science
We are NOT cataloging a set of known
information
Programming and complex algorithms - pattern
matching, string matching, biostatistics
Data mining and multi-dimensional
visualization tools
Uncertainty of the data and constant revision of
the “known”
– Genes are guesses based on complex algorithms,
not books on the shelf
Raw Genome Data:
BLAST Similarity Search
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Sbjct: 60
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
Multiple Alignment
Protein domains
(Pattern analysis)
Clustering (Phylogenetics)
UCSC
The Challenge of New Data
Types (Genomics)
• Gene expression microarrays
– thousands of genes, imprecise measurements
– huge images, private file formats
• Proteomics
– high-throughput Mass Spec
– protein chips: protein-protein interactions
• Genotyping
– thousands of alleles, thousands of individuals
• Regulatory Networks
Biological Information
Microarray Technology
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Ordinary glass
microscope slide
cDNA spotted microarrays
Goal of Microarray experiments
Microarrays are a very good way of
identifying a bunch of genes involved in a
disease process
– Differences between cancer and normal tissue
– Tuberculosis infected vs resistant lung cells
Mapping out a pathway
– Co-regulated genes
Finding function for unknown genes
– Involved these processes
Proteomics
Identify all of the proteins in an organism
– Potentially many more than genes due to
alternative splicing and post-translational
modifications
Quantitate in different cell types and in
response to metabolic/environmental factors
Protein-protein interactions
Yeast Proteome
Jeong H, Mason SP, A.-L Barabasi
Nature 411 (2001) 40-41
Human Genetic Variation
Every human has essentially the same set of genes
But there are different forms of each gene -- known as alleles
– blue vs. brown eyes
– genetic diseases such as cystic fibrosis or Huntington’s disease are
caused by dysfunctional alleles
Alleles are created by mutations in the
DNA sequence of one person - which
are passed on to their descendants
High-Throughput Genotyping
Relate genes to Organisms
Diseases
– OMIM: Human Genetic Disease
Metabolic and regulatory pathways
– KEGG
– Cancer Genome Project
Human Alleles
The OMIM (Online Mendelian Inheritance
in Man) database at the NCBI tracks all human
mutations with known phenotypes.
It contains a total of about 2,000 genetic
diseases [and another ~11,000 genetic loci with
known phenotypes - but not necessarily known gene
sequences]
It is designed for use by physicians:
– can search by disease name
– contains summaries from clinical studies
Training "computer savvy"
scientists
Know the right tool for the job
Get the job done with tools available
Network connection is the lifeline of the
scientist
Jobs change, computers change, projects
change, scientists need to be adaptable
Why teach genomics in undergraduate
(or Medical) education?
Demand for trained graduates from the biomedical
industry
Bioinformatics is essential to understand current
developments in all fields of biology
We need to educate an entire new generation of
scientists, health care workers, etc.
Use bioinformatics to enhance the teaching of other
subjects: genetics, evolution, biochemistry
Genomics in Medical Education
“The explosion of information about the new
genetics will create a huge problem in
health education. Most physicians in
practice have had not a single hour of
education in genetics and are going to be
severely challenged to pick up this new
technology and run with it."
Francis Collins
Long Term Implications
A "periodic
table for biology" will lead to an
explosion of research and discoveries - we will
finally have the tools to start making systematic
analyses of biological processes (quantitative
biology).
Understanding the genome will lead to the
ability to change it - to modify the characteristics
of organisms and people in a wide variety of
ways
Bioinformatics: A
Biologist's Guide
to Biocomputing
and the Internet
Essentials
of Medical
Genomics
Stuart M. Brown, Ph.D.
[email protected]
www.med.nyu/rcr
www.GenomicsHelp.com