Transcript Document

BIOLOGY 3020 Fall 2008
Gene Hunting
(DNA database searching)
DNA and p53 Transcription Factor
How many transcription
factors (TFs) in Corn?
Lecture Outline
1. What is a gene?
2. DNA Databases today –
GenBank
3. How to find a new gene in
the GenBank
4. How to know that you
have a full length
(complete) gene
5. Storing your work
1: What is a gene?
A gene is a unit of genetic information
Genes are made of DNA (found in cell
nucleus)
One gene encodes one protein
(polypeptide) (made in cell cytoplasm)
A messenger RNA (mRNA) mediates the
expression of a gene (via ribosome)
An organism is encoded for by numerous
genes (about 26,000 for humans)
Central Dogma of
Molecular Genetics
DNA – all genes are
present in every cell
Only some genes are
expressed in a given cell
mRNA population
represents those genes
expressed in a given cell
(tissue specific gene
expression)
How a gene is expressed
IntronI
Exon1
IntronII
Exon3
Exon2
D
DNA
mRNA
Start
D
Intron slicing and
polyA tailing
Mature mRNA
AAAAAAA
Stop
polyA tail
Open Reading Frame (ORF)
Translation on
ribosome
Protein
Where can you find a gene ?
Book collections can be stored in a library
Collections of genes can be made and
stored in gene libraries !
There are 2 main kinds of gene libraries
Genomic libraries are made from DNA and
contain entire genes (exons and introns).
cDNA libraries are made from mRNAs
that are converted into DNA (only exons)
cDNA libraries are very useful
A library of genes expressed in a given
tissue type is a cDNA library
To study a tissue (e.g. liver or brain)
then a cDNA library contains the
genes used to make that tissue
cDNA libraries are made from mRNA
which is converted into DNA.
One cDNA clone from a cDNA library
contains the coding information for that
gene (with introns removed)
cDNA is made from mRNA
Start
AAAAAAA
Stop
TTTTTTT
Add polyT primer, nucleotides,
and Reverse Transcriptase
AAAAAAA
TTTTTTT
Mature
mRNA
DNA/RNA
RNA removed (by NaOH) and
second strand synthesized
TTTTTTT
Complementary
DNA cDNA
A full length cDNA is hard to find
Start
Stop
AAAAAAA
Open Reading Frame (ORF)
AAAAAAA
mRNA is
degraded
from 5’ end
AAAAAAA
AAAAAAA
AAAAAAA
Most cDNAs are not
full length (flcDNA)
and the ORF is
incomplete (partial)
cDNA (EST) libraries have few flcDNAs
Open Reading Frame (ORF)
cDNA libraries are
made and individual
clones sequenced at
random
A sequenced cDNA is called an
Expressed Sequence Tag (EST)
Millions of ESTs from different tissues of
different organisms are stored in GenBank
– but only a small few are flcDNAs!
-how to find the longest ones? Where ?
2. DNA Databases today – GenBank
GenBank is housed at NCBI www.ncbi.nlm.nih.gov
The Entrez
Nucleotide database
is a collection of
sequences from
several sources,
including GenBank,
RefSeq, and PDB.
The number of bases
in these databases
continues to grow at
an exponential rate.
As of April 2006,
there are over 130
billion bases in
GenBank and RefSeq
alone !
The main infomration access point is in
Entrez (click on All databases)
A virtual “Jungle” of information…….
3. How to find a new gene in this jungle?
Class project to clone novel
transcription factor (TF)
genes from Corn
A good starting point is the
set of predicted TFs from
rice (whose genome has been
completed)
Visit the GRASSIUS website
GRASSIUS Website
New NSF
supported
database
www.grassius.org
GRASSIUS Outreach section
GRASSIUS Helpful Links
(On Links menu)
Maize
MAGI Maize Assembled Genomic Island [MAGI]
MaizeGDB MaizeGDB is the community database for information on Zea mays
The Maize Full Length Project
This project uses genomics tools to understand a fundamental biological process
through identificaion of genes expressed during maize reproduction and in somatic
tissues responding to abiotic perturbations such as heat, cold, salt, UV-B, drought,
and lack of light.
Rice
TIGR Rice Genome
The TIGR Rice Genome Annotation Database and Resource is
a National Science Foundation project and provides sequence and annotation data
for the rice genome
RiceTFDB
RiceTFDB (2.1) is a public database arising from efforts to
identify and catalogue all Oryza sativa genes involved in transcriptional control
Comparative Genomic Resources
CGGC
Comparative Grass Genomics Center (CGGC)
AGRIS
The Arabidopsis Gene Regulatory Information Server (AGRIS) is a
information resource for Arabidopsis promoter sequences, transcription factors
and their target genes
Grass Transcription Factor Database
(GRASSTFDB)
GRASSTFDB provides a comprehensive collection of
transcription factors from maize, sugarcane, sorghum
and rice.
Transcription factors, defined here specifically as
proteins containing domains that suggest sequencespecific DNA-binding activities, are classified based on
the presence of 50+ conserved domains.
Links to resources that provide information on mutants
available, map positions or putative functions for these
transcription factors are provided.
The genes that you clone and study
will be added to this database
Use Known Rice TF Gene to find
related TF in maize EST database
2516 TFs in 66 families
Example: the G2-like TF family
These TFs are known to be
important in the growth of
plants and is found in
several other species but
not yet studied in corn
Each TF gene has a unique Locus number
(like a bar code)
Clicking on a locus
gives more
information on that
particular gene
You want to retrieve
the sequence (at
bottom of the page
See next slide)
Domain architecture is info on the protein product
These links give you the
actual ORF (Coding
Sequence CDS), entire gene
or protein sequence
The actual ORF
of a gene (Coding
Sequence CDS).
The first start
codon is always
ATG and the
last is one of
three stop codons
TAA, TAG or
TGA
The ORF is a
multiple of 3
The ORF is translated into
the protein sequence
The start codon ATG always encodes the amino
acid methionine M. The * indicates the stop
codon (no amino acid in the protein)
Copy and paste this sequence into a new
protein molecule in VectorNTI
In the Protein Molecules Local Database – make
a new subbase for your protein files
Click on “New Protein
Molecule” and type in
the locus name of the
rice TF locus e.g.
Os01g08160.1
Click on the sequence and Features menu
Click on
“Edit sequence”
Paste in the
sequence
and click
“OK” twice
Using the Rice TF as a starting point……
..Let the
hunt for
the corn
TF begin!
Highlight
the
protein
sequence
and click
on Tools
…
Do a BLAST search
(like a google search) to
search the GenBank
We will use the NCBI
BLAST server
There are 5 different BLAST programs to
choose from
BLAST stands for Basic Local Alignment
Search Theorem. (Like doing a Google
search)
Select tblastn program and est others database
and then submit. When “Finished” click in file
BLAST Report has graphic and list windowpanes
A
B
C
D
E
In windowpane A is info on each “Hit” against the
database (here there are 500 hits)
The first is with a corn (Zea mays) mRNA (EST)
A
C
D
In windowpane C the arrows show how the query
sequence (Q:1) lines up with the highlighted hit
(H:1) (Top blue line in windowpane D)
The actual alignment of the sequence Q1 with
the 1st hit is shown in windowpane E
D
E
Note that amino acid 91 of Q1 aligns with
nucleotide 64 of H1(=amino acid 21) so hit1 is a
partial cDNA (NOT full length)  however….
Scrolling down we find that another blue line does
overlap with the beginning of the query
Now amino acid 22 overlaps with bp 347 of the
corn EST with the GenBank accession EE188556
This one looks like it is a flcDNA 
Click on this and the Genbank file will open….
Now the new gene is in your sights!
Genbank file EE188556 seems to be a flcDNA
By highlighting the sequence and translating it in
different frames, then by examining with the
BLAST result it can be seen that the correct
ORF is in frame 2
Extending back from the shared region about 45
amino acids we find a Met (ATG start codon)
Record the
GenBank number
EE188556
In the comments
file make sure that
the clone is
available from the
Arizona Genomics
The plate location is needed
Institute
to request the clone
Save the Genbank file into VectorNTI and you
will use this in the second part of the course
Export the file as an archive and email it to
[email protected] with your group
number and GenBank file in the subject line
e.g. Group5 EST-EE188556
In your lab report include the following in the
Results Section for this lab
1: The Rice Locus number that you started with
2: The protein sequence of the Rice gene and a
brief description of the TF family to which it
belongs
3: The GenBank Accession number of the Maize
EST that appears to be a flcDNA similar to the
rice TF
4: The Arizona Genomics Institute Plate number
Congratulations! Now you have hunted down a
new gene and you will clone this in the 2nd and
3rd part of the course