Day1_5_Bioinformatics_primer

Download Report

Transcript Day1_5_Bioinformatics_primer

Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Bioinformatics Primer
Goal: Introductory skills for bioinformatics analysis.
Format: Complete the exercises, ask anything.
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – interpro
• interpro (www.ebi.ac.uk/interpro/)
• Exercise
– for 3 proteins important to your research area (choose 2 well
defined, 1 not well defined)
– download their protein sequence from
www.ncbi.nlm.nih.gov
– analyse them using interpro
• what domains do they contain?
• what are the functions of these domains?
• what families do the proteins belong to?
– how would you do this on 100 proteins, or 20,000 proteins?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – gene ontologies
•
Gene Ontology database
– www.geneontology.org
•
Exercise
– Keep this information saved as you will use it thje following days
– 1) Define
• Molecular function
• Biological process
• Subcellular location
– 2) Find GO identifiers that describe functions, processes or locations that
are relevant to your research
• List the identifier, type and description.
• Should you use identifiers further up or down the hierarchy?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – gene ontologies
• Exercise continued
– 3) For 3 proteins relevant to your research
• What GO terms are assigned to the protein?
• What evidence is there for the assignments?
– 4) Describe the difference between the evidence codes.
– 5) How would you find all proteins with a specific molecular
function?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – ArrayExpress/GEO
• GEO/ArrayExpress
– Microarray repository tools containing published microarray data
– Note differences in ease of use and completeness!
• Exercise
– Compare GEO and ArrayExpress.
– Search for Human stem cell microarray studies
– What are the GEO/ArrayExpress identifiers for some recent Stem
cell microarray studies?
– What data is available? Raw data? Processed data?
– Download a CEL file (or set of CEL files) from a stem cell
microarray study.
– Go to ArrayExpress Atlas
• Look up at least two genes of interest (in stem cell biology)
• What does the database tell you?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – Ensembl
• Exercise
– Go to Ensembl. Describe it.
– Look up a (human) gene. How many transcript variants does it
have?
– Explore!
– Use BioMart to gather all Ensembl identifiers and Entrez geneIDs
for all human and mouse genes, export this data into excel (you
will need this later).
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – UCSC
• Exercise
– Go to genome.ucsc.edu.
– Look up a (human) gene. Select many different gene models –
how many transcript variants are found for your gene in UCSC
known genes, AceView, Refseq?
– Use the table browser to download all human genes (refseq) into
excel.
– What else of interest can you download?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – R
• See accompanying worksheet
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – command line login
• Try this on your own laptop
• Windows command line
– windows+R, type “cmd”
• Cygwin (unix in windows)
– open cygwin
• putty (log into a unix server)
– ip address, username, password
• VMware (virtual machine within windows)
– choose a unix virtual machine (i.e. tinyunix)
– open a terminal
• Apple Mac
– OS X: open a terminal
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – command line
• Basic command line operations
– Directories
• cd <directory> : Change the current directory
• pwd : get current working directory
– Viewing files and directories
• ls <path> : list the contents of a directory (dir)
• more <file> : see contents of file on screen, stop after every page
• less <file> : see contents of file (with better ability to move in the file)
• cat <file> : see contents of file, don't stop at new page
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – command line
• Basic command line operations
– Editing files
• emacs <file> : open file for editing in emacs
– other programs: nedit, vi
– Copying and moving files
• cp <file> <destination> : copy file to destination (copy)
• mv <file> <destination> : move (or rename) file to destination (move)
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – command line
• Basic command line operations
– login and copying
• ssh / scp : login to server, copy files
– viewing parts of files
• head -#lines <file> : look at first # lines
• tail -#lines <file> : look at last # lines
– pattern matching
• grep -e “pattern”<file> : find lines in file with “pattern”
• grep -v “pattern”<file> : find lines in file without “pattern”
– counting
• wc <file> : count words, lines in file
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Basic Skills – command line
• Basic command line operations
– > <file>: send results to file
• more filename > filename2 (send all of filename to filename2)
• ls > directory_contents.txt
– pipe :”|” : send the results forward to another program
• grep -e “pattern” filename > filename_pattern.txt
• head -5 filename > filename_pattern.txt
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
command line exercises
•
•
•
•
•
•
•
•
•
Create a directory, name it after yourself
What is the current working directory?
Copy “exercise.txt” into your directory
Change the working directory to that directory
Look at the file with “more”
Read the man page for wc with “man wc”
What are the first 5 lines?
What are the last 3 lines?
How many lines contain the word “fish”? (hint you need to use
pipe)
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
command line exercises
• command line in windows
– Test the following in windows command line (open with
windows-key + R, then “cmd”)
• more
• | (pipe)
• grep
• wc
• sed
– Which work, which do not?
– How do you find help for a program​?
– What is “sed” for?
Alistair Chalk, Elisabet Andersson
Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet,
18-24 September 2008.
Additional resources
• Plenty of tutorials are available online for R and unix
– Unix tutorial for beginers
• http://www.ee.surrey.ac.uk/Teaching/Unix/
– R
• http://cran.r-project.org/other-docs.html
– Note some are very large (100+ pages)
Alistair Chalk, Elisabet Andersson