Presentation Bioinformatics seminar 181011 R.Kiekens

Download Report

Transcript Presentation Bioinformatics seminar 181011 R.Kiekens

Login:
BITseminar
Pass:
BITseminar2011
BIOINFORMATICS
Bioinformatics
• Combination of:
– Theory and methods (algorithms, statistical
methods, machine learning, …)
– Applications (sequence analysis, genome
assemblies, databases, ... )
– Different kinds of datasets (sequence data,
microarray, next-gen data, …)
Biology Core Concepts
•
•
•
•
•
•
Molecular biology
Systems biology
Evolutionary theory
Common lab techniques
Sequence comparison
Phylogenetic analysis
Computer science
•
•
•
•
•
•
•
Programming
Database querying
Data mining
Visualization
Machine learning
Modeling
…
Data exceeds analysis
data
Bioinformatician
How to survive?
•
•
•
•
•
•
Knowledge of Linux/Unix
Scripting: Perl/Python
Network based data storage
Knowledge biology, genomics
Database structures
Try to keep up with all new tools!
Benifit of using (Bio)perl, example
You have a 1000 sequences to
blast and analyse…
You can do this manually
Or… use a perlscript to do this
for you and present you the
final results!
Good journals to keep up the pace
 Bioinformatics (
http://bioinformatics.oxfordjournals.org/ )
 BMC Bioinformatics ( http://
www.biomedcentral.com/bmcbioinformatics/ )
 PLoS Computational Biology (
http://www.ploscompbiol.org/ )
 ...
DATABASES
Types of databases
•
•
•
•
•
DNA databases
Protein databases
Genome databases
Microarray databases
Next-Gen seq databases
What to find in databases?
•
•
•
•
•
•
•
Sequences
Motifs
Mutations, SNPs
Gene ineraction profiles
Interactions (protein protein interactions)
Transcription factor binding sites
Etc…
Databases? Good Reference
• http://nar.oxfordjournals.org annual edition
NCBI: lot of options… feed the need
Amino acid databases
• Uniprot
– SWISS-PROT
– TrEMBL
– PIR
Uniprot
•
•
•
•
http://www.uniprot.org
Good quality, curated
Minimal redundancy
Extensive cross linking
to useful databases
Structural databases
• Structure leads to function!
– Protein Data Base – PDB http://www.pdb.org
– SCOP & CATH databases (structural classification) http://scop.mrclmb.cam.ac.uk/scop/ ; http://www.cathdb.info/
Structure prediction (modeling)
 SWISS-MODEL & Repository ( http:// swissmodel.expasy.org/ )

MODELLER & MODBASE ( http://salilab.org )
 Study of interactions (docking) & drug design
SNPs and pharma
• To collect, encode, and disseminate
knowledge about the impact of human
genetic variations on drug response.
http://www.pharmgkb.org/
DNA Microarray Databases
• Standard: MIAME = minimum information about
microarray experiment
• Databases:
– ArrayExpress (EBI)
http://www.ebi.ac.uk/arrayexpress/
– GEO (NCBI)
http://www.ncbi.nlm.nih.gov/geo/
Check the database
before planning an
experiment!
Next gen data database
• http://www.ncbi.nlm.nih.gov/Traces/sra
• http://www.ebi.ac.uk/ena
• http://www.ddbj.nig.ac.jp/sub/trace_srae.html
GENOME BROWSERS
Human reference sequences
• Celera
• Huref
• GRCh37
Three reference
genomes. Keep this
in mind when
browsing databases!
Useful Genome Browsers
• Ensembl: http://www.ensembl.org/
• NCBI Map Viewer:
http://www.ncbi.nlm.nih.gov/mapview/map
_search.cgi?
• UCSC: http://genome.ucsc.edu/
Genome browser: Ensembl
EMBL Problems
•
•
•
•
Lots of redundancy
Wrong or old annotations
Vector contamination
Errors in sequences
Refseq
•
•
•
•
Better option, NCBI reference
Curated
Annotations are controlled
No redundancy
NCBI:Genbank vs RefSeq
http://www.ncbi.nlm.nih.gov/RefSeq/
• Sequence records are created by scientists who submit
sequence data to GenBank. As an archival database,
GenBank may contain hundreds of records for the same
gene. In addition, because there is no independent review
system, the types of information may vary from record to
record, and GenBank sequence data may contain errors
and contaminant vector DNA.
• To address some of the problems associated with GenBank
sequence records, NCBI developed its RefSeq database.
Refseq accession numbers
•
•
•
•
•
NM_ mRNA (provisional, predicted, reviewed)
NP_ protein (provisional, predicted, reviewed)
NR_ non-coding RNA (provisional, reviewed)
NG_ human genes (provisional, reviewed)
NC_ chromosomes, complete genomes
(provisional, reviewed)
Refseq accession numbers (2)
•
•
•
•
XM_ predicted mRNA (model)
XP_ predicted protein (model)
XR_ predicted non-coding RNA (model)
NT_ human and mouse genomic contiqs
(model)
• NW_ mouse supercontiqs (model)
Genome browser: NCBI
Genome browser: UCSC
• Example: UCSC
• Good tutorial:
– http://www.openhelix.
com/downloads/ucsc/
ucsc_home.shtml
SNPS AND DISEASE RESEARCH
SNPs and disease research
• Association analysis, disease related (?),
mapping genome variation…
• Reference = dbSNP database
Example NCBI SNP database,
SNP rs33957964
Other useful SNPs databases
• Genome variation
center http://gvs.gs.washington.edu/GVS/
• HapMap (Ensembl) http://hapmap.org/
• List of all:
http://www.hgvs.org/dblist/ccent.html
Clinical Bioinformatics
• Microarrays, omics data (genomics, proteomics,
interactomics, metabolomics, …)
• Combination of bioinformatics and medical
informatics
ALGORITHMS AND TOOLS
Algorithms
• Fundaments for bioinformatic tools
– Implemented in ‘front end tools’ (website, Java
applications)
• Can be slow
• Good for smaller analysis, quick mining
– Scripts, programs - use in command line (e.g.local BLAST)
•
•
•
•
Usually local install on server
faster
large queries, long analysis time required
Knowledge of linux/unix essential
Hall of Fame
•
•
•
•
•
•
•
•
Linux operating system, mySQL database
(Bio)Perl: programming language  making your life easier!
Blast/Blat: comparing sequences
Phylip: Phylogenetic analysis, tree building
ClustalW: Multiple alignment
MEGA5: Multiple alignment and editing sequences
HMMER: comparative genomics
EMBOSS: combining several tools for sequence analysis
Open sourcce  Free to
use and develop
Tools? Good Reference
• http://nar.oxfordjournals.org/ - annual edition
Analysing next gen sequencing data
• Different tools for different formats
– Roche
– Applied Biosystems
– Illumina
Next gen tools
• FastQC: quality assesment of FASTQ files
Assembly tools next gen
• A number of specialized tools exist:
ABySS, gap4, Geneious, Mira, Newbler,
SSAKE, SOAPdenovo, Velvet, …
Galaxy! http://galaxy.psu.edu/
• Galaxy provides a web-based application for the
analysis of sequence data
• Includes many tools including NGS data
• Makes your life easier, less linux knowledge
On the cloud
Structure Galaxy
So this is why you need a bioinformatician in the lab!!
Login:
BITseminar
Pass:
BITseminar2011