24. Pre-processing of metagenomic datasets

Download Report

Transcript 24. Pre-processing of metagenomic datasets

Metagenomic dataset
preprocessing – data reduction
Konstantinos Mavrommatis
[email protected]
1
Complexity
Acid Mine Drainage
Sargasso Sea
Termite Hindgut
Cow rumen
Soil
The total metagenome is the result of a cell community. Cells belong to
different organisms ranging from strains to domains.
1
10
Species complexity
100
1000
10000
Who is there?
(phylogenetic content)
What does it do?
(Functional content)
Why is it there?
(Comparative study)
2
Dataset processing
Sample preparation
High throughput sequencing
Assemble reads
Analysis
Feature prediction
QC
Functional annotation
and comparative analysis
Binning
3
Dataset processing
(v 3.0a)
Submitted file
Submitted file
Submitted file
Assembled contigs
454 reads
Illumina reads
Fasta/fast
q
File QC.
Check character set and contig name. Remove trailing Ns.
Trimming.
Trimming.
Q=20
Q=13
Low complexity.
Size of 80 bp
Fasta
Dereplication.
Prefix = 5, identity 95%,
Clustering.
100% identity
File for gene calling
fasta
Dataset processing
Feature prediction pipeline (v 3.0a)
File for gene calling
fasta
Unassembled reads + assembled contigs
CRISPR detection.
crt / pilercr
RNA detection.
tRNAscan / hmmer / Blast / (isolates:Rfam)
CDS detection.
Isolates: prodigal
Metagenomes: varies
Conflict resolution
Concatenation of all results.
Creation of final output file
File for IMG
IMG
Dataset processing
Quality trimming
Courtesy Alex Copeland
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Remove sequences from the ends of the reads.
lucy for 454 datasets.
Illumina (longest high quality string)
6
Dataset processing
Low complexity filter
tatatatatatatatatat
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
using dust (NCBI)
-Remove sequences with less than 80
informative bases
7
Dataset processing
Dereplication
8
Dataset processing
Sequence dereplication
atcccat
atc-cat
atcccat
atcccat
atcccat
gctacat
gctncat
gctacat
gctacat
Not
dereplicated
using uclust
-95% identity (global alignment).
-Identical prefix (5nt)
9
Dataset processing
Evaluation of processing tools
Unassembled sequences due to their small size, quality problems, and large
number need to be processed with efficient pipelines.
Simulated datasets:
a. Using sequences extracted from finished genomes (Perfect sequences)
b. Using reads that have been used to assemble finished genomes (Real
errors).
Evaluation and development of new tools/wrappers.
10
Dataset processing
Feature prediction
Available methods:
Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal.
Similarity based: Blastx, USEARCH.
isolate
CORRECT
MISSED
NEW
WRONG
metagenome
11
Trimming
14
454 Ti(no errors)
15
454Ti(with errors)
16
Illumina 115 bp
17
Illumina 74 bp
18
Contigs
frameshift
Wrong prediction
19
Why annotate unassembled
reads?
Sample
Total size
102,722,384
(2x150) reads
Assembled contigs
1,375,950 contigs
Assembled reads
Mapped (by bwa)
11,778,925 reads
Genes called on
unassembled reads
64,737,444 genes
5060 different pfams
7481 different pfams
8,373,641 (12%)
genes
Similar to genes on contigs1
Genes with similarity
to isolate genomes
40,778,854 genes
Assembled only
More accurate
statistics based on
unassembled +
assembled
20
Unassembled +
assembled +
real metagenome
Additional
information
about functions
and phylogeny
Processing time(metagenomes)
21
Total submissions
Processing time
336
2.45 days (annotation)
24 days (integration)
Data size (bp)
174,719,855 (average)
58,006,992,092
(total)
Processing time(isolates)
Total submissions
3630
22
Processing time
10 hours(annotation)
12 days (integration)
Data size (bp)
1,658,242 (average)
4,114,099,773
(total)
Thank you for your attention
23