Getting started¶
Installation¶
You can download dbOTU3 using pip install dbotu
. dbOTU3 should be compatible
with Python 2 and 3. Installing with pip
will add dbotu.py
to your path.
Getting your data in shape¶
To run this software, you will need:
- A table of sequence counts. This table has a very specific format: the
file is tab-separated; the first column is sequence IDs; the rest of the
column headers are sample names. Each cell is the number of times that
sequence appears in that sample. If you use convert a BIOM file to a TSV
using
biom convert --to-tsv
, you will need to remove the first line (i.e.,# Constructed from biom file
). - A fasta file containing the sequences to be processed into OTUs. The sequences should not be aligned. Single-end reads should be trimmed to the same length. Paired-end reads should be merged but not trimmed.
The table of sequence counts will be read into memory, but the fasta file will be indexed. As per the algorithm, the sequences will be processed in order of decreasing abundance, but neither the table nor the fasta file need to be in any particular order. The software will throw an error if there are sequence IDs in the table that are not in the fasta.
Deciding on parameters¶
You’ll need to pick values for a few parameters:
- The abundance cutoff. The original paper suggests using \(10.0\) to create OTUs that account just for sequencing error and \(0.0\) to create OTUs that merge ecological populations. The default is \(10.0\).
- The genetic cutoff. The original paper suggests using \(0.10\) as a cutoff of genetic dissimilarity. This means that sequences that are \(10\%\) different will definitely not be put into the same OTU. The default is \(0.10\).
- The distribution cutoff. This is the \(p\)-value from the statistical test of distribution described in The distribution criterion. The default value is \(0.0005\) (as suggested in the original publication), although some testing has suggested that smaller \(p\)-values might be more sensible.
Engage!¶
Push the button:
dbotu.py my-sequence-table.txt my-fasta.fasta -o my-otu-table.txt
If you wanted to change some of the default parameters, you can find the relevant options using:
dbotu.py --help
The --output
option specifies where the resulting OTU table should go. The
--membership
option specifies that a QIIME-style membership file should be
created (one line for each OTU; the representative sequence ID is the first
field, all member sequence IDs are tab-separated after that).
If you are interested in only the sequences that were selected as OTUs, you can
use the script dbotu_rep_seqs.py
, which uses the OTU table output by dbOTU
and the input fasta file to produce a fasta file of representative sequences.
Monitor¶
The --log
option produces (mostly) tab-separated file with two parts. The
first part, set off by three hyphens, is a header with information about the
program run: it has
the algorithm parameters and the input/output filenames you specified when
invoking dbOTU3.
The second part (after the dashes) is a history of the algorithm’s progress. Each line has one or two (tab-separated) IDs. A single ID means that that sequence was assigned as its own OTU. Two IDs means that the first sequence was merged into the second. You can use this file, which is written on the fly, to see what you asked for and how far through your data dbOTU3 has gotten.
The log file includes enough information that a dbOTU run can be restarted.
(This might be handy if, say, you’re running dbOTU on a computing cluster and
your jobs is killed after hitting a time limit.) The script
dbotu_restart.py
included in the package will restart a run using this log
file.
If you want to get into the specifics of what the algorithm did, you can read
the debug log file (produced by using the --debug
option). The debug log
file has 5 types of lines, all of which are tab-separated:
- Lines like
A abundance_check B C
show that the abundance criterion was applied to sequenceA
and all existing OTUs, of which OTUsB
andC
passed. Thus, the genetic similarity ofA
will be tested againstB
andC
. If no fields followabundance_check
, no OTUs passed the abundance criterion. - Lines with
genetic_check
are likeabundance_check
: the sequence in the first field was sufficiently genetically similar to the OTUs after thegenetic_check
field to qualify for a distrubtion test. - Lines like
A distribution_check B 0.001
mean that the distribution of sequenceA
was compared with that of OTUB
and the distribution criterion returned a \(p\)-value of 0.001. - Lines like
A new_otu
show that sequenceA
was made into a new OTU. - Lines like
A new_otu B
show that sequenceA
was merged into OTUB
.
Evaluate¶
If you can validate your choices for parameters, do so! There is no guarantee that the default parameters are the ones that will give most biologically meaningful results.
You might also want to chimera-check the OTUs, possibly with a script like my UCHIME chimera checker.