# Getting started¶

## Installation¶

You can download dbOTU3 using pip install dbotu. dbOTU3 should be compatible with Python 2 and 3. Installing with pip will add dbotu.py to your path.

## Getting your data in shape¶

To run this software, you will need:

• A table of sequence counts. This table has a very specific format: the file is tab-separated; the first column is sequence IDs; the rest of the column headers are sample names. Each cell is the number of times that sequence appears in that sample. If you use convert a BIOM file to a TSV using biom convert --to-tsv, you will need to remove the first line (i.e., # Constructed from biom file).
• A fasta file containing the sequences to be processed into OTUs. The sequences should not be aligned. Single-end reads should be trimmed to the same length. Paired-end reads should be merged but not trimmed.

The table of sequence counts will be read into memory, but the fasta file will be indexed. As per the algorithm, the sequences will be processed in order of decreasing abundance, but neither the table nor the fasta file need to be in any particular order. The software will throw an error if there are sequence IDs in the table that are not in the fasta.

## Deciding on parameters¶

You’ll need to pick values for a few parameters:

• The abundance cutoff. The original paper suggests using $$10.0$$ to create OTUs that account just for sequencing error and $$0.0$$ to create OTUs that merge ecological populations. The default is $$10.0$$.
• The genetic cutoff. The original paper suggests using $$0.10$$ as a cutoff of genetic dissimilarity. This means that sequences that are $$10\%$$ different will definitely not be put into the same OTU. The default is $$0.10$$.
• The distribution cutoff. This is the $$p$$-value from the statistical test of distribution described in The distribution criterion. The default value is $$0.0005$$ (as suggested in the original publication), although some testing has suggested that smaller $$p$$-values might be more sensible.

## Engage!¶

Push the button:

dbotu.py my-sequence-table.txt my-fasta.fasta -o my-otu-table.txt


If you wanted to change some of the default parameters, you can find the relevant options using:

dbotu.py --help


The --output option specifies where the resulting OTU table should go. The --membership option specifies that a QIIME-style membership file should be created (one line for each OTU; the representative sequence ID is the first field, all member sequence IDs are tab-separated after that).

If you are interested in only the sequences that were selected as OTUs, you can use the script dbotu_rep_seqs.py, which uses the OTU table output by dbOTU and the input fasta file to produce a fasta file of representative sequences.

## Monitor¶

The --log option produces (mostly) tab-separated file with two parts. The first part, set off by three hyphens, is a header with information about the program run: it has the algorithm parameters and the input/output filenames you specified when invoking dbOTU3.

The second part (after the dashes) is a history of the algorithm’s progress. Each line has one or two (tab-separated) IDs. A single ID means that that sequence was assigned as its own OTU. Two IDs means that the first sequence was merged into the second. You can use this file, which is written on the fly, to see what you asked for and how far through your data dbOTU3 has gotten.

The log file includes enough information that a dbOTU run can be restarted. (This might be handy if, say, you’re running dbOTU on a computing cluster and your jobs is killed after hitting a time limit.) The script dbotu_restart.py included in the package will restart a run using this log file.

If you want to get into the specifics of what the algorithm did, you can read the debug log file (produced by using the --debug option). The debug log file has 5 types of lines, all of which are tab-separated:

• Lines like A abundance_check B C show that the abundance criterion was applied to sequence A and all existing OTUs, of which OTUs B and C passed. Thus, the genetic similarity of A will be tested against B and C. If no fields follow abundance_check, no OTUs passed the abundance criterion.
• Lines with genetic_check are like abundance_check: the sequence in the first field was sufficiently genetically similar to the OTUs after the genetic_check field to qualify for a distrubtion test.
• Lines like A distribution_check B 0.001 mean that the distribution of sequence A was compared with that of OTU B and the distribution criterion returned a $$p$$-value of 0.001.
• Lines like A new_otu show that sequence A was made into a new OTU.
• Lines like A new_otu B show that sequence A was merged into OTU B.

## Evaluate¶

If you can validate your choices for parameters, do so! There is no guarantee that the default parameters are the ones that will give most biologically meaningful results.

You might also want to chimera-check the OTUs, possibly with a script like my UCHIME chimera checker.