Usage

1 Input / Output

dupsifter is able to accept input from stdin or from an input SAM/BAM file. Output can be directed either to stdout or to an output SAM/BAM. Input and output options can be mixed as needed (i.e., input BAM to streamed output).

Note

The input to dupsifter is expected to be read-name grouped (i.e., reads with same name next to one another in the BAM). If you supply a position sorted BAM, it will produce an error message along the lines of:

[dupsifter] ERROR: Can't find read 1 and/or read 2 in 1 reads with read ID: <name of read>.
Are these reads coordinate sorted?

2 Run Statistics

Each run of dupsifter calculates statistics related to the number of duplicates and the types of reads processed. By default, the output file is named dupsifter.stat if the output is streamed or basename.dupsifter.stat if the output file is defined (-o basename.bam). The statistics file name can also be defined with the -O option (i.e., dupsifter -O output.dupsifter.stat ref.fa input.bam). If using the -O option, the file should end with .dupsifter.stat. If both -o and -O are provided, then the -O file name will be used.

3 Help

Program: dupsifter
Version: 1.3.0
Contact: Jacob Morrison <jacob.morrison@vai.org>

dupsifter [options] <ref.fa> [in.bam]

Output options:
    -o, --output STR             name of output file [stdout]
    -O, --stats-output STR       name of file to write statistics to (see Note 3 for details)
Input options:
    -s, --single-end             run for single-end data
    -m, --add-mate-tags          add MC and MQ mate tags to mate reads
    -W, --wgs-only               process WGS reads instead of WGBS
    -l, --max-read-length INT    maximum read length for paired end duplicate-marking [10000]
    -b, --min-base-qual INT      minimum base quality [0]
    -B, --has-barcode            reads in file have barcodes (see Note 4 for details)
    -r, --remove-dups            toggle to remove marked duplicate
    -v, --verbose                print extra messages
    -h, --help                   this help
        --version                print version info and exit

Note 1, [in.bam] must be name sorted. If not provided, assume the input is stdin.
Note 2, assumes either ALL reads are paired-end (default) or single-end.
    If a singleton read is found in paired-end mode, the code will break nicely.
Note 3, defaults to dupsifter.stat if streaming or (-o basename).dupsifter.stat
    if the -o option is provided. If -o and -O are provided, then -O will be used.
Note 4, dupsifter first looks for a barcode in the CB SAM tag, then in the CR SAM tag, then
    tries to parse the read name. If the barcode is in the read name, it must be the last element
    and be separated by a ':' (i.e., @12345:678:9101112:1234_1:N:0:ACGTACGT). Any separators
    found in the barcode (e.g., '+' or '-') are treated as 'N's and the additional parts of the
    barcode are included up to a maximum length of 16 bases/characters. Barcodes can be read from
    single-end or paired-end (pulled from read 1 only) sequencing.

4 Option Descriptions

Short Option Long Option Argument Type Description
-o --output string Name of output file (either .sam or .bam)
-O --stats-output string Name of file to write statistics to (end with .dupsifter.stat)
-s --single-end none Run for single-end data (only do this if you know the data is SE)
-m --add-mate-tags none Add MC (mate CIGAR) and MQ (mate MAPQ) tags to mated reads
-W --wgs-only none Process WGS data instead of WGBS (see Methodology for differences in processing)
-l --max-read-length integer Maximum read length (handles padding for reference genome windows)
-b --min-base-qual integer Minimum base quality (used in determining bisulfite strand if tags not provided)
-B --has-barcode none Use when reads have cell barcodes and you want to mark duplicates accordingly
-r --remove-dups none Remove reads that are flagged as duplicates
-v --verbose none Print extra messages when running
-h --help none Print usage help message and exit
--version none Print dupsifter version and exit