Usage
1 Input / Output
dupsifter
is able to accept input from stdin
or from an input SAM/BAM file. Output can be directed either to stdout
or to an output SAM/BAM. Input and output options can be mixed as needed (i.e., input BAM to streamed output).
The input to dupsifter
is expected to be read-name grouped (i.e., reads with same name next to one another in the BAM). If you supply a position sorted BAM, it will produce an error message along the lines of:
[dupsifter] ERROR: Can't find read 1 and/or read 2 in 1 reads with read ID: <name of read>. Are these reads coordinate sorted?
2 Run Statistics
Each run of dupsifter
calculates statistics related to the number of duplicates and the types of reads processed. By default, the output file is named dupsifter.stat
if the output is streamed or basename.dupsifter.stat
if the output file is defined (-o basename.bam
). The statistics file name can also be defined with the -O
option (i.e., dupsifter -O output.dupsifter.stat ref.fa input.bam
). If using the -O
option, the file should end with .dupsifter.stat
. If both -o
and -O
are provided, then the -O
file name will be used.
3 Help
Program: dupsifter
Version: 1.3.0
Contact: Jacob Morrison <jacob.morrison@vai.org>
dupsifter [options] <ref.fa> [in.bam]
Output options:
-o, --output STR name of output file [stdout]
-O, --stats-output STR name of file to write statistics to (see Note 3 for details)
Input options:
-s, --single-end run for single-end data
-m, --add-mate-tags add MC and MQ mate tags to mate reads
-W, --wgs-only process WGS reads instead of WGBS
-l, --max-read-length INT maximum read length for paired end duplicate-marking [10000]
-b, --min-base-qual INT minimum base quality [0]
-B, --has-barcode reads in file have barcodes (see Note 4 for details)
-r, --remove-dups toggle to remove marked duplicate
-v, --verbose print extra messages
-h, --help this help
--version print version info and exit
Note 1, [in.bam] must be name sorted. If not provided, assume the input is stdin.
Note 2, assumes either ALL reads are paired-end (default) or single-end.
If a singleton read is found in paired-end mode, the code will break nicely.
Note 3, defaults to dupsifter.stat if streaming or (-o basename).dupsifter.stat
if the -o option is provided. If -o and -O are provided, then -O will be used.
Note 4, dupsifter first looks for a barcode in the CB SAM tag, then in the CR SAM tag, then
tries to parse the read name. If the barcode is in the read name, it must be the last element
and be separated by a ':' (i.e., @12345:678:9101112:1234_1:N:0:ACGTACGT). Any separators
found in the barcode (e.g., '+' or '-') are treated as 'N's and the additional parts of the
barcode are included up to a maximum length of 16 bases/characters. Barcodes can be read from single-end or paired-end (pulled from read 1 only) sequencing.
4 Option Descriptions
Short Option | Long Option | Argument Type | Description |
---|---|---|---|
-o |
--output |
string | Name of output file (either .sam or .bam) |
-O |
--stats-output |
string | Name of file to write statistics to (end with .dupsifter.stat ) |
-s |
--single-end |
none | Run for single-end data (only do this if you know the data is SE) |
-m |
--add-mate-tags |
none | Add MC (mate CIGAR) and MQ (mate MAPQ) tags to mated reads |
-W |
--wgs-only |
none | Process WGS data instead of WGBS (see Methodology for differences in processing) |
-l |
--max-read-length |
integer | Maximum read length (handles padding for reference genome windows) |
-b |
--min-base-qual |
integer | Minimum base quality (used in determining bisulfite strand if tags not provided) |
-B |
--has-barcode |
none | Use when reads have cell barcodes and you want to mark duplicates accordingly |
-r |
--remove-dups |
none | Remove reads that are flagged as duplicates |
-v |
--verbose |
none | Print extra messages when running |
-h |
--help |
none | Print usage help message and exit |
--version |
none | Print dupsifter version and exit |