BAM Splitting

The split-bam command splits a coordinate-sorted BAM file into separate BAM files for each cell barcode. This is useful for per-cell analysis workflows that require individual BAM files.

Two-Stage Splitting Strategy

Splitting runs in two stages to handle the inherent challenge of splitting a coordinate-sorted BAM by cell barcode (which is scattered throughout the file) without loading everything into memory:

Stage 1 (Bucketing): Reads are distributed into hash buckets based on their cell barcode. Each contig is processed independently and can run in parallel.
Stage 2 (Merging): Each bucket is streamed in coordinate order, and reads are written to their final per-cell BAM files. An LRU cache limits the number of simultaneously open output files.

The result is one coordinate-sorted BAM per cell, maintaining the sort order from the input.

Usage

tranquillyzer split-bam \
    --bucket-threads 4 \
    --merge-threads 4 \
    INPUT_BAM

Command Line Options

Option	Default	Description	When to change
`INPUT_BAM`	required	Coordinate-sorted BAM (must be indexed)
`--out-dir`	`<bam_dir>/split_bams`	Output directory for per-cell BAMs
`--tag`	`CB`	BAM tag holding the cell barcode	Change if your BAM uses a different tag
`--bucket-threads`	1	Parallel workers for Stage 1 (bucketing)	Increase for faster bucketing
`--merge-threads`	1	Parallel workers for Stage 2 (merging)	Increase for faster merging (max 8)
`--nbuckets`	256	Number of hash buckets	Increase for very large cell counts
`--max-open-cb-writers`	128	Max simultaneous open output files	Increase if your system allows more open files
`--filter-secondary`	off	Drop secondary alignments	Enable to reduce output size
`--filter-supplementary`	off	Drop supplementary alignments	Enable to reduce output size
`--filter-unmapped`	on	Drop unmapped reads
`--filter-duplicates`	on	Drop PCR/optical duplicates	Disable if you want to keep duplicates
`--min-mapq`	0	Minimum MAPQ for retention	Increase for stricter quality filtering
`--index-outputs`	off	Create .bai index for each per-cell BAM	Enable if downstream tools need indexes
`--prefer-csi-index`	off	Use CSI instead of BAI indexing	Enable for very large reference genomes
`--keep-tmp`	off	Keep temporary bucket files	Enable for debugging

Output

<out_dir>/<cell_barcode>.bam:one coordinate-sorted BAM per cell
<out_dir>/<cell_barcode>.bam.bai:BAM index (if --index-outputs enabled)

Reads without a cell barcode tag are skipped.