Quality Control
Generate QC Files for MultiQC
BISCUIT provides two methods for generating QC files that can be picked up by MultiQC. First, biscuit qc
and biscuit qc_coverage
(added in v1.7.0) are two subcommands that create alignment-related, bisulfite strand, and some conversion-related QC metrics (qc
) and coverage statistics (qc_coverage
) of an input BAM. Second, the QC.sh
BASH script runs these two subcommands, while also generating base-averaged conversion metrics (if provided an input VCF from biscuit pileup
).
Examples for biscuit qc
and biscuit qc_coverage
:
biscuit qc [-s] /path/to/my_reference.fa input.bam sample_name
biscuit qc_coverage \
-P output_prefix \
-T top_gc_bins.bed.gz \
-B bottom_gc_bins.bed.gz \
/path/to/my_reference.fa \
cpgs.bed.gz \
input.bam
Additional help can be found by running biscuit qc
or biscuit qc_coverage
on the command line or visiting the qc help page or the qc_coverage help page.
Usage for QC.sh
:
QC.sh \
[--n-threads N] \
[--single-end] \
[--vcf input.vcf] \
[--outdir output_dir] \
[--keep-tmp-files] \
[--no-cov-qc] \
/path/to/reference_assets \
/path/to/my_reference.fa \
sample_name \
input.bam
Additional help for QC.sh
can be found by running QC.sh --help
.
The bash script is available in the scripts/
directory if the source code has been downloaded, otherwise it can be downloaded from the latest release page on GitHub. Additionally, zip files with supplementary QC asset files (/path/to/reference_assets
) on the release page. These asset files are provided for hg19, hg38, and mm10. Asset files for other genome builds will need to be generated by the user. The build_biscuit_QC_assets.pl
perl script can create the files from a provided reference genome, while step-by-step instructions can be found on the FAQ page. Note, QC.sh
and assets in v1.7.0 and later are not compatible with BISCUIT v1.6.1 and earlier.
Filter Reads by Bisulfite Conversion
For some library preparations, incomplete conversions are enriched in a subset of reads which need to be filtered. The bsconv
subcommand transforms an input BAM into one that contains a ZN
tag (like ZN:Z:CA_R0C11,CC_R1C14,CG_R0C2,CT_R1C5
). This tag summarizes counts of retention and conversion for four different cytosine contexts CpA
, CpC
, CpG
and CpT
. The -p
flag outputs the counts in a table, instead of as a tag in the BAM file. Additionally, biscuit bsconv
can be used to filter reads with user-defined levels of cytosine retention in CpH, CpY, CpA, CpC, or CpT contexts.
biscuit bsconv /path/to/my_reference.fa input.bam
For help on available flags, run biscuit bsconv
on the command line or visiting the bsconv help page.
Validate Bisulfite Conversion Label
Sometimes, the bisulfite conversion labels in a given alignment are inaccurate, conflicting, or ambiguous. The bsstrand
subcommand summarizes these labels, given the number of C→T and G→A substitutions. As an option, it can also correct inaccurate labels.
biscuit bsstrand /path/to/my_reference.fa input.bam
returns something similar to:
Mapped reads: 2865
Unmapped reads: 44
Corrected reads: 0 (0.00%)
Strand Distribution:
strand\BS BSW (f) BSC (r)
R1 (f): 67 528
R1 (r): 692 47
R2 (f): 765 92
R2 (r): 107 567
R1 mapped to OT/OB: 114
R1 mapped to CTOT/CTOB: 1220
R2 mapped to OT/OB: 1332
R2 mapped to CTOT/CTOB: 199
Confusion counts:
orig\infer BSW (f) BSC (r) Conflict (c) Unknown (u)
BSW (f): 1483 48 11 89
BSC (r): 27 1084 6 117
Conflict (c): 0 0 0 0
Unknown (u): 0 0 0 0
If unfamiliar with the OT/OB/CTOT/CTOB definitions, see the Glossary
The inferred YD
tag gives the following
- f: OT/CTOT (BSW) strand
- r: OB/CTOB (BSC) strand
- c: conflicting strand information
- u: unintelligible strand source (unknown)
YD
is inferred based on the number of C→T (nCT
) and G→A (nGA
) observations in each read according to the following rules:
- If both
nCT
andnGA
are zero,YD = u
- If
nCT > nGA
and (nGA == 0
ormin(nGA,nCT) / max(nGA,nCT) <= 0.5
), thenYD = f
- If
nCT <= nGA
and (nCT == 0
ormin(nGA,nCT) / max(nGA,nCT) <= 0.5
), thenYD = r
- All other scenarios give
YD = c
.
The flag -y
appends nCT
(YC
tag) and nGA
(YG
tag) in the output BAM file.
For more help with bsstrand
, run biscuit bsstrand
in the terminal or visit the bsstrand help page.
When -b 1
Was Specified in Mapping
-b 1
forces the mapping to be conducted in a stranded manner. Using bsstrand
, the user can validate whether this enforcement was successful. For example,
biscuit bsstrand /path/to/my_reference.fa stranded.bam
returns
Mapped reads: 2918
Unmapped reads: 93
Corrected reads: 0 (0.00%)
Strand Distribution:
strand\BS BSW (f) BSC (r)
R1 (f): 840 0
R1 (r): 0 551
R2 (f): 0 565
R2 (r): 962 0
R1 mapped to OT/OB: 1391
R1 mapped to CTOT/CTOB: 0
R2 mapped to OT/OB: 0
R2 mapped to CTOT/CTOB: 1527
Confusion counts:
orig\infer BSW (f) BSC (r) Conflict (c) Unknown (u)
BSW (f): 715 916 9 162
BSC (r): 499 500 12 105
Conflict (c): 0 0 0 0
Unknown (u): 0 0 0 0
In this case, read 1 always mapped to the OT or OB strand and read 2 always mapped to the CTOT or CTOB strand.
Investigate Cytosine-Read Pair in Tabular Form
The cinread
subcommand can print information about cytosines in different contexts in a tab-separated format. By default the read name, whether the read comes from read 1 or read 2, the bisulfite strand, reference base, and base found in the read are printed. The printed information can be chosen using the -p
flag.
biscuit cinread /path/to/my_reference.fa input.bam
For more information about the available options, run biscuit cinread
in the terminal or visit the cinread help page.