Note: The initial version of BISCUIT that included the bc
subcommand (version 1.3.0) used a different scheme for storing and extracting cell barcodes which means using umi-tools
is incompatible with version 1.3.0. All versions from 1.3.1-dev onward use the same scheme as umi-tools
. This also means that barcodes extracted with version 1.3.0 must be aligned with version 1.3.0 to ensure proper inclusion of barcodes in the SAM entry. While the mechanics of using biscuit bc
described on this page will also work for version 1.3.0, please note there will be some minor differences in the BAM produced (e.g., CR tag used instead of CB) and biscuit align
will not work with umi-tools
-corrected FASTQ files.
Extracting Cell Barcodes from scWGBS Protocols
Cell barcodes are commonly used in scRNA-seq protocols to overcome DNA input issues and increase throughput by merging cDNA from separate wells into a single well. In a similar manner, the single-cell WGBS (scWGBS) protocol, snmc-seq2, uses the same methodology for increasing throughput. While there aren’t many scWGBS protocols that use cell barcoding, it seems likely barcodes (and unique molecular indexes (UMIs) down the road) are used more frequently in protocols.
BISCUIT provides the biscuit bc
subcommand as a way to remove uncorrected cell barcodes from FASTQ files, which can either be written to new FASTQ files or fed directly into biscuit align
. Further, the output from biscuit bc
matches the output from the commonly used tool, umi-tools, which allows for the user to perform barcode correction with umi-tools
and then use the resulting FASTQs as input to biscuit align
.
Barcode Extraction with BISCUIT
As an overview, biscuit bc
takes FASTQs as input and outputs the processed reads to standard out:
# Single-end data
biscuit bc read1.fq.gz
# Paired-end data
biscuit bc read1.fq.gz read2.fq.gz
For paired-end data, the output is printed in an interleaved form (i.e., read 1 followed by read 2 for each read pair). It is important to include both FASTQs as input for paired-end data, as the barcode and an artificial UMI (needed for compatibility with umi-tools
) are written to the read name of both read 1 and read 2.
You can set the geometry of where the cell barcode is with the -m/--mate
, -s/--bc-start
, and -l/--bc-length
options.
biscuit bc -m 1 -s 1 -l 8 read1.fq.gz read2.fq.gz
-m/--mate
specifies the mate the barcode is in (1 or 2), -s/--bc-start
is the start position of the barcode (given as a 1-based number), and l/--bc-length
is the length of the barcode.
Rather than writing the processed reads to standard out, you can instead write them to a FASTQ file with -o/--output-prefix
, which is the basename of your output FASTQ files. For single-end data, .fq.gz
will be appended to your output prefix. Paired-end data will have _R1.fq.gz
or _R2.fq.gz
appended to the prefix for reads 1 and 2, respectively.
biscuit bc -o my_processed_reads read1.fq.gz read2.fq.gz
For more help on available flags, run biscuit bc
in the terminal or visit the bc help page.
Passing Ouput to biscuit align
You can pass the output of biscuit bc
straight to biscuit align
:
# Single-end data
biscuit bc read1.fq.gz | \
biscuit align -9 -@ NTHREADS /path/to/my_reference.fa -
# Paired-end data
biscuit bc read1.fq.gz read2.fq.gz | \
biscuit align -p -9 -@ NTHREADS /path/to/my_reference.fa -
The -9
flag tells biscuit align
to extract barcodes from the read name, while the -p
flag for paired-end data is needed to handle the interleaved input from biscuit bc
.
You can also align FASTQs that were created by biscuit bc
:
# Single-end data
biscuit bc -o reads my_se_data.fq.gz
biscuit align -9 -@ NTHREADS /path/to/my_reference.fa reads.fq.gz
# Paired-end data
biscuit bc -o reads my_pe_data_R1.fq.gz my_pe_data_R2.fq.gz
biscuit align -9 -@ NTHREADS /path/to/my_reference.fa read_R1.fq.gz read_R2.fq.gz
In either case, the cell barcode is placed in the CB
SAM tag (“the optionally corrected cellular barcode sequence”), while the UMI is placed in the RX
SAM tag (the “[s]equence bases from the unique molecular identifier … [where] the value may be non-unique in the file.”)
Barcode Correction with umi-tools
Barcode correction and extraction can be performed using umi-tools
with the help of synthbar (here shown for paired-end sequencing).
Step 1: Add synthetic UMI with synthbar
umi-tools
expects both a cell barcode and a UMI. At this time, there are no scWGBS protocols that include UMIs; therefore, a synthetic barcode must be added to the read with the cell barcode in it. synthbar
is a tool that can add synthetic barcodes to cells, but for our purposes can also be used to add a synthetic UMI:
synthbar -b AAAAAAAA barcoded_reads_R1.fastq.gz | \
gzip > barcoded_with_umi_R1.fastq.gz
This will prepend AAAAAAAA
to the start of each read in the input FASTQ file, which can be treated as the UMI for each read.
Step 2: Create Whitelist
Using umi-tools
requires to steps. The first step is to process your FASTQ with the cell barcodes in it and try to find the most likely true cell barcode:
umi_tools whitelist \
--log2stderr \
--bc-pattern=NNNNNNNNCCCCCCC \
--stdin barcoded_with_umi_R1.fastq.gz \
> whitelist.txt
Here, NNNNNNNNCCCCCCC
specifies a an 8 basepair UMI followed by a 7 basepair cell barcode.
Step 3: Correct and Extract Barcodes
The results from whitelist.txt
are then fed into the code that extracts the UMI and cell barcode into the readname:
umi_tools extract \
--bc-pattern=NNNNNNNNCCCCCCC \
--whitelist=whitelist.txt \
--error-correct-cell \
--stdin barcoded_with_umi_R1.fastq.gz \
--stdout barcoded_with_umi_R1_corrected.fastq.gz \
--read2-in barcoded_reads_R2.fastq.gz \
--read2-out barcoded_reads_R2_corrected.fastq.gz
Passing to biscuit align
The resulting FASTQs can then be passed into biscuit align
in a similar manner to extracting barcodes with biscuit bc
:
biscuit align -9 -@ NTHREADS /path/to/my_reference.fa \
barcoded_with_umi_R1_corrected.fastq.gz \
barcoded_reads_R2_corrected.fastq.gz