The Old BISCUIT Epiread Format
Previously, BISCUIT used an epiread format that was an extension of the original epiread format, but closer to the original format than the epiBED format. This format can still be produced by running:
biscuit epiread \
-@ NTHREADS \
-o my_output.epiread \
-O \
[-B snps.bed] \
/path/to/my_reference.fa \
my_output.bam
An example of the epiread format is:
chr19 read_456 1 + 3040315 CCCCTCCC . .
chr19 read_789 1 + 3078472 CC 3078510 T
where the columns are
- Chromosome name
- Read name
- Read position in paired-end sequencing
- Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
- Position of the cytosine in the first CpG (0-based)
- Retention pattern (“C” for retention or “T” for conversion) for all CpGs covered
- Position of the first SNP, if a SNP location file is provided (“.” if no SNPs)
- Base call of all SNPs covered (“.” if no SNPs)
The original epiread format can be retrieved by running cut -f 1,5,6
on the output epiread file.
Single Fragment Epireads
Because both mate reads come from the same DNA molecule, the DNA methylation information can be considered a physically phased molecular event. In such a case, it may be useful to look at a single fragment representation of the epiread data. Therefore, the read name and position in the single-end epiread output can be used to collate mate reads in a read pair. The default behavior of the epiread
subcommand focuses only on individual read mapping. The following awk
command gives a nice, compact file for a single fragment epiread format:
sort -k2,2 -k3,3n single_end.epiread |
awk 'BEGIN{ qname="" ; rec="" }
qname == $2 { print rec"\t"$5"\t"$6"\t"$7"\t"$8 ; qname="" }
qname != $2 { qname=$2 ; rec=$1"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8 ; pair=$3}'
In the following output, the columns represent:
- Chromosome name
- Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
- Location of the cytosine in the first CpG covered (0-based) in read 1
- Retention pattern of all CpGs covered in read 1
- Location of the first SNP covered in read 1
- Base call of all SNPs covered in read 1
- Location of the cytosine in the first CpG covered (0-based) in read 2
- Retention pattern of all CpGs covered in read 2
- Location of the first SNP covered in read 2
- Base call of all SNPs covered in read 2
chr19 - 3083513 CCCCCCC 3083495 ATT 3083513 CCCCCCC 3083495 ATT
chr19 - 3083545 CCTCCCCCCT . . 3083527 CCCCTCCCCCC 3083523 AG
NOMe-seq Mode in the Old BISCUIT Epiread Format
As with the epiBED format, epiread
can generate NOMe-seq mode epireads in the old BISCUIT epiread format using the -N
option. An example of producing single fragment NOMe-seq epireads is:
# Create SNP BED file
biscuit vcf2bed -t snp my_pileup.vcf.gz > snps.bed
# Generate individual read NOME-seq epiread file
biscuit epiread -B snps.bed -O -N /path/to/my_reference.fa my_output.bam |
gzip -c > single_end.epiread.gz
# Collating paired epireads
zcat single_end.epiread.gz |
sort -k1,1 -k2,2 -k3,3n |
awk 'BEGIN{ qname="" ; rec="" }
qname == $2 { print rec"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10 ; qname="" }
qname != $2 { qname=$2 ; rec=$1"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10 ; pair=$3}' |
sort -k1,1 -k3,3n | \
gzip -c > paired.epiread.gz
The paired output looks like,
chr18 + 10689 CCC 10663 TTCT 10696 T 10689 CCC 10663 TTCT 10696 T
chr18 + 10689 CCC 10694 TT 10696 T 10689 CCC 10694 TT 10696 T
The output columns are:
- Chromosome name
- Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
- Location of the cytosine in the first CpG covered (0-based) in read 1
- Retention pattern of all CpGs covered in read 1
- Location of the cytosine in the first GpCpH covered (0-based) in read 1
- Retention pattern of all GpCpHs covered in read 1
- Location of the first SNP covered in read 1
- Base call of all SNPs covered in read 1
- Location of the cytosine in the first CpG covered (0-based) in read 2
- Retention pattern of all CpGs covered in read 2
- Location of the cytosine in the first GpCpH covered (0-based) in read 2
- Retention pattern of all GpCpHs covered in read 2
- Location of the first SNP covered in read 2
- Base call of all SNPs covered in read 2
The Pairwise Epiread Format
This format is necessary when running biscuit asm
and can be generated by running:
biscuit epiread -@ NTHREADS -P -B snps.bed /path/to/my_reference.fa my_output.bam | \
sort -k1,1 -k2,2n -k3,3n > my_output.pairwise.epiread
Generate Rectangular Forms
With the creation of the epiBED format and readEpibed
in biscuiteer, we suggest avoiding the use of the old epiread format and biscuit rectangle
for creating a matrix where column represent CpGs and each row represents a read. However, if you still want to run rectangle
to generate this matrix, run
biscuit rectangle /path/to/my_reference.fa my_output.epiread
Note, my_output.epiread
can only have a single chromosome in it for rectangle
to work.
For more details on the available flags, run biscuit rectangle
in the terminal or visit the rectangle help page.