The Old BISCUIT Epiread Format

Previously, BISCUIT used an epiread format that was an extension of the original epiread format, but closer to the original format than the epiBED format. This format can still be produced by running:

biscuit epiread \
    -@ NTHREADS \
    -o my_output.epiread \
    -O \
    [-B snps.bed] \
    /path/to/my_reference.fa \
    my_output.bam

An example of the epiread format is:

chr19    read_456    1    +    3040315    CCCCTCCC    .    .
chr19    read_789    1    +    3078472    CC    3078510    T

where the columns are

  1. Chromosome name
  2. Read name
  3. Read position in paired-end sequencing
  4. Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
  5. Position of the cytosine in the first CpG (0-based)
  6. Retention pattern (“C” for retention or “T” for conversion) for all CpGs covered
  7. Position of the first SNP, if a SNP location file is provided (“.” if no SNPs)
  8. Base call of all SNPs covered (“.” if no SNPs)

The original epiread format can be retrieved by running cut -f 1,5,6 on the output epiread file.

Single Fragment Epireads

Because both mate reads come from the same DNA molecule, the DNA methylation information can be considered a physically phased molecular event. In such a case, it may be useful to look at a single fragment representation of the epiread data. Therefore, the read name and position in the single-end epiread output can be used to collate mate reads in a read pair. The default behavior of the epiread subcommand focuses only on individual read mapping. The following awk command gives a nice, compact file for a single fragment epiread format:

sort -k2,2 -k3,3n single_end.epiread |
awk 'BEGIN{ qname="" ; rec="" }
     qname == $2 { print rec"\t"$5"\t"$6"\t"$7"\t"$8 ; qname="" }
     qname != $2 { qname=$2 ; rec=$1"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8 ; pair=$3}'

In the following output, the columns represent:

  1. Chromosome name
  2. Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
  3. Location of the cytosine in the first CpG covered (0-based) in read 1
  4. Retention pattern of all CpGs covered in read 1
  5. Location of the first SNP covered in read 1
  6. Base call of all SNPs covered in read 1
  7. Location of the cytosine in the first CpG covered (0-based) in read 2
  8. Retention pattern of all CpGs covered in read 2
  9. Location of the first SNP covered in read 2
  10. Base call of all SNPs covered in read 2
chr19    -    3083513    CCCCCCC    3083495    ATT    3083513    CCCCCCC    3083495    ATT
chr19    -    3083545    CCTCCCCCCT    .    .    3083527    CCCCTCCCCCC    3083523    AG

NOMe-seq Mode in the Old BISCUIT Epiread Format

As with the epiBED format, epiread can generate NOMe-seq mode epireads in the old BISCUIT epiread format using the -N option. An example of producing single fragment NOMe-seq epireads is:

# Create SNP BED file
biscuit vcf2bed -t snp my_pileup.vcf.gz > snps.bed

# Generate individual read NOME-seq epiread file
biscuit epiread -B snps.bed -O -N /path/to/my_reference.fa my_output.bam |
    gzip -c > single_end.epiread.gz

# Collating paired epireads
zcat single_end.epiread.gz |
sort -k1,1 -k2,2 -k3,3n |
awk 'BEGIN{ qname="" ; rec="" }
     qname == $2 { print rec"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10 ; qname="" }
     qname != $2 { qname=$2 ; rec=$1"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10 ; pair=$3}' |
sort -k1,1 -k3,3n | \
gzip -c > paired.epiread.gz

The paired output looks like,

chr18    +    10689    CCC    10663    TTCT    10696    T    10689    CCC    10663    TTCT    10696    T
chr18    +    10689    CCC    10694    TT    10696    T    10689    CCC    10694    TT    10696    T

The output columns are:

  1. Chromosome name
  2. Bisulfite strand (OT/CTOT (+) or OB/CTOB (-))
  3. Location of the cytosine in the first CpG covered (0-based) in read 1
  4. Retention pattern of all CpGs covered in read 1
  5. Location of the cytosine in the first GpCpH covered (0-based) in read 1
  6. Retention pattern of all GpCpHs covered in read 1
  7. Location of the first SNP covered in read 1
  8. Base call of all SNPs covered in read 1
  9. Location of the cytosine in the first CpG covered (0-based) in read 2
  10. Retention pattern of all CpGs covered in read 2
  11. Location of the cytosine in the first GpCpH covered (0-based) in read 2
  12. Retention pattern of all GpCpHs covered in read 2
  13. Location of the first SNP covered in read 2
  14. Base call of all SNPs covered in read 2

The Pairwise Epiread Format

This format is necessary when running biscuit asm and can be generated by running:

biscuit epiread -@ NTHREADS -P -B snps.bed /path/to/my_reference.fa my_output.bam | \
sort -k1,1 -k2,2n -k3,3n > my_output.pairwise.epiread

Generate Rectangular Forms

With the creation of the epiBED format and readEpibed in biscuiteer, we suggest avoiding the use of the old epiread format and biscuit rectangle for creating a matrix where column represent CpGs and each row represents a read. However, if you still want to run rectangle to generate this matrix, run

biscuit rectangle /path/to/my_reference.fa my_output.epiread

Note, my_output.epiread can only have a single chromosome in it for rectangle to work.

For more details on the available flags, run biscuit rectangle in the terminal or visit the rectangle help page.