Setup
Input BED files
100 human cell WGBS data from the snmc-seq2 dataset:
bedfiles <- list.files(
"../data/snmcseq2",
pattern = "*.bed.gz$",
full.names = TRUE
)[1:100]
Regions
Since iscream is a region-based querying tool, we need to load some regions. Here we’ll be using 5000 gene body regions as the input:
library(data.table)
regions <- fread(
"../data/genes.bed",
col.names = c("chr", "start", "end", "gene")
)[1:5000]
head(regions)
#> chr start end gene
#> <char> <int> <int> <char>
#> 1: chr1 1471764 1497848 ATAD3B
#> 2: chr1 3069167 3438621 PRDM16
#> 3: chr1 2403963 2413797 PEX10
#> 4: chr1 10472287 10630758 PEX14
#> 5: chr1 2425979 2505532 PLCH2
#> 6: chr1 9292893 9369532 SPSB1
Running iscream
Make tabix queries
The tabix()
function can be used to query regions from
BED files much like the tabix shell command. It returns the
queried lines from the BED files, parsed into columns, as a
data.table
. tabix()
is a generic BED files
query function and is not restricted to methylation BED files. It has
built-in support for BISCUIT, Bismark, and BSbolt BED file column names,
set with the aligner
argument, but can take other column
names with the colnames
argument.
If multiple input files are provided, they are queried in parallel.
If raw = TRUE
, tabix()
will return the same
data as Rsamtools::scanTabix()
does - a named list of
strings.
system.time(tbx_query <- tabix(bedfiles, regions))
#> user system elapsed
#> 35.363 4.769 10.853
tbx_query
#> chr start end beta coverage sample
#> <char> <int> <int> <num> <int> <char>
#> 1: chr1 923949 923950 0.000 1 bisc_SRR6911624
#> 2: chr1 923953 923954 0.000 1 bisc_SRR6911624
#> 3: chr1 923959 923960 0.000 1 bisc_SRR6911624
#> 4: chr1 923971 923972 0.000 1 bisc_SRR6911624
#> 5: chr1 923973 923974 0.000 1 bisc_SRR6911624
#> ---
#> 45729764: chr4 190179369 190179370 0.000 2 bisc_SRR6911723
#> 45729765: chr4 190179686 190179687 1.000 1 bisc_SRR6911723
#> 45729766: chr4 190179687 190179688 0.500 2 bisc_SRR6911723
#> 45729767: chr4 190179753 190179754 1.000 1 bisc_SRR6911723
#> 45729768: chr4 190179754 190179755 0.333 3 bisc_SRR6911723
Get summary data
To get a summary of the methylation information of the gene bodies
use summarize_regions
:
system.time(summary_query <- summarize_regions(bedfiles, regions))
#> [12:13:24.310639] [iscream::summarize_regions] [info] Summarizing 5000 regions from 100 bedfiles
#> [12:13:24.310748] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, cpg_count
#> user system elapsed
#> 27.350 0.609 5.563
head(summary_query)
#> Feature Sample coverage.sum M.sum coverage.mean
#> 1 chr1:1471764-1497848 bisc_SRR6911624 109 95 1.101010
#> 2 chr1:3069167-3438621 bisc_SRR6911624 1470 789 1.074561
#> 3 chr1:2403963-2413797 bisc_SRR6911624 62 15 1.000000
#> 4 chr1:10472287-10630758 bisc_SRR6911624 258 200 1.011765
#> 5 chr1:2425979-2505532 bisc_SRR6911624 272 195 1.066667
#> 6 chr1:9292893-9369532 bisc_SRR6911624 74 64 1.000000
#> M.mean coverage.median M.median coverage.stddev M.stddev
#> 1 0.9595960 1 1 0.3028757 0.4932026
#> 2 0.5767544 1 1 0.2892802 0.5888135
#> 3 0.2419355 1 0 0.0000000 0.4317514
#> 4 0.7843137 1 1 0.1080374 0.4307897
#> 5 0.7647059 1 1 0.2796698 0.5169655
#> 6 0.8648649 1 1 0.0000000 0.3442015
#> coverage.variance M.variance coverage.min M.min coverage.max M.max
#> 1 0.09173366 0.2432488 1 0 2 2
#> 2 0.08368306 0.3467014 1 0 3 3
#> 3 0.00000000 0.1864093 1 0 1 1
#> 4 0.01167207 0.1855797 1 0 2 2
#> 5 0.07821522 0.2672534 1 0 3 3
#> 6 0.00000000 0.1184746 1 0 1 1
#> coverage.range M.range cpg_count
#> 1 1 2 99
#> 2 2 3 1368
#> 3 0 1 62
#> 4 1 2 255
#> 5 2 3 255
#> 6 0 1 74
Build matrices
The query_all()
function queries and stores every CpG in
the input regions. Unlike summarize_regions()
the output
matrix dimensions are unknown at runtime. Although usually quite fast,
if the CpG count is very large and there are few overlaps in the CpGs
between files, this can take a long time. Here, gene bodies are large
and the final matrix can contain millions of CpGs. Further, with
single-cell data, chances are new CpGs are found in every file.
Preallocating the number of rows, however, can drastically reduce runtime. Since we got 45 million CpGs from all the BED files, we can approximately expect between 5 and 10 million unique CpGs as single-cell data has lower coverage than bulk. We already have the tabix query so we can get the unique CpG count here and use it to preallocate the matrix, reducing the number of matrix resizes. Making tabix queries can be a relatively quick way to approximate the CpG count of a dataset. If you haven’t done a tabix query of the full dataset, you can approximate how many CpGs to expect based on CpG counts in one file and the coverage of your WGBS method.
cpg.count <- tbx_query$start |> unique() |> length()
system.time(meth_mat <- query_all(
bedfiles,
regions,
sparse = TRUE,
prealloc = cpg.count
))
#> [12:13:31.961976] [iscream::query_all] [info] Querying 5000 regions from 100 bedfiles
#>
#> [12:14:22.904618] [iscream::query_all] [info] Creating metadata vectors
#> [12:14:23.381735] [iscream::query_all] [info] nrows 7281989 - 5882 extra rows allocated with 3 resizes
#> [12:14:27.962580] [iscream::query_all] [info] Creating sparse matrix
#> user system elapsed
#> 208.880 3.086 56.854
str(meth_mat)
#> List of 5
#> $ M :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:32627134] 9 11 12 13 14 15 16 17 18 19 ...
#> .. ..@ p : int [1:101] 0 157256 362760 626888 904511 1158005 1437240 1729046 2014168 2231683 ...
#> .. ..@ Dim : int [1:2] 7276107 100
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#> .. ..@ x : num [1:32627134] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..@ factors : list()
#> $ Cov :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:43227794] 0 1 2 3 4 5 6 7 8 9 ...
#> .. ..@ p : int [1:101] 0 217582 488919 850016 1220035 1545163 1920047 2322291 2713873 3020522 ...
#> .. ..@ Dim : int [1:2] 7276107 100
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#> .. ..@ x : num [1:43227794] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..@ factors : list()
#> $ pos : int [1:7276107] 1472309 1472386 1472390 1472394 1472407 1472414 1472439 1472489 1472503 1473922 ...
#> $ chr : chr [1:7276107] "chr1" "chr1" "chr1" "chr1" ...
#> $ sampleNames: chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
The output of query_all()
is a named list containing
matrices of coverage values and M values and vectors of the sample
names, chromosome names and positions of the loci.
This list can be used to produce a BSseq object. However since BSseq cannot work with sparse matrices, the two matrices would need to be converted to dense matrices first.
bs <- do.call(BSseq, meth_mat)
Session info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: AlmaLinux 9.5 (Teal Serval)
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.utf8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.utf8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.utf8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.16.4 iscream_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2 parallelly_1.42.0 Matrix_1.7-2
#> [4] parallel_4.4.2 tools_4.4.2 Rcpp_1.0.14
#> [7] grid_4.4.2 knitr_1.49 xfun_0.50
#> [10] RcppParallel_5.1.10 stringfish_0.16.0 lattice_0.22-6
#> [13] evaluate_1.0.3