Abstract
A guide to improving iscream’s runtime and memory efficiency
Setup
Input BED files
Running this vignette requires downloading 2GB of single-cell whole genome bisulfite sequencing (WGBS) BED files and tabix indices from this Zenodo record: https://zenodo.org/records/14733834.
snmc_zip_path <- tempfile("snmcseq2")
snmc_dir <- tempdir()
download.file(
"https://zenodo.org/records/14733834/files/sc_beds.zip",
destfile = snmc_zip_path
)
unzip(snmc_zip_path, exdir = snmcseq2_dir)
genes_file <- tempfile("genes")
download.file(
"https://zenodo.org/records/14733834/files/genes.bed",
destfile = genes_file
)
Select 100 human cell WGBS data from the snmC-seq2 (Luo et al. 2018) dataset:
bedfiles <- list.files(
snmc_dir,
pattern = "*.bed.gz$",
full.names = TRUE
)[1:100]
Regions
Here we’ll be using 5000 gene body regions as the input:
library(data.table)
regions <- fread(
genes_file,
col.names = c("chr", "start", "end", "gene")
)[1:5000]
head(regions)
#> chr start end gene
#> <char> <int> <int> <char>
#> 1: chr1 1471764 1497848 ATAD3B
#> 2: chr1 3069167 3438621 PRDM16
#> 3: chr1 2403963 2413797 PEX10
#> 4: chr1 10472287 10630758 PEX14
#> 5: chr1 2425979 2505532 PLCH2
#> 6: chr1 9292893 9369532 SPSB1
Multithreading
iscream uses multithreading to reduce querying runtime. The number of
threads can be set before or after loading the library. iscream will
read the option while loading and so can be set in
.Rprofile
options("iscream.threads" = 8)
library(iscream)
#> iscream using 8 threads based on 'options(iscream.threads)' but parallelly::availableCores() detects 16 possibly available threads. See `?set_threads` for information on multithreading before trying to use more.
Queries are parallelized across the input BED files. While this
reduces runtime, it increases memory usage as data from multiple files
are loaded into memory at the same time. For more information on
multithreading see ?set_threads()
.
tabix()
tabix()
queries regions from BED files much like the
tabix shell command, but supports multiple files and can query
them in parallel. Although fast, tabix()
may seem
unresponsive on large queries as parallel::mclapply()
doesn’t show progression, but a progress bar is in development.
iscream has two tabix()
implementations, one using the
command line tabix
executable and the other using the
htslib API. The command line tool is faster as it can stream to a file
which can be read from R while the htslib API stores the strings in
memory and is slower. iscream will look for the tabix executable on
startup and will only fall back to the htslip API if the executable is
not found. See ?tabix
details for more information.
qt <- system.time(
tbx_query <- tabix(bedfiles, regions, col.names = c("beta", "coverage"))
)
tbx_query
#> chr start end beta coverage file
#> <char> <int> <int> <num> <int> <char>
#> 1: chr1 923949 923950 0.000 1 bisc_SRR6911624
#> 2: chr1 923953 923954 0.000 1 bisc_SRR6911624
#> 3: chr1 923959 923960 0.000 1 bisc_SRR6911624
#> 4: chr1 923971 923972 0.000 1 bisc_SRR6911624
#> 5: chr1 923973 923974 0.000 1 bisc_SRR6911624
#> ---
#> 45733375: chr4 190179369 190179370 0.000 2 bisc_SRR6911723
#> 45733376: chr4 190179686 190179687 1.000 1 bisc_SRR6911723
#> 45733377: chr4 190179687 190179688 0.500 2 bisc_SRR6911723
#> 45733378: chr4 190179753 190179754 1.000 1 bisc_SRR6911723
#> 45733379: chr4 190179754 190179755 0.333 3 bisc_SRR6911723
On 8 threads, retrieving 45,733,379 records across 100 files took 11.814 seconds.
summarize_regions
To get a summary of the information of the gene bodies use
summarize_regions
, providing the gene name column as the
feature column:
qt <- system.time(
summary_query <- summarize_regions(
bedfiles,
regions,
columns = 4,
col_names = "beta",
feature_col = "gene"
)
)
#> [10:51:04.214132] [iscream::summarize_regions] [info] Summarizing 5000 regions from 100 bedfiles
#> [10:51:04.214257] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, count
#> [10:51:04.214269] [iscream::summarize_regions] [info] with columns 4 as beta
head(summary_query)
#> feature file beta.sum beta.mean beta.median beta.stddev
#> 1 ATAD3B bisc_SRR6911624 85.000 0.8585859 1 0.3502215
#> 2 PRDM16 bisc_SRR6911624 723.500 0.5288743 1 0.4976973
#> 3 PEX10 bisc_SRR6911624 15.000 0.2419355 0 0.4317514
#> 4 PEX14 bisc_SRR6911624 198.000 0.7764706 1 0.4174294
#> 5 PLCH2 bisc_SRR6911624 184.333 0.7228745 1 0.4474832
#> 6 SPSB1 bisc_SRR6911624 64.000 0.8648649 1 0.3442015
#> beta.variance beta.min beta.max beta.range count
#> 1 0.1226551 0 1 1 99
#> 2 0.2477026 0 1 1 1368
#> 3 0.1864093 0 1 1 62
#> 4 0.1742473 0 1 1 255
#> 5 0.2002412 0 1 1 255
#> 6 0.1184746 0 1 1 74
On 8 threads, summarizing the 45,733,379 records across all 100 files
took 4.948 seconds. As with tabix()
, runtime reduces and
memory usage increases with increasing thread counts.
make_mat
The make_mat()
function queries and stores every genomic
location within the input regions across input files. Unlike
summarize_regions()
the output matrix dimensions are
unknown at runtime. Although usually quite fast, if the number of
records falling into the input regions are large and there are few
overlaps files, this can take a long time. Here, gene bodies are large
and the final matrix can contain millions of CpG loci. Further, with
sparse data, chances are new loci are found in every file.
Preallocating the number of rows, however, can drastically reduce runtime. Since we got 45 million loci/CpGs from all the BED files with the tabix query above, we can expect approximately between 5 and 10 million unique loci as single-cell WGBS data has lower coverage than bulk. We already have the tabix query so we can get the unique CpG count here and use it to preallocate the matrix, reducing the number of matrix resizes. We’ll add 100,000 extra rows on top of the existing count to be safe since every avoided resize cuts off at least a couple seconds from the runtime. Making tabix queries can be a relatively quick way to approximate the CpG count of a dataset. If you haven’t done a tabix query of the full dataset, you can approximate how many CpGs to expect based on CpG counts in one file and the coverage of your WGBS method. Here we make a matrix of the beta-values in the 4th column:
cpg.count <- tbx_query$start |> unique() |> length()
qt <- system.time(meth_mat <- make_mat_se(
bedfiles,
regions,
column = 4,
sparse = TRUE,
prealloc = cpg.count + 1e5
))
#> [10:51:11.215841] [iscream::query_all] [info] Querying 5000 regions from 100 bedfiles
#>
#> [10:51:51.998902] [iscream::query_all] [info] Creating metadata vectors
#> [10:51:52.457885] [iscream::query_all] [info] 7276107 loci found - 16250 extra rows allocated with 0 resizes
#> [10:51:57.284487] [iscream::query_all] [info] Creating sparse matrix
meth_mat
#> class: RangedSummarizedExperiment
#> dim: 7276107 100
#> metadata(0):
#> assays(1): value
#> rownames: NULL
#> rowData names(0):
#> colnames(100): bisc_SRR6911624 bisc_SRR6911625 ... bisc_SRR6911722
#> bisc_SRR6911723
#> colData names(0):
Making this 7,276,107 x 100 matrix took 50.429 seconds.
Session info
sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: AlmaLinux 9.6 (Sage Margay)
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.utf8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.utf8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.utf8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] iscream_0.99.0 data.table_1.17.6
#>
#> loaded via a namespace (and not attached):
#> [1] crayon_1.5.3 httr_1.4.7
#> [3] knitr_1.50 xfun_0.52
#> [5] UCSC.utils_1.4.0 generics_0.1.4
#> [7] DelayedArray_0.34.1 jsonlite_2.0.0
#> [9] RcppParallel_5.1.10 SummarizedExperiment_1.38.1
#> [11] S4Vectors_0.46.0 stringfish_0.16.0
#> [13] stats4_4.5.0 MatrixGenerics_1.20.0
#> [15] Biobase_2.68.0 grid_4.5.0
#> [17] abind_1.4-8 evaluate_1.0.4
#> [19] IRanges_2.42.0 GenomeInfoDb_1.44.0
#> [21] compiler_4.5.0 Rcpp_1.1.0
#> [23] XVector_0.48.0 lattice_0.22-7
#> [25] R6_2.6.1 SparseArray_1.8.0
#> [27] parallelly_1.45.0 parallel_4.5.0
#> [29] GenomeInfoDbData_1.2.14 GenomicRanges_1.60.0
#> [31] Matrix_1.7-3 tools_4.5.0
#> [33] matrixStats_1.5.0 S4Arrays_1.8.1
#> [35] BiocGenerics_0.54.0