Skip to contents

Setup

Loading iscream

The number of threads can be set before or after loading the library:

options("iscream.threads" = 8)
library(iscream)
#> iscream using 8 threads based on 'options(iscream.threads)' 
#> but parallelly::availableCores() detects 16 possibly available threads.
#> See `?set_threads` for information on multithreading before trying to use more.

Input BED files

Running this vignette requires downloading 2GB of BED files and tabix indices from this Zenodo record: https://zenodo.org/records/14733834.

snmc_zip_path <- tempfile("snmcseq2")
snmc_dir <- tempdir()
download.file(
    "https://zenodo.org/records/14733834/files/sc_beds.zip",
    destfile = snmc_zip_path
)
unzip(snmc_zip_path, exdir = snmcseq2_dir)

genes_file <- tempfile("genes")
download.file(
    "https://zenodo.org/records/14733834/files/genes.bed",
    destfile = genes_file
)

100 human cell WGBS data from the snmc-seq2 dataset:

bedfiles <- list.files(
  snmc_dir,
  pattern = "*.bed.gz$",
  full.names = TRUE
)[1:100]

Regions

Since iscream is a region-based querying tool, we need to load some regions. Here we’ll be using 5000 gene body regions as the input:

library(data.table)
regions <- fread(
  genes_file,
  col.names = c("chr", "start", "end", "gene")
)[1:5000]
head(regions)
#>       chr    start      end   gene
#>    <char>    <int>    <int> <char>
#> 1:   chr1  1471764  1497848 ATAD3B
#> 2:   chr1  3069167  3438621 PRDM16
#> 3:   chr1  2403963  2413797  PEX10
#> 4:   chr1 10472287 10630758  PEX14
#> 5:   chr1  2425979  2505532  PLCH2
#> 6:   chr1  9292893  9369532  SPSB1

Running iscream

Make tabix queries

The tabix() function can be used to query regions from BED files much like the tabix shell command. It returns the queried lines from the BED files, parsed into columns, as a data.table. tabix() is a generic BED files query function and is not restricted to methylation BED files. It has built-in support for BISCUIT, Bismark, and BSBolt BED file column names, set with the aligner argument, but can take other column names with the colnames argument.

If multiple input files are provided, they are queried in parallel. If raw = TRUE, tabix() will return the same data as Rsamtools::scanTabix() does - a named list of strings.

system.time(tbx_query <- tabix(bedfiles, regions))
#>    user  system elapsed 
#>  32.184   5.368  11.000
tbx_query
#>              chr     start       end  beta coverage          sample
#>           <char>     <int>     <int> <num>    <int>          <char>
#>        1:   chr1    923949    923950 0.000        1 bisc_SRR6911624
#>        2:   chr1    923953    923954 0.000        1 bisc_SRR6911624
#>        3:   chr1    923959    923960 0.000        1 bisc_SRR6911624
#>        4:   chr1    923971    923972 0.000        1 bisc_SRR6911624
#>        5:   chr1    923973    923974 0.000        1 bisc_SRR6911624
#>       ---                                                          
#> 45733375:   chr4 190179369 190179370 0.000        2 bisc_SRR6911723
#> 45733376:   chr4 190179686 190179687 1.000        1 bisc_SRR6911723
#> 45733377:   chr4 190179687 190179688 0.500        2 bisc_SRR6911723
#> 45733378:   chr4 190179753 190179754 1.000        1 bisc_SRR6911723
#> 45733379:   chr4 190179754 190179755 0.333        3 bisc_SRR6911723

Get summary data

To get a summary of the methylation information of the gene bodies use summarize_regions, providing the gene name column as the feature column:

system.time(summary_query <- summarize_meth_regions(
  bedfiles,
  regions,
  feature_col = "gene")
)
#> [11:20:22.814673] [iscream::summarize_regions] [info] Summarizing 5000 regions from 100 bedfiles
#> [11:20:22.814996] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, count
#> [11:20:22.815005] [iscream::summarize_regions] [info] with columns 4, 5 as coverage, M
#>    user  system elapsed 
#>  24.616   0.618   4.913
head(summary_query)
#>   Feature          Sample coverage.sum M.sum coverage.mean    M.mean
#> 1  ATAD3B bisc_SRR6911624          109    95      1.101010 0.9595960
#> 2  PRDM16 bisc_SRR6911624         1470   789      1.074561 0.5767544
#> 3   PEX10 bisc_SRR6911624           62    15      1.000000 0.2419355
#> 4   PEX14 bisc_SRR6911624          258   200      1.011765 0.7843137
#> 5   PLCH2 bisc_SRR6911624          272   195      1.066667 0.7647059
#> 6   SPSB1 bisc_SRR6911624           74    64      1.000000 0.8648649
#>   coverage.median M.median coverage.stddev  M.stddev coverage.variance
#> 1               1        1       0.3028757 0.4932026        0.09173366
#> 2               1        1       0.2892802 0.5888135        0.08368306
#> 3               1        0       0.0000000 0.4317514        0.00000000
#> 4               1        1       0.1080374 0.4307897        0.01167207
#> 5               1        1       0.2796698 0.5169655        0.07821522
#> 6               1        1       0.0000000 0.3442015        0.00000000
#>   M.variance coverage.min M.min coverage.max M.max coverage.range M.range
#> 1  0.2432488            1     0            2     2              1       2
#> 2  0.3467014            1     0            3     3              2       3
#> 3  0.1864093            1     0            1     1              0       1
#> 4  0.1855797            1     0            2     2              1       2
#> 5  0.2672534            1     0            3     3              2       3
#> 6  0.1184746            1     0            1     1              0       1
#>   cpg_count
#> 1        99
#> 2      1368
#> 3        62
#> 4       255
#> 5       255
#> 6        74

Alternatively to get just the beta column use the general summarize_regions() function, specifying the columns to summarize:

system.time(summary_query <- summarize_regions(
  bedfiles,
  regions,
  columns = 4,
  col_names = "beta",
  feature_col = "gene")
)

Build matrices

The make_bsseq_mat() function queries and stores every CpG in the input regions. Unlike summarize_regions() the output matrix dimensions are unknown at runtime. Although usually quite fast, if the CpG count is very large and there are few overlaps in the CpGs between files, this can take a long time. Here, gene bodies are large and the final matrix can contain millions of CpGs. Further, with single-cell data, chances are new CpGs are found in every file.

Preallocating the number of rows, however, can drastically reduce runtime. Since we got 45 million CpGs from all the BED files with the tabix query above, we can expect approximately between 5 and 10 million unique CpGs as single-cell data has lower coverage than bulk. We already have the tabix query so we can get the unique CpG count here and use it to preallocate the matrix, reducing the number of matrix resizes. We’ll add 100,000 extra on top of the existing count to be safe since every avoided resize cuts off at least a couple seconds from the runtime. Making tabix queries can be a relatively quick way to approximate the CpG count of a dataset. If you haven’t done a tabix query of the full dataset, you can approximate how many CpGs to expect based on CpG counts in one file and the coverage of your WGBS method.

cpg.count <- tbx_query$start |> unique() |> length()
system.time(meth_mat <- make_bsseq_mat(
  bedfiles,
  regions,
  sparse = TRUE,
  prealloc = cpg.count + 1e5
))
#> [11:20:29.883091] [iscream::query_all] [info] Querying 5000 regions from 100 bedfiles
#> 
#> [11:21:13.888292] [iscream::query_all] [info] Creating metadata vectors
#> [11:21:14.365486] [iscream::query_all] [info] 7276107 loci found - 16250 extra rows allocated with 0 resizes
#> [11:21:22.518141] [iscream::query_all] [info] Creating sparse matrix
#>    user  system elapsed 
#> 275.560   2.978  53.696
str(meth_mat)
#> List of 5
#>  $ M          :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:32627134] 9 11 12 13 14 15 16 17 18 19 ...
#>   .. ..@ p       : int [1:101] 0 157256 362760 626888 904511 1158005 1437240 1729046 2014168 2231683 ...
#>   .. ..@ Dim     : int [1:2] 7276107 100
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#>   .. ..@ x       : num [1:32627134] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ Cov        :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:43227794] 0 1 2 3 4 5 6 7 8 9 ...
#>   .. ..@ p       : int [1:101] 0 217582 488919 850016 1220035 1545163 1920047 2322291 2713873 3020522 ...
#>   .. ..@ Dim     : int [1:2] 7276107 100
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#>   .. ..@ x       : num [1:43227794] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ pos        : int [1:7276107] 1472309 1472386 1472390 1472394 1472407 1472414 1472439 1472489 1472503 1473922 ...
#>  $ chr        : chr [1:7276107] "chr1" "chr1" "chr1" "chr1" ...
#>  $ sampleNames: chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...

The output of make_bsseq_mat() is a named list containing matrices of coverage values and M values and vectors of the sample names, chromosome names and positions of the loci.

This list can be used to produce a BSseq object. However since BSseq cannot work with sparse matrices, the two matrices would need to be converted to dense matrices first.

bs <- do.call(BSseq, meth_mat)

If you want just one column from the BED file as a matrix instead of the two for BSseq, use make_mat(). For example, to get a matrix of beta values (column 4 in the BISCUIT BED file) run

make_mat(bedfiles, regions, col = 4, mat_name = "beta")`

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: AlmaLinux 9.5 (Teal Serval)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.utf8        LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.utf8      LC_MONETARY=C.UTF-8    LC_MESSAGES=C.utf8    
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.17.0  iscream_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.4.2      parallelly_1.43.0   Matrix_1.7-3       
#>  [4] parallel_4.4.2      tools_4.4.2         Rcpp_1.0.14        
#>  [7] grid_4.4.2          knitr_1.50          xfun_0.51          
#> [10] RcppParallel_5.1.10 stringfish_0.16.0   lattice_0.22-6     
#> [13] evaluate_1.0.3