Skip to contents

Setup

Loading iscream

The number of threads can be set before or after loading the library:

options("iscream.threads" = 8)
library(iscream)
#> iscream using 8 threads of detected 8 threads. See `?set_threads` for information on multithreading.

Input BED files

100 human cell WGBS data from the snmc-seq2 dataset:

bedfiles <- list.files(
  "../data/snmcseq2",
  pattern = "*.bed.gz$",
  full.names = TRUE
)[1:100]

Regions

Since iscream is a region-based querying tool, we need to load some regions. Here we’ll be using 5000 gene body regions as the input:

library(data.table)
regions <- fread(
  "../data/genes.bed",
  col.names = c("chr", "start", "end", "gene")
)[1:5000]
head(regions)
#>       chr    start      end   gene
#>    <char>    <int>    <int> <char>
#> 1:   chr1  1471764  1497848 ATAD3B
#> 2:   chr1  3069167  3438621 PRDM16
#> 3:   chr1  2403963  2413797  PEX10
#> 4:   chr1 10472287 10630758  PEX14
#> 5:   chr1  2425979  2505532  PLCH2
#> 6:   chr1  9292893  9369532  SPSB1

Running iscream

Make tabix queries

The tabix() function can be used to query regions from BED files much like the tabix shell command. It returns the queried lines from the BED files, parsed into columns, as a data.table. tabix() is a generic BED files query function and is not restricted to methylation BED files. It has built-in support for BISCUIT, Bismark, and BSbolt BED file column names, set with the aligner argument, but can take other column names with the colnames argument.

If multiple input files are provided, they are queried in parallel. If raw = TRUE, tabix() will return the same data as Rsamtools::scanTabix() does - a named list of strings.

system.time(tbx_query <- tabix(bedfiles, regions))
#>    user  system elapsed 
#>  35.363   4.769  10.853
tbx_query
#>              chr     start       end  beta coverage          sample
#>           <char>     <int>     <int> <num>    <int>          <char>
#>        1:   chr1    923949    923950 0.000        1 bisc_SRR6911624
#>        2:   chr1    923953    923954 0.000        1 bisc_SRR6911624
#>        3:   chr1    923959    923960 0.000        1 bisc_SRR6911624
#>        4:   chr1    923971    923972 0.000        1 bisc_SRR6911624
#>        5:   chr1    923973    923974 0.000        1 bisc_SRR6911624
#>       ---                                                          
#> 45729764:   chr4 190179369 190179370 0.000        2 bisc_SRR6911723
#> 45729765:   chr4 190179686 190179687 1.000        1 bisc_SRR6911723
#> 45729766:   chr4 190179687 190179688 0.500        2 bisc_SRR6911723
#> 45729767:   chr4 190179753 190179754 1.000        1 bisc_SRR6911723
#> 45729768:   chr4 190179754 190179755 0.333        3 bisc_SRR6911723

Get summary data

To get a summary of the methylation information of the gene bodies use summarize_regions:

system.time(summary_query <- summarize_regions(bedfiles, regions))
#> [12:13:24.310639] [iscream::summarize_regions] [info] Summarizing 5000 regions from 100 bedfiles
#> [12:13:24.310748] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, cpg_count
#>    user  system elapsed 
#>  27.350   0.609   5.563
head(summary_query)
#>                  Feature          Sample coverage.sum M.sum coverage.mean
#> 1   chr1:1471764-1497848 bisc_SRR6911624          109    95      1.101010
#> 2   chr1:3069167-3438621 bisc_SRR6911624         1470   789      1.074561
#> 3   chr1:2403963-2413797 bisc_SRR6911624           62    15      1.000000
#> 4 chr1:10472287-10630758 bisc_SRR6911624          258   200      1.011765
#> 5   chr1:2425979-2505532 bisc_SRR6911624          272   195      1.066667
#> 6   chr1:9292893-9369532 bisc_SRR6911624           74    64      1.000000
#>      M.mean coverage.median M.median coverage.stddev  M.stddev
#> 1 0.9595960               1        1       0.3028757 0.4932026
#> 2 0.5767544               1        1       0.2892802 0.5888135
#> 3 0.2419355               1        0       0.0000000 0.4317514
#> 4 0.7843137               1        1       0.1080374 0.4307897
#> 5 0.7647059               1        1       0.2796698 0.5169655
#> 6 0.8648649               1        1       0.0000000 0.3442015
#>   coverage.variance M.variance coverage.min M.min coverage.max M.max
#> 1        0.09173366  0.2432488            1     0            2     2
#> 2        0.08368306  0.3467014            1     0            3     3
#> 3        0.00000000  0.1864093            1     0            1     1
#> 4        0.01167207  0.1855797            1     0            2     2
#> 5        0.07821522  0.2672534            1     0            3     3
#> 6        0.00000000  0.1184746            1     0            1     1
#>   coverage.range M.range cpg_count
#> 1              1       2        99
#> 2              2       3      1368
#> 3              0       1        62
#> 4              1       2       255
#> 5              2       3       255
#> 6              0       1        74

Build matrices

The query_all() function queries and stores every CpG in the input regions. Unlike summarize_regions() the output matrix dimensions are unknown at runtime. Although usually quite fast, if the CpG count is very large and there are few overlaps in the CpGs between files, this can take a long time. Here, gene bodies are large and the final matrix can contain millions of CpGs. Further, with single-cell data, chances are new CpGs are found in every file.

Preallocating the number of rows, however, can drastically reduce runtime. Since we got 45 million CpGs from all the BED files, we can approximately expect between 5 and 10 million unique CpGs as single-cell data has lower coverage than bulk. We already have the tabix query so we can get the unique CpG count here and use it to preallocate the matrix, reducing the number of matrix resizes. Making tabix queries can be a relatively quick way to approximate the CpG count of a dataset. If you haven’t done a tabix query of the full dataset, you can approximate how many CpGs to expect based on CpG counts in one file and the coverage of your WGBS method.

cpg.count <- tbx_query$start |> unique() |> length()
system.time(meth_mat <- query_all(
  bedfiles,
  regions,
  sparse = TRUE,
  prealloc = cpg.count
))
#> [12:13:31.961976] [iscream::query_all] [info] Querying 5000 regions from 100 bedfiles
#> 
#> [12:14:22.904618] [iscream::query_all] [info] Creating metadata vectors
#> [12:14:23.381735] [iscream::query_all] [info] nrows 7281989 - 5882 extra rows allocated with 3 resizes
#> [12:14:27.962580] [iscream::query_all] [info] Creating sparse matrix
#>    user  system elapsed 
#> 208.880   3.086  56.854
str(meth_mat)
#> List of 5
#>  $ M          :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:32627134] 9 11 12 13 14 15 16 17 18 19 ...
#>   .. ..@ p       : int [1:101] 0 157256 362760 626888 904511 1158005 1437240 1729046 2014168 2231683 ...
#>   .. ..@ Dim     : int [1:2] 7276107 100
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#>   .. ..@ x       : num [1:32627134] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ Cov        :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:43227794] 0 1 2 3 4 5 6 7 8 9 ...
#>   .. ..@ p       : int [1:101] 0 217582 488919 850016 1220035 1545163 1920047 2322291 2713873 3020522 ...
#>   .. ..@ Dim     : int [1:2] 7276107 100
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : NULL
#>   .. .. ..$ : chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...
#>   .. ..@ x       : num [1:43227794] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ pos        : int [1:7276107] 1472309 1472386 1472390 1472394 1472407 1472414 1472439 1472489 1472503 1473922 ...
#>  $ chr        : chr [1:7276107] "chr1" "chr1" "chr1" "chr1" ...
#>  $ sampleNames: chr [1:100] "bisc_SRR6911624" "bisc_SRR6911625" "bisc_SRR6911626" "bisc_SRR6911627" ...

The output of query_all() is a named list containing matrices of coverage values and M values and vectors of the sample names, chromosome names and positions of the loci.

This list can be used to produce a BSseq object. However since BSseq cannot work with sparse matrices, the two matrices would need to be converted to dense matrices first.

bs <- do.call(BSseq, meth_mat)

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: AlmaLinux 9.5 (Teal Serval)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.utf8        LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.utf8      LC_MONETARY=C.UTF-8    LC_MESSAGES=C.utf8    
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.16.4  iscream_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.4.2      parallelly_1.42.0   Matrix_1.7-2       
#>  [4] parallel_4.4.2      tools_4.4.2         Rcpp_1.0.14        
#>  [7] grid_4.4.2          knitr_1.49          xfun_0.50          
#> [10] RcppParallel_5.1.10 stringfish_0.16.0   lattice_0.22-6     
#> [13] evaluate_1.0.3