
Getting other data structures from iscream
Source:vignettes/data_structures.Rmd
data_structures.Rmd
The examples here show how iscream output can be converted into other data structures for further analysis.
## iscream using 1 thread by default but parallelly::availableCores() detects 4 possibly available threads. See `?set_threads` for information on multithreading before trying to use more.
## 'tabix' executable not found in $PATH. tabix() will use htslib to make queries instead which can be slower. See ?tabix for details.
data_dir <- system.file("extdata", package = "iscream")
bedfiles <- list.files(data_dir, pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
regions <- c(A = "chr1:1-6", B = "chr1:7-10", C = "chr1:11-14")
GRanges
GRanges can be used as the input to all of iscream’s querying
functions. The output of tabix()
and
make_mat()
, make_bsseq_mat()
can also be
turned into GRanges.
From tabix
queries
## GRanges object with 7 ranges and 2 metadata columns:
## seqnames ranges strand | V1 V2
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chr1 0-2 * | 1.0 1
## [2] chr1 2-4 * | 1.0 1
## [3] chr1 4-6 * | 0.0 2
## [4] chr1 6-8 * | 0.0 1
## [5] chr1 8-10 * | 0.5 2
## [6] chr1 10-12 * | 1.0 2
## [7] chr1 12-14 * | 1.0 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the input is already a GRanges
object,
tabix()
will also return a GRanges
object
along with any attached metadata. tabix()
uses the input
findOverlaps()
to join input metadata with the queried
data.
gr <- GRanges(regions)
values(gr) <- DataFrame(
gene = c("gene1", "gene2", "gene3"),
some_metadata = c("s1", "s2", "s3")
)
gr
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | gene some_metadata
## <Rle> <IRanges> <Rle> | <character> <character>
## A chr1 1-6 * | gene1 s1
## B chr1 7-10 * | gene2 s2
## C chr1 11-14 * | gene3 s3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
tabix(bedfiles, gr)
## GRanges object with 20 ranges and 5 metadata columns:
## seqnames ranges strand | V1 V2 file gene
## <Rle> <IRanges> <Rle> | <numeric> <integer> <character> <character>
## [1] chr1 1-2 * | 1.0 1 a gene1
## [2] chr1 3-4 * | 1.0 1 a gene1
## [3] chr1 5-6 * | 0.0 2 a gene1
## [4] chr1 7-8 * | 0.0 1 a gene2
## [5] chr1 9-10 * | 0.5 2 a gene2
## ... ... ... ... . ... ... ... ...
## [16] chr1 1-2 * | 1.0 1 d gene1
## [17] chr1 3-4 * | 1.0 2 d gene1
## [18] chr1 7-8 * | 0.0 1 d gene2
## [19] chr1 9-10 * | 0.5 2 d gene2
## [20] chr1 13-14 * | 1.0 1 d gene3
## some_metadata
## <character>
## [1] s1
## [2] s1
## [3] s1
## [4] s2
## [5] s2
## ... ...
## [16] s1
## [17] s1
## [18] s2
## [19] s2
## [20] s3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the input BED file is not zero-based (e.g. Bismark coverage
files), set zero_based = FALSE
in the tabix()
call to get the correct conversion from data frame to GenomicRanges.
From summarize_regions
With a named set of regions, you’ll need to
set_region_rownames
as the input to create the GRanges
object:
(summary <- summarize_meth_regions(
bedfiles,
regions,
set_region_rownames = TRUE,
fun = c("sum", "mean"))
)
## [02:07:11.393664] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
## [02:07:11.393712] [iscream::summarize_regions] [info] using sum, mean
## [02:07:11.393717] [iscream::summarize_regions] [info] with columns 4, 5 as coverage, M
## feature file coverage.sum M.sum coverage.mean M.mean
## chr1:1-6 A a 4 2 1.333333 0.6666667
## chr1:7-10 B a 3 1 1.500000 0.5000000
## chr1:11-14 C a 5 5 2.500000 2.5000000
## chr1:1-6 A b 4 2 2.000000 1.0000000
## chr1:7-10 B b 1 1 1.000000 1.0000000
## chr1:11-14 C b 3 1 1.500000 0.5000000
## chr1:1-6 A c 2 2 2.000000 2.0000000
## chr1:7-10 B c 3 1 1.500000 0.5000000
## chr1:11-14 C c NA NA NA NA
## chr1:1-6 A d 3 3 1.500000 1.5000000
## chr1:7-10 B d 3 1 1.500000 0.5000000
## chr1:11-14 C d 1 1 1.000000 1.0000000
## GRanges object with 12 ranges and 6 metadata columns:
## seqnames ranges strand | summary.feature summary.file
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1-6 * | A a
## [2] chr1 7-10 * | B a
## [3] chr1 11-14 * | C a
## [4] chr1 1-6 * | A b
## [5] chr1 7-10 * | B b
## ... ... ... ... . ... ...
## [8] chr1 7-10 * | B c
## [9] chr1 11-14 * | C c
## [10] chr1 1-6 * | A d
## [11] chr1 7-10 * | B d
## [12] chr1 11-14 * | C d
## summary.coverage.sum summary.M.sum summary.coverage.mean summary.M.mean
## <numeric> <numeric> <numeric> <numeric>
## [1] 4 2 1.33333 0.666667
## [2] 3 1 1.50000 0.500000
## [3] 5 5 2.50000 2.500000
## [4] 4 2 2.00000 1.000000
## [5] 1 1 1.00000 1.000000
## ... ... ... ... ...
## [8] 3 1 1.5 0.5
## [9] NA NA NA NA
## [10] 3 3 1.5 1.5
## [11] 3 1 1.5 0.5
## [12] 1 1 1.0 1.0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the regions vector is not named, pass the feature
column as the GRanges input regions:
(summary <- summarize_meth_regions(
bedfiles,
unname(regions),
fun = c("sum", "mean"))
)
## [02:07:11.509947] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
## [02:07:11.509971] [iscream::summarize_regions] [info] using sum, mean
## [02:07:11.509975] [iscream::summarize_regions] [info] with columns 4, 5 as coverage, M
## feature file coverage.sum M.sum coverage.mean M.mean
## 1 chr1:1-6 a 4 2 1.333333 0.6666667
## 2 chr1:7-10 a 3 1 1.500000 0.5000000
## 3 chr1:11-14 a 5 5 2.500000 2.5000000
## 4 chr1:1-6 b 4 2 2.000000 1.0000000
## 5 chr1:7-10 b 1 1 1.000000 1.0000000
## 6 chr1:11-14 b 3 1 1.500000 0.5000000
## 7 chr1:1-6 c 2 2 2.000000 2.0000000
## 8 chr1:7-10 c 3 1 1.500000 0.5000000
## 9 chr1:11-14 c NA NA NA NA
## 10 chr1:1-6 d 3 3 1.500000 1.5000000
## 11 chr1:7-10 d 3 1 1.500000 0.5000000
## 12 chr1:11-14 d 1 1 1.000000 1.0000000
GRanges(summary$feature, summary = summary[, -1])
## GRanges object with 12 ranges and 5 metadata columns:
## seqnames ranges strand | summary.file summary.coverage.sum
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 1-6 * | a 4
## [2] chr1 7-10 * | a 3
## [3] chr1 11-14 * | a 5
## [4] chr1 1-6 * | b 4
## [5] chr1 7-10 * | b 1
## ... ... ... ... . ... ...
## [8] chr1 7-10 * | c 3
## [9] chr1 11-14 * | c NA
## [10] chr1 1-6 * | d 3
## [11] chr1 7-10 * | d 3
## [12] chr1 11-14 * | d 1
## summary.M.sum summary.coverage.mean summary.M.mean
## <numeric> <numeric> <numeric>
## [1] 2 1.33333 0.666667
## [2] 1 1.50000 0.500000
## [3] 5 2.50000 2.500000
## [4] 2 2.00000 1.000000
## [5] 1 1.00000 1.000000
## ... ... ... ...
## [8] 1 1.5 0.5
## [9] NA NA NA
## [10] 3 1.5 1.5
## [11] 1 1.5 0.5
## [12] 1 1.0 1.0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Making SummarizedExperiment objects
## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
From make_mat()
or make_bsseq_mat()
matrices
(mats <- make_bsseq_mat(bedfiles, regions))
## [02:07:14.507880] [iscream::query_all] [info] Querying 3 regions from 4 bedfiles
##
## [02:07:14.508242] [iscream::query_all] [info] Creating metadata vectors
## [02:07:14.508282] [iscream::query_all] [info] 7 loci found - 9993 extra rows allocated with 0 resizes
## [02:07:14.508285] [iscream::query_all] [info] Creating dense matrix
## $M
## a b c d
## [1,] 1 0 0 1
## [2,] 1 0 2 2
## [3,] 0 2 0 0
## [4,] 0 1 0 0
## [5,] 1 0 1 1
## [6,] 2 0 0 0
## [7,] 3 1 0 1
##
## $Cov
## a b c d
## [1,] 1 2 0 1
## [2,] 1 0 2 2
## [3,] 2 2 0 0
## [4,] 1 1 2 1
## [5,] 2 0 1 2
## [6,] 2 2 0 0
## [7,] 3 1 0 1
##
## $pos
## [1] 0 2 4 6 8 10 12
##
## $chr
## [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
##
## $sampleNames
## [1] "a" "b" "c" "d"
(mats.sparse <- make_bsseq_mat(bedfiles, regions, sparse = TRUE))
## [02:07:14.509290] [iscream::query_all] [info] Querying 3 regions from 4 bedfiles
##
## [02:07:14.509583] [iscream::query_all] [info] Creating metadata vectors
## [02:07:14.509598] [iscream::query_all] [info] 7 loci found - 9993 extra rows allocated with 0 resizes
## [02:07:14.509606] [iscream::query_all] [info] Creating sparse matrix
## $M
## 7 x 4 sparse Matrix of class "dgCMatrix"
## a b c d
## [1,] 1 . . 1
## [2,] 1 . 2 2
## [3,] . 2 . .
## [4,] . 1 . .
## [5,] 1 . 1 1
## [6,] 2 . . .
## [7,] 3 1 . 1
##
## $Cov
## 7 x 4 sparse Matrix of class "dgCMatrix"
## a b c d
## [1,] 1 2 . 1
## [2,] 1 . 2 2
## [3,] 2 2 . .
## [4,] 1 1 2 1
## [5,] 2 . 1 2
## [6,] 2 2 . .
## [7,] 3 1 . 1
##
## $pos
## [1] 0 2 4 6 8 10 12
##
## $chr
## [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
##
## $sampleNames
## [1] "a" "b" "c" "d"
gr <- GRanges(mats$chr, mats$pos)
# dense
SummarizedExperiment(assays = list(M = mats$M, Cov = mats$Cov), rowRanges = gr)
## class: RangedSummarizedExperiment
## dim: 7 4
## metadata(0):
## assays(2): M Cov
## rownames: NULL
## rowData names(0):
## colnames(4): a b c d
## colData names(0):
# sparse
SummarizedExperiment(
assays = list(M = mats.sparse$M, Cov = mats.sparse$Cov),
rowRanges = gr
)
## class: RangedSummarizedExperiment
## dim: 7 4
## metadata(0):
## assays(2): M Cov
## rownames: NULL
## rowData names(0):
## colnames(4): a b c d
## colData names(0):
Making BSseq
objects
A BSseq object is a type of SummarizedExperiment, but it cannot handle sparse matrices:
Session info
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.38.1 Biobase_2.68.0
## [3] MatrixGenerics_1.20.0 matrixStats_1.5.0
## [5] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0
## [7] IRanges_2.42.0 S4Vectors_0.46.0
## [9] BiocGenerics_0.54.0 generics_0.1.4
## [11] iscream_0.1.0.9000
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 SparseArray_1.8.0 lattice_0.22-7
## [4] digest_0.6.37 evaluate_1.0.4 grid_4.5.1
## [7] fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-3
## [10] httr_1.4.7 UCSC.utils_1.4.0 textshaping_1.0.1
## [13] jquerylib_0.1.4 abind_1.4-8 cli_3.6.5
## [16] rlang_1.1.6 crayon_1.5.3 XVector_0.48.0
## [19] parallelly_1.45.0 cachem_1.1.0 DelayedArray_0.34.1
## [22] yaml_2.3.10 S4Arrays_1.8.1 tools_4.5.1
## [25] parallel_4.5.1 GenomeInfoDbData_1.2.14 R6_2.6.1
## [28] lifecycle_1.0.4 fs_1.6.6 stringfish_0.16.0
## [31] ragg_1.4.0 desc_1.4.3 pkgdown_2.1.3
## [34] RcppParallel_5.1.10 bslib_0.9.0 data.table_1.17.6
## [37] Rcpp_1.0.14 systemfonts_1.2.3 xfun_0.52
## [40] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
## [43] compiler_4.5.1