
Getting other data structures from iscream
Source:vignettes/data_structures.Rmd
data_structures.Rmd
The examples here show how iscream output can be converted into other data structures for further analysis.
## iscream using 1 thread by default but parallelly::availableCores() detects 4 possibly available threads. See `?set_threads` for information on multithreading before trying to use more.
data_dir <- system.file("extdata", package = "iscream")
bedfiles <- list.files(data_dir, pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
regions <- c(A = "chr1:1-6", B = "chr1:7-10", C = "chr1:11-14")
GRanges
GRanges can be used as the input to all of iscream’s querying
functions. The output of tabix()
and
query_all()
can also be turned into GRanges.
From tabix
queries
## GRanges object with 7 ranges and 2 metadata columns:
## seqnames ranges strand | beta coverage
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 1-2 * | 1.0 1
## [2] chr1 3-4 * | 1.0 1
## [3] chr1 5-6 * | 0.0 2
## [4] chr1 7-8 * | 0.0 1
## [5] chr1 9-10 * | 0.5 2
## [6] chr1 11-12 * | 1.0 2
## [7] chr1 13-14 * | 1.0 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the input is already a GRanges
object,
tabix()
will also return a GRanges
object
along with any attached metadata. tabix()
uses the input
findOverlaps()
to join input metadata with the queried
data.
gr <- GRanges(regions)
values(gr) <- DataFrame(
gene = c("gene1", "gene2", "gene3"),
some_metadata = c("s1", "s2", "s3")
)
gr
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | gene some_metadata
## <Rle> <IRanges> <Rle> | <character> <character>
## A chr1 1-6 * | gene1 s1
## B chr1 7-10 * | gene2 s2
## C chr1 11-14 * | gene3 s3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
tabix(bedfiles, gr)
## GRanges object with 20 ranges and 5 metadata columns:
## seqnames ranges strand | beta coverage sample gene
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <character> <character>
## [1] chr1 1-2 * | 1.0 1 a gene1
## [2] chr1 3-4 * | 1.0 1 a gene1
## [3] chr1 5-6 * | 0.0 2 a gene1
## [4] chr1 7-8 * | 0.0 1 a gene2
## [5] chr1 9-10 * | 0.5 2 a gene2
## ... ... ... ... . ... ... ... ...
## [16] chr1 1-2 * | 1.0 1 d gene1
## [17] chr1 3-4 * | 1.0 2 d gene1
## [18] chr1 7-8 * | 0.0 1 d gene2
## [19] chr1 9-10 * | 0.5 2 d gene2
## [20] chr1 13-14 * | 1.0 1 d gene3
## some_metadata
## <character>
## [1] s1
## [2] s1
## [3] s1
## [4] s2
## [5] s2
## ... ...
## [16] s1
## [17] s1
## [18] s2
## [19] s2
## [20] s3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
From summarize_regions
With a named set of regions, you’ll need to
set_region_rownames
as the input to create the GRanges
object:
(summary <- summarize_regions(
bedfiles,
regions,
set_region_rownames = TRUE,
fun = c("sum", "mean"))
)
## [18:39:33.637460] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
## [18:39:33.637535] [iscream::summarize_regions] [info] using sum, mean
## Feature Sample coverage.sum M.sum coverage.mean M.mean
## chr1:1-6 A a 4 2 1.333333 0.6666667
## chr1:7-10 B a 3 1 1.500000 0.5000000
## chr1:11-14 C a 5 5 2.500000 2.5000000
## chr1:1-6 A b 4 2 2.000000 1.0000000
## chr1:7-10 B b 1 1 1.000000 1.0000000
## chr1:11-14 C b 3 1 1.500000 0.5000000
## chr1:1-6 A c 2 2 2.000000 2.0000000
## chr1:7-10 B c 3 1 1.500000 0.5000000
## chr1:11-14 C c NA NA NA NA
## chr1:1-6 A d 3 3 1.500000 1.5000000
## chr1:7-10 B d 3 1 1.500000 0.5000000
## chr1:11-14 C d 1 1 1.000000 1.0000000
## GRanges object with 12 ranges and 6 metadata columns:
## seqnames ranges strand | summary.Feature summary.Sample
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1-6 * | A a
## [2] chr1 7-10 * | B a
## [3] chr1 11-14 * | C a
## [4] chr1 1-6 * | A b
## [5] chr1 7-10 * | B b
## ... ... ... ... . ... ...
## [8] chr1 7-10 * | B c
## [9] chr1 11-14 * | C c
## [10] chr1 1-6 * | A d
## [11] chr1 7-10 * | B d
## [12] chr1 11-14 * | C d
## summary.coverage.sum summary.M.sum summary.coverage.mean summary.M.mean
## <numeric> <numeric> <numeric> <numeric>
## [1] 4 2 1.33333 0.666667
## [2] 3 1 1.50000 0.500000
## [3] 5 5 2.50000 2.500000
## [4] 4 2 2.00000 1.000000
## [5] 1 1 1.00000 1.000000
## ... ... ... ... ...
## [8] 3 1 1.5 0.5
## [9] NA NA NA NA
## [10] 3 3 1.5 1.5
## [11] 3 1 1.5 0.5
## [12] 1 1 1.0 1.0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If the regions vector is not named, pass the Feature
column as the GRanges input regions:
(summary <- summarize_regions(
bedfiles,
unname(regions),
fun = c("sum", "mean"))
)
## [18:39:33.758934] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
## [18:39:33.758964] [iscream::summarize_regions] [info] using sum, mean
## Feature Sample coverage.sum M.sum coverage.mean M.mean
## 1 chr1:1-6 a 4 2 1.333333 0.6666667
## 2 chr1:7-10 a 3 1 1.500000 0.5000000
## 3 chr1:11-14 a 5 5 2.500000 2.5000000
## 4 chr1:1-6 b 4 2 2.000000 1.0000000
## 5 chr1:7-10 b 1 1 1.000000 1.0000000
## 6 chr1:11-14 b 3 1 1.500000 0.5000000
## 7 chr1:1-6 c 2 2 2.000000 2.0000000
## 8 chr1:7-10 c 3 1 1.500000 0.5000000
## 9 chr1:11-14 c NA NA NA NA
## 10 chr1:1-6 d 3 3 1.500000 1.5000000
## 11 chr1:7-10 d 3 1 1.500000 0.5000000
## 12 chr1:11-14 d 1 1 1.000000 1.0000000
GRanges(summary$Feature, summary = summary[, -1])
## GRanges object with 12 ranges and 5 metadata columns:
## seqnames ranges strand | summary.Sample summary.coverage.sum
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 1-6 * | a 4
## [2] chr1 7-10 * | a 3
## [3] chr1 11-14 * | a 5
## [4] chr1 1-6 * | b 4
## [5] chr1 7-10 * | b 1
## ... ... ... ... . ... ...
## [8] chr1 7-10 * | c 3
## [9] chr1 11-14 * | c NA
## [10] chr1 1-6 * | d 3
## [11] chr1 7-10 * | d 3
## [12] chr1 11-14 * | d 1
## summary.M.sum summary.coverage.mean summary.M.mean
## <numeric> <numeric> <numeric>
## [1] 2 1.33333 0.666667
## [2] 1 1.50000 0.500000
## [3] 5 2.50000 2.500000
## [4] 2 2.00000 1.000000
## [5] 1 1.00000 1.000000
## ... ... ... ...
## [8] 1 1.5 0.5
## [9] NA NA NA
## [10] 3 1.5 1.5
## [11] 1 1.5 0.5
## [12] 1 1.0 1.0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Making SummarizedExperiment objects
From query_all
matrices
(mats <- query_all(bedfiles, regions))
## [18:39:36.173591] [iscream::query_all] [info] Querying 3 regions from 4 bedfiles
##
## [18:39:36.174088] [iscream::query_all] [info] Creating metadata vectors
## [18:39:36.174137] [iscream::query_all] [info] nrows 10000 - 9993 extra rows allocated with 0 resizes
## [18:39:36.174143] [iscream::query_all] [info] Creating dense matrix
## $M
## a b c d
## [1,] 1 0 0 1
## [2,] 1 0 2 2
## [3,] 0 2 0 0
## [4,] 0 1 0 0
## [5,] 1 0 1 1
## [6,] 2 0 0 0
## [7,] 3 1 0 1
##
## $Cov
## a b c d
## [1,] 1 2 0 1
## [2,] 1 0 2 2
## [3,] 2 2 0 0
## [4,] 1 1 2 1
## [5,] 2 0 1 2
## [6,] 2 2 0 0
## [7,] 3 1 0 1
##
## $pos
## [1] 1 3 5 7 9 11 13
##
## $chr
## [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
##
## $sampleNames
## [1] "a" "b" "c" "d"
(mats.sparse <- query_all(bedfiles, regions, sparse = TRUE))
## [18:39:36.175598] [iscream::query_all] [info] Querying 3 regions from 4 bedfiles
##
## [18:39:36.176029] [iscream::query_all] [info] Creating metadata vectors
## [18:39:36.176053] [iscream::query_all] [info] nrows 10000 - 9993 extra rows allocated with 0 resizes
## [18:39:36.176064] [iscream::query_all] [info] Creating sparse matrix
## $M
## 7 x 4 sparse Matrix of class "dgCMatrix"
## a b c d
## [1,] 1 . . 1
## [2,] 1 . 2 2
## [3,] . 2 . .
## [4,] . 1 . .
## [5,] 1 . 1 1
## [6,] 2 . . .
## [7,] 3 1 . 1
##
## $Cov
## 7 x 4 sparse Matrix of class "dgCMatrix"
## a b c d
## [1,] 1 2 . 1
## [2,] 1 . 2 2
## [3,] 2 2 . .
## [4,] 1 1 2 1
## [5,] 2 . 1 2
## [6,] 2 2 . .
## [7,] 3 1 . 1
##
## $pos
## [1] 1 3 5 7 9 11 13
##
## $chr
## [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
##
## $sampleNames
## [1] "a" "b" "c" "d"
gr <- GRanges(mats$chr, mats$pos)
# dense
SummarizedExperiment(assays = list(M = mats$M, Cov = mats$Cov), rowRanges = gr)
## class: RangedSummarizedExperiment
## dim: 7 4
## metadata(0):
## assays(2): M Cov
## rownames: NULL
## rowData names(0):
## colnames(4): a b c d
## colData names(0):
# sparse
SummarizedExperiment(
assays = list(M = mats.sparse$M, Cov = mats.sparse$Cov),
rowRanges = gr
)
## class: RangedSummarizedExperiment
## dim: 7 4
## metadata(0):
## assays(2): M Cov
## rownames: NULL
## rowData names(0):
## colnames(4): a b c d
## colData names(0):
Making BSseq
objects
A BSseq object is a type of SummarizedExperiment, but it cannot handle sparse matrices:
Session info
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.36.0 Biobase_2.66.0
## [3] MatrixGenerics_1.18.1 matrixStats_1.5.0
## [5] GenomicRanges_1.58.0 GenomeInfoDb_1.42.3
## [7] IRanges_2.40.1 S4Vectors_0.44.0
## [9] BiocGenerics_0.52.0 iscream_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.6.1 lattice_0.22-6
## [4] digest_0.6.37 evaluate_1.0.3 grid_4.4.2
## [7] fastmap_1.2.0 jsonlite_1.9.0 Matrix_1.7-1
## [10] httr_1.4.7 UCSC.utils_1.2.0 textshaping_1.0.0
## [13] jquerylib_0.1.4 abind_1.4-8 cli_3.6.4
## [16] rlang_1.1.5 crayon_1.5.3 XVector_0.46.0
## [19] parallelly_1.42.0 cachem_1.1.0 DelayedArray_0.32.0
## [22] yaml_2.3.10 S4Arrays_1.6.0 tools_4.4.2
## [25] parallel_4.4.2 GenomeInfoDbData_1.2.13 R6_2.6.1
## [28] lifecycle_1.0.4 zlibbioc_1.52.0 fs_1.6.5
## [31] stringfish_0.16.0 ragg_1.3.3 desc_1.4.3
## [34] pkgdown_2.1.1 RcppParallel_5.1.10 bslib_0.9.0
## [37] data.table_1.16.4 Rcpp_1.0.14 systemfonts_1.2.1
## [40] xfun_0.51 knitr_1.49 htmltools_0.5.8.1
## [43] rmarkdown_2.29 compiler_4.4.2