
An introduction to iscream
James Eapen
Department of Epigenetics, Van Andel Institute, Grand Rapids, MISource:
vignettes/iscream.Rmd
iscream.Rmd
Introduction
iscream is a BED file querying package that can can retrieve records within regions of interest from multiple tabixed BED files. It can make queries like tabix, summarize data within regions and make matrices across input files and regions. All operations can be done in parallel and return R/Bioconductor data structures.
iscream may be installed from https://bioconductor.org with
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("iscream")
iscream was designed to be memory efficient and fast on large
datasets. This vignette demonstrates iscream’s functions on a small
dataset. For more information on performance considerations on large
datasets see vignette("performance")
.
Loading iscream
## iscream using 1 thread by default but parallelly::availableCores() detects 4 possibly available threads. See `?set_threads` for information on multithreading before trying to use more.
set_threads(2)
## iscream now using 2 of 4 available threads.
On load, iscream will inform the user about the number if threads it
is set to use. This is configurable with either
set_threads()
or by setting
option("iscream.threads")
, which can be done in your
.Rprofile
to set it on startup. Once loaded you use the
following querying functions.
tabix()
tabix is a command-line tool that queries records from compressed BED files. As input it receives the path to a tabixed BED file and one or more genomic regions of interest. It queries and prints the records from the tabixed BED file that fall within the regions of interest. This output may be redirected to a file and read into R.
iscream’s tabix()
function works in a similar way but
supports multiple files and multiple input formats all in one function.
It receives a vector of paths to tabixed BED files and regions of
interest. The input regions may be a vector of
"chr:start-end"
strings, a data frame with
"chr"
, "start"
, and "end"
columns
or a GRanges
object. It queries the BED files and returns
the records as a parsed data.table.
To get a GRanges
object instead, use
tabix_gr
.
Using iscream eliminates context switching between the shell and R
for tabix queries since the queries are made in R and the result is an R
data structure. Rsamtools
is another R package that can query data from BED files, but it only
supports one file at a time, only accepts GRanges
as the
input regions and does not parse the tab-delimited strings. See
vignette("tabix")
for a comparison of iscream and
Rsamtools.
Setup
Using list.files()
, you can get the paths to all
compressed bed files in a directory. These files contain small regions
from chromosome 1 and Y from the snmC-seq2 methylation data (Luo et al.
2018).
data_dir <- system.file("extdata", package = "iscream")
(bedfiles <- list.files(
data_dir,
pattern = "cell[1-5].bed.gz$",
full.names = TRUE))
## [1] "/home/runner/work/_temp/Library/iscream/extdata/cell1.bed.gz"
## [2] "/home/runner/work/_temp/Library/iscream/extdata/cell2.bed.gz"
## [3] "/home/runner/work/_temp/Library/iscream/extdata/cell3.bed.gz"
## [4] "/home/runner/work/_temp/Library/iscream/extdata/cell4.bed.gz"
For demonstration, I’m using two regions.
regions <- c("chr1:184577-680065", "chrY:56877780-56882524")
Note: in this dataset the chromosome identifier is
in a “chr[number]” format. However, in some datasets, the chromosome ID
is just a number or a letter like ‘1’, or ‘Y’. To see the chromosome ID
format across your files you can use query_chroms()
to get
all unique chromosomes across all input files:
query_chroms(bedfiles)
## [1] "chr1" "chrY"
Making queries
tabix()
one file:
tabix(bedfiles[1], regions)
## chr start end V1 V2
## <char> <int> <int> <num> <int>
## 1: chr1 184577 184578 0 1
## 2: chr1 191223 191224 0 1
## 3: chr1 191264 191265 0 1
## 4: chr1 191307 191308 0 1
## 5: chr1 191491 191492 0 1
## 6: chr1 191507 191508 0 1
## 7: chr1 191526 191527 0 1
## 8: chr1 679950 679951 1 1
## 9: chr1 680064 680065 1 1
## 10: chrY 56877780 56877781 0 1
## 11: chrY 56878350 56878351 0 1
## 12: chrY 56878351 56878352 0 1
## 13: chrY 56878431 56878432 0 1
## 14: chrY 56879483 56879484 0 2
## 15: chrY 56879535 56879536 0 2
## 16: chrY 56879539 56879540 1 1
## 17: chrY 56879724 56879725 0 1
## 18: chrY 56879834 56879835 0 1
## 19: chrY 56879863 56879864 1 1
## 20: chrY 56879915 56879916 1 1
## 21: chrY 56879945 56879946 1 1
## 22: chrY 56881130 56881131 1 1
## 23: chrY 56881926 56881927 1 1
## 24: chrY 56882523 56882524 0 1
## chr start end V1 V2
With multiple files, the output contains a column for the file name.
tabix(bedfiles, regions)
## chr start end V1 V2 file
## <char> <int> <int> <num> <int> <char>
## 1: chr1 184577 184578 0 1 cell1
## 2: chr1 191223 191224 0 1 cell1
## 3: chr1 191264 191265 0 1 cell1
## 4: chr1 191307 191308 0 1 cell1
## 5: chr1 191491 191492 0 1 cell1
## ---
## 70: chrY 56880518 56880519 0 1 cell4
## 71: chrY 56880715 56880716 0 2 cell4
## 72: chrY 56880914 56880915 0 2 cell4
## 73: chrY 56881927 56881928 1 1 cell4
## 74: chrY 56882345 56882346 1 1 cell4
Using GenomicRanges
Both tabix()
and tabix_gr()
accept
GRanges
objects as input regions but
tabix_gr()
returns GRanges
objects instead of
data frames. tabix_gr
will also preserve any input
GRanges
metadata.
library(GenomicRanges)
gr <- GRanges(regions)
tabix_gr(bedfiles, gr)
## GRangesList object of length 4:
## $cell1
## GRanges object with 24 ranges and 3 metadata columns:
## seqnames ranges strand | V1 V2 file
## <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
## [1] chr1 184578 * | 0 1 cell1
## [2] chr1 191224 * | 0 1 cell1
## [3] chr1 191265 * | 0 1 cell1
## [4] chr1 191308 * | 0 1 cell1
## [5] chr1 191492 * | 0 1 cell1
## ... ... ... ... . ... ... ...
## [20] chrY 56879916 * | 1 1 cell1
## [21] chrY 56879946 * | 1 1 cell1
## [22] chrY 56881131 * | 1 1 cell1
## [23] chrY 56881927 * | 1 1 cell1
## [24] chrY 56882524 * | 0 1 cell1
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
Setting column names
You can set the result data frame’s column names or the
GRanges
mcols
with col.names
:
## GRangesList object of length 4:
## $cell1
## GRanges object with 24 ranges and 3 metadata columns:
## seqnames ranges strand | beta coverage file
## <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
## [1] chr1 184578 * | 0 1 cell1
## [2] chr1 191224 * | 0 1 cell1
## [3] chr1 191265 * | 0 1 cell1
## [4] chr1 191308 * | 0 1 cell1
## [5] chr1 191492 * | 0 1 cell1
## ... ... ... ... . ... ... ...
## [20] chrY 56879916 * | 1 1 cell1
## [21] chrY 56879946 * | 1 1 cell1
## [22] chrY 56881131 * | 1 1 cell1
## [23] chrY 56881927 * | 1 1 cell1
## [24] chrY 56882524 * | 0 1 cell1
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
Rsamtools-style output
tabix_raw()
will return an unparsed named list like
Rsamtools,
but with multi-file support:
tabix_raw(bedfiles, regions)
## $cell1
## $cell1$`chr1:184577-680065`
## [1] "chr1\t184577\t184578\t0.000\t1" "chr1\t191223\t191224\t0.000\t1"
## [3] "chr1\t191264\t191265\t0.000\t1" "chr1\t191307\t191308\t0.000\t1"
## [5] "chr1\t191491\t191492\t0.000\t1" "chr1\t191507\t191508\t0.000\t1"
## [7] "chr1\t191526\t191527\t0.000\t1" "chr1\t679950\t679951\t1.000\t1"
## [9] "chr1\t680064\t680065\t1.000\t1"
##
## $cell1$`chrY:56877780-56882524`
## [1] "chrY\t56877780\t56877781\t0.000\t1" "chrY\t56878350\t56878351\t0.000\t1"
## [3] "chrY\t56878351\t56878352\t0.000\t1" "chrY\t56878431\t56878432\t0.000\t1"
## [5] "chrY\t56879483\t56879484\t0.000\t2" "chrY\t56879535\t56879536\t0.000\t2"
## [7] "chrY\t56879539\t56879540\t1.000\t1" "chrY\t56879724\t56879725\t0.000\t1"
## [9] "chrY\t56879834\t56879835\t0.000\t1" "chrY\t56879863\t56879864\t1.000\t1"
## [11] "chrY\t56879915\t56879916\t1.000\t1" "chrY\t56879945\t56879946\t1.000\t1"
## [13] "chrY\t56881130\t56881131\t1.000\t1" "chrY\t56881926\t56881927\t1.000\t1"
## [15] "chrY\t56882523\t56882524\t0.000\t1"
##
##
## $cell2
## $cell2$`chr1:184577-680065`
## character(0)
##
## $cell2$`chrY:56877780-56882524`
## [1] "chrY\t56878351\t56878352\t0.000\t1" "chrY\t56879482\t56879483\t1.000\t1"
## [3] "chrY\t56879483\t56879484\t1.000\t1" "chrY\t56879534\t56879535\t1.000\t1"
## [5] "chrY\t56879538\t56879539\t1.000\t1" "chrY\t56880322\t56880323\t0.000\t1"
## [7] "chrY\t56880404\t56880405\t1.000\t1" "chrY\t56881129\t56881130\t0.000\t1"
## [9] "chrY\t56881285\t56881286\t0.000\t1" "chrY\t56881364\t56881365\t1.000\t1"
## [11] "chrY\t56882344\t56882345\t0.000\t1" "chrY\t56882523\t56882524\t0.000\t1"
##
##
## $cell3
## $cell3$`chr1:184577-680065`
## character(0)
##
## $cell3$`chrY:56877780-56882524`
## [1] "chrY\t56879862\t56879863\t1.000\t1" "chrY\t56879914\t56879915\t0.500\t2"
## [3] "chrY\t56879944\t56879945\t1.000\t1" "chrY\t56879969\t56879970\t1.000\t1"
## [5] "chrY\t56879976\t56879977\t1.000\t1" "chrY\t56880322\t56880323\t0.000\t1"
## [7] "chrY\t56880359\t56880360\t0.000\t1" "chrY\t56881129\t56881130\t0.500\t2"
##
##
## $cell4
## $cell4$`chr1:184577-680065`
## [1] "chr1\t191003\t191004\t1.000\t1" "chr1\t191307\t191308\t1.000\t1"
## [3] "chr1\t191438\t191439\t1.000\t1" "chr1\t191491\t191492\t1.000\t1"
## [5] "chr1\t191722\t191723\t1.000\t1" "chr1\t191747\t191748\t1.000\t1"
## [7] "chr1\t191776\t191777\t1.000\t1" "chr1\t628458\t628459\t1.000\t1"
## [9] "chr1\t631433\t631434\t1.000\t1" "chr1\t631436\t631437\t1.000\t1"
##
## $cell4$`chrY:56877780-56882524`
## [1] "chrY\t56877780\t56877781\t0.000\t1" "chrY\t56877810\t56877811\t0.000\t1"
## [3] "chrY\t56879834\t56879835\t0.000\t1" "chrY\t56879863\t56879864\t1.000\t1"
## [5] "chrY\t56879915\t56879916\t1.000\t1" "chrY\t56879945\t56879946\t0.000\t1"
## [7] "chrY\t56880045\t56880046\t0.500\t2" "chrY\t56880081\t56880082\t0.000\t1"
## [9] "chrY\t56880092\t56880093\t0.500\t2" "chrY\t56880112\t56880113\t1.000\t1"
## [11] "chrY\t56880137\t56880138\t1.000\t2" "chrY\t56880142\t56880143\t0.500\t2"
## [13] "chrY\t56880323\t56880324\t0.000\t1" "chrY\t56880360\t56880361\t1.000\t1"
## [15] "chrY\t56880405\t56880406\t1.000\t1" "chrY\t56880518\t56880519\t0.000\t1"
## [17] "chrY\t56880715\t56880716\t0.000\t2" "chrY\t56880914\t56880915\t0.000\t2"
## [19] "chrY\t56881927\t56881928\t1.000\t1" "chrY\t56882345\t56882346\t1.000\t1"
WGBS BED files
Since iscream was originally developed to read Whole Genome Bisulfite
Sequencing data, it has the aligner
argument for
automatically setting column names for the BISCUIT (Zhou et al. 2024),
Bismark (Krueger and
Andrews 2011), and BSBolt (Farrell et al. 2021).
tabix_gr(bedfiles, regions, aligner = "biscuit")
## GRangesList object of length 4:
## $cell1
## GRanges object with 24 ranges and 3 metadata columns:
## seqnames ranges strand | beta coverage file
## <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
## [1] chr1 184578 * | 0 1 cell1
## [2] chr1 191224 * | 0 1 cell1
## [3] chr1 191265 * | 0 1 cell1
## [4] chr1 191308 * | 0 1 cell1
## [5] chr1 191492 * | 0 1 cell1
## ... ... ... ... . ... ... ...
## [20] chrY 56879916 * | 1 1 cell1
## [21] chrY 56879946 * | 1 1 cell1
## [22] chrY 56881131 * | 1 1 cell1
## [23] chrY 56881927 * | 1 1 cell1
## [24] chrY 56882524 * | 0 1 cell1
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
summarize_regions()
Using the same BED files and regions, iscream can also summarize the
data within regions. This requires providing the index of the data
columns you want summarized and the column names for the output. All
summarizing functions are run on each input column by default - see
?summarize_regions
for supported functions.
summarize_regions(
bedfiles,
regions,
columns = c(4, 5),
col_names = c("beta", "coverage")
)
## [15:52:15.034322] [iscream::summarize_regions] [info] Summarizing 2 regions from 4 bedfiles
## [15:52:15.034368] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, count
## [15:52:15.034372] [iscream::summarize_regions] [info] with columns 4, 5 as beta, coverage
## feature file beta.sum coverage.sum beta.mean coverage.mean
## 1 chr1:184577-680065 cell1 2.0 9 0.2222222 1.000000
## 2 chrY:56877780-56882524 cell1 6.0 17 0.4000000 1.133333
## 3 chr1:184577-680065 cell2 NA NA NA NA
## 4 chrY:56877780-56882524 cell2 6.0 12 0.5000000 1.000000
## 5 chr1:184577-680065 cell3 NA NA NA NA
## 6 chrY:56877780-56882524 cell3 5.0 10 0.6250000 1.250000
## 7 chr1:184577-680065 cell4 10.0 10 1.0000000 1.000000
## 8 chrY:56877780-56882524 cell4 9.5 26 0.4750000 1.300000
## beta.median coverage.median beta.stddev coverage.stddev beta.variance
## 1 0.00 1 0.4409586 0.0000000 0.1944444
## 2 0.00 1 0.5070926 0.3518658 0.2571429
## 3 NA NA NA NA NA
## 4 0.50 1 0.5222330 0.0000000 0.2727273
## 5 NA NA NA NA NA
## 6 0.75 1 0.4432026 0.4629100 0.1964286
## 7 1.00 1 0.0000000 0.0000000 0.0000000
## 8 0.50 1 0.4722566 0.4701623 0.2230263
## coverage.variance beta.min coverage.min beta.max coverage.max beta.range
## 1 0.0000000 0 1 1 1 1
## 2 0.1238095 0 1 1 2 1
## 3 NA NA NA NA NA NA
## 4 0.0000000 0 1 1 1 1
## 5 NA NA NA NA NA NA
## 6 0.2142857 0 1 1 2 1
## 7 0.0000000 1 1 1 1 0
## 8 0.2210526 0 1 1 2 1
## coverage.range count
## 1 0 9
## 2 1 15
## 3 NA NA
## 4 0 12
## 5 NA NA
## 6 1 8
## 7 0 10
## 8 1 20
The feature
column here contains the genomic region
coordinates, but can be set to something more informational if you have
names for the regions. You can also select the functions applied with
fun
:
names(regions) <- c("R1", "R2")
summarize_regions(
bedfiles,
regions,
fun = c("mean", "sum"),
columns = 5,
col_names = "coverage"
)
## [15:52:15.121806] [iscream::summarize_regions] [info] Summarizing 2 regions from 4 bedfiles
## [15:52:15.121829] [iscream::summarize_regions] [info] using mean, sum
## [15:52:15.121833] [iscream::summarize_regions] [info] with columns 5 as coverage
## feature file coverage.mean coverage.sum
## 1 R1 cell1 1.000000 9
## 2 R2 cell1 1.133333 17
## 3 R1 cell2 NA NA
## 4 R2 cell2 1.000000 12
## 5 R1 cell3 NA NA
## 6 R2 cell3 1.250000 10
## 7 R1 cell4 1.000000 10
## 8 R2 cell4 1.300000 26
WGBS BED files
For WGBS data specifically, you can use
summarize_meth_regions()
, which takes the
aligner
argument to correctly parse the columns.
summarize_meth_regions(
bedfiles,
regions,
aligner = "biscuit",
fun = c("mean", "sum")
)
## [15:52:15.192048] [iscream::summarize_regions] [info] Summarizing 2 regions from 4 bedfiles
## [15:52:15.192070] [iscream::summarize_regions] [info] using mean, sum
## [15:52:15.192074] [iscream::summarize_regions] [info] with columns 4, 5 as coverage, M
## feature file coverage.mean M.mean coverage.sum M.sum
## 1 R1 cell1 1.000000 0.2222222 9 2
## 2 R2 cell1 1.133333 0.4000000 17 6
## 3 R1 cell2 NA NA NA NA
## 4 R2 cell2 1.000000 0.5000000 12 6
## 5 R1 cell3 NA NA NA NA
## 6 R2 cell3 1.250000 0.7500000 10 6
## 7 R1 cell4 1.000000 1.0000000 10 10
## 8 R2 cell4 1.300000 0.6000000 26 12
make_mat()
make_mat()
makes a matrix where every row is a locus
found in at least one input file and the columns are the input files. It
returns a named list of the matrix, coordinates, and file names you can
use for analysis or other data structures. make_mat_se()
returns a RangedSummarizedExperiment
, and
make_mat_gr()
returns a GRanges
object.
## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
(mat <- make_mat_se(bedfiles, regions, column = 4, mat_name = "beta"))
## [15:52:18.091263] [iscream::query_all] [info] Querying 2 regions from 4 bedfiles
##
## [15:52:18.091793] [iscream::query_all] [info] Creating metadata vectors
## [15:52:18.091855] [iscream::query_all] [info] 62 loci found - 9938 extra rows allocated with 0 resizes
## [15:52:18.091860] [iscream::query_all] [info] Creating dense matrix
## class: RangedSummarizedExperiment
## dim: 62 4
## metadata(0):
## assays(1): beta
## rownames: NULL
## rowData names(0):
## colnames(4): cell1 cell2 cell3 cell4
## colData names(0):
## cell1 cell2 cell3 cell4
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 1
## [5,] 0 0 0 1
## [6,] 0 0 0 0
## [7,] 0 0 0 0
## [8,] 1 0 0 0
## [9,] 1 0 0 0
## [10,] 0 0 1 0
If you have sparse data, you can save memory with
sparse = TRUE
, but only for make_mat()
and
make_mat_se()
.
mat <- make_mat(bedfiles, regions, column = 4, mat_name = "beta", sparse = TRUE)
## [15:52:18.217240] [iscream::query_all] [info] Querying 2 regions from 4 bedfiles
##
## [15:52:18.217735] [iscream::query_all] [info] Creating metadata vectors
## [15:52:18.217773] [iscream::query_all] [info] 62 loci found - 9938 extra rows allocated with 0 resizes
## [15:52:18.217785] [iscream::query_all] [info] Creating sparse matrix
head(mat$beta, 10)
## 10 x 4 sparse Matrix of class "dgCMatrix"
## cell1 cell2 cell3 cell4
## [1,] . . 1.0 .
## [2,] . . 0.5 .
## [3,] . . 1.0 .
## [4,] . . 1.0 .
## [5,] . . 1.0 .
## [6,] . . . .
## [7,] . . . .
## [8,] . . 0.5 .
## [9,] . . . .
## [10,] . . . .
WGBS BED files
For a BSseq
compatible object, use make_meth_mat
which returns a list
of BSseq inputs. To make the BSseq
object run
For more information on the supported Bioconductor data structures
see vignette("data_structures")
. An example workflow using
iscream to examine transcription start site methylation is at
vignette("TSS")
.
Session info
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.38.1 Biobase_2.68.0
## [3] MatrixGenerics_1.20.0 matrixStats_1.5.0
## [5] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0
## [7] IRanges_2.42.0 S4Vectors_0.46.0
## [9] BiocGenerics_0.54.0 generics_0.1.4
## [11] iscream_0.99.0 BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 SparseArray_1.8.0 lattice_0.22-7
## [4] digest_0.6.37 evaluate_1.0.4 grid_4.5.1
## [7] bookdown_0.43 fastmap_1.2.0 jsonlite_2.0.0
## [10] Matrix_1.7-3 BiocManager_1.30.26 httr_1.4.7
## [13] UCSC.utils_1.4.0 textshaping_1.0.1 jquerylib_0.1.4
## [16] abind_1.4-8 cli_3.6.5 rlang_1.1.6
## [19] crayon_1.5.3 XVector_0.48.0 parallelly_1.45.0
## [22] DelayedArray_0.34.1 cachem_1.1.0 yaml_2.3.10
## [25] S4Arrays_1.8.1 tools_4.5.1 parallel_4.5.1
## [28] GenomeInfoDbData_1.2.14 R6_2.6.1 lifecycle_1.0.4
## [31] fs_1.6.6 stringfish_0.17.0 ragg_1.4.0
## [34] desc_1.4.3 pkgdown_2.1.3 RcppParallel_5.1.10
## [37] bslib_0.9.0 data.table_1.17.8 Rcpp_1.1.0
## [40] systemfonts_1.2.3 xfun_0.52 knitr_1.50
## [43] htmltools_0.5.8.1 rmarkdown_2.29 compiler_4.5.1