Summarize CpGs methylation information over genomic regions
Source:R/summarize_regions.R
summarize_regions.Rd
Run summarizing functions on the CpGs in bedfiles across genomic regions.
Parallelized across files using threads from the "iscream.threads"
option.
Usage
summarize_regions(
bedfiles,
regions,
fun = "all",
aligner = "biscuit",
mval = TRUE,
set_region_rownames = FALSE,
nthreads = NULL
)
Arguments
- bedfiles
A vector of bedfile paths
- regions
A vector of genomic regions strings. If a named vector is provided, the names will be used in the feature column instead of the genomic regions string
- fun
Function to apply over the region. See details.
- aligner
The aligner used to produce the BED files - one of "biscuit", "bismark", "bsbolt".
- mval
Whether to calculate the M value (coverage \(\times \beta\)) or use the beta value when applying the function.
- set_region_rownames
Use the region strings as the returned data frame's rownames. Can be useful if you have a named vector of regions and want both the rownames and the feature names.
- nthreads
Set number of threads to use overriding the
"iscream.threads"
option. See?set_threads
for more information.
Details
Supported fun
arguments are given below. For each of these functions,
setting mval = FALSE
will use the beta values instead of the M value:
Sum:
"sum"
Mean:
"mean"
Median:
"median"
Standard deviation:
"stddev"
Variance:
"variance"
Range:
"range"
No. of CpGs in the region:
"cpg_count"
The summarizing computations are backed by the Armadillo library. See https://arma.sourceforge.net/docs.html#stats_fns for futher details on the supported functions
Examples
bedfiles <- system.file("extdata", package = "iscream") |>
list.files(pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
# examine the bedfiles
colnames <- c("chr", "start", "end", "beta", "coverage")
lapply(bedfiles, function(i) knitr::kable(read.table(i, col.names = colnames)))
#> [[1]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 1.0| 1|
#> |chr1 | 3| 4| 1.0| 1|
#> |chr1 | 5| 6| 0.0| 2|
#> |chr1 | 7| 8| 0.0| 1|
#> |chr1 | 9| 10| 0.5| 2|
#> |chr1 | 11| 12| 1.0| 2|
#> |chr1 | 13| 14| 1.0| 3|
#>
#> [[2]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 0| 2|
#> |chr1 | 5| 6| 1| 2|
#> |chr1 | 7| 8| 1| 1|
#> |chr1 | 11| 12| 0| 2|
#> |chr1 | 13| 14| 1| 1|
#>
#> [[3]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 3| 4| 1| 2|
#> |chr1 | 7| 8| 0| 2|
#> |chr1 | 9| 10| 1| 1|
#>
#> [[4]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 1.0| 1|
#> |chr1 | 3| 4| 1.0| 2|
#> |chr1 | 7| 8| 0.0| 1|
#> |chr1 | 9| 10| 0.5| 2|
#> |chr1 | 13| 14| 1.0| 1|
#>
# make a vector of regions
regions <- c("chr1:1-6", "chr1:7-10", "chr1:11-14")
summarize_regions(bedfiles, regions)
#> [20:33:34.652406] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.652421] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, range, cpg_count
#> Feature Sample coverage.sum M.sum coverage.mean M.mean coverage.median
#> 1 chr1:1-6 a 4 2 1.333333 0.6666667 1.0
#> 2 chr1:7-10 a 3 1 1.500000 0.5000000 1.5
#> 3 chr1:11-14 a 5 5 2.500000 2.5000000 2.5
#> 4 chr1:1-6 b 4 2 2.000000 1.0000000 2.0
#> 5 chr1:7-10 b 1 1 1.000000 1.0000000 1.0
#> 6 chr1:11-14 b 3 1 1.500000 0.5000000 1.5
#> 7 chr1:1-6 c 2 2 2.000000 2.0000000 2.0
#> 8 chr1:7-10 c 3 1 1.500000 0.5000000 1.5
#> 9 chr1:11-14 c NA NA NA NA NA
#> 10 chr1:1-6 d 3 3 1.500000 1.5000000 1.5
#> 11 chr1:7-10 d 3 1 1.500000 0.5000000 1.5
#> 12 chr1:11-14 d 1 1 1.000000 1.0000000 1.0
#> M.median coverage.stddev M.stddev coverage.variance M.variance
#> 1 1.0 0.5773503 0.5773503 0.3333333 0.3333333
#> 2 0.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 3 2.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 4 1.0 0.0000000 1.4142136 0.0000000 2.0000000
#> 5 1.0 0.0000000 0.0000000 0.0000000 0.0000000
#> 6 0.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 7 2.0 0.0000000 0.0000000 0.0000000 0.0000000
#> 8 0.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 9 NA NA NA NA NA
#> 10 1.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 11 0.5 0.7071068 0.7071068 0.5000000 0.5000000
#> 12 1.0 0.0000000 0.0000000 0.0000000 0.0000000
#> coverage.range M.range cpg_count
#> 1 1 1 3
#> 2 1 1 2
#> 3 1 1 2
#> 4 0 2 2
#> 5 0 0 1
#> 6 1 1 2
#> 7 0 0 1
#> 8 1 1 2
#> 9 NA NA NA
#> 10 1 1 2
#> 11 1 1 2
#> 12 0 0 1
summarize_regions(bedfiles, regions, mval = FALSE)
#> [20:33:34.665899] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.665914] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, range, cpg_count
#> Feature Sample coverage.sum beta.sum coverage.mean beta.mean
#> 1 chr1:1-6 a 4 2.0 1.333333 0.6666667
#> 2 chr1:7-10 a 3 0.5 1.500000 0.2500000
#> 3 chr1:11-14 a 5 2.0 2.500000 1.0000000
#> 4 chr1:1-6 b 4 1.0 2.000000 0.5000000
#> 5 chr1:7-10 b 1 1.0 1.000000 1.0000000
#> 6 chr1:11-14 b 3 1.0 1.500000 0.5000000
#> 7 chr1:1-6 c 2 1.0 2.000000 1.0000000
#> 8 chr1:7-10 c 3 1.0 1.500000 0.5000000
#> 9 chr1:11-14 c NA NA NA NA
#> 10 chr1:1-6 d 3 2.0 1.500000 1.0000000
#> 11 chr1:7-10 d 3 0.5 1.500000 0.2500000
#> 12 chr1:11-14 d 1 1.0 1.000000 1.0000000
#> coverage.median beta.median coverage.stddev beta.stddev coverage.variance
#> 1 1.0 1.00 0.5773503 0.5773503 0.3333333
#> 2 1.5 0.25 0.7071068 0.3535534 0.5000000
#> 3 2.5 1.00 0.7071068 0.0000000 0.5000000
#> 4 2.0 0.50 0.0000000 0.7071068 0.0000000
#> 5 1.0 1.00 0.0000000 0.0000000 0.0000000
#> 6 1.5 0.50 0.7071068 0.7071068 0.5000000
#> 7 2.0 1.00 0.0000000 0.0000000 0.0000000
#> 8 1.5 0.50 0.7071068 0.7071068 0.5000000
#> 9 NA NA NA NA NA
#> 10 1.5 1.00 0.7071068 0.0000000 0.5000000
#> 11 1.5 0.25 0.7071068 0.3535534 0.5000000
#> 12 1.0 1.00 0.0000000 0.0000000 0.0000000
#> beta.variance coverage.range beta.range cpg_count
#> 1 0.3333333 1 1.0 3
#> 2 0.1250000 1 0.5 2
#> 3 0.0000000 1 0.0 2
#> 4 0.5000000 0 1.0 2
#> 5 0.0000000 0 0.0 1
#> 6 0.5000000 1 1.0 2
#> 7 0.0000000 0 0.0 1
#> 8 0.5000000 1 1.0 2
#> 9 NA NA NA NA
#> 10 0.0000000 1 0.0 2
#> 11 0.1250000 1 0.5 2
#> 12 0.0000000 0 0.0 1
summarize_regions(bedfiles, regions, fun = "sum")
#> [20:33:34.678399] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.678412] [iscream::summarize_regions] [info] using sum
#> Feature Sample coverage.sum M.sum
#> 1 chr1:1-6 a 4 2
#> 2 chr1:7-10 a 3 1
#> 3 chr1:11-14 a 5 5
#> 4 chr1:1-6 b 4 2
#> 5 chr1:7-10 b 1 1
#> 6 chr1:11-14 b 3 1
#> 7 chr1:1-6 c 2 2
#> 8 chr1:7-10 c 3 1
#> 9 chr1:11-14 c NA NA
#> 10 chr1:1-6 d 3 3
#> 11 chr1:7-10 d 3 1
#> 12 chr1:11-14 d 1 1