Summarize CpGs methylation information over genomic regions
Source:R/summarize_regions.R
summarize_regions.Rd
Run summarizing functions on the CpGs in bedfiles across genomic regions.
Parallelized across files using threads from the "iscream.threads"
option.
Usage
summarize_regions(
bedfiles,
regions,
fun = "all",
aligner = "biscuit",
mval = TRUE,
set_region_rownames = FALSE,
nthreads = NULL
)
Arguments
- bedfiles
A vector of bedfile paths
- regions
A vector, data frame or GenomicRanges of genomic regions. See details.
- fun
Function(s) to apply over the region. See details.
- aligner
The aligner used to produce the BED files - one of "biscuit", "bismark", "bsbolt".
- mval
Whether to calculate the M value (coverage \(\times \beta\)) or use the beta value when applying the function.
- set_region_rownames
Use the region strings as the returned data frame's rownames. Can be useful if you have a named regions and want both the regions strings rownames and the feature names.
- nthreads
Set number of threads to use overriding the
"iscream.threads"
option. See?set_threads
for more information.
Details
The input regions may be string vector in the form "chr:start-end" or a GRanges object. If a data frame is provided, they must have "chr", "start", and "end" columns. If the string vector and GenomicRanges inputs are named, the names will be used to describe each feature in the output dataframe. If input dataframes have a 'name' column, it will be used to populate the output's feature column.
Supported fun
arguments are given below. For each of these functions,
setting mval = FALSE
will use the beta values instead of the M value:
Sum:
"sum"
Mean:
"mean"
Median:
"median"
Standard deviation:
"stddev"
Variance:
"variance"
Minimum:
"min"
Maximum:
"max"
Range:
"range"
No. of CpGs in the region:
"cpg_count"
The summarizing computations are backed by the Armadillo library. See https://arma.sourceforge.net/docs.html#stats_fns for futher details on the supported functions
Examples
bedfiles <- system.file("extdata", package = "iscream") |>
list.files(pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
# examine the bedfiles
colnames <- c("chr", "start", "end", "beta", "coverage")
lapply(bedfiles, function(i) knitr::kable(read.table(i, col.names = colnames)))
#> [[1]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 1.0| 1|
#> |chr1 | 3| 4| 1.0| 1|
#> |chr1 | 5| 6| 0.0| 2|
#> |chr1 | 7| 8| 0.0| 1|
#> |chr1 | 9| 10| 0.5| 2|
#> |chr1 | 11| 12| 1.0| 2|
#> |chr1 | 13| 14| 1.0| 3|
#>
#> [[2]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 0| 2|
#> |chr1 | 5| 6| 1| 2|
#> |chr1 | 7| 8| 1| 1|
#> |chr1 | 11| 12| 0| 2|
#> |chr1 | 13| 14| 1| 1|
#>
#> [[3]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 3| 4| 1| 2|
#> |chr1 | 7| 8| 0| 2|
#> |chr1 | 9| 10| 1| 1|
#>
#> [[4]]
#>
#>
#> |chr | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 | 1| 2| 1.0| 1|
#> |chr1 | 3| 4| 1.0| 2|
#> |chr1 | 7| 8| 0.0| 1|
#> |chr1 | 9| 10| 0.5| 2|
#> |chr1 | 13| 14| 1.0| 1|
#>
# make a vector of regions
regions <- c("chr1:1-6", "chr1:7-10", "chr1:11-14")
summarize_regions(bedfiles, regions)
#> [17:18:30.227743] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [17:18:30.227758] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, min, max, range, cpg_count
#> Feature Sample coverage.sum M.sum coverage.mean M.mean coverage.median
#> 1 chr1:1-6 a 4 2 1.333333 0.6666667 1.0
#> 2 chr1:7-10 a 3 1 1.500000 0.5000000 1.5
#> 3 chr1:11-14 a 5 5 2.500000 2.5000000 2.5
#> 4 chr1:1-6 b 4 2 2.000000 1.0000000 2.0
#> 5 chr1:7-10 b 1 1 1.000000 1.0000000 1.0
#> 6 chr1:11-14 b 3 1 1.500000 0.5000000 1.5
#> 7 chr1:1-6 c 2 2 2.000000 2.0000000 2.0
#> 8 chr1:7-10 c 3 1 1.500000 0.5000000 1.5
#> 9 chr1:11-14 c NA NA NA NA NA
#> 10 chr1:1-6 d 3 3 1.500000 1.5000000 1.5
#> 11 chr1:7-10 d 3 1 1.500000 0.5000000 1.5
#> 12 chr1:11-14 d 1 1 1.000000 1.0000000 1.0
#> M.median coverage.stddev M.stddev coverage.variance M.variance coverage.min
#> 1 1.0 0.5773503 0.5773503 0.3333333 0.3333333 1
#> 2 0.5 0.7071068 0.7071068 0.5000000 0.5000000 1
#> 3 2.5 0.7071068 0.7071068 0.5000000 0.5000000 2
#> 4 1.0 0.0000000 1.4142136 0.0000000 2.0000000 2
#> 5 1.0 0.0000000 0.0000000 0.0000000 0.0000000 1
#> 6 0.5 0.7071068 0.7071068 0.5000000 0.5000000 1
#> 7 2.0 0.0000000 0.0000000 0.0000000 0.0000000 2
#> 8 0.5 0.7071068 0.7071068 0.5000000 0.5000000 1
#> 9 NA NA NA NA NA NA
#> 10 1.5 0.7071068 0.7071068 0.5000000 0.5000000 1
#> 11 0.5 0.7071068 0.7071068 0.5000000 0.5000000 1
#> 12 1.0 0.0000000 0.0000000 0.0000000 0.0000000 1
#> M.min coverage.max M.max coverage.range M.range cpg_count
#> 1 0 2 1 1 1 3
#> 2 0 2 1 1 1 2
#> 3 2 3 3 1 1 2
#> 4 0 2 2 0 2 2
#> 5 1 1 1 0 0 1
#> 6 0 2 1 1 1 2
#> 7 2 2 2 0 0 1
#> 8 0 2 1 1 1 2
#> 9 NA NA NA NA NA NA
#> 10 1 2 2 1 1 2
#> 11 0 2 1 1 1 2
#> 12 1 1 1 0 0 1
summarize_regions(bedfiles, regions, fun = c("mean", "stddev"), mval = FALSE)
#> [17:18:30.245909] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [17:18:30.245921] [iscream::summarize_regions] [info] using mean, stddev
#> Feature Sample coverage.mean beta.mean coverage.stddev beta.stddev
#> 1 chr1:1-6 a 1.333333 0.6666667 0.5773503 0.5773503
#> 2 chr1:7-10 a 1.500000 0.2500000 0.7071068 0.3535534
#> 3 chr1:11-14 a 2.500000 1.0000000 0.7071068 0.0000000
#> 4 chr1:1-6 b 2.000000 0.5000000 0.0000000 0.7071068
#> 5 chr1:7-10 b 1.000000 1.0000000 0.0000000 0.0000000
#> 6 chr1:11-14 b 1.500000 0.5000000 0.7071068 0.7071068
#> 7 chr1:1-6 c 2.000000 1.0000000 0.0000000 0.0000000
#> 8 chr1:7-10 c 1.500000 0.5000000 0.7071068 0.7071068
#> 9 chr1:11-14 c NA NA NA NA
#> 10 chr1:1-6 d 1.500000 1.0000000 0.7071068 0.0000000
#> 11 chr1:7-10 d 1.500000 0.2500000 0.7071068 0.3535534
#> 12 chr1:11-14 d 1.000000 1.0000000 0.0000000 0.0000000
summarize_regions(bedfiles, regions, fun = "sum")
#> [17:18:30.249952] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [17:18:30.249961] [iscream::summarize_regions] [info] using sum
#> Feature Sample coverage.sum M.sum
#> 1 chr1:1-6 a 4 2
#> 2 chr1:7-10 a 3 1
#> 3 chr1:11-14 a 5 5
#> 4 chr1:1-6 b 4 2
#> 5 chr1:7-10 b 1 1
#> 6 chr1:11-14 b 3 1
#> 7 chr1:1-6 c 2 2
#> 8 chr1:7-10 c 3 1
#> 9 chr1:11-14 c NA NA
#> 10 chr1:1-6 d 3 3
#> 11 chr1:7-10 d 3 1
#> 12 chr1:11-14 d 1 1