Skip to contents

Run summarizing functions on the CpGs in bedfiles across genomic regions. Parallelized across files using threads from the "iscream.threads" option.

Usage

summarize_regions(
  bedfiles,
  regions,
  fun = "all",
  aligner = "biscuit",
  mval = TRUE,
  set_region_rownames = FALSE,
  nthreads = NULL
)

Arguments

bedfiles

A vector of bedfile paths

regions

A vector of genomic regions strings. If a named vector is provided, the names will be used in the feature column instead of the genomic regions string

fun

Function to apply over the region. See details.

aligner

The aligner used to produce the BED files - one of "biscuit", "bismark", "bsbolt".

mval

Whether to calculate the M value (coverage \(\times \beta\)) or use the beta value when applying the function.

set_region_rownames

Use the region strings as the returned data frame's rownames. Can be useful if you have a named vector of regions and want both the rownames and the feature names.

nthreads

Set number of threads to use overriding the "iscream.threads" option. See ?set_threads for more information.

Value

A data.frame

Details

Supported fun arguments are given below. For each of these functions, setting mval = FALSE will use the beta values instead of the M value:

  • Sum: "sum"

  • Mean: "mean"

  • Median: "median"

  • Standard deviation: "stddev"

  • Variance: "variance"

  • Range: "range"

  • No. of CpGs in the region: "cpg_count"

The summarizing computations are backed by the Armadillo library. See https://arma.sourceforge.net/docs.html#stats_fns for futher details on the supported functions

Examples

bedfiles <- system.file("extdata", package = "iscream") |>
  list.files(pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
# examine the bedfiles
colnames <- c("chr", "start", "end", "beta", "coverage")
lapply(bedfiles, function(i) knitr::kable(read.table(i, col.names = colnames)))
#> [[1]]
#> 
#> 
#> |chr  | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 |     1|   2|  1.0|        1|
#> |chr1 |     3|   4|  1.0|        1|
#> |chr1 |     5|   6|  0.0|        2|
#> |chr1 |     7|   8|  0.0|        1|
#> |chr1 |     9|  10|  0.5|        2|
#> |chr1 |    11|  12|  1.0|        2|
#> |chr1 |    13|  14|  1.0|        3|
#> 
#> [[2]]
#> 
#> 
#> |chr  | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 |     1|   2|    0|        2|
#> |chr1 |     5|   6|    1|        2|
#> |chr1 |     7|   8|    1|        1|
#> |chr1 |    11|  12|    0|        2|
#> |chr1 |    13|  14|    1|        1|
#> 
#> [[3]]
#> 
#> 
#> |chr  | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 |     3|   4|    1|        2|
#> |chr1 |     7|   8|    0|        2|
#> |chr1 |     9|  10|    1|        1|
#> 
#> [[4]]
#> 
#> 
#> |chr  | start| end| beta| coverage|
#> |:----|-----:|---:|----:|--------:|
#> |chr1 |     1|   2|  1.0|        1|
#> |chr1 |     3|   4|  1.0|        2|
#> |chr1 |     7|   8|  0.0|        1|
#> |chr1 |     9|  10|  0.5|        2|
#> |chr1 |    13|  14|  1.0|        1|
#> 

# make a vector of regions
regions <- c("chr1:1-6", "chr1:7-10", "chr1:11-14")
summarize_regions(bedfiles, regions)
#> [20:33:34.652406] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.652421] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, range, cpg_count
#>       Feature Sample coverage.sum M.sum coverage.mean    M.mean coverage.median
#> 1    chr1:1-6      a            4     2      1.333333 0.6666667             1.0
#> 2   chr1:7-10      a            3     1      1.500000 0.5000000             1.5
#> 3  chr1:11-14      a            5     5      2.500000 2.5000000             2.5
#> 4    chr1:1-6      b            4     2      2.000000 1.0000000             2.0
#> 5   chr1:7-10      b            1     1      1.000000 1.0000000             1.0
#> 6  chr1:11-14      b            3     1      1.500000 0.5000000             1.5
#> 7    chr1:1-6      c            2     2      2.000000 2.0000000             2.0
#> 8   chr1:7-10      c            3     1      1.500000 0.5000000             1.5
#> 9  chr1:11-14      c           NA    NA            NA        NA              NA
#> 10   chr1:1-6      d            3     3      1.500000 1.5000000             1.5
#> 11  chr1:7-10      d            3     1      1.500000 0.5000000             1.5
#> 12 chr1:11-14      d            1     1      1.000000 1.0000000             1.0
#>    M.median coverage.stddev  M.stddev coverage.variance M.variance
#> 1       1.0       0.5773503 0.5773503         0.3333333  0.3333333
#> 2       0.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 3       2.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 4       1.0       0.0000000 1.4142136         0.0000000  2.0000000
#> 5       1.0       0.0000000 0.0000000         0.0000000  0.0000000
#> 6       0.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 7       2.0       0.0000000 0.0000000         0.0000000  0.0000000
#> 8       0.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 9        NA              NA        NA                NA         NA
#> 10      1.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 11      0.5       0.7071068 0.7071068         0.5000000  0.5000000
#> 12      1.0       0.0000000 0.0000000         0.0000000  0.0000000
#>    coverage.range M.range cpg_count
#> 1               1       1         3
#> 2               1       1         2
#> 3               1       1         2
#> 4               0       2         2
#> 5               0       0         1
#> 6               1       1         2
#> 7               0       0         1
#> 8               1       1         2
#> 9              NA      NA        NA
#> 10              1       1         2
#> 11              1       1         2
#> 12              0       0         1
summarize_regions(bedfiles, regions, mval = FALSE)
#> [20:33:34.665899] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.665914] [iscream::summarize_regions] [info] using sum, mean, median, stddev, variance, range, cpg_count
#>       Feature Sample coverage.sum beta.sum coverage.mean beta.mean
#> 1    chr1:1-6      a            4      2.0      1.333333 0.6666667
#> 2   chr1:7-10      a            3      0.5      1.500000 0.2500000
#> 3  chr1:11-14      a            5      2.0      2.500000 1.0000000
#> 4    chr1:1-6      b            4      1.0      2.000000 0.5000000
#> 5   chr1:7-10      b            1      1.0      1.000000 1.0000000
#> 6  chr1:11-14      b            3      1.0      1.500000 0.5000000
#> 7    chr1:1-6      c            2      1.0      2.000000 1.0000000
#> 8   chr1:7-10      c            3      1.0      1.500000 0.5000000
#> 9  chr1:11-14      c           NA       NA            NA        NA
#> 10   chr1:1-6      d            3      2.0      1.500000 1.0000000
#> 11  chr1:7-10      d            3      0.5      1.500000 0.2500000
#> 12 chr1:11-14      d            1      1.0      1.000000 1.0000000
#>    coverage.median beta.median coverage.stddev beta.stddev coverage.variance
#> 1              1.0        1.00       0.5773503   0.5773503         0.3333333
#> 2              1.5        0.25       0.7071068   0.3535534         0.5000000
#> 3              2.5        1.00       0.7071068   0.0000000         0.5000000
#> 4              2.0        0.50       0.0000000   0.7071068         0.0000000
#> 5              1.0        1.00       0.0000000   0.0000000         0.0000000
#> 6              1.5        0.50       0.7071068   0.7071068         0.5000000
#> 7              2.0        1.00       0.0000000   0.0000000         0.0000000
#> 8              1.5        0.50       0.7071068   0.7071068         0.5000000
#> 9               NA          NA              NA          NA                NA
#> 10             1.5        1.00       0.7071068   0.0000000         0.5000000
#> 11             1.5        0.25       0.7071068   0.3535534         0.5000000
#> 12             1.0        1.00       0.0000000   0.0000000         0.0000000
#>    beta.variance coverage.range beta.range cpg_count
#> 1      0.3333333              1        1.0         3
#> 2      0.1250000              1        0.5         2
#> 3      0.0000000              1        0.0         2
#> 4      0.5000000              0        1.0         2
#> 5      0.0000000              0        0.0         1
#> 6      0.5000000              1        1.0         2
#> 7      0.0000000              0        0.0         1
#> 8      0.5000000              1        1.0         2
#> 9             NA             NA         NA        NA
#> 10     0.0000000              1        0.0         2
#> 11     0.1250000              1        0.5         2
#> 12     0.0000000              0        0.0         1
summarize_regions(bedfiles, regions, fun = "sum")
#> [20:33:34.678399] [iscream::summarize_regions] [info] Summarizing 3 regions from 4 bedfiles
#> [20:33:34.678412] [iscream::summarize_regions] [info] using sum
#>       Feature Sample coverage.sum M.sum
#> 1    chr1:1-6      a            4     2
#> 2   chr1:7-10      a            3     1
#> 3  chr1:11-14      a            5     5
#> 4    chr1:1-6      b            4     2
#> 5   chr1:7-10      b            1     1
#> 6  chr1:11-14      b            3     1
#> 7    chr1:1-6      c            2     2
#> 8   chr1:7-10      c            3     1
#> 9  chr1:11-14      c           NA    NA
#> 10   chr1:1-6      d            3     3
#> 11  chr1:7-10      d            3     1
#> 12 chr1:11-14      d            1     1