Skip to contents

Query records from tabixed BED files

Usage

tabix(bedfiles, regions, aligner = NULL, col.names = NULL, nthreads = NULL)

tabix_gr(
  bedfiles,
  regions,
  aligner = NULL,
  col.names = NULL,
  zero_based = TRUE,
  nthreads = NULL
)

tabix_raw(bedfiles, regions, nthreads = NULL)

Arguments

bedfiles

The BED files to be queried

regions

A vector, data frame or GenomicRanges of genomic regions. See details.

aligner

The aligner used to produce the BED files - one of "biscuit", "bismark", "bsbolt". Will set the result data.table's column names based on this argument.

col.names

A vector of column names for the data columns of the result.table, not including "chr", "start", and "end". Set if your BED file is not from the supported aligners or is a general BED file.

nthreads

Set number of threads to use overriding the "iscream.threads" option. See ?set_threads for more information.

zero_based

Whether the input BED file has a zero-based start column - used when coverting the result data frame to GenomicRanges.

Value

  • tabix(): A data frame

  • tabix_gr(): A GRanges object for single files and GRangesList for multiple files. When making GRanges, the 0-based records from BED-files will be converted to 1-based with GenomicRanges::makeGRangesFromDataFrame(). Bismark's coverage files will not be converted as they are already 1-based and the ranges slot will be only one position.

  • tabix_raw(): A named list of raw strings from the regions in the style of Rsamtools::scanTabix

Details

Query method

'iscream has two methods to query records from BED files:

  • the tabix shell executable: fast since its output can be redirected to a file (which data.table::fread() can then read) instead of having to allocate memory and store it during the query

  • iscream's tabix implementation, based on the tabix executable using htslib, but slower on large queries since it stores the records as they are found instead of writing to a file. However it's able to store each region's records independently instead of in a single file and is used in make_mat(), make_bsseq_mat(), and summarize_regions().

When iscream is attached, it checks that the tabix executable is available with Sys.which() and, if available, sets options("tabix.method" = "shell"). tabix() then uses the tabix executable to make queries, except for tabix_raw(). If tabix is not found, iscream uses its tabix implementation. To use only iscream's tabix implementation, set options("tabix.method" = "htslib").

Input region formats

The input regions formats may be string vector in the form "chr:start-end", a dataframe with "chr", "start" and "end" columns or a GRanges object. Input regions must be 1-based.

Examples

bedfiles <- system.file("extdata", package = "iscream") |>
  list.files(pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
regions <- c("chr1:1-6", "chr1:7-10", "chr1:11-14")
tabix(bedfiles, regions, col.names = c("beta", "coverage"))
#>        chr start   end  beta coverage   file
#>     <char> <int> <int> <num>    <int> <char>
#>  1:   chr1     0     2   1.0        1      a
#>  2:   chr1     2     4   1.0        1      a
#>  3:   chr1     4     6   0.0        2      a
#>  4:   chr1     6     8   0.0        1      a
#>  5:   chr1     8    10   0.5        2      a
#>  6:   chr1    10    12   1.0        2      a
#>  7:   chr1    12    14   1.0        3      a
#>  8:   chr1     0     2   0.0        2      b
#>  9:   chr1     4     6   1.0        2      b
#> 10:   chr1     6     8   1.0        1      b
#> 11:   chr1    10    12   0.0        2      b
#> 12:   chr1    12    14   1.0        1      b
#> 13:   chr1     2     4   1.0        2      c
#> 14:   chr1     6     8   0.0        2      c
#> 15:   chr1     8    10   1.0        1      c
#> 16:   chr1     0     2   1.0        1      d
#> 17:   chr1     2     4   1.0        2      d
#> 18:   chr1     6     8   0.0        1      d
#> 19:   chr1     8    10   0.5        2      d
#> 20:   chr1    12    14   1.0        1      d
#>        chr start   end  beta coverage   file
tabix_gr(bedfiles, regions, col.names = c("beta", "coverage"))
#> GRangesList object of length 4:
#> $a
#> GRanges object with 7 ranges and 3 metadata columns:
#>       seqnames    ranges strand |      beta  coverage        file
#>          <Rle> <IRanges>  <Rle> | <numeric> <integer> <character>
#>   [1]     chr1       1-2      * |       1.0         1           a
#>   [2]     chr1       3-4      * |       1.0         1           a
#>   [3]     chr1       5-6      * |       0.0         2           a
#>   [4]     chr1       7-8      * |       0.0         1           a
#>   [5]     chr1      9-10      * |       0.5         2           a
#>   [6]     chr1     11-12      * |       1.0         2           a
#>   [7]     chr1     13-14      * |       1.0         3           a
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $b
#> GRanges object with 5 ranges and 3 metadata columns:
#>       seqnames    ranges strand |      beta  coverage        file
#>          <Rle> <IRanges>  <Rle> | <numeric> <integer> <character>
#>   [1]     chr1       1-2      * |         0         2           b
#>   [2]     chr1       5-6      * |         1         2           b
#>   [3]     chr1       7-8      * |         1         1           b
#>   [4]     chr1     11-12      * |         0         2           b
#>   [5]     chr1     13-14      * |         1         1           b
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $c
#> GRanges object with 3 ranges and 3 metadata columns:
#>       seqnames    ranges strand |      beta  coverage        file
#>          <Rle> <IRanges>  <Rle> | <numeric> <integer> <character>
#>   [1]     chr1       3-4      * |         1         2           c
#>   [2]     chr1       7-8      * |         0         2           c
#>   [3]     chr1      9-10      * |         1         1           c
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $d
#> GRanges object with 5 ranges and 3 metadata columns:
#>       seqnames    ranges strand |      beta  coverage        file
#>          <Rle> <IRanges>  <Rle> | <numeric> <integer> <character>
#>   [1]     chr1       1-2      * |       1.0         1           d
#>   [2]     chr1       3-4      * |       1.0         2           d
#>   [3]     chr1       7-8      * |       0.0         1           d
#>   [4]     chr1      9-10      * |       0.5         2           d
#>   [5]     chr1     13-14      * |       1.0         1           d
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
tabix_raw(bedfiles, regions)
#> $a
#> $a$`chr1:1-6`
#> [1] "chr1\t0\t2\t1.000\t1" "chr1\t2\t4\t1.000\t1" "chr1\t4\t6\t0.000\t2"
#> 
#> $a$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t1"  "chr1\t8\t10\t0.500\t2"
#> 
#> $a$`chr1:11-14`
#> [1] "chr1\t10\t12\t1.000\t2" "chr1\t12\t14\t1.000\t3"
#> 
#> 
#> $b
#> $b$`chr1:1-6`
#> [1] "chr1\t0\t2\t0.000\t2" "chr1\t4\t6\t1.000\t2"
#> 
#> $b$`chr1:7-10`
#> [1] "chr1\t6\t8\t1.000\t1"
#> 
#> $b$`chr1:11-14`
#> [1] "chr1\t10\t12\t0.000\t2" "chr1\t12\t14\t1.000\t1"
#> 
#> 
#> $c
#> $c$`chr1:1-6`
#> [1] "chr1\t2\t4\t1.000\t2"
#> 
#> $c$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t2"  "chr1\t8\t10\t1.000\t1"
#> 
#> $c$`chr1:11-14`
#> character(0)
#> 
#> 
#> $d
#> $d$`chr1:1-6`
#> [1] "chr1\t0\t2\t1.000\t1" "chr1\t2\t4\t1.000\t2"
#> 
#> $d$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t1"  "chr1\t8\t10\t0.500\t2"
#> 
#> $d$`chr1:11-14`
#> [1] "chr1\t12\t14\t1.000\t1"
#> 
#>