Query records from tabixed BED files
Usage
tabix(bedfiles, regions, aligner = NULL, col.names = NULL, nthreads = NULL)
tabix_gr(
bedfiles,
regions,
aligner = NULL,
col.names = NULL,
zero_based = TRUE,
nthreads = NULL
)
tabix_raw(bedfiles, regions, nthreads = NULL)
Arguments
- bedfiles
The BED files to be queried
- regions
A vector, data frame or GenomicRanges of genomic regions. See details.
- aligner
The aligner used to produce the BED files - one of "biscuit", "bismark", "bsbolt". Will set the result data.table's column names based on this argument.
- col.names
A vector of column names for the data columns of the result.table, not including "chr", "start", and "end". Set if your BED file is not from the supported aligners or is a general BED file.
- nthreads
Set number of threads to use overriding the
"iscream.threads"
option. See?set_threads
for more information.- zero_based
Whether the input BED file has a zero-based start column - used when coverting the result data frame to GenomicRanges.
Value
tabix()
: A data frametabix_gr()
: AGRanges
object for single files andGRangesList
for multiple files. When makingGRanges
, the 0-based records from BED-files will be converted to 1-based withGenomicRanges::makeGRangesFromDataFrame()
. Bismark's coverage files will not be converted as they are already 1-based and theranges
slot will be only one position.tabix_raw()
: A named list of raw strings from the regions in the style ofRsamtools::scanTabix
Details
Query method
'iscream has two methods to query records from BED files:
the tabix shell executable: fast since its output can be redirected to a file (which
data.table::fread()
can then read) instead of having to allocate memory and store it during the queryiscream's tabix implementation, based on the tabix executable using htslib, but slower on large queries since it stores the records as they are found instead of writing to a file. However it's able to store each region's records independently instead of in a single file and is used in
make_mat()
,make_bsseq_mat()
, andsummarize_regions()
.
When iscream is attached, it checks that the tabix executable is
available with Sys.which()
and, if available, sets options("tabix.method" = "shell")
. tabix()
then uses the tabix executable to make queries,
except for tabix_raw()
. If tabix is not found, iscream uses its tabix
implementation. To use only iscream's tabix implementation, set
options("tabix.method" = "htslib")
.
Examples
bedfiles <- system.file("extdata", package = "iscream") |>
list.files(pattern = "[a|b|c|d].bed.gz$", full.names = TRUE)
regions <- c("chr1:1-6", "chr1:7-10", "chr1:11-14")
tabix(bedfiles, regions, col.names = c("beta", "coverage"))
#> chr start end beta coverage file
#> <char> <int> <int> <num> <int> <char>
#> 1: chr1 0 2 1.0 1 a
#> 2: chr1 2 4 1.0 1 a
#> 3: chr1 4 6 0.0 2 a
#> 4: chr1 6 8 0.0 1 a
#> 5: chr1 8 10 0.5 2 a
#> 6: chr1 10 12 1.0 2 a
#> 7: chr1 12 14 1.0 3 a
#> 8: chr1 0 2 0.0 2 b
#> 9: chr1 4 6 1.0 2 b
#> 10: chr1 6 8 1.0 1 b
#> 11: chr1 10 12 0.0 2 b
#> 12: chr1 12 14 1.0 1 b
#> 13: chr1 2 4 1.0 2 c
#> 14: chr1 6 8 0.0 2 c
#> 15: chr1 8 10 1.0 1 c
#> 16: chr1 0 2 1.0 1 d
#> 17: chr1 2 4 1.0 2 d
#> 18: chr1 6 8 0.0 1 d
#> 19: chr1 8 10 0.5 2 d
#> 20: chr1 12 14 1.0 1 d
#> chr start end beta coverage file
tabix_gr(bedfiles, regions, col.names = c("beta", "coverage"))
#> GRangesList object of length 4:
#> $a
#> GRanges object with 7 ranges and 3 metadata columns:
#> seqnames ranges strand | beta coverage file
#> <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
#> [1] chr1 1-2 * | 1.0 1 a
#> [2] chr1 3-4 * | 1.0 1 a
#> [3] chr1 5-6 * | 0.0 2 a
#> [4] chr1 7-8 * | 0.0 1 a
#> [5] chr1 9-10 * | 0.5 2 a
#> [6] chr1 11-12 * | 1.0 2 a
#> [7] chr1 13-14 * | 1.0 3 a
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $b
#> GRanges object with 5 ranges and 3 metadata columns:
#> seqnames ranges strand | beta coverage file
#> <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
#> [1] chr1 1-2 * | 0 2 b
#> [2] chr1 5-6 * | 1 2 b
#> [3] chr1 7-8 * | 1 1 b
#> [4] chr1 11-12 * | 0 2 b
#> [5] chr1 13-14 * | 1 1 b
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $c
#> GRanges object with 3 ranges and 3 metadata columns:
#> seqnames ranges strand | beta coverage file
#> <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
#> [1] chr1 3-4 * | 1 2 c
#> [2] chr1 7-8 * | 0 2 c
#> [3] chr1 9-10 * | 1 1 c
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $d
#> GRanges object with 5 ranges and 3 metadata columns:
#> seqnames ranges strand | beta coverage file
#> <Rle> <IRanges> <Rle> | <numeric> <integer> <character>
#> [1] chr1 1-2 * | 1.0 1 d
#> [2] chr1 3-4 * | 1.0 2 d
#> [3] chr1 7-8 * | 0.0 1 d
#> [4] chr1 9-10 * | 0.5 2 d
#> [5] chr1 13-14 * | 1.0 1 d
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
tabix_raw(bedfiles, regions)
#> $a
#> $a$`chr1:1-6`
#> [1] "chr1\t0\t2\t1.000\t1" "chr1\t2\t4\t1.000\t1" "chr1\t4\t6\t0.000\t2"
#>
#> $a$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t1" "chr1\t8\t10\t0.500\t2"
#>
#> $a$`chr1:11-14`
#> [1] "chr1\t10\t12\t1.000\t2" "chr1\t12\t14\t1.000\t3"
#>
#>
#> $b
#> $b$`chr1:1-6`
#> [1] "chr1\t0\t2\t0.000\t2" "chr1\t4\t6\t1.000\t2"
#>
#> $b$`chr1:7-10`
#> [1] "chr1\t6\t8\t1.000\t1"
#>
#> $b$`chr1:11-14`
#> [1] "chr1\t10\t12\t0.000\t2" "chr1\t12\t14\t1.000\t1"
#>
#>
#> $c
#> $c$`chr1:1-6`
#> [1] "chr1\t2\t4\t1.000\t2"
#>
#> $c$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t2" "chr1\t8\t10\t1.000\t1"
#>
#> $c$`chr1:11-14`
#> character(0)
#>
#>
#> $d
#> $d$`chr1:1-6`
#> [1] "chr1\t0\t2\t1.000\t1" "chr1\t2\t4\t1.000\t2"
#>
#> $d$`chr1:7-10`
#> [1] "chr1\t6\t8\t0.000\t1" "chr1\t8\t10\t0.500\t2"
#>
#> $d$`chr1:11-14`
#> [1] "chr1\t12\t14\t1.000\t1"
#>
#>