Resource Requirements

Tranquillyzer uses a GPU to annotate reads. This page helps you figure out whether your GPU has enough memory for your data, explains how Tranquillyzer automatically manages GPU resources, and gives practical advice for tuning performance.

GPU Memory Requirement?

The main factor is read length: longer reads require more GPU memory. The table below shows the longest read each GPU can handle. If the longest reads in your dataset are shorter than the limit listed for your GPU, you are good to go.

These limits assume XLA is enabled, --vram-headroom 0.35 (the default), and the default model architecture (10x3p_sc_ont_012). Empirically tested means the model was directly observed to run at that read length. Theoretical values are extrapolated from the empirical measurement (see How feasibility limits are computed).
GPU VRAM (GB) Max Read Length (bp) Status
NVIDIA T4 16 ~142,000 Theoretical
NVIDIA L4 24 ~212,000 Theoretical
NVIDIA A10 24 ~212,000 Theoretical
RTX 3090 24 ~212,000 Theoretical
RTX 4090 24 ~212,000 Theoretical
NVIDIA A100 40 ~354,000 Theoretical
NVIDIA L40S 48 ~425,000 Empirically tested
NVIDIA A100 80 ~708,000 Theoretical
NVIDIA H100 80 ~708,000 Theoretical

A few things to keep in mind:

  • These limits tell you whether a read can be processed, not how fast it will be. Reads near the limit will work but may be slow.
  • Adding more GPUs makes processing faster (more reads in parallel), but does not increase the maximum read length each GPU can handle. That is determined by the memory on each individual GPU.
  • If you train a custom model with a larger architecture (more LSTM layers, attention heads, etc.), these limits will decrease.

What if my GPU is not listed?

You can estimate the maximum read length for any GPU using this rule of thumb:

\[ L_{\max} \approx \frac{\text{VRAM (GB)} \times 0.65}{73.4~\text{KB}} \]

For example, a GPU with 32 GB of VRAM: \(L_{\max} \approx \frac{32 \times 0.65}{0.0000734} \approx 283{,}000~\text{bp}\).

The 0.65 factor comes from the default --vram-headroom of 0.35 (only ~65% of memory is used, the rest is kept as a safety buffer). The 73.4 KB/bp constant is derived from empirical testing on NVIDIA L40S GPUs — see Technical Details for the full derivation.

How Tranquillyzer Uses GPU

You generally do not need to configure GPU settings manually. Tranquillyzer automatically:

  1. Groups reads by length, so reads in each batch are a similar size
  2. Picks a batch size that fits in your GPU memory
  3. Recovers gracefully if it guesses too high (automatic retry with smaller batches)

The sections below explain each step in more detail.

Grouping Reads by Length

Length bins

During preprocessing, reads are sorted into length bins (e.g., 0-499 bp, 500-999 bp, …). Bin widths are not uniform across the full length range: below --adaptive-bin-threshold (default 10,000 bp), the user-specified --bin-size determines bin width; above it, fixed coarse tiers apply (5,000 / 10,000 / 25,000 bp for the 10k–50k, 50k–100k, and 100k+ ranges respectively). Additionally, when --min-reads-per-bin is enabled during preprocessing, adjacent sparsely populated bins are merged, so actual bin boundaries at annotation time may be wider than the nominal widths. Within each bin, all reads are padded to the same length so they can be processed together as a batch. The padded length is:

\[ L = B + 1 + P_{\text{conv}} \]

where \(B\) is the upper bound of the bin and \(P_{\text{conv}}\) is extra padding required by the model’s convolutional layers (specifically, the one-sided receptive field of the largest dilated convolution):

\[ P_{\text{conv}} = \left\lfloor \frac{\max_i\bigl((k_i - 1) \times d_i\bigr)}{2} \right\rfloor \]

For the default model (conv_kernel_sizes = [25, 25, 25], dilation_rates = [1, 3, 5]), this works out to \(P_{\text{conv}} = 60\). This padding ensures the model can see enough context to correctly label bases near the end of each read.

Chunks

Each length bin may contain thousands or millions of reads, so Tranquillyzer processes them in chunks — manageable groups that are loaded into memory one at a time. The chunk size adapts to read length:

\[ \text{chunk size} = \min\!\left(\left\lfloor \text{base chunk size} \times \frac{500}{\text{avg read length in bin}} \right\rfloor,\; 500{,}000\right) \]

The idea is simple: short reads use less memory per read, so more of them can fit in a chunk. Long reads use more memory, so fewer are loaded at once. The base chunk size is controlled by --chunk-size (default: 100,000). The hard cap of 500,000 prevents any single chunk from using too much memory, even for very short reads.

Choosing How Many Reads to Process at Once

Within each chunk, Tranquillyzer decides how many reads to send to the GPU in a single batch. It runs two checks and picks the smaller (safer) value:

\[ B_{\text{per GPU}} = \min(B_{\text{tokens}}, B_{\text{mem}}) \]

Check 1: Token budget

The first check limits the total amount of “work” per batch. A token is one position in a padded read (roughly one DNA base). The token budget controls how many total tokens are processed at once:

\[ B_{\text{tokens}} = \left\lfloor \frac{T}{L} \right\rfloor \]

where \(T\) is the token budget (--target-tokens, default 1,200,000) and \(L\) is the padded read length. For example, with 1,200,000 tokens and a padded length of 1,000 bp, the batch size would be 1,200 reads.

When to adjust: If you have a high-memory GPU and want larger batches for faster processing, increase --target-tokens. If you are hitting memory errors, decrease it.

Check 2: Memory estimate

The second check estimates whether the batch will actually fit in GPU memory, based on the model architecture. Each token (each position in a read) consumes a certain amount of GPU memory as it passes through the model’s layers. Tranquillyzer estimates this bytes per token value by adding up the memory used at each layer:

\[ \text{bpt} = \bigl(\underbrace{4}_{\text{input}} + \underbrace{d_e \cdot 4}_{\text{embedding}} + \underbrace{C_f \cdot 4}_{\text{conv}} + \underbrace{l_u \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM out}} + \underbrace{l_u \cdot 4 \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM gates}} + \underbrace{n_l \cdot 4}_{\text{CRF/output}}\bigr) \times \alpha \]

Each term corresponds to a layer of the model:

  • Input (4 bytes): the encoded DNA base (stored as an integer).
  • Embedding (\(d_e \times 4\)): the learned vector representation of each base (\(d_e\) = 128 by default).
  • Conv activations (\(C_f \times 4\)): outputs from the convolutional layers. \(C_f\) is the total filter count across all conv layers (128 filters \(\times\) 3 layers = 384 for the default model).
  • LSTM outputs (\(l_u \times d_{\text{bi}} \times 4\)): outputs from the recurrent layer (\(l_u\) = 96 units, \(d_{\text{bi}}\) = 2 for bidirectional).
  • LSTM gates (\(l_u \times 4 \times d_{\text{bi}} \times 4\)): internal memory used by the four LSTM gates (input, forget, cell, output).
  • CRF / output (\(n_l \times 4\)): the final prediction scores for each label (\(n_l \approx\) 10 labels).
  • Overhead (\(\alpha\) = 1.12): a 12% multiplier to account for TensorFlow’s internal bookkeeping.

For the default model, this works out to approximately 6,644 bytes per token (see worked example).

The batch size from the memory check is then:

\[ B_{\text{mem}} = \left\lfloor \frac{U}{\text{bpt} \times L} \right\rfloor \]

where \(U\) is the usable GPU memory and \(L\) is the padded read length. If you have multiple GPUs with different amounts of memory, Tranquillyzer uses the limit from the smallest GPU (since every GPU must be able to handle its share of the batch).

How usable memory is calculated

Tranquillyzer does not assume it can use all of your GPU memory. It reserves a fraction as a safety buffer (controlled by --vram-headroom, default 0.35) and accounts for memory that TensorFlow has already allocated:

\[ U = \max\!\bigl(0,\; (\text{total GPU budget} - \text{already in use}) \times (1 - \text{headroom})\bigr) \]

With the default headroom of 0.35, about 65% of the remaining GPU memory is actually used for inference. You can tell Tranquillyzer how much VRAM your GPU has via --gpu-mem (default: 12 GB if not specified).

Scaling Across Multiple GPUs

When Tranquillyzer detects more than one GPU, it automatically distributes work across all of them using TensorFlow’s MirroredStrategy. Each GPU processes its own portion of the batch in parallel:

\[ B_{\text{global}} = B_{\text{per GPU}} \times \text{number of GPUs} \]

For example, with 4 GPUs and a per-GPU batch size of 300, the global batch size is 1,200 — meaning 1,200 reads are processed simultaneously (300 on each GPU).

If your GPUs have different amounts of memory, you can specify each one’s VRAM budget individually with a comma-separated list: --gpu-mem 48,48,24. If you pass a single number (e.g., --gpu-mem 48), it is applied to all GPUs. If you don’t specify --gpu-mem at all, 12 GB per GPU is assumed.

What Happens If the GPU Runs Out of Memory

If a batch turns out to be too large and the GPU runs out of memory (OOM), Tranquillyzer automatically recovers:

  1. It clears the GPU memory and triggers garbage collection.
  2. It cuts the batch size in half.
  3. It rebuilds the model if needed (to restore multi-GPU state).
  4. It retries with the smaller batch.

This process repeats until it finds a batch size that works (down to --min-batch-size, default 1). Your run will not crash from an OOM error — but repeated recovery cycles are slow, so it is better to tune your settings proactively (see Tuning Performance).

Running Without a GPU

Tranquillyzer can run without a GPU, but this is not recommended for production datasets — CPU-only inference is orders of magnitude slower and impractical for large sequencing runs. CPU mode exists primarily for testing, development, and small exploratory analyses. When no GPU is detected:

  • Batch sizes are determined by the token budget only (no memory estimation is needed since there is no GPU memory to manage).
  • The batch size is capped at 32 reads to keep memory usage predictable on CPU.
  • No multi-device parallelism is used.

Tuning Performance

Quick Reference: CLI Flags

These flags are available for annotate-reads. A subset is also available for visualize and train-model.
Flag Default What it does When to change it
--gpu-mem 12 GB Tells Tranquillyzer how much VRAM your GPU has Always set this to your actual VRAM for the best results
--target-tokens 1,200,000 Controls batch size via a token budget Increase for faster processing on high-VRAM GPUs; decrease if hitting OOM
--vram-headroom 0.35 Fraction of GPU memory kept as a safety buffer Increase (e.g., 0.50) if you get OOM errors; decrease (e.g., 0.20) for more speed
--token-cap-above 0 (off) Enables aggressive batching for short reads Set to a length threshold (e.g., 5000) to speed up short-read bins
--min-batch-size 1 Smallest allowed batch size per GPU Rarely needs changing
--max-batch-size 8,192 Largest allowed batch size per GPU Increase on high-VRAM GPUs to allow larger batches for short-read bins (default is 2,000 for visualize and train-model)
--chunk-size 100,000 Base number of reads loaded into memory at once Rarely needs changing; adapts automatically to read length
--threads 12 CPU threads for barcode correction, demux, and I/O Increase if you have many CPU cores; decrease on shared machines

Common Scenarios

Resolving Out-of-Memory Errors

Try these steps in order:

  1. Set --gpu-mem to your actual VRAM. For example, if you have an NVIDIA T4 (16 GB), use --gpu-mem 16. The default of 12 GB may cause Tranquillyzer to overestimate what fits in memory on smaller GPUs.
  2. Increase --vram-headroom from 0.35 to 0.50. This reserves more memory as a safety buffer, leaving less for batches but reducing OOM risk.
  3. Decrease --target-tokens from 1,200,000 to 800,000 or lower. This directly reduces batch sizes.
  4. If none of the above help, note that the automatic OOM recovery (see above) will eventually find a working batch size — but it is slow. Proactive tuning avoids this overhead.

Maximizing Throughput

  • Set --gpu-mem to your actual VRAM — this is the single most important setting. If you have a 48 GB GPU but leave the default at 12, Tranquillyzer will use only a fraction of your GPU’s capacity.
  • Lower --vram-headroom from 0.35 to 0.20 if you are not seeing any OOM errors. This lets Tranquillyzer use more of your GPU memory for larger batches.
  • Set --token-cap-above to a threshold like 5,000. This tells Tranquillyzer to use full GPU capacity (without the token budget constraint) for all bins with reads shorter than 5,000 bp — where OOM is unlikely. Longer bins still use the safer, dual-constraint approach.
  • Increase --target-tokens on high-VRAM GPUs (e.g., 1,500,000 or higher on A100/H100) to allow larger batches for longer reads.

Heterogeneous Multi-GPU Setups

Use --gpu-mem with a comma-separated list matching your GPUs. For example, if you have two 48 GB GPUs and one 24 GB GPU:

--gpu-mem 48,48,24

Tranquillyzer will compute the per-GPU batch size based on the smallest GPU (24 GB in this case), so all GPUs can handle their share of the work.

Technical Details

This section provides the mathematical derivations behind the estimates and algorithms described above. It is optional reading — you do not need to understand these details to use Tranquillyzer effectively.

Model Configuration Used for Estimates

All estimates on this page correspond to the default Tranquillyzer CNN-BiLSTM-CRF annotation model (10x3p_sc_ont_012) with the following architecture:

Parameter Value
embedding_dim 128
conv_layers 3
conv_filters 128
conv_kernel_sizes [25, 25, 25]
dilation_rates [1, 3, 5]
lstm_layers 1
lstm_units 96
bidirectional TRUE
CRF TRUE
attention_heads 0

How Feasibility Limits Are Computed

The GPU feasibility limits in the table above are derived from a single empirical measurement and then scaled to other GPUs.

Empirical anchor

On a system with 4x NVIDIA L40S GPUs (48 GB VRAM each) and XLA enabled, Tranquillyzer (model 10x3p_sc_ont_012) successfully processed reads of approximately 425 kb at a per-GPU batch size of 1.

Scaling to other GPUs

With the default headroom of 0.35, the usable memory on the L40S is:

\[ U \approx 48~\text{GB} \times (1 - 0.35) \approx 31.2~\text{GB} \]

Dividing by the maximum read length gives an effective memory cost per base:

\[ k \approx \frac{31.2~\text{GB}}{425{,}000~\text{bp}} \approx 73.4~\text{KB/bp} \]

For any GPU, the maximum read length is then approximately:

\[ L_{\max} \approx \frac{\text{VRAM} \times (1 - H)}{k} \]

where \(H\) is the headroom fraction (default 0.35).

Why empirical scaling rather than analytical calculation?

GPU memory usage during inference is not just the model activations. It also includes TensorFlow/XLA compilation buffers, memory allocator fragmentation, LSTM internal state, CRF decoding workspace, and other overhead that is difficult to predict analytically. Anchoring to a real measurement captures all of these factors, giving more reliable estimates — especially for ultra-long reads where these overheads become significant.

Bytes-Per-Token Estimation Formula

The per-token memory estimate used in batch sizing sums the activation footprint across each layer of the model:

\[ \text{bpt} = \bigl(\underbrace{4}_{\text{input}} + \underbrace{d_e \cdot 4}_{\text{embedding}} + \underbrace{C_f \cdot 4}_{\text{conv}} + \underbrace{l_u \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM out}} + \underbrace{l_u \cdot 4 \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM gates}} + \underbrace{n_l \cdot 4}_{\text{CRF/output}}\bigr) \times \alpha \]

Term Formula Default value Description
Input 4 4 One int32 per position
Embedding \(d_e \times 4\) 512 Float32 embedding vector
Conv \(C_f \times 4\) 1,536 Conv filter outputs (\(C_f\) = filters \(\times\) layers = 128 \(\times\) 3)
LSTM out \(l_u \times d_{\text{bi}} \times 4\) 768 Recurrent layer outputs (96 units \(\times\) 2 directions)
LSTM gates \(l_u \times 4 \times d_{\text{bi}} \times 4\) 3,072 Four LSTM gates, both directions
CRF/output \(n_l \times 4\) 40 Prediction scores for ~10 labels
Subtotal 5,932
Overhead (\(\alpha\) = 1.12) 12% TensorFlow bookkeeping
Total bpt ~6,644 bytes per token

Usable Memory at Runtime

When Tranquillyzer is running, it queries TensorFlow to find out how much GPU memory is already in use and calculates the usable portion:

\[ U = \max\!\bigl(0,\; (V_{\text{total}} - V_{\text{in use}}) \times (1 - H)\bigr) \]

  • \(V_{\text{total}}\) = the GPU memory budget you specified via --gpu-mem (default 12 GB)
  • \(V_{\text{in use}}\) = memory already allocated by TensorFlow on that GPU
  • \(H\) = headroom fraction (--vram-headroom, default 0.35)

Before a model is loaded, \(V_{\text{in use}} = 0\), which gives the simpler formula \(U \approx V \times (1 - H)\) used for the feasibility estimates.

Two-Tier Batching with --token-cap-above

By default, Tranquillyzer uses the conservative \(\min(B_{\text{tokens}}, B_{\text{mem}})\) for every length bin. The --token-cap-above flag enables a two-tier strategy:

  • Short-read bins (padded length < threshold): batch size is set by GPU memory capacity only (\(B_{\text{mem}}\)), ignoring the token budget. This allows larger batches because short reads rarely approach OOM limits.
  • Long-read bins (padded length \(\geq\) threshold): batch size uses the safer \(\min(B_{\text{tokens}}, B_{\text{mem}})\) to protect against OOM.

For example, --token-cap-above 5000 uses aggressive GPU-capacity-only batching for bins up to 5,000 bp, while longer bins use the dual-constraint approach.