Resource Requirements

Tranquillyzer uses a GPU to annotate reads. This page helps you figure out whether your GPU has enough memory for your data, explains how Tranquillyzer automatically manages GPU resources, and gives practical advice for tuning performance.

GPU Memory Requirement?

The main factor is read length: longer reads require more GPU memory. The table below shows the longest read each GPU can handle. If the longest reads in your dataset are shorter than the limit listed for your GPU, you are good to go.

Tentative figures — pending per-GPU benchmarking

The numbers below are based on a single empirical run on 4× NVIDIA L40S using 10x3p_sc_ont_016. The L40S entry is directly measured; all other entries are theoretical scalings using the model’s per-token memory cost. Per-GPU empirical recommendations will be added as we benchmark each architecture; if you have results from your own runs, please open an issue.

These limits assume `--vram-headroom 0.35` (the default) and the architecture of `10x3p_sc_ont_016` (the current default model). **Empirically tested** corresponds to the largest read length directly observed in a production run on that GPU; the actual ceiling is likely higher. **Theoretical** values are extrapolated from the empirical anchor using the model’s per-token memory cost (see How feasibility limits are computed).
GPU	VRAM (GB)	Max Read Length (bp)	Status
NVIDIA T4	16	~323,000	Theoretical
NVIDIA L4	24	~484,000	Theoretical
NVIDIA A10	24	~484,000	Theoretical
RTX 3090	24	~484,000	Theoretical
RTX 4090	24	~484,000	Theoretical
NVIDIA A100	40	~807,000	Theoretical
NVIDIA L40S	48	~525,000	Empirically tested
NVIDIA A100	80	~1,615,000	Theoretical
NVIDIA H100	80	~1,615,000	Theoretical

A few things to keep in mind:

These limits tell you whether a read can be processed, not how fast it will be. Reads near the limit will work but may be slow.
Adding more GPUs makes processing faster (more reads in parallel), but does not increase the maximum read length each GPU can handle. That is determined by the memory on each individual GPU.
If you train a custom model with a larger architecture (more LSTM layers, attention heads, etc.), these limits will decrease.

What if my GPU is not listed?

You can estimate the maximum read length for any GPU using this rule of thumb:

\[ L_{\max} \approx \frac{\text{VRAM (GB)} \times 0.65}{32.2~\text{KB}} \]

For example, a GPU with 32 GB of VRAM: \(L_{\max} \approx \frac{32 \times 0.65}{0.0000322} \approx 646{,}000~\text{bp}\).

The 0.65 factor comes from the default --vram-headroom of 0.35 (only ~65% of memory is used, the rest is kept as a safety buffer). The 32.2 KB/bp constant is derived by scaling an L40S empirical measurement by the per-token memory cost of 10x3p_sc_ont_016 — see Technical Details for the full derivation.

How Tranquillyzer Uses GPU

You generally do not need to configure GPU settings manually. Tranquillyzer automatically:

Groups reads by length, so reads in each batch are a similar size
Picks a batch size that fits in your GPU memory
Recovers gracefully if it guesses too high (automatic retry with smaller batches)

The sections below explain each step in more detail.

Grouping Reads by Length

Length bins

During preprocessing, reads are sorted into length bins (e.g., 0-499 bp, 500-999 bp, …). Bin widths are not uniform across the full length range: below --adaptive-bin-threshold (default 10,000 bp), the user-specified --bin-size determines bin width; above it, fixed coarse tiers apply (5,000 / 10,000 / 25,000 bp for the 10k–50k, 50k–100k, and 100k+ ranges respectively). Additionally, when --min-reads-per-bin is enabled during preprocessing, adjacent sparsely populated bins are merged, so actual bin boundaries at annotation time may be wider than the nominal widths. Within each bin, all reads are padded to the same length so they can be processed together as a batch. The padded length is:

\[ L = B + 1 + P_{\text{conv}} \]

where \(B\) is the upper bound of the bin and \(P_{\text{conv}}\) is extra padding required by the model’s convolutional layers (specifically, the one-sided receptive field of the largest dilated convolution):

\[ P_{\text{conv}} = \left\lfloor \frac{\max_i\bigl((k_i - 1) \times d_i\bigr)}{2} \right\rfloor \]

For the current default model 10x3p_sc_ont_016 (conv_kernel_sizes = [25, 25, 25], dilation_rates = [1, 1, 1]), this works out to \(P_{\text{conv}} = 12\). This padding ensures the model can see enough context to correctly label bases near the end of each read.

Chunks

Each length bin may contain thousands or millions of reads, so Tranquillyzer processes them in chunks — manageable groups that are loaded into memory one at a time. The chunk size adapts to read length:

\[ \text{chunk size} = \min\!\left(\left\lfloor \text{base chunk size} \times \frac{500}{\text{avg read length in bin}} \right\rfloor,\; 500{,}000\right) \]

The idea is simple: short reads use less memory per read, so more of them can fit in a chunk. Long reads use more memory, so fewer are loaded at once. The base chunk size is controlled by --chunk-size (default: 100,000). The hard cap of 500,000 prevents any single chunk from using too much memory, even for very short reads.

Choosing How Many Reads to Process at Once

Within each chunk, Tranquillyzer decides how many reads to send to the GPU in a single batch. It runs two checks and picks the smaller (safer) value:

\[ B_{\text{per GPU}} = \min(B_{\text{tokens}}, B_{\text{mem}}) \]

Check 1: Token budget

The first check limits the total amount of “work” per batch. A token is one position in a padded read (roughly one DNA base). The token budget controls how many total tokens are processed at once:

\[ B_{\text{tokens}} = \left\lfloor \frac{T}{L} \right\rfloor \]

where \(T\) is the token budget (--target-tokens, default 1,200,000) and \(L\) is the padded read length. For example, with 1,200,000 tokens and a padded length of 1,000 bp, the batch size would be 1,200 reads.

When to adjust: If you have a high-memory GPU and want larger batches for faster processing, increase --target-tokens. If you are hitting memory errors, decrease it.

Check 2: Memory estimate

The second check estimates whether the batch will actually fit in GPU memory, based on the model architecture. Each token (each position in a read) consumes a certain amount of GPU memory as it passes through the model’s layers. Tranquillyzer estimates this bytes per token value by adding up the memory used at each layer:

\[ \text{bpt} = \bigl(\underbrace{4}_{\text{input}} + \underbrace{d_e \cdot 4}_{\text{embedding}} + \underbrace{C_f \cdot 4}_{\text{conv}} + \underbrace{l_u \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM out}} + \underbrace{l_u \cdot 4 \cdot d_{\text{bi}} \cdot 4}_{\text{LSTM gates}} + \underbrace{n_l \cdot 4}_{\text{CRF/output}}\bigr) \times \alpha \]

Each term corresponds to a layer of the model:

Input (4 bytes): the encoded DNA base (stored as an integer).
Embedding (\(d_e \times 4\)): the learned vector representation of each base (\(d_e\) = 128 by default).
Conv activations (\(C_f \times 4\)): outputs from the convolutional layers. \(C_f\) is the total filter count across all conv layers (64 filters \(\times\) 3 layers = 192 for the default model).
LSTM outputs (\(l_u \times d_{\text{bi}} \times 4\)): outputs from the recurrent layer (\(l_u\) = 32 units, \(d_{\text{bi}}\) = 2 for bidirectional).
LSTM gates (\(l_u \times 4 \times d_{\text{bi}} \times 4\)): internal memory used by the four LSTM gates (input, forget, cell, output).
CRF / output (\(n_l \times 4\)): the final prediction scores for each label (\(n_l \approx\) 10 labels).
Overhead (\(\alpha\) = 1.12): a 12% multiplier to account for TensorFlow’s internal bookkeeping.

For the default model, this works out to approximately 2,916 bytes per token (see worked example).

The batch size from the memory check is then:

\[ B_{\text{mem}} = \left\lfloor \frac{U}{\text{bpt} \times L} \right\rfloor \]

where \(U\) is the usable GPU memory and \(L\) is the padded read length. If you have multiple GPUs with different amounts of memory, Tranquillyzer uses the limit from the smallest GPU (since every GPU must be able to handle its share of the batch).

How usable memory is calculated

Tranquillyzer does not assume it can use all of your GPU memory. It reserves a fraction as a safety buffer (controlled by --vram-headroom, default 0.35) and accounts for memory that TensorFlow has already allocated:

\[ U = \max\!\bigl(0,\; (\text{total GPU budget} - \text{already in use}) \times (1 - \text{headroom})\bigr) \]

With the default headroom of 0.35, about 65% of the remaining GPU memory is actually used for inference. You can tell Tranquillyzer how much VRAM your GPU has via --gpu-mem (default: 12 GB if not specified).

Scaling Across Multiple GPUs

When Tranquillyzer detects more than one GPU, it automatically distributes work across all of them using TensorFlow’s MirroredStrategy. Each GPU processes its own portion of the batch in parallel:

\[ B_{\text{global}} = B_{\text{per GPU}} \times \text{number of GPUs} \]

For example, with 4 GPUs and a per-GPU batch size of 300, the global batch size is 1,200 — meaning 1,200 reads are processed simultaneously (300 on each GPU).

If your GPUs have different amounts of memory, you can specify each one’s VRAM budget individually with a comma-separated list: --gpu-mem 48,48,24. If you pass a single number (e.g., --gpu-mem 48), it is applied to all GPUs. If you don’t specify --gpu-mem at all, 12 GB per GPU is assumed.

What Happens If the GPU Runs Out of Memory

If a batch turns out to be too large and the GPU runs out of memory (OOM), Tranquillyzer automatically recovers:

It clears the GPU memory and triggers garbage collection.
It cuts the batch size in half.
It rebuilds the model if needed (to restore multi-GPU state).
It retries with the smaller batch.

This process repeats until it finds a batch size that works (down to --min-batch-size, default 1). Your run will not crash from an OOM error — but repeated recovery cycles are slow, so it is better to tune your settings proactively (see Tuning Performance).

Running Without a GPU

Tranquillyzer can run without a GPU, but this is not recommended for production datasets — CPU-only inference is orders of magnitude slower and impractical for large sequencing runs. CPU mode exists primarily for testing, development, and small exploratory analyses. When no GPU is detected:

Batch sizes are determined by the token budget only (no memory estimation is needed since there is no GPU memory to manage).
The batch size is capped at 32 reads to keep memory usage predictable on CPU.
No multi-device parallelism is used.

Tuning Performance

Quick Reference: CLI Flags

These flags are available for `annotate-reads`. A subset is also available for `visualize` and `train-model`.
Flag	Default	What it does	When to change it
`--gpu-mem`	12 GB	Tells Tranquillyzer how much VRAM your GPU has	Always set this to your actual VRAM for the best results
`--target-tokens`	1,200,000	Controls batch size via a token budget	Increase for faster processing on high-VRAM GPUs; decrease if hitting OOM
`--vram-headroom`	0.35	Fraction of GPU memory kept as a safety buffer	Increase (e.g., 0.50) if you get OOM errors; decrease (e.g., 0.20) for more speed
`--token-cap-above`	0 (off)	Enables aggressive batching for short reads	Set to a length threshold (e.g., 5000) to speed up short-read bins
`--min-batch-size`	1	Smallest allowed batch size per GPU	Rarely needs changing
`--max-batch-size`	8,192	Largest allowed batch size per GPU	Increase on high-VRAM GPUs to allow larger batches for short-read bins (default is 2,000 for `visualize` and `train-model`)
`--chunk-size`	100,000	Base number of reads loaded into memory at once	Rarely needs changing; adapts automatically to read length
`--threads`	12	CPU threads for barcode correction, demux, and I/O	Increase if you have many CPU cores; decrease on shared machines

Common Scenarios

Resolving Out-of-Memory Errors

Try these steps in order:

Set --gpu-mem to your actual VRAM. For example, if you have an NVIDIA T4 (16 GB), use --gpu-mem 16. The default of 12 GB may cause Tranquillyzer to overestimate what fits in memory on smaller GPUs.
Increase --vram-headroom from 0.35 to 0.50. This reserves more memory as a safety buffer, leaving less for batches but reducing OOM risk.
Decrease --target-tokens from 1,200,000 to 800,000 or lower. This directly reduces batch sizes.
If none of the above help, note that the automatic OOM recovery (see above) will eventually find a working batch size — but it is slow. Proactive tuning avoids this overhead.

Maximizing Throughput

Set --gpu-mem to your actual VRAM — this is the single most important setting. If you have a 48 GB GPU but leave the default at 12, Tranquillyzer will use only a fraction of your GPU’s capacity.
Lower --vram-headroom from 0.35 to 0.20 if you are not seeing any OOM errors. This lets Tranquillyzer use more of your GPU memory for larger batches.
Set --token-cap-above to a threshold like 5,000. This tells Tranquillyzer to use full GPU capacity (without the token budget constraint) for all bins with reads shorter than 5,000 bp — where OOM is unlikely. Longer bins still use the safer, dual-constraint approach.
Increase --target-tokens on high-VRAM GPUs (e.g., 1,500,000 or higher on A100/H100) to allow larger batches for longer reads.
Raise --max-batch-size above the default 8,192 on high-VRAM GPUs (e.g., 20,000 on 48 GB, up to ~32,000 on 80 GB) to unlock larger batches on short-read bins. See Tuning --max-batch-size for guidelines.

Tuning `--max-batch-size`

--max-batch-size is the per-GPU ceiling on batch size after the token-budget and memory checks have run. On short-read bins, GPU memory rarely binds and batch size is determined by either the token budget (default behavior) or --max-batch-size (when --token-cap-above is set). With the default --max-batch-size = 8,192, short bins on high-VRAM GPUs can leave a large portion of GPU compute idle.

Raising --max-batch-size lets short-read bins use more of the GPU at once. The safe pattern is to combine a higher cap with --token-cap-above, so the cap only takes effect on bins below the threshold:

--max-batch-size 20000 --token-cap-above 10000

Tentative starting points (subject to per-GPU benchmarking):

VRAM	Suggested `--max-batch-size` (with `--token-cap-above 10000`)
16 GB	4,000 – 8,000
24 GB	8,000 – 12,000
40–48 GB	16,000 – 20,000
80 GB	24,000 – 32,000

Warning

These are starting points derived from a single empirical run on 4× L40S (48 GB) — not exhaustive testing. If a chunk contains an outlier long read, an aggressive --max-batch-size can trigger an OOM recovery cycle that is slow. Per-GPU recommendations will be refined as we benchmark each architecture.

Heterogeneous Multi-GPU Setups

Use --gpu-mem with a comma-separated list matching your GPUs. For example, if you have two 48 GB GPUs and one 24 GB GPU:

--gpu-mem 48,48,24

Tranquillyzer will compute the per-GPU batch size based on the smallest GPU (24 GB in this case), so all GPUs can handle their share of the work.

Technical Details

This section provides the mathematical derivations behind the estimates and algorithms described above. It is optional reading — you do not need to understand these details to use Tranquillyzer effectively.

Model Configuration Used for Estimates

All estimates on this page correspond to the CNN-BiLSTM-CRF annotation model 10x3p_sc_ont_016, the current default. Its architecture is:

Parameter	Value
`embedding_dim`	128
`conv_layers`	3
`conv_filters`	64
`conv_kernel_sizes`	[25, 25, 25]
`dilation_rates`	[1, 1, 1]
`lstm_layers`	1
`lstm_units`	32
`bidirectional`	TRUE
`CRF`	TRUE
`attention_heads`	0

How Feasibility Limits Are Computed

The GPU feasibility limits in the table above combine an empirical anchor on the L40S with a theoretical scaling to other GPUs.

Empirical anchor (`10x3p_sc_ont_016`)

On a system with 4× NVIDIA L40S GPUs (48 GB VRAM each), Tranquillyzer (model 10x3p_sc_ont_016) successfully processed reads up to ~525 kb in a real annotation pipeline run. This is the largest read length directly observed in that run; the actual ceiling on the L40S is likely higher.

The run used --vram-headroom 0.45, --target-tokens 1,100,000, --max-batch-size 20000, and --token-cap-above 10000. With more aggressive settings (lower headroom, larger token budget), capacity can be pushed further.

Scaling to other GPUs

10x3p_sc_ont_016 has roughly 44% of the per-token memory cost of its predecessor 10x3p_sc_ont_012 (see bpt table). The 73.4 KB/bp empirical constant derived from a batch-size-1 measurement of 10x3p_sc_ont_012 on the L40S scales to approximately 32.2 KB/bp for the current default model:

\[ L_{\max} \approx \frac{\text{VRAM} \times (1 - H)}{32.2~\text{KB}} \]

where \(H\) is the headroom fraction (default 0.35). This constant assumes overhead scales proportionally with theoretical per-token memory; direct per-GPU empirical confirmation is pending.

Why empirical scaling rather than analytical calculation?

GPU memory usage during inference is not just the model activations. It also includes TensorFlow’s cuDNN workspace allocations, memory allocator fragmentation, LSTM internal state, CRF decoding workspace, and other overhead that is difficult to predict analytically. Anchoring to a real measurement captures all of these factors, giving more reliable estimates — especially for ultra-long reads where these overheads become significant.

Bytes-Per-Token Estimation Formula

The per-token memory estimate used in batch sizing sums the activation footprint across each layer of the model:

Term	Formula	Default value	Description
Input	4	4	One int32 per position
Embedding	\(d_e \times 4\)	512	Float32 embedding vector
Conv	\(C_f \times 4\)	768	Conv filter outputs (\(C_f\) = filters \(\times\) layers = 64 \(\times\) 3)
LSTM out	\(l_u \times d_{\text{bi}} \times 4\)	256	Recurrent layer outputs (32 units \(\times\) 2 directions)
LSTM gates	\(l_u \times 4 \times d_{\text{bi}} \times 4\)	1,024	Four LSTM gates, both directions
CRF/output	\(n_l \times 4\)	40	Prediction scores for ~10 labels
Subtotal		2,604
Overhead (\(\alpha\) = 1.12)			12% TensorFlow bookkeeping
Total bpt		~2,916	bytes per token

Usable Memory at Runtime

When Tranquillyzer is running, it queries TensorFlow to find out how much GPU memory is already in use and calculates the usable portion:

\[ U = \max\!\bigl(0,\; (V_{\text{total}} - V_{\text{in use}}) \times (1 - H)\bigr) \]

\(V_{\text{total}}\) = the GPU memory budget you specified via --gpu-mem (default 12 GB)
\(V_{\text{in use}}\) = memory already allocated by TensorFlow on that GPU
\(H\) = headroom fraction (--vram-headroom, default 0.35)

Before a model is loaded, \(V_{\text{in use}} = 0\), which gives the simpler formula \(U \approx V \times (1 - H)\) used for the feasibility estimates.

Two-Tier Batching with `--token-cap-above`

By default, Tranquillyzer uses the conservative \(\min(B_{\text{tokens}}, B_{\text{mem}})\) for every length bin. The --token-cap-above flag enables a two-tier strategy:

Short-read bins (padded length < threshold): batch size is set by GPU memory capacity only (\(B_{\text{mem}}\)), ignoring the token budget. This allows larger batches because short reads rarely approach OOM limits.
Long-read bins (padded length \(\geq\) threshold): batch size uses the safer \(\min(B_{\text{tokens}}, B_{\text{mem}})\) to protect against OOM.

For example, --token-cap-above 5000 uses aggressive GPU-capacity-only batching for bins up to 5,000 bp, while longer bins use the dual-constraint approach.