Model Training

The train-model command trains a deep learning model that assigns a segment label (e.g., adapter, barcode, UMI, cDNA, polyA/T) to every base in a read. It trains on the synthetic reads produced by simulate-data and validates each model variant using an independently simulated set with known ground-truth labels.

If multiple hyperparameter combinations are specified in training_params.yaml, train-model trains each variant separately and exports a standardized set of artifacts for each one (see Output Artifacts).

Model Architecture Overview

The model processes a read through a series of layers, each building on the previous one to capture increasingly complex patterns:

Input tokens → Embedding → Conv layers → [BiLSTM] → Dense → CRF → Per-base labels

Brackets around BiLSTM indicate that it is an optional component — see When the BiLSTM can be dropped below.

At a high level:

Each DNA base is converted to a numeric token and then to a learned vector representation (embedding).
Convolutional layers scan local windows along the read to detect short motifs like adapter subsequences and boundary signatures.
A bidirectional LSTM reads the entire sequence in both directions, providing long-range context — so the model knows, for example, that a barcode should appear between specific adapter sequences.
A dense layer produces a score for each possible label at every position.
A CRF layer enforces globally consistent label sequences by learning which label transitions are plausible (e.g., adapter → barcode is likely, but barcode → adapter may not be).

The sections below explain each component in more detail, along with the parameters that control it.

Layer-by-Layer Details

Input Representation

Reads are converted to integer tokens before training:

Base	Token
padding	0
A	1
C	2
G	3
T	4
N	5

This mapping is governed by the vocab_size parameter (default: 6, covering the four DNA bases, the ambiguity symbol N, and the padding token). The padding token (0) is used only to make reads within a batch the same length and does not represent a nucleotide.

Embedding Layer

The embedding layer converts each integer token into a learned numeric vector of length embedding_dim. This gives the model a richer representation of each base than the raw integer — essentially, each nucleotide gets a trainable “feature profile” that downstream layers can combine to detect patterns.

embedding_dim (default: 128): controls how much information the model stores about each token. For DNA’s small vocabulary, moderate values (64-128) are typically sufficient. Larger values increase capacity but require more memory and can lead to overfitting with limited training data.

Convolutional Layers

After embedding, the model applies a stack of 1D convolutional layers. Each layer slides a small window (the kernel) along the read and learns to recognize local motifs — adapter subsequences, barcode/UMI neighborhood signatures, and transition cues near segment boundaries.

Each convolutional block consists of:

Conv1D with ReLU activation and L2 regularization
BatchNormalization
Dropout

The key parameters are:

conv_kernel_size (default: 25): how wide a motif each layer can detect at once. Larger kernels help detect longer protocol motifs and smooth local noise, but values that are too large may blur sharp boundary cues.
conv_filters (default: 64 per layer): the number of distinct motif detectors per layer. More filters means more pattern capacity at the cost of memory and compute. If boundaries are being missed or adapters vary across reads, increasing filters is often more effective than adding more layers.
conv_layers (default: 3): how many convolutional stages are stacked. More layers allow hierarchical motif composition (combining simple patterns into complex ones). Too many layers can over-smooth or overfit.
dilation_rates (default: [1, 1, 1]): controls the spacing between kernel elements. Increasing dilation (e.g., [1, 3, 5]) expands the receptive field — the amount of surrounding context each layer can see — without increasing the number of parameters. This is useful when segment boundaries depend on context that extends beyond a single kernel width.

Regularization

Two mechanisms prevent the model from overfitting to the training data:

L2 regularization (regularization, default: 0.01): applies a penalty to large convolution weights, encouraging simpler patterns. If too strong, it can prevent the model from learning adapter/barcode motifs.
Dropout (dropout_rate, default: 0.35): randomly disables a fraction of activations during training, forcing the model to be robust to missing features. Higher dropout improves generalization but can reduce boundary precision if excessive.

Bidirectional LSTM

Adapters and barcodes can be recognized by their local sequence, but accurate labeling often requires understanding where a segment appears relative to others. For example, a barcode is not just a 16-mer — it is a 16-mer that appears between specific adapter sequences. The LSTM (Long Short-Term Memory) layer addresses this by reading the sequence in order and retaining information over long distances.

Tranquillyzer uses a bidirectional LSTM by default, meaning it reads the sequence both left-to-right and right-to-left. This allows each position to be interpreted using context from both sides.

lstm_units (default: 32): the size of the LSTM’s internal memory. Increasing units improves the model’s ability to track long-range structure but increases compute and can overfit.
lstm_layers (default: 1): stacking multiple LSTMs increases modeling power but is often unnecessary for standard protocols.
bidirectional (default: True): reading in both directions is typically beneficial for boundary labeling. Setting this to False is uncommon for this task.

When the BiLSTM can be dropped

For typical segmentation tasks — short motif-like adapters, barcodes, and UMIs flanked by known structure — the Conv + CRF backbone alone is often sufficient. Convolutional layers capture the local motifs, and the CRF enforces globally consistent label sequences, which together cover most of what the BiLSTM contributes for well-behaved library designs.

Dropping or shrinking the BiLSTM (e.g., setting lstm_layers = 0) can yield accuracy comparable to the default CNN-BiLSTM-CRF model while noticeably reducing training and inference time and GPU memory footprint. Because the BiLSTM is sequential and memory-heavy, removing it also allows larger batch sizes and makes processing very long reads more tractable.

The BiLSTM is more likely to help when long-range dependencies actually matter — for example, ambiguous motifs whose identity depends on distant context, or non-standard library structures with repeated or overlapping adapter patterns.

Practical recommendation: start with the default CNN-BiLSTM-CRF. If inference is memory-bound or reads are very long, train a Conv + CRF variant (lstm_layers = 0) and compare per-segment accuracy on a held-out validation set before committing to it.

Multi-Head Self-Attention (Experimental)

Warning: Self-attention is an experimental feature intended for exploratory and research purposes only. The standard CNN-BiLSTM-CRF architecture has been sufficient for all read segmentation tasks to date, and self-attention has never been required in practice. Enabling it significantly increases GPU memory usage and training time, and may not improve (or could even degrade) performance for typical annotation workloads. It is disabled by default and should remain so unless you have a specific research reason to experiment with it.

If enabled (attention_heads > 0), the model applies self-attention after the LSTM. Attention allows each position to directly reference other positions in the read and re-weight information based on relevance.

More attention heads allow multiple “views” of relevance patterns, but substantially increase memory and training time.
Disabled by default (attention_heads = 0). The CNN-BiLSTM-CRF architecture provides sufficient context modeling for segment labeling tasks.

Dense Layer

After feature extraction, a dense (fully connected) layer produces a vector of scores for each possible segment label at every position. The number of outputs equals num_labels, which is determined automatically from the segment definitions in seq_orders.yaml.

Conditional Random Field (CRF)

The CRF layer sits on top of the per-base scores and learns which label transitions are plausible. For example, it can learn that adapter → barcode is a likely transition but barcode → adapter is not, and use this to produce globally consistent label sequences.

The CRF is particularly beneficial when the model exhibits label fragmentation — rapid oscillation between labels near boundaries or in repetitive regions. It usually increases training and inference time slightly but often improves boundary consistency.

When crf_layer = False, the model uses standard softmax classification instead, which predicts each position independently without considering neighboring labels.

Training Configuration

Optimizer and Learning Rate

Training uses the Adam optimizer with gradient clipping (clipnorm = 1.0) to prevent unstable updates. The learning rate is controlled by:

learning_rate (default: 0.001): the step size for weight updates. Too high can cause training to diverge; too low slows convergence.
Automatic reduction: if validation accuracy plateaus, the learning rate is halved (down to a minimum of 1e-5). This is handled by ReduceLROnPlateau with patience = 1 epoch.

Early Stopping

Training automatically stops when the model is no longer improving on the validation set, preventing overfitting:

CRF models: training stops after 1 epoch without improvement in validation loss (patience = 1). The weights from the best epoch are restored.
Non-CRF models: training stops after 3 epochs without improvement (patience = 3).

The epochs parameter (default: 5) sets the maximum number of training epochs, but early stopping typically ends training before this limit.

Batch Sizing and Dynamic Padding

Training reads within a batch are padded to the length of the longest read in that batch (not the longest read in the dataset). This dynamic padding approach reduces wasted computation on shorter reads. The batch_size parameter (default: 128) controls how many reads are in each batch.

Larger batches improve throughput and stabilize gradient estimates but require more GPU memory.
Smaller batches use less memory and can sometimes improve generalization.

Multi-GPU Support

When multiple GPUs are available, train-model automatically distributes training across them using TensorFlow’s MirroredStrategy. No additional configuration is needed.

Core Training Parameters

Parameter	Default	What it controls
`batch_size`	128	Reads per gradient update
`train_fraction`	0.8	Fraction of simulated reads used for training (rest for internal validation)
`epochs`	5	Maximum training epochs (early stopping typically ends sooner)
`learning_rate`	0.001	Step size for optimization
`dropout_rate`	0.35	Fraction of activations randomly disabled during training
`regularization`	0.01	L2 penalty strength on convolution weights

Hyperparameter Grid Search

If you specify multiple values for a parameter in training_params.yaml (as a comma-delimited list), train-model will enumerate all combinations and train a separate model variant for each. For example, if you specify learning_rate: 0.001,0.0001 and dropout_rate: 0.2,0.35, four variants will be trained.

Each variant is saved in its own subdirectory with a numeric index (e.g., model_name_0/, model_name_1/, …).

Note: Per-layer parameters like conv_filters, conv_kernel_sizes, dilation_rates, and lstm_units are specified as lists that define the value for each layer — they are not expanded as part of the grid search.

Post-Training Validation

After training each variant, train-model automatically generates an independent set of validation reads (controlled by --num-val-reads, default 20), runs them through the trained model, and produces a PDF visualization (*_val_viz.pdf). This serves as a quick sanity check — you can visually inspect whether the model is correctly identifying segment boundaries on reads it has never seen.

The validation reads are generated using the same error model parameters (--mismatch-rate, --insertion-rate, etc.) but are completely independent from the training data.

Hyperparameter Tuning Guide

Imprecise Segment Boundaries

The model finds the right segments but transition points drift by a few bases.

Increase conv_kernel_size slightly (e.g., 15 → 20) to give the model a wider local context for boundary detection.
If the model appears to be overfitting, increase dropout_rate modestly (e.g., 0.20 → 0.30).

Degraded Accuracy on Noisy Reads

Predictions are accurate on clean reads but break down as error rates increase.

Increase --num-reads during simulation to expose the model to more noisy examples.
Increase conv_filters (e.g., 64 → 96) to learn more robust motif detectors.
Consider a slightly lower learning_rate to stabilize optimization under noise.

Reducing GPU Memory Usage

Reduce batch_size to lower GPU memory usage.
Reduce conv_filters or embedding_dim to decrease the model’s memory footprint.
Keep attention_heads = 0 unless needed — attention increases memory usage substantially.