Background on Model Training

Overview

Tranquillyzer uses a deep learning model to label every base in a long-read RNA-seq read — identifying adapters, barcodes, UMIs, cDNA, polyA/T tails, and other structural elements. To train this model, Tranquillyzer needs labeled data where the true identity of every base is known.

Rather than relying on labels from an upstream annotation tool (which would embed that tool’s own assumptions and biases into the model), Tranquillyzer trains exclusively on synthetically generated reads. These synthetic reads are built from explicit library specifications and a configurable sequencing error model, so the ground-truth label of every base is known at generation time. This approach enables fully supervised learning with exact labels and makes it straightforward to adapt the model to new protocols or read structures.

Training Workflow

Training consists of four stages, handled by two commands:

Stage	Description	Command
1	Simulate reads based on a defined library structure	`simulate-data`
2	Introduce realistic sequencing errors into the simulated reads	`simulate-data`
3	Train the neural network on the labeled reads	`train-model`
4	Validate the model on an independently simulated set	`train-model`

After training, the Model Assessment page describes how to rigorously evaluate model quality using the assess-model command, which computes quantitative metrics on architecture accuracy, segment boundary precision, and per-base classification performance.

The following pages walk through each stage, explaining the intuition behind key decisions and how they affect the final model.

Prerequisites

Before training, prepare the following:

seq_orders.yaml — defines the segment order, motif sequences, barcodes, UMIs, and valid structures for your library protocol. A default file is provided at utils/seq_orders.yaml. If you are training for a new protocol, you will need to add an entry for it. See the simulate-data CLI reference and read simulation pages for details.
training_params.yaml — defines the model architecture and training hyperparameters. A default file is provided at utils/training_params.yaml. See the train-model CLI reference page for the full parameter list and examples.
Transcriptome FASTA (recommended) — providing a reference transcriptome via --transcriptome gives the simulator realistic cDNA sequences instead of random DNA. This helps the model learn to distinguish internal homopolymers from true polyA/T tails.

Output Artifacts

After training, each model variant produces the following artifacts in its output directory:

File	Description
`{model_name}.h5`	Trained model weights (Keras HDF5 format)
`{model_name}_lbl_bin.pkl`	Fitted label binarizer (scikit-learn pickle) — maps segment names to numeric labels
`{model_name}_params.yaml`	Model architecture and hyperparameters used for this variant
`{model_name}_history.tsv`	Training history (loss, accuracy, etc. per epoch)
`{model_name}_val_viz.pdf`	Visualization of model predictions on validation reads — a quick sanity check

To use a trained model for annotation, place the .h5, _lbl_bin.pkl, and _params.yaml files in the models/ directory.