train-model CLI

The train-model command trains model variants from pre-generated simulated data (produced by simulate-data). It reads the simulated reads and labels from {output_dir}/simulated_data/, trains each hyperparameter combination defined in training_params.yaml, and validates each variant on an independently generated set of reads.

For background on the model architecture and training process, see Model Training.

Note: The error model options below (--mismatch-rate, --polyt-error-rate, etc.) and cDNA length options (--min-cdna, --max-cdna) control the post-training validation reads only. These validation reads are generated independently to provide a quick sanity check — they do not affect the training data, which was produced earlier by simulate-data.

Command Line Options

CLI Option	Type	Default	What it controls	When to change it
`model_name`	TEXT	required	Library model key defining segment order and motifs	Set to match your protocol
`output_dir`	TEXT	required	Directory containing simulated training data	Same directory used for `simulate-data`
`--param-file`	TEXT	`utils/training_params.yaml`	Hyperparameter grid file	Only if using a custom file
`--training-seq-orders-file`	TEXT	`utils/seq_orders.yaml`	Library definition file	Only if using a custom file
`--num-val-reads`	INT	20	Number of validation reads to synthesize after training	Increase for a more thorough validation check
`--mismatch-rate`	FLOAT	0.05	Validation read substitution rate	Match your sequencing platform
`--insertion-rate`	FLOAT	0.05	Validation read insertion rate	Match your sequencing platform
`--deletion-rate`	FLOAT	0.06	Validation read deletion rate	Match your sequencing platform
`--min-cdna`	INT	100	Minimum cDNA length in validation reads	Adjust to match expected read lengths
`--max-cdna`	INT	500	Maximum cDNA length in validation reads	Adjust to match expected read lengths
`--polyt-error-rate`	FLOAT	0.02	Validation polyA/polyT error rate	Increase if your data has noisy homopolymers
`--max-insertions`	FLOAT	1	Validation max insertions per position	Rarely needs changing
`--threads`	INT	2	CPU threads for validation read synthesis	Increase on multi-core machines
`--rc` / `--no-rc`	FLAG	`--rc`	Include reverse complements in the validation set	Disable only for fixed-orientation protocols
`--transcriptome`	TEXT	None	Transcriptome FASTA for validation cDNA	Recommended for realistic validation
`--invalid-fraction`	FLOAT	0.3	Fraction of validation reads generated as invalid	Rarely needs changing
`--gpu-mem`	TEXT	None (12 GB)	GPU memory budget for validation inference	Set to your actual VRAM
`--target-tokens`	INT	1,200,000	Token budget per GPU for validation inference batching	See Resource Requirements
`--vram-headroom`	FLOAT	0.35	VRAM headroom fraction for validation inference	Increase if OOM during validation
`--min-batch-size`	INT	1	Minimum batch size for validation inference	Rarely needs changing
`--max-batch-size`	INT	2,000	Maximum batch size for validation inference	Rarely needs changing

Defining Hyperparameters in `training_params.yaml`

The hyperparameter grid is defined in a YAML file with one top-level key per model. Each key maps to a dictionary of parameter values. If you want to search over multiple values for a parameter, provide them as a comma-separated string — train-model will enumerate all combinations automatically.

Note: If you copy and paste the following template, make sure the pasted text retains proper YAML formatting.

Single-Model Example

training_params.yaml

10x3p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 6
  embedding_dim: 128
  conv_layers: 3
  conv_filters: [64, 64, 64]
  conv_kernel_sizes: [25, 25, 25]
  dilation_rates: [1, 1, 1]
  lstm_layers: 1
  lstm_units: [32]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

Multi-Model Example

Additional models can be added as separate top-level keys. The model name in the YAML file is the model_name used in other Tranquillyzer commands.

training_params_multi.yaml

10x5p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 5
  embedding_dim: 128
  conv_layers: 4
  conv_filters: [128, 128, 128, 128]
  conv_kernel_sizes: [25, 25, 25, 25]
  dilation_rates: [1, 1, 1, 1]
  lstm_layers: 1
  lstm_units: [96]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

10x3p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 6
  embedding_dim: 128
  conv_layers: 3
  conv_filters: [64, 64, 64]
  conv_kernel_sizes: [25, 25, 25]
  dilation_rates: [1, 1, 1]
  lstm_layers: 1
  lstm_units: [32]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

Example Usage

tranquillyzer train-model \
    10x3p_sc_ont \
    training_out \
    --num-val-reads 20 \
    --mismatch-rate 0.05 \
    --insertion-rate 0.05 \
    --deletion-rate 0.06 \
    --min-cdna 100 \
    --max-cdna 500 \
    --polyt-error-rate 0.02 \
    --max-insertions 2 \
    --invalid-fraction 0.3 \
    --rc \
    --threads 2 \
    --transcriptome gencode.v44.transcripts.fa