train-model CLI

The train-model command trains model variants from pre-generated simulated data (produced by simulate-data). It reads the simulated reads and labels from {output_dir}/simulated_data/, trains each hyperparameter combination defined in training_params.yaml, and validates each variant on an independently generated set of reads.

For background on the model architecture and training process, see Model Training.

Note: The error model options below (--mismatch-rate, --polyt-error-rate, etc.) and cDNA length options (--min-cdna, --max-cdna) control the post-training validation reads only. These validation reads are generated independently to provide a quick sanity check — they do not affect the training data, which was produced earlier by simulate-data.

Command Line Options

CLI Option Type Default What it controls When to change it
model_name TEXT required Library model key defining segment order and motifs Set to match your protocol
output_dir TEXT required Directory containing simulated training data Same directory used for simulate-data
--param-file TEXT utils/training_params.yaml Hyperparameter grid file Only if using a custom file
--training-seq-orders-file TEXT utils/seq_orders.yaml Library definition file Only if using a custom file
--num-val-reads INT 20 Number of validation reads to synthesize after training Increase for a more thorough validation check
--mismatch-rate FLOAT 0.05 Validation read substitution rate Match your sequencing platform
--insertion-rate FLOAT 0.05 Validation read insertion rate Match your sequencing platform
--deletion-rate FLOAT 0.06 Validation read deletion rate Match your sequencing platform
--min-cdna INT 100 Minimum cDNA length in validation reads Adjust to match expected read lengths
--max-cdna INT 500 Maximum cDNA length in validation reads Adjust to match expected read lengths
--polyt-error-rate FLOAT 0.02 Validation polyA/polyT error rate Increase if your data has noisy homopolymers
--max-insertions FLOAT 1 Validation max insertions per position Rarely needs changing
--threads INT 2 CPU threads for validation read synthesis Increase on multi-core machines
--rc / --no-rc FLAG --rc Include reverse complements in the validation set Disable only for fixed-orientation protocols
--transcriptome TEXT None Transcriptome FASTA for validation cDNA Recommended for realistic validation
--invalid-fraction FLOAT 0.3 Fraction of validation reads generated as invalid Rarely needs changing
--gpu-mem TEXT None (12 GB) GPU memory budget for validation inference Set to your actual VRAM
--target-tokens INT 1,200,000 Token budget per GPU for validation inference batching See Resource Requirements
--vram-headroom FLOAT 0.35 VRAM headroom fraction for validation inference Increase if OOM during validation
--min-batch-size INT 1 Minimum batch size for validation inference Rarely needs changing
--max-batch-size INT 2,000 Maximum batch size for validation inference Rarely needs changing

Defining Hyperparameters in training_params.yaml

The hyperparameter grid is defined in a YAML file with one top-level key per model. Each key maps to a dictionary of parameter values. If you want to search over multiple values for a parameter, provide them as a comma-separated string — train-model will enumerate all combinations automatically.

Note: If you copy and paste the following template, make sure the pasted text retains proper YAML formatting.

Single-Model Example

training_params.yaml
10x3p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 5
  embedding_dim: 128
  conv_layers: 3
  conv_filters: [128, 128, 128]
  conv_kernel_sizes: [25, 25, 25]
  dilation_rates: [1, 3, 5]
  lstm_layers: 1
  lstm_units: [96]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

Multi-Model Example

Additional models can be added as separate top-level keys. The model name in the YAML file is the model_name used in other Tranquillyzer commands.

training_params_multi.yaml
10x5p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 5
  embedding_dim: 128
  conv_layers: 4
  conv_filters: [128, 128, 128, 128]
  conv_kernel_sizes: [25, 25, 25, 25]
  dilation_rates: [1, 1, 1, 1]
  lstm_layers: 1
  lstm_units: [96]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

10x3p_sc_ont:
  batch_size: 128
  train_fraction: 0.8
  vocab_size: 5
  embedding_dim: 128
  conv_layers: 3
  conv_filters: [128, 128, 128]
  conv_kernel_sizes: [25, 25, 25]
  dilation_rates: [1, 3, 5]
  lstm_layers: 1
  lstm_units: [96]
  bidirectional: true
  crf_layer: true
  attention_heads: 0
  dropout_rate: 0.35
  regularization: 0.01
  learning_rate: 0.001
  epochs: 5

Example Usage

tranquillyzer train-model \
    10x3p_sc_ont \
    training_out \
    --num-val-reads 20 \
    --mismatch-rate 0.05 \
    --insertion-rate 0.05 \
    --deletion-rate 0.06 \
    --min-cdna 100 \
    --max-cdna 500 \
    --polyt-error-rate 0.02 \
    --max-insertions 2 \
    --invalid-fraction 0.3 \
    --rc \
    --threads 2 \
    --transcriptome gencode.v44.transcripts.fa