train-model CLI
The train-model command trains model variants from pre-generated simulated data (produced by simulate-data). It reads the simulated reads and labels from {output_dir}/simulated_data/, trains each hyperparameter combination defined in training_params.yaml, and validates each variant on an independently generated set of reads.
For background on the model architecture and training process, see Model Training.
Note: The error model options below (
--mismatch-rate,--polyt-error-rate, etc.) and cDNA length options (--min-cdna,--max-cdna) control the post-training validation reads only. These validation reads are generated independently to provide a quick sanity check — they do not affect the training data, which was produced earlier bysimulate-data.
Command Line Options
| CLI Option | Type | Default | What it controls | When to change it |
|---|---|---|---|---|
model_name |
TEXT | required | Library model key defining segment order and motifs | Set to match your protocol |
output_dir |
TEXT | required | Directory containing simulated training data | Same directory used for simulate-data |
--param-file |
TEXT | utils/training_params.yaml |
Hyperparameter grid file | Only if using a custom file |
--training-seq-orders-file |
TEXT | utils/seq_orders.yaml |
Library definition file | Only if using a custom file |
--num-val-reads |
INT | 20 | Number of validation reads to synthesize after training | Increase for a more thorough validation check |
--mismatch-rate |
FLOAT | 0.05 | Validation read substitution rate | Match your sequencing platform |
--insertion-rate |
FLOAT | 0.05 | Validation read insertion rate | Match your sequencing platform |
--deletion-rate |
FLOAT | 0.06 | Validation read deletion rate | Match your sequencing platform |
--min-cdna |
INT | 100 | Minimum cDNA length in validation reads | Adjust to match expected read lengths |
--max-cdna |
INT | 500 | Maximum cDNA length in validation reads | Adjust to match expected read lengths |
--polyt-error-rate |
FLOAT | 0.02 | Validation polyA/polyT error rate | Increase if your data has noisy homopolymers |
--max-insertions |
FLOAT | 1 | Validation max insertions per position | Rarely needs changing |
--threads |
INT | 2 | CPU threads for validation read synthesis | Increase on multi-core machines |
--rc / --no-rc |
FLAG | --rc |
Include reverse complements in the validation set | Disable only for fixed-orientation protocols |
--transcriptome |
TEXT | None | Transcriptome FASTA for validation cDNA | Recommended for realistic validation |
--invalid-fraction |
FLOAT | 0.3 | Fraction of validation reads generated as invalid | Rarely needs changing |
--gpu-mem |
TEXT | None (12 GB) | GPU memory budget for validation inference | Set to your actual VRAM |
--target-tokens |
INT | 1,200,000 | Token budget per GPU for validation inference batching | See Resource Requirements |
--vram-headroom |
FLOAT | 0.35 | VRAM headroom fraction for validation inference | Increase if OOM during validation |
--min-batch-size |
INT | 1 | Minimum batch size for validation inference | Rarely needs changing |
--max-batch-size |
INT | 2,000 | Maximum batch size for validation inference | Rarely needs changing |
Defining Hyperparameters in training_params.yaml
The hyperparameter grid is defined in a YAML file with one top-level key per model. Each key maps to a dictionary of parameter values. If you want to search over multiple values for a parameter, provide them as a comma-separated string — train-model will enumerate all combinations automatically.
Note: If you copy and paste the following template, make sure the pasted text retains proper YAML formatting.
Single-Model Example
training_params.yaml
10x3p_sc_ont:
batch_size: 128
train_fraction: 0.8
vocab_size: 5
embedding_dim: 128
conv_layers: 3
conv_filters: [128, 128, 128]
conv_kernel_sizes: [25, 25, 25]
dilation_rates: [1, 3, 5]
lstm_layers: 1
lstm_units: [96]
bidirectional: true
crf_layer: true
attention_heads: 0
dropout_rate: 0.35
regularization: 0.01
learning_rate: 0.001
epochs: 5Multi-Model Example
Additional models can be added as separate top-level keys. The model name in the YAML file is the model_name used in other Tranquillyzer commands.
training_params_multi.yaml
10x5p_sc_ont:
batch_size: 128
train_fraction: 0.8
vocab_size: 5
embedding_dim: 128
conv_layers: 4
conv_filters: [128, 128, 128, 128]
conv_kernel_sizes: [25, 25, 25, 25]
dilation_rates: [1, 1, 1, 1]
lstm_layers: 1
lstm_units: [96]
bidirectional: true
crf_layer: true
attention_heads: 0
dropout_rate: 0.35
regularization: 0.01
learning_rate: 0.001
epochs: 5
10x3p_sc_ont:
batch_size: 128
train_fraction: 0.8
vocab_size: 5
embedding_dim: 128
conv_layers: 3
conv_filters: [128, 128, 128]
conv_kernel_sizes: [25, 25, 25]
dilation_rates: [1, 3, 5]
lstm_layers: 1
lstm_units: [96]
bidirectional: true
crf_layer: true
attention_heads: 0
dropout_rate: 0.35
regularization: 0.01
learning_rate: 0.001
epochs: 5Example Usage
tranquillyzer train-model \
10x3p_sc_ont \
training_out \
--num-val-reads 20 \
--mismatch-rate 0.05 \
--insertion-rate 0.05 \
--deletion-rate 0.06 \
--min-cdna 100 \
--max-cdna 500 \
--polyt-error-rate 0.02 \
--max-insertions 2 \
--invalid-fraction 0.3 \
--rc \
--threads 2 \
--transcriptome gencode.v44.transcripts.fa