train-model CLI
Input Parameters to train-model
Important Note: The error model (e.g., --mismatch-rate, --polyt-error-rate, etc.) and cDNA length options (--min-cdna and --max-cdna) below are used to synthesize a small post-training validation set (with --num-val-reads number of reads). tranquillyzer annotates these reads using the newly trained model and exports a PDF (*_val_viz.pdf) as an initial sanity check. These options do not affect the training reads, which were produced by simulate-data and can be found in reads.pkl and labels.pkl inside the <output_dir>/simulated_data/ directory.
| CLI Option | Type | What It Controls | Default |
|---|---|---|---|
model_name |
TEXT | Library model key defining segment order and motifs | required |
output_dir |
TEXT | Directory where simulated training data lives | required |
--param-file |
TEXT | Path to training_params.tsv used to define the hyperparameter grid for model_name |
utils/training_params.tsv |
--training-seq-orders-file |
TEXT | Path to training_seq_orders.tsv defining segment order/motifs for the model |
utils/training_seq_orders.tsv |
--num-val-reads |
INT | Number of post-training validation reads to synthesize and visualize | 20 |
--mismatch-rate |
FLOAT | Validation synthesis mismatch rate | 0.05 |
--insertion-rate |
FLOAT | Validation synthesis insertion rate | 0.05 |
--deletion-rate |
FLOAT | Validation synthesis deletion rate | 0.06 |
--min-cdna |
INT | Validation synthesis minimum cDNA length | 100 |
--max-cdna |
INT | Validation synthesis maximum cDNA length | 500 |
--polyt-error-rate |
FLOAT | Validation synthesis error rate inside polyA/polyT segments | 0.02 |
--max-insertions |
FLOAT | Validation synthesis max insertions per position | 1 |
--threads |
INT | CPU threads used for validation read synthesis | 2 |
--rc / --no-rc |
FLAG | Include reverse complements in validation set (doubles --num-val-reads) |
--rc |
--transcriptome |
TEXT | Optional transcriptome FASTA used for validation synthesis (otherwise random transcripts) | None (i.e., random transcripts) |
--invalid-fraction |
FLOAT | Fraction of validation reads synthesized as invalid/artifactual | 0.3 |
--gpu-mem |
TEXT | GPU memory (GB) used to guide inference batch sizing for validation annotation/plotting | None (12 GB if GPU present) |
--target-tokens |
INT | Token budget per GPU replica for choosing safe inference batch size (≅ batch × padded_len) | 1200000 |
--vram-headroom |
FLOAT | Fraction of VRAM reserved as headroom when sizing inference batches | 0.35 |
--min-batch-size |
INT | Minimum inference batch size for validation annotation | 1 |
--max-batch-size |
INT | Maximum inference batch size for validation annotation | 2000 |
Example training_params.tsv Template
If you copy and paste the following template, make sure the pasted text retains the tab (\t) characters that serve as the delimiter between the columns.
training_params_example.tsv
parameter 10x5p_sc_ont
batch_size 128
train_fraction 0.8
vocab_size 5
embedding_dim 128
conv_layers 4
conv_filters 128
conv_kernel_size 25
lstm_layers 1
lstm_units 96
bidirectional TRUE
crf_layer TRUE
attention_heads 0
dropout_rate 0.35
regularization 0.01
learning_rate 0.001
epochs 5Additional models can be added to the training_params.tsv file by adding additional columns to the existing file. The name in the first column is the model_name to be used in other tranquillyzer functions for that model. A realistic example that includes parameters for both 10x Genomics 5’ and 3’ protocols is:
training_params_example_extended.tsv
parameter 10x5p_sc_ont 10x3p_sc_ont
batch_size 128 128
train_fraction 0.8 0.8
vocab_size 5 5
embedding_dim 128 128
conv_layers 4 3
conv_filters 128 128
conv_kernel_size 25 25
lstm_layers 1 1
lstm_units 96 96
bidirectional TRUE TRUE
crf_layer TRUE TRUE
attention_heads 0 0
dropout_rate 0.35 0.35
regularization 0.01 0.01
learning_rate 0.001 0.001
epochs 5 1Example Use Case
tranquillyzer train-model \
10x3p_sc_ont \
training_out \
--num-val-reads 20 \
--mismatch-rate 0.05 \
--insertion-rate 0.05 \
--deletion-rate 0.06 \
--min-cdna 100 \
--max-cdna 500 \
--polyt-error-rate 0.02 \
--max-insertions 2 \
--invalid-fraction 0.3 \
--rc \
--threads 2 \
--transcriptome gencode.v44.transcripts.fa