train-model CLI

Input Parameters to train-model

Important Note: The error model (e.g., --mismatch-rate, --polyt-error-rate, etc.) and cDNA length options (--min-cdna and --max-cdna) below are used to synthesize a small post-training validation set (with --num-val-reads number of reads). tranquillyzer annotates these reads using the newly trained model and exports a PDF (*_val_viz.pdf) as an initial sanity check. These options do not affect the training reads, which were produced by simulate-data and can be found in reads.pkl and labels.pkl inside the <output_dir>/simulated_data/ directory.

CLI Option Type What It Controls Default
model_name TEXT Library model key defining segment order and motifs required
output_dir TEXT Directory where simulated training data lives required
--param-file TEXT Path to training_params.tsv used to define the hyperparameter grid for model_name utils/training_params.tsv
--training-seq-orders-file TEXT Path to training_seq_orders.tsv defining segment order/motifs for the model utils/training_seq_orders.tsv
--num-val-reads INT Number of post-training validation reads to synthesize and visualize 20
--mismatch-rate FLOAT Validation synthesis mismatch rate 0.05
--insertion-rate FLOAT Validation synthesis insertion rate 0.05
--deletion-rate FLOAT Validation synthesis deletion rate 0.06
--min-cdna INT Validation synthesis minimum cDNA length 100
--max-cdna INT Validation synthesis maximum cDNA length 500
--polyt-error-rate FLOAT Validation synthesis error rate inside polyA/polyT segments 0.02
--max-insertions FLOAT Validation synthesis max insertions per position 1
--threads INT CPU threads used for validation read synthesis 2
--rc / --no-rc FLAG Include reverse complements in validation set (doubles --num-val-reads) --rc
--transcriptome TEXT Optional transcriptome FASTA used for validation synthesis (otherwise random transcripts) None (i.e., random transcripts)
--invalid-fraction FLOAT Fraction of validation reads synthesized as invalid/artifactual 0.3
--gpu-mem TEXT GPU memory (GB) used to guide inference batch sizing for validation annotation/plotting None (12 GB if GPU present)
--target-tokens INT Token budget per GPU replica for choosing safe inference batch size (≅ batch × padded_len) 1200000
--vram-headroom FLOAT Fraction of VRAM reserved as headroom when sizing inference batches 0.35
--min-batch-size INT Minimum inference batch size for validation annotation 1
--max-batch-size INT Maximum inference batch size for validation annotation 2000

Example training_params.tsv Template

Note

If you copy and paste the following template, make sure the pasted text retains the tab (\t) characters that serve as the delimiter between the columns.

training_params_example.tsv
parameter   10x5p_sc_ont
batch_size  128
train_fraction  0.8
vocab_size  5
embedding_dim   128
conv_layers 4
conv_filters    128
conv_kernel_size    25
lstm_layers 1
lstm_units  96
bidirectional   TRUE
crf_layer   TRUE
attention_heads 0
dropout_rate    0.35
regularization  0.01
learning_rate   0.001
epochs  5

Additional models can be added to the training_params.tsv file by adding additional columns to the existing file. The name in the first column is the model_name to be used in other tranquillyzer functions for that model. A realistic example that includes parameters for both 10x Genomics 5’ and 3’ protocols is:

training_params_example_extended.tsv
parameter   10x5p_sc_ont    10x3p_sc_ont
batch_size  128 128
train_fraction  0.8 0.8
vocab_size  5   5
embedding_dim   128 128
conv_layers 4   3
conv_filters    128 128
conv_kernel_size    25  25
lstm_layers 1   1
lstm_units  96  96
bidirectional   TRUE    TRUE
crf_layer   TRUE    TRUE
attention_heads 0   0
dropout_rate    0.35    0.35
regularization  0.01    0.01
learning_rate   0.001   0.001
epochs  5   1

Example Use Case

tranquillyzer train-model \
    10x3p_sc_ont \
    training_out \
    --num-val-reads 20 \
    --mismatch-rate 0.05 \
    --insertion-rate 0.05 \
    --deletion-rate 0.06 \
    --min-cdna 100 \
    --max-cdna 500 \
    --polyt-error-rate 0.02 \
    --max-insertions 2 \
    --invalid-fraction 0.3 \
    --rc \
    --threads 2 \
    --transcriptome gencode.v44.transcripts.fa