simulate-data CLI

The simulate-data command generates synthetic labeled reads for training. It builds reads according to the library structure defined in seq_orders.yaml, introduces sequencing errors, and outputs the reads and their per-base labels as pickle files in {output_dir}/simulated_data/.

For background on how reads are simulated and why, see Read Structure Simulation and Sequencing Error Modeling.

Command Line Options

CLI Option Type Default What it controls When to change it
model_name TEXT required Library model key defining segment order and motifs Set to match your protocol
output_dir TEXT required Output directory (writes to simulated_data/ subdirectory)
--training-seq-orders-file TEXT utils/seq_orders.yaml Library definition file specifying segment order and sequences Only if using a custom file
--num-reads INT 50,000 Number of reads to simulate (before reverse-complement doubling) Increase for more training data; decrease for quick tests
--mismatch-rate FLOAT 0.05 Base substitution probability Match your sequencing platform’s error profile
--insertion-rate FLOAT 0.05 Base insertion probability Match your sequencing platform’s error profile
--deletion-rate FLOAT 0.06 Base deletion probability Match your sequencing platform’s error profile
--min-cdna INT 100 Minimum cDNA length in simulated reads Adjust to match expected read lengths
--max-cdna INT 500 Maximum cDNA length in simulated reads Adjust to match expected read lengths
--polyt-error-rate FLOAT 0.02 Error rate within polyA/polyT segments Increase if your data has noisy homopolymers
--max-insertions INT 1 Maximum insertions allowed after a single base Rarely needs changing
--threads INT 2 CPU threads for read simulation Increase for faster generation on multi-core machines
--rc / --no-rc FLAG --rc Include reverse complements (doubles the training set) Disable only if your protocol has a fixed known orientation
--transcriptome TEXT None Transcriptome FASTA for realistic cDNA generation Recommended — improves model accuracy on real data
--invalid-fraction FLOAT 0.3 Fraction of reads generated as structurally invalid artifacts Increase if your data has many malformed reads

Example Usage

tranquillyzer simulate-data \
    10x3p_sc_ont \
    training_out \
    --num-reads 50000 \
    --mismatch-rate 0.05 \
    --insertion-rate 0.05 \
    --deletion-rate 0.06 \
    --min-cdna 100 \
    --max-cdna 500 \
    --polyt-error-rate 0.02 \
    --max-insertions 1 \
    --invalid-fraction 0.3 \
    --rc \
    --threads 2 \
    --transcriptome gencode.v44.transcripts.fa