simulate-data CLI
The simulate-data command generates synthetic labeled reads for training. It builds reads according to the library structure defined in seq_orders.yaml, introduces sequencing errors, and outputs the reads and their per-base labels as pickle files in {output_dir}/simulated_data/.
For background on how reads are simulated and why, see Read Structure Simulation and Sequencing Error Modeling.
Command Line Options
| CLI Option | Type | Default | What it controls | When to change it |
|---|---|---|---|---|
model_name |
TEXT | required | Library model key defining segment order and motifs | Set to match your protocol |
output_dir |
TEXT | required | Output directory (writes to simulated_data/ subdirectory) |
— |
--training-seq-orders-file |
TEXT | utils/seq_orders.yaml |
Library definition file specifying segment order and sequences | Only if using a custom file |
--num-reads |
INT | 50,000 | Number of reads to simulate (before reverse-complement doubling) | Increase for more training data; decrease for quick tests |
--mismatch-rate |
FLOAT | 0.05 | Base substitution probability | Match your sequencing platform’s error profile |
--insertion-rate |
FLOAT | 0.05 | Base insertion probability | Match your sequencing platform’s error profile |
--deletion-rate |
FLOAT | 0.06 | Base deletion probability | Match your sequencing platform’s error profile |
--min-cdna |
INT | 100 | Minimum cDNA length in simulated reads | Adjust to match expected read lengths |
--max-cdna |
INT | 500 | Maximum cDNA length in simulated reads | Adjust to match expected read lengths |
--polyt-error-rate |
FLOAT | 0.02 | Error rate within polyA/polyT segments | Increase if your data has noisy homopolymers |
--max-insertions |
INT | 1 | Maximum insertions allowed after a single base | Rarely needs changing |
--threads |
INT | 2 | CPU threads for read simulation | Increase for faster generation on multi-core machines |
--rc / --no-rc |
FLAG | --rc |
Include reverse complements (doubles the training set) | Disable only if your protocol has a fixed known orientation |
--transcriptome |
TEXT | None | Transcriptome FASTA for realistic cDNA generation | Recommended — improves model accuracy on real data |
--invalid-fraction |
FLOAT | 0.3 | Fraction of reads generated as structurally invalid artifacts | Increase if your data has many malformed reads |
Example Usage
tranquillyzer simulate-data \
10x3p_sc_ont \
training_out \
--num-reads 50000 \
--mismatch-rate 0.05 \
--insertion-rate 0.05 \
--deletion-rate 0.06 \
--min-cdna 100 \
--max-cdna 500 \
--polyt-error-rate 0.02 \
--max-insertions 1 \
--invalid-fraction 0.3 \
--rc \
--threads 2 \
--transcriptome gencode.v44.transcripts.fa