simulate-data CLI

The simulate-data command generates synthetic labeled reads for training. It builds reads according to the library structure defined in seq_orders.yaml, introduces sequencing errors, and outputs the reads and their per-base labels as pickle files in {output_dir}/simulated_data/.

For background on how reads are simulated and why, see Read Structure Simulation and Sequencing Error Modeling.

Command Line Options

CLI Option	Type	Default	What it controls	When to change it
`model_name`	TEXT	required	Library model key defining segment order and motifs	Set to match your protocol
`output_dir`	TEXT	required	Output directory (writes to `simulated_data/` subdirectory)	—
`--training-seq-orders-file`	TEXT	`utils/seq_orders.yaml`	Library definition file specifying segment order and sequences	Only if using a custom file
`--num-reads`	INT	50,000	Number of reads to simulate (before reverse-complement doubling)	Increase for more training data; decrease for quick tests
`--mismatch-rate`	FLOAT	0.05	Base substitution probability	Match your sequencing platform’s error profile
`--insertion-rate`	FLOAT	0.05	Base insertion probability	Match your sequencing platform’s error profile
`--deletion-rate`	FLOAT	0.06	Base deletion probability	Match your sequencing platform’s error profile
`--min-cdna`	INT	100	Minimum cDNA length in simulated reads	Adjust to match expected read lengths
`--max-cdna`	INT	500	Maximum cDNA length in simulated reads	Adjust to match expected read lengths
`--polyt-error-rate`	FLOAT	0.02	Error rate within polyA/polyT segments	Increase if your data has noisy homopolymers
`--max-insertions`	INT	1	Maximum insertions allowed after a single base	Rarely needs changing
`--threads`	INT	2	CPU threads for read simulation	Increase for faster generation on multi-core machines
`--rc` / `--no-rc`	FLAG	`--rc`	Include reverse complements (doubles the training set)	Disable only if your protocol has a fixed known orientation
`--transcriptome`	TEXT	None	Transcriptome FASTA for realistic cDNA generation	Recommended — improves model accuracy on real data
`--invalid-fraction`	FLOAT	0.3	Fraction of reads generated as structurally invalid artifacts	Increase if your data has many malformed reads
`--min-spacer`	INT	0	Minimum length of random cDNA spacer between concatenated fragments	Rarely changed
`--max-spacer`	INT	50	Maximum length of random cDNA spacer between concatenated fragments	Raise to cover longer real chimeric junctions
`--min-flank`	INT	0	Minimum length of random cDNA flank at each terminal end of a read	Rarely changed
`--max-flank`	INT	50	Maximum length of random cDNA flank at each terminal end of a read	Raise (e.g. 200–300) to cover real ONT adapter-flank lengths and reduce edge false positives

Example Usage

tranquillyzer simulate-data \
    10x3p_sc_ont \
    training_out \
    --num-reads 50000 \
    --mismatch-rate 0.05 \
    --insertion-rate 0.05 \
    --deletion-rate 0.06 \
    --min-cdna 100 \
    --max-cdna 500 \
    --polyt-error-rate 0.02 \
    --max-insertions 1 \
    --invalid-fraction 0.3 \
    --rc \
    --threads 2 \
    --transcriptome gencode.v44.transcripts.fa