Read Simulation
The simulate-data command generates synthetic labeled reads for model training. This process has two stages: first, reads are assembled from a defined library structure with known ground-truth labels; then, realistic sequencing errors are introduced so the model learns to handle noisy data.
Stage 1: Read Structure Simulation
Segment-Based Read Construction
Synthetic reads are assembled by concatenating library-specific segments in a defined order — adapters, barcodes, UMIs, cDNA, polyA/T tails, and any other structural elements specified by the protocol. Think of it as building a read from a template: each segment is generated according to its type, and the segments are joined together to form a complete read.
The segment order and any fixed sequence motifs (e.g., adapter sequences) are specified in the library definition file (seq_orders.yaml). During assembly, the simulator records both the nucleotide sequence and a per-base label that identifies which segment each base belongs to. These labels are the ground-truth annotations used for training.
Defining Your Library Structure
Each library protocol is defined in seq_orders.yaml with the following information:
- Segment names and order — the structural elements in the order they appear in a read (e.g.,
p5,i5,CBC,UMI,cDNA,polyA,p7) - Segment patterns — the sequence or pattern for each segment:
- A literal DNA string (e.g.,
CTACACGACGCTCTTCCGATCT) for fixed adapter/primer sequences NX(e.g.,N16) for a fixed-length unknown sequence (barcodes, UMIs)NNfor a variable-length unknown sequence (cDNA)AorTfor polyA or polyT tails of random length
- A literal DNA string (e.g.,
- Barcodes and UMIs — which segments define a cell and which define a molecule
- Valid structures — the acceptable segment orderings that the model should recognize during inference
- Training structures — the segment orderings used for training, including proportions and optional artifact modes
Writing Your Own seq_orders.yaml
The seq_orders.yaml file has a two-level structure: libraries define the sequencing chemistry, and models reference a library (optionally overriding specific fields). This separation means you define your protocol once as a library and then create multiple model variants from it without duplicating configuration.
Step 1: Define a Library
A library captures everything about your sequencing protocol’s read structure. Here is a minimal example for a hypothetical single-cell protocol with a 5’ adapter, a 16 bp cell barcode, a 12 bp UMI, cDNA, a polyA tail, and a 3’ adapter:
libraries:
my_protocol:
strand: fwd
barcodes: [CBC]
umis: [UMI]
segments:
- name: adapter_5p
pattern: AGATCGGAAGAGC
- name: CBC
pattern: N16
- name: UMI
pattern: N12
- name: cDNA
pattern: NN
- name: polyA
pattern: A
- name: adapter_3p
pattern: GTACTCTGCGTTGKey points:
strand: set tofwdorrevdepending on which orientation the cDNA is expected to be in relative to the reference transcriptome.barcodesandumis: list the segment names that define a cell and a molecule, respectively. These names must match entries in thesegmentslist.segments: an ordered list defining each structural element. Thenamefield is the label the model will learn to predict. Thepatternfield specifies how to generate that segment:
| Pattern | Meaning | Example use |
|---|---|---|
| Literal string | Fixed known sequence | Adapter/primer sequences |
NX |
Random sequence of exactly X bases | Barcodes (N16), UMIs (N12) |
NN |
Random sequence of variable length | cDNA |
A |
PolyA tail (random length) | 3’ polyA |
T |
PolyT tail (random length) | 5’ polyT |
Step 2: Define Valid Structures
Valid structures tell Tranquillyzer which segment orderings to accept as correctly annotated during inference. A read matching any of the listed structures is considered valid. Typically, you define one structure for the full-length read and optionally others for known truncation variants:
valid_structures:
- [adapter_5p, CBC, UMI, cDNA, polyA, adapter_3p] # full-length read
- [adapter_5p, CBC, UMI, cDNA, adapter_3p] # missing polyA (still valid)Step 3: Define Training Structures
Training structures control which read layouts are generated during simulation and in what proportions. This is where you teach the model to handle both well-formed reads and common artifacts.
Each training structure has:
order: the sequence of segment namesproportion: the fraction of training reads generated with this layout (all proportions should sum to 1.0)repeat(optional): number of times to concatenate the read (simulates read concatenation artifacts)rc_pattern(optional): a list of orientations for each repeated fragment —fwd(forward),rev(reverse complement), orreverse(reversed without complementing)
training_structures:
# 50% full-length, well-formed reads
full_read:
order: [adapter_5p, CBC, UMI, cDNA, polyA, adapter_3p]
proportion: 0.50
# 10% concatenated reads (fwd + rev-comp) — a common nanopore artifact
concat_fwd_rev:
order: [adapter_5p, CBC, UMI, cDNA, polyA, adapter_3p]
repeat: 2
rc_pattern: [fwd, rev]
proportion: 0.10
# 10% truncated at 5' end (missing adapter)
truncated_no_5p:
order: [CBC, UMI, cDNA, polyA, adapter_3p]
proportion: 0.10
# 10% truncated at 3' end (missing adapter)
truncated_no_3p:
order: [adapter_5p, CBC, UMI, cDNA, polyA]
proportion: 0.10
# 10% reads with stuttered/repeated barcode blocks
stutter_cbc:
order: [adapter_5p, CBC, CBC, CBC, UMI, cDNA, polyA, adapter_3p]
proportion: 0.10
# 10% reads with repeated 3' adapter
stutter_3p:
order: [adapter_5p, CBC, UMI, cDNA, polyA, adapter_3p, adapter_3p, adapter_3p]
proportion: 0.10You can also override the orientation of individual elements within a structure using :fwd, :rev, or :reverse suffixes. For example, a hairpin-like structure where the second half is reversed:
hairpin:
order: [adapter_5p, CBC, UMI, polyT, cDNA, polyT:rev, UMI, CBC, adapter_5p:rev]
proportion: 0.05Step 4: Create a Model Entry
Once your library is defined, create a model entry that references it. The simplest case inherits everything from the library without overrides:
models:
my_protocol_001:
library: my_protocolIf you want to adjust specific fields for a particular model variant (e.g., change training structure proportions or add a new structure), you can override them in the model entry. Overrides use shallow merging for training structures — new keys are added, existing keys are updated, and keys set to null are removed:
models:
my_protocol_002:
library: my_protocol
training_structures:
truncated_no_5p: null # remove this structure
full_read:
proportion: 0.70 # increase proportion of full reads
new_artifact: # add a new structure
order: [adapter_5p, CBC, UMI, cDNA, polyA, adapter_3p]
repeat: 3
rc_pattern: [fwd, rev, fwd]
proportion: 0.10cDNA Generation
The cDNA segment can be generated in two ways:
- From a transcriptome FASTA (
--transcriptome): fragments are sampled from real transcript sequences. This is the recommended approach because it exposes the model to realistic cDNA properties, including internal polyA/T runs and low-complexity regions. These internal homopolymers can be confused with the protocol-introduced polyA/T tails, so training with real cDNA teaches the model to resolve polyA/T boundaries using surrounding context rather than homopolymer presence alone. - Random sequences (default): if no transcriptome is provided, cDNA is generated as random DNA. This works for initial testing but may produce a model that struggles with internal homopolymers in real data.
cDNA Length Considerations
cDNA length is sampled uniformly within the range [--min-cdna, --max-cdna]. This range affects how well the model learns segment boundaries:
- Shorter cDNA means the fixed-length elements (adapters, barcodes, UMIs) make up a larger fraction of each read. This gives the model more boundary examples per read, making it easier to learn transitions — but the model sees less long-range sequence context.
- Longer cDNA increases the variety of read lengths and improves generalization. However, if reads become very long relative to the fixed segments, the model spends most of its capacity on cDNA interior and may become less precise at segment boundaries.
As a general guideline, choose a cDNA length range that reflects the expected read lengths in your real data, while keeping it short enough that boundary regions remain well-represented in training.
Stage 2: Sequencing Error Modeling
Simulating Sequencing Errors
Real sequencing data contains errors — bases are substituted, inserted, or deleted during the sequencing process. If the model were trained only on clean, error-free reads, it would struggle with real data. The simulator addresses this by introducing realistic errors into the synthetic reads before training, so the model learns to annotate reads correctly despite noise.
Error Types
Three types of base-level errors are applied:
- Mismatches (
--mismatch-rate): a base is replaced with a different base. Default: 0.05 (5%). - Insertions (
--insertion-rate): extra bases are added after a position, up to a maximum of--max-insertionsper position. Default: 0.05 (5%), max 1 insertion per position. - Deletions (
--deletion-rate): a base is removed from the sequence. Default: 0.06 (6%).
These rates are applied per base position. When an insertion or deletion occurs, the per-base label vector is updated accordingly (labels are inserted or removed to stay aligned with the modified sequence).
Segment-Aware Error Rates
Not all segments receive the same error treatment. PolyA/T tails use a separate, typically lower error rate (--polyt-error-rate, default 0.02) that reflects the distinct noise characteristics of homopolymer regions in nanopore sequencing. Other segments use the global rates listed above.
When to adjust error rates: Match the error rates to your sequencing platform and chemistry. For example, if you are using a newer ONT chemistry with lower error rates, you may want to reduce --mismatch-rate and --insertion-rate accordingly. If your data has particularly noisy homopolymer regions, increase --polyt-error-rate.
Strand Orientation Augmentation
Reads can come from either DNA strand. By default (--rc), the simulator generates the reverse complement of each read (with labels reversed accordingly), doubling the training set size. This teaches the model to handle both orientations and is particularly useful for datasets where strand assignment is uncertain or mixed.
If your protocol always produces reads in a known orientation, you can disable this with --no-rc, though keeping it enabled generally improves robustness.
Artifact Simulation
Real sequencing datasets inevitably contain malformed or artifactual reads — concatenated reads, truncated reads, reads with repeated adapter blocks, and other structural anomalies. If the model only sees well-formed reads during training, it may confidently mislabel these artifacts instead of flagging them as invalid.
To address this, the simulator generates a configurable fraction of structurally invalid reads (--invalid-fraction, default 0.30). Supported artifact modes include:
- Read concatenation (two reads joined together)
- Repeated adapter blocks at the 5’ or 3’ end
- Truncations from either end
Training on a mix of valid and invalid reads improves the model’s ability to recognize and flag malformed inputs during inference.