Background on Model Training
Overview
In long-read RNA-seq data, base-level annotations such as adapters, barcodes, UMIs, cDNA, and polyA/T tails can be inferred either through a set of heuristics or a trained deep learning model. tranquillyzer uses a fully supervised approach to train its deep learning model. As a fully supervised method, tranquillyzer requires labeled data as input for its training.
Labeled data can either come from annotations generated by an upstream annotation tool or through simulated data which inherently contains labels. Because upstream annotation tools typically rely on heuristics with their own assumptions and biases embedded in them, training a deep learning model on these data is not ideal. The model generally learns to reproduce the output from the tool rather than the true structure of the reads.
tranquillyzer addresses this limitation by training exclusively on synthetically generated reads with known ground-truth base-level labels. Reads are simulated according to explicit library specifications and sequencing error models, ensuring the true identity of every base is known at generation time. This enables (i) fully supervised learning with exact labels and (ii) objective evaluation without reliance on external annotations. As a result, model training is decoupled from upstream heuristics which enables easy adaptation to new protocols and read structures.
Training Workflow
The training procedure consists of four conceptual stages:
- Simulation of reads based on a provided structure
- Introduction of sequencing errors to simulated reads
- Supervised neural network training for base-level annotation (
train-model) - Validation on an independently simulated dataset (
train-model)
Stages (1) and (2) are performed by the simulate-data command, while stages (3) and (4) are handled by train-model. The following tutorial lays out the intuition behind each of these stages and how they may impact the efficacy of the final model.