TY - JOUR
T1 - Reducing noise and stutter in short tandem repeat loci with unique molecular identifiers
AU - Woerner, August E.
AU - Mandape, Sammed
AU - King, Jonathan L.
AU - Muenzler, Melissa
AU - Crysup, Benjamin
AU - Budowle, Bruce
N1 - Funding Information:
This work was supported in part by award no. 2018-DU-BX-0177, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect those of the U.S. Department of Justice.
Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2021/3
Y1 - 2021/3
N2 - Unique molecular identifiers (UMIs) are a promising approach to contend with errors generated during PCR and massively parallel sequencing (MPS). With UMI technology, random molecular barcodes are ligated to template DNA molecules prior to PCR, allowing PCR and sequencing error to be tracked and corrected bioinformatically. UMIs have the potential to be particularly informative for the interpretation of short tandem repeats (STRs). Traditional MPS approaches may simply lead to the observation of alleles that are consistent with the hypotheses of stutter, while with UMIs stutter products bioinformatically may be re-associated with their parental alleles and subsequently removed. Herein, a bioinformatics pipeline named strumi is described that is designed for the analysis of STRs that are tagged with UMIs. Unlike other tools, strumi is an alignment-free machine learning driven algorithm that clusters individual MPS reads into UMI families, infers consensus super-reads that represent each family and provides an estimate the resulting haplotype's accuracy. Super-reads, in turn, approximate independent measurements not of the PCR products, but of the original template molecules, both in terms of quantity and sequence identity. Provisional assessments show that naïve threshold-based approaches generate super-reads that are accurate (∼97 % haplotype accuracy, compared to ∼78 % when UMIs are not used), and the application of a more nuanced machine learning approach increases the accuracy to ∼99.5 % depending on the level of certainty desired. With these features, UMIs may greatly simplify probabilistic genotyping systems and reduce uncertainty. However, the ability to interpret alleles at trace levels also permits the interpretation, characterization and quantification of contamination as well as somatic variation (including somatic stutter), which may present newfound challenges.
AB - Unique molecular identifiers (UMIs) are a promising approach to contend with errors generated during PCR and massively parallel sequencing (MPS). With UMI technology, random molecular barcodes are ligated to template DNA molecules prior to PCR, allowing PCR and sequencing error to be tracked and corrected bioinformatically. UMIs have the potential to be particularly informative for the interpretation of short tandem repeats (STRs). Traditional MPS approaches may simply lead to the observation of alleles that are consistent with the hypotheses of stutter, while with UMIs stutter products bioinformatically may be re-associated with their parental alleles and subsequently removed. Herein, a bioinformatics pipeline named strumi is described that is designed for the analysis of STRs that are tagged with UMIs. Unlike other tools, strumi is an alignment-free machine learning driven algorithm that clusters individual MPS reads into UMI families, infers consensus super-reads that represent each family and provides an estimate the resulting haplotype's accuracy. Super-reads, in turn, approximate independent measurements not of the PCR products, but of the original template molecules, both in terms of quantity and sequence identity. Provisional assessments show that naïve threshold-based approaches generate super-reads that are accurate (∼97 % haplotype accuracy, compared to ∼78 % when UMIs are not used), and the application of a more nuanced machine learning approach increases the accuracy to ∼99.5 % depending on the level of certainty desired. With these features, UMIs may greatly simplify probabilistic genotyping systems and reduce uncertainty. However, the ability to interpret alleles at trace levels also permits the interpretation, characterization and quantification of contamination as well as somatic variation (including somatic stutter), which may present newfound challenges.
KW - Molecular barcodes
KW - Probabilistic genotyping
KW - Stutter
KW - Unique molecular identifiers
UR - http://www.scopus.com/inward/record.url?scp=85098960233&partnerID=8YFLogxK
U2 - 10.1016/j.fsigen.2020.102459
DO - 10.1016/j.fsigen.2020.102459
M3 - Article
C2 - 33429137
AN - SCOPUS:85098960233
SN - 1872-4973
VL - 51
JO - Forensic Science International: Genetics
JF - Forensic Science International: Genetics
M1 - 102459
ER -