TY - JOUR
T1 - Techniques for estimating genetically variable peptides and semi-continuous likelihoods from massively parallel sequencing data
AU - Woerner, August E.
AU - Crysup, Benjamin
AU - Hewitt, F. Curtis
AU - Gardner, Myles W.
AU - Freitas, Michael A.
AU - Budowle, Bruce
N1 - Funding Information:
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) (https://www.iarpa.gov), via contract number 2018–18041000003. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. The funders reviewed and approved the manuscript for publication but had no role in study design, data collection and analysis, or preparation of the manuscript. Award received by Signature Science, LLC (co-PIs FCH and MWG). Sub-award received by University of North Texas Health Science Center (AEW).
Funding Information:
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) ( https://www.iarpa.gov ), via contract number 2018–18041000003. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. The funders reviewed and approved the manuscript for publication but had no role in study design, data collection and analysis, or preparation of the manuscript. Award received by Signature Science, LLC (co-PIs FCH and MWG). Sub-award received by University of North Texas Health Science Center (AEW) .
Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/7
Y1 - 2022/7
N2 - Forensic genetic investigations typically rely on analysis of DNA for attribution purposes. There are times, however, when the amount and/or the quality of the DNA is limited, and thus little or no information can be obtained regarding the source of the sample. An alternative biochemical target that also contains genetic signatures is protein. One class of genetic signatures is protein polymorphisms that are a direct consequence of simple/single/short nucleotide polymorphisms (SNPs) in DNA. However, to interpret protein polymorphisms in a forensic context, certain complexities must be understood and addressed. These complexities include: 1) SNPs can generate 0, 1, or arbitrarily many polymorphisms in a polypeptide; and 2) as an object of expression that is modulated by alleles, genes and interactions with the environment, proteins may be present or absent in a given sample. To address these issues, a novel approach was taken to generate the expected protein alleles in a reference sample based on whole genome (or exome) sequence data and assess the significance of the evidence using a haplotype-based semi-continuous likelihood algorithm that leverages whole proteome data. Converting the genomic information into the proteomic information allows for the zero-to-many relationship between SNPs and GVPs to be abstracted away. When viewed as a haplotype, many GVPs that correspond to the same SNP is equivalent to many SNPs in perfect linkage disequilibrium (LD). As long as the likelihood formulation correctly accounts for LD, the correspondence between the SNP and the proteome can be safely neglected. Tests were performed on simulated samples, including single-source and two-person mixtures, and the power of using a classical semi-continuous likelihood versus one that has been adapted to neglect drop-out was compared. Additionally, summary statistics and a rudimentary set of decision guidelines were introduced to help identify mixtures from protein data.
AB - Forensic genetic investigations typically rely on analysis of DNA for attribution purposes. There are times, however, when the amount and/or the quality of the DNA is limited, and thus little or no information can be obtained regarding the source of the sample. An alternative biochemical target that also contains genetic signatures is protein. One class of genetic signatures is protein polymorphisms that are a direct consequence of simple/single/short nucleotide polymorphisms (SNPs) in DNA. However, to interpret protein polymorphisms in a forensic context, certain complexities must be understood and addressed. These complexities include: 1) SNPs can generate 0, 1, or arbitrarily many polymorphisms in a polypeptide; and 2) as an object of expression that is modulated by alleles, genes and interactions with the environment, proteins may be present or absent in a given sample. To address these issues, a novel approach was taken to generate the expected protein alleles in a reference sample based on whole genome (or exome) sequence data and assess the significance of the evidence using a haplotype-based semi-continuous likelihood algorithm that leverages whole proteome data. Converting the genomic information into the proteomic information allows for the zero-to-many relationship between SNPs and GVPs to be abstracted away. When viewed as a haplotype, many GVPs that correspond to the same SNP is equivalent to many SNPs in perfect linkage disequilibrium (LD). As long as the likelihood formulation correctly accounts for LD, the correspondence between the SNP and the proteome can be safely neglected. Tests were performed on simulated samples, including single-source and two-person mixtures, and the power of using a classical semi-continuous likelihood versus one that has been adapted to neglect drop-out was compared. Additionally, summary statistics and a rudimentary set of decision guidelines were introduced to help identify mixtures from protein data.
KW - Genetically variable peptides
KW - Massively parallel sequencing
KW - Mixtures
KW - Probabilistic genotyping
KW - Proteomics
UR - http://www.scopus.com/inward/record.url?scp=85129590394&partnerID=8YFLogxK
U2 - 10.1016/j.fsigen.2022.102719
DO - 10.1016/j.fsigen.2022.102719
M3 - Article
C2 - 35526505
AN - SCOPUS:85129590394
SN - 1872-4973
VL - 59
JO - Forensic Science International: Genetics
JF - Forensic Science International: Genetics
M1 - 102719
ER -