A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing

BMC Genomics

Table 1 Features used in the logistic regression model

Feature	Description	Value range: 5th–95th percentile (median)
DP	NGS read depth at the variant position.	78–433 (222)
AD	Number of reads that support the variant call.	25–393 (110)
AF	Fraction of reads that support the variant call, i.e. AD / DP.	0.13–0.56 (0.49)
GC @ 5, 20, 50	Fraction of GC content in the 5, 20, and 50 bases around the variant position.	0.18–0.73 (0.45) 0.29–0.69 (0.44) 0.30–0.68 (0.42)
MQ	Root Mean Square of the mapping quality of the call.	59.3–60 (60)
GQ	Genotype Quality of the call.	50–99 (99)
WHR	Weighted Homopolymer Rate in a window of 20 bases around the variant position: the sum of squares of the homopolymer lengths, divided by the number of homopolymers.	1.6–4.3 (2.4)
HPL-D	Distance to the longest homopolymer within 20 bases from the call position.	0–15 (5)
HPL-L	Length of the longest homopolymer within 20 bases from the call position.	2–6 (4)
QUAL	Quality score assigned by the GATK HaplotypeCaller to the call.	142–5448 (2564)
QD	QUAL, normalized by DP.	1.6–16.9 (11.3)
FS	Phred-scaled p-value using Fisher’s exact test, to detect strand bias.	0–9.2 (1.7)

Names and descriptions of features incorporated into the model. For each feature, the median and range as the 5th–95th percentile of values within the dataset is reported

ISSN: 1471-2164