Genome-enabled prediction using probabilistic neural network classifiers

González-Camacho, Juan Manuel; Crossa, José; Pérez-Rodríguez, Paulino; Ornella, Leonardo; Gianola, Daniel

doi:10.1186/s12864-016-2553-1

Research article
Open access
Published: 09 March 2016

Genome-enabled prediction using probabilistic neural network classifiers

Juan Manuel González-Camacho¹,
José Crossa²,
Paulino Pérez-Rodríguez¹,
Leonardo Ornella³ &
…
Daniel Gianola⁴

BMC Genomics volume 17, Article number: 208 (2016) Cite this article

3639 Accesses
45 Citations
2 Altmetric
Metrics details

Abstract

Background

Multi-layer perceptron (MLP) and radial basis function neural networks (RBFNN) have been shown to be effective in genome-enabled prediction. Here, we evaluated and compared the classification performance of an MLP classifier versus that of a probabilistic neural network (PNN), to predict the probability of membership of one individual in a phenotypic class of interest, using genomic and phenotypic data as input variables. We used 16 maize and 17 wheat genomic and phenotypic datasets with different trait-environment combinations (sample sizes ranged from 290 to 300 individuals) with 1.4 k and 55 k SNP chips. Classifiers were tested using continuous traits that were categorized into three classes (upper, middle and lower) based on the empirical distribution of each trait, constructed on the basis of two percentiles (15–85 % and 30–70 %). We focused on the 15 and 30 % percentiles for the upper and lower classes for selecting the best individuals, as commonly done in genomic selection. Wheat datasets were also used with two classes. The criteria for assessing the predictive accuracy of the two classifiers were the area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUCpr). Parameters of both classifiers were estimated by optimizing the AUC for a specific class of interest.

Results

The AUC and AUCpr criteria provided enough evidence to conclude that PNN was more accurate than MLP for assigning maize and wheat lines to the correct upper, middle or lower class for the complex traits analyzed. Results for the wheat datasets with continuous traits split into two and three classes showed that the performance of PNN with three classes was higher than with two classes when classifying individuals into the upper and lower (15 or 30 %) categories.

Conclusions

The PNN classifier outperformed the MLP classifier in all 33 (maize and wheat) datasets when using AUC and AUCpr for selecting individuals of a specific class. Use of PNN with Gaussian radial basis functions seems promising in genomic selection for identifying the best individuals. Categorizing continuous traits into three classes generally provided better classification than when using two classes, because classification accuracy improved when classes were balanced.

Background

Complex traits of economic importance in animal and plant breeding seem to be affected by many quantitative trait loci (QTL), each having a small effect, and are greatly influenced by the environment. Predicting these complex traits using information from dense molecular markers exploits linkage disequilibrium (LD) between molecular markers and QTL. Basically, genomic selection works by capturing realized relationships between individuals and, to an extent, by capturing the effects of QTL via their linkage or LD with markers. Genomic selection (GS) regression models use all available molecular marker and phenotypic data from an observed base (training population) to predict the genetic values of yet unphenotyped candidates for selection (testing population) whose marker genotypes are known.

There is a vast literature describing statistical methods that use different functional forms on markers for predicting genetic values, e.g., [1, 2], starting with the seminal work of [3], which proposed regressing phenotypes on all available markers using a Gaussian linear model with different prior distributions on marker effects. Several parametric and semi-parametric methods have been described and used thereafter for genome-enabled prediction in animals and plants [4–11].

The basic quantitative genetic model y _i = g _i + γ _i (i = 1, … n individuals) describes the i ^th response or phenotype (y _i) expressed as a deviation from some general mean (μ) as the sum of an unknown genetic value (g _i) plus a model residual γ_i. The unknown genetic value can be represented as a complex function of genotypes with a large number of genes. However, since the genes affecting a trait are unknown, this complex function can be approximated by a regression of phenotype on marker genotypes where a large number of markers {x _i1, …, x _ip} (x _ij is the number of copies of one of the two alleles observed in the i ^th individual at the j ^th marker) may be used as regressors for predicting the genetic value of the i ^th individual. Thus, for u(x _i) = u(x _i1, … x _ip), the basic model becomes y _i = u _i + γ _i, where γ_i includes errors due to unspecified environmental effects, imperfect linkage disequilibrium between markers and the QTL affecting the trait, and unaccounted gene × gene and gene × environment interactions.

In several applications, u(x _i) is a parametric linear regression with form u(x _i1, … x _ip) = ∑ ^p_j = 1 x _ij β _j,, where β_j is the substitution effect of the allele coded as ‘one’ at the j ^th marker. The linear regression function becomes y _i = ∑ ^p_j = 1 x _ij β _j + γ _i. The regression function u(x _i) can also be represented by semi-parametric models, such as reproducing kernel Hilbert space (RKHS) regressions or by different types of neural networks (NN) such as the multilayer perceptron or radial basis functions [5, 8, 11–14]. Several penalized linear regression models and Bayesian shrinkage estimation methods have been applied to genome-enabled prediction [1]. Similarly, regularized machine learning has been used for predicting complex traits [15]. Recently, two-layer feed-forward NN with backpropagation were implemented in various forms using German Fleckvieh and Holstein-Friesian bull data and high prediction accuracies were achieved [16]. Likewise, a multi-layer NN classifier was applied to study genetic diversity in simulated experiments [17].

Nonparametric classification models are a branch of supervised machine learning that has been successfully applied in several fields of knowledge, e.g., text mining, bioinformatics and genomics [18, 19]. Particularly in applied genomic breeding programs and depending on the trait under consideration, the objective of classification is to focus on candidates for selection contained in the upper or lower classes of the prediction space. A common classification problem arises when an input marker vector x _i ∈ ℝ ^p is to be assigned to one of S classes by a classifier. The classifier is trained using a set of training pairs (x _i, c _i), (i = 1, … n individuals), where c _i describes the class label (k) to which x _i belongs, (k = 1 … S), where S represents the number of classes. Usually, c _i is transformed into a vector c _i of dimension S × 1, with 1 in class k and 0 otherwise.

The multi-layer perceptron (MLP) classifier is a typical architecture of feed-forward NN with at least a hidden layer and an output layer, where both layers have nonlinear and differentiable transfer functions. The nonlinear transfer function in the hidden layer enables an NN to act as a universal approximation method. The training process of an MLP for each individual i, with input vector x _i and target class c _i, typically uses the error backpropagation learning algorithm [20]. This process requires a lot of computational time when the number of input variables is large.

The probabilistic neural network (PNN) was proposed by [21] and is widely used in pattern recognition and classification. PNN classifies an input vector x _i into a specific k class such that the specific class has the maximum probability of being a correct assignment. PNN provides an optimum pattern classifier that minimizes the expected risk of wrongly classifying an object, and is a very efficient (in terms of computational time) classification method. The PNN training algorithm is simpler and faster than that of the MLP approach because PNN parameters are estimated directly from the input data and an iterative procedure is not required. Further, PNN guarantees convergence to a Bayes classifier if enough training examples are provided [22]. Several classification methods such as support vector machines and random forests have been applied in GS [23–25]. However, despite the apparent advantages of PNN, no PNN classifiers have been applied in GS so far.

The objective of this research was to assess the performance of two NN classifiers, MLP and PNN (based on Gaussian kernels), to select individuals belonging to a specific class of interest (target class). In an applied GS context, the problem should be formulated according to whether the focus is on selecting individuals into the upper, middle or lower classes, depending on the trait under selection. Then the question is how many of the predicted individuals classified in the target class are actually observed in that class. The problem is posed as follows: given an input vector x _i of p markers for the i ^th individual, each individual i in the testing set must be classified in a class of interest of the phenotypic response. Classes were defined considering different percentiles of the target trait, specifically, 15 and 30 % for the upper and lower classes were analyzed.

Methods

This section has four parts: first the two datasets are described; second, the strategy for categorizing the datasets is explained; third, the multilayer perceptron neural network (MLP) and probabilistic neural network (PNN) are described, and finally, the criteria used to assess model accuracy for classifying the best individuals based on genomic information are described.

Maize datasets

The maize datasets include 16 trait-environment combinations measured on 300 tropical lines genotyped with 55,000 SNPs each; these datasets were previously used by [8]. Four datasets contain information on the complex trait grain yield (GY) evaluated under severe drought stress (GY-SS) and well-watered conditions (GY-WW), and in high yielding (GY-HI) and low yielding (GY-LO) environments. Another six datasets include information on days to anthesis or male flowering (MFL), on days to silking or female flowering (FFL), and the MFL to FFL interval (ASI) evaluated under severe drought stress (SS) and well-watered (WW) environments. The remaining six datasets contain information on gray leaf spot (GLS) resistance evaluated in six CIMMYT international trials (GLS-1 to GLS-6). The number of individuals and the type and number of markers are presented in Table 1; for further details, see [8].

Table 1 Maize datasets – three classes

Full size table

Wheat datasets

These datasets include 306 wheat lines from the CIMMYT Global Wheat Program (GWP) that were genotyped with 1717 Diversity Array Technology (DArT) markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.diversityarrays.com), which is a whole-genome profiling service laboratory. Two traits were analyzed, grain yield (GY) and days to heading (DTH), which were evaluated in different environments (year-drought stress-agronomic treatments). GY was measured in seven environments and DTH in ten environments. The number of individuals and the type and number of markers are presented in Table 2; for further details, see [11].

Table 2 Wheat datasets – three classes

Full size table

Transforming phenotypic responses into three or two classes

The continuous phenotypic responses y _i for each stratified random partition in the datasets were grouped into three classes (upper, middle and lower), based on 15–85 % and 30–70 % percentiles of the response of each trait analyzed. For example, for 15–85 % percentiles, the quantiles q _0.15 and q _0.85 were used to split y _i into three classes: y _i ∈ upper class, if y _i > q _0.85; y _i ∈ middle class, if q _0.15 < y _i ≤ q _0.85; and, y _i ∈ lower class; if y _i ≤ q _0.15 A similar rule was applied to split y_i into three classes with 30-70 % percentiles.

For the two species, the target classes were the upper 15 and 30 % classes (GY for maize and wheat); the middle 40 and 70 % classes (ASI for maize), and the lower 15 and 30 % classes (FFL, MFL, and GLS for maize and DTH for wheat).

Comparison of prediction accuracy of PNN based on two or three classes was performed only on the wheat datasets to simplify computations. Firstly, the phenotypic responses y _i for each stratified random partition of the wheat datasets were grouped into two classes from the datasets previously grouped into three classes. The upper 15 % of the binary class was defined by using the upper 15 % of the trichotomous classes, and the lower class was the sum of the middle and lower classes of the trichotomous classes; a similar strategy was applied for the lower 15 % of the binary class. The same random partitions (training, testing sets) were used when comparing PNN with two classes versus PNN with three classes. Partitions of the wheat datasets into two classes for GY and DTH are shown in Table 3.

Table 3 Wheat datasets – two classes

Full size table

Multilayer perceptron neural network (MLP) classifier

An MLP can be trained to classify items into S different disjoint classes. Each target class c _i is transformed into a target vector c _i of zeroes except for a 1 in element k, (k = 1, …, S) the class to be represented. We arranged a set of n input vectors x _i into a matrix X of dimension n × p. Then we arranged the n target vectors c _i into a matrix C of dimension S × n. The rows of X correspond to columns of C, individual-by-individual. Statistical learning is inferred from the data only, with no assumption about the joint distribution of inputs and outcomes. This gives MLP great flexibility for capturing complex patterns frequently found in plant breeding [26].

We begin by describing a standard MLP for a categorical response (PNN is introduced subsequently). MLP is an NN that can be thought of as a two-stage regression (e.g., [18]). In the first stage (hidden layer), M data-derived basis functions, {z _m} ^m = M_m = 1 are inferred; in the second stage (the output layer has S neurons, S classes), each neuron’s output is computed on the basis functions inferred in the hidden layer using a nonlinear transfer function (Fig. 1).

In the hidden layer, one data-derived predictor is inferred at each of M neurons. These data-derived predictors are computed by first inferring a score (u _mi), which is a linear combination of the input weights and the input markers plus a bias (intercept) term. Subsequently, this score is transformed using a nonlinear transfer function, φ(⋅), that is, z _mi = φ(w _mo + ∑ ^p_j = 1 w _mj x _ij), where w _mo is the bias term, and W _m = {W _mj} ^{m = M; j = p}_{m = 1; j = 1} is an input weight matrix. The transfer function maps from a score defined in the real line onto the interval [−1, 1] (e.g., a hyperbolic tangent sigmoid transfer function is $ tansig(u)=\frac{2}{\left(\left(1+ exp\left(-2\right)\right)-1\right)} $. Subsequently, in the output layer, phenotypes are regressed on the data-derived features, {z _mi} ^{i = n; m = M}_{i = 1; m = 1} , according to the model E(y _ki|parameters) = v _ki = w _ko + ∑ ^M_m = 1 w _mi, where φ _k(v _ki) and φ _k(.) is the tansig transfer function, k = 1, …, S. Finally, the predicted score vector ĉ _i = {y _ki} ^k = S_k = 1 , and the predicted class ĉ _i is ĉ(x _i) = arg max _{1 ≤ k ≤ S}(ĉ _k) are obtained.

Training of an MLP (given a fixed number of transfer functions in the hidden layer) involves estimating all of the classifier’s parameters by means of an iterative backpropagation error algorithm, based on the scaled conjugate gradient algorithm described by [27]. To improve the generalization capacity of MLP, an early stopping ensemble strategy can be applied [28]; early stopping effects non-Bayesian shrinkage of coefficients. In this approach, we divided the available data into three subsets. The first subset is the training set, used for computing the gradient and updating network weights and biases. The second subset is the validation set, where the error in the set is monitored during the training process. The validation error normally decreases during the initial training phase, as does the training set error. However, when the network begins to over-fit the data, the error in the validation set typically begins to rise. When the validation error increases at some point in the iteration, the training is stopped, and the weights and biases at the minimum validation error are returned. The third subset is used as testing set.

The performance function to optimize an MLP is usually the mean squared error (mse), which is the average squared error between the predicted classes Ĉ and the target classes C. Ĉ is also a matrix of dimension S × n, where each column contains values in the [0,1] range. The index of the largest element in the column indicates which of the S classes that vector represents.

Probabilistic neural network (PNN) classifier

The architecture of a PNN is similar to that of a radial basis function NN [8]; a PNN has two layers, the pattern layer and the summation-output layer, as illustrated in Fig. 2. The pattern layer computes distances (using a Gaussian radial basis function (RBF)) between the input vector x _i and the training (centers) input vectors c _m ∈ ℝ ^p; m = 1, …, M neurons (M = n individuals of the input data set) and returns an output vector u _i ∈ ℝ ^M whose elements u _mi = b _m‖x _i − c _m‖, where $ {b}_m=\frac{\sqrt{\left(-\mathrm{In}(0.5)\right)}}{h} $ is a weight and h is the width of the Gaussian RBF, indicating how close the input vector x _i is to c _m [22]. Then each u _mi is transformed into a vector z _i ∈ ℝ ^M, whose elements are defined by the Gaussian operation z _mi = exp(−u ²_mi ). The summation-output layer sums these contributions for each class k, that is, v _ki = ∑ ^M_m = 1 w _km z _mi, where w _km are weights obtained from the target classes C matrix, to generate a vector of probabilities ĉ _i = softmax(v _i) of dimension S × 1 as its net output, where the softmax transfer function σ(.) is given by

$$ \sigma \left({\mathbf{v}}_i\right)=\frac{exp\left({v}_k\right)}{{\displaystyle {\sum}_{j=1}^S exp\left({v}_j\right)}};\kern0.5em for\kern0.5em k=1,\dots, S\kern0.5em \mathrm{classes} $$

where v _i is a target vector of dimension S × 1 with elements v _k. The softmax transfer function on the summation-output layer transforms the outputs of processing units for each k class in the interval [0,1].

The pattern layer of a PNN is a neural representation of a Bayes classifier, where the class density functions are approximated using a windows Parzen estimator [29]. The standard training method for a PNN (given a value of h for the Gaussian RBFs) requires a single pass over all the x _i markers of the training set. For this reason, PNN requires short training time and produce as output (ĉ _i), posterior probabilities of class membership.

Criteria for assessing classifier prediction accuracy

The prediction accuracy of MLP and PNN was evaluated using a cross-validation procedure. For each data set, 50 random partitions stratified by classes were generated. Each partition randomly assigned 90 % of the data to the training set and the remaining 10 % to the testing set. We used stratified sampling by class to make sure there were no empty classes in the training and testing sets. For each data set, partition index matrices PINDX(n, 50) were generated, where n is the number of individuals in each data set analyzed; PINDX(i,j) has a value equal to 1 (training) or 2 (testing) for the i ^th individual in the j ^th partition. Each model was trained and evaluated with the same pair of training and testing sets of each partition. For MLP the training sets defined in PINDX(n, 50) were subdivided by stratified random sampling by class into two disjoint sets, one for training (88 %) and another for validation (12 %); this was done with the objective of applying the training early stopping ensemble strategy[28]. For each random partition, ten replications (random seeds) were used to evaluate the performance of MLP.

Two performance measures for assessing prediction accuracy of the two classifiers (averaged across 50 random partitions) were used: (1) the area under the receiver operating characteristic curve (AUC), and (2) the area under the precision-recall curve (AUCpr), or average precision.

For GY in both species, models were trained to maximize the AUC of the upper class; for FL, GLS, and DTH, models were trained to maximize the AUC of the lower class; for ASI, the target value is zero (perfect synchrony between anthesis and silking interval), models were trained to maximize the AUC of the middle class.

The area under the receiver operating characteristic curve (AUC)

Rather than computing the recall (R) [also called sensitivity or true positive rate (tpr)] and the false positive rate (fpr) for a fixed threshold τ, a set of thresholds was defined and then tpr vs fpr(R vs f pr) was plotted as an implicit function of τ; this is called an ROC curve.

The recall or sensitivity is $ R=\frac{tp}{tp+fn}, $ where tp is the number of positives predicted as positives and fn is the number of positives predicted as negatives. This measure evaluates the number of individuals that are correctly classified as a proportion of all the observed individuals in the target class. $ fpr=\frac{fp}{fp+tn}, $ where fp is the number of negatives predicted as positives and tn is the number of negatives predicted as negatives (Table 4).

Table 4 Description of a confusion matrix for binary classes with observed values and classifier predicted values

Full size table

To compare the performance of classifiers, the receiver operating characteristic curve (ROC) has to be reduced to a single scalar value representing the expected performance. A common method is to compute the area under the ROC curve (AUC), which produces a value between 0 and 1. If AUC(a) > AUC(b), then classifier a has a better average performance than classifier b. AUC can be interpreted as the probability that a randomly chosen individual is ranked as more likely to be of the target class than a randomly chosen individual of another class. The ROC graphs are a useful tool for visualizing the performance of the classifiers because they provide a richer measure of classification performance than other scalar measures [30].

The area under the precision-recall curve (AUCpr)

A precision-recall curve is a plot of precision (P) vs R for a set of thresholds τ. $ P=\frac{tp}{tp+fp} $ is defined as the fraction of positives predicted as positives with respect to all predicted positives (Table 4). Thus P measures the fraction of the predicted positives that is really positive, while R measures the fraction of the predictive positives that was in fact detected. This curve is summarized as a single number using the average precision (AUCpr), which approximates the area under the precision-recall curve [31]. This measure is recommended for classes of different sizes; upper or lower classes of 15 % had a lower number of individuals than the corresponding upper or lower classes of 85 %. AUC is commonly used to present results of binary decision problems in machine learning algorithms. However, when dealing with unbalanced classes, AUCpr curves give a more informative idea of a machine learning algorithm than AUC [32, 33].

Software

Scripts for fitting models and performing cross-validations were written in MATLAB r2010b. All the analyses were performed in a Linux Workstation.

Results and discussion

Results of the value of AUC for classifiers MLP and PNN in each trait-environment combination are depicted in histograms in Fig. 3a–d (maize datasets) and Fig. 4a–b (wheat datasets) for the traits selected in the upper and lower (15 and 30 %) and middle (40 and 70 %) classes, respectively.

The first clear trend using the AUC criterion is that PNN outperformed MLP for most of the individuals in the upper, middle and lower classes. Depending on the trait-environment combination, the PNN30% or PNN15% upper and lower and the PNN40% and PNN70% middle were usually larger than those of MLP; the only exception was PNN15% for GY-SS (Fig. 3a), which was lower than MLP15% (Additional file 1: Table S1).

We also describe AUC and AUCpr results of comparing the performance of PNN for wheat trait-environment combinations using two or three classes.

Comparing classifiers to select individuals in the upper, middle and lower classes in the maize datasets

Upper classes (15 and 30 %)

Results of the prediction accuracy criterion AUC of the two classifiers MLP and PNN for traits selected in the 15 and 30 % upper classes for GY under the different environmental conditions are reported in Fig. 3a. PNN was more accurate than MLP in the upper 30 % class, for assigning individuals based on GY under stress conditions. Additional file 1: Table S1 shows the results based on the AUC criterion for the upper, middle and lower classes.

When using the AUCpr criterion, which relates P and R for the upper class, PNN outperformed MLP, which is clearly shown in Table 5 (as shown for the AUC criterion in Fig. 3a). Also, AUCpr for PNN30% was always better than PNN15% for all the traits in the upper class. These results lead to the conclusion that PNN was more accurate than MLP for assigning maize lines to the correct upper class for GY under WW and SS conditions. Also under the AUC criterion, PNN30% was similar to PNN15% for GY-HI and GY-WW, but better than PNN15% for GY-LO and GY-SS. Under the criterion AUCpr, PNN30% was always better than PNN15% for all GY.

Table 5 Maize datasets

Full size table

Middle classes (40 and 70 %)

Concerning the AUC criterion for the middle class based on ASI-SS and ASI-WW, Fig. 3b shows a slight superiority of PNN over MLP for both 40 and 70 %; however, PNN40% was, on average, slightly better than PNN70%. On the other hand, results using the AUCpr criterion also show a slight superiority of PNN over MLP for MLP40% for ASI-SS and MLP70% for both ASI-SS and ASI-WW (Table 5). For this middle class, the AUCpr results favored PNN as a better predictor than MLP for assigning maize lines to the correct middle class.

Lower classes (15 and 30 %)

For the lower class, Fig. 3c for FL and Fig. 3d for GLS (both traits in different environments) show a clear superiority in terms of the AUC criterion of PNN over MLP for both lower classes. The better prediction accuracy of classifier PNN is reflected in AUCpr prediction accuracy, where PNN outperformed MLP for both lower classes, and PNN30% was higher than PNN15% for all 10 traits (Table 5).

Comparing classifiers for selecting individuals in the upper and lower classes in the wheat datasets

Upper classes (15 and 30 %)

Results of AUC for GY that were selected in the upper 15 and 30 % classes are presented in Fig. 4a and in Additional file 2: Table S2. PNN outperformed MLP for both upper classes for all GY. PNN30% gave better prediction accuracy than PNN15% in most traits, with the exception of GY-3 and GY-6, where PNN15% had better prediction than PNN30%.

Criterion AUCpr showed that PNN was better than MLP for both upper classes; PNN appeared as the best class predictive models in all GY traits. Furthermore, under the AUCpr criterion, PNN30% was higher than PNN15% in all wheat GY traits (Table 6). In summary, results of the upper 15 and 30 % classes show that PNN was a more accurate predictor than MLP when using the AUC and AUCpr criteria.

Table 6 Wheat datasets

Full size table

Lower classes (15 and 30 %)

For the lower classes involving wheat DTH, AUC of PNN was higher than MLP for both 15 and 30 % percentiles and all traits (Fig. 4b). In five instances (DTH-2, DTH-3, DTH-5, DTH-6 and DTH-9), the PNN15% model was slightly more accurate than PNN30% when classifying individuals in this lower class.

The best performance of PNN was reflected in the prediction accuracy given by the AUCpr criterion, where PNN was better than MLP in both lower classes for all DTH traits. Likewise, PNN30% was always higher than PNN15% (Table 6).

Prediction accuracy of PNN classifier with two and three classes in the wheat datasets

This section compares the performance of PNN in the upper and lower (15 and 30 %) classes for wheat GY and DTH traits, when two and three classes are formed and evaluated using the AUC (Table 7) and AUCpr (Table 8) criteria. For the AUC criterion, PNN with three classes was slightly better than PNN with two classes for most traits in the upper and lower 15 and 30 % classes (Table 7). For the AUCpr criterion, results were not as clear as for AUC; however, PNN with three classes was globally better than PNN with two classes (Table 8).

Table 7 Wheat datasets

Full size table

Table 8 Wheat datasets

Full size table

In summary, results for the wheat datasets comparing the performance of PNN for selecting individuals in the lower and upper 15 and 30 % classes, based on the splitting of continuous traits into two or three classes, showed that for the lower 15 %, the performance of PNN with three classes was better than PNN with two classes (in seven of ten traits). However, PNN with two classes gave better predictions than PNN with three classes in the upper 15 % (four over seven traits). This is not the case when predicting individuals in the upper and lower 30 %, where PNN with three classes was a better predictor than PNN with two classes for most traits.

ROC and precision-recall curves for the maize and wheat datasets

Some results of the ROC and precision-recall curves for various maize and wheat datasets for upper and lower 15 and 30 %, with the middle class in maize for 40 and 70 %, are displayed in a series of figures (for the maize datasets, Fig. 5a–f; for the wheat datasets, Fig. 6a–d). For the maize and wheat datasets, it is clear that the ROC curves of PNN for the upper and lower 15 and 30 % and the middle 40 and 70 % dominated the corresponding curves of MLP. Also, AUC values for PNN were always greater than those for MLP.

Furthermore, the P vs R graphs show that for all the maize and wheat datasets, PNN was better than MLP, indicating that the precision of PNN remains better than that of the MLP for all recall values. The precision of PNN started declining at higher values of R than the values of R for MLP.

Discussion

Accuracy of the MLP and PNN classifiers for selecting the best individuals

Genomic selection aims to accurately predict genetic values with genome-wide marker data using a three-step process: 1) model training and validation, 2) predicting genetic values, and 3) selecting based on these predictions [34].

We evaluated the performance of classifiers MLP and PNN for selecting the best individuals in maize and wheat datasets (Tables 1 and 2). Results indicated that, overall, PNN was more precise in identifying individuals in the correct class than MLP. Previous studies using RBFNN and Bayesian regularized NN on the same wheat datasets [8, 11] used in this study showed their prediction advantage over the linear parametric models for complex traits such as GY because these models can capture cryptic epistatic effects in gene × gene networks such as those usually present in wheat (e.g., additive × additive interactions). The good performance of PNN for selecting individuals in the correct classes may also be due to its ability for capturing small and complex interactions, while MLP may fail to do so.

The fact that these classifiers are trained to maximize the probability of membership of an individual to the target class, rather than searching for an overall performance, makes it attractive for applying these tools in GS. Results from MLP and PNN indicated that PNN was much more efficient in maximizing the probability of membership for the upper, middle, and lower classes than MLP.

From a practical genome-assisted plant breeding perspective, this study attempts to mimic the breeder’s decision, for example when selecting the upper 15 or 30 % class candidates for GY, or when selecting the lower 15 or 30 % class candidates for DTH, GLS or FL. In maize breeding, ASI synchrony close to zero is a crucial “middle class trait” under SS conditions because it will ensure selecting plants that will simultaneously produce pollen and silk; thus grains can be harvested. Therefore, PNN should help genomic-assisted breeding select appropriate candidates in each class of interest.

Breeding values have two main components, parental average (accounting for between family variation) and Mendelian sampling (accounting for within family variation). Genomic prediction should account for these two main components and try to control potential population structures that could modify prediction accuracy between the selected training and testing populations. An important practical question is how well PNN and MLP predict the breeding value of individuals between families and within families that were not phenotyped. Although the elite maize and wheat lines used in this study are not ideal as training sets, the cross-validation scheme used in this study (where 50 random partitions stratified by classes were generated for each data set) attempts to mimic the prediction of non phenotyped individuals belonging to different families (crosses) or to the same family. Although this cross-validation design may not have chosen individuals between and within families as precisely as they are in reality, it is likely that the 50 random partitions searched for all possible relationships between individuals in the training and testing sets such that some cross-validation partitions selected subsets of training data that had high correlations with the observed data, indicating a family relationship among individuals belonging to those training–testing subsets [11], whereas other random partitions chose subsets of training individuals that had no family relationship with those in the testing set, thus producing low correlations with the observed values. When applied to both classifiers, PNN consistently gave better average prediction accuracy (across the 50 random partitions) of the genetic values of the unobserved individuals than MLP in all 33 maize and wheat data sets.

AUC and AUCpr

For both datasets, the results of the AUCpr criterion showed that the values of the upper and lower PNN30% were higher than those for the upper and lower PNN15%. Also, the values of the middle PNN70% were higher than those for the PNN40% (Tables 5 and 6). These results were similar but not equal to those found by AUC (which does not account for imbalances in the number of individuals comprising the upper, middle and lower classes) in several instances. PNN15% was superior to PNN30% in the maize data (e.g., ASI-SS, ASI-WW, FFL-SS, MFL-SS, GL-1, GLS-4) and the wheat data (e.g., DTH-2, DTH-3, DTH5, DTH-6, DTH-9). Prediction accuracy of individuals was clearly hampered under biotic stress in the maize data, which was also found by [6, 8, 11, 35].

Figures 5a–f and 6a–d showing the ROC curve clearly indicated the advantage of PNN over MLP. The R vs fpr graph indicates that, for most of the traits, the probability of correctly classifying an individual in the upper, lower or middle classes was very often 0.80 or higher, even with a small fpr. In most cases, at a value of fpr = 0, the probability of classifying an individual in the correct class was 0.80 or greater for PNN. For all traits, the AUC of PNN15% was always better than the AUC of MLP15% and the AUC of PNN30% was better than the AUC of MLP30%.

For the AUCpr curve, Figs. 5a–f and 6a–d indicate that, in most cases, PNN had higher precision than MLP at higher sensitivity values. This criterion also indicates the superior performance of PNN over MLP.

Prediction accuracy for 30 vs 15 % classes with binary and trichotomous classes

Based on the AUC criterion, it is clear that PNN gave better prediction accuracy than MLP when assigning maize and wheat individuals to the classes of interest. Using the AUCpr criterion, the results were equally clear for the wheat and the maize datasets.

For the wheat datasets, the AUC criterion showed the superiority of PNN30% with three classes over PNN30% with two classes, as well as the superiority of PNN15% with three classes over PNN15% with two classes (Table 7). However, the differences given by the AUC criterion were not as marked as those shown by the AUCpr criterion. The AUCpr criterion applied with PNN shows that for the upper 15 % classes (GY traits), partitioning the data into two classes assigned more wheat lines to the correct observed classes than partitioning the data into three classes. However, for the lower 15 % classes (DTH traits) and for PNN 30 % upper and lower classes, results indicate that three classes gave better prediction than two classes (Table 8).

Conclusions

We compared the performance of the multilayer perceptron (MLP) and the probabilistic neural network (PNN) classifiers for selecting the best individuals belonging to a class of interest (target class) in maize and wheat datasets using high-throughput molecular marker information (55 k and 1.4 k). PNN outperformed MLP in most of the datasets. The performance criteria used to judge the predictive accuracy of MLP and PNN for assigning individuals to the right observed class were the area under ROC curve, AUC, and the area under the precision-recall curve, AUCpr, PNN had better accuracy than MLP. In genomic selection, where p markers > > n individuals is the norm, PNN seems promising because of its better generalization capacity than MLP, and is faster than MLP in obtaining optimal solutions, thus presenting appealing computational advantages.

Availability of supporting data

The 33 datasets (16 maize and 17 wheat trials) and the MATLAB scripts used in this work are available at http://hdl.handle.net/11529/10576.

Abbreviations

ASI:: anthesis-silking interval
AUC :: area under the receiver operating characteristic curve
AUCpr :: area under the precision-recall curve
DTH:: days to heading
FFL:: female flowering or days to silking
GLS:: gray leaf spot
GS:: genomic selection
GY:: grain yield
HI:: high yielding
LO:: low yielding
MFL:: male flowering or days to anthesis
MLP:: multi-layer perceptron
NN:: neural networks
P :: Precision
PNN:: probabilistic neural network
QTL:: quantitative trait loci
R :: recall or sensitivity
RBF:: radial basis function
RBFNN:: radial basis function neural network
RKHS:: reproducing kernel Hilbert spaces
ROC:: receiver operating characteristic curve
SS:: severe drought stress
WW:: well-watered

References

de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327–45.
Article PubMed PubMed Central Google Scholar
Gianola D. Priors in whole genome regression: the Bayesian alphabet returns. Genetics. 2013;194:573–96.
Article CAS PubMed PubMed Central Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic values using genome-wide dense marker maps. Genetics. 2001;157:1819–29.
CAS PubMed PubMed Central Google Scholar
de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K, Cotes JM. Predicting quantitative traits with regression models for dense molecular markers and pedigrees. Genetics. 2009;182:375–85.
de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res. 2010;92(4):295–308.
Article Google Scholar
Crossa J, de los Campos G, Pérez-Rodríguez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Dreisigacker S, Yan J, Arief V, Banziger M, Braun, HJ. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics. 2010;186:713–24.
Crossa J, Pérez-Rodríguez P, de los Campos G, Mahuku G, Dreisigacker S, Magorokosho C. Genomic selection and prediction in plant breeding. J of Crop Improvement. 2011;25(3):239–61.
Article Google Scholar
González-Camacho JM, de los Campos G, Pérez-Rodríguez P, Gianola D, Cairns JE, Mahuku G, et al. Genome-enabled prediction of genetic values using radial basis function neural networks. Theor Appl Genet. 2012;125(4):759–71.
Article PubMed PubMed Central Google Scholar
Heslot N, Yang HP, Sorrells ME, Jannink JL. Genomic selection in plant breeding: A comparison of models. Crop Sci. 2012;52:146–60.
Article Google Scholar
Pérez-Rodríguez P, de los Campos G, Crossa J, Gianola D. Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. The Plant Genome. 2010;3(2):106–16.
Article Google Scholar
Pérez-Rodríguez P, Gianola D, González-Camacho JM, Crossa J, Manès Y, Dreisigacker S. Comparison between linear and non-parametric regression models for genome-enabled prediction in wheat. G3: Genes|Genomes|Genetics. 2012;2(12):1595–605.
Article PubMed PubMed Central Google Scholar
Gianola D, Fernando RL, Stella A. Genomic-assisted prediction of genetic values with semiparametric procedures. Genetics. 2006;173:1761–76.
Article CAS PubMed PubMed Central Google Scholar
Gianola D, Okut H, Weigel KA, Rosa GJM. Predicting complex quantitative traits with neural networks: a case study with Jersey cows and wheat. BMC Genetics 2011; doi:101186/1471-2156-12-87.
Gianola D, van Kaam JBCHM. Reproducing kernel Hilbert space regression methods for genomic-assisted prediction of quantitative traits. Genetics. 2008;178:2289–303.
Article PubMed PubMed Central Google Scholar
Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio, T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genetics 2014; doi:10.1371/journal.pgen.1004754.
Ehret A, Hochstuhl D, Gianola D, Thaller G. Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genetics Selection Evolution 2015; doi:10.1186/s12711-015-0097-5.
Sant’Anna IC, Tomaz RS, Silva GN, Nascimento M, Bhering LL, Cruz CD. Superiority of artificial neural networks for a genetic classification procedure. Genet Mol Res. 2015;14(3):9898–906.
Article PubMed Google Scholar
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2009.
Book Google Scholar
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16(6):321–32.
Article CAS PubMed Google Scholar
Kecman V. Learning and soft computing: support vector machines, neural networks and fuzzy logic models. Massachusetts, and London, England: MIT Press Cambridge; 2001.
Google Scholar
Specht DF. Probabilistic neural networks. Neural Netw. 1990;1(3):109–18.
Article Google Scholar
Wasserman PD. Advanced methods in neural networks. New York: Van Nostrand Reinhold; 1993.
Google Scholar
Long N, Gianola D, Rosa GJ, Weigel KA, Avendaño S. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. J of Animal Breeding and Genetics. 2007;124:377–89.
Article CAS Google Scholar
González-Recio O, Rosa GJ, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci. 2014;166:217–31.
Article Google Scholar
Ornella L, Pérez-Rodríguez P, Tapia E, González-Camacho JM, Burgueño J, et al. Genomic-enabled prediction with classification algorithms. Heredity. 2014;112(6):616–26.
Article CAS PubMed PubMed Central Google Scholar
Brachi B, Geoffrey P, Morris GP, Borevitz JO. Genome-wide association studies in plants: the missing heritability is in the field. Genome Biology 2011; doi:10.1186/gb-2011-12-10-232.
Moller MF. Scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 1993;6:525–33.
Article Google Scholar
Naftalyy U, Intratorz N, Hornx D. Optimal ensemble averaging of neural networks. Network Comput Neural Syst. 1997;8:283–96.
Article Google Scholar
Parzen E. On estimation of a probability density function and mode. Ann Math Statist. 1962;33:1065–76.
Article Google Scholar
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
Article Google Scholar
Murphy KP. Machine learning: a probabilistic perspective. 1st ed. Cambridge, Massachusetts, London, England: The MIT Press; 2012.
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: ICML ‘06: Proceedings of the 23rd international conference on machine learning. New York, NY, USA: ACM; 2006. doi:10.1145/1143844.1143874.
Google Scholar
Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted Data. PLoS ONE 2014; doi:10.1371/journal.pone.0092209.
Heffner EL, Lorenz AJ, Jannink J-L, Sorrells ME. Plant breeding with genomic selection: Gain per unit time and cost. Crop Sci. 2010;50:1681–90.
Article Google Scholar
Zhang X, Pérez-Rodríguez P, Semagn K, Beyene Y, Babu R, López-Cruz MA, San Vicente F, Olsen M, Buckler E, Jannink J-L, Prasanna BM and Crossa J. Genomic prediction in biparental tropical maize populations in water-stressed and well-watered environments using low-density and GBS SNPs. Heredity. 2015;114:291–9.

Download references

Acknowledgments

We thank CIMMYT’s global wheat and maize programs and the national program researchers who performed the experiments and collected the valuable phenotypic data analyzed in this study. We acknowledge the financial support provided to Cornell-CIMMYT by the Bill & Melinda Gates Foundation.

Author information

Authors and Affiliations

Colegio de Postgraduados, Campus Montecillo, Texcoco, México, 056230, México
Juan Manuel González-Camacho & Paulino Pérez-Rodríguez
Biometrics and Statistics Unit (BSU), International Maize and Wheat Improvement Center (CIMMYT), Apdo Postal 6-641, México DF, 06600 24105, México
José Crossa
NIDERA SEMILLAS S.A., Ruta 8 Km. 376, 2600, Venado Tuerto, Argentina
Leonardo Ornella
Department of Animal Sciences, University of Wisconsin, Madison, 53706, USA
Daniel Gianola

Authors

Juan Manuel González-Camacho
View author publications
You can also search for this author in PubMed Google Scholar
José Crossa
View author publications
You can also search for this author in PubMed Google Scholar
Paulino Pérez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Ornella
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gianola
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Crossa.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JMGC conceived, drafted the manuscript, carried out the study, performed computations and wrote a part of the manuscript; JC conceived and wrote part of the manuscript; PPR, LO and DG helped to provide critical insights and revised the manuscript. All authors read and approved the final manuscript.

Additional files

Additional file 1: Table S1.

Maize datasets. Mean values of the area under the ROC curve AUC (standard deviation in parentheses) of 50 random partitions for upper 15 and 30 % classes for grain yield (GY) in four environments (HI, LO, SS, and SS), for middle 40 and 70 % classes for anthesis-silking interval (ASI) in two environments (SS) and (WW), and for lower 15 and 30 % classes for four traits, female flowering (FFL) and male flowering (MFL) in two environments (WW and SS); for gray leaf spot resistance (GLS) in six environments and for both MLP and PNN classifiers. Numbers in bold are the highest AUC values between MLP and PNN for 15 and 30 %. (DOC 62 kb)

Additional file 2 :Table S2.

Wheat datasets. Mean values of the area under the ROC curve AUC (standard deviation in parentheses) of 50 random partitions for the upper 15 and 30 % classes for grain yield (GY) in seven environments (1-7) and for the lower 15 and 30 % classes for days to heading (DTH) in ten environments (1-10) for both MLP and PNN classifiers. Numbers in bold are the highest AUC values between MLP and PNN for 15 and 30 %. (DOC 59 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

González-Camacho, J.M., Crossa, J., Pérez-Rodríguez, P. et al. Genome-enabled prediction using probabilistic neural network classifiers. BMC Genomics 17, 208 (2016). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-016-2553-1

Download citation

Received: 03 October 2015
Accepted: 29 February 2016
Published: 09 March 2016
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-016-2553-1

Genome-enabled prediction using probabilistic neural network classifiers

Abstract

Background

Results

Conclusions

Background

Methods

Maize datasets

Wheat datasets

Transforming phenotypic responses into three or two classes

Multilayer perceptron neural network (MLP) classifier

Probabilistic neural network (PNN) classifier

Criteria for assessing classifier prediction accuracy

The area under the receiver operating characteristic curve (AUC)

The area under the precision-recall curve (AUCpr)

Software

Results and discussion

Comparing classifiers to select individuals in the upper, middle and lower classes in the maize datasets

Upper classes (15 and 30 %)

Middle classes (40 and 70 %)

Lower classes (15 and 30 %)

Comparing classifiers for selecting individuals in the upper and lower classes in the wheat datasets

Upper classes (15 and 30 %)

Lower classes (15 and 30 %)

Prediction accuracy of PNN classifier with two and three classes in the wheat datasets

ROC and precision-recall curves for the maize and wheat datasets

Discussion

Accuracy of the MLP and PNN classifiers for selecting the best individuals

AUC and AUCpr

Prediction accuracy for 30 vs 15 % classes with binary and trichotomous classes

Conclusions

Availability of supporting data

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Additional files

Additional file 1: Table S1.

Additional file 2 :Table S2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us