 Research
 Open Access
 Published:
Scalable optimal Bayesian classification of singlecell trajectories under regulatory model uncertainty
BMC Genomics volume 20, Article number: 435 (2019)
Abstract
Background
Singlecell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty.
Results
The goal of this paper is to develop optimal classification of singlecell trajectories accounting for potential model uncertainty. Partiallyobserved Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy geneexpression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of singlecell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particlebased singlecell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution.
Conclusion
The performance of the proposed particlebased method is demonstrated through numerical experiments using a POBDS model of the wellknown Tcell large granular lymphocyte (TLGL) leukemia network with noisy timeseries geneexpression data.
Background
A key issue in genomic signal processing is to classify normal versus cancerous cells, different stages of tumor development, or different prospective drug response. Previous geneexpression technologies, such as microarray and RNASeq, typically measure the average behavior of tens of thousands of cells [1, 2]. By contrast, the recent advances in nextgeneration sequencing technologies have allowed indepth investigation of the transcriptome at a singlecell resolution [3, 4].
Gene regulatory networks (GRNs) govern the functioning of key cellular processes, such as stress response, DNA repair, and other mechanisms involved in complex diseases such as cancer. Often, the relationship among genes can be described by logical rules updated at discrete time intervals with each gene have Boolean states: 0 (OFF) or 1 (ON) [5]. The PartiallyObserved Boolean dynamical system (POBDS) model [6–8] is a rich framework for modeling the behavior of GRNs observed through contemporary geneexpression technologies, as it allows indirect and incomplete observation of gene states. Several tools for the POBDS model have been developed in recent years, such as the optimal filter and smoother based on the Minimum Mean Square Error (MMSE) criterion, termed as the Boolean Kalman filter (BKF) and Boolean Kalman smoother (BKS) [6], respectively.
In [9] and [10], the maximumlikelihood (ML) based classification of singlecell trajectories has been developed. The method uses the MLadaptive filter proposed in [6] for estimation of the unknown parameters, followed by the Bayes classifier tuned to the ML parameter estimates. The drawback of this method is its inability to use prior knowledge in deriving the classifier. In [11], the intrinsically Bayesian robust (IBR) classifier for the trajectories is developed. This IBR classifier is optimal relative to the prior distribution of unknown parameters.
In this paper, assuming that there are two classes, healthy (c=0) and cancerous (c=1), we derive the optimal Bayesian classifier (OBC) [12, 13] for classification of singlecell trajectories. The difference between the OBC and IBR classifiers is that in the OBC the expectation of the classconditional densities is taken over the posterior distribution of the unknown parameters [14, 15], whereas in the IBR classifier the expectation is taken over the priors.
Despite the optimality of the developed OBC for singlecell trajectories, its exact computation for large GRNs becomes intractable, due to the large size of the matrices involved. In this paper, we develop a particlebased OBC to scale up the classification of singlecell trajectories. The proposed method contains a bank of Auxiliary ParticleFilter implementations of the Boolean Kalman Filter (APFBKF) proposed in [16], for both training and test processes.
Our contributions are twofold: 1) Optimal Bayesian Classification: we derive the optimal Bayesian classifiers (OBC) for both singlecell gene expression trajectories and multiplecell averaged gene expression with uncertain regulatory network prior; and 2) Scalability: the parallel particle filters together with the MonteCarlo inference have been efficiently used to estimate the likelihood and stationary distributions, making the derived OBC scalable to larger gene regulatory networks.
We apply the APFBKFbased OBC to classify trajectories of the blood cancer Tcell large granular lymphocyte (TLGL) leukemia. TLGL leukemia is a chronic disease characterized by a clonal proliferation of cytotoxic T cells [17]. A Boolean network model of T cell survival signaling in the context of TLGL leukemia has been constructed by [18] through performing extensive literature search. Then the TLGL network has been simplified by [17], which constructs the minimum network that preserves the attractor structure of that system. The reduced network contains 18 genes, which has an optimal solution with a transition matrix with 2^{2×18}=68,719,476,736 elements. By contrast, as we will show in our numerical experiments, the proposed APFBKFbased method captures Tcell dynamics with only 1000 particles.
Methods
Gene regulatory network model
Gene regulatory networks are modeled as partiallyobserved Boolean dynamical systems (POBDS). The two components of the POBDS model are a state space model that describes the evolution of the dynamics of the GRN, and an observation model for the measurements. These two components are described below.
GRN state space model
The state process {X_{k};k=0,1,…}, where X_{k}∈{0,1}^{n}, represents the activation (ON)/inactivation (OFF) state for the corresponding gene across time. The state at each discrete time is assumed to be updated through the nonlinear signal model
for k=1,2,…, where f:{0,1}^{n}→{0,1}^{n} is a Boolean function called the network function, “ ⊕” indicates componentwise modulo2 addition, and n_{k}∈{0,1}^{n} is Boolean transition noise at time k. The modulo2 addition means that if a bit in the noise n_{k} is 1, the value of the corresponding gene in the Boolean state X_{k} is flipped. In this paper, the noise process n_{k} is assumed to have independent components distributed as Bernoulli (p), where the parameter p>0 models the noise “intensity” — the closer p is to 0.5, the more chaotic the system will be, while a value of p close to zero means that the state trajectories are nearly deterministic, being governed tightly by the network function and perturbation process. The network possesses a steadystate distribution π^{∞} describing its longrun behavior as:
where \(\left \{\mathbf {x}^{1},\ldots,\mathbf {x}^{2^{n}}\right \}\) denotes the set of the corresponding network states in the Boolean vector representation.
Observation model
The data available to the experimenter is described by the observation process {Y_{k};k=1,2,…}, where Y_{k}=(Y_{k}(1),…,Y_{k}(n)) is a vector containing the transcript abundance measurements at time k, for k=1,2,…. We consider a Gaussian linear model
where \(\mathbf {v}_{k} \sim \mathcal {N}(0,\sigma ^{2}I_{n})\) is an uncorrelated zeromean Gaussian noise vector, λ= [λ_{1},…,λ_{n}]^{T} is a vector of baseline gene expressions corresponding to the “zero” state for each gene, and D=Diag(δ_{1},…,δ_{n}) is a diagonal matrix containing differential expression values for each gene along the diagonal (these indicate by how much the “one” state of each gene is overexpressed over the “zero” state). Such a Gaussian linear model is an appropriate model for singlecell geneexpression data [19, 20].
Optimal Bayesian classifier (OBC) for singlecell trajectories
Assume there are two POBDSs corresponding to the healthy and cancerous (mutated) classes, each having n genes. The difference between the healthy and mutated classes could be the overexpression or disruption of a value of single or multiple genes in the mutated case. Let \(\mathbb {Y}^{c} = \left \{\mathcal {Y}_{c}^{(1)},\mathcal {Y}_{c}^{(2)},\ldots,\mathcal {Y}_{c}^{(D_{c})}\right \}\) be the set of D_{c} observed trajectories from class c, c=0,1. Let Θ=(θ_{1},…,θ_{M}) be the uncertainty set of M network functions containing the unknown true network functions in (1), indicating M possible Boolean functions as \(\left \{\mathbf {f}^{c}_{\theta _{1}},\ldots,\mathbf {f}^{c}_{\theta _{M}}\right \}\) considering the regulatory model uncertainty for the class c. The prior probability of the model θ for the class c is represented by π(θ∣c), where \({\sum \nolimits }_{i=1}^{M} \pi (\theta _{i}\mid c)=1\), for c=0,1. This uncertainty could arise due to some unknown regulations (i.e. interactions) between some genes in the pathway diagram (more information in “Results and discussion” section). We wish to derive the optimal Bayesian classifier (OBC) under uncertainty using all available data and prior knowledge.
If the featurelabel distribution is unknown but belongs to an uncertainty class Θ of featurelabel distributions, then we desire a classifier to minimize the expected error over the uncertainty class. This expected error is equivalent to the Bayesian minimum meansquareerror estimate [21] given by \(\hat {\epsilon }(\psi) = E_{\theta \mathbb {Y}^{c},c} [\epsilon (\psi,\theta)]\), where ε(ψ,θ) is the error of ψ on the featurelabel distribution parameterized by θ and the expectation is taken over the posterior distribution of parameters \(\pi (\theta \mid \mathbb {Y}^{c}, c), c=0, 1\).
For a given test trajectory \(\mathcal {Y}\), the OBC, minimizing the Bayesian minimum meansquare error estimate, can be obtained as
where p_{θ}(.) denotes a probability density function corresponding to parameter θ,p^{0} is the prior probability of class 0 and
for c=0,1.
Derivation of (5) requires computing the posterior distribution
where π(θ∣c) denotes the prior probability of the corresponding network model θ for class c. For an arbitrary trajectory \(\tilde {\mathcal {Y}}\), we define the loglikelihood function associated to model θ and class c by
Now, using the above definition in (4)(6) leads to the following exact OBC solution:
where
The expectation in (9) is taken with respect to the posterior distribution of θ, i.e. \(\pi (\theta \mathbb {Y}^{c},c)\), as opposed to IBR classifier [11] that considers the prior distribution π(θc). Furthermore, \(\log p_{\theta }(\mathbb {Y}^{c}\mid c)={\sum \nolimits }_{d=1}^{D_{c}} \log p\left (\mathcal {Y}_{c}^{(d)}\mid \theta, c\right)\) is used in (9) due to the independency of the training trajectories.
As shown in Eq. (8), the OBC requires computing the loglikelihood functions of all training trajectories and test trajectory for both classes and θ∈Θ. Let \(\left \{\mathbf {x}^{1},\ldots,\mathbf {x}^{2^{n}}\right \}\) denote the corresponding network states in the Boolean vector representation and Y_{1:T}=(Y_{1},…,Y_{T}) be a single trajectory of length T. The loglikelihood function can be computed as
where based on the POBDS model,
Define the conditional probability of any network state at time step k for the model θ and class c as
Using (11) and (12), the loglikelihood function in (10) can be written in a compact form as [22]
where ._{1} denotes L_{1} norm, indicating the summation in (10). \(M^{\theta,c}_{k}\) is the transition matrix of the Markov state process corresponding to model θ∈Θ, with entries given by:
with p denoting the Bernoulli noise parameter. \(T^{\theta,c}_{k}(\mathbf {Y}_{k})\) is a diagonal matrix, called the update matrix, with the ith diagonal element given by
for i=1,…,2^{n}, where δ_{j} and λ_{j} are geneexpression parameters associated to the jth gene in class c as defined in (3). Notice that the initial distribution \(\Pi ^{\theta,c}_{00}=\pi ^{\infty }_{\theta,c}\) is the steadystate distribution associated to model θ and the class c defined as
This vector can be either computed exactly as introduced in [9] or approximated by creating multiple MonteCarlo trajectories with relatively long horizons.
The posterior distribution can also be recursively computed according to the transition and update matrices as described in [6]:
The complexity of computing the loglikelihood function for a single trajectory of length T is of order O(2^{2n}×T) due to the transition matrix involved in its computation. The whole process of the proposed OBC is presented in Algorithm 1.
Scalable classification of singlecell trajectories
In the previous section, the exact solution for the optimal Bayesian classifier is introduced. However, for large systems with a large number of state variables, the exact computation of Algorithm 1 becomes impractical. This is due to the large transition matrix with 2^{2n} elements required to compute the loglikelihoods, leading to exponential computational and memory complexity. Thus, the key here is to scale up the OBC for singlecell trajectories by reducing both computational and memory complexity when computing (10).
We adopt the Sequential MonteCarlo (SMC) techniques [23–27] for estimating nonlinear statespace models, such as our POBDS here. These techniques approximate the target distribution using sample points (“particles”) drawn from a proposal distribution, taking advantage of the fact that sampling from the proposal distribution is easier than from the target. This helps alleviate the high computation of the exact filter by using a finite set of MonteCarlo samples. In this paper, we use the Auxiliary Particle Filter implementation of the Boolean Kalman Filter (APFBKF) proposed in [16] to deal with large GRNs.
Let Y_{1:T} be a given trajectory and the goal is to approximate the likelihood function in Eq. (10) (i.e., \(L_{c}^{\theta }(\mathbf {Y}_{1:T})\) for class c∈{0,1} and θ∈Θ). Let there be N total particles \(\{\mathbf {x}_{k1,i}\}_{i=1}^{N}\), with their associated weights \(\{w_{k1,i}\}_{i=1}^{N}\). The particle filter allows us to produce an approximation for the elements in \(\boldsymbol {\Pi }^{\theta,c}_{kk}\) of (12), simply by using the discrete support of the particles
Usually only a few particles have significant weights after a few iterations of the algorithm and most particles have negligible weights. APFBKF is a lookahead method that predicts the location of particles with a high probability at time k based on the observations at time step k−1, with the purpose of making the subsequent resampling step more efficient. Without the lookahead, the basic algorithm blindly propagates all particles, even those in low probability regimes.
The APFBKF algorithm defines:
where ζ_{k}=1,…,N, is an index of the mixture in (18). If we draw from the joint density and then discard the index, then we will have a sample from (18) as required. This procedure carries out the prediction and update steps of the optimal filter in (17). We can approximate (19) by
where μ_{k,i} is the mode associated with the density of P_{θ}(X_{k}∣x_{k−1,i},c), given by [16]:
for i=1,…,N, where we have used (1) and the fact that the noise is zeromode (i.e., the Bernoulli noise intensity p is smaller than 0.5).
By simulating the index with the probability v_{k,i}=p_{θ}(Y_{k}∣μ_{k,i},c) w_{k−1,i}, we can sample from P_{θ}(X_{k},ζ_{k} ∣ Y_{1:k},c) and then sample from the transition density given the mixture, P_{θ}(X_{k}∣x_{k−1,i},c).
Actually, we simulate only from particles associated with large predictive likelihoods. We first sample N times from the joint density of P_{θ}(X_{k},ζ_{k} ∣ Y_{1:k},c), and then obtain the new particles \(\{\mathbf {x}_{k,i}\}_{i=1}^{N}\) and their associated weights \(\{w_{k,i}\}_{i=1}^{N}\) by
The auxiliary variables \(\{\zeta _{k,i}\}_{i=1}^{N}\) are obtained by sampling from a discrete distribution:
where Cat (a_{1},…,a_{N}) represents a categorical distribution with the probability mass function \(f(\zeta = i) = a_{i}/{\sum \nolimits }_{j=1}^{N} a_{j}\).
It is shown in [16, 28] that the loglikelihood function in (10) can be approximated by:
where \(\hat {L}_{c}^{\theta }(\mathbf {Y}_{1:T})\) denotes the approximation of \(L_{c}^{\theta }(\mathbf {Y}_{1:T})\). Note that the computational complexity of this algorithm is of order O(NT) which can be much smaller than O(2^{2n}T) that is the complexity of computing the exact loglikelihood function in (10).
The whole process and the schematic diagram of the proposed classifier are presented in Algorithm 2 and Fig. 1, respectively. During the training process, 2MD_{0}D_{1} particle filters need to be run for computing the loglikelihood functions of trajectories from all network models in the uncertainty class for two classes. The output values of the particle filters can be used for efficient approximation of the posterior distribution in (9). Then, during the test process, for a given test trajectory, 2M particle filters need to be performed for loglikelihood approximation of all network models (θ∈Θ) and classes (c=0,1). These loglikelihood values and posterior probability approximated during the training process can be used to derive the approximate OBC in (8).
The training process of the proposed method has the computational complexity O(2NMTD_{0}D_{1}), whereas the exact solution has the complexity O(2^{2n+1}MTD_{0}D_{1}). The exponential growth of the complexity with the size of network (i.e., number of genes) for the exact solution precludes its application to large GRNs. However, the number of particles, N, by the proposed method can be chosen relatively small according to the attractor structure of the system (i.e., N<<2^{2n}) [29], allowing the classification of largescale singlecell trajectories (see “Results and discussion” section). The complexity of the test process for the proposed method is O(2NMT), as opposed to O(2^{2n+1}MT) for the optimal solution.
Optimal Bayesian classifier for multiplecell scenarios
In the previous sections, the classification of singlecell trajectories is discussed. Here, we consider common scenarios in molecular biology research where geneexpression data are often based on the average expression from multiplecells at different time with different states. Since the trajectories are assumed to be independent and drawn based on the dynamics of the true network, its steadystate distribution \(\pi ^{\infty }_{\theta ^{*}}\) characterizes the probability of the system being at different states. It can be shown that the multiplecell data are independent samples from the following measurement model [10]:
where \(\phantom {\dot {i}\!}\{\mathbf {x}^{1},\ldots,\mathbf {x}^{2^{n}}\}\) denotes all the network states in the Boolean vector representation. However, we assume that the true network model θ^{∗}, is unknown, and is represented by a finite set of M possible network models {θ_{1},…,θ_{M}} with prior probability π(θ∣c). Let \(\mathbf {Y}_{c}^{(1)},\ldots,\mathbf {Y}_{c}^{(N_{c})}\) be the multiplecell training measurements available for class c. The optimal Bayesian classifier for a given test sample Y can be represented by
The posterior probability of the parameter θ can be computed as
Computation of (26) requires computing the conditional probability of training and test samples given all θ∈Θ and c∈{0,1}. Let A be an n×2^{n} matrix containing all Boolean states of the system (i.e., \(\mathbf {A}=\left [\mathbf {x}^{1},\ldots,\mathbf {x}^{2^{n}}\right ]\)). According to (25), the conditional distribution of an arbitrary sample \(\tilde {\mathbf {Y}}\) given class c and network model θ can be written as
where “ ∘” is the Hadamard product. Thus, the Boolean nature of the state vector suggests that each element of the multiplecell measurement is distributed as a mixture of two Gaussian distributions. Replacing (27) and (28) into (26) leads to the OBC for multiplecell scenarios. Comprehensive comparison results between the OBC in singlecell trajectories and multiplecells are provided in the next section.
Results and discussion
We evaluate the proposed singlecell trajectory classifier and compare its performance with the OBC based on multiplecell average expression on the TLGL leukemia Boolean network, whose GRN [17] is shown in Fig. 2. This GRN has 18 genes. The regulating functions are defined in Table 1 [17, 18]. The main node is “Apoptosis”, which denotes programmed cell death. According to [17], in the mutated case, the node Apoptosis is stuck at OFF state and cannot be activated. As a result, we derive the healthy Boolean network from Table 1 and for the mutated Boolean network we put the value of Apoptosis in Table 1 to zero. This means that in the cancerous scenario, the value of Apoptosis does not obey the regulating functions and is always zero.
As we may not know the true network function, we consider four candidate network functions for each of the healthy and mutated networks as the uncertainty class of possible GRN models. In addition to the true network, we remove the operation ¬ of Apoptosis for the genes sFas and GPCR, which are intermediate nodes. Therefore, this network is very close to the true network. For the third network, we remove ¬ of Apoptosis from two other nodes IAP and P2. In the fourth network of this uncertainty class, we change the operation AND to OR for the gene BID. In this study, we use the observation models described in Eqs. (3) and (25) for the singlecell trajectory and multiplecell averaging, respectively.
Figure 3a and b show the classification errors for the simulated data based on the TLGL leukemia Boolean network versus the number of time points T for two values of state perturbation probability p=0.05 and 0.1, respectively. For the sake of simplicity, we assume the geneexpression parameters in Eqs. (3) and (25) to be the same for all genes, that are λ=[λ,…,λ]^{T} and D=Diag(δ,…,δ), and set λ=10, and δ=30. Two different values are considered for the observation noise level: σ=20 (low noise), and σ=25 (high noise). While σ corresponds to both within subject and between subject variations in the singlecell, it shows between subject variation in the multiplecell because multiplecells would allow to average out the within subject variance. We set the number of training trajectories D=D^{c=0}+D^{c=1}=4. In both figures, the error curves are monotonically decreasing in terms of the trajectory length T for the singlecell classifier. There is a special case in which the error gets fixed after some T. This may be explained by the effect from the steadystate distributions depending on the lengths of attractor cycles of the networks under study. When the perturbation noise p and the observation noise σ are small, the sufficient T to achieve the least possible error is L+1, where L is the minimum attractor length in the two networks. More precisely, the BNps tend to the deterministic BNs when the perturbation noise is small, meaning that the observations occur only in the attractor states and circulate inside the attractor cycles. In such a case, L+1 is the maximum length of a trajectory that can help distinguish the two networks. But there is a nonnegligible probability of jumping states in the considerable perturbation scenarios, so that longer trajectories can be helpful. In all figures for every value of T and p, the error increases with increasing observation noise. While the proposed classifier works in both low and highnoise scenarios, the classifier based on the multiplecell expression data only works well in the lownoise scenarios and is very sensitive to σ. In low observation noise scenarios and when p=0.05, multiplecell classifier can easily classify, while the trajectorybased with two timepoints will have a little bit higher error due to the common short segments of the trajectories between two classes. When there are at least four timepoints, which is the attractor cycle size in the uncertainty class of the networks, the performance of the classifiers based on singlecell trajectories is better compared to the ones using multiplecell even in the lownoise scenarios. Increasing the number of timepoints help better decipher the difference in singlecell trajectories between two classes (healthy vs. cancerous) with improved classification accuracy. Compared to the multiplecell classifier based on averaged gene expression over cells at different states, the trajectory based classifier can clearly improve the classification performance. Even using four time points, the classification accuracy can be improved up to 8%. Using longer trajectories, the improvement can be up to 18% as indicated in Fig. 3.
In addition to the effect of trajectory length, we would like to investigate how the number of training trajectories affect the performance of the proposed method, especially with a low number of training samples. Figure 4 shows the effect of the number of training trajectories D on the classification performance. In the particle filter point of view, increasing D by 1 means increasing the available data as T. Therefore, we set T=2, that is smallest T, to better see the trend of classification error. Both the average error and its standard deviation decrease with more training trajectories and the classification error converges to a fixed value when D becomes large enough. The value of D required for a converged error rate depends on the parameters T, p, and M. In realworld scenarios that may have significant uncertainty and the perturbation probability is high, we need more training data to improve the performance.
Figure 5 illustrates how the model uncertainty may affect the classification performance. In our setup, the uncertainty is manifested as the size of the network uncertainty class – the number of uncertain networks for each class. In both perturbation probabilities, the error increases with increasing M. To demonstrate how the proposed method can reduce the computational cost of the Boolean network classification, we check the change of the average classification error with respect to the number of particles. Figure 6 shows that increasing the number of particles monotonically decreases the classification error and the average error converges with only 1000 particles. This can be compared with the traditional method [9], which needs computation based on the 2^{18}×2^{18} transition probability matrix. Such a dimension is too large for the direct application of the OBC.
To more comprehensively evaluate the proposed method, we have compared it with two other classification methods, i.e. IRB [11] and PlugIn [10]. The performance comparisons under four different scenarios are provided in Table 2. The OBC stands out as the best performing method in different scenarios. Using OBC improves the accuracy by 8.5% and 1.5% with T=3 for p=0.1 and σ=25 compared to PlugIn and IRB, respectively.
The proposed method does not have any restriction on the noise distribution assumptions due to the generalizability of particle filters. To show this, we also test our method with different noise distributions. While the noise of GRNs is usually Gaussian, sometimes due to up regulation, noise can be Poisson or Negative Binomial (NB). We have simulated Gaussian and NB noise distributions with the same mean and variance while for Poisson noise, the variance is equal to its mean (Details can be found in Additional file 1, the additional experimental results). Additional file 1: Figure S1 shows that our method performs consistently well for all observation noises. The Poisson results are superior to the other models as its variance is smaller.
We also compare the performance of APF with the plain Sequential Importance Resampling (SIR) in high process noise. As Additional file 1: Figure S2 shows, the performance is similar especially when there is enough time points. When the number of time points is low, the SIRbased particle filter performs worse. The superior performance of APF in realworld GRNs is due to the fact that the size of attractors is usually small and the predicted mode values by APF can be a good approximation for the next prediction. Moreover, the initial distribution is assumed to be from the stationary distribution. This makes APF a more desirable approximation solutions due to the lower diversity in the particles.
Conclusions
In this paper, we have developed the optimal Bayesian classifier for binary classification of singlecell trajectories under regulatory model uncertainty. The partiallyobserved Boolean dynamical system is used for modeling the dynamical behavior of gene regulatory networks. Due to the intractability of the OBC for large GRNs, we have proposed a particle filtering technique for approximating the OBC. This particlebased solution reduces the computational and memory complexity of the optimal solution significantly. The performance of the proposed particlebased method is demonstrated through numerical experiments using a POBDS model of the wellknown Tcell large granular lymphocyte (TLGL) leukemia network based on noisy timeseries geneexpression data.
Abbreviations
 APFBKF:

Auxiliary particlefilter implementations of the Boolean Kalman filter
 BKF:

Boolean Kalman filter
 BKS:

Boolean Kalman smoother
 BNp:

Boolean network with perturbation
 GRN:

Gene regulatory network
 IRB:

Intrinsically Bayesian robust
 MMSE:

Minimum mean square error
 OBC:

Optimal Bayesian classifier
 POBDS:

Partially observed Boolean dynamical system
 SMC:

Sequential MonteCarlo
 TLGL:

Tcell large granular lymphocyte
References
 1
Hajiramezanali E., Dadaneh S. Z., de Figueiredo P., Sze S. H., Zhou M., Qian X.Differential expression analysis of dynamical sequencing count data with a gamma markov chain. 2018. arXiv preprint arXiv:1803.02527.
 2
Hajiramezanali E., Dadaneh S. Z., Karbalayghareh A., Zhou M., Qian X.Bayesian multidomain learning for cancer subtype discovery from nextgeneration sequencing count data. In Advances in Neural Information Processing Systems 31. 2018:9132–41.
 3
Lake B. B., Chen S., Sos B. C., Fan J., Kaeser G. E., Yung Y. C., Duong T. E., Gao D., Chun J., Kharchenko P. V., et al.Integrative singlecell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol. 2018; 36(1):70.
 4
Fiers M. W., Minnoye L., Aibar S., Bravo GonzálezBlas C., Kalender Atak Z., Aerts S.Mapping gene regulatory networks from singlecell omics data. Brief Funct Genom. 2018.
 5
Shmulevich I., Dougherty E. R.Genomic Signal Processing vol. 50. New Jersey: Princeton University Press; 2014.
 6
Imani M., BragaNeto U.Maximumlikelihood adaptive filter for partiallyobserved Boolean dynamical systems. IEEE Trans Sig Process. 2017; 65(2):359–71.
 7
Imani M., BragaNeto U. M.Pointbased methodology to monitor and control gene regulatory networks via noisy measurements. IEEE Trans Control Syst Technol. 2018. https://0doiorg.brum.beds.ac.uk/10.1109/TCST.2017.2789191.
 8
Imani M., BragaNeto U.Finitehorizon LQR controller for partiallyobserved Boolean dynamical systems. Automatica. 2018; 95:172–9.
 9
Karbalayghareh A., BragaNeto U., Hua J., Dougherty E. R.Classification of state trajectories in gene regulatory networks. IEEE/ACM IEEE Trans Control Syst Technol Biol Bioinforma. 2016; 15:68–82.
 10
Karbalayghareh A., BragaNeto U., Dougherty E. R.Classification of singlecell gene expression trajectories from incomplete and noisy data. IEEE/ACM IEEE Trans Control Syst Technol Biol Bioinforma. 2017; 16:193–207.
 11
Karbalayghareh A., BragaNeto U., Dougherty E. R.Intrinsically Bayesian robust classifier for singlecell gene expression trajectories in gene regulatory networks. BMC Syst Biol. 2018; 12(3):23.
 12
Dalton L. A., Dougherty E. R.Optimal classifiers with minimum expected error within a Bayesian framework—Part I: Discrete and Gaussian models. Pattern Recog. 2013; 46(5):1301–14.
 13
Dalton L. A., Dougherty E. R.Intrinsically optimal Bayesian robust filtering. IEEE Trans Sig Process. 2014; 62(3):657–70.
 14
Boluki S., Esfahani M. S., Qian X., Dougherty E. R.Incorporating biological prior knowledge for Bayesian learning via maximal knowledgedriven information priors. BMC Bioinformatics. 2017; 18(14):552.
 15
Boluki S., Esfahani M. S., Qian X., Dougherty E. R.Constructing pathwaybased priors within a gaussian mixture model for Bayesian regression and classification. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 16:524–537.
 16
Imani M., BragaNeto U.Particle filters for partiallyobserved Boolean dynamical systems. Automatica. 2018; 87:238–50.
 17
Saadatpour A., Wang R. S., Liao A., Liu X., Loughran T. P., Albert I., Albert R.Dynamical and structural analysis of a T cell survival network identifies novel candidate therapeutic targets for large granular lymphocyte leukemia. PLoS Comput Biol. 2011; 7(11):1002267.
 18
Zhang R., Shah M. V., Yang J., Nyland S. B., Liu X., Yun J. K., Albert R., Loughran T. P.Network model of survival signaling in large granular lymphocyte leukemia. Proc Natl Acad Sci. 2008; 105(42):16308–13.
 19
Wu A. R., Neff N. F., Kalisky T., Dalerba P., Treutlein B., Rothenberg M. E., Mburu F. M., Mantalas G. L., Sim S., Clarke M. F., et al.Quantitative assessment of singlecell RNAsequencing methods. Nat Methods. 2014; 11(1):41.
 20
Ståhlberg A., Kubista M.The workflow of singlecell expression profiling using quantitative realtime PCR. Expert Rev Mol Diagn. 2014; 14(3):323–31.
 21
Dalton L. A., Dougherty E. R.Bayesian minimum meansquare error estimation for classification error—Part I: Definition and the Bayesian MMSE error estimator for discrete classification. IEEE Trans Sig Process. 2011; 59(1):115–29.
 22
Imani M., BragaNeto U.Multiple model adaptive controller for partiallyobserved Boolean dynamical systems. In: 2017 American Control Conference (ACC). IEEE: 2017. p. 1103–8. https://0doiorg.brum.beds.ac.uk/10.23919/ACC.2017.7963100.
 23
Kantas N., Doucet A., Singh S. S., Maciejowski J., Chopin N., et al.On particle methods for parameter estimation in statespace models. Stat Sci. 2015; 30(3):328–51.
 24
Hajiramezanali M., Amindavar H.Maneuvering target tracking based on SDE driven by GARCH volatility. In: 2012 IEEE Statistical Signal Processing Workshop (SSP). IEEE: 2012. p. 764–7. https://0doiorg.brum.beds.ac.uk/10.1109/SSP.2012.6319816.
 25
Imani M., Ghoreishi S. F., Allaire D., BragaNeto U.MFBOSSM: Multifidelity Bayesian optimization for fast inference in statespace models. In: AAAI: 2019.
 26
Hajiramezanali M., Amindavar H.Maneuvering target tracking based on combined stochastic differential equations and GARCH process. In: 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA). IEEE: 2012. p. 1293–7. https://0doiorg.brum.beds.ac.uk/10.1109/ISSPA.2012.6310492.
 27
Hajiramezanali M., Fouladi S. H., Ritcey J. A., Amindavar H.Stochastic differential equations for modeling of high maneuvering target tracking. ETRI J. 2013; 35(5):849–58.
 28
Pitt M. K., Shephard N.Filtering via simulation: Auxiliary particle filters. J American Stat Assoc. 1999; 94(446):590–9.
 29
Qian X., Ghaffari N., Ivanov I., Dougherty E. R.State reduction for network intervention in probabilistic Boolean networks. Bioinformatics. 2010; 26(24):3098–104.
Acknowledgment
We thank Texas A&M High Performance Research Computing and Texas Advanced Computing Center for providing computational resources to perform experiments in this work.
Funding
This research has been supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Mathematical Multifaceted Integrated Capability Centers (MMICCS) program, under award number DESC0019393, NSF Awards CCF1553281, CCF1718513 and CCF1718924, and a grant from the Brookhaven National Laboratory (BNL). Publication costs are funded by DESC0019393.
Availability of data and materials
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 6, 2019: Selected original research articles from the Fifth International Workshop on Computational Network Biology: Modeling, Analysis and Control (CNBMAC 2018): Genomics. The full contents of the supplement are available online at https://0bmcgenomicsbiomedcentralcom.brum.beds.ac.uk/articles/supplements/volume20supplement6.
Author information
Affiliations
Contributions
E. H. developed OBC for the singlecell trajectory classification, developed the scalable OBC, performed the experiments, and wrote the manuscript. M. I. wrote part of the code and the first draft, and in conjunction with U. B. N. structured the APFBKF by integrating their previous partiallyobserved Boolean dynamical system into this new framework. E. R. D. and X.Q. oversaw the project, proposed the new scalable OBC, and wrote the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to Xiaoning Qian.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
This additional file contains the additional experiment results. (PDF 229 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Hajiramezanali, E., Imani, M., BragaNeto, U. et al. Scalable optimal Bayesian classification of singlecell trajectories under regulatory model uncertainty. BMC Genomics 20, 435 (2019) doi:10.1186/s1286401957203
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s1286401957203
Keywords
 Optimal Bayesian classification
 Singlecell trajectory classification
 Particle filter
 Probabilistic Boolean networks