 PROCEEDINGS
 Open Access
 Published:
NMFP: a nonnegative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNAseq data
BMC Genomics volume 17, Article number: 11 (2016)
The Erratum to this article has been published in BMC Genomics 2016 17:296
Abstract
Background
The advent of nextgeneration RNA sequencing (RNAseq) has greatly advanced transcriptomic studies, including systemwide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a highthroughput manner from RNAseq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions.
Results
We have developed a preselection method called “Nonnegative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNAseq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms.
Conclusion
NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNAseq data.
Background
In molecular biology and transcriptomics, an important question is to understand the alternative splicing process, a key step of gene transcription in diverse species, from invertebrates to mammals. Alternative splicing generates at least two mRNA isoforms (i.e., transcripts) for one gene, and these mRNA isoforms as well as their protein products can play different roles in various biological phenomena. For example, studies have found that aberrant structures or abundance of mRNA isoforms can cause various human diseases [1, 2]. Hence, systematic identification and quantification of mRNA isoforms can greatly help understand gene regulation mechanisms for both basic biology and translational medicine. For decades, mRNA isoforms of individual genes have been accurately identified and quantified by lowthroughput technologies such as cDNA cloning and qPCR. However, it was impossible to conduct this task on a systemwide scale for thousands of genes simultaneously until the invention of highthroughput technologies.
More than a decade ago, microarray technologies established a highthroughput platform for identifying and quantifying mRNA isoforms of genes with known sequences. SPACE [3, 4] is a method using nonnegative matrix factorization (NMF) to predict mRNA isoforms and estimate their abundance from microarray data. NMF is a popular pattern recognition method and is distinguished from other matrix factorization methods by its use of nonnegativity constraints [5]. A main advantage of NMF is its interpretable factorization results, which are positive and sparse combinations of features. In the context of mRNA isoform discovery, SPACE uses NMF to find isoforms as combinations of exons and exon junctions. SPACE was demonstrated to have good power in detecting mRNA isoforms under ideal simulation settings. However, many open questions remain about how to use NMF to accurately find mRNA isoforms from real microarray data. Reasons for these questions include the following issues inherent with NMF and microarray. First, in many scenarios NMF outputs nonunique factorization results [6], making SPACE identify ambiguous mRNA isoforms. Second, NMF requires a prespecified matrix rank for factorization results, and how to determine this rank remains a difficult question [7–9]. Third, microarray can only detect exons and exon junctions with at least partially known sequences, and thus it cannot provide information on novel exons or exon junctions. This hinders SPACE from discovering novel isoforms that contain novel exons or exon junctions.
Compared to microarray, the more recent nextgeneration RNA sequencing (RNAseq) technologies yield deeper and wider insights into transcriptomes at base level resolution [10]. RNAseq measures the expression of genomewide exons and exon junctions, regardless of prior knowledge on their existence. Hence, unlike microarray, RNAseq enables the identification of novel mRNA isoforms containing previously unknown exons or exon junctions. Despite its many advantages, a major drawback of Illumina RNAseq, the most popular RNAseq platform, is that it can only generate RNAseq reads with short lengths (≤2×250 bp for pairedend reads). Such short RNAseq reads cannot cover a fulllength mRNA isoform, making isoform identification a difficult computational problem. Since the advent of RNAseq, various computational methods have been developed to discover mRNA isoforms from RNAseq data. For the species with goodquality reference genomes, including human and model organisms, mainstream isoform discovery methods (e.g., Cufflinks [11], Scripture [12], IsoLasso [13] and SLIDE [14]) use RNAseq reads first mapped to a reference genome. In this paper, our discussion focuses on these socalled “refernecebased methods”, which attempt to identify fulllength mRNA isoforms from mapped RNAseq reads using various approaches. For example, Cufflinks, Scripture and IsoLasso first build graphs of exonic regions containing mapped reads, and then find mRNA isoforms as the maximal paths in the graph based on a parsimony assumption. These three methods are also often referred to as “de novo methods,” because their isoform identification approach can completely rely on RNAseq reads without using existing annotations. Another referencebased method SLIDE uses a different approach. It constructs a search space of possible isoforms by using known boundaries of genes and exons from existing annotations, with the possible addition of novel genes and exons inferred from RNAseq data. SLIDE then identifies isoforms from the search space based on a regularized linear model, which accounts for the RNAseq read generating mechanism and read distributions. Because of its use of annotations, SLIDE is also called an “annotation aided method”, in contrast to the previous “de novo methods.” Note that the recent versions of Cufflinks also added an annotation aided option, which can simulate RNAseq reads from annotations and use the simulated reads together with RNAseq data for isoform discovery [15]. Both de novo and annotation aided methods have advantages and drawbacks. The graphbased de novo methods are more robust to RNAseq data noise thanks to their use of the parsimony assumption, which, however, also prevents them from finding overlapping isoforms. SLIDE, on the other hand, is capable of finding overlapping isoforms but is more sensitive to RNAseq data noise. Despite their differences, a common obstacle for both types of methods is the large search space of possible mRNA isoforms for genes with complex splicing structures (i.e., many exons). For the graphbased de novo methods, a gene with many exons will lead to a graph with large numbers of nodes and edges, making it difficult to search for isoforms as maximal paths. For SLIDE, a gene with n exons will give rise to a total of 2^{n}−1 possible isoforms, placing great difficulty on the regularized linear model to find the correct isoforms. Although SLIDE is coupled with a naïve preselection procedure to filter out the isoform candidates with unsupported exon junctions by RNAseq reads, this procedure is not ideal for lowly expressed isoforms, whose exon junctions may not be supported by reads due to the low read coverage.
Given the wide existence of RNAseq noise and the large search space of possible isoforms, existing computational methods for isoform discovery have unsatisfactory performance in many cases. In an evaluation conducted by the RGASP consortium [16], Cufflinks only achieved precision rates of 23 %, 50 % and 47 % and recall rates of 18 %, 40 % and 39 % for discovering fulllength mRNA isoforms from H. sapiens, D. melanogaster, and C. elegans RNAseq data respectively. For SLIDE, the corresponding precision rates were 37 %, 80 % and 63 %, and the corresponding recall rates were 7 %, 49 % and 59 %. This evaluation showed that both Cufflinks and SLIDE, as well as other evaluated methods, have low accuracy for discovering human isoforms, mainly because many human genes have large numbers of exons and thus a huge search space of possible isoforms. In statistical literature, for highdimensional linear models with far more features than observations, it was found that adding a screening procedure to filter out excessive features in the search space can improve the feature identification performance of Lasso [17]. Since Lasso was used in SLIDE to find isoforms as features, we are motivated to reduce the search space of possible isoforms by designing a screening (or preselection) procedure, so as to improve the performance of SLIDE and hopefully other isoform identification methods.
Also motivated by SPACE and the good properties of NMF, we propose a preselection method “Nonnegative Matrix Factorization Preselection” (NMFP) to improve the accuracy of referencebased isoform discovery methods. Our goal is to use NMFP to greatly reduce the search space of isoforms, and have the selected isoform candidates cover most of the true isoforms. NMFP employs NMF to select isoform candidates for a given gene by decomposing its RNAseq read count matrix. We demonstrate the performance of NMFP by simulation and real data studies. In the simulation, we generate RNAseq reads from D. melanogaster (fly) and M. musculus (mouse) using Flux Simulator [18] and show that NMFP can greatly improve the accuracy of Cufflinks and SLIDE in isoform discovery. We also show the efficacy of NMFP on discovering D. melanogaster isoforms from real RNAseq data. Moreover, we evaluate the robustness of NMFP to the choice of NMF ranks, the number of input samples in the read count matrix, and gene expression levels. In addition, we show that NMFP can increase the robustness of SLIDE to its regularization parameter choice.
Methods
Our proposed method NMFP shares the same root as SPACE, because they both use NMF to identify mRNA isoforms. However, unlike SPACE that aims to directly identify isoforms via NMF, NMFP has a less ambitious goal, which is to identify a set of good isoform candidates for downstream isoform discovery methods. To achieve this goal, in NMFP we use highfrequency filtering to alleviate the nonuniqueness issue of NMF results, and design a model selection approach to choose the NMF rank. Moreover, we develop a new NMF algorithm to penalize the isoforms that have conflicting exons or exon junctions.
Data processing
Suppose there are m RNAseq data sets (samples) and a gene has n subexons, which were defined as the exonic regions between two adjacent splicing sites in SLIDE [14]. For this gene, we first categorize RNAseq reads in every sample into bins, which are defined as twodimensional vectors (k, l), representing the mapped reads (singleend reads or two ends of pairedend reads) with starting position in the kth subexon and ending position in the lth subexon, 1≤k≤l≤n. Hence, this gene has a total number of \(p = n + {n \choose 2}\) bins. We count the numbers of reads in every bin in all samples and generate a read count matrix U with dimensions p×m, with rows corresponding to bins and columns corresponding to samples
Since samples may have different sequencing depths and bins may have different lengths (due to different lengths of exons), to prevent the NMF results from being dominated by samples with more reads or bins with greater lengths, we normalize U into a matrix V by removing the sequencing depth and bin length effects. We conduct the normalization based on the following model
where U _{ ij } is the (i,j)th entry of U, representing the read count of the ith bin in the jth sample, R _{ j } is the total number of reads mapped to this gene in the jth sample, and P _{ i } is the probability that a read falls into the ith bin (0≤P _{ i }≤1 and \(\sum _{i=1}^{p} P_{i}=1\)). We assume that P _{ i } is proportional to the length of the ith bin. Although this assumption can be violated by the nonuniformity of read distribution due to alternative splicing and sequencing noise, it provides an approximate and simple solution to removing bin length effects from the read count matrix.
We describe the normalization process as follows.

1.
Scale the read counts of each sample inverse proportionally with its total number of reads
$$U'_{ij}=U_{ij} \cdot \frac{\bar{R}}{R_{j}}~,~i=1,2,\ldots,p~;~j=1,2,\ldots,m~, $$where \(\bar {R}=\frac {1}{m}\sum _{j=1}^{m} R_{j}~\).

2.
Further scale the read counts of each bin inverse proportionally with its bin length
$$V_{ij}=U'_{ij} \cdot \frac{1}{p P_{i}}~,~i=1,2,\ldots,p~;~j=1,2,\ldots,m~, $$and denote V=(V _{ ij }), where V _{ ij } represents the normalized read count of the ith bin in the jth sample.
Matrix decomposition model
Li and Wong [19] proposed a model to describe the relationship between the probe intensity v and the transcript abundance t for microarray data
where a is the affinity of the probe, and e is an error term. SPACE revised the Model (2) into the following matrix model for p probes and s transcripts
where V is a p×m matrix representing the intensities of p probes in m samples; A is a p×p diagonal matrix representing the affinity of p probes; G is a p×s indicator matrix with binary elements indicating the existence of the p probes in s transcripts; T is an s×m matrix representing the abundance of s transcripts in m samples; ε is a p×m error matrix.
Inspired by the Models (2) and (3), we write the following model for the relationship between normalized reads counts and unknown transcript (i.e., mRNA isoform) abundance
where V is the p×m normalized reads count matrix (from last section “Data processing”), and W is a p×s isoform composition matrix, whose rows and columns are bins and mRNA isoforms respectively. Each column of W is a pdimensional vector with elements ∈ [ 0,1], indicating the existence of bins in an isoform, with the values subject to RNAseq noise and biases. Theoretically, all the existing bins in an isoform should have the same read density assuming that RNAseq reads follow a uniform distribution in that isoform. However, due to RNAseq noise and biases, the uniformity assumption is often violated, and nonexisting bins may contain erroneously mapped reads. To account for these issues, we model the existence of bins in an isoform as values between 0 and 1, where 1 indicates a bin having the maximal read intensity among all the bins, and 0 indicates a bin containing no mapped reads. H is a s×m isoform abundance matrix, whose rows and columns are mRNA isoforms and samples respectively. Each row of H is an mdimensional vector representing the isoform’s abundance levels in the m samples. Based on our definition of W, h _{ ij } (the abundance value of the ith isoform in the jth sample) should be equal to the maximum bin abundance of the ith isoform in the jth sample. ε is an error matrix.
We can further decompose W as
where A is a p×p diagonal matrix with elements ∈ [ 0,1], representing the biases of read counts in different bins. Such biases are attributable to the features of the bins, such as GC contents and the bins’ positions in transcripts. G is a p×s indicator matrix with binary values to describe the existence of bins in isoforms. Each column of G represents an isoform.
Modified NMF approach
With the normalized read count matrix V, our goal is to estimate G, whose columns will be regarded as isoform candidates. To achieve this goal, we have to first estimate W and A. We adopt the SPACE approach to estimate G. In the SPACE method [4], given an estimate \(\hat {\textbf {W}}=(\hat {W}_{\textit {ik}})\), A was estimated as
In other words, \(\hat {A}_{\textit {ii}}\), the bias factor of read counts in the ith bin, is considered as \(\max _{k} (\hat {W}_{\textit {ik}})\), the largest “adjusted existence” of the ith bin in all isoforms.
With \(\hat {\mathbf A}\) and \(\hat {\mathbf W}\), G is estimated as
where I(·) is an indicator function, and c∈IR is a threshold to make \(\hat {\mathbf {G}}\) a binary matrix. We used c=0.4 in this paper.
Prior to that, we estimate W by decomposing V using NMF. There exist many available NMF algorithms [5, 20–22], which have different preferences over the factorization results. Since our goal is to find isoform candidates by decomposing V, one constraint we need is the compatibility of bins in every column of \(\hat {\textbf {W}}\), i.e., every isoform found by NMF. For example, Bin (1,3) and Bin (2,2) are conflicting and should not coexist in an isoform, because the former indicates that Subexon 2 is skipped, while the latter implies the existence of Subexon 2.
To realize this constraint in NMF, we add a penalty term to the standard NMF objective function as follows
where p is the number of bins, m is the number of samples, and α is a regularization parameter controlling the weight of the penalty term (e.g., α=0.1). i⊥j denotes that Bin i and Bin j are conflicting (See Section 1 in the Additional file 1 for more detail). We estimate W by minimizing the objective function D(V, WH) subject to a prespecified s=rank(W), which is equivalent to the number of columns of W. How to determine the rank remains an open question. In NMFP, we choose the rank by gap statistic, which was originally designed to select the number of clusters in kmeans clustering [23] (See Section 2 in the Additional file 1 for more detail). After obtaining \(\hat {\textbf {W}}\), we use Eqs. (6) and (7) to find \(\hat {\textbf {G}}\), which indicates a set of isoform candidates.
The penalty term in the objective function (8) cannot fully prevent the coexistence of conflicting bins in every isoform (i.e, column in \(\hat {\textbf {G}}\)). To resolve this issue and recover potentially true isoforms, we consider both the existing and skipped statuses of the ambiguous subexons that lead to conflicting bins. For example, if both Bin (1,3) and Bin (2,2) exist, Subexon 2 is ambiguous. We will consider the cases of Subexon 2 being existing and skipped as two different isoforms. If there are n ambiguous subexons in an isoform candidate, 2^{n} isforms will be considered, but only those whose subexons and subexon junctions are supported by RNAseq reads will be kept as isoform candidates.
Aggregation of multiple NMF runs
To address the issue of nonunique NMF results, we use a highfrequency filtering approach to find isoform candidates identified in a number of NMF runs. Specifically, we implement the NMF estimation procedure described in the previous two subsections for N times (e.g. N=100) and aggregated the identified isoform candidates into a candidate pool. In the filtering step, we first remove the isoform candidates, which contain subexon junctions not supported by reads, from the pool. Then among the rest of candidates in the pool, we only retain the highfrequency ones, which are selected in at least r NMF runs (e.g., r=20), as final isoform candidates.
The NMFP method is illustrated in Fig. 1.
Results
We conduct simulations and realdata analyses in many scenarios to demonstrate the effectiveness of NMFP in improving the performance of downstream isoform discovery methods. The following results show that NMFP can help Cufflinks and SLIDE achieve better isoform discovery accuracy. We also study the robustness of NMFP to the choice of NMF rank, the number of input samples in the read count matrix, and the gene expression levels.
Simulation results in D. melanogaster
We use Flux Simulator [18] to simulate RNAseq reads of D. melanogaster (with reference genome dm6 and Ensembl annotation BDGP6 of release 80). We focus on 4125 genes on chr3R, in accordance with the SLIDE paper [14], which showed good performance of SLIDE on genes with 3 to 10 subexons. We simulate 50 samples, each with 10,000,000 RNA molecules and 50,000,000 pairedend reads with length 2×76 bp, from these genes’ isoforms in the annotation. The isoform abundance is randomly assigned by Flux Simulator. Based on the way we define subexons (Additional file 1: Figure S1), 51.7 % (2132) genes contain 3 to 10 subexons (Table 1). Among these genes, 44.6 % (951) have more than one isoforms in the annotation. We apply NMFP to these 951 genes.
To check the effects of NMFP on improving the isoform discovery performance of SLIDE, we define a new procedure called “NMFP + SLIDE”, which combines NMFP and SLIDE by inputting the isoform candidates found by NMFP into SLIDE. In the original SLIDE method, 2^{n}−1 possible isoforms are enumerated for an nsubexon gene, and SLIDE uses a simple preselection to filter out the isoforms with subexons or subexon junctions not supported by reads. NMFP + SLIDE replaces this preselection step by NMFP.
Similarly, we define “NMFP + Cufflinks”, which inputs the isoform candidates found by NMFP as a GTF file into Cufflinks and uses Cufflinks with the “–GTF” option to estimate the abundance of these candidates. The candidates with nonzero estimated abundance are kept as the isoforms identified by NMFP + Cufflinks. This is different from the original Cufflinks, because it does not rely on a connectivity graph or the parsimony assumption as Cufflinks does.
To compare SLIDE with NMFP + SLIDE, and Cufflinks with NMFP + Cufflinks, we evaluate their isoform discovery performance at three different levels: the nucleotide, exon and transcript levels, similar to the RGASP evaluation [16]. In this comparison, we also evaluate the performance of NMFP alone, because it is important to check whether the isoform candidates found by NMFP have a good coverage of the true isoforms. Since SLIDE has two default values for its regularization parameter λ in the Lasso regression (λ=0.2 for fewer isoforms and λ=0.01 for more isoforms), we include two versions of SLIDE in this comparison: “SLIDE(fewer)” for λ=0.2 and “SLIDE(more)” for λ=0.01. This results in a comparison of seven methods in total:

Cufflinks

NMFP + Cufflinks

SLIDE(fewer)

NMFP + SLIDE(fewer)

SLIDE(more)

NMFP + SLIDE(more)

NMFP
Evaluation at nucleotide level
We first compare the seven methods at the nucleotide level, that is, how many nucleotides are shared by the identified isoforms of each method and the true isoforms in the annotation. For each identified isoform, we match it to the annotated isoform that shares the most nucleotides with it. We rank the identified isoforms by their percentages of overlapping nucleotides with their matched annotated isoforms, from the highest percentage to the lowest. If there are k identified isoforms and r annotated isoforms and k>r, only the top r identified isoforms would be paired with an annotated isoform, and the rest k−r identified isoforms would be counted as false positives with zero precision rates. For each identified isoform paired with an annotated isoform, we calculate its precision rate as \(\frac {\#\text {overlapping nucleotides with the paired annotated isoform}}{\# \text {nucleotides in the identified isoform}}\). The precision rate of a gene was defined as the average of the individual precision rates of its identified isoforms.
Similarly, we match each annotated isoform with an identified isoform. We calculate the recall rate of each annotated isoform with a paired identified isoform as \(\frac {\#\, \text {overlapping nucleotides with the paired identified isoform}}{\#\, \text {nucleotides in the annotated isoform}}\). If there are more annotated isoforms than identified ones, the unpaired annotated isoforms would have zero recall rates. For a given gene, its recall rate is defined as the average of the individual recall rates of its annotated isoforms. Given the precision and recall rates of a gene, we calculate the F score as the their harmonic mean. That is, F score \(= \frac {2}{1/\text {precision}+1/\text {recall}}\).
We calculate the precision and recall rates of every method in each of the 50 samples by taking the average of 951 genes. In the results, all methods except NMFP have precision rates above 0.76 in all samples. Although NMFP has low precision rates, its recall rates are as high as 0.96, implying that its identified isoform candidates has a good coverage of annotated isoforms (Fig. 2). NMFP + Cufflinks has lower precision rates but much higher recall rates than Cufflinks, because the latter identifies fewer isoforms given its parsimony assumption. SLIDE has opposite results, which show that NMFP + SLIDE have higher precision rates but lower recall rates than SLIDE, for both SLIDE(fewer) and SLIDE(more). The reason is that the original versions of SLIDE have a larger search space than NMFP + SLIDE, and thus they are likely to find more isoforms and thus cover more nucleotides given the same Lasso regularization parameter.
In terms of F scores, which combine precision and recall rates, NMFP is shown to have greatly improved the performance of Cufflinks, by increasing the F scores of approximately 0.6 for Cufflinks to more than 0.8 for NMFP + Cufflinks. Strikingly, with the addition of NMFP, Cufflinks outperforms SLIDE, which originally had better F scores than Cufflinks without NMFP [14]. NMFP + Cufflinks becomes the top performer in F scores. On the other hand, NMFP does not much improve the performance SLIDE(more) and even slightly deproves the performance SLIDE(fewer) at the nucleotide level. The reason is that the original SLIDE has reasonably good performance at the nucleotide level, though its identified isoforms are often similar to but not exactly the annotated isoforms (see the subsection “Evaluation at transcript level”).
Evaluation at exon level
We next evaluate the performance of these seven methods at the exon level. If an exon in an identified isoform overlaps at least 50 % of an annotated exon, we call the two exons “overlapping”. Similar to our matching scheme at the nucleotide level, we match identified isoforms with annotated isoforms at the exon level. Then for each identified isoform, its precision rate at the exon level is defined as \(\frac {\#\,\text {overlapping exons with the paired annotated isoform}}{\#\,\text {exons in the identified isoform}}\). Similarly for each annotated isoform, its recall rate is defined as \(\frac {\#\,\text {overlapping exons with the paired identified isoform}}{\#\,\text {exons in the annotated isoform}}\). Then for every gene, its precision and recall rates are calculated as the average precision rate of its identified isoforms and the average recall rate of its annotated isoforms. Its F score is the harmonic mean of its precision and recall rates. For every method in every sample, the precision, recall and F score are calculated by taking the average of 951 genes.
Similar to the performance at nucleotide level, NMFP significantly improves the F scores of Cufflinks from approximately 0.6 to over 0.8, yet it does not improve the performance SLIDE (Fig. 3). The reason is that the original SLIDE has reasonably good performance at the exon level, though its identified isoforms are often similar to but not exactly the annotated isoforms (see the subsection “Evaluation at transcript level”).
Evaluation at transcript level
Compared to the previous evaluation at nucleotide and exon levels, the evaluation at the transcript level is more important, as the ultimate goal of isoform discovery methods is to correctly identify fulllength mRNA isoforms. At the transcript level, if an identified isoform and an annotated isoform have every exon matched by the criterion we defined for matching at the exon level, we called these two isoforms “matched”. Then for every gene, its precision and recall rates at the transcript level are defined as \(\frac {\#\,\text {identified isoforms matched with annotated isoforms}}{\#\,\text {identified isoform}}\) and \(\frac {\#\,\text {annotated isoforms matched with identified isoforms}}{\#\,\text {annotated isoform}}\) respectively. Its Fscore is the harmonic mean of its precision and recall rates. For every method in every sample, the precision, recall and F score are calculated by taking the average of 951 genes.
In the results, NMFP has high recall rates greater than 0.85, indicating that the isoform candidates found by NMFP have a good coverage of annotated transcripts (Fig. 4). Meanwhile, NMFP has precision rates above 0.4, implying that the number of isoform candidates is less than 2.5 times the number of annotated isoform. This demonstrates the effectiveness of NMFP in reducing the search space of possible isoforms. In terms of F scores at the transcript level, NMFP greatly improves the performance of both Cufflinks and SLIDE. Specifically, NMFP + Cufflinks has F scores above 0.6, while Cufflinks has F scores only below 0.5. The reason is that NMFP greatly improves the recall rates of Cufflinks. NMFP also increases the F scores of SLIDE(more) from 0.5 to approximately 0.6, because NMFP helps SLIDE(more) achieve better precision rates. These results illustrate the effectiveness of NMFP in reducing the search space for SLIDE (thus increasing the precision of SLIDE) and in removing the parsimony assumption for Cufflinks (thus increasing the recall of Cufflinks). To explain why NMFP can improve SLIDE’s performance at the transcript level but not at the nucleotide or exon level, we find the reason as the unidentifiability issue of SLIDE with a large search space of possible isoforms. Due to the short RNAseq read lengths, there may exist multiple sets of isoforms (as combinations of exons) that can well explain the observed RNAseq reads. Hence with a large search space of isoforms, SLIDE are likely to miss the true isoforms, but it still can detect nucleotides and exons reasonably well. Since NMFP can shrink the search space to have a higher concentration of true isoforms (i.e., annotated isoforms in the simulation), SLIDE would become more likely to identify the true isoforms and thus achieve better isoform discovery accuracy at the transcript level.
Robustness of NMFP to the choice of NMF rank
How to determine the matrix rank of factorization results is a difficult and open question in NMF [3, 4, 7–9]. In NMFP, the NMF rank determines the number of isoform candidates to be found after each NMF run. Our aggregation of multiple NMF runs and the subsequent highfrequency filtering can increase the robustness of NMFP to the choices of NMF rank. To illustrate this point, we use D. melanogaster gene FBgn0039955 as an example and apply NMFP with different NMF ranks to the simulated RNAseq data. This gene has 4 annotated isoforms (in Ensembl BDGP6 of release 80) and 10 subexons. In the result (Fig. 5), NMFP has a good recall rate of 0.9 for ranks ranging from 3 to 9. The precision rate and F score are also stable in this range. This result and other examples in the Additional file 1 show the robustness of NMFP to the choices of NMF rank. In practice, NMFP selects the rank based on gap statistic (See Section 2 in the Additional file 1 for more detail). Nevertheless, we note that the NMF rank cannot be too much smaller or larger than the number of true isoforms. If too small, the rank will lead to too few isoform candidates to cover the true isoforms. If too large, the identified isoform candidates will be mostly individual bins but not fulllength isoforms.
Simulation results in M. musculus
We also evaluate the performance of NMFP in M. musculus (mouse) (reference genome mm10 and annotation GRCm38 of release 81), a mammalian model organism with more complex gene structures than D. melanogaster. We generate 100 RNAseq samples from mouse chr1 using Flux Simulator, with pairedend reads of length 2×76 bp. The 100 samples have 10 different numbers of RNA molecules and 10 different numbers of reads. The numbers of RNA molecules range from 4,200,000 to 6,000,000, increasing in steps of 200,000. The number of reads range from 11,000,000 to 20,000,000, increasing in steps of 1,000,000. More detail on the 100 samples is summarized in Section 3.2 and Table S1 of the Additional file 1. On chr1, there are 3432 genes, among which we are interested in 852 genes with 310 subexons and 217 annotated isoforms (more detail in Additional file 1: Table S2). In this simulation, we calculate precision and recall rates and F scores at the transcript level for every method, same as what we did in the simulation in D. melanogaster.
Increased robustness of NMFP + SLIDE to the choices of parameter λ
By shrinking the searching space of possible isoforms, NMFP can effectively increase the robustness of SLIDE to its regularization parameter λ in the Lasso estimation. We conduct case studies on three mouse genes, for which we apply NMFP + SLIDE and SLIDE with 150 λ values, ranging from 0.001 to 0.15 by 0.001. The results show that NMFP + SLIDE has better isoform discovery accuracy than SLIDE for smaller λ values (Fig. 6 and Additional file 1: Figure S3). This finding is expected because a smaller search space does not need a large regularization parameter in the Lasso estimation. It indicates that NMFP can reduce the difficulty of choosing λ in the use of SLIDE, as users can simply choose a small λ value to obtain good isoform discovery results.
Robustness of NMFP to the number of samples
In practice, the number of RNAseq data sets under the same biological condition can be quite small, though this issue has become less of a concern given the rapid accumulation of RNAseq data sets in public repositories. To make NMFP a practically useful tool, it is important to evaluate its robustness to m, the number of input samples in the read count matrix V. We apply NMFP to different numbers of simulated mouse samples, from 20 to 100 samples, following the sample order with increasing sequencing depths (for the sample order, please see Section 3.2 in the Additional file 1). As shown in Fig. 7, the NMFP results are robust to the number of samples and differences in sequencing depths.
Simulation study on the effects of expression levels
We use another simulation study to demonstrate that NMFP is capable of increasing the isoform discovery accuracy for a lowly expressed gene in a sample by leveraging other samples (Fig. 8). In this simulation, we construct a gene with 8 exons and 3 true isoforms (10111111), (10011111) and (01111111), which indicate the skipping of Exon 2, Exons 2 and 3, and Exon 1 respectively. The exon lengths are randomly sampled from {100,150,200,…,500}. We simulate 10 samples. In each sample, the proportions of the three isoforms are randomly simulated from U [ 0,1], and the three proportions are normalized to sum up to 1. Given an isoform, singleend reads with length 76 bp are assumed to follow a uniform distribution on this isoform. We fix the gene expression level in Sample 1 as 3 RPKM (Reads Per Kilobase of transcript per Million mapped reads) and vary the gene expression levels in the other nine samples from 0.5, 1, 2 to 10 RPKM in 11 different settings. In each setting, we calculate a read count matrix U from the gene expression levels, the isoform proportions, and the uniformity assumption. Then we add white Gaussian noise N(0,1) to U, and normalize the resulting matrix as V, for which we apply NMFP. Figure 8 depicts the relationship between the gene expression level in Samples 2–10 and the minimum number of isoform candidates found by NMFP to cover the three true isoforms. When the gene expression level is below 3 RPKM, the number of isoform candidates needed to cover the true isoforms is more than 60 with a large variance.
Once the gene expression level goes above 3 RPKM, the number of needed candidates decreases to approximately 30 and has a reduced variance. This result demonstrates that for a gene with low expression in a sample, NMFP can take advantage of other samples to help reduce the number of isoform candidates of this gene. This property makes it possible for downstream isoform discovery methods to explore isoform structures of lowly expressed genes in a sample given the availability of other samples where the genes are more highly expressed.
Real data case study
We further demonstrate the performance of NMFP and compare the seven methods on 74 real RNAseq data sets of D. melanogaster (reference genome dm5 and annotation BDGP5 of release 66). Description of the 74 data sets is in Section 5 of the Additional file 1. Since we have no knowledge on the true isoforms existing in real RNAseq data sets, we interpret the results by studying a few genes in Integrative Genomics Viewer (IGV) [24, 25], where we plot the read distribution in an RNAseq data set (D. melanogaster pupae 3 days) and the isoforms identified by the seven methods.
For D. melanogaster gene FBgn0037643, which has 11 subexons and 6 annotated isoforms, NMFP finds 20 isoform candidates, which contain 4 annotated isoforms and miss the other two by a very short subexon (of only 3 bp). Cufflinks only identifies one annotated isoform, which cannot explain the observed read distribution in the last subexon. In contrast, NMFP + Cufflinks successfully identifies four annotated isoforms, which successfully capture the last subexon. NMFP also helps SLIDE discover fulllength annotated isoforms instead of isofrom fragments (Fig. 9). More real data case studies are in Section 3.4 of the Additional file 1.
Conclusion and discussions
NMF is inherently capable of providing interpretable results for the problem of mRNA isoform discovery. From an input matrix of RNAseq read counts, NMF can find a sparse structure to indicate which bins are likely to form one isoform. However, NMF alone is not competent for recovering isoforms from RNAseq data because of several reasons: (i) inaccurate read coverages and missing junction reads due to RNAseq biases and mapping difficulties, (ii) nonuniqueness of NMF solutions leading to ambiguous isoform discovered, (iii) conflicting bins coexisting in discovered isoforms making the results biologically invalid, and (iv) difficulty in determining the NMF rank. Those issues can be largely alleviated or resolved when NMF is used as a method to preselect isoform candidates rather than determining final isoforms. For (i), (ii) and (iv), since each nonunique NMF solution provides isoform candidates that may include some true isoforms among many false positives, aggregating multiple NMF solutions to find the highfrequency isoforms will increase the chance of discovering the true isoforms. We demonstrate in the simulation study of M. musculus, that the NMFP results are not sensitive to the choice of NMF ranks. We address (iii) by adding a penalty term to NMF and decomposing the invalid isoform candidates with conflicting bins into biologically valid isoforms.
We propose NMFP as a preselection method for improving the accuracy of downstream isoform discovery methods. NMFP takes input from annotated genes and exon boundaries (e.g. UCSC and Ensemble annotations) and/or de novo assemblies (e.g. Cufflinks output). It outputs a small pool of isoform candidates, which are expected to well cover true isoforms and can form a reduced search space for downstream methods. From a general perspective, combination of methods with opposing advantages and drawbacks may improve over individual methods. NMFP and SLIDE are both built on the idea of selecting isoforms from all possibilities, while Cufflinks assembles transcripts directly via an overlap graph. A possible reason for NMFP + Cufflinks to become the top performer is that it combines the strengths of two different approaches: selection and assembly. This may also explain why NMFP + SLIDE, which combines two selection based methods, is outperformed by NMFP + Cufflinks, though SLIDE alone outperforms Cufflinks without NMFP. However compared to SLIDE itself, NMFP + SLIDE still has improved performance at the transcript level, and even becomes more robust to the regularization parameter choices. Overall, we demonstrate by simulation and real data studies that NMFP is a useful tool for improving isoform discovery accuracy and robustness in many aspects.
Nevertheless, NMFP still encounters several problems and there remains substantial room for improvement. The noise and bias generated during RNAseq still has large influence on NMFP results. Some methods for denosing and debiasing may help the performance of NMFP [26, 27]. In addition, NMFP requires enough samples to implement NMF, for which normalization is important but difficult for samples generated by different platforms or from different sources. Normalization methods for RNAseq data (e.g., [28]) may improve the performance of NMFP. Moreover, the current NMFP algorithm has high computational complexity, and optimizing the algorithm is one of our top priority. We also intend to build a database of isoform candidates found by NMFP for different species, so that researchers can conveniently use them for downstream analysis without aggregating multiple samples and going through the whole NMFP pipeline.
Abbreviations
 NMF:

nonnegative matrix factorization
 NMFP:

nonnegative matrix factorization preselection
 mRNA:

messenger RNA
 cDNA:

complementary DNA
 qPCR:

quantitative polymerase chain reaction
References
 1
Faustino NA, Cooper TA. PremRNA splicing and human disease. Genes Dev. 2003; 17(4):419–37.
 2
Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet. 2007; 8(10):749–61.
 3
Anton MA, Gorostiaga D, Guruceaga E, Segura V, Carmonasaez P, Pascualmontano A. SPACE : an algorithm to predict and quantify alternatively spliced isoforms using microarrays. Genome Biol. 2008; 9(2):46. doi:10.1186/gb200892r46.
 4
Anton Ma, Aramburu A, Rubio A. Improvements to previous algorithms to predict gene structure and isoform concentrations using Affymetrix Exon arrays. BMC Bioinf. 2010; 11(1):578. doi:10.1186/1471210511578.
 5
Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999; 401(6755):788–91. doi:10.1038/44565.
 6
Donoho D, Stodden V. When Does NonNegative Matrix Factorization Give a Correct Decomposition into Parts? In: Advances in Neural Information Processing Systems: 2003.
 7
Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 2004; 101(12):4164–9. doi:10.1073/pnas.0308531101.
 8
Pascualmontano A, Carmonasaez P, Chagoyen M, Tirado F, Carazo JM, Pascualmarqui RD. bioNMF: a versatile tool for nonnegative matrix factorization in biology. BMC Bioinf. 2006; 9:1–9. doi:10.1186/147121057366.
 9
Fogel P, Young SS, Hawkins DM, Ledirac N, Hall F, Ne CS, et al.Inferential, robust nonnegative matrix factorization analysis of microarray data. Bioinformatics (Oxford, England). 2007; 23(1):44–9. doi:10.1093/bioinformatics/btl550.
 10
Wang Z, Gerstein M, Snyder M. RNAseq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57–63.
 11
Trapnell C, Williams Ba, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al.Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511–5. doi:10.1038/nbt.1621.171.
 12
Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, et al.Ab initio reconstruction of cell type – specific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nat Biotechnol. 2010; 28(5):503–10. doi:10.1038/nbt.1633.
 13
Li W, Feng J, Jiang T. IsoLasso: a LASSO regression approach to RNASeq based transcriptome assembly (Extended abstract). Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2011; 6577 LNBI:168–88. doi:10.1007/9783642200366_18.
 14
Jessica J, Jiang Cr, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of nextgeneration mRNA sequencing (RNASeq) data for isoform discovery and abundance estimation. PNAS. 2011. doi:10.1073/pnas.1113972108.
 15
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using rnaseq. Bioinformatics. 2011; 27(17):2325–29.
 16
Steijger T, Abril JF, Engström PG, Kokocinski F, The RGASP Consortium, Hubbard TJ, et al.Assessment of transcript reconstruction methods for RNAseq. Nat Methods. 2013; 10(12):1177–84. doi:10.1038/nmeth.2714.
 17
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol). 2008; 70(5):849–911.
 18
Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, et al.Modelling and simulating generic RNASeq experiments with the flux simulator. Nucleic Acids Res. 2012; 40(20):10073–83. doi:10.1093/nar/gks666.
 19
Li C, Wong WH. Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS. 2001; 98:31–6. doi:10.1073/pnas.98.1.31.
 20
Li SZ, Hou X, Zhang H, Cheng Q. Learning Spatially Localized, PartsBased Representation. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference, vol. 1. IEEE: 2001. p. 207–212. http://0ieeexplore.ieee.org.brum.beds.ac.uk/xpl/login.jsp?tp=&arnumber=990477&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D990477.
 21
Feng T, Li SZ, Shum Hy, Zhang H. Local NonNegative Matrix Factorization as a Visual Representation. In: Proceedings of the 2nd International Conference on Development and Learning(ICDL’02): 2002. p. 178–183. http://0ieeexplore.ieee.org.brum.beds.ac.uk/xpl/login.jsp?tp=&arnumber=1011835&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1011835.
 22
Pascualmontano A, Carazo JM, Member S, Kochi K, Lehmann D, Pascualmarqui RD, et al.Nonsmooth Nonnegative Matrix Factorization (ns NMF). IEEE Trans Pattern Anal Mach Intell. 2006; 28(3):1–13.
 23
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistics. J R Stat Soc Ser B (Stat Methodol). 2001; 63(2):411–23.
 24
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al.Integrative genomics viewer. Nat Biotechnol. 2011; 29(1):24–6.
 25
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (igv): highperformance genomics data visualization and exploration. Briefings in bioinformatics. 2013; 14:178–192. http://0bib.oxfordjournals.org.brum.beds.ac.uk/content/14/2/178.long.
 26
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010; 38(12):1–7. doi:10.1093/nar/gkq224.
 27
Li J, Jiang H, Wong WH. Modeling nonuniformity in shortread rates in RNASeq data. Genome Biol. 2010; 11(5):50. doi:10.1186/gb2010115r50.
 28
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014; 32(9):1–10. doi: 10.1038/nbt.2931.
Acknowledgements
This work was supported by the startup fund of Department of Statistics at University of California, Los Angeles, and the Hellman Fellowship from the Hellman Foundation. The publication costs for this article were funded by the Hellman Foundation. We thank YuCheng T. Yang in School of Life Sciences, Tsinghua University and Zhiyi Qin and Dr. Xuegong Zhang in Department of Automation, Tsinghua University for their comments and help on this work. We also thank the three anonymous reviewers for their insightful comments that helped us improve our manuscript.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YY and JJL conducted simulation, processed real data„ and performed statistical analysis together. YY drafted the manuscript. JJL supervised this study. Both authors read and approved the final manuscript.
Declarations
This article has been published as part of BMC Genomics Volume 17 Supplement 1, 2016: Selected articles from the Fourteenth Asia Pacific Bioinformatics Conference (APBC 2016): Genomics. The full contents of the supplement are available online at http://0www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/17/S1.
Source codes
The NMFP source codes and examples are available at http://www.stat.ucla.edu/~jingyi.li/packages/NMFP.zip.
From The Fourteenth Asia Pacific Bioinformatics Conference(APBC 2016) San Francisco, CA, USA. 11  13 January 2016
Additional file
Additional file 1
More detail on the methods and results; more simulation and real data results. (PDF 606 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ye, Y., Li, J.J. NMFP: a nonnegative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNAseq data. BMC Genomics 17, 11 (2016) doi:10.1186/s1286401523048
Published
DOI
Keywords
 mRNA isoform discovery
 Nextgeneration RNA sequencing (RNAseq)
 Nonnegative matrix factorization (NMF)
 Cufflinks
 SLIDE