Skip to main content
  • Methodology article
  • Open access
  • Published:

A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts

Abstract

Background

In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts.

Methods

The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs.

Results

This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively.

Conclusions

The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species.

Background

In recent years, thousands of sequencing projects around the world have been creating enormous volumes of RNA data, which has led to the discovery and description of a rapidly increasing number of non-coding RNAs (ncRNAs) in eukaryotic genomes [14]. NcRNAs are a highly heterogeneous group, ranging in length from about 20 bases in microRNAs and siRNAs [5] to “macroRNAs” spanning hundreds of kilobases [6, 7], known as long non-coding RNAs (lncRNAs). While the majority of ncRNAs seem to be spliced and processed similar to coding mRNAs, there is also a large body of unspliced transcripts [8, 9] and a vast number of small processing products [10]. The functions of ncRNAs are analogously diverse. In fact, they appear to be involved in virtually all the regulatory processes in the cell.

Although they are often pragmatically defined as transcripts of a more than 200 nucleotides in length, and without any apparent coding capacity, lncRNAs are still rather poorly understood [1113]. Nevertheless, some classes, such as chromatin-associated long intergenic ncRNAs (lincRNAs) [14], as well as subgroups that are directly involved in transcriptional and post-transcriptional regulation [1517], have been identified in high throughput analyses. An extensive literature links lncRNAs with a wide array of diseases [1821], although the molecular mechanisms underlying lncRNA action are still largely unknown.

Distinguishing between protein coding transcripts (PCTs) and long non-coding transcripts (lncRNAs) is a surprisingly difficult task in practice, and there is still an ongoing controversy whether some or even the majority of the transcripts currently classified as “non-coding” can in fact be translated.

From a computational point of view, distinguishing PCTs from lncRNAs is a paradigmatic machine learning task, and several tools have become available for this purpose. Among these tools, CPC (Coding Potential Calculator) [22] and CPAT [23] have been developed to discriminate PCTs from ncRNAs. While CPC works well with known PCTs, it may tend to classify novel PCTs as ncRNAs, if they have not been recorded in protein databases [22]. The CPAT tool is based on logistic regression, and it uses four features based on ORFs.

Tools such as LncRNApred [24], lncRScan-SVM [25], DeepLNC [26] can predict lncRNAs. IseeRNA [27] was specially designed to predict lincRNAs. LncRScan-SVM and iSeeRNA are methods based on Support Vector Machines (SVM), trained with data from humans and mice, both presenting very good results. To predict lncRNAs, these two methods use GTF as input files, along with conservation data and some nucleotide patterns extracted from the sequences, to predict lncRNAs. LncRNApred is a method that was constructed using Random Forest, and features extracted from the sequence nucleotides to predict lncRNAs. DeepLNC was built using deep neural networks, and reported high accuracy to predict lncRNAs. Unfortunately, it is not clear which features were used, and the DeepLNC site presents an exception when any fasta file is submitted.

Recently, Wucher et al. [28] proposed FEELnc (FlExible Extraction of LncRNAs), a program to annotate lncRNAs based on a Random Forest model, trained with frequency nucleotide patterns and relaxed ORFs. They used FEELnc on a data set of canine RNA-seq samples, having improved the canine genome annotation with 10.374 novel lncRNAs.

Comprehensive reviews of these tools have been provided by Han et al. [25] and Guo et al. [29]. Similarly, Ventola et al. [30] studied features extracted from sequence data, those presented in the literature and some newly proposed features, in order to find signatures (groups of features) that can distinguish lncRNA transcripts from other classes, such as PCTs.

In general, the basic idea of the methods that use information of transcript nucleotides is to create a model to predict ncRNAs from known samples already stored in databases. Despite working well with the species for which they have been trained, these methods do not usually generalize for other organisms. In other words, these approaches are not capable of reliably predicting lncRNAs in a variety of species.

In addition, there are various databases containing lncRNAs (see Guo et al. [29] and Fritah et al. [31] for detailed reviews). Among them, Ensembl [32], NONCODE v. 4.0 [33], lncRNAdb [34], PLncDB [35], NRED [36] provide information on general and specific lncRNAs, while DIANA-LncBase [37] and lncRNADisease [38] present interactions among lncRNAs and other ncRNAs or proteins.

Moreover, in recent years, experimental and computational models have been developed to predict secondary and tertiary structures of lncRNAs, as explored in Yan et al. [39]. While the prediction of the lncRNAs’ secondary structures in-vitro has high-experimental costs, in-silico methods are low cost, but they exhibit high false-positive rates [29].

Although lncRNAs have very heterogenous characteristics [1113], the previous described methods indicate that there are sets of features that allow researchers to distinguish lncRNAs from PCTs.

In this study, we present a SVM based method to distinguish lncRNAs from PCTs, using features extracted from transcript sequences: frequencies of nucleotide patterns selected by Principal Component Analysis (PCA) [40]; open reading frame (ORF) length; and ORF relative length. In addition, in order to analyze the performance of our method, we developed case studies with human, mouse and zebrafish data. We also compared results of our method to other tools found in the literature. To validate our model, we applied it to three different species (human, gorilla and rhesus macaque), as well as to human and mouse pseudogenes. Finally, we re-annotated data from Swiss-Prot, and annotated transcripts derived from RNA-seq data, reported in Necsulea et al. [41].

Methods

Data

Four data sets for training the models were obtained from Ensembl [32]: human (Homo sapiens) assemblies GRCh37 patch 13 (hg19, GENCODE 19) and GRCh38 patch 10 (hg38, GENCODE 26), mouse (Mus musculus) assembly GRCm38 patch 5 (mm10, GENCODE M13), zebrafish (Danio rerio) assembly GRCz10. These transcript FASTA files contain PCT and lncRNA sequences, while the classification was extracted from the transcript biotype, provided by Ensembl.

The SVM based method

We propose a method based on SVM to distinguish lncRNAs from PCTs (see Fig. 1), using PCA to reduce the number of features calculated from the nucleotides of the transcripts.

Fig. 1
figure 1

The method to distinguish lncRNAs from PCTs. The method is based on SVM and uses as attributes nucleotide pattern frequencies, chosen with the support of PCA, together with the first ORF relative length, a characteristic that informs the coding potential of a transcript

First, a standard data set was created, removing all the sequences shorter than 200 bases from the original FASTA files. This standard data set contained, besides the transcripts (description and sequence), some calculated features (nucleotide pattern frequencies and ORF lengths) for each transcript, as follows. These features were divided in two sets: the first one contained the average frequency of the di-, tri- and tetra-nucleotide patterns in all the possible frames; and the second set contained the length and relative length of the first and the longest predicted ORFs. The relative length of an ORF is defined by its length divided by its corresponding transcript length.

The standard data set generated two other sets - training and testing, each composed of a positive set (containing lncRNAs) and a negative set (containing PCTs), of equal sizes. The training and the testing data sets were randomly generated, 75% for training and 25% for testing.

First set of features built with PCA

In the standard data set, there was a total of 336 different frequencies of nucleotide patterns in the first set: 16 di-nucleotide pattern frequencies; 64 tri-nucleotide pattern frequencies; and, 256 tetra-nucleotide pattern frequencies. We reduced the number of these possible features, having identified their relative importance, with the PCA method [40]. Thus, PCA was applied to all the nucleotide pattern frequencies of the training data set, to find how many, and which ones, would effectively help to distinguish between lncRNAs and PCTs.

The orthogonal transformation produced by PCA was used to calculate the “contribution” of each nucleotide pattern frequency. This orthogonal transformation is an n×n matrix with eigenvectors in its columns and features in its rows, where n = 336 frequencies of nucleotide patterns. We removed the m least significant columns from this matrix, obtaining a new n×(nm) matrix, and calculated the Euclidean norm of the new vectors, also called loadings, represented by its columns. These norms are the contributions of the frequencies after the dimension reduction. This allowed to select sets of nucleotide pattern frequencies in the training data set.

The PCA indicated that a set of 10 features could explain about ≈65.0% of data, while a set of 60 features could explain about ≈95.0% of data. From this information, we created 6 groups of nucleotide pattern frequencies with sizes 10, 20, 30, 40, 50 and 60. The frequencies of nucleotide patterns that most contributed to the orthogonal transformation were selected to create each group. Each of these groups formed the first potential sets of features. The PCA results can be seen in Additional file 1.

Second set of features regarding ORFs

In addition, four sets of features were constructed, in order to find the best set of features regarding ORFs: the first ORF length and its relative length; the first ORF relative length; the longest ORF length and its relative length; and the longest ORF relative length.

Implementation

To implement the SVM method [42], a libSVM package [43] was used.

In order to find the best set of features, we combined the 6 sets of features found with the PCA (10,20,30,40,50 and 60 frequencies of nucleotide patterns) and the 4 sets of features described above (the first and the longest ORF lengths and their relative lengths), thereby creating 24 experiments.

In these 24 experiments, the grid search tool1 with 10-fold cross validation was used in the training data set, to define which experiment performed best. In each experiment, the best C and γ parameters were selected. The grid search results can be seen in Additional file 2.

Case studies

Four case studies were performed to evaluate the SVM method. We validated all the models created with species different from those used in the training phase, according to the following data sets: rat (Rattus norvegicus) assembly Rnor6.0, pig (Sus Scrofa) assembly Sscrofa10.2, and fruitfly (Drosophila melanogaster) assembly BDGP6. We also applied the models to human and mouse pseudogenes. In addition to this, we re-annotated two sequences from Swiss-Prot database [44], and annotated contigs derived from RNA-seq transcripts of human, gorilla and rhesus macaque, reported in Necsulea et al. [41].

Results and discussion

Human

In the first case study, only human data from the assemblies GRCh37 (hg19) and the GRCh38 (hg38) were used for training and testing. Our databases included 104,763 PCTs and 24,513 lncRNAs from GRCh37, and 102,915 PCTs and 28,321 lncRNAs from GRCh38. We filtered all the sequences shorter than 200 bases, having obtained 94,830 and 92,716 PCTs from GRCh37 and GRCh38 assemblies, respectively, and 24,266 and 28,024 lncRNAs from GRCh37 and GRCh38 assemblies, respectively.

To train the models, 18,200 PCTs and 18,200 lncRNAs were used from GRCh37, and 21,018 PCTs and 21,018 lncRNAs from GRCh38. GRCh37 testing data set included 6,066 PCTs and 6,066 lncRNAs, while the GRCh38 testing data set contained 7,006 PCTs and 7,006 lncRNAs.

The 6 sets of nucleotide pattern frequencies selected with PCA were used to identify which one produced the best results. To do this, we used two ROC curves (see Figs. 2 and 3). These figures show the results of the models trained with the first ORF relative length and the 6 nucleotide pattern sets. The curve for the model trained with 50 nucleotide frequencies performed slightly better for both assemblies, GRCh37 and GRCh38.

Fig. 2
figure 2

GRCh37 ROC curve used to select the set with the best nucleotide pattern frequencies. The model trained with a set composed of 50 nucleotide pattern frequencies performed slightly better than the other models

Fig. 3
figure 3

GRCh38 ROC curve used to select the set with the best nucleotide pattern frequencies. The model trained with a set composed of 50 nucleotide pattern frequencies performed slightly better than the other models

The nucleotide pattern frequencies that achieved the best results for the human data are shown in Table 1. The nucleotide pattern frequencies for both GRCh37 and GRCh38 data sets were almost equal, the only difference being, “acg” and “gta”. We noted that both patterns are among the lowest PCA loadings, compared to all the other patterns.

Table 1 Selected nucleotide pattern frequencies for the human data

Using these patterns, together with the first and longest ORF relative lengths as features, we trained 8 models with two kernels, radial and quadratic, having tested them with both data sets, GRCh37 and GRCh38. The results are shown in Table 2 and Figs. 4 and 5. The quadratic kernel achieved substantial accuracy in almost all the tests, while the radial kernel achieved very high accuracy in all of them.

Fig. 4
figure 4

GRCh37 ROC curve used to select the best ORF relative length, and the kernel. The model trained with the first ORF relative length and the radial kernel obtained better results

Fig. 5
figure 5

GRCh38 ROC curve used to select the best ORF relative lenght, and the kernel. The model trained with the first ORF relative length and the radial kernel obtained best results

Table 2 Results of the human case study

In other results, the difference of the ORF relative length was very small when using the first and the longest ORF relative lengths. Although we were able to achieve very close values of accuracy, the first ORF relative length model presented higher sensitivity than the longest one. In addition, finding the first ORF (O(n)) has a lower time complexity when compared to finding the longest ORF (O(n 2)). From a biological point of view, the canonical model for translation initiation is the scanning model of the ribosome, which is finding the initial “atg” codon [45]. It is worth noting that, in our data sets, in ≈94% of the lncRNAs, the first ORF was different from the longest one, while in ≈93% of the PCTs, the first and the longest ORFs were the same. Using only this characteristic, we built a deterministic classifier to distinguish lncRNAs from PCTs, and compared it with the best SVM model (Fig. 6). This classifier achieved ≈93.5% accuracy. Thus, we decided to use the first ORF relative length as a feature in our models.

Fig. 6
figure 6

GRCh37 ROC curve showing the performance of the deterministic classifier compared to the best SVM model

The contribution of each feature set was also investigated (Figs. 7 and 8). Each feature set can also distinguish lncRNAs from PCTs with high confidence. The model using only the first ORF achieved 92.90% accuracy in GRCh37 data set and 92.95% in GRCh38 data set, while the model using only the 50 frequencies of nucleotide pattern achieved an accuracy of 90.86% and 91.54%, respectively. These results confirm that ORF content is a key characteristic, as reported in the literature, and, also show that other features, such as sets of nucleotide pattern frequencies, can achieve similar performance in distinguishing lncRNAs from PCTs. However, we found that combining all the features in one model presented better results.

Fig. 7
figure 7

GRCh37 ROC curve showing the performance of a model trained with the first ORF relative length only, another model trained with the 50 selected nucleotide patterns frequencies, and a third model using all these features

Fig. 8
figure 8

GRCh38 ROC curve showing the performance of a model trained with the first ORF relative length only, another model trained with the 50 selected nucleotide patterns frequencies, and a model using all these features

We compared our results with the methods and results presented by Sun et al. [27, 46], Han et al. [25], Pian et al. [24] and Wucher et al. [28], as shown in Table 3. Note that, in these comparisons, the same human assemblies were used. These results show that, in all the chosen metrics, our method presented better results.

Table 3 Results for models trained with human data

DeepLNC of Tripathi et al. [26] presented almost the same results, when compared to our method. In contrast to the other methods, we did not execute any experiment directly, since DeepLNC uses the lncipedia database [47], and does not clearly indicate the negative data set. We also attempted to use their method with our data set, but the web application (http://bioserver.iiita.ac.in/deeplnc/) presented an exception when submitting a fasta file, and failed to report any results. Notably, 98.21% of all the lncRNAs of the lncipedia database were correctly classified by our method.

Moreover, in order to verify the performance of our method in a highly curated set of lncRNAs and PCTs, we selected the best trained model to classify human data, the one trained with data from GRCh37, with 50 PCA selected nucleotide pattern frequencies and the first ORF relative length. This model was used to classify the highly curated data set of 5.322 lncRNAs reported by Nitsche et al. [48] and 5.322 PCTs randomly chosen from the Swiss-Prot reviewed database [44], but not including those annotated as putative, hypothetical, unknown and predicted. The model analyzed this data set with 96.15% accuracy, 99.72% sensitivity (5,307/5,322) and 92.58% specificity (4,927/5,322).

Mouse

For the second case study, we used mouse transcript data, from the GRCm38 assembly, with 61,440 PCTs and 11,511 lncRNAs. Again, we removed all the sequences shorter than 200 nucleotides, which resulted in 57,191 PCTs and 11,347 lncRNAs. This data was randomly split in two data sets, a training data set with 8510 PCTs and 8510 lncRNAs, and a testing data set with 2837 PCTs and 2837 lncRNAs.

Models with the 6 nucleotide pattern sets together with the first ORF relative length were also used to find which set would perform better. The ROC curve in Fig. 9 shows that the model trained with 50 nucleotide frequencies performed better than the other models. The nucleotide pattern frequencies that achieved the best results for the mouse data are shown in Table 4.

Fig. 9
figure 9

GRCm38 ROC curve used to select the best set of nucleotide pattern frequencies. The model trained with the set composed of 50 nucleotide pattern frequencies performed better than the other models

Table 4 Selected nucleotide pattern frequencies for mouse data

Similar to the human case, using these nucleotide pattern frequencies, we also analyzed models trained with radial and quadratic kernels, using the first and the longest ORFs, as well as absolute and relative lengths. Analyzing the results, shown in Table 5 and in Fig. 10, we found that the best model was trained using the radial kernel, with features of the set of 50 frequencies of nucleotide patterns and the first ORF relative length.

Fig. 10
figure 10

GRCm38 ROC curve used to select the best ORF relative lenght and the kernel. The model trained with the first ORF relative length and the radial kernel obtained the better results

Table 5 Results for models trained with mouse data

Again, the contribution of each feature category was investigated (Fig. 11). The model using only the first ORF achieved 93.52% accuracy, while the model using only the 50 frequencies of nucleotide patterns achieved an accuracy of 90.68%. Once more, ORF content is confirmed as a determinant characteristic, as well as a set of nucleotide pattern frequencies that achieved similar performance, to distinguish lncRNAs from PCTs. However, we found that combining all the features in one model improved performance.

Fig. 11
figure 11

GRCm38 ROC curve showing the performance of a model trained with first ORF relative length only, another model trained with the 50 selected nucleotide patterns frequencies, and a third model using all these features

The comparison of our results with those obtained by Sun et al. [27, 46], Han et al. [25], Pian et al. [24] and Wucher et al. [28] (see Table 6), shows that our method achieved better sensitivity and accuracy than the other methods, although the specificity was 1.41% lower than CPC, despite a 23.24% higher sensitivity in this case. Therefore, our method presented better performance in distinguishing lncRNAs from PCTs in mouse transcript data, when compared to the other tools.

Table 6 Results for models trained and tested with mouse data

Human and Mouse

This case study was analyzed to verify if a cross species model would better distinguish lncRNAs from PCTs than the previously tested single species models.

In this case study, we used the same training and testing data from the previous case studies to build the training and testing data sets. We combined data from GRCh37 with GRCm38 and from GRCh38 with GRCm38.

First, we selected the 50 nucleotide pattern frequencies to build the models (see Table 7). The least significant patterns (lowest PCA loading), “cca” and “gac”, were the only differences in these sets.

Table 7 Selected nucleotide pattern frequencies for human and mouse data

Using these patterns, we trained models with the first and the longest ORF relative lengths. The results are shown in Table 8.

Table 8 Results of the human and mouse case study

We noticed a small improvement in accuracy using the bi-species model, when compared to the single species model. These results suggest that a multi-species model can slightly improve distinguishing lncRNAs from PCTs, when compared to a single species model.

Mouse and Zebrafish

The last case study was performed to evaluate our method when creating a multi-species model with data from two evolutionary distant species, together with a fewer number of annotated lncRNAs. To do this, we used mouse (GRCm38) and zebrafish (GRCz10).

The same training and testing data sets from the mouse case study were used, together with data from GRCz10, 2775 PCTs and 2775 lncRNAs for training, and 926 PCTs and 926 lncRNAs for testing.

The 50 nucleotide pattern frequencies selected by the PCA are shown in Table 9. These 50 patterns and the first ORF relative length were used to create the SVM model, which obtained the results presented in Table 10.

Table 9 Selected nucleotide pattern frequencies from mouse and zebrafish
Table 10 Results for the mouse and zebrafish case study. We trained one model with two data sets, GRCm38 and GRCz10

Once again, the results show that we can use the same method on different sets of species, creating a multi-species model to distinguish lncRNAs from PCTs, with high accuracy.

Model validation

To validate our method, we used the best model of each case study to distinguish lncRNAs from PCTs in data sets of species that were not used in the SVM training. The objective was to analyze under- and overfitting, and also whether the models could distinguish lncRNAs from PCTs in data sets of evolutionarily close and distant species.

Besides the data sets used in each case study, we used data from pig (Sscrofa10.2) - 205 lncRNAs and 205 PCTs, rat (Rnor6.0) - 3537 lncRNAs and 3537 PCTs, and fruit fly (BDGP6) - 2776 lncRNAs and 2776 PCTs. All the results are shown in Table 11.

Table 11 Comparison of all the results for each species, together with their corresponding performances

From these results, we can see that none of the models are overfitted, since they were able to be applied to different species with high accuracy. The models that used GRCh38 data led to worse performance for evolutionarily distant species, especially when compared to models that used data from GRCh37. The newly 3808 annotated lncRNAs probably contribute to a model more fitted to evolutionarily close species.

The pig data set obtained the worst classification. These results could be explained by the small number of sequences in the data set, and also by the fact that this is not a model organism, so possibly this data is not curated enough. Nonetheless, our method can be used to improve the quality of lncRNA annotation in this species.

On the other hand, it is interesting to note that a multi-species model can improve the accuracy when compared to a single species model, as can be seen in Table 11. The accuracy was slightly improved when the GRCh37/GRCm38 model was used to distinguish lncRNAs from PCTs in the human GRCh37 data set. Interestingly, a model created with two evolutionary distant species - mouse and zebrafish - was able to distinguish lncRNAs of the fruit fly, which is an even more distant species.

Finally, we used human and mouse pseudogenes (in GTF files), having predicted 81.2% (12,033 from a total of 15,494) pseudogenes of the human genome, and 91.7% (6832 from a total of 7453) pseudogenes of the mouse genome. It is remarkable that there is such a large number of predicted pseudogenes as lncRNA, since pseudogenes are derived from ancient PCTs, and diverge slowly after their generation, losing coding capacity and potential regulatory signal [49]. Nevertheless, our method distinguishes pseudogenes from bona fide PCTs.

PCTs re-annotation and RNA-seq annotation

The GRCh38 model was used to search for lncRNAs among putative, hypothetical, unknown and predicted human PCTs in the Swiss-Prot reviewed database [44]. We found 1245 sequences longer than 67 animo-acids (201 bases). To find the corresponding nucleotide sequences, we used the EMBL reference of each entry of the Swiss-Prot database. All these sequences were trimmed, in order to begin with a start codon, because we found sequences that were 5 UTR long. This avoids introducing bias by the first ORF relative length in the discrimination between lncRNAs and PCTs. Our method found 231 candidates. From these, we focused in 21 candidates - those that had more than a 90% probability of being lncRNA, and shorter than 2000 bases. After analyzing the EMBL and Swiss-Prot databases and the sequences themselves, we found 2 putative PCTs with multiple “atg” at the 5’ UTR, and also with annotation warnings about dubious prediction. Thus, both sequences listed in Additional file 3, could be re-annotated as lncRNAs with high probability.

In addition, we also used transcripts derived from RNA-seq data to validate our model against annotated lncRNAs, as reported by Necsulea et al. [41]. They presented 11,890, 912 and 12,056 lncRNAs from human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta), respectively. Our GRCh37 model correctly classified 11,726 (98.62%), 737 (80.81%) and 11,086 (91.95%) lncRNAs from human, gorilla and rhesus macaque, respectively.

Conclusion

In this article, we presented an SVM based method to distinguish long non-coding RNAs (lncRNAs) from protein coding transcripts (PCTs), using features from the nucleotide patterns (frequencies of di-, tri- and tetra-nucleotides) of transcripts, chosen with the support of Principal Component Analysis (PCA), together with ORF length and ORF relative length.

We trained and tested our method with data of human, mouse and zebrafish, obtaining high performance. The best results were an accuracy of 98.18% with human transcripts, 97.83% with mouse transcripts and 96.09% with zebrafish transcripts. We compared our results with other methods in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc) and found we had obtained better results.

To validate our model, we first classified the mouse data with the human model, and vice-versa, obtaining accuracy of ≈97.8% in both cases, showing that our model is not overfitted, and can be used with evolutionarily close species. We also validated the multi-species models human/mouse and mouse/zebrafish, which also produced excellent results. Next, we tested our models with data from rat, pig and fruit fly, having obtained accuracies from 84 to ≈99% in all these organisms. Our method classified 81.2% of human pseudogenes and 91.7% of mouse pseudogenes as non-coding, and also found 2 uncharacterized sequences, among 1245, in the Swiss-Prot reviewed database, indicating a high probability of being lncRNAs. Furthermore, the method successfully annotated the majority of the assembled transcripts derived from RNA-seq data from human (98,62%), gorilla (80,81%) and rhesus macaque (91,95%).

We intend to investigate if a semi-supervised learning method could reduce the size of the training data sets, while simultaneously maintaining high accuracy in the testing phase. This could be very useful to train models for organisms with a small amount of known lncRNA transcripts. Lastly, novel features (see Ventola et al. [30]) could be used in machine learning methods, also indicating potential biological characteristics of lncRNAs.

Endnote

1 A Python script to find a model with C and γ parameters presenting the best accuracy, which is part of the libSVM package.

Abbreviations

lncRNA:

Long non-coding RNA

lincRNA:

Long intergenic non-conding RNA

ncRNA:

Non-coding RNA

ORF:

Open reading frame

PCT:

Protein coding transcript

PCA:

Principal component analysis

ROC:

Receiver operating characteristic

SVM:

Support vector machine

References

  1. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447:799–816.

    Article  PubMed Central  Google Scholar 

  2. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engström PG, Lenhard B, Aturaliya RN, Batalov S, Beisel KW, Bult CJ, Fletcher CF, Forrest AR, Furuno M, Hill D, Itoh M, Kanamori-Katayama M, Katayama S, Katoh M, Kawashima T, Quackenbush J, Ravasi T, Ring BZ, Shibata K, Sugiura K, Takenaka Y, Teasdale RD, Wells CA, Zhu Y, Kai C, Kawai J, Hume DA, Carninci P, Hayashizaki Y. Transcript annotation in FANTOM3: Mouse Gene Catalog based on physical cDNAs. PLoS Genet. 2006; 2:62.

    Article  Google Scholar 

  3. Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, Ponting CP, Stadler PF, Morris KJ, Morillon A, Rozowsky JS, Gerstein M, Wahlestedt C, Hayashizaki Y, Carninci P, Gingeras TR, Mattick JS. The reality of pervasive transcription. PLoS Biol. 2011; 9:1000625.

    Article  Google Scholar 

  4. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.

    Article  PubMed Central  Google Scholar 

  5. Carthew RW, Sontheimer EJ. Origins and mechanisms of miRNAs and siRNAs. Cell. 136; 2009:642–55.

  6. Kapranov P, St Laurent G, Raz T, Ozsolak F, Reynolds CP, Sorensen PH, Reaman G, Milos P, Arceci RJ, Thompson JF, Triche TJ. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is ’dark matter’ un-annotated RNA. BMC Biol. 2010; 8:149.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Hackermüller J, Reiche K, Otto C, Hösler N, Blumert C, Brocke-Heidrich K, Böhlig L, Nitsche A, Kasack K, Ahnert P, Krupp W, Engeland K, Stadler PF, Horn F. Cell cycle, oncogenic and tumor suppressor pathways regulate numerous long and macro non-protein coding RNAs. Genome Biol. 2014; 15:48.

    Article  Google Scholar 

  8. Nakaya HI, Amaral PP, Louro R, Lopes A, Fachel AA, Moreira YB, El-Jundi TA, da Silva AM, Reis EM, Verjovski-Almeida S. Genome mapping and expression analyses of human intronic noncoding RNAs reveal tissue-specific patterns and enrichment in genes related to regulation of transcription. Genome Biol. 2007; 8(3):43.

    Article  Google Scholar 

  9. Engelhardt J, Stadler PF. Evolution of the unspliced transcriptome. BMC Evol Biol. 2015; 15:166. doi:10.1186/s12862-015-0437-7.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Kapranov P, Cheng J, Dike S, Nix D, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Madhavan G, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007; 316:1484–8.

    Article  CAS  PubMed  Google Scholar 

  11. Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009; 136(4):629–41. doi:10.1016/j.cell.2009.02.006.

    Article  CAS  PubMed  Google Scholar 

  12. Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009; 10(3):155–9. doi:10.1038/nrg2521.

    Article  CAS  PubMed  Google Scholar 

  13. Orom UA, Shiekhattar R. Noncoding RNAs and enhancers: complications of a long-distance relationship. Trends Genet. 2011; 27(10):433–9. doi:10.1016/j.tig.2011.06.009.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Rinn JL, Kertesz M, Wang JK, Squazzo SL, Xu X, Brugmann SA, Goodnough LH, Helms JA, Farnham PJ, Segal E, Chang HY. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 2007; 129(7):1311–23. doi:10.1016/j.cell.2007.05.022.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW. The antisense transcriptomes of human cells. Science. 2008; 322(5909):1855–7. doi:10.1126/science.1163853.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA. A chromatin landmark and transcription initiation at most promoters in human cells. Cell. 2007; 130(1):77–88. doi:10.1016/j.cell.2007.05.042.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ashe HL, Monks J, Wijgerde M, Fraser P, Proudfoot NJ. Intergenic transcription and transinduction of the human beta-globin locus. Genes Dev. 1997; 11(19):2494–509.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Weirick T, John D, Dimmeler S, Uchida S. C-It-Loci: a knowledge database for tissue-enriched loci. Bioinformatics. 2015; 31(21):3537–43. http://0-dx-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btv410.

    Article  CAS  PubMed  Google Scholar 

  19. Hall JR, Messenger ZJ, Tam HW, Phillips SL, Recio L, Smart RC. Long noncoding RNA lincRNA-p21 is the major mediator of UVB-induced and p53-dependent apoptosis in keratinocytes. Cell Death Dis. 2015; 6:1700.

    Article  Google Scholar 

  20. Tang SS, Zheng BY, Xiong XD. LincRNA-p21: Implications in Human DiseasesLong noncoding RNA lincRNA-p21 is the major mediator of UVB-induced and p53-dependent apoptosis in keratinocytes. Int J Mol Sci. 2015; 16:18732–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kumar V, Westra HJ, Karjalainen J, Zhernakova DV, Esko T, Hrdlickova B, Almeida R, Zhernakova A, Reinmaa E, Vosa U, Hofker MH, Fehrmann RSN, Fu J, Withoff S, Metspalu A, Franke L, Wijmenga C. Human Disease-Associated Genetic Variation Impacts Large Intergenic Non-Coding RNA Expression. Cell Death Dis. 2013; 9(1):1003201.

    Google Scholar 

  22. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007; 35(Web Server issue):345–9. doi:10.1093/nar/gkm391.

    Article  Google Scholar 

  23. Wang L, Park HJ, Dasari S, co-authors. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucl Ac Res. 2013; 41(6):74–4.

    Article  CAS  Google Scholar 

  24. Pian C, Zhang G, Chen Z, Chen Y, Zhang J, Yang T, Zhang L. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature. PLoS ONE. 2016; 11(5):0154567. doi:10.1371/journal.pone.0154567.

    Article  Google Scholar 

  25. Han S, Liang Y, Li Y, Du W. Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Long Noncoding Transcripts Discrimination. Biomed Res Int. 2016; 2016:8496165. doi:10.1155/2016/8496165.

    PubMed  PubMed Central  Google Scholar 

  26. Tripathi R, Patel S, Kumari V, Chakraborty P, Varadwaj PK. Deeplnc, a long non-coding rna prediction tool using deep neural network. Netw Model Anal Health Inform Bioinformatics. 2016; 5(1):21. doi:10.1007/s13721-016-0129-2.

    Article  Google Scholar 

  27. Sun K, Chen X, Jiang P, Song X, Wang H, Sun H. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics. 2013; 14 Suppl 2:7. doi:10.1186/1471-2164-14-S2-S7.

    Article  Google Scholar 

  28. Wucher V, Legeai F, Hédan B, Rizk G, Lagoutte L, Leeb T, Jagannathan V, Cadieu E, David A, Lohi H, Cirera S, Fredholm M, Botherel N, Leegwater PAJ, Le Béguec C, Fieten H, Johnson J, Alföldi J, André CK, Lindblad-Toh, Hitte C, Derrien T. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 2016;1–12. doi:10.1093/nar/gkw1306.

  29. Guo X, Gao L, Wang Y, Chiu DKY, Wang T, Deng Y. Advances in long noncoding RNAs: identification, structure prediction and function annotation. Brief Funct Genomics. 2016; 15(1):38–46. doi:10.1093/bfgp/elv022.

    PubMed  Google Scholar 

  30. Ventola GM, Noviello TMR, D ́Aniello S, Spagnuolo A, Ceccarelli M, Cerulo L. Identification of long non-coding transcripts with feature selection: a comparative study. BMC Bioinformatics. 2017; 18(187):1–16. doi:10.1186/s12859-017-1594-z.

    Google Scholar 

  31. Fritah S, Niclou SP, Azuaje F. Databases for lncRNAs: a comparative evaluation of emerging tools. RNA. 2014; 20(11):1655–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Cunningham F, co-authors. Ensembl 2015. Nucleic Acids Res. 2015; 43(Database issue):662–9. doi:10.1093/nar/gku1010.

    Article  Google Scholar 

  33. Xie C, Yuan J, Li H, Li M, Zhao G, Bu D, Zhu W, Wu W, Chen R, Zhao Y. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res. 2014; 42(D1):98–103.

    Article  Google Scholar 

  34. Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS, Dinger ME. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2015; 43(Database issue):168–73.

    Article  Google Scholar 

  35. Jin J, Liu J, Wang H, Wong L, Chua NH. PLncDB: plant long non-coding RNA database. Bioinformatics. 2013; 29(8):1068–71.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Dinger ME, Pang KC, Mercer TR, Crowe ML, Grimmond SM, Mattick JS. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 2009; 37(suppl 1):122–6.

    Article  Google Scholar 

  37. Paraskevopoulou MD, Georgakilas G, Kostoulas N, Reczko M, Maragkakis M, Dalamagas TM, Hatzigeorgiou AG. DIANA-LncBase: experimentally verified and computationally predicted microRNA targets on long non-coding RNAs. Nucleic Acids Res. 2013; 41(Database issue):239–45.

    Article  Google Scholar 

  38. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013; 41(Database issue):983–6.

    Google Scholar 

  39. Yan K, Arfat Y, Li D, co-authors. Structure prediction: New insights into decrypting long noncoding RNAs. Int J Mol Sci. 2016; 17(1):132.

    Article  PubMed Central  Google Scholar 

  40. Jolliffe IT. Principal Component Analysis, 2nd ed edn. New York: Springer; 2002.

    Google Scholar 

  41. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, Baker JC, Grützner F, Kaessmann H. The evolution of lncrna repertoires and expression patterns in tetrapods. Nature. 2014; 505(7485):635–40. doi:10.1038/nature12943.

    Article  CAS  PubMed  Google Scholar 

  42. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT ’92. New York: ACM: 1992. p. 144–52. doi:10.1145/130385.130401. http://0-doi-acm-org.brum.beds.ac.uk/10.1145/130385.130401.

    Google Scholar 

  43. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

    Article  Google Scholar 

  44. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. Uniprotkb/swiss-prot, the manually annotated section of the uniprot knowledgebase: How to use the entry view. Methods Mol Biol. 2016; 1374:23–54.

    Article  CAS  PubMed  Google Scholar 

  45. Agarwal P, Bafna V. The ribosome scanning model for translation initiation: implications for gene prediction and full-length cdna detection. Proc Int Conf Intell Syst Mol Biol. 1998; 6:2–7.

    CAS  PubMed  Google Scholar 

  46. Sun L, Liu H, Zhang L, Meng J. lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine. PLoS ONE. 2015; 10(10):0139654. doi:10.1371/journal.pone.0139654.

    Google Scholar 

  47. Volders PJ, Verheggen K, Menschaert G, Vandepoele K, Martens L, Vandesompele J, Mestdagh P. An update on lncipedia: a database for annotated human lncrna sequences. Nucleic Acids Res. 2015; 43(8):4363–4. doi:10.1093/nar/gkv295.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Nitsche A, Rose D, Fasold M, Reiche K, Stadler PF. Comparison of splice sites reveals that long non-coding RNAs are evolutionarily well conserved. RNA. 2015; 21:801–12. doi:10.1261/rna.046342.114.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Frankish A, Harrow J. Gencode pseudogenes. Methods Mol Biol. 2014; 1167:129–55.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors kindly thank the reviewers for their helpful comments and suggestions. M. E. M. T.Walter thanks CNPq (process 308524/2015-2).

Funding

No funding was obtained for this study.

Availability of data and materials

All data sets analysed during the current study are available in the Ensembl Database [32]: http://www.ensembl.org.

The trained models and the program to predict long non-coding RNA are available at: https://github.com/hugowschneider/longdist.py.

Author information

Authors and Affiliations

Authors

Contributions

HWS proposed and implemented both the PCA and the SVM methods, and executed the experiments. TR and MMB proposed and discussed the biological aspects to be considered for predicting lncRNAs. PFS proposed the problem and the experiments. MEMTW proposed the SVM method and discussed the implementation. All the authors contributed to the discussion of the results, and also to the text. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hugo W. Schneider.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

This file shows the results of the PCA analysis. The sheets are organized by data sets and PCA results. For example, "GRCh37 PCA Explained" contains the explained data for each component obtained from the GRCh37 data. The PCA results are "PCA Explained", "Loadings" and "Attributes", containing the screeplot of the explained data, the loadings of the nucleotide pattern frequencies, and the selected attributes, respectively. This three sheets are available for each case study, GRCh37, GRCh38, GRCm38, GRCh37/GRCm38, GRCh38/GRCm38 and GRCm30/GRCz10. (XLSX 389 kb)

Additional file 2

Grid search results to find good C and γ parameters for the SVM method, for each case study. Each image in this file represents the best selection of parameters for each case study. The attributes used in this grid search were the set of 50 nucleotide pattern frequencies, the first and the longest ORF relative lengths. (PDF 130 kb)

Additional file 3

Sequences of the Swiss-Prot reviewed database not characterized, identified by our method with a high probability of being lncRNAs. (Fa 3.62 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schneider, H., Raiol, T., Brigido, M. et al. A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics 18, 804 (2017). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-017-4178-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-017-4178-4

Keywords