Skip to main content
  • Methodology Article
  • Open access
  • Published:

Predicting RNA hyper-editing with a novel tool when unambiguous alignment is impossible

Abstract

Background

Repetitive elements are now known to have relevant cellular functions, including self-complementary sequences that form double stranded (ds) RNA. There are numerous pathways that determine the fate of endogenous dsRNA, and misregulation of endogenous dsRNA is a driver of autoimmune disease, particularly in the brain. Unfortunately, the alignment of high-throughput, short-read sequences to repeat elements poses a dilemma: Such sequences may align equally well to multiple genomic locations. In order to differentiate repeat elements, current alignment methods depend on sequence variation in the reference genome. Reads are discarded when no such variations are present. However, RNA hyper-editing, a possible fate for dsRNA, introduces enough variation to distinguish between repeats that are otherwise identical.

Results

To take advantage of this variation, we developed a new algorithm, RepProfile, that simultaneously aligns reads and predicts novel variations. RepProfile accurately aligns hyper-edited reads that other methods discard. In particular we predict hyper-editing of Drosophila melanogaster repeat elements in vivo at levels previously described only in vitro, and provide validation by Sanger sequencing sixty-two individual cloned sequences. We find that hyper-editing is concentrated in genes involved in cell-cell communication at the synapse, including some that are associated with neurodegeneration. We also find that hyper-editing tends to occur in short runs.

Conclusions

Previous studies of RNA hyper-editing discarded ambiguously aligned reads, ignoring hyper-editing in long, perfect dsRNA – the perfect substrate for hyper-editing. We provide a method that simulation and Sanger validation show accurately predicts such RNA editing, yielding a superior picture of hyper-editing.

Background

The advent of deep sequencing methodologies has opened up new opportunities to study non-coding RNA. Of particular interest are repetitive elements that form double-stranded (ds) RNA when transcribed. Long, perfect dsRNA stimulates innate immunity, regulates gene transcription, and has been implicated in a variety of neurological and autoimmune disorders [1, 2]. The fate of dsRNA depends on its interaction with RNA binding proteins. Possible fates include cleavage by dicer leading to gene silencing [3], suppression by TDP-43 related proteins [4], which have been been implicated in neurodegenerative disease [5], and hyper-editing by adenosine deaminase acting on RNA (ADAR) enzymes.

Recent evidence suggests that ADAR inhibits RNA interference [6] and the induction of innate immunity [7]. Both of these interactions seem to occur when ADAR competes with other enzymes for dsRNA substrates. The large number of I-U mispairs introduced by hyper-editing likely destabilizes those substrates, making them unavailable to other dsRNA enzymes [8]. Finding the genomic sources of hyper-edited dsRNA and describing the pattern of editing therein will improve our understanding of how ADAR functions within dsRNA pathways. It will also reveal highly expressed long, perfect dsRNA that may be important for other cellular pathways.

RNA editing by ADAR enzymes was first recognized for its exquisite specificity in modifying particular adenosine (A) residues to inosine (I) in structured double-stranded regions of pre-mRNAs. Because inosine is recognized as guanosine (G) by all cellular machines, including the ribosome, specific editing has the potential to change the amino acids encoded by mRNA. However ADAR enzymes have another activity on (nearly) perfect dsRNA: Hyper-editing can convert up to 50% of adenosines to inosine within the double-stranded region [9]. Hyper-editing of endogenous RNAs was first reported in human ALU elements, a class of transposable elements comprising about 10% of the human genome sequence and numbering over one million copies [10]. Analyses of hyper-editing revealed far more editing sites in repetitive elements than the known examples of specific editing in protein-encoding RNAs. As next-generation sequencing has become cheaper, new studies, using new sequencing methods and analyses, have increased the number of known editing sites and expanded our understanding of ADAR activity [1115].

However, the ability of these studies to find hyper-editing in long, perfect dsRNA and to accurately estimate the level of editing at a given hyper-edited position is limited: They must discard reads that have no best alignment to a reference genome or risk widespread false positive predictions. By its nature, long dsRNA consists of a sequence followed by its reverse complement. This self-complementarity ensures that a read originating from the interior of such a molecule will align equally well on both the forward and reverse strand. To make matters worse, dsRNA is most likely to appear when repetitive elements are present, forming when two copies occur nearby but in opposite orientation or within certain self-complementary sequences. Thus, reads originating from within a long, perfect duplex are unlikely to have a single best alignment. Therefore, methods that discard reads with ambiguous alignment cannot provide a full picture of hyper-editing. Long read sequencing technologies do present a possible solution. However methods that can align short hyper-edited reads to dsRNA are needed, because short read data sets are cheap and ubiquitous.

To confront this challenge, we employed a probabilistic model that iteratively aligns reads and finds novel sequence changes, including hyper-editing and SNPs. While hyper-editing is the focus of this application, we also address other sequence modifications that may be confused with hyper-edits. For this purpose we used a three-component Dirichlet mixture model [16] to separate SNPs, hyper-edits, and positions that only differ from the reference by read error. We also estimate expression levels to further refine our alignment.

Here we present RepProfile, an algorithm that employs the expectation maximization (EM) algorithm [17] to find the read alignments, SNPs, hyper-editing patterns, and expression levels that are most likely under our model. This EM algorithm alternates between averaging over hidden variables (in this case, the alignment) in its E-step and estimating the hyper-editing, SNPs and expression (henceforth called the genome profile [18]) that maximize the likelihood of those averages in its M-step. While the alignment of a read to the reference genome may be ambiguous, as the algorithm refines its estimate of the genome profile, the probability of the correct alignment can grow to a point of near certainty if enough informative positions (nucleotides that distinguish between repeat copies) are identified, even when the genomic sequences of repetitive elements are identical. Because the expected alignment must be recalculated at each E-step, RepProfile is potentially computationally intensive. Thus, RepProfile is built to consider one repeat family at a time. A widespread analysis can be done by running RepProfile on many repeats in parallel.

Several methods have been proposed that consider read alignment and inference jointly, but none make use of novel sequence variation to improve alignment. TEtranscripts [19] uses the EM algorithm to learn expression levels in repetitive sequence. The algorithm of Wang et al. [20] is similar, but uses Gibbs sampling instead of EM and is designed for application to Chip-seq. The algorithm of Parks et al. [21] considers how genomic rearrangement affects read alignment. We are, as far as the authors know, the first apply such methods to position variation, including SNPs and hyper-editing.

Results

RepProfile was used to predict hyper-editing in transposable elements from 2x100bp Illumina sequence reads from whole head Drosophila melanogaster RNA. RepProfile was run on each transposable element (TE) family in parallel. This included all repeats in the UCSC genome browser (genome.ucsc.edu) repeatmasker track except simple repeats, low complexity repeats, rRNA and satellites, a total of 29 megabases. Hyper-editing is not limited to TEs, but they are a common source of dsRNA, and RepProfile was designed to find hyper-editing in TEs. Repeat families that are a prefix of other families were merged. Thus, for example, PROTOP, PROTOP_A and PROTOP_B were considered together. Similarly the LTR and interior portions of RNA transposons were merged. RepProfile aligned 8.3 millions reads (totaling 1.66 gigabases) and predicted a total of 30,185 edit sites.

In this section we focus on predictions in FB4_DM, PROTOP and DNAREP1_DM. RepProfile predicts the most widespread hyper-editing in FB4_DM. Hyper-editing of PROTOP repeats was already described in [6], and DNAREP1_DM shows how imperfect helices can be hyper-edited. A note about each family with at least 1000 predictions can be found in the Additional file 1. A full list of all predictions can be found in Additional file 2: Table S3. We use simulation and clone validation to show the accuracy of RepProfile.

Simulations support accuracy of RepProfile

RepProfile and competing methods were tested against three different simulations of hyper-editing. In the first reads are simulated from a hypothetical repeat family consisting of 24 identical copies of a random 1kb sequence: 20 isolated copies (10 in each orientation) and 2 oppositely oriented pairs that are simulated to be hyper-edited (one on each strand). In the second, we simulate reads from the FB4_DM repeat, including editing only at the sites observed in our clone data (see below). In the third simulation, FB4_DM repeats are chosen at random to be hyper-edited. In both FB4_DM simulations, reads are simulated in proportion to observed expression levels.

Reads drawn from the hypothetical repeat family show that RepProfile is able to provide accurate alignment to highly repetitive sequence. Because the genome sequence of these repeats are identical, no read that falls entirely within a repeat has a unique alignment to the hypothetical repeat reference. Nevertheless, RepProfile is able to align 90% of reads that fall entirely within one of the hyper-edited duplexes to the learned profile with mapping quality 30+ (estimated probability of misalignment ≤0.1%). 99.9% of these reads are aligned correctly, allowing RepProfile to predict editing sites with high sensitivity and PPV (Table 1).

Table 1 Sensitivity and PPV for RepProfile and competing methods in the three different simulations

In addition to RepProfile, we predicted edit sites using the method of Porath et al. [12] and by finding editing enriched regions (EERS) [15], either using all reads (+rep) or only using reads for which at least one end aligns uniquely (uniq). Neither of these methods uses a similar strategy to RepProfile. In particular, the Porath et al. method considers only reads with a large number of A to G changes. However there are few published methods showing success predicting RNA-editing in repeats, and the comparison shows that hyper-editing of long, perfect dsRNA cannot be found using a simpler method. Table 1 shows sensitivity and positive predictive value (PPV) for each method applied to each simulation. RepProfile provides the highest sensitivity in all three simulations, while maintaining a PPV of 99% or above. RepProfile also provides accurate estimation of the editing level across all simulations (see Fig. 1).

Fig. 1
figure 1

Agreement between simulated and predicted RNA-editing levels. Level of RNA-editing in simulation vs prediction for each position that is either simulated or predicted to be edited. a Edits simulated from hypothetical exact repeat copies. b Simulation using edits predicted by clone sequence. c Simulation of hyper-editing at random FB4_DM. Because this simulation reflects the varied coverage of an actual RNA-seq experiment, many simulated edit sites are covered by few of no reads (points along x axis)

In the clone simulation, editing occurs at A/G informative positions. Thus hyper-edited reads can align uniquely but incorrectly, explaining the diminished PPV when finding EERs with unique reads. RepProfile’s diminished sensitivity in the random hyper-editing simulation is due to the fact that many simulated edit sites occur in low coverage regions. As repeats accumulate mutations, unique alignments become possible, but the repeats also tend to lose their dsRNA structure. Because edit sites are not limited to dsRNA in the random hyper-editing simulation, more hyper-edited reads align uniquely, allowing the competing methods to perform better.

RepProfile predictions are validated by Sanger clones

A hyper-edited FB4_DM in the gene retinal degeneration A (rdgA) was chosen for validation. Not only is this repeat not unique, it is also internally repetitive, making alignment particularly challenging. rdgA is expressed almost exclusively in the nervous system [22], as is dADAR protein. 62 sequences were generated by cloning RT-PCR amplicons from the sequence spanning chrX:8,928,544-8,929,835 in the dm6 genome assembly (available from the UCSC genome browser: genome.ucsc.edu). The cloned sequences showed a small deletion spanning chrX:8,928,786-8,929,278 and so the FB4_DM reference was updated to include this deletion.

The Sanger sequences confirm the pattern of hyper-editing predicted at this locus, with each clone displaying a distinct pattern of edited sites. Of 322 adenosines in the clone region, editing is observed at 280 positions (87%) in at least 1 clone. Each clone is edited at an average of 76.7 adenosines (24%). RepProfile predicts editing at 269 of the 280 positions edited in the clone sequences (sensitivity = 96%). RepProfile predicts editing at an additional 11 sites, yielding a PPV of 96%. As Fig. 2 a shows, RepProfile accurately predicts editing levels. Among positions that show evidence for editing both in RepProfile and in the clones, RepProfile overestimates editing by 6%, with a standard deviation of 11%. Figure 2 b provides a site-by-site comparison of predicted and validated editing.

Fig. 2
figure 2

Agreement between RepProfile and clone validation. a Fraction of editing predicted by clones vs fraction predicted by RepProfile. RepProfile and clones show good agreement in the rdgA FB4_DM element. b RNA-editing predictions aligned to RNA secondary structure as predicted by RNAstructure [39, 40]. Positions are colored so that aligned pairs are the same color. Unpaired positions are not colored. Both methods predict hyper-editing only in the helical portion of this repeat, consistent with the fact that ADAR is a dsRNA binding protein. c Explanation of RNA structure coloring. A gradient from red to green to blue is stretched across the length of the sequence. Paired positions are colored according to the minimum of their position and the position they are paired to. Unpaired position are not colored

Using reads for which at least one end aligns uniquely, the method of EERs [15] predicts only 31% of the cloned edit sites. All predictions made by the EER method are supported by the clone validation. Applying the method of Porath et al. [12] to our data yields 17 editing sites in this FB4_DM element, but fails to find any editing in the cloned region. These sensitivity estimates are lower than those estimated in simulation, indicating that there are additional alignment challenges not included in our simulation. Similarly our Helicos single-molecule sequencing (SMS) results [13] find 2 tier 1 and 10 tier 2 edit sites in this element, but none in the cloned region. Rodriguez et al. [11] and Ramaswami et al. [14] fail to predict any editing in this FB4_DM. The lack of FB4_DM hyper-editing in these published lists is unsurprising as they all rely on unambiguous alignment of short reads.

FB4_DM repeats are highly hyper-edited

The FB4_DM sequence is almost entirely a perfect inverted repeat, which has the capacity, if transcribed, to fold back and form long, (nearly) perfect dsRNA (see Additional file 3). Thus, transcripts containing FB4_DM in pre-mRNA are potentially excellent ADAR hyper-editing substrates. Indeed, RepProfile predicts frequent hyper-editing of FB4_DM elements.

In addition to the element in rdgA described above, there are four other FB4_DM that are predicted to be hyper-edited by RepProfile with highest confidence (see discussion). These predictions appear in the genes no-long-nerve-cord (nolo), Pur-alpha, Maf1 and rolled (rl). Across these five repeats (including rdgA), 1681 editing sites are predicted. Interestingly, like rdgA, these genes are known to be involved in proper cell-cell communication in the nervous system, particularly in the correct function of synapses [2326]. Two (Pur-alpha [24] and rdgA [22]) are associated with neurodegneration. Seven more FB4_DM (see Table 2) are predicted to be edited by RepProfile at slightly lower confidence (see Discussion). This brings the total number of predicted edit sites to 4384. Five of these seven are also in genes that have been shown (or are predicted) to play roles in proper neuronal maintenance and function [2731]. Two examples of FB4_DM hyper-editing are shown in Fig. 3.

Fig. 3
figure 3

RepProfile predictions for two FB4_DM. Alignment of RepProfile predictions to RNAstructure [39, 40] secondary structure predictions. Secondary structure is colored as in Fig. 2b/c. a An FB4_DM element in the gene nolo. b An FB4_DM element in the gene rolled (rl)

Table 2 FB4_DM that are predicted to be hyper-edited. Predictions with a yes in the last column are RepProfile’s most confident predictions

There are seven additional genes containing FB4_DM elements that are predicted to form dsRNA, but are not predicted to be edited: CG11873, CG17600, CG42238, kek5, kirre, Pka-R1, vtd. Only two (Pka-R1 and vtd) of these genes are annotated with neuron-related GO terms in flybase [32] (as of January 1, 2017). It is possible that while these genes are edited in neurons, the edited reads are overwhelmed by transcription in cells that do not express ADAR. Alternatively, these RNA duplexes may be targeted by another dsRNA binding protein, making them unavailable to ADAR.

DNAREP1_DM repeats form imperfect helices that are partially edited

While shorter (up to 500 nt) the 5,802 DNAREP1_DM elements in the fly genome play an analogous role to that of ALU repeat elements in the human genome. As with ALU repeats, it is not uncommon for two DNAREP1_DM elements to be oriented in opposite directions in the same gene, or even to be opposite and adjacent. However DNAREP1_DM instances tend to be quite divergent, and so these DNAREP1_DM form imperfect helices that are edited to a lesser extent than FB4_DM. Figure 4 shows the hyper-editing and structural predictions for two pairs of DNAREP1_DM that are adjacent and opposite in orientation. Table 3 lists all the DNAREP1_DM that are predicted to be hyper-edited. RepProfile predicts 685 edited positions in DNAREP1_DM. Many of these genes are also relevant to the nervous system [3336].

Fig. 4
figure 4

RepProfile predictions for two DNAREP1_DM. Alignment of RepProfile predictions to RNAstructure [39, 40] secondary structure predictions. Secondary structure is colored as in Fig. 2b/c. a A complementary pair of DNAREP1_DM elements in the gene CG17684. b A complementary pair of DNAREP1_DM elements in the gene Myo-81f

Table 3 DNAREP1_DM that are predicted to be hyper-edited

Alignments to PROTOP show hyper-editing of the previously described Hoppel killer element

The most probable solution found by RepProfile includes 38 hyper-edited PROTOP, PROTOP_A and PROTOP_B elements containing a total of 4326 edit sites. Here we focus on the five most confident predictions (see Discussion). All five are pairs of PROTOP(A/B) that are adjacent but opposite in orientation (see Table 4). These repeats contain a total of 973 predicted edit sites, 697 of which are in the Hoppel killer (Hok) element [6].

Table 4 PROTOP(A/B) that are predicted to be hyper-edited at the highest confidence

Our previous work [6] demonstrated that dADAR proteins, as well as other dsRNA-binding proteins, localize to the Hok element in vivo, but using SMS we only found a small number of editing sites in Hok [13]. However with RepProfile we are able to predict drastically more editing – a result that is more in line with the strong evidence of ADAR activity at this locus. Figure 5 shows predicted editing for Hok. Hok contains three highly similar PROTOP_A elements, so there may be structural conformations other than the one illustrated in Fig. 5.

Fig. 5
figure 5

RepProfile predictions for the PROTOP_A elements that form Hok. Alignment of RepProfile predictions for Hok to the RNAstructure [39, 40] secondary structure prediction. Secondary structure is colored as in Fig. 2 b/c

ADAR edits in short runs

FB4_DM sequences contain long strings of consecutive adenosines, sometimes more than ten adenosines long. We analyzed the hyper-editing of consecutive adenosines on a read-by-read basis to understand how ADAR edits long, perfect dsRNA. When analyzing runs of edited adenosines that are followed by another base, we do not not know whether the run would have continued had there been more adenosines to edit. Thus we have (a discrete version of) the lifespan estimation from censored data problem analyzed by Kaplan and Meyer [37]. Runs of edited adenosines that are followed by a base that is not adenosine are considered to be censored, as the run may have continued were there more adenosines to edit. The hazard function,

$${} \begin{aligned} P(\text{run}\,\,\text{length}&=n|\text{run}\,\,\text{length} \geq n)\\ &\approx \frac{\#n\,\mathrm{long,}\,\text{uncensored}}{(\#n\,\mathrm{long,}\,\text{uncensored})+(\#>n\,\text{long})} \end{aligned} $$

is shown in Fig. 6. Short runs are less likely to end than would be predicted from context alone. However as the run gets longer the probability that the run will end increases. This is consistent with the theory that as ADAR edits, it disrupts the dsRNA structure, introducing I-U mispairs and making future editing less likely.

Fig. 6
figure 6

Estimated hazard function for length of editing run. The estimated probability that a run of FB4_DM editing will end after a given number of edited A’s, with 95% binomial confidence intervals. The red line is the marginal probability that an A following an A at helix depth at least 25 will not be edited in a particular read. The estimates indicate that as a run of edits gets longer, it is more likely to end

Short helices are rarely edited; the longest helices are edited most

To measure the affect of dsRNA structure on editing, we measure the length of helices in the five most confident FB4_DM hyper-editing predictions (Table 2) and the five adjacent +/- DNAREP1_DM repeats that are predicted to be hyper-edited (Table 3). Helices are allowed to include bulges of up to two bases on one or both sides of the helix, as small bulges have been shown not to interrupt hyper-editing [38]. Structure predictions are by RNAstructure [39, 40].

We find that editing is rare in short helices and that it is most frequent in long helices. RepProfile predicts editing at only 13% of adenosines in helices that are fewer than 18 basepairs long, but at 80% of adenosines in helices that are longer than 64 basepairs. This is consistent with evidence that ADAR does not bind to helices shorter than 15–20 basepairs and is most efficient when editing helices longer than 100 basepairs [9]. Figure 7 shows editing binned by helix size.

Fig. 7
figure 7

Probability that editing will be predicted as a function of helix size. a The fraction of A positions at which any amount of editing is predicted, binned according the helix size, allowing bulges of of up to two bases for the five most confident FB4_DM hyper-editing predictions (Table 2) and the five adjacent +/- DNAREP1_DM repeats that are predicted to be hyper-edited (Table 3). Error bars are 95% binomial confidence intervals. b Mean fraction of editing, binned according the helix size as in part A. Error bars are 95% normal approximation confidence intervals. Note that in both A and B, larger bins are supported by many positions but only a few helices, so confidence intervals may be overly tight. Structure predictions by RNAstructure [39, 40]

In the 819 basepair rdgA FB4_DM helix (the longest in this analysis), 84.1% of adenosine positions are predicted to be edited, but in individual transcripts only 27.7% of adenosines are edited on average. This editing of 27.7% of adenosines is less than the 50−60% seen in vitro [9]. This difference could be because the editing reaction is not allowed to complete in vivo, because transcripts are sequenced before they are fully edited, or because some copies are sequenced from cells with low levels of ADAR.

The nucleotide context of our predictions reflects known ADAR preferences

We investigated the effect of preceding and following bases on the fraction of editing at an adenosine (A) position. We consider only positions that are in the five most confident FB4_DM predictions and are also at least 25 bases away from the nearest unpaired position – a total of 865 adenosines. Figure 8 shows the fraction of editing for sites in each three-base context.

Fig. 8
figure 8

Fraction of editing at predicted edit sites as a function of three-letter context. a The three-letter context centered at the edited base. The 5’ base is much more predictive of editing level than the 3’ base. b The three-letter context ending at the edited base. The base two back from an edited position has only a small effect

The preceding base has a strong effect on the fraction of editing. Predicted edit sites following T are edited in the highest fraction of reads (mean = 0.35). Sites following A are slightly less likely to be edited (mean = 0.29). Sites following C are much less likely to be edited (mean = 0.08) and sites following G are rarely edited (mean = 0.03.) Each pairwise comparison has t-test BHY [41] FDR less than 0.003. Consistent with evidence that ADAR edits in runs, this 5’ preference affects the following base: Adenosines preceded by AA or TA are edited more often than adenosines preceded by CA or GA (pairwise FDRs all less than 0.015.)

The following base has a smaller effect on the fraction of editing. Adenosines followed by G are edited most often (mean=0.35), followed by A (mean=0.26), C (mean = 0.24) and T (mean = 0.21.) However the only statistically significant result is that adenosines followed by G are more likely to be edited (p value = 0.0018.) Our results for 3’ and 5’ base preferences agree with those found in previous studies [4245].

Run time per RepProfile step scales with the number of candidate alignments; the number of steps depends on the amount of editing

For most repeats, calculating the probability of each candidate alignment in each E step forms a bottleneck. Thus the time to complete a single EM step scales with the number of candidate alignments (Fig. 9a). This makes run time difficult to predict a priori as the number of candidate alignments depends not only on the number of reads, but also on the number of candidate alignments per read. For example, there are five times as many reads that align to DNAREP1_DM repeats as there are reads that align to FW_DM repeats. However predicting hyper-editing in DNAREP1_DM takes less than half as long per step, because FW_DM repeats are much more similar to one another than are DNAREP1_DM repeats. Alignment probabilities are calculated independently, making the E step highly parallelizable. For FB4_DM, a single EM step takes 1353, 664, 365, 215, and 131 seconds on 1, 2, 4, 8, and 16 cores, respectively.

Fig. 9
figure 9

EM step run time. EM run time on 8 cores. a The RepProfile bottleneck occurs during the E step where about 16,608 candidate alignments are processed per second. b The red line indicate the minimum number of steps run by RepProfile. More steps are run if many local maxima are found. c RepProfile runs time vary widely

While EM usually converges in a small number of steps, as it runs, RepProfile suggests new initial conditions to explore alternate hyper-editing solutions. At minimum one initial condition is tried for each hyper-edited repeat (Fig. 9 b). However RepProfile will continue trying new initial conditions if a more likely solution is found with a different set of hyper-edited repeats. Most TEs (about 80%) run in 30 minutes or less on 8 cores, but six TEs required run times of a day or more (Additional file 2: Table S2, Fig. 9 c). The most computationally intensive repeat, PROTOP(A/B), required about 4 days of run time with 815 steps. Running RepProfile on all TEs required a total of 16 node-days at 8 cores per node.

Summary

RepProfile provides accurate RNA hyper-editing predictions that are validated both by simulated data and by individual sequence clones. Our analysis reveals hyper-editing that is not – indeed we argue cannot be – found by other methods. In particular, we highlight the hyper-editing of long, perfect dsRNA formed by FB4_DM elements – a repeat whose relevance to hyper-editing was not previously known – in the introns of genes with synaptic function. We also estimate the level of editing – something that other methods do not do. In addition to finding many hyper-editing events in long, perfect dsRNA, our results show that ADAR often edits a run of adjacent adenosines; that editing is rare in helices less than 20 base pairs long but becomes more frequent as helix length increases; and that, consistent with previous findings, adenosines are more likely to be edited if they follow A or T than if they follow C or G.

Discussion

The successes of RepProfile, both in simulation and validation, show that short reads can predict RNA editing even when standard alignment techniques cannot produce confident alignments. Even if repeats are locally identical, they are likely to form different RNA secondary structures in the context of different transcripts, leading to unique editing patterns. Additionally there may be cell-specific factors that further differentiate hyper-editing patterns. Thus, when endogenous dsRNAs are “marked" by ADAR modification with a unique editing pattern, RepProfile can distinguish between identical repeats.

As far as the authors know, RepProfile is the only tool capable of using RNAseq data to accurately find RNA hyper-editing (or position variation in general) within sequences that form long, perfect dsRNA. RepProfile reveals RNA duplexes with hundreds of edited positions, where other methods, reliant on unambiguous alignment to single reference genome, find few or No sites. Because almost all RNAseq analysis methods rely on unambiguous alignment to a reference genome, it is likely that many studies have missed valuable insights regarding dsRNA. This is especially important for RNA hyper-editing as hyper-editing only occurs in dsRNA. While previous studies have been able to describe hyper-editing events, their descriptions are limited to dsRNA molecules that contain sufficient imperfections (bulges) for unambiguous alignment.

The major challenge for EM applications, such as RepProfile, is that EM is only guaranteed to find a local maximum, which may or may not be the global maximum. Thus, EM must be run with a variety of initial conditions in order to be confident that the global max has indeed been found. When applying RepProfile to real RNAseq reads, we often find several maxima, leading to the distinction between highly confident and regular hyper-editing predictions. The highly confident predictions are repeats that are predicted to be hyper-edited in all maxima. Highly confident predictions tend to occur when RepProfile can align paired reads such that one end is aligned to the hyper-edited repeat and the other is aligned outside the repeat. For FB4_DM, coverage of the flanking sequence is 13 times higher for highly confident predictions (vs 4 times higher for the repeat itself). As the flanking sequence tends to be more unique than the repeat itself, it is difficult to be sure which repeat is hyper-edited without these flanking reads. The failure to align reads outside the repeat could be due to repetitiveness in the flanking sequence, inaccuracies in the reference, or simply because of low coverage.

While repeatedly realigning reads allows for accurate predictions of hyper-editing in repetitive elements, it is time consuming. Thus RepProfile only considers one repeat family at a time. As each repeat can be considered in parallel, it is possible to use RepProfile to predict Drosophila melanogaster hyper-editing genome-wide. Even so, for prevalent, highly repetitive repeats such as PROTOP(A/B), RepProfile can take days to run. In a larger genome, such as the human genome, there may be repeats that are even more computationally challenging. Thus improvements may be necessary when applying RepProfile to a large genome. Such improvements could be made by creating new c extensions or by selectively updating alignment probabilities. RepProfile is written entirely in Python (with heavy computation done by numpy) and does not check whether the profile has changed significantly before recalculating alignment probability at each step.

Our analysis provides insight into how ADAR edits at the molecular level. We find that ADAR is more likely to edit adjacent adenosines, but is less likely to extend long runs of editing. This indicates that ADAR edits processively, but that as it edits it destabilizes the helix, causing the enzyme to detach. The finding that short bulges do not interrupt RNA-editing [38] explains why ADAR activity does not slow until the editing run lengthens. Our data confirms the results of in vitro experiments showing that ADAR does not readily bind to short helices and that long dsRNA is required for maximum editing efficiency [9]. We also confirm the highly replicated result that ADAR has a strong 5’ preference for A or U over G or C, but weak 3’ preferences [4245]. Good agreement with these results provides further evidence that RepProfile gives an accurate picture of RNA hyper-editing.

The predictions made by RepProfile point to the possibility that hyper-edited TEs play a functional rule – a question that deserves further investigation. Most of our predictions, indeed all of our most confident FB4_DM predictions, are in highly-conserved genes with synaptic functions in both invertebrates and vertebrates. This similarity of function is not likely to arise by random transposable element (TE) insertion, providing evidence for the domesticated use of TEs to regulate neuronal gene expression. Of course this is merely an observed correlation and it possible that some property of neuronal genes coincidentally facilitates hyper-editing of TEs. Our results provide a baseline picture of hyper-editing in these genes. These sites can now serve as targets for future studies investigating how hyper-editing is controlled and how it affects gene regulation.

While we have used RepProfile to predict hyper-editing, it also uses SNPs and expression levels to differentiate between repeats. In addition to the hyper-editing application described here, we envision that RepProfile is capable of finding unreported SNPs in repetitive sequence and of reconstructing the sequences of novel transposable element insertions. Indeed any sequencing experiment relies on an accurate alignment, and our results show that RepProfile can provide high quality alignment to repeats. The need for higher quality alignments may be especially great in differential expression experiments where failing to account for variation can lead to biased results [46]. Thus, RepProfile has the potential to improve a wide range of RNAseq experiments.

Conclusion

It is often not possible to unambiguously align a single read, considered in isolation, to a repetitive reference genome. As a result, most analysis pipelines only consider unique regions of the genome, failing to provide any results about long, perfect dsRNA. Not only is such dsRNA the prime target for ADAR, proper regulation of dsRNA, in which ADAR plays a crucial role, is necessary for normal neuronal function. RepProfile provides accurate hyper-editing predictions in dsRNA, showing that, in the case of RNA editing at least, unambiguous alignment (to a reference genome) is not necessary for accurate inference. By building a complete probabilistic model that not only considers the information that aligned reads provide about hyper-editing, but also the information that hyper-editing provides about those read alignments, we are able to provide a more complete and more accurate picture of how ADAR edits endogenous dsRNA. We find that ADAR edits in short runs, and we observe the most hyper-editing in FB4_DM repeats that are in the introns of genes with synaptic functions, two of which are associated with neurogeneration, hinting at a regulatory role for hyper-editing. Previous studies of RNA editing in Drosophila melanogaster have failed to identify hyper-editing in this repeat, showing that a method, such as RepProfile, that accurately aligns short reads to dsRNA is necessary to begin teasing apart dsRNA pathways, and to understand the regulatory role of ADAR.

Methods

RepProfile Algorithm

Glossary of Random Variables:

  • R=R 1,…,R m is the set of m Read sequences.

  • A=A 1,…,A m is the Alignment of each read.

  • X=X 1,…,X r is the relative eXpression of each repeat, where r is the total number of repeats.

  • G=G 1,…,G n is the Genome profile (probability of A/C/G/T sequenced) at each position in each repeat, where n is the total number of positions.

  • H=H 1,…,H r are the Hyper parameters, representing underlying sources of variation in G (hyper-editing in our applictation).

  • T=T 1,…,T n are the variation Types of each position such as SNPs or edited positions.

  • U(A,R)=U 1,…,U n are the number of A, C, G, T aligned at each position.

  • V(A,R)=V 1,…,V r are the number of reads aligned to each repeat.

The hierarchy in Fig. 10 generates a probability distribution across read sequences. Repeat expression levels, modeled on the left side, combine with nucleotide variations, modeled on the right, to generate read sequences. We can use EM to maximize the joint probability, which as the two are proportional, also maximizes the posterior distribution conditioned on the observed read sequences. To streamline the computation, only potential alignments suggested by a standard aligner are considered.

Fig. 10
figure 10

Bayesian network for RepProfile probability distribution. Directed graph describing the probability distribution used by RepProfile as a Bayesian network. Reads, R, depend on the profile, G, and the alignment, A. The profile depends on the position types, T (e.g. edited position), which depend on the repeat types H (e.g. hyper-edited repeat.) The alignment A depends on expression levels, X

The model of nucleotide changes begins with a repeat genome consisting of all copies (repeat elements) of a particular repeat in an organism’s reference genome. Each repeat element, k, is in one of several states, H k . In our analysis of hyper-editing, the states are: hyper-edited on the forward strand, hyper-edited on the reverse strand and not hyper-edited (Fig. 11 a). Similarly, each genomic position, i, within each repeat is in one of several states, T i . In our hyper-editing model, the states are: edited, SNP and neither (Fig. 11 b). The probability of a particular position being in a particular state depends on the repeat state, H k(i). For instance an adenosine can only be in the edit state, if it is in a repeat that is hyper-edited on the forward strand. Conditioned on the state, T i , we model G i (x), the probability of nucleotide x, at position i, using a Dirichlet distribution – a distribution of distributions over the four-letter nucleotide alphabet {A,C,G,T } (Fig. 11 c). Thus, G i (x) is sampled from a mixture of Dirichlets, with T i being the mixture component. See the Additional file 4 for the exact parameter values that define P(H,T,G).

Fig. 11
figure 11

Cartoon description of RepProfile probability distribution. a There are many repeat copies in the genome. Each has some probability of being hyper-edited. H tells us which are hyper-edited and which are not. b Within each repeat, each position has a chance of being a SNP and each A in a hyper-edited repeat has a chance of being edited. T indicates which positions are SNPs and which are edited. c Any base can be sequenced at any position. G gives the distribution over A/C/G/T at each position. d Each repeat has a probability of generating the next read (defined by X), and A is the specific aligned position for the read. The read bases, R, are drawn from the corresponding distributions in G

In parallel, a distribution across genomic positions, X, defines the probability of sequencing a read that starts at a particular position. X is assumed to be constant across positions in a single repeat element. Thus, X can be thought of as defining the relative expression level of each repeat element. Alignments, A 1,A 2,…,A m , are drawn from X (Fig. 11 d). Insertions and deletions are inserted according to an affine gap probability. Given a set of aligned positions, A j , and distributions over nucleotides at those positions, \(G_{A_{j}}\), the probability of a read sequence, R j , is the product across read letters of the probability of those read letters: \(\prod G_{A_{j}}(R_{j})\). The assumption here is that, conditioned on the profile, the read sequence at each position is independent. This assumption is contradicted by our own observation that ADAR edits in short runs. However assuming this independence is necessary for efficient computation, and our simulation and validation shows that this assumption does not prevent accurate estimation. Distinguishing repeats that are hyper-edited from those that are not (H) allows us to preserve the key dependence: that edit sites tend to localize.

After conducting an RNAseq experiment, we observe a read set R. If we assume that R is drawn from the distribution described above then we can use EM to estimate X,H,T and G by maximizing P(X,H,T,G,R)P(X,H,T,G|R), and treating the alignment, A, as a hidden variable.

If we reparameterize A and R to new random variables: U that counts the number of A/C/G/T aligned each position and V that counts the number of reads aligned to each repeat, P(A,R|G,X) becomes an exponential family distribution:

$$\begin{array}{*{20}l}{} P(A,R|G,X) &\propto h(A) \prod_{k} X_{k}^{V_{k}(A)} \prod_{i=1}^{n}\prod_{x \in \{a,c,g,t\}}G_{i}(x)^{U_{i}^{x}(A,R) }\\ &= h(A)\exp [(U,V)\cdot(\log G, \log X)] \end{array} $$

where \(U_{i}^{x}(A,R)\) is the number of nucleotide x aligned to position i, V k (A) is the number of reads aligned to repeat k and h(A) are indel probabilities. To perform an EM update, we need to calculate the following quantity:

$${} \begin{aligned} &G^{(t+1)},T^{(t+1)},H^{(t+1)},X^{(t+1)}\\ &\quad\quad=\underset{G,T,H,X}{\arg\max}{\mathbb{E}}\,[\!\log P(A,X,H,T,G,R)|R,G^{(t)},H^{(t)},X^{(t)} ] \end{aligned} $$

As Dirichlet distributions are conjugate to the exponential family above, the maximization can be completed as follows. First we consider terms depending on X:

$$\underset{X:\sum X=1}{\arg\max}(\alpha_{X}-1+{\mathbb{E}}_{A} [\!V])\cdot \log X=\frac{\alpha_{X}-1+{\mathbb{E}}_{A} [\!V]}{\sum \alpha_{X}-1+{\mathbb{E}}_{A} [\!V]} $$

where α X are the Dirichlet parameters and \({\mathbb {E}}_{A}\) is expectation over A conditioned on (R,G (t),H (t),X (t)).

Next, for fixed T we can maximize over G:

$$\begin{aligned} \underset{G:\sum G_{i} =1}{\arg\max}(&\alpha_{T}-1+{\mathbb{E}}_{A} [\!U])\cdot \log G\\ &=\frac{\alpha_{T}-1+{\mathbb{E}}_{A} [\!U]}{\sum\alpha_{T}-1+{\mathbb{E}}_{A} [\!U]}=\hat{G}_{T} \end{aligned} $$

Then we can maximize T for a given H:

$$\underset{T}{\arg\max} \log P(T|H)- \log Z_{T} +(\alpha_{T}-1+{\mathbb{E}}_{A} [\!U])\cdot \log \hat{G}_{T} $$

where Z T is the Dirichlet normalization constant for f(G|T). Finally, we can find the optimal value of H for each repeat and then work backwards to get T and G. More details can be found in the Additional file 4.

It remains to calculate \({\mathbb {E}}_{A} [\!U]\) and \({\mathbb {E}}_{A} [\!V]\), which is done by summing the counts for each individual alignment:

$$\begin{aligned} {\mathbb{E}}_{A} [\!U]&=\sum_{j} {\mathbb{E}}_{A_{j}} [\!U(A_{j})]\\ &=\sum_{j}\sum_{A_{j}}P\left(A_{j} |R_{j},G^{(t)},X^{(t)}\right)U(A_{j}) \end{aligned} $$
$$\begin{aligned} {\mathbb{E}}_{A} [\!V]&=\sum_{j} {\mathbb{E}}_{A_{j}} [\!V(A_{j})]\\&=\sum_{j}\sum_{A_{j}}P\left(A_{j} |R_{j},G^{(t)},X^{(t)}\right)V(A_{j}) \end{aligned} $$

However it is not computationally feasible to sum over all possible alignments A j for all reads j by brute force. In Hidden Markov Models, the forward and backward sum algorithm is usually used to achieve computational feasibility. However it is still O(mnq) for a repeat genome of length n and a dataset of m reads of length q. Thus it is still not feasible when reads number in the hundreds of millions. Fortunately for most bases at most positions, the probability of that base at that position will be small. Thus most of the possible alignments will have probability near 0. We can approximate the sum over alignments by considering only a small number of candidate alignments. In the case of hyper-editing, we allow candidate reads to have any number of A to G mismatches but only four other mismatches.

While each step of EM is guaranteed to produce a larger value of P(G (t),H (t),T (t),X (t)|R), the process is not guaranteed to converge to the global maximum. In some cases, EM gets stuck at a local maximum. In many applications, EM is run many times with many different initial conditions. The solution that gives the largest value of P(R,G,H,T,X) is taken. As RepProfile runs, it creates new initial conditions by removing hyper-editing from each repeat one at a time. EM is run again for each new initial condition and the solution with the best likelihood is chosen. Trying initial conditions with fewer hyper-edited repeats balances the fact that expected counts tend to spread variation across repeats in early EM steps and allows us to settle on a set of highly confident predictions.

Drosophila stocks

Drosophila strains were raised at a constant 25 °C, on standard molasses food, and under 12 h day/night cycles.

Cloning and RNA editing analysis

To examine RNA-editing, total RNA was extracted from heads and thoraxes (20 per sample) of 1- to 2-day-old male Drosophila. RNA extractions were performed using TRIzol reagent (Invitrogen). Total RNA was transcribed into cDNA using M-MLV Reverse Transcriptase from Promega using an rdgA specific primer: RDGD-RT3 5 -GATTAATAGCATCGCACTCGAAGTAATCCC-3 . Edited cDNAs were amplified via PCR using target-specific primers: RDGINT-F2 5 -GTATGTATGTTTATCAACACCCTCC-3 and RDGD-R3 5 -GACTTCATTCCAACGCTGTCGTTCTG-3 . The PCR product was purified using the Wizard®;SV Gel and PCR Clean-Up System from Promega (catalog number: A9282) from 1.5% agarose gel electrophoresis. Subsequently, 4 μL of PCR product was cloned into One Shot®;TOP10 Chemically Competent E. coli cells using Zero Blunt®;TOPO®;PCR Cloning Kit from Invitrogen (catalog number: K2800J10), according to manufacturer’s guidelines. A total of 50 μL solution containing the transformed cells were plated on kanamycin+agar plates. Plates were incubated overnight at 37 °C. Colonies picked from the plate were grown in kanamycin+ LB media overnight at 37 °C shaker at 200 RPM. DNA was isolated from 600 μL of culture media using PureYield™Plasmid Miniprep System, according to manufacturer’s guidelines. Finally, 2 μL of isolated DNA was used for the sequencing reaction using BigDye®;to obtain chromatograms for analysis.

LoxP RNA 100BP paired-end sequencing

Total RNA, extracted by the above procedure, was sent to Genewiz for the preparation and deep sequencing of 100bp paired-end libraries. No polyA selection was preformed, but otherwise library prep and sequencing was done according to standard Genewiz methods.

Generation of candidate alignments

The script used to process reads and generate candidate alignments can be found at https://github.com/wmckerrow/RepProfile/blob/master/utilities/make_candidate_alignments_genomic.sh. First, T’s in antisense reads are replaced with C. A’s in sense reads are replaced with G. Similarly, two masked genomic references are created by replacing A with G in one and T with C in the other. Two bam alignments are generated by using bwa aln (version 0.7.12) [47] to align masked reads to each of the masked references and subsequently merged into a single alignment. Reads for which some part of at least one of the read ends overlaps sequence labeled as the repeat of interest (FB4_DM, DNAREP1_DM or PROTOP/PROTOP_A/PROTOP_B) in the repeatmasker database (as downloaded from the UCSC table browser: genome.ucsc.edu, dm6 version) are extracted. Reads with mean base quality less than 30 were excluded. A repeat genome is generated from positions that are in or within 1kb of sequence labeled as the target repeat.

The repeat reads are aligned to the repeat genome, using the same masking procedure, this time retaining up to 10,000 secondary alignments with at most 4 mismatches (after masking.) The resulting combined bam file is sorted by read name and parsed by RepProfile.

Simulation

The hypothetical repeat family was generated as follows: A random 1kb sequence was generated and copied 24 times – 12 in each orientation. For 10 copies in each direction, a 1kb of random flanking sequence was added to each end. The other four copies were paired to form two RNA duplexes. 1 kb of random sequence was added to each end of each duplex. For the other two simulations, the FB4_DM repeat reference was used.

In the first simulation, using the hypothetical repeat, both duplexes but none of the isolated repeats are simulated to be hyper-edited. One duplex is edited on the plus strand, and one on the minus strand. In the second simulation, using FB4_DM, only the cloned region was simulated to be hyper-edited. In the third simulation, again using FB4_DM, each of 13 editable FB4_DM had a 0.3 chance of being hyper-edited. This simulation was repeated 20 times and results were pooled. All the editable FB4_DM are in highly-expressed genes and greater than 1500 bases long. Hyper-editing is simulated in the direction of gene transcription.

In the first and third simulations (excluding the clone simulation), the simulated profile was generated as follows: Within each hyper-edited repeat, each editable position has a 0.5 chance of being edited. For each edited position, p is chosen uniformly between 0.001 and 0.997. To generate the profile at an edited position, the edited base G(C) is given probability p, the reference base A(T) is given probability 0.998−p, and the other two bases are given probability 0.001. Each position in any repeat has a 0.01 chance of being a SNP. For each SNP position, p is chosen uniformly between 0.001 and 0.997 and a non-reference base, x, is chosen uniformly at random. To generate the profile at SNP positions, base x is given probability p, the reference base is given probability 0.998−p, and the other two bases are given probability 0.001. For other positions, including all flanking sequence, the reference base has probability 0.997 and the other three bases have probability 0.001.

For the clone simulation, outside of the cloned region, the profile matched the reference genome, with each other base having probability 0.001 of appearing by simulated read error. Inside the cloned region, the profile was estimated from the clones.

For the hypothetical repeat family, half of the isolated repeats are transcribed in each direction. For the FB4_DM simulations, reads are drawn in proportion to exon coverage. FB4_DM not in genes are sampled at a low level.

For the FB4_DM simulation, 200,000 reads are drawn. For the hypothetical repeat simulation, 50,000 reads are drawn.

Abbreviations

ADAR:

Adenosine deaminase acting on RNA

ds:

Double stranded

EER:

Editing enriched region

EM:

Expectation maximization

Hok:

Hoppel killer

PPV:

Positive predictive value

rdgA:

retinal degeneration A

SMS:

single-molecule sequencing (Helicos platform)

TE:

Transposable element

References

  1. O’Connell MA, Mannion NM, Keegan LP. The epitranscriptome and innate immunity. PLOS Genet. 2015; 11(12):1005687.

    Article  Google Scholar 

  2. Savva YA, Rezaei A, StLaurent G, Reenan RA. Reprogramming, circular reasoning and self versus non-self: One-stop shopping with rna editing. Front Genet. 2016; 7:100.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Bernstein E, Caudy AA, Hammond SM, Hannon GJ. Role for a bidentate ribonuclease in the initiation step of rna interference. Nature. 2001; 409(6818):363–6.

    Article  CAS  PubMed  Google Scholar 

  4. Saldi TK, Ash PE, Wilson G, Gonzales P, Garrido-Lecca A, Roberts CM, Dostal V, Gendron TF, Stein LD, Blumenthal T, Petrucelli L, Link CD. Tdp-1, the caenorhabditis elegans ortholog of tdp-43, limits the accumulation of double-stranded rna. EMBO J. 2014; 33(24):2947–66. doi:10.15252/embj.201488740.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Chen-Plotkin AS, Lee VM-Y, Trojanowski JQ. Tar dna-binding protein 43 in neurodegenerative disease. Nat Rev Neurol. 2010; 6(4):211–20. doi:10.1038/nrneurol.2010.18.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Savva YA, Jepson JEC, Chang YJ, Whitaker R, Jones BC, StLaurent G, Tackett MR, Kapranov P, Jiang N, Du G, Helfand SL, Reenan RA. Rna editing regulates transposon-mediated heterochromatic gene silencing. Nat Commun. 2013; 4:2745.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Mannion NM, Greenwood SM, Young R, Cox S, Brindle J, Read D, Nellåker C, Vesely C, Ponting CP, McLaughlin PJ, Jantsch MF, Dorin J, Adams IR, Scadden ADJ, Öhman M, Keegan LP, O’Connell MA. The rna-editing enzyme adar1 controls innate immune responses to rna. Cell Rep. 2014; 9:1482–94.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nishikura K. Editor meets silencer: crosstalk between rna editing and rna interference. Nat Rev Mol Cell Biol. 2006; 7(12):919–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Nishikura K, Yoo C, Kim U, Murray JM, Estes PA, Cash FE, Liebhaber SA. Substrate specificity of the dsrna unwinding/modifying activity. EMBO J. 1991; 10(11):3523–32.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Lasda E, Parker R. Circular rnas: diversity of form and function. RNA. 2014; 20(12):1829–184213558382.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Rodriguez J, Menet JS, Rosbash M. Nascent-seq indicates widespread cotranscriptional rna editing in drosophila. Mol Cell. 2012; 47:27–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Porath HT, Carmi S, Levanon EY. A genome-wide map of hyper-edited rna reveals numerous new sites. Nat Commun. 2014; 5:4726. Article number 5726.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. StLaurent G, Tackett MR, Nechkin S, Shtokalo D, Antonets D, Savva YA, Maloney R, Kapranov P, Lawrence CE, Reenan RA. Genome-wide analysis of a-to-i rna editing by single-molecule sequencing in drosophila. Nat Struct Mol Biol. 2013; 20(11):1333–9.

    Article  CAS  Google Scholar 

  14. Ramaswami G, Zhang R, Piskol R, Keegan LP, Deng P, O’Connell MA, Li JB. Identifying rna editing sites using rna sequencing data alone. Nat Methods. 2013; 10(2):128–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Whipple JM, Youssef OA, Aruscavage PJ, Nix DA, Hong C, Johnson WE, Bass BL. Genome-wide profiling of the c. elegans dsrnaome. RNA. 2015; 21(5):786–800. doi:10.1261/rna.048801.114.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian S, Haussle D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Bioinformatics. 1996; 12(6):327–45.

    Article  Google Scholar 

  17. Dempster AP, Laird NM, Rudin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc. Series B (Methodological). 1977; 39(1):1–38.

    Google Scholar 

  18. Gribskov M, Mclachlan AD, Eisenberg D. Profile analysis: Detection of distantly related proteins. Proc Nat Acad Sci USA. 1987; 84:4355–358.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Jin Y, Tam OH, Paniagua E, Hammell M. Tetranscripts: a package for including transposable elements in differential expression analysis of rna-seq datasets. Bioinformatics. 2015; 31(22):3593–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Wang J, Huda A, Lunyak VV, Jordan IK. A gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags. Bioinformatics. 2010; 26(20):2501–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Parks MM, Lawrence CE, Raphael BJ. Detecting non-allelic homologous recombination from high-throughput sequencing data. Genome Biol. 2015; 16:72.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Harris WA, Stark WS. Hereditary retinal degeneration in drosophila melanogaster. a mutant defect associated with the phototransduction process. J Gen Physiol. 1977; 69(3):261–91.

    Article  CAS  PubMed  Google Scholar 

  23. Meyer S, Schmidt I, Klämbt C. Glia ecm interactions are required to shape the drosophila nervous system. Mech Dev. 2014; 133:105–16.

    Article  CAS  PubMed  Google Scholar 

  24. Jin P, Duan R, Qurashi A, Qin Y, Tian D, Rosser TC, Liu H, Feng Y, Warren ST. Pur a binds to rcgg repeats and modulates repeat-mediated neurodegeneration in a drosophila model of fragile x tremor/ataxia syndrome. Neuron. 2007; 55(4):556–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Smith KR, Oliver PL, Lumb MJ, Arancibia-Carcamo IL, Revilla-Sanchez R, Brandon NJ, Moss SJ, Kittler JT. Identification and characterisation of a maf1/macoco protein complex that interacts with gabaa receptors in neurons. Mol Cell Neurosci. 2010; 44(4):330–41. doi:10.1016/j.mcn.2010.04.004.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kumar JP, Hsiung F, Powers MA, Moses K. Nuclear translocation of activated map kinase is developmentally regulated in the developing drosophila eye. Development. 2003; 130(16):3703–14.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Wang J, Lee CHJ, Lin S, Lee T. Steroid hormone-dependent transformation of polyhomeotic mutant neurons in the drosophila brain. Development. 2006; 133(7):1231–40.

    Article  CAS  PubMed  Google Scholar 

  28. Song J, Wu L, Chen Z, Kohanski RA, Pick L. Axons guided by insulin receptor in drosophila visual system. Science. 2003; 300(5618):502–5.

    Article  CAS  PubMed  Google Scholar 

  29. Lee HK, Cording A, Vielmetter J, Zinn K. Interactions between a receptor tyrosine phosphatase and a cell surface ligand regulate axon guidance and glial-neuronal communication. Neuron. 2013; 78(5):813–26.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Nahm M, Kim S, Paik SK, Lee M, Lee S, Lee ZH, Kim J, Lee D, Bae YC, Lee S. dcip4 (drosophila cdc42-interacting protein 4) restrains synaptic growth by inhibiting the secretion of the retrograde glass bottom boat signal. J Neurosci. 2010; 30(24):8138–50.

    Article  CAS  PubMed  Google Scholar 

  31. Merino C, Penney J, González M, Tsurudome K, Moujahidine M, O’Connor MB, Verheyen EM, Haghighi P. Nemo kinase interacts with mad to coordinate synaptic growth at the drosophila neuromuscular junction. J Cell Biol. 2009; 185(4):713–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Attrill H, Falls K, Goodman JL, Millburn GH, Antonazzo G, Rey AJ, Marygold SJ, the FlyBase consortium. Flybase: establishing a gene group resource for drosophila melanogaster. Nucleic Acids Res. 2015; 44(D1):D786–92.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Carrillo RA, Özkan E, Menon KP, Nagarkar-Jaiswal S, Lee PT, Jeon M, Birnbaum ME, Bellen HJ, Garcia KC, Zinn K. Control of synaptic connectivity by a network of drosophila igsf cell surface proteins. Cell. 2015; 163(7):1770–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Kurusu M, Cording A, Taniguchi M, Menon K, Suzuki E, Zinn K. A screen of cell-surface molecules identifies leucine-rich repeat proteins as key mediators of synaptic target selection. Neuron. 2008; 59(6):972–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Yu W, Kawasaki F, Ordway RW. Activity-dependent interactions of nsf and snap at living synapses. Mol Cell Neurosci. 2011; 47(1):19–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Kim M, Semple I, Kim B, Kiers A, Nam S, Park HW, Park H, Ro SH, Kim JS, Juhász G, Lee JH. Drosophila gyf/grb10 interacting gyf protein is an autophagy regulator that controls neuron and muscle homeostasis. Autophagy. 2015; 11(8):1358–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958; 53(282):457–81.

    Article  Google Scholar 

  38. Lehmann KA, Bass BL. The importance of internal loops within rna substrates of adar1. J Mol Biol. 1999; 291:1–13.

    Article  CAS  PubMed  Google Scholar 

  39. Mathews DH. Using an rna secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA. 2004; 10(8):1178–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Lu ZJ, Gloor JW, Mathews DH. Improved rna secondary structure prediction by maximizing expected pair accuracy. RNA. 2009; 15(10):1805–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001; 29(4):1165–88.

    Article  Google Scholar 

  42. Polson AG, Bass BL. Preferential selection of adenosines for modification by double-stranded rna adenosine deaminase. Embo J. 1994; 12(23):5701–11.

    Google Scholar 

  43. Lehmann KA, Bass BL. Double-stranded rna adenosine deaminases adar1 and adar2 have overlapping specificities. Biochemistry. 2000; 39(42):12875–84.

    Article  CAS  PubMed  Google Scholar 

  44. Kim DDY, Kim TTY, Walsh T, Kobayashi Y, Matise TC, Buyske S, Gabriel A. Widespread rna editing of embedded alu elements in the human transcriptome. Genome Res. 2004; 14:1719–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Eggington JM, Greene T, Bass BL. Predicting sites of adar editing in double-stranded rna. Nat Commun. 2011; 2:319.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Stevenson KR, Coolon JD, Wittkopp PJ. Sources of bias in measures of allele-specific expression derived from rna-seq data aligned to a single reference genome. BMC Genomics. 2013; 14:536.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754–60. doi:10.1093/bioinformatics/btp324.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University. We would also like to acknowledge William Thompson for his assistance with Python coding.

Funding

Not applicable.

Availability of data and materials

RepProfile code is available at https://github.com/wmckerrow/RepProfile. Raw read data can be accessed through NCBI project PRJNA358243: https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/?term=PRJNA358243. Edited clone sequences can be found in Additional file 2: Table S1.

Author information

Authors and Affiliations

Authors

Contributions

RR and YS identified the challenge and importance of aligning hyper-edited reads to repeats. CL and WM designed the probabilistic model based RR’s and YS’s understanding of hyper-editing. YS and AR performed wet lab work including nucleotide extraction and sequencing. WM wrote and executed RepProfile code. RR provided interpretation of the RepProfile results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wilson H. McKerrow.

Ethics declarations

Competing interests

The authors declare that they have No competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Data was collected only from unregulated invertebrates.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Supplementary results. Descriptions of TE families with 1000 or more edit sites. (PDF 32 kb)

Additional file 2

Supplementary tables. List of all editing predictions in FB4_DM, DNAREP1_DM and PROTOP along with sequences for the rdgA clones. (XLSX 1,065 kb)

Additional file 3

rdgA FD4_DM. RNAstructure prediction for rdgA FD4_DM. The FD4_DM element forms a dsRNA structure that is too long to fit on a figure. (PDF 137 kb)

Additional file 4

Supplementary methods. Detailed description of RepProfile prior and simplifications for EM. Also included: rationale for choosing to focus on FB4_DM, DNAREP1_DM and PROTOP. (PDF 133 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McKerrow, W., Savva, Y., Rezaei, A. et al. Predicting RNA hyper-editing with a novel tool when unambiguous alignment is impossible. BMC Genomics 18, 522 (2017). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-017-3898-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-017-3898-9

Keywords