 Research
 Open Access
 Published:
Single molecule sequencingguided scaffolding and correction of draft assemblies
BMC Genomics volume 18, Article number: 879 (2017)
Abstract
Background
Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies.
Results
We propose a disassemblingreassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on errorcorrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NPhard, and solve it by a 2approximation algorithm.
Conclusions
Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to wellestablished assembly benchmarks.
Background
Paired end sequencing, where shorter, high quality sequences are generated from both ends of a longer DNA fragment, has been used to improve genome assemblies (see [1] for a review). This method was a significant advance since it allowed scaffolding [2] and improved assembly quality assessment [3]. It also helped resolve many repetitive sequences that are smaller than the DNA fragments provided to sequencing [1].
Although the total cost of sequencing has decreased substantially, the DNA fragment sizes suitable for inexpensive sequencing have remained similar. As a result, repetitive sequences larger than a few thousand bases in extant genomes largely remain either incorrect or absent. Prior work has focused on combining different assemblers or picking the best assembly (see [4] for a review), but these approaches cannot fix errors (or omissions) produced by all assemblers [2].
Newer sequencing methods, such as Illumina’s long reads (e.g., [5]) or the more recent 10X Genomics Chromium platform, will indirectly increase fragment sizes through localized assembly. If repeats are spread throughout the genome, instances of repeats should be unique in each subproblem (reconstructed fragment) and therefore can be assembled well. If they are not distributed throughout, as in the case of complex genomes like maize [6], such approaches will suffer from the same pitfalls of traditional assembly.
In this paper, we present an alternative framework that uses single molecule sequencing (SMS) data [7–10]. In contrast to paired end sequencing, SMS platforms like Oxford Nanopore generate low quality sequences that currently can be as large as one hundred thousand nucleotides. Because the sequences are generated from actual DNA molecules, they are ideal to resolve local tandem repeats, transposon structure(s), and smallscale inversions and translocations. Most importantly, this new method can in theory be applied to any extant genome to improve quality. We demonstrate this on both bacterial and yeast genomes using currently available SMS data.
To the best of our knowledge, the framework of BIGMAC [11] is the most similar to ours. BIGMAC, however, is designed to improve genome accuracy for metagenomic assembly by extensively relying on the assembly itself. As a result, it can hardly correct structural errors depicted in Fig. 1, and this was validated in a comparison. Individual components of this method are also related to prior work in assembly improvement. For example, our scaffolding step is similar to the likelihoodbased gap closing approach of GMcloser [12], similar to [13], as well as the recently reported ScaffMatch [14]. In contrast, we compute overlaps between SMS reads in order to more accurately weight the connections between genome regions. To further validate and compare with prior methods, we also test our approach on a wellestablished genome assembly benchmark (GAGE [15]) at varied coverages of SMS scaffolding data.
Results
We evaluate the performance of our program, denoted as SMSC, on both synthetic data and real data as target assemblies. In the case that the experimental results of different program settings are similar, we present only one of them.
Synthetic data
Escherichia coli is a well studied organism with a highly curated genome assembly. For this reason, it is a common benchmark for assembly software. The input to our program is a target assembly with induced structural errors (see below) and Nanopore long reads that have already been corrected by Nanocorr [16].
To randomly mutate the complete genome of E. coli K12 (accession NC_000913), we wrote a Python script to cut the genome randomly N−1 times (N=50 in Fig. 2 a). Specifically, if the total length of the remaining genome is L, and there are m cuts to go, then we chose a cutting point uniformly randomly between length l and L/(m+1) (l=10000 when N≤400 and l=5000 when N=500 in Table 1. The reason we chose different l is to more accurately represent expected fragmentation postassembly given the length of the E. coli sequence). After this cut, there are (m−1) cuts to go. After all the cuts, we then shuffle the pieces and randomly reverse complement some to eliminate known order and orientation. Further, we introduced 2% insertion, deletion, and mismatches each into the new target assembly for correction.
The coverage of these Nanopore long reads is 50x, the average length is 4070 base pairs (bp), and the upper quartile is 6231 bp. Figure 2 is a before and after example comparison using these data in our method visualized using mummerplot [17]. Bacterial genomes are circular, so the two genome alignment segments actually represent a single genome. This figure shows that our method disassembled the erroneous genome and then correctly reassembled the consistent segments.
To test the performance of our method with higher fragmentation, we gradually increased the number of mutations in the mutated genome (see Table 1). The numbers of relocations and inversions post correction were assessed by dnadiff, which is another tool in the MUMmer toolkit. Based on the definition of comparisons in [17], translocations are not reported given that E. coli K12 is a single sequence. Because dnadiff does not natively distinguish whether a genome is circular, there will be a relocation if the starting location of our output is different from that of the ground truth. We also noticed that many inferred relocations and inversions are small and localized. These appear to be insertions present in our assembly resulting from either imperfect read alignments or uncorrected errors in the long reads.
To see how the coverage of the long reads affects our method, we downsampled the long reads to 25x and 12.5x, where the 12.5x sampling is a subset of the 25x sample (see Table 1). Because the results of using Nucmer and BLASR as alignment tools are similar for this genome, the results in Table 1 were all produced using Nucmer as the alignment tool.
The results of our experiments show that, as expected, our results depend somewhat on both the coverage of the long read data and the fragmentation/errors present in the target assembly. Higher fragmentation produces shorter fragments, which will lead to shorter validated segments because our conservative method requires 99% of the long reads to align. We expect this could be alleviated somewhat by changing the parameters to be 99% of the smaller of the two sequences or relaxing this threshold.
Real data
S. cerevisiae W303
We downloaded the yeast (S. cerevisiae W303) draft assembly from [18], and errorcorrected PacBio long reads (10x, upper quartile length of 932bp, already corrected by the Celera Assembler PBcR pipeline) from [19]. After running our program on this draft assembly with the errorcorrected long reads, the number of contigs increases from 30 in the draft assembly to 294 (using BLASR as alignment tool) and 278 (using Nucmer as alignment tool). Given the short and overall low coverage post long read correction, however, this added fragmentation is consistent with our earlier simulations.
To our best knowledge, there are only draft assemblies available for S. cerevisiae W303 [16, 19, 20]. We therefore used its closely related strain S. cerevisiae S288 (available at [16]) as in [16]. We then used Quast [21] as the quality assessment tool to determine the overall quality. As for the required independent Illumina pairend reads for Quast evaluation, we used the S288 data downloaded from NCBI (SRA accession SRR507778).
In this experiment, we also compared our results with another scaffolder ScaffMatch [14], which requires traditional paired end reads for scaffolding. The pairend reads were downloaded from [16]. Its coverage is 105.5x. Since ScaffMatch can only do scaffolding, we use our disassembler output, i.e., the validated segments, as the common input for both methods. Because our edge weighting is based on alignments between errorcorrected PacBio long reads, it can therefore fill gaps. ScaffMatch, on the other hand, can only estimate the size of the gap and insert an appropriate number of padding characters (usually an ‘N’).
Table 2 shows that even with suboptimal long read data (10X coverage, <1kb length), the postcorrection result is better than the original target assembly. When using BLASR as the alignment tool, our method performs the best, followed by ScaffMatch, except that SMSC with BLASR produces an inversion. Note that the higher fragmentation of our method is somewhat expected given that we are using less data (10X vs. >105X).
S. aureus USA300
Since we did not use the exact strain in the comparison above, we decided to also test a wellcharacterized GAGE benchmark [15]: S. aureus USA300.
Its complete genome, a draft assembly, and pairend reads were downloaded from [15]. Although there are several assemblies on GAGE, we present only the results from the AllpathLG assembly since there were no detected structural errors and our method worked best with this assembler. The accession number for the raw PacBio long reads data is SRR2063240, which is then error corrected. After correction by Canu [22], the coverage reduces to 28.63X (from over 293X originally).
Here, we compare with ScaffMatch as before. Since we used Canu to correct the PacBio long reads, we also produced a de novo assembly for S. aureus USA300 based only on long reads. The experiments on the full dataset (the one with coverage of 293.6x of long reads in Table 3) show that SMSC has better N50 value but more pieces. This is because there are 8 contigs of length ≤300 and 2 of length about 1500.
We also downsampled the raw PacBio long reads to observe the effects of coverage for this genome. Considering that we downsampled only the raw PacBio long reads but not the PE reads, the output of ScaffMatch will appear to be very good. We also take as input for ScaffMatch the errorcorrected PacBio long reads. Since this is not a fair usage of ScaffMatch, we will denote it by ScaffMatch ^{∗}. Table 3 includes our experimental results from 6 to 10% of the full coverage, and from this table one can see that our approach and ScaffMatch have higher contiguity and identity than de novo assemblies. ScaffMatch almost always has a constant contiguity because the PE reads were not downsampled. By taking as input the errorcorrected downsampled PacBio long reads, ScaffMatch ^{∗} outputs improvement of contigs as the downsamplesize increases, as anticipated.
Discussion
Our results indicate that SMSC is competitive with Canu in terms of contiguity and identity when using lower coverage SMS data. Interestingly, and maybe as expected, scaffolding using both SMSC and ScaffMatch is worse than highdepth SMS de novo assembly using Canu.
The higher relocations and misassemblies in ScaffMatch seem to be caused by gaps induced by regions missing in matepair data. Since ScaffMatch uses the output of the first stage of SMSC as input, and SMSC implements a simpler scaffolding algorithm than ScaffMatch, we expect that our results could improve if we implemented a more complete algorithm, with the structural consistency competitive with Canu. We leave this implementation as future work.
Conclusion
Here we presented a new framework that can correct structural assembly errors using single molecule sequencing (SMS) data. Further, we show that long reads when applied to an extant genome can fill scaffold gaps and produce higher overall structural consistency. We expect this framework to be more useful for researchers who have invested heavily in a draft genome and do not want to completely resequence this genome.
Methods
In general, our method consists of disassembling and reassembling genomic regions (Fig. 1). In the first step, we locate structurally consistent segments in an assembly (Fig. 1 a) and extract these regions (Fig. 1 b), thus disassembling the input assembly into smaller, validated portions (or contigs). In the second step, we reassemble the consistent regions (Fig. 1 c) by applying an independently derived path cover model first proposed as a scaffold problem in [14] and fill the gaps. The details of our algorithms are presented below.
Disassembling
Longread alignment
We first align errorcorrected long reads to a target assembly (see Fig. 1 a), using either Nucmer [17] or BLASR [23] (any longread alignment tool could in theory be used here). The output of Nucmer/BLASR can be viewed as a sequence of aligned blocks (DL _{ i }, DR _{ i }, RL _{ i }, RR _{ i }), such that the nucleotides in the range [DL _{ i },DR _{ i }] of the target assembly are aligned to the long read in the range [RL _{ i },RR _{ i }]. If there are overlaps between consecutive aligned blocks, either in the draft sequence or in the long read, we cut off the left ends of the blocks on the right. We then attempt to close any small gaps between these aligned blocks via the NeedlemanWunsch Algorithm, and call the resulting alignment region an alignment segment.
An alignment segment is kept and used if for any sequence in the target assembly and a long read, the following hold: i) For any gap of length g _{1} between two aligned blocks [DL _{ i },DR _{ i }] and [DL _{ i+1},DR _{ i+1}] in the draft assembly, and for the corresponding gap between [RL _{ i },RR _{ i }] and [RL _{ i+1},RR _{ i+1}] of length g _{2} in the long read, the value g _{1}−g _{2} is within a chosen threshold (30 in our experiments); ii) the sum of all alignment blocks for a long read divided by the length of the long read is larger than a chosen threshold (0.99 in our experiments). After picking out these good alignment segments, we claim that if there are two aligned segments overlapping by at least δ (also a threshold, 50 in our experiments) in the target assembly, then these two aligned segments in the assembly should be merged.
Since we do not have a quantitative measure for block quality, the choices of these thresholds are based on observations derived from custom visualizations of the alignment results. In general, the first threshold makes sure that if the target assembly is of high quality, then there are not too many pieces after disassembly. The second threshold guarantees that the long reads are wellcontained in the target assembly. The third threshold eliminates overlaps that are not trustworthy while trying to retain contiguity. Using errorcorrected long reads maximizes the number of obtained alignments and is not a requirement for this step.
The time complexity of the alignment step depends on the tool used (e.g., BLASR or Nucmer). Sorting the aligned blocks takes O(a loga) time, where a is the total number of aligned blocks. Cutting off the overlaps takes O(k) time, where k is the number of insertions in all the aligned blocks [DL _{ i },DR _{ i },RL _{ i },RR _{ i }]. Closing the gaps takes O(l _{1} l _{2}) time for each gap, where l _{1}=DL _{ i+1}−DR _{ i }−1 and l _{2}=RL _{ i+1}−RR _{ i }−1. Theoretically, in the worst case, the time bound could be O(MN), where M is the total length of the target assembly and N is the total length of all the long reads; but this is unlikely to happen in practice because the NeedlemanWunsch Algorithm will not be applied if a contig and a long read are not already aligned by BLASR or Nucmer. Another task is to determine whether an alignment segment should be kept, which takes linear time to handle. In general, this alignment step takes O(MN) time in the worstcase, and may run faster in practice.
Segment extraction
Once we obtain validated segments from the target assembly, we extract these segments and discard the remaining untrustworthy regions (see Fig. 1 b). We support two options for extracting structurally validated segments. The first one is simple extraction. This option is preferred if the quality of the target draft assembly is fairly high. The second option is to generate a new consensus from the long read data aligned to each segment. This option is preferred if the draft assembly quality is relatively low (e.g., a low coverage sequencing scheme).
The first option takes O(K) time, where K is the number of validated segments. The second option can take much longer time, because it requires to call BLASR or Nucmer for another sequence alignment. This new alignment is then used for consensus of the validated segments.
Reassembling
In this step, our goal is to reassemble the extracted segments (e.g., Fig. 1 c is a possible reassembly). The available information for closing induced gaps hopefully is within the remaining unaligned long reads. Therefore, we next utilize these remaining long reads using a graphbased theoretical model — which also appeared as the scaffolding problem proposed by Igor et al. [14].
Prior work on this scaffolding problem was to reduce it to the vertex disjoint problem that is NPhard, but no complexity analysis was given for the scaffolding problem. Here, we show that this scaffolding problem is indeed NPhard. We also prove that the algorithm presented in ScaffMatch [14] is a 2approximation of an optimal scaffolding using our independent formulation of this model.
Graph model construction
Our graph model (see Fig. 3 for an example) is derived from the concept of breakpoint graphs [24–26]. In a breakpoint graph, each gene is represented by two vertices, indicating the 5’end and 3’end of the gene. If two genes are consecutive in a scaffold, a colored edge will be added to connect the two corresponding vertices. Following the same idea, we view each validated segment as two vertices, and a dashed edge is added between these two vertices. A solid edge can be added between two vertices in the graph if there is a long read bridging them. Similar edges can be merged in this step. If there are still multiple edges between two vertices, we consider using the one with the largest support (the number of long reads) (see Fig. 3 a). This construction of the model has the advantage that in the successive steps of the algorithm, we do not have to consider the directions of the vertices (as shown in Fig. 3 b and c). Prior approaches of MAIA [13] and Medusa [27] both require the directions of vertices and edges to be determined. The directions are usually determined in two separate phases, which may result in accumulative assembly errors, given that both phases are NPhard problems (and therefore approximated). This 2phase difficulty was addressed by both our model and the similar graph model in ScaffMatch [14].
NPhardness of the scaffolding problem
In this graph, an alternating path that starts at a dashed edge, followed by a solid edge, …, and ends at a dashed edge, is a possible scaffold (as shown in Fig. 3 b). Since each validated segment should ideally appear in exactly one scaffold in the final output assembly (i.e., no repeats, see Fig. 3 c), the objective of the problem is to find a set of (vertexdisjoint) alternating paths that covers all the dashed edges exactly once. In addition, we can weight the solid edges between vertices by the number of long reads that support that connection. Thus, the objective is to find a maximum weighted alternating path cover.
Formally, the scaffolding problem is defined as follows.
Definition 1
Given a weighted undirected graph G=(U,U ^{′};E), where U={u _{1},u _{2},…,u _{ n }}, U ^{′}={u1′,u2′,…,un′}, and E is a set of dashed edges and solid edges. The dashed edges are exactly {(u _{1},u1′), (u _{2},u2′),…,(u _{ n },un′)}. The solid edges may connect any two vertices in the graph. Each solid edge is assigned a positive edge weight, and the edge weight for each dashed edge is 0. An alternating path is defined as a simple path that starts at a dashed edge, followed by a solid edge, then a dashed edge, then a solid edge, …, and ends at a dashed edge. The objective is to find a set of alternating paths such that each vertex and each dashed edge appear in the paths exactly once (i.e., an alternating path cover), and the sum of the edge weights in all paths is maximized.
To prove that this problem is NPhard, we prove that the decision version of the problem is NPcomplete.
Theorem 1
Given a weighted undirected graph G=(U,U ^{′};E) defined as above, and a parameter k, the problem of determining whether there exists an alternating path cover whose edge weight sum is at least k is NPcomplete.
Proof
Given any set of paths, we can easily verify whether they form an alternating path cover and the weight sum is at least k in polynomial time. Hence, the decision version of the scaffolding problem is in NP.
To prove that the decision version is NPcomplete, we reduce the Hamiltonian path problem (the undirected graph version) to this problem. Given an instance I of the Hamiltonian path problem G=(V,E), we make a copy of the vertex set V and E as V ^{′} and E ^{′}. Let E∪E ^{′} be the solid edges in the instance I ^{′} of the decision version of the scaffolding problem. The dashed edges in I ^{′} are constructed by connecting the corresponding vertices in V and V ^{′}. The edge weights of the solid edges are all 1’s and k=n−1. In this way, we construct an instance of the decision version of the scaffolding problem in polynomial time. Figure 4 shows an example of the construction.
(⇒) If there is a Hamiltonian path in I, we can denote the path as u _{1}→u _{2}→⋯→u _{ n }. Then we can construct a single alternating path u _{1}→u1′→u2′→u _{2}→u _{3}→u3′→⋯. The path ends at u _{ n } if n is even, and at un′ if n is odd. The edge weight sum of the alternating path is (n−1). Each dashed edge and each vertex in I ^{′} are covered exactly once, and edges (u1′,u2′), (u _{2},u _{3}),… must be in the graph I ^{′} because the corresponding edges are in the Hamiltonian path, and thus in the original graph I.
(⇐) If there is an alternating path cover of edge weight sum at least (n−1) in the graph I ^{′}, then the path cover must be a single alternating path. W.L.O.G, we can assume that the path starts at a vertex in U. Then the alternating path would be \(u_{i_{1}} \rightarrow u_{i_{1}}' \rightarrow u_{i_{2}}' \rightarrow u_{i_{2}} \rightarrow u_{i_{3}} \rightarrow u_{i_{3}}' \cdots \). We can construct a Hamiltonian path \(u_{i_{1}} \rightarrow u_{i_{2}} \rightarrow \cdots \rightarrow u_{i_{n}}\) since these edges must exist in the graph I.
Hence, the decision version of the scaffolding problem is NPcomplete. □
The 2approximation algorithm
For completeness, we also describe the algorithm presented in ScaffMatch [14]:

1.
Find a maximum weighted matching by considering only the solid edges.

2.
Add the dashed edges to the matching.

3.
For each alternating cycle, change the smallest weight of its solid edges to 1, and run steps (1) and (2) until there is no cycle.
To prove that this is a 2approximation algorithm, for simplicity, we relax step (3) such that we remove the smallest weight solid edge from each cycle. This relaxation is worse than the iterative algorithm above, but the worst case is equal. In fact, the current implementation of our algorithm follows this relaxed version. Our proof is similar to that for the ordinary path cover problem in [28].
Theorem 2
There is a 2approximation algorithm for the scaffolding problem.
Proof
Let M ^{∗} be the weight sum of the maximum matching from step (1), ALG be the output value of the algorithm presented above, and OPT be the optimal value of the scaffolding problem. For each feasible solution, after removing the dashed edges, the solution must be a matching. Hence, we have OPT≤M ^{∗}. On the other hand, for a cycle c _{ i }, let k _{ i } be the number of solid edges in c _{ i }. And let k _{ min }= min{k _{ i }}. Since we remove the smallest weighted edge from each cycle. It follows that ALG≥M ^{∗}·(k _{ min }−1)/k _{ min }. Thus OPT/ALG≤k _{ min }/(k _{ min }−1). In the worst case, k _{ min }=2. Hence, OPT/ALG≤2. □
Remark 1
If the output of the maximum weighted matching is improved (i.e., k _{ min } is larger), then the approximation ratio of the algorithm will be better.
The time complexity of our 2approximation algorithm is dominated by the maximum weighted matching. We applied the implementation version of the maximum weighted matching in the library LEMON [29] whose time complexity is O(mn logn), where n is the number of vertices and m is the number of edges in the graph for matching. Hence, the total time complexity for the 2approximation algorithm is O(Km logK), where m is the number of edges, which in the worst case is the number of long reads, and K is the number of validated segments, which is proportional to the number of vertices in the graph.
Abbreviations
 SMS:

Single molecule sequencing
References
 1
Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013; 14(3):157–167.
 2
Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, et al.Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013; 2(1):10. Available from: http://0dx.doi.org.brum.beds.ac.uk/10.1186/2047217X210. Accessed 8 Nov 2017.
 3
Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive misassembly. Genome Biol. 2008; 9(3):R55–R55.
 4
Koren S, Treangen T, Hill C, Pop M, Phillippy A. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics. 2014; 15(1):126. Available from: http://0www.biomedcentral.com.brum.beds.ac.uk/14712105/15/126. Accessed 8 Nov 2017.
 5
McCoy RC, Taylor RW, Blauwkamp TA, Kelley JL, Kertesz M, Pushkarev D, et al.Illumina TruSeq Synthetic LongReads Empower de novo Assembly and Resolve Complex, HighlyRepetitive Transposable Elements. PLoS ONE. 2014; 09;9(9):e106689.
 6
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al.The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326(5956):1112–1115.
 7
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al.Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol. 2012; 30:693–700.
 8
English AC, Richards S, Han Y, Wang M, Vee V, Qu J, et al.Mind the gap: upgrading genomes with Pacific Biosciences RS longread sequencing technology. PLoS One. 2012; 7(11):e47768.
 9
Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from longread SMRT sequencing data. Nat Methods. 2013; 10:563–569.
 10
Boetzer M, Pirovano W. SSPACELongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics. 2014; 15(1):211. Available from: http://0www.biomedcentral.com.brum.beds.ac.uk/14712105/15/211. Accessed 8 Nov 2017.
 11
Lam KK, Hall R, Clum A, Rao S. BIGMAC: Breaking Inaccurate Genomes and Merging Assembled Contigs for long read metagenomic assembly. BMC Bioinformatics. 2016; 17(1):435.
 12
Kosugi S, Hirakawa H, Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihoodbased selection of contig or longread alignments. Bioinformatics. 2015; 31(23):3733–3741.
 13
Nijkamp J, Winterbach W, Van den Broek M, Daran JM, Reinders M, De Ridder D. Integrating genome assemblies with MAIA. Bioinformatics. 2010; 26(18):i433–i439.
 14
Mandric I, Zelikovsky A. ScaffMatch: scaffolding algorithm based on maximum weight matching. Bioinformatics. 2015; 31(16):2632–2638.
 15
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al.GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012; 22(3):557–567. Available from: http://gage.cbcb.umd.edu/data/index.html. Accessed 8 Nov 2017.
 16
Goodwin S, Gurtowski J, EtheSayers S, Deshpande P, Schatz MC, McCombie WR. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25(11):1750–1756. Available from: http://schatzlab.cshl.edu/data/nanocorr/. Accessed 8 Nov 2017.
 17
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu, C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004; 5(2):R12.
 18
DevNet, PacificBiosciences, (eds).PacificBiosciences/DevNet. Pacific Biosciences of California, Inc.; 2013. Available from: http://datasets.pacb.com.s3.amazonaws.com/2013/Yeast/HGAP_Assembly/polished_assembly.fasta. Accessed 8 Nov 2017.
 19
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al. Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol. 2012; 30(7):693–700. Available from: ftp://ftp.cbcb.umd.edu/pub/data/PBcR//corrected/yeast.corrected.fasta.bz2. Accessed 8 Nov 2017.
 20
Ralser M, Kuhl H, Ralser M, Werber M, Lehrach H, Breitenbach M, et al. The Saccharomyces cerevisiae W303K6001 crossplatform genome sequence: insights into ancestry and physiology of a laboratory mutt. Open Biol. 2012; 2(8):120093.
 21
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013; 14(5):R47.
 22
Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with singlemolecule sequencing and localitysensitive hashing. Nat Biotechnol. 2015; 33(6):623–630.
 23
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012; 13(1):238.
 24
Sankoff D, Blanchette M. Multiple genome rearrangement and breakpoint phylogeny. J Comput Biol. 1998; 5(3):555–570.
 25
Aganezov S, Alekseyev MA. Multigenome Scaffold Coassembly Based on the Analysis of Gene Orders and Genomic Repeats In: Bourgeois A, Skums P, Wan X, Zelikovsky A, editors. Bioinformatics Research and Applications: 12th International Symposium, ISBRA 2016, Minsk, Belarus, June 58, 2016, Proceedings. Cham: Springer International Publishing: 2016. p. 237–249. doi:10.1007/9783319387826_20. https://doi.org/10.1007/9783319387826_20. Accessed 8 Nov 2017.
 26
Alekseyev MA, Pevzner PA. Breakpoint graphs and ancestral genome reconstructions. Genome Res. 2009; 19(5):943–957.
 27
Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, Crescenzi P, Fani R, Fondi M. MeDuSa: a multidraft based scaffolder. Bioinformatics. 2015; 31(15):2443. doi:10.1093/bioinformatics/btv171. http://0dx.doi.org.brum.beds.ac.uk/10.1093/bioinformatics/btv171
 28
Moran S, Wolfstahl V. Approximation Algorithms for Covering a Graph by VertexDisjoint Paths of Maximum Total Weight. NETWORKS. 1990; 20(5):4.
 29
Dezs B, Jüttner A, Kovács P. LEMON  an Open Source C++ Graph Template Library. Electron Notes Theor Comput Sci. 2011; 264(5):23–45. Available from: http://0dx.doi.org.brum.beds.ac.uk/10.1016/j.entcs.2011.06.003. Accessed 8 Nov 2017.
Funding
This work was supported in part by NIH grant R21AI112734 and NIH contract HHSN272200900039C (SJE), NIH grant R21AI123967 (SJE and DZC), and NSF grants CCF1217906 and CCF1617735 (DZC). The publication cost will come from NIH grant R21AI123967.
Availability of data and materials
The software is available at https://bitbucket.org/NDBL/smsc.
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 10, 2017: Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics. The full contents of the supplement are available online at https://0bmcgenomicsbiomedcentralcom.brum.beds.ac.uk/articles/supplements/volume18supplement10.
Authors’ contributions
SZ implemented the program, designed and performed experiments, analyzed the data, and prepared the manuscript. SJE provided the experimental data. DC and SJE supervised this project and edited the manuscript. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhu, S., Chen, D. & Emrich, S. Single molecule sequencingguided scaffolding and correction of draft assemblies. BMC Genomics 18, 879 (2017) doi:10.1186/s1286401742718
Published
DOI
Keywords
 Genome improvement
 Single molecule sequencing
 Sequence assembly
 Genome scaffolding