- Open Access
A linear time algorithm for detecting long genomic regions enriched with a specific combination of epigenetic states
© Ichikawa and Morishita; licensee BioMed Central Ltd. 2015
- Published: 21 January 2015
Epigenetic modifications are essential for controlling gene expression. Recent studies have shown that not only single epigenetic modifications but also combinations of multiple epigenetic modifications play vital roles in gene regulation. A striking example is the long hypomethylated regions enriched with modified H3K27me3 (called, "K27HMD" regions), which are exposed to suppress the expression of key developmental genes relevant to cellular development and differentiation during embryonic stages in vertebrates. It is thus a biologically important issue to develop an effective optimization algorithm for detecting long DNA regions (e.g., >4 kbp in size) that harbor a specific combination of epigenetic modifications (e.g., K27HMD regions). However, to date, optimization algorithms for these purposes have received little attention, and available methods are still heuristic and ad hoc.
In this paper, we propose a linear time algorithm for calculating a set of non-overlapping regions that maximizes the sum of similarities between the vector of focal epigenetic states and the vectors of raw epigenetic states at DNA positions in the set of regions. The average elapsed time to process the epigenetic data of any of human chromosomes was less than 2 seconds on an Intel Xeon CPU. To demonstrate the effectiveness of the algorithm, we estimated large K27HMD regions in the medaka and human genomes using our method, ChromHMM, and a heuristic method.
We confirmed that the advantages of our method over those of the two other methods. Our method is flexible enough to handle other types of epigenetic combinations. The program that implements the method is called "CSMinfinder" and is made available at: http://mlab.cb.k.u-tokyo.ac.jp/~ichikawa/Segmentation/
- Similarity Score
- Epigenetic Modification
- Developmental Gene
- Epigenetic State
- Optimal Series
Epigenetic modifications have been shown to play a vital role in regulating gene expression. Recent genome-wide studies have revealed that in vertebrates, although most CpG sites in DNA sequences are highly methylated, hypomethylated CpG islands proximal to genes are involved in regulating gene expression . Specifically, hypermethylated CpG islands in promoter regions are relevant to gene silencing, while hypomethylated CpG islands are in an active or permissive state for transcription . In addition to cytosine methylation of CpG sites, some histone modifications around promoter regions also are known to affect the regulation of gene expression [3, 4].
Thus, there has been considerable interest in long K27HMD regions with biologically important characteristics. However, computational methods for detecting long K27HMD regions are still heuristic and ad hoc, emphasizing the need to develop an effective algorithm from a profound background in computation theory. For example, to identify K27HMD, Nakamura et al. proposed a heuristic method that used certain ad hoc parameter settings to define hypomethylated regions and H3K27me3 peak detection . The method is not guaranteed to output K27HMD regions longer than a given threshold, and it often generates regions of differing lengths. ChromHMM  is a statistical method that classifies epigenetic modifications into classes of combinations and divides a DNA sequence into sub-regions such that each sub-region has a uniform combination of epigenetic states while neighboring sub-regions have distinct characteristics. ChromHMM has been used successfully to partition regions surrounding genes into active/inactive promoters, exons, and introns by analyzing epigenetic codes. Although ChromHMM can be used for K27HMD detection by setting its parameters to find regions that are hypomethylated and marked by H3K27me3, ChromHMM often generates many short regions and thus is not suitable for detecting large K27HMD regions. Overall, these previous methods have simply not been designed to output regions of lengths greater than or equal to a given minimum threshold.
To address this problem, we propose a linear time algorithm for calculating a set of non-overlapping regions such that the set maximizes the score of focal combinations of epigenetic modifications (e.g., K27HMD) and the length of each region is greater than or equal to a given minimum threshold (e.g., 4 kb). We define the score of a focal combination of epigenetic modifications at each DNA position as the similarity between the vector of focal epigenetic states and the vector of raw epigenetic states at the position. We then define the similarity score of a set of regions as the sum of similarity scores of all positions in the set. This method solves several issues in previous heuristic methods because it allows us to set a minimum region length for detecting 'long' regions of biological importance and guarantees the output of an optimal set of long regions that maximizes an objective function.
We implemented the algorithm. We call the program CSMinfinder (Chromatin State with minimum length finder). With CSMinfinder, we identified large K27HMD regions in the medaka and human genomes [8, 16, 17] that overlapped many developmental genes. CSMinfinder can be applied to epigenetic data from other vertebrates for understanding cell development and differentiation.
CSMinfinder runs in time proportional to the size of the genome, and it can process vertebrate genomes in feasible amounts of time. Although we applied CSMinfinder specifically to K27HMD, it can be used for the detection of regions with other types of epigenetic combinations by defining the vector of focal epigenetic states appropriately.
To detect long regions of focal epigenetic states, we formulated this as a problem of finding an optimal set of disjoint (non-overlapping) regions in a sequence that maximizes the sum of similarity scores in all regions. Our method calculates a similarity score between a vector of epigenetic modifications at each position and the feature vector of a focal epigenetic state, such as K27HMD, and outputs the set of regions with the highest sum of similarity scores.
Calculating a similarity vector
We need to generate a modification vector at each position from epigenomic signal data. For example, to create benchmark datasets in this study, we binarized the modification signal level at each position using BinarizeBed in ChromHMM , which classified the signal at each position into 0 or 1 according to a Poisson background model. Subsequently, we defined a modification vector as the vector with binary scores of modifications at each position.
Definition 1. Let wl,w2...,w n be non-overlapping windows of the same length (e.g., 200 bp in this study) in a DNA sequence. Let be binary or real-valued signals of k modifications in window i. The modification vector of w i is defined as . Let F denote the feature vector of a focal modification pattern with k elements. The similarity score of M i and F is defined as their inner product minus a given threshold τ.
Example. Suppose that k = 3, τ = 1.3, F = (1,1,0), M1 = (1,1,0), M2 = (1,0,1) and M3 = (0,0,1). Similarity scores of F and M i are 0.7, -0.3, and -1.3 for i = 1,2,3
When the inner product of M i and F is positive for all i = 1,...,n, the optimal set of regions that maximizes the sum of similarity scores in the regions becomes the entire region, [1,n], which may not be informative. If we want to select a set of regions whose modification vectors are closer to the feature vector F, we can set the threshold τ to an appropriate positive value to yield a negative similarity score for the inner product that is lower than τ. Positions with negative similarity scores are less likely to be included in the optimal set of regions. A higher threshold is likely to divide the entire genome into smaller regions with a higher precision, while a lower threshold yields an opposite trend. In this manner, for a series of windows wl,w2...,w n in a DNA sequence, we generate a series of similarity scores.
Detecting an optimal set of disjoint regions
To detect regions of focal epigenetic states such as K27HMD, we present an algorithm for calculating an optimal set of disjoint regions in a sequence that maximizes the sum of similarity scores for all regions. In addition, to identify sufficiently long regions, we define a minimum length threshold of regions such that each region is longer than or equal to the minimum length. The problem can be defined as follows.
a t + m1 - 1 ≤ b t for t = 1,...,k (the minimum length constraint on regions),
bt-1+m0 <a t for t = 2,...,k (the minimum length constraint on intervals between regions), and
a1 = 1 or a1 >m0 (the first region start at position 1 or at position larger than m0).
Readers may find the last condition strange because it appears to disallow the situation that the first region starts at position a1 >m0. We used the condition to simplify the presentation of our linear-time algorithm, which is described later. To obtain such an optimal series of regions that the first region starts at a1 >m0, for example, you can temporarily add m0 negative weights in front of L, calculate the optimal series, and restore the coordinate.
To calculate a C that maximizes the sum of weights in C, , we used a dynamic programming algorithm developed by Csuros . Here, we outline the algorithm.
Definition 3. We assume that all series meet the conditions given in Definition 2. Let w(C) denote the sum of weights in C, . We consider two cases: that in which the last region of C ends at I and that in which it does not. When the last region does not end at i, let denote a series of regions that maximizes w(C) among all series, such that the last region ends at position b k ≤ i - m, where m ≥ 1. When the last region ends at i, let denote a series of regions that maximizes w(C) among all series, such that the last region is of length ≥ m (≥1); specifically, a k + m - 1 ≤ i (= b k ).
According to this definition, C maximizing w(C) is either or . For calculating these two series, we define here and recursively for i = 1,...,n and m ≥ 1.
Recall that C maximizing w(C) is either or . From , we can build the series of regions, , by tracing back the process of calculating . Similarly, from , we can obtain .
We implemented the above idea. We call the program CSMinfinder.
To compare CSMinfinder with other available methods for detecting large K27HMD, we used real biological datasets from the medaka-fish and human genomes, each of which was a set of vectors of DNA methylation levels at CpG sites, determined by bisulfite sequencing, and H3K4me3 and H3K27me3 histone modification Chip-seq data . We set the window size to 200 bp, normalized the data using a Poisson distribution model, and set the binarized score of a window to 1 if its probability was < 0.0001 and to 0 otherwise.
Detecting large K27HMD in medaka epigenomic data
• Using ChromHMM, we estimated six chromatin states and divided the given DNA sequence into these six states. Specifically, ChromHMM asks users to input the number of epigenetic states beforehand. Thus, we started with inputting a small number into ChromHMM, increased the number gradually one by one until we found a state similar to K27HMD, hypo-methylated DNA modification and H3K27me3 histone modification, and called the number sufficient. Inputting a value larger than the sufficient number into ChromHMM did not make much sense because it just output a state similar to K27HMD. The sufficient number was six. Among the six states, one represented hypomethylated DNA modifications and the H3K27me3 histone modification. We therefore treated the state as K27HMD.
• Nakamura's method detects a hypomethylated domain on a DNA sequence that has more than nine contiguous CpG sites with low methylation (methylation level <0.4) and no more than four contiguous highly methylated CpG sites. Parameters are selected heuristically. A hypomethylated domain is treated as a K27HMD if it contains H3K27me3 peaks detected by QuEST , such that each peak is more than three times larger than the average.
• In CSMinfinder, we used two types of minimum length thresholds, 4 kbp and 8 kbp, to evaluate the effect of this constraint. We set the minimum length of any interval between regions to 600 bp.
Comparing the performance in detection of large K27HMD around genes in the medaka genome
Large K27HMD regions of length >4 kbp suppress the expression of many developmental genes . Thus, we verified the effectiveness of CSMinfinder for detecting large K27HMD regions surrounding genes in the medaka genome. Nakamura's method could detect 246 large K27HMD regions containing the promoter regions of developmental genes (e.g., hox clusters) that were relevant to transcriptional regulation and the developmental process. CSMinfinder detected 911 K27HMD regions, and of these, 386 regions contained promoter regions of >4 kbp in size and contained 242 of the 246 regions identified using Nakamura's method. Indeed, our regions covered 91% of bases in the entire regions detected by Nakamura's method. Specifically, although the exact boundaries of individual regions estimated by the two methods were unlikely to be consistent, these regions largely overlapped each other. These results demonstrate the high concordance between CSMinfinder and Nakamura's methods as well as the ability of CSMinfinder to identify more K27HMD regions than did Nakamura's method.
Analysis of large K27HMD regions in human epigenomic data
We also compared CSMinfinder with the other two for processing human epigenomic data. For ChromHMM, we calculated the sufficient number for the human data according to the procedure described before, and we classified epigenetic modification data into seven states rather than six so as to identify a state similar to K27HMD. The sufficient numbers of epigenetic states in the human and medaka data differed due to the difference in data quality. The sufficient number in the medaka data was smaller than that in the human data presumably because epigenetic state signals in the medaka data were clearer.
In CSMinfinder, we set the minimum length threshold to 8 kbp and the interval between regions to 600 bp. We also searched an ideal value of threshold τ by repeated trials to detect large continual regions, and we set τ to 1.3 and 1.6 in the respective medaka and human data.
Computational performance and software availability
In this work, we proposed a method that estimates large K27HMD regions [5–8, 12] by calculating the similarity between the vector of focal epigenetic states and that of raw epigenetic states at each DNA position. The advantage of this algorithm (CSMinfinder) is the output of an optimal series of regions while allowing us to set the minimum length threshold on individual regions. We estimated large K27HMD in the medaka and human genomes and verified that CSMinfinder was comparable to Nakamura's heuristic method  designed to detect K27HMD and was advantageous over ChromHMM in terms of the lengths of K27HMD regions.
For the medaka genomic data, ChromHMM performed well and could detect as many long regions as CSMinfinder did; however, for the human genomic data, ChoromHMM found a smaller number of large K27HMD regions of length > 8 kbp than the other two methods did. This was likely due to the differences in characteristics between the medaka and human genomic data. In the medaka genome, the data were collected from an inbred stain in which the genomic differences between the two alleles were quite small. Thus, methylation levels were bimodal and were clearly divided into two states, hypomethylated and hypermethylated, making it relatively easy to identify blocks of hypomethylated domains. In the human genome, however, the majority of methylation levels were poised because the human genome is diploid intrinsically and allele-specific methylation is prevalent, making it more difficult to detect clear boundaries between hypermethylated and hypomethylated domains. Although many DNA methylation levels are ambiguous in the human genome, ChromHMM attempts to assign one state to each position. Positions with vague DNA methylation levels are assigned only a single state by ChromHMM. Thus, ChromHMM is likely to output too many short regions.
One advantage of CSMinfinder is that we can set the minimum region length for specific purposes. For example, in the medaka genome, using an 8-kbp minimum length threshold merged some of the shorter regions that were generated using a 4-kbp minimum threshold into a longer continuous region. Thus, we could obtain longer regions using a higher minimum length threshold. Similarly, we can also adjust the minimum threshold for defining similarity scores between modification vectors and the feature vector for a variety of purposes. Setting the minimum threshold to a lower value generates more regions that are less similar to the feature vector of interest. Having more than one series of regions that may overlap can be informative. We can therefore tune CSMinfinder easily to meet various demands.
In this paper, we demonstrated the advantages of our algorithm by detecting large K27HMD regions that have attracted much interest because of their importance in characterizing the behavior of developmental genes and confirmed the performance of our algorithm. CSMinfinder is not limited to the identification of large K27HMD regions but can be used for the detection of other large DNA regions that have different types of epigenetic state combinations associated with regulating gene functions.
We thank Professor Hiroyuki Takeda, Dr. Tatsuya Tsukahara, and Dr. Ryohei Nakamura for stimulating discussions. This work was supported in part by CREST, JST (Japan Science and Technology Agency).
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/bmcgenomics/supplements/16/S2
- Hendrich B, Tweedie S: The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends in Genetics. 2003, 19 (5): 269-277. 10.1016/S0168-9525(03)00080-5.View ArticlePubMedGoogle Scholar
- Bird A: DNA methylation patterns and epigenetic memory. Genes Dev. 2002, 16 (1): 6-21. 10.1101/gad.947102.View ArticlePubMedGoogle Scholar
- Vastenhouw NL, Schier AF: Bivalent histone modifications in early embryogenesis. Curr Opin Cell Biol. 2012, 24 (3): 374-386. 10.1016/j.ceb.2012.03.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou VW, Goren A, Bernstein BE: Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet. 2011, 12 (1): 7-18.View ArticlePubMedGoogle Scholar
- Xie W, Schultz MD, Lister R, Hou Z, Rajagopal N, Ray P, Whitaker JW, Tian S, Hawkins RD, Leung D, et al: Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell. 2013, 153 (5): 1134-1148. 10.1016/j.cell.2013.04.022.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong M, Sun D, Luo M, Huang Y, Challen GA, Rodriguez B, Zhang X, Chavez L, Wang H, Hannah R, et al: Large conserved domains of low DNA methylation maintained by Dnmt3a. Nature genetics. 2014, 46 (1): 17-23.PubMed CentralView ArticlePubMedGoogle Scholar
- Bogdanovic O, Long SW, van Heeringen SJ, Brinkman AB, Gomez-Skarmeta JL, Stunnenberg HG, Jones PL, Veenstra GJ: Temporal uncoupling of the DNA methylome and transcriptional repression during embryogenesis. Genome research. 2011, 21 (8): 1313-1327. 10.1101/gr.114843.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Nakamura R, Tsukahara T, Qu W, Ichikawa K, Otsuka T, Ogoshi K, Saito TL, Matsushima K, Sugano S, Hashimoto S, et al: Large hypomethylated domains serve as strong repressive machinery for key developmental genes in vertebrates. Development. 2014, 141 (13): 2568-2580. 10.1242/dev.108548.View ArticlePubMedGoogle Scholar
- Hu JL, Zhou BO, Zhang RR, Zhang KL, Zhou JQ, Xu GL: The N-terminus of histone H3 is required for de novo DNA methylation in chromatin. Proc Natl Acad Sci USA. 2009, 106 (52): 22187-22192. 10.1073/pnas.0905767106.PubMed CentralView ArticlePubMedGoogle Scholar
- Cedar H, Bergman Y: Linking DNA methylation and histone modification: patterns and paradigms. Nat Rev Genet. 2009, 10 (5): 295-304. 10.1038/nrg2540.View ArticlePubMedGoogle Scholar
- Ooi SK, Qiu C, Bernstein E, Li K, Jia D, Yang Z, Erdjument-Bromage H, Tempst P, Lin SP, Allis CD, et al: DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature. 2007, 448 (7154): 714-717. 10.1038/nature05987.PubMed CentralView ArticlePubMedGoogle Scholar
- Long HK, Sims D, Heger A, Blackledge NP, Kutter C, Wright ML, Grutzner F, Odom DT, Patient R, Ponting CP, et al: Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates. eLife. 2013, 2: e00348-PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao XD, Han X, Chew JL, Liu J, Chiu KP, Choo A, Orlov YL, Sung WK, Shahab A, Kuznetsov VA, et al: Whole-genome mapping of histone H3 Lys4 and 27 trimethylations reveals distinct genomic compartments in human embryonic stem cells. Cell Stem Cell. 2007, 1 (3): 286-298. 10.1016/j.stem.2007.08.004.View ArticlePubMedGoogle Scholar
- Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, et al: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006, 125 (2): 315-326. 10.1016/j.cell.2006.02.041.View ArticlePubMedGoogle Scholar
- Ernst J, Kellis M: ChromHMM: automating chromatin-state discovery and characterization. Nat Methods. 2012, 9 (3): 215-216. 10.1038/nmeth.1906.PubMed CentralView ArticlePubMedGoogle Scholar
- Qu W, Hashimoto S, Shimada A, Nakatani Y, Ichikawa K, Saito TL, Ogoshi K, Matsushima K, Suzuki Y, Sugano S, et al: Genome-wide genetic variations are highly correlated with proximal DNA methylation patterns. Genome research. 2012, 22 (8): 1419-1425. 10.1101/gr.140236.112.PubMed CentralView ArticlePubMedGoogle Scholar
- Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315-322. 10.1038/nature08514.PubMed CentralView ArticlePubMedGoogle Scholar
- Csuros M: Maximum-Scoring Segment Sets. Transactions on Computational Biology and Bioinformatics. 2004, 1 (4): 139-150. 10.1109/TCBB.2004.43.View ArticlePubMedGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008, 5 (9): 829-834. 10.1038/nmeth.1246.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.