Skip to main content

Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes

Abstract

Background

Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes.

Results

Our proteogenomics workflow, Peptimapper, was applied to the genome annotation of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. We generated proteomics data from various life cycle stages of Ectocarpus sp. strains and sub-cellular fractions using a shotgun approach. First, we directly generated peptide sequence tags (PSTs) from the proteomics data. Second, we mapped PSTs onto the translated genomic sequence. Closely located hits (i.e., PSTs locations on the genome) were then clustered to detect potential coding regions based on parameters optimized for the organism. Third, we evaluated each cluster and compared it to gene predictions from existing conventional genome annotation approaches. Finally, we integrated cluster locations into GFF files to use a genome viewer. We identified two potential novel genes, a ribosomal protein L22 and an aryl sulfotransferase and corrected the gene structure of a dihydrolipoamide acetyltransferase. We experimentally validated the results by RT-PCR and using transcriptomics data.

Conclusions

Peptimapper is a complementary tool for the expert annotation of genomes. It is suitable for any organism and is distributed through a Docker image available on two public bioinformatics docker repositories: Docker Hub and BioShaDock. This workflow is also accessible through the Galaxy framework and for use by non-computer scientists at https://galaxy.protim.eu.

Data are available via ProteomeXchange under identifier PXD010618.

Background

Proteomics and genomics data combined with bioinformatics tools, known as proteogenomics [1,2,3], is a valuable strategy to improve genome annotation [4,5,6]. Proteomics methods and applications have been reviewed by Nesvizhskii [7] and more recently by Menschaert & Fenyö and by Ruggles and collaborators [8, 9]. Proteomics data provides direct access to amino-acid sequences that can be mapped onto translated genomic sequences [10, 11]. The combined use of experimental proteomics data and genomic sequences is a powerful way to: i) confirm gene-model predictions, ii) correct possible intron/exon boundary errors or wrong start/stop codons, and iii) find new CDSs that have not been computationally predicted by machine learning-based approaches or homology searches. Many studies have demonstrated the use proteomics datasets to provide protein-level evidence of gene expression and refine gene models [3, 12]. This approach has been successfully applied to many organisms, such as Anopheles gambiae [13], Rattus norvegicus [14, 15], and Homo sapiens [16], as well as plants [17,18,19]. Many microbial genomes, usually lacking high quality annotation, can also benefit from proteogenomics strategies to improve gene prediction [20,21,22,23,24]. Finally, proteogenomics can also significantly influence the study of non-model organisms [25].

Here we developed an easy to use, suitable, and efficient proteogenomics workflow, Peptimapper (Fig. 1a), to complete eukaryotic genome annotation. It automatically generates de novo short amino-acid sequences (i.e., peptide sequence tags, PSTs) from experimental proteomics data, maps these to the six-frame translation of genomics DNA sequences, and highlights potentially translated regions, which could be exons or genes. Our workflow makes it possible to not only improve genome annotation by confirming or correcting gene models or finding new CDSs, but also to complete classical database-driven proteomics identification, by generating a list of gene-matched translated proteins using these short de novo amino-acid sequences.

Fig. 1
figure 1

a Project workflow: Samples corresponding to various stages of the life cycle (sporophyte, gametophyte, and gametes) and sub-cellular compartments of Ectocarpus sp. were prepared for MS analysis. PSTs were generated from the MS/MS data, mapped against the genome, and clustered. We classified each cluster according to their genomic annotation (whether one hit overlapped with or included at least one CDS of an identified protein) and the number of typical spectra and verified whether all hits were included in a CDS or not. The location of clusters of interest were written into GFF files to be integrated into a Genome viewer. The results of cluster qualification and visualization along the contig allowed us to select clusters for experimental characterization. b Illustration of PSTs obtained from an MS/MS spectrum. MS/MS spectra are composed of ions resulting from the fragmentation of peptides at their peptide bonds during tandem mass spectrometry analysis. The generated fragment ions differ in mass corresponding to their adjacent amino-acid masses within the peptide sequence. A PST is a partial element of information deduced from an MS/MS spectrum, defined as a small sequence of several probable adjacent amino acids from the original peptide and the masses of the flanking N- and C-terminal fragment ions of this small sequence

Design and implementation

Our proteogenomics workflow (Fig. 1a), Peptimapper, is composed of a series of scripts we developed for a project called Ectoline. The scripts are partially based on the PepLine software [18], which were tested and some modified. The workflow consists of modular components and can therefore be used for any eukaryotic genome, following modification to accommodate its properties. We developed Peptimapper using the genome of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. For each biological sample, we generated the MS/MS spectra file in Mascot Generic File (MGF) format using conventional proteomics software (i.e., Mascot Distiller, Proteome Discoverer™, etc.). We used the “sequence tagging” approach [10], in which a PST is defined by a small sequence tag (usually three or four amino acids) and the two flanking (N- and C-terminal) masses (Fig. 1b).

The bioinformatics steps are shown in Fig. 2. PSTs were generated de novo from the MS/MS spectra information of the MGF. After testing several PSTs generation tools (see Step 1: From MGF files to PSTs), we decided to adapt an existing tool: PepNovo + 3.1 beta [26] (LXRunPepNovo). PSTs were then mapped on the six-frame translations of the genome sequence, resulting in a list of hits. A hit is defined as the location of a PST on the genome sequence. Finally, closely located hits were clustered to identify regions potentially associated with genes or, at least exons. This was achieved by testing (see Step 2: PST mapping and clustering below) and bundling three modules of the PepLine software (PMTrans, PMMatch, PMClust) into one script: LXPepMatch.

Fig. 2
figure 2

Steps of the bioinformatics workflow

By cross-checking results using a classical database-driven proteomics approach, we tested and optimized step 1 and step 2 modules by varying settings according to known Ectocarpus sp. genome features (downloaded from ORCAE, a public database: http://bioinformatics.psb.ugent.be/orcae/overview/Ectsi) and using as input a subset of reference spectra (see MS/MS reference datasets building). The workflow construction is detailed below (see section Workflow construction: step-by-step).

The last step consisted of classifying clusters generated in step 2, according to their annotation and confidence level using the script we developed for this purpose: LXQualify. Clusters on the genome were visualized by implementing another specific script, LXClust2Gff. It wrote cluster locations generated by LXPepMatch into GFF files for further integration into a genome viewer (i.e., Artemis (Sanger Institute, England) [27]). Results were validated by manually comparing various clusters of interest with EST and RNA-Seq data (see workflow building step by step, step 3 section, below).

From biological samples to MS/MS spectra files (MGF files)

We extracted various subcellular samples (cell walls, cytoplasm, nuclei, membranes) from various life cycle stages of Ectocarpus sp. strains (gametophyte, gamete) using specific protocols (see Additional file 1) to provide deep coverage of the proteome. Enriched extracts were then separated by SDS-PAGE onto 12% precast GeBaGels (Gene Bio-Application Ltd., Kfar Hanagide, Israel) and stained with EZBlue gel staining reagent (Sigma-Aldrich, Saint-Quentin Fallavier, France), according to the manufacturer’s instructions. Gel lanes were cut into 20 bands which were subjected to trypsin digestion, as previously described [28]. Tryptic peptides were analyzed using a nanoflow high-performance liquid chromatography (HPLC) system (LC Packings Ultimate 3000, Thermo Fisher Scientific, Courtaboeuf, France) connected to a hybrid LTQ-Orbitrap XL™ spectrometer (Thermo Fisher Scientific) equipped with a nanoelectrospray ion source (New Objective, Woburn, Massachusetts, USA), as previously described [29]. The mass spectrometer was operated in the data-dependent mode by automatic switching between full-survey scan MS and consecutive MS/MS acquisition. Survey full scan MS spectra (mass range 400–2000) were acquired in the Orbitrap section of the instrument with a resolution of r = 60,000 at 400 m/z; ion injection times were calculated for each spectrum to allow the accumulation of 106 ions in the Orbitrap. The seven most intense peptide ions in each survey scan with an intensity above 2000 were sequentially isolated and fragmented in the linear ion trap by collision-induced dissociation. For Orbitrap measurements, an external calibration was used before each injection series to ensure an overall error mass accuracy below 5 ppm for the detected peptides. MS data were saved in RAW file format (Thermo Fisher Scientific) using XCalibur 2.0.7 with tune 2.4. For each sample, MS/MS spectra, grouped into an MGF file, were generated by Proteome Discoverer™ 1.2 software. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [30] partner repository with the dataset identifier PXD010618.

Building the MS/MS reference dataset

For each step of Peptimapper, we developed specific tools or used or adapted existing tools. We built three reference datasets for their assessment. These were obtained by a classical database-driven proteomics approach. Peptides were identified using Proteome Discoverer™ 1.2 software supported by the Mascot search engine (Mascot server v2.2.07; http://www.matrixscience.com), using its decoy strategy. This software matches each MS/MS experimental spectrum (RAW file) against a database comprising all theoretical MS/MS spectra calculated for every possible peptide from an in silico digestion of Ectocarpus sp. gene model proteins (downloaded from ORCAE, https://bioinformatics.psb.ugent.be/gdb/ectocarpus/, Ectsi_prot, 2010, 16,533 sequences). Mass tolerance was set to 10 ppm and 0.5 Da for MS and MS/MS, respectively. Enzyme selectivity was set to full trypsin, with one missed cleavage allowed. The allowed protein modifications were set to carbamidomethylation of cysteines and variable oxidation of methionine. Proteome Discoverer™ identification results allowed us to manually create reference MGF files composed of MS/MS spectra, selected according to the False Discovery Rate (FDR) calculated in Proteome Discoverer™ and the reliability of the protein identification. MGF files were separated into three reference datasets according to the confidence level of the identifications: the “green” reference dataset (high quality spectra) containing all MS/MS spectra corresponding to identified proteins with at least three peptides and a FDR (computed as described above by the Mascot search engine) < 1%; the “orange” reference dataset (medium quality spectra) containing all MS/MS spectra with proteins identified by one peptide or more, and a FDR > 1% and < 5% and; the “red” reference dataset (low quality spectra) containing all MS/MS spectra without any protein identification and a FDR > 5%.

Workflow construction: step-by-step

Step 1: From MGF files to PSTs

The first step (Fig. 2) consisted of PST generation. We considered three bioinformatics tools for this step: 1) PepNovo + 3.1 [26], 2) Peaks [31], and 3) Taggor, which is a module of PepLine [18]. Taggor, initially developed for QTOF mass spectrometer data treatment, was adapted to account for mass tolerance parameters when using an ion trap spectrometer. We tested tool performance using a subset of the green dataset: 10 high quality MS/MS spectra provided by proteome Discoverer™, exporting the spectra of the two best peptides from each of the five top scoring proteins (see Additional file 2) into a MGF file. This file was used as input for each of the three bioinformatics tools. We then selected the one that generated the most PSTs identical to peptide sequences identified by Proteome Discoverer™ for the same spectra.

Step 2: PSTs mapping and clustering

The second step consisted of successive genome translation, PST mapping, and hits clustering. We separately tested PMMatch and PMClust scripts, which were then grouped together into LXPepMatch (Fig. 2). We first generated PSTs from the three previously defined reference datasets (see Additional file 2). For PST mapping, we adapted PMMatch, which is a module of PepLine [18] designed to locate PSTs on complete genome sequences, by adding an option to specify an absolute mass tolerance. It was set to 0.5 Da for our test case. We defined the PST length (i.e., optimal number of amino acids) and compared the results with those for which one or no amino acid modifications were allowed, by mapping PSTs of each reference spectra dataset against the Ectocarpus sp. gene model proteins (downloaded from ORCAE, Ectsi_prot, 2010) (Fig. 3a). Proteins matched by PSTs were compared to the expected proteins identified by conventional database spectral identification, using the same spectra and same protein sequence database (Ectsi_prot, 2010) with Proteome Discoverer™ 1.2 software. A hit (the location of a PST on the genome sequence) was considered to be valid if it matched the expected protein. We considered proteins matched by at least two valid hits and corresponding to expected proteins to be true-positives (“true_pos”). We defined “nref” as the number of expected proteins and “nfound” as the number of proteins matched by PMMatch. Sensitivity was defined as the percent of expected proteins matched (true_pos/nref) and selectivity the percent of expected proteins among all proteins matched using our workflow (true_pos/nfound). We computed these metrics for each reference dataset by varying tag lengths from 3 to 5.

Fig. 3
figure 3

a Mapping step testing method. Predicted proteins matched by PSTs were compared to the expected proteins, identified by conventional database spectral identifications, using the same spectra and protein sequences database (ORCAE). b Definition of the cluster exon-mapping categories according to their mRNA locations (Ectocarpus sp. GFF3 files, ORCAE): IN when clusters were located inside an mRNA feature, OUT when clusters were located outside an mRNA feature, and CROSS when clusters were located across an mRNA feature. c Clustering step testing method. We only used the green dataset. The number of predicted proteins (matched by IN and CROSS clusters) was compared to that of expected proteins identified with Proteome Discoverer™ using the Ectocarpus sp. database (ORCAE)

For hit clustering, we only used the green reference dataset to test the program PMClust, another module of PepLine. We optimized this step by selecting the maximal distance between two consecutive hits to be grouped in a cluster, taking into account the mean length of CDSs, exons, introns, genes, and intergenic regions of the Ectocarpus sp. genome. We defined the minimum number of hits (MINHIT) and minimum number of peptides (MINPEP) a cluster could contain to improve the results. Clusters were mapped to the 1591 annotated contigs (GFF3 files downloaded from ORCAE, Ectsi_gff3, 2011) and classified into three categories, according to their mRNA locations: IN when clusters were located inside an mRNA feature, OUT when clusters were located outside of an mRNA feature, and CROSS when clusters were located across an mRNA feature (Fig. 3b).

We performed the first test by analyzing their distribution into each category for various values of MINHIT and MINPEP, with the aim of obtaining the maximum number of IN clusters. In the second test, we compared proteins that matched IN and CROSS clusters to the expected proteins identified by conventional database spectral identification, using the same spectra and protein sequence database (Ectsi_prot, 2010) with Proteome Discoverer™ 1.2 software (Fig. 3c). We considered proteins matched by an IN or CROSS cluster and corresponding to an expected protein to be true-positives (“true_pos”). All IN and CROSS clusters corresponded to found proteins (“nfound”), whereas all expected proteins corresponded to proteins identified by the database-driven approach (“nref”). We calculated the sensitivity (“true_pos”/“nref”), which is the percent of expected proteins matched by an IN or CROSS cluster among all expected proteins, and the selectivity, which is the percent of expected proteins among all proteins found by IN or CROSS clusters (Fig. 3c). We computed these metrics using the green reference dataset for various values of MINHIT and MINPEP.

Step 3: Annotation and visualization of cluster results

In the third step (Fig. 2), we developed a specific script, LXQualify, allowing the annotation of clusters with three labels. The first was “UNANNOTATED” or “ANNOTATED”, if no hits or at least one hit was included in a CDS of a protein-coding gene, respectively. The second was “DUBIOUS”, “POSSIBLE”, or “SURE”, to assign a confidence degree to each cluster according to the number of typical spectra (i.e., specific to the cluster) and the number of hits. A cluster was “DUBIOUS” if it contained zero or one typical spectrum; “POSSIBLE” if it contained two or more typical spectra, but less than three different peptides, and “SURE” if it contained two or more typical spectra and three or more different peptides. The third was “OK” or “CHECK”, if all the hits were included in an annotated CDS or at least one hit did not match with the annotated CDS, respectively. Labeled clusters were listed in a tabular output file.

We developed a specific script, LXClust2Gff, to convert PMClust output to GFF files for visualization. Artemis (Sanger Institute, England) [27] was used as the genome sequence viewer to conveniently assess our results.

Some clusters of interest were validated by comparison with transcriptomics data and RT-PCR experiments. Transcriptomic data (ESTs and RNA-Seq coverage) were downloaded from the ORCAE public database of the Ectocarpus sp. genome. These data were previously published with research articles [32, 33]. For RT-PCR experiments, approximately 100 mg wet weight of frozen samples of Ectocarpus sp.. Ec 32 were quickly ground in liquid nitrogen for RNA extraction. Total RNA was prepared as described previously [34]. RNA quantity and quality were verified using a NanoDrop ND-1000 spectrophotometer (NanoDrop products, Thermo Fisher Scientific) and by electrophoresis on agarose gels. cDNAs were produced from 1 μg total RNA using the ImProm-II™ Reverse Transcription System (Promega, Charbonnières-les-Bains, France). PCR experiments were carried out with a thermocycler machine using a standard GoTaq® DNA Polymerase Protocol (Promega). The annealing temperature of the specific primers for cluster validation were between 58 and 60 °C. PCR products were separated and purified on agarose gels. The DNA was subsequently sequenced using the BigDye® Terminator v3.1 Cycle Sequencing Kit and a 3130 Genetic Analyzer (Applied Biosystems, Foster City, California, USA).

Results

We designed and built our proteogenomics tool using the brown alga E. sp. as our test case. The Ectocarpus sp. (formerly included in E. sp., [35]) has become a model organism for brown algal biology because of its amenable features for morpho-genetic, life-cycle, and genetic studies. Publication of the Ectocarpus sp. genome in 2010 propelled brown algal research into the genomic era and several post-genomic tools have been subsequently developed using this species to explore diverse aspects of brown algal biology, including its life cycle, development, metabolic processes, and interactions with the environment [32, 36]. Resources for Ectocarpus sp. now include two genetic maps [37, 38], gene mapping techniques, microarrays [39, 40], transcriptomic data [41, 42], proteomic techniques [43, 44], and bioinformatics tools for the prediction of peptide addressing [45] and metabolic reconstruction [46].

Workflow: settings and test results

We optimized our proteogenomics approach (Fig. 1a) using the three reference datasets. It required fine-tuning for the type and quality of the MS data and adaptation to the characteristics of the Ectocarpus sp. genome. The initial genome V1 annotation retrieved 16,256 protein-coding sequences, among which 6655 had no EST support and 5819 concerned specific brown algal genes encoding proteins with no known function [32]. Moreover, the high number of introns per gene (an average of seven), the extended 3’UTR regions, with an average length of 845 bp, and short intergenic regions often hampered accurate gene prediction.

Step 1: From MS/MS spectra to PSTs

We selected the best from among three programs (i.e., Taggor, Peaks, PepNovo+) to generate PSTs from MS/MS spectra. Taggor had difficulties distinguishing doubly charged ions from singly charged ions, leading to sequence errors. This tool also required a preliminary deconvolution step. We thus discarded it and focused on Peaks and PepNovo+. Ten high quality experimental spectra identified by Proteome Discoverer™ 1.2 software (see Additional file 2) were manually selected for use as reference sequences (Table 1) to cross-reference PST sequences generated by each program we tested.

Table 1 Comparison of tag quality between Peaks and PepNovo+, using Proteome Discoverer ™ peptide identifications as the expected sequences

Peaks generated PSTs of variable, generally long sequence-tag length (at least six amino acids) that could potentially lead to errors. Indeed, Peaks generated only four correct sequences (Table 1). The errors generated by Peaks are also explained by the mass tolerance accuracy parameter. For example, the amino-acid mass of “DD” is 230.05 Da and that of “ET” is 230.09 Da. The reference sequence tag to generate was “GVSEET”, whereas Peaks wrongly proposed “GVSEDD”. PepNovo+, raised two concerns. First, it did not take into account the H2O molecule plus the single charge acquisitions during the fragmentation process, resulting in errors in the mass of Mn NTer. Second, it did not take the peptide charge into consideration. Nevertheless, after correcting for these problems, PepNovo + appeared to be the best choice. Indeed, it returned 8 of the 10 reference sequences (Table 1). PepNovo + was set to two allowed amino-acid modifications, cysteine carbamidomethylation and methionine oxidation (C + 57 and M + 16, respectively). The maximal tag number generated per spectra was set to 10.

Step 2. From PSTs to hit clusters

PST mapping was performed using the PMMatch program. The results of three PST files generated by PepNovo + from the green, orange, and red reference spectra datasets (see Additional file 2) were used as input. We then compared protein encoding genes matched by PSTs to the expected proteins identified by Proteome Discoverer™ from the same spectra dataset, using the same Ectocarpus sp. protein database (ORCAE, Ectsi_prot, 2010) as that used by PMMatch (Fig. 3a). Sensitivity and selectivity were calculated for each dataset. Sensitivity measures the proportion of positive IDs that were correctly identified among all expected proteins and selectivity the proportion of positive IDs that were correctly identified among all matched proteins. The results of each reference dataset are reported in Fig. 4 for three different sequence tag lengths (i.e., 3, 4, or 5 amino acids) for a minimum of one or two hits per protein (MINHIT) with one or no amino acid modifications allowed.

Fig. 4
figure 4

Sensitivity and selectivity were calculated for each dataset. Sensitivity measures the proportion of positives that are correctly identified by clusters and selectivity is the proportion of positives that are correctly identified among all proteins matched by clusters. a Selectivity and sensitivity for each reference dataset with MINHIT = 1. b Selectivity and sensitivity for each reference dataset with MINHIT = 2. c Selectivity and sensitivity for each reference dataset with MINHIT = 2 and one amino acid modification allowed

We observed poorer selectivity and slightly better sensitivity for the high and medium quality spectra with MINHIT = 1 (Fig. 4a) than with MINHIT = 2 (Fig. 4b). MINHIT = 2 resulted in poorer sensitivity for noisy spectra. Thus, optimal parameters depend on spectra quality. Selectivity increased with PST length, regardless of the quality of the spectra, with a slight unfavorable influence on sensitivity, particularly for spectra of medium and poor quality. Thresholds of MINHIT > 1 and a tag sequence length = 5 amino acids allowed us to focus on spectra of relatively good quality. We thus reduced the number of results to be validated from our preliminary study. Sensitivity and selectivity were higher when MINHIT = 2 and one amino acid modification was allowed, regardless of the quality of the spectra (Fig. 4c). We thus set allowed amino-acid modifications to one for PST mapping.

We performed further tests on the cluster results to set the optimal minimum hits per protein parameter (MINHIT). We compared proteins matched by clusters to the expected proteins identified by conventional database spectral identification using Proteome Discoverer™ 1.2 software and the same spectra and protein database as above (ORCAE, Ectsi_prot, 2010) (Fig. 3c). We generated PSTs from only the green reference dataset, running PMMatch on the Ectocarpus sp. genome sequence (ORCAE, Ectsi_genome_V2_cleaned.tfa) translated by PMTrans. We clustered hits with the aim of uncovering regions potentially associated with genes or, at least, exons. The maximal distance between two consecutive hits in a cluster was correlated with E. sp. genome features to improve cluster results. The statistical distribution was established for each, starting from the E. sp. GFF3 files (ORCAE, Ectsi_gff3, 2011; Ectsi_genome_V2_cleaned.tfa): CDS (median of 137 nt), exons (median of 143 nt), introns (median of 531 nt), gene (median of 4772 nt), and intergenic regions (median of 2529 nt). We observed relatively short CDS and introns and short intergenic regions, with a median that was only four times larger than that of introns. Thus, there was a risk of confusing introns and intergenic regions. We thus fixed the maximal distance between hits to 5000 nt, thus minimizing the risk to merge two proteins while covering 99.6% of introns. Each cluster was annotated to fall into one of three categories, “IN”, “OUT”, or “CROSS”, according to Ectocarpus sp. mRNA locations (Fig. 3b), to further set the minimal number of hits (MINHIT), and validated peptides (MINPEP) required to form a cluster. First, we assessed the proportion of clusters falling into each category and the best result, i.e., 90% of the clusters obtained were “IN” with a minimum of three hits and two peptides (Fig. 5a). Second, we compared the number of proteins predicted by the “IN” and “CROSS” clusters to the expected number of proteins (i.e., identified with Proteome Discoverer™ using the same reference spectra) for different values of MINHIT and MINPEP (Fig. 5b). We obtained satisfactory sensitivity and selectivity scores of 83 and 82%, respectively, with a minimum of three hits and two peptides. Last, we ran PMClust with the following parameters: MINHIT at three hits, MINPEP at two peptides, and the maximal distance between two hits to form a cluster of 5000 nucleotides.

Fig. 5
figure 5

a Percentage of clusters qualified as IN, OUT, or CROSS with respect to predicted gene locations from Ectocarpus sp. GFF3 files (ORCAE). b The number of proteins (matched by IN and CROSS clusters) relative to the number of expected proteins (identified with Proteome Discoverer™) allowed the calculation of sensitivity and selectivity. Sensitivity measures the proportion of positives that are correctly identified by IN or CROSS clusters. Selectivity corresponds to the proportion of positives that are correctly identified among all proteins matched by IN or CROSS clusters

Final workflow: validation on all samples of Ectocarpus sp.

We produced a PST file from each biological sample using LX_RunPepNovo. PSTs were mapped on the six-frame translations of the Ectocarpus sp. genome (ORCAE, Ectsi_genome_V2_cleaned.tfa) using the LXPepMatch program with the optimized parameters described above, thus generating 20-hit lists that were pooled to obtain only one file per sample. We used various strains to isolate biological samples of Ectocarpus sp.. We thus needed to avoid mistakes linked to small genetic differences due to polymorphisms. Before clustering, we merged hits files of each strain: Ec32 (soluble, membrane, and cell wall fractions), Ec594 (gametophyte and nuclei fractions), and Ec410 (gamete fraction). We generated clusters from each hits file, i.e., Ec32, Ec594, and Ec410, with the optimized parameters described above, using GFF3 files (ORCAE, Ectsi_gff3_Jun2013).

The resulting list contained 2107 unique clusters (see Additional file 3), combining all strains, that included 272 unannotated and 1832 annotated clusters. We further analyzed a subset of these clusters. We annotated clusters to fall into one of three grades to assign a degree of confidence to each, based on the number of typical spectra (i.e., specific to the cluster) and the number of hits, as described above. Thirteen percent of clusters were unannotated and 87% annotated (Fig. 6a). We focused on SURE or POSSIBLE and OK or CHECK clusters (see implementation step 3).

Fig. 6
figure 6

a Distribution of all ANNOTATED/UNANNOTATED Ectocarpus sp. clusters. b Distribution of ANNOTATED clusters according to category. c Distribution of UNANNOTATED clusters according to the category

We validated our workflow by focusing on distinct case-results. i) We studied clusters annotated as SURE and CHECK to correct mispredicted genes or correct the ATG start codon. We found 472 clusters annotated as SURE and CHECK (Fig. 6b). Among these, two were retained for further experimental validation. ii) We only focused on the 45 unannotated clusters labeled SURE (Fig. 6c) to find new CDSs. Among these, two were further analyzed.

Correction of mispredicted genes or ATG start codons

The first cluster, named “cluster A”, was present on sctg_326 between 105,788 and 106,978 (Fig. 7a). It was “ANNOTATED_SURE_CHECK” and only identified in Ec32 samples. This cluster was positioned downstream of the predicted gene model Esi0326_0032, annotated as a dihydrolipoamide acetyltransferase. A blastX search (on the ORCAE website, http://bioinformatics.psb.ugent.be/blast/moderated/?project=orcae_Ectsi, Ectsi_genome11x) against all portions of the Esi0326_0032 gene followed by the DNA sequence of cluster “A” identified a full-length protein for the dihydrolipoamide acetyltransferase component of the pyruvate/2-oxoglutarate dehydrogenase complex. This was also corroborated by two ESTs: AAA12YO13, AAB11YA05 (ORCAE, Ectsi_ESTs_cleaned) matching this region (Fig. 7b). Our results showed that the prediction of the last exon of the Esi0326_0032 gene model appeared to be false. Consequently, we selected this cluster as a candidate for correction of a mispredicted gene. We designed primer pairs to amplify portions based on the sequences of the PSTs for validation (Fig. 7c), as public databases suggested that the “A” cluster was expressed in vivo by E. sp.. PCR products of the expected size of 244 bp were obtained (Fig. 7d) and sequencing confirmed the presence of the expected nucleotide sequences (data not shown).

Fig. 7
figure 7

a Artemis view of cluster “A” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “A”. c Experimental validation conditions. d RT-PCR validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “A”

The second cluster, named “cluster B” was present on sctg_39 between 759,682 and 762,422 (Fig. 8a). It was “ANNOTATED_SURE_CHECK” and only identified in Ec32 samples. Most (191 “hit in”) of the total hits (257 “tot hit”) were present in the predicted sequence of an Ectocarpus sp. gene corresponding to Esi0039_0135 (Fig. 8b). These 191 hits corresponded to five peptides located in the region of the predicted Esi0039_0135 protein-coding gene (“cds in”). Another peptide, “TYIMIKPDGVQR”, partially covered the Esi0039_0135 sequence (“cds cross”). The predicted start codon of the Esi0039_0135 protein-coded gene is included in this peptide sequence and further analysis of the ESTs showed that the true start codon of this gene may likely be upstream of the predicted one (Fig. 8c). Indeed, these two potential start codons are very close (separated by five amino acids) emphasizing the benefit of a proteomic approach for true start codon assignment. The identification of the PST sequence TYIMIKPDGVQR in the MS/MS data led us to propose a new position for the ATG start codon of the Esi0039_0135 gene. Nevertheless, we cannot exclude that the two ATG codons may be alternatively used in vivo to produce different translation products of this nucleoside diphosphate kinase.

Fig. 8
figure 8

a Artemis view of cluster “B” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “B”. c Translation view of the 5’ cDNA sequence corresponding to the region defined by the cluster

New CDS discovery

“Cluster C”, was mapped against sctg_8 from position 419,782 to position 422,806 (Fig. 9a) with four peptides (Fig. 9b). It was “UNANNOTATED_SURE_CHECK” and observed in the three datasets. An MS-Blast search [47] (http://genetics.bwh.harvard.edu/msblast/), using PST sequences included in this cluster, revealed similarity with ribosomal protein L22 (RPL22) from other species. In addition, cross-analysis with Ectocarpus sp. transcriptomic data available from the ORCAE website (http://bioinformatics.psb.ugent.be/blast/moderated/?project=orcae_Ectsi) showed that a complete cDNA sequence (ACA17YE21) was present in cluster “C”. This sequence appears to be a good candidate for a new protein coding gene as no coding gene has yet been predicted in this region (Fig. 9a). We designed primer pairs to amplify portions based on the PST sequences (Fig. 9c) for validation, as transcriptomic data in public databases suggested that the “C” cluster is likely to be expressed in vivo in E. sp.. PCR products of the expected size of 248 bp were obtained (Fig. 9d) and sequencing confirmed the presence of the expected nucleotide sequences (data not shown).

Fig. 9
figure 9

a Artemis view of cluster “C” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Data for cluster “C”. c Experimental validation conditions. d RT-PCR validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “C”

The last cluster, named “cluster D”, was present on sctg_203 between 20,569 and 25,604 (Fig. 10a). It was observed in two datasets (Ec32 and Ec594) and is present on LG02 of the genetic map [38] (Fig. 10b). We also selected it for validation by RT-PCR analysis (Fig. 10d). We designed PCR primers based on the coding sequences of the most distant PSTs identified for this cluster (VVLPTWELR and IADFVGIETTPEIIEK). In addition, four ESTs were found to cover the entire length of the cluster. Amplification of Ectocarpus sp. cDNAs led to an expected product of 692 bp absent from the RNA amplification negative control, confirming the translation of this new gene product (Fig. 10d).

Fig. 10
figure 10

a ARTEMIS view of cluster “D” for validation. Ectocarpus sp. genes close to these clusters are represented by linear exons colored in yellow. Another representation of the same exons is shown in light blue along the six reading frames (+ 1, + 2, + 3 above and − 1, − 2, − 3 below). The cluster is indicated by black bold rectangles around the small rectangles of the PSTs. b Cluster “D” data. c Experimental validation conditions. d RT-PCR and EST validation: agarose gel electrophoresis of RT-PCR products (lane 1), RNA-PCR negative controls (lane 2), and DNA size marker (lane 3) for PST cluster “D”

Peptimapper workflow distribution

Our proteogenomics workflow, Peptimapper, is composed of four scripts from the Ectoline github project: LXRunPepNovo, LXPepMatch, LXQualify, and LXClust2Gff (Fig. 2; see command line arguments and output file descriptions into Additional file 4). PMMatch, used by LXPepMatch, was adapted from the Pepline suite (version 2.0.1) to fit our workflow. LXRunPepNovo is a new version of PepNovo + (version 3.1 beta), adapted for our data. This version contains sources and pre-compiled binaries for Linux and MacOS platforms. Ectoline project is distributed under the GPL or CECILL license. The text of both licenses is attached (and should remain attached) to this distribution and is available at https://github.com/laeticlo/Ectoline.

We built a Docker image, called peptimapper (see dockerfile into Additional file 5), to allow easier distribution and interoperability of Ectoline scripts. Anyone can retrieve this image from a public repository and run it as a package without any specific configuration or installation requirements. This package contains all workflow components (Fig. 2), along with dependencies and running environment. We use two different cloud-based public registry services for storing and distributing our Docker image: the Docker Hub (https://hub.docker.com/r/dockerprotim/peptimapper/) and BioShaDock [48], a public curated and bioinformatics-focused repository (https://docker-ui.genouest.org/app/#/container/dockerprotim/peptimapper). In the context of this project, every computational tool for each step of the overall workflow was integrated and deployed on our own Galaxy server https://galaxy.protim.eu, using this Docker image (see Additional file 5). Our workflow is therefore functionally reproducible with Galaxy [49, 50]. It ran on a virtual machine with 8 CPU and 70 Go RAM.

Discussion

Most sequence-centric proteogenomics available pipelines are based on the generation of customized protein databases from genome, exome, or RNA sequencing to, e.g. reannotate genes, predict splice isoforms or discover novel proteins, using classical database-driven methods [7,8,9]. These methods are based on a direct comparison between experimental MS/MS spectra and theoretical MS/MS spectra generated from in silico digestion of these customized protein databases. A major advantage of such approaches is the specificity of the databases, including variations such as single amino acid variants (SAAVs) and alternative splice junctions. However, one of their weaknesses is the size of these databases, larger than those used in conventional proteomic searches and containing only known proteins. Consequently, it requires iterative search strategies and a specific FDR calculation to be sensitive enough to avoid false positive identifications [7]. Peptimapper overcomes this statistical drawback by first partially interpreting MS experimental spectra before mapping them onto the translated genome. Other similar pipelines currently exist that map MS-based proteomics data onto genomic coordinates as the Proteogenomic Mapping Tool [51], proteoAnnotator [52], PGMiner [53], Protk (https://github.com/iracooke/protk), IggyPep [54], or PepLine [18] (Table 2). However, for most of these pipelines, peptides are derived from database-driven methods, except for IggyPep and PepLine that also use de novo Peptide Sequence Tags (PSTs) obtained from partial interpretation of mass spectrometry data. Unfortunately, PepLine and IggyPep are neither maintained nor available anymore.

Table 2 Eukaryotic proteogenomics pipelines and Galaxy workflows

Another crucial step mentioned into the review by A. Nesvizhskii [7] is the confidence degree for results. Validation and cura\tion steps are not always integrated into existing proteogenomics pipelines. Peptimapper provides results annotated with quality criteria (e.g. minimal number of typical spectra by cluster) and visualizable through a genome browser for manual inspection purposes.

The high number of data processing steps that compose a proteogenomics analysis do not make the strategy easily workable for biologists. Especially since the most of available pipelines are only accessible through a command line interface or sometimes as a stand-alone software. Flexible and accessible Galaxy-based workflows presented Table 2, are implemented for proteogenomics analysis and well used for many projects [55,56,57,58,59]. Interestingly, through a Galaxy framework, Peptimapper is the only pipeline today that uses a complementary de novo approach that has been also proved to be efficient in finding new genes and in the discovery of refinement of intron/exon boundaries.

According to the important criteria we mentioned above a comparison of available pipelines is presented Table 2 based on these functionalities, i.e., database-driven for peptide identification or de novo peptide interpretation, then mapping onto the translated genomic sequence; pipeline interface; user-friendly for biologists; results curation; results visualization. By re-using and improving PepLine former modules, our pipeline extends the process by providing the users with new functionalities, thus meeting the important criteria and being as complete as possible: i) It is compatible with ion trap mass spectrometry data; ii) it allows quality annotation of results and their visualization through a genome browser; and iii) it makes the workflow easily accessible through the Galaxy framework [49, 50].

Annotation of the Ectocarpus sp. genome has become remarkably more accurate through the application of extensive RNA sequencing approaches and new informatics tools [60]. Similarly, the EctoGEM metabolic network has been considered to complement annotations within the Ectocarpus sp. genome database to support the understanding of metabolic networks in this organism [46]. The problems caused by many features of the Ectocarpus sp. genome (high number of introns per gene, extended 3’UTR, short intergenic regions) can be alleviated by accurate annotation through the use of the proteogenomics approach developed in this study.

RT-PCR experiments combined with transcriptomic data (available on ORCAE website) allowed us to confirm the predictions, validate two new genes (RPL22, AST), and correct one gene model (Dihydrolipoamide acetyltransferase), all corresponding to clusters obtained by our combined approach of proteomics and bioinformatics. Crossing the data generated by our bioinformatics workflow for another cluster (cluster B) with transcriptomic data allowed us to identify an alternative ATG start codon of a gene encoding a nucleoside diphosphate kinase. This finding suggests that there may be two alternative ATGs for this gene. Such a result shows that direct mapping of MS/MS data to genomic information provides a valuable approach to complement automatic annotation.

The methodological development focused on: i) workflow development and the optimization of parameters to apply it to all our ‘sub-proteome’ MS/MS datasets and ii) the search for the best qualifying criteria to sort clusters according to specific aims (e.g., re-annotation, identification of small ORFs in the 3’UTR, etc.).

Parameter adjustment is based both on MS/MS spectra and genomic features. Fine-tuning appears to be an important step and configuration workflow settings are now available for organisms with gene characteristics similar to those of our test case. Here, we only focused on a few results. Indeed, many other identified clusters should be of potential biological interest. Sixteen additional clusters are currently under investigation in our laboratory by combining proteomics with new developments in transcriptomics [60]: nine potential new protein-coding genes are yet to be confirmed, and seven exonic models and one ATG model may need correcting (Table 3; Additional file 6).

Table 3 Additional clusters currently under investigation

Future studies

Recently, extensive RNA-seq data were used to improve 11,108 existing gene models and identify 2030 new Ectocarpus sp. protein-coding genes [60]. New data available in public databases has advanced functional annotation associated with protein-coding genes. To date, 61% of genes now have a functional assignment, compared to 34% in the V1 annotation [60] we used in our workflow. We are now applying our workflow, tailored for this organism, using the most recent Ectocarpus sp. genome annotation (http://bioinformatics.psb.ugent.be/orcae/overview/EctsiV2). In the future, we will analyze short ORFs, focusing on small clusters corresponding to gene models ≤ 150 nucleotides. In such a study, the proteogenomics approach is a clear asset to confirm whether some small transcripts are translated.

We also successfully tested Peptimapper using MS data produced by a mass spectrometer of the latest generation (i.e., Q Exactive™ HF, ThermoFisher Scientific), currently used to analyze another organism (i.e., Homo sapiens) and other MS data analysis software (i.e., Mascot Distiller V2.6; software supported by a more recent version of the Mascot server v2.5.1; http://www.matrixscience.com) to create MGF. This shows that Peptimapper is fully adaptable to the most recent MS instruments and MS analysis software and is relevant to study other eukaryotic organisms.

Conclusions

In addition to improving annotation of the Ectocarpus sp. genome and gaining new knowledge about its proteome, our objective was to provide an accessible, efficient, and flexible tool to the annotation community that is easily configurable according to the species of interest. Thus, genome sequence and GFF3 files must be available for the organism of interest to display genome features. The workflow is available as a Docker image or interfaced with our Galaxy platform (see Additional file 5), enabling web access to users with non-programming experience to easily run it in a transparent and reproducible way.

Availability and requirements

Project name: Ectoline

Project home page: https://github.com/laeticlo/Ectoline

Operating system(s): this distribution contains sources and pre-compiled binaries for Linux, and MacOSX platform

Licence: GPL license or under the CECILL licence

Ectoline Docker image name: peptimapper

Docker hub repository: http://hub.docker.com/r/dockerprotim/peptimapper/

Docker bioshadock repository: https://docker-ui.genouest.org/app/#/container/dockerprotim/peptimapper

Galaxy platform: https://galaxy.protim.eu/

Abbreviations

CDS:

Coding DNA sequence

CF:

Cytoplasmic-proteome fractions

CHAPS:

3-[(3-cholamidopropyl)dimethylammonio]-1-propanesulfonate

CPU:

Central processing unit

CW:

Cell wall

CWF:

Cell-wall-enriched proteome fractions

DTT:

Dithiothreitol

EDTA:

Ethylene-diamine-tetraacetic acid

EST:

Expression sequence tag

FACS:

Fluorescence-activated cell sorting

FDR:

False Discovery Rate

GF:

Gametes proteome fractions

GFF/GFF3:

General feature format

HEPES:

4-(2-hydoxyethyl)-1-piperazine ethane sulfonic acid

HPLC:

High-performance liquid chromatography

LC-MS/MS:

Liquid chromatography tamdem mass spectrometry

MF:

Membrane-enriched proteome fractions

MGF:

Mascot Generic File

MS:

Mass spectrometry

NES:

Nuclear export signal

NF:

Nuclear proteome fraction

NLS:

Nuclear localization signal

PCR:

Polymerase chain reaction

PST:

Peptide sequence tag

RAM:

Random access memory

SAAV:

Single amino acid variant

SP:

Secretory pathway

References

  1. Pandey A, Pevzner PA. Proteogenomics. Proteomics. 2014;14(23–24):2631–2.

    PubMed  Google Scholar 

  2. Krug K, Nahnsen S, Macek B. Mass spectrometry at the interface of proteomics and genomics. Mol BioSyst. 2011;7(2):284–91.

    Article  CAS  PubMed  Google Scholar 

  3. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4(1):59–77.

    Article  CAS  PubMed  Google Scholar 

  4. Armengaud J. Reannotation of genomes by means of proteomics data. Methods Enzymol. 2017;585:201–16.

    Article  CAS  PubMed  Google Scholar 

  5. Datta KK, Madugundu AK, Gowda H. Proteogenomic methods to improve genome annotation. Methods Mol Biol. 2016;1410:77–89.

    Article  CAS  PubMed  Google Scholar 

  6. Kuster B, Mortensen P, Andersen JS, Mann M. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics. 2001;1(5):641–50.

    Article  CAS  PubMed  Google Scholar 

  7. Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11(11):1114–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Menschaert G, Fenyo D. Proteogenomics from a bioinformatics angle: a growing field. Mass Spectrom Rev. 2017;36(5):584–99.

    Article  CAS  PubMed  Google Scholar 

  9. Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, Fenyo D, Zhang B, Mani DR. Methods, tools and current perspectives in proteogenomics. Mol Cell Proteomics. 2017;16(6):959–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994;66(24):4390–9.

    Article  CAS  PubMed  Google Scholar 

  11. Yates JR 3rd, Eng JK, McCormack AL. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal Chem. 1995;67(18):3202–10.

    Article  CAS  PubMed  Google Scholar 

  12. Nanduri B, Wang N, Lawrence ML, Bridges SM, Burgess SC. Gene model detection using mass spectrometry. Methods Mol Biol. 2010;604:137–44.

    Article  CAS  PubMed  Google Scholar 

  13. Kalume DE, Peri S, Reddy R, Zhong J, Okulate M, Kumar N, Pandey A. Genome annotation of Anopheles gambiae using mass spectrometry-derived data. BMC Genomics. 2005;6:128.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Kumar D, Yadav AK, Jia X, Mulvenna J, Dash D. Integrated transcriptomic-proteomic analysis using a proteogenomic workflow refines rat genome annotation. Mol Cell Proteomics. 2016;15(1):329–39.

    Article  CAS  PubMed  Google Scholar 

  15. Chocu S, Evrard B, Lavigne R, Rolland AD, Aubry F, Jegou B, Chalmel F, Pineau C. Forty-four novel protein-coding loci discovered using a proteomics informed by transcriptomics (PIT) approach in rat male germ cells. Biol Reprod. 2014;91(5):123.

    Article  PubMed  Google Scholar 

  16. Wright JC, Mudge J, Weisser H, Barzine MP, Gonzalez JM, Brazma A, Choudhary JS, Harrow J. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun. 2016;7:11778.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chapman B, Castellana N, Apffel A, Ghan R, Cramer GR, Bellgard M, Haynes PA, Van Sluyter SC. Plant proteogenomics: from protein extraction to improved gene predictions. Methods Mol Biol. 2013;1002:267–94.

    Article  CAS  PubMed  Google Scholar 

  18. Ferro M, Tardif M, Reguer E, Cahuzac R, Bruley C, Vermat T, Nugues E, Vigouroux M, Vandenbrouck Y, Garin J, et al. PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J Proteome Res. 2008;7(5):1873–83.

    Article  CAS  PubMed  Google Scholar 

  19. Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A. 2008;105(52):21034–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Potgieter MG, Nakedi KC, Ambler JM, Nel AJ, Garnett S, Soares NC, Mulder N, Blackburn JM. Proteogenomic analysis of mycobacterium smegmatis using high resolution mass spectrometry. Front Microbiol. 2016;7:427.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Armengaud J, Hartmann EM, Bland C. Proteogenomics for environmental microbiology. Proteomics. 2013;13(18–19):2731–42.

    CAS  PubMed  Google Scholar 

  22. de Groot A, Dulermo R, Ortet P, Blanchard L, Guerin P, Fernandez B, Vacherie B, Dossat C, Jolivet E, Siguier P, et al. Alliance of proteomics and genomics to unravel the specificities of Sahara bacterium Deinococcus deserti. PLoS Genet. 2009;5(3):e1000434.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Muller SA, Findeiss S, Pernitzsch SR, Wissenbach DK, Stadler PF, Hofacker IL, von Bergen M, Kalkhof S. Identification of new protein coding sequences and signal peptidase cleavage sites of helicobacter pylori strain 26695 by proteogenomics. J Proteome. 2013;86:27–42.

    Article  Google Scholar 

  24. Venter E, Smith RD, Payne SH. Proteogenomic analysis of bacteria and archaea: a 46 organism case study. PLoS One. 2011;6(11):e27587.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Armengaud J, Trapp J, Pible O, Geffard O, Chaumot A, Hartmann EM. Non-model organisms, a species endangered by proteogenomics. J Proteome. 2014;105:5–18.

    Article  CAS  Google Scholar 

  26. Frank A, Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem. 2005;77(4):964–73.

    Article  CAS  PubMed  Google Scholar 

  27. Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012;28(4):464–9.

    Article  CAS  PubMed  Google Scholar 

  28. Com E, Clavreul A, Lagarrigue M, Michalak S, Menei P, Pineau C. Quantitative proteomic isotope-coded protein label (ICPL) analysis reveals alteration of several functional processes in the glioblastoma. J Proteome. 2012;75(13):3898–913.

    Article  CAS  Google Scholar 

  29. Lavigne R, Becker E, Liu Y, Evrard B, Lardenois A, Primig M, Pineau C. Direct iterative protein profiling (DIPP) - an innovative method for large-scale protein detection applied to budding yeast mitosis. Mol Cell Proteomics. 2012;11(2):M111 012682.

    Article  PubMed  Google Scholar 

  30. Vizcaino JA, Csordas A, del-Toro N, Dianes JA, Griss J, Lavidas I, Mayer G, Perez-Riverol Y, Reisinger F, Ternent T, et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016;44(D1):D447–56.

    Article  CAS  PubMed  Google Scholar 

  31. Bern M, Cai Y, Goldberg D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal Chem. 2007;79(4):1393–400.

    Article  CAS  PubMed  Google Scholar 

  32. Cock JM, Sterck L, Rouze P, Scornet D, Allen AE, Amoutzias G, Anthouard V, Artiguenave F, Aury JM, Badger JH, et al. The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature. 2010;465(7298):617–21.

    Article  CAS  PubMed  Google Scholar 

  33. Lipinska AP, D'Hondt S, Van Damme EJ, De Clerck O. Uncovering the genetic basis for early isogamete differentiation: a case study of Ectocarpus siliculosus. BMC Genomics. 2013;14:909.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Dittami SM, Gravot A, Goulitquer S, Rousvoal S, Peters AF, Bouchereau A, Boyen C, Tonon T. Towards deciphering dynamic changes and evolutionary mechanisms involved in the adaptation to low salinities in Ectocarpus (brown algae). Plant J. 2012;71(3):366–77.

    CAS  PubMed  Google Scholar 

  35. Peters AF, Marie D, Scornet D, Kloareg B, Cock JM. Proposal of Ectocarpus siliculosus (Ectocarpales, Phaeophyceae) as a model organism for brown algal genetics and genomics. J Phycol. 2004;40:1079–88.

    Article  Google Scholar 

  36. Cock JM, Coelho SM, Brownlee C, Taylor AR. The Ectocarpus genome sequence: insights into brown algal biology and the evolutionary diversity of the eukaryotes. New Phytol. 2010;188(1):1–4.

    Article  CAS  PubMed  Google Scholar 

  37. Avia K, Coelho SM, Montecinos GJ, Cormier A, Lerck F, Mauger S, Faugeron S, Valero M, Cock JM, Boudry P. High-density genetic map and identification of QTLs for responses to temperature and salinity stresses in the model brown alga Ectocarpus. Sci Rep. 2017;7:43241.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Heesch S, Cho GY, Peters AF, Le Corguille G, Falentin C, Boutet G, Coedel S, Jubin C, Samson G, Corre E, et al. A sequence-tagged genetic map for the brown alga Ectocarpus siliculosus provides large-scale assembly of the genome sequence. New Phytol. 2010;188(1):42–51.

    Article  CAS  PubMed  Google Scholar 

  39. Coelho SM, Godfroy O, Arun A, Le Corguille G, Peters AF, Cock JM. OUROBOROS is a master regulator of the gametophyte to sporophyte life cycle transition in the brown alga Ectocarpus. Proc Natl Acad Sci U S A. 2011;108(28):11518–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Dittami SM, Scornet D, Petit JL, Segurens B, Da Silva C, Corre E, Dondrup M, Glatting KH, Konig R, Sterck L, et al. Global expression analysis of the brown alga Ectocarpus siliculosus (Phaeophyceae) reveals large-scale reprogramming of the transcriptome in response to abiotic stress. Genome Biol. 2009;10(6):R66.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Ahmed S, Cock JM, Pessia E, Luthringer R, Cormier A, Robuchon M, Sterck L, Peters AF, Dittami SM, Corre E, et al. A haploid system of sex determination in the brown alga Ectocarpus sp. Curr Biol. 2014;24(17):1945–57.

    Article  CAS  PubMed  Google Scholar 

  42. Lipinska AP, Ahmed S, Peters AF, Faugeron S, Cock JM, Coelho SM. Development of PCR-based markers to determine the sex of kelps. PLoS One. 2015;10(10):e0140535.

    Article  PubMed  PubMed Central  Google Scholar 

  43. Contreras L, Ritter A, Dennett G, Boehmwald F, Guitton N, Pineau C, Moenne A, Potin P, Correa JA. Two-dimensional gel electrophoresis analysis of brown algal protein extracts(1). J Phycol. 2008;44(5):1315–21.

    Article  CAS  PubMed  Google Scholar 

  44. Ritter A, Ubertini M, Romac S, Gaillard F, Delage L, Mann A, Cock JM, Tonon T, Correa JA, Potin P. Copper stress proteomics highlights local adaptation of two strains of the model brown alga Ectocarpus siliculosus. Proteomics. 2010;10(11):2074–88.

    Article  CAS  PubMed  Google Scholar 

  45. Gschloessl B, Guermeur Y, Cock JM. HECTAR: a method to predict subcellular targeting in heterokonts. BMC Bioinformatics. 2008;9:393.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Prigent S, Collet G, Dittami SM, Delage L, Ethis de Corny F, Dameron O, Eveillard D, Thiele S, Cambefort J, Boyen C, et al. The genome-scale metabolic network of Ectocarpus siliculosus (EctoGEM): a resource to study brown algal physiology and beyond. Plant J. 2014;80(2):367–81.

    Article  CAS  PubMed  Google Scholar 

  47. Shevchenko A, Sunyaev S, Loboda A, Shevchenko A, Bork P, Ens W, Standing KG. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal Chem. 2001;73(9):1917–26.

    Article  CAS  PubMed  Google Scholar 

  48. Moreews F, Sallou O, Menager H, Le Bras Y, Monjeaud C, Blanchet C, Collin O. BioShaDock: a community driven bioinformatics shared Docker-based tools registry. F1000Res. 2015;4:1443.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Goecks J, Nekrutenko A, Taylor J, Galaxy T. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11(8):R86.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, Chilton J, Clements D, Coraor N, Gruning BA, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46(W1):W537–44.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Sanders WS, Wang N, Bridges SM, Malone BM, Dandass YS, McCarthy FM, Nanduri B, Lawrence ML, Burgess SC. The proteogenomic mapping tool. BMC Bioinformatics. 2011;12:115.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Ghali F, Krishna R, Perkins S, Collins A, Xia D, Wastling J, Jones AR. ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards. Proteomics. 2014;14(23–24):2731–41.

    Article  CAS  PubMed  Google Scholar 

  53. Has C, Lashin SA, Kochetov AV, Allmer J. PGMiner reloaded, fully automated proteogenomic annotation tool linking genomes to proteomes. J Integr Bioinform. 2016;13(4):293.

    Article  PubMed  Google Scholar 

  54. Menschaert G, Vandekerckhove TT, Baggerman G, Landuyt B, Sweedler JV, Schoofs L, Luyten W, Van Criekinge W. A hybrid, de novo based, genome-wide database search approach applied to the sea urchin neuropeptidome. J Proteome Res. 2010;9(2):990–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Jagtap PD, Johnson JE, Onsongo G, Sadler FW, Murray K, Wang Y, Shenykman GM, Bandhakavi S, Smith LM, Griffin TJ. Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res. 2014;13(12):5898–908.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Sheynkman GM, Johnson JE, Jagtap PD, Shortreed MR, Onsongo G, Frey BL, Griffin TJ, Smith LM. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics. 2014;15:703.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Fan J, Saha S, Barker G, Heesom KJ, Ghali F, Jones AR, Matthews DA, Bessant C. Galaxy integrated omics: web-based standards-compliant workflows for proteomics informed by transcriptomics. Mol Cell Proteomics. 2015;14(11):3087–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Sajulga R, Mehta S, Kumar P, Johnson JE, Guerrero CR, Ryan MC, Karchin R, Jagtap PD, Griffin TJ. Bridging the chromosome-centric and biology/disease-driven human proteome projects: accessible and automated tools for interpreting the biological and pathological impact of protein sequence variants detected via proteogenomics. J Proteome Res. 2018. https://0-doi-org.brum.beds.ac.uk/10.1021/acs.jproteome.8b00404

  59. Chambers MC, Jagtap PD, Johnson JE, McGowan T, Kumar P, Onsongo G, Guerrero CR, Barsnes H, Vaudel M, Martens L, et al. An accessible proteogenomics informatics resource for cancer researchers. Cancer Res. 2017;77(21):e43–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Cormier A, Avia K, Sterck L, Derrien T, Wucher V, Andres G, Monsoor M, Godfroy O, Lipinska A, Perrineau MM, et al. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus. New Phytol. 2017;214(1):219–32.

    Article  CAS  PubMed  Google Scholar 

  61. Zhu Y, Orre LM, Johansson HJ, Huss M, Boekel J, Vesterlund M, Fernandez-Woodbridge A, Branca RMM, Lehtio J. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat Commun. 2018;9(1):903.

    Article  PubMed  PubMed Central  Google Scholar 

  62. Li Y, Wang X, Cho JH, Shaw TI, Wu Z, Bai B, Wang H, Zhou S, Beach TG, Wu G, et al. JUMPg: an integrative proteogenomics pipeline identifying unannotated proteins in human brain and Cancer cells. J Proteome Res. 2016;15(7):2309–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Has C, Lashin SA, Kochetov A, Allmer J. PGMiner reloaded, fully automated proteogenomic annotation tool linking genomes to proteomes. J Integr Bioinform. 2016;13(4):16–23.

    Article  Google Scholar 

  64. Crappe J, Ndah E, Koch A, Steyaert S, Gawron D, De Keulenaer S, De Meester E, De Meyer T, Van Criekinge W, Van Damme P, et al. PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res. 2015;43(5):e29.

    Article  PubMed  Google Scholar 

  65. Nagaraj SH, Waddell N, Madugundu AK, Wood S, Jones A, Mandyam RA, Nones K, Pearson JV, Grimmond SM. PGTools: a software suite for proteogenomic data analysis and visualization. J Proteome Res. 2015;14(5):2255–66.

    Article  CAS  PubMed  Google Scholar 

  66. Kim H, Park H, Paek E. NextSearch: a search engine for mass spectrometry data against a compact nucleotide exon graph. J Proteome Res. 2015;14(7):2784–91.

    Article  CAS  PubMed  Google Scholar 

  67. Risk BA, Spitzer WJ, Giddings MC. Peppy: proteogenomic search software. J Proteome Res. 2013;12(6):3019–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We are grateful to Laurence Dartevelle and Susana Coelho for the production of unialgal Ec 32, Ec 410 and mutant oro of Ectocarpus. Rachel Lefevre who participates to the elaboration of the protocols to obtain soluble, membrane and cell wall proteins. We thank the GenOuest bioinformatics facility, in particular Anthony Bretaudeau and Cyril Monjeaud for helping with Peptimapper Galaxy integration.

Funding

This work was supported by the French National Research Agency via the investment expenditure programme IDEALG (ANR-10-BTBR-04-02). This work was conducted at the PROTIM core facility (https://www.protim.eu) and supported by grants from Biogenouest, Infrastructures en Biologie Santé et Agronomie (IBiSA) and Conseil Régional de Bretagne awarded to C.P.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Author information

Authors and Affiliations

Authors

Contributions

AR and DM: prepared E. sp. protein sub-proteome fractions. LD: performed RT-PCR experiments, bioinformatic cluster validation and predictions of protein localization and wrote the manuscript. PPotin: participated in the research design, data analysis and manuscript editing. LG: designed and performed bioinformatics workflow optimization and adaptation, analysis, integration into Galaxy and Docker developpement and wrote the manuscript. AV and YV: performed workflow pepLine optimization and adaptation and manuscript editing. PPeterlongo: performed PepNovo+ adaptation. EC and RL: performed proteomics MS analyses and manuscript editing. CP designed and coordinated the research and wrote the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Charles Pineau.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Sample preparation protocols. (PDF 93 kb)

Additional file 2:

All reference datasets Excel file. (XLS 21262 kb)

Additional file 3:

All cluster results Excel file. (XLSX 282 kb)

Additional file 4:

Scripts detailed descriptions: command line arguments, output file descriptions and availability. (PDF 161 kb)

Additional file 5:

Bioinformatic tools distribution. A. Peptimapper dockerfile. B. Workflow labeled “Peptimapper” available on Protim Galaxy platform. (PDF 773 kb)

Additional file 6:

Study of additional clusters under investigation listed Table 3. RNA-sequencing and EST data have been incorporated in the browser. (PDF 2631 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guillot, L., Delage, L., Viari, A. et al. Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes. BMC Genomics 20, 56 (2019). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-019-5431-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-019-5431-9

Keywords