Skip to main content

Low impact of different SNP panels from two building-loci pipelines on RAD-Seq population genomic metrics: case study on five diverse aquatic species

Abstract

Background

The irruption of Next-generation sequencing (NGS) and restriction site-associated DNA sequencing (RAD-seq) in the last decade has led to the identification of thousands of molecular markers and their genotyping for refined genomic screening. This approach has been especially useful for non-model organisms with limited genomic resources. Many building-loci pipelines have been developed to obtain robust single nucleotide polymorphism (SNPs) genotyping datasets using a de novo RAD-seq approach, i.e. without reference genomes. Here, the performances of two building-loci pipelines, STACKS 2 and Meyer’s 2b-RAD v2.1 pipeline, were compared using a diverse set of aquatic species representing different genomic and/or population structure scenarios. Two bivalve species (Manila clam and common edible cockle) and three fish species (brown trout, silver catfish and small-spotted catshark) were studied. Four SNP panels were evaluated in each species to test both different building-loci pipelines and criteria for SNP selection. Furthermore, for Manila clam and brown trout, a reference genome approach was used as control.

Results

Despite different outcomes were observed between pipelines and species with the diverse SNP calling and filtering steps tested, no remarkable differences were found on genetic diversity and differentiation within species with the SNP panels obtained with a de novo approach. The main differences were found in brown trout between the de novo and reference genome approaches. Genotyped vs missing data mismatches were the main genotyping difference detected between the two building-loci pipelines or between the de novo and reference genome comparisons.

Conclusions

Tested building-loci pipelines for selection of SNP panels seem to have low influence on population genetics inference across the diverse case-study scenarios here studied. However, preliminary trials with different bioinformatic pipelines are suggested to evaluate their influence on population parameters according with the specific goals of each study.

Background

Next-generation sequencing (NGS) technologies have represented a breakthrough for genomic studies [1] due to the huge reduction of sequencing cost (less than 0.02$ per Mb [2]) and the development of a broad and versatile range of techniques for different genomic approaches [3]. By harnessing the possibilities of NGS, diverse reduced-representation genome sequencing approaches, useful to identify and genotype thousands of markers for genomic screening, were suggested and quickly became popular [4, 5]. One of these approaches is the restriction site-associated DNA sequencing (RAD-seq), currently in a more mature phase, which includes different methods (e.g. ddRAD-seq, ezRAD-seq, 2b-RAD-seq) whose performances have been compared using simulations and real data [6]. RAD-seq methods require specific library preparation protocols, which exploit the ability of Restriction Enzymes (REs) to cut at specific genomic targets rendering a collection of fragments representative of a genome fraction to be compared among samples. These collections can be screened to identify and genotype a variable number of single nucleotide polymorphisms (SNPs) depending on the goals of the study for population genomics, linkage mapping or genome wide association studies, among others. The 2b-RAD method here used exploits the properties of IIB REs which produce a collection of short DNA fragments (between 33 and 36 bp) by cutting at both sides of the recognition site [7]. This method has the advantages of simple library preparation, short-reads to be sequenced (single-end 50 bp) and, as other methods, the number of loci can be adjusted both using REs with different recognition site frequency or by fixing nucleotides in the adaptors during library construction (i.e. selective-base ligation) [7, 8].

Genomic laboratory protocols have been set up and optimized through years by introducing modifications on the original RAD-seq methodology to get better results using different laboratory protocols for different scenarios (e.g. samples with low DNA quality, genome size, etc.; see Fig. 5 in [8]). Similarly, the bioinformatic pipelines starting from raw data, a critical issue in RAD-seq methodologies, have undergone an important refinement and diversification. Nevertheless, there is not a consensus about what is the best strategy for each scenario, despite the increasing number of studies addressed to evaluate the impact of technical and/or bioinformatic protocols [9, 10]. In a typical 2b-RAD library, hundreds of millions of reads are generated, and they need to be allocated to each multiplexed individual (dozens to hundreds in the same lane) and to each genomic position or locus in the reference genome (or RAD-tag catalogue). The rationale behind this is stacking raw reads belonging to the same locus, while discerning and separating at the same time the reads belonging to different loci. Results could be improved if a reference genome, belonging to the species itself or to other congeneric species, is available. This would enable to avoid mixing of reads pertaining to paralogous loci. In November 2020, there were reference genomes for 25 bivalve species and subspecies (22 genera) and 583 fish species (338 genera) with different assembly confidence at the NCBI database (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/datasets/). Nevertheless, there are about 9200 species within the 1260 bivalve genera [11] and 35,672 recognized species within the 5212 documented fish genera [12]. All in all, less than 0.2% of the genomes of the known eukaryotic species have been sequenced to date [13]. Although full genome sequencing assembly is becoming progressively more robust thanks to the long-read sequencing methods and assembling strategies, most of the species will have to wait for long before their genomes are assembled. Therefore, de novo approaches (i.e. stacking reads without a reference genome) will be the only option for many studies, although some initiatives are trying to change this perspective (e.g. Earth Biogenome Project; https://www.earthbiogenome.org/). For this reason, one of the strengths of a RAD-based method is its applicability without a reference genome [14].

There are different bioinformatic pipelines to identify a high number of SNPs and achieve confident genotypes using a RAD-seq approach. The most popular one is STACKS [15, 16] (around 3000 citations, at Google scholar in Nov. 2020), but several other alternatives, including Meyer’s 2b-RAD pipeline [7], which was the original building-loci pipeline for 2b-RAD data, have been recently published. Some of these pipelines are able to perform a de novo approach (dDocent [17]), whereas others need a reference genome for alignment (Fast-GBS [18], TASSEL-GBS v2 [19]) or can address both approaches (STACKS, Meyer’s 2b-RAD v2.1 pipeline, ipyrad [20]). Several of these alternative pipelines merge and concatenate pre-existing applications, making their design flexible and customized according to the data managed and the goals of the study, but also providing upgrading and reliable bug-fix (e.g. dDocent, Fast-GBS, Meyer’s 2b-RAD v2.1 pipeline). Several factors should be considered for the selection of the bioinformatic pipeline to be used, among which sampling variance (number of samples and reads across them), population structure, genome architecture of the species studied and budget (e.g. read coverage) are the most relevant. The genome of each species has its particular size, history (e.g. duplication events), polymorphism, complexity and inter-individual variability, which can hinder the identification of stacks of reads (putative RAD loci) and their variants, circumstances that should be considered when choosing the appropriate building-loci pipeline and its parameters.

Studies comparing bioinformatic pipelines and strategies already exist. Some comparisons between de novo and reference-based approaches are available [21, 22], and one of them tested the performance of the different strategies used to obtain accurate population genetics inferences [22]. Noticeable differences were observed among bioinformatic pipelines in the number of detected SNPs, sometimes resulting in distinct values for population descriptors and inference [22]. Other studies have evaluated the same software with different species to optimise the selection of bioinformatic parameters (STACKS 1.42, [23]; STACKS 1.44, [10]), making a common advice of doing preliminary trials to optimize the building-loci pipeline selected parameters. Published step-by-step protocols with a single species also exist [14]. A number of SNP calling comparison between STACKS 1.08 and dDocent 1.0 has been carried out using three fish species [17], while Sovic et al. [24] tested a novel pipeline (i.e. AftrRAD) vs STACKS and PYRAD using simulated and species datasets to assess computational efficiency and SNP calling. It is not uncommon to find large differences in the number of SNPs (e.g. of one order of magnitude) in some building-loci pipelines comparisons [17]. Recently, Wright et al. [25] compared population parameters (e.g. FST, PCoA) with SNPs obtained from three pipelines (i.e. GATK, SAMtools, STACKS) using two species with reference genome. Results showed remarkable differences in some population parameters (e.g. Hardy Weinberg Equilibrium) across bioinformatic approaches. Considering this information, a main issue that should be clarified on RAD-seq methodologies is the impact of building-loci pipelines on population genetics parameter estimations and derived conclusions using a de novo approach on different biological scenarios and to some extent, to be compared to a reference genome approach.

In this work, two building-loci pipelines for SNP calling and genotyping: i) STACKS 2.0 (http://catchenlab.life.illinois.edu/stacks/) and ii) Meyer’s 2b-RAD v2.1 pipeline were tested using a de novo approach on different genomics and population genetics scenarios by using five aquatic species: (i) Manila clam (Ruditapes philippinarum), (ii) common edible cockle (Cerastoderma edule), (iii) brown trout (Salmo trutta), (iv) silver catfish (Rhamdia quelen) and (v) small-spotted catshark (Scyliorhinus canicula). A range of population parameters were compared in the five species applying similar parameter settings for each pipeline (description of pipelines in Methods). The two marine bivalve species from the Order Veneroida show high polymorphism and low population structure [26, 27]; the brown trout belongs to the order Salmoniformes, which suffered a specific genome duplication event [28], and shows one of the highest population structuring among vertebrates [29]; isolated populations from different ecosystems were analysed in the silver catfish, a freshwater species from the order Siluriformes living in fluvial and costal lagoon environments [30]; finally, the small-spotted catshark (order Carcharhiniformes) is a benthic species which populations here used show low genetic differentiation [31]. Despite these species represent different population genetics and evolutionary scenarios, this does not mean that they necessarily comprehend all the models for the manifold scenarios used to check the performance of building-loci pipelines. To date, two of the species used in this study have a reference genome available: Manila clam (assembly size: 1.123 Gb; 19 chromosomes [32]) and brown trout (2.370 Gb; 40 chromosomes [33]). There are different statistics to evaluate the quality of genome assemblies (see Table 2 in [34] such as scaffold N50 and N90 (i.e. the length of the scaffold at which 50 and 90% of the assembly length is covered, respectively). The scaffold N50 is much higher in brown trout than in Manila clam assembly (52,209 Kb and 345 Kb, respectively), but a very high accuracy and completeness (see Supplementary Material [32]). Anyway, for short read data such as for 2bRAD-seq, the contiguity of the genome should not have a major influence in the results when conservative short-read aligner parameters and strong SNP filtering steps are applied, minimizing the admixture of reads from paralogous loci or losing of reads due to genome fragmentation.

Wang and Guo [35] hypothesized that bivalves with 19 chromosomes could have a tetraploid origin, due to its ability to tolerate chromosomal aneuploidies. Furthermore, gene/gene family expansions would be a rather common process in this group, likely more frequent than in other molluscs [36]. Both genomic features could pose a challenge regarding paralogous genes for stacking reads, a common problem when no reference genome is available. In addition, molluscs present the highest genetic polymorphism in animal kingdom [37], which could represent genotyping drawbacks related to the presence of null alleles. All salmonids, including brown trout, have a tetraploid origin in process of diploidization since their origin around 90 Mya. This specific duplication should be added to the three Whole Genome Duplications (WGDs) events in the line of teleosts from the vertebrate ancestor [38, 39], which represent a major issue regarding paralogy. This issue would not be so important in small-spotted catshark and silver catfish, with two and three older WGD events in their evolutive lines, respectively [40, 41]. The most recent, the teleost-specific 3rd WGD, dated around 300 Mya. We followed a de novo approach for comparison between pipelines in all species, but reference genomes in these two species were also taken as a useful reference to elucidate which building-loci pipeline provides better results with a de novo approach.

For the five species, population parameters, structure pattern and outlier loci detection were estimated as an essential outcome to evaluate the performance of four SNP panels after different filtering steps. From two building-loci pipelines, STACKS (STA panel onwards) and Meyer’s 2b-RAD v2.1 pipeline (ALT panel onwards), and two criteria for SNP selection, common SNPs (i.e. shared between building-loci pipelines, COM panel onwards) and merged SNP (a combination of shared and exclusive SNPs from both building-loci pipelines, MER panel onwards) panels. When available, the results from reference genome approach were used to compare the results of population parameters evaluated with the de novo approach. In this case three additional SNP panels were obtained: the former with the reference genome approach using STACKS and the two remaining with the shared SNPs (i.e. STACKS de novo and 2b-RAD v2.1). Our main goal was to assess the influence of the genomic architecture and population structure on the biological conclusions obtained with the different bioinformatic pipelines, and accordingly, to propose methodological recommendations for future studies using a de novo approach.

Results

The number of filtered reads loaded into building-loci pipelines using the reference genome approach was lower than with the de novo approach. A percentage of 22.1 and 25.4% were finally used in Manila clam and in brown trout, respectively. This reduction mostly due to those reads aligning to more than one place (59.6 and 72.6%, respectively) that were filtered out (i.e. -m 1 in Bowtie 1.1.2), the remaining reads failed to align with the mismatch criteria applied (18.3 and 2.0%, respectively).

The number of initial SNPs, after the building-loci step with the de novo approach, ranged from 56,074 in brown trout (STA) and 125,823 in silver catfish (ALT) to 356,389 (STA) and 426,317 (ALT) in common cockle (Tables S1-S5). These figures dropped throughout the successive filtering steps (Fig. 1) up to finally being retained from 0.2% in Manila clam STA panel to 20.5% in silver catfish STA panel (Figs. 2 and 3). There was a remarkable difference at the initial number of SNPs obtained between STA and ALT pipelines in brown trout and small-spotted catshark, although the outcome after filtering was rather similar (Table 1). No comparison was made at BIAL filter (i.e. SNPs with more than two alleles excluded; Fig. 1), since triallelic SNPs are removed with STACKS by default. The proportion of missing genotypes after applying the minimum coverage filtering step was higher with ALT panel than with STA in almost all species (Fig. 4). Filtering patterns varied among species due to the different weight of each filtering step. For instance, in bivalves, where genetic polymorphism is higher, the SNP retention after the third filtering step (i.e. RAD loci with ≤3 SNPs per RAD-locus) was much lower than in fish species (Figs. 2 and 3), while in silver catfish and small-spotted catshark, was due to the minimum allele count (MAC). This was related to the smaller sampling size of those fish species (N = 21 and N = 28, respectively) and the higher frequency of missing data, especially in small-spotted catshark (83% after depth filter in STA and ALT panels). When comparing pipelines, no clear differences on the filtering pattern were observed through the different filtering steps (Figs. 2 and 3), except for MAC in brown trout, where more SNPs were pruned in the ALT panel. This could be related to the increment on the missing data after the minimum coverage filter, with higher frequency in ALT (59.3% vs 43.7% for ALT and STA, respectively (Fig. 4). After the third filtering step, in brown trout (N = 52) there were significantly more missing genotypes per SNP (P < 0.01) at ALT panel on average (28.74 ± 15.24) as compared to STA panel (21.07 ± 17.54). The two species with the lowest median coverage at this step were small-spotted catshark (median = 10x for ALT and STA panels), brown trout (median = 11x and 14x for ALT and STA panels, respectively) and Manila clam (21x for the STA panel).

Fig. 1
figure1

Scheme of filtering steps to obtain the different SNP panels: STA, ALT, COM and MER, representing STACKS, Alternative, Shared and Merged panels, respectively

Fig. 2
figure2

Number of SNPs from the initial building-loci pipelines (blue bars) to the final panels (green bars) through the different SNPs filtering steps

Fig. 3
figure3

SNP reduction (in percentage) according to population genetics filtering steps from the initial number of SNPs to the panel finally retained (purple bar)

Table 1 Mean (bold values) and standard deviation of population metrics for the final SNP panels using a de novo approach
Fig. 4
figure4

Percentage of missing genotypes after the Min coverage filter (8x). Boxplots were obtained with the percentage of missing genotypes through the different samples. Up and down triangles represent the percentage of missing genotypes at different SNP panels after and before 8x coverage filter, respectively

The final number of SNPs ranged from 479 (STA) and 956 (ALT) in Manila clam to 21,468 (STA) and 22,481 (ALT) in silver catfish. These figures were always higher with the ALT pipeline for all species. The number of SNPs in COM panels, those from STA panel shared with ALT panel, ranged from 206 in Manila clam to 17,459 SNPs in silver catfish, while the percentage of SNPs called in STA that were found in ALT ranged from 23.9% in small-spotted catshark to 81.3% in silver catfish. The main source of variation when considering the whole COM panels genotype dataset (for instance in silver catfish Ntotal COM genotypes (366,639) = Nsamples (21) x NCOM SNP (17,459)) was missing data, ranging from 2.6% in common cockle to 14.8% at small-spotted catshark (Figs. S1-S4). Roughly speaking, there would be three types of missing calling data with a de novo approach: (i) SNPs not genotyped by a building-loci pipeline due to too low coverage (< 3x with used configuration); ii) SNPs genotyped by a building loci-pipeline but with a coverage lower than 8x (Min coverage 8x filtering step; Fig. 1); and (iii) SNPs with enough coverage but ambiguous alternative nucleotide depth (see http://eli-meyer.github.io/2bRAD_utilities/#genotype). The main source of missing genotype differences between pipelines was related to COM SNPs from STA pipeline that passed the minimum coverage filter (Min coverage 8x), but not were genotyped by ALT pipeline due to having a coverage lower than 3x. Both pipelines never genotyped with a coverage lower than 3x with the used configuration. This situation was found in most species, causing from 46.8% of the total missing genotype differences between ALT and STA genotypes in shared SNPs (COM panel) in brown trout to 63.2% in small-spotted catshark. Nevertheless, the main source of missing data differences in Manila clam was genotypes removed by coverage filter (Min coverage 8x) in ALT but not in STA (53.5% of the missing genotype differences). The percentage of the genotyping differences caused by homozygous-heterozygous differences between pipelines at the same SNP and individual ranged from 0.5% in brown trout to 2.6% in small-spotted catshark with respect to the whole COM genotype panel (Table 2). The frequency of genotyping differences caused by different homozygotes at the same SNP and individual (e.g. AA for one pipeline and GG for the other) was negligible in almost all cases (from 0 to 0.098%). The number of SNPs obtained with reference genome (RG) approach was always lower than with both de novo building-loci pipelines (see Tables S1 and S3), however, the number of SNPs shared between RG and each of the de novo pipelines was similar (i.e. RG-STA and RG-ALT). The percentage of SNPs obtained with RG approach detected as well in both de novo pipelines was 47.7% (RG-STA/RG) and 37.7% (RG-ALT/RG) in Manila clam and 73.0% (RG-STA/RG) and 80.9% (RG-ALT/RG) in brown trout, unlike the reverse way where the percentages of shared SNPs were lower due to the higher number of SNPs obtained with ALT: 28.8% (RG-STA/STA) and 11.4% (RG-ALT/ALT) for Manila clam with (Table S1) and 46.5% (RG-STA/STA) and 32.9% (RG-ALT/ALT) for brown trout (Table S3). Both de novo building-loci pipelines performed similarly when compared with reference genome (RG) approach genotyping (RG-STA and RG-ALT genotype comparisons; Table 2).

Table 2 Genotypic differences between shared SNPs from the different pipelines

The population parameters evaluated (e.g. diversity levels, global FST) were roughly similar when using the different SNP panels in each species, showing higher differences in FIS values (Table 1). The most notable differences would be found in the FIS values and when comparing de novo and reference genome approaches in brown trout, especially regarding Hardy-Weinberg tests and related population parameters (i.e. FIS, Ho vs He; Table S4). Here, unlike Manila clam, the proportion of SNPs with extremely low FIS (≤ − 0.5) greatly differed between both approaches. The structure patterns obtained using STRUCTURE and DAPC analyses (see “Methods” section) were similar between both approaches (Figs. S5-S9). The highest global FST values among populations were found in brown trout and silver catfish, as expected, and no different interpretations among panels could be extracted. Some minor discrepancies in the number of suggestive outliers among panels were detected. All suggestive outliers detected showed a positive α-value suggesting diversifying selection. The complete set of population parameters is provided in Supplementary Information (Tables S1-S5).

Discussion

In the last decade, the binomial NGS / RAD-seq has been the choice for genomic screening in many studies due to the vast number of genetic markers identified and genotyped in a single step. In this context, species with low genomic information have been targeted for population genomics and evolutionary studies broadening the opportunities for more refined approaches regarding conservation genetics and breeding programs. Nevertheless, the effect of genomic architecture, genetic diversity, and population structure into the outcomes of these techniques (number of SNPs, genotyping confidence) in the target species are essential issues to be addressed using both with simulation and real data approaches. These issues are not only important for the wet-lab protocols, but also for the bioinformatic pipelines to be used to analyse the huge amount of data produced. Technical decisions on the reduced-representation method and restriction enzymes selection to be applied when constructing libraries are critical. When a reference genome is available the number of potential loci obtained with different restriction enzymes (e.g. using ExtractSites.pl https://github.com/Eli-Meyer/2bRAD_utilities) and the uniqueness of RAD loci should be tested (e.g. using EvalFrags.pl from 2bRAD_utilities). For instance, in Manila clam was predicted that the percentage of the genome constituted by repetitive elements and combined transposable elements could exceed 70% [32]. This genomic information should be considered to make the best technical decisions. Without reference genome, different REs can be tested if the budget allows it (see Box 1 in [8]), to improve the percentage of reads to build up confident loci. In the same way, the different performances of software and bioinfomatic pipelines might depend on the species and its genomics context. These can affect not only the number of markers found, but more importantly, the biological conclusions drawn. A repertoire of bioinformatic publications to manage the large amount of genomic data at different stages (e.g. building-loci pipelines, SNP filtering steps) has been published to serve as guidelines for researchers with limited experience in the field and advices for bioinformatic “Gordian Knots”.

The panels used in this study came from species that differ in their genomic architecture, polymorphism and population structure, and these factors could influence the results obtained also depending on the different population parameters used (e.g. global FST, allelic richness). Nevertheless, the results obtained in our study within species using the four de novo panels (i.e. STA, ALT, COM, MER) were roughly similar for all the population parameters evaluated. Accordingly, biological inferences would hardly change. Minor differences were found in the number of suggestive outliers detected in common cockle. In this case, the number of suggestive outliers detected could be related to the total number of SNPs of each panel. But beyond this observation, our results suggest that whatever the pipeline chosen similar results are obtained with a de novo approach. The number of initial and final SNPs obtained with reference genome was lower than obtained with a de novo approach. This would be related to the high number of input reads removed by duplicate or multiple alignment to the reference genomes due to the short length of 2b-RAD reads before going to the building-loci pipeline. Some population parameters between both approaches showed relevant differences (see FIS values in brown trout; Table S3).

A practical approach to decide between pipelines with different building-loci strategies for handling Genotyping-by-sequencing (GBS) data is to assay trials with a small subset of data and check for their results using a meaningful set of population parameters, previously selected according to the objectives of the study. Indeed, using the number of SNPs obtained as the main criterion [17] to decide the best building-loci pipeline to be used is not advisable, since a higher number of SNPs does not necessarily indicate a better stacking and confident RAD-seq data [10], and consequently, it might have a negative impact on the confidence of results and biological inferences. The initial number of SNPs obtained with STA and ALT pipelines across the different species tested was rather similar except for the brown trout and the small-spotted catshark. These species showed the lowest median coverage, near to the selected threshold coverage filter (8x) hence the differences on the number of putative loci from input data. While STACKS 2, with a de novo approach, starts with individual data demanding a number of identical reads to build a locus (see Methods), Meyer’s 2b-RAD v2.1 works with a combined subset of confident reads from all samples to build a global reference panel to which align every read. In cases with low coverage, a less demanding criterion to build loci can produce large differences in the initial number of SNPs. Nevertheless, through the filtering steps, the SNP number from building-loci pipelines converged in both species and importantly, the population parameters between SNPs panels were similar. Once chosen the pipeline, it would be recommendable to run several trials with different parameters to properly adjust them to the dataset. For instance, −M in STACKS (which defines the maximum nucleotide differences allowed between intraindividual putative loci) depends on the levels of polymorphism of the species and -m in STACKS on the existing coverage [23]. In the same way as for the choice of the building-loci pipeline, it would be advisable to choose parameters taking into account the results from population outcomes, since there is not a unique pipeline suited to every situation, as already indicated [21].

After the building-loci pipeline, it is important to adjust filtering (criteria and order [9]) according to the particular scenario of each species (e.g. sequencing and genotyping errors, duplicated loci [42]). Since the filtering parameters are dataset dependent [43], the filtering criteria should be adjusted accordingly (e.g. the stringency of MAC filtering step is sample size dependent). For instance, the number of SNPs was markedly reduced through filtering steps and the highest difference in the percentage of retained SNPs was found among species. In the study by O’Leary et al. [9] the percentage of retained SNPs ranged from 0 to 63% using the same filtering pipeline with four marine fish species. In our study, the three SNP/RAD-locus filter used to avoid inconsistent RAD-loci could not work well for highly polymorphic species or taxa (e.g. bivalves). Furthermore, the POP filter (i.e. 60% call rate per population) could be applied not so stringently since in previous studies qualitative interpretations of population parameters were maintained in most cases [22, 25], and sometimes even improved [44]. Notwithstanding, the drawback could be using larger SNP panels for similar information. If well, for some studies it is fundamental to achieve the highest density of SNPs possible (e.g. linkage disequilibrium, outlier detection and gene mining, Genome-wide association study; GWAS). The biggest difference between pipelines was found with the MAC filtering step in brown trout which could be explained by the higher average of missing genotypes per SNP (MAC is sample size dependant) and the lower coverage per RAD-locus (misclassification of heterozygotes) in the ALT pipeline. Finally, more filtering steps might be necessary, especially when working without a reference genome (e.g. FIS SNP filtering step when paralogs or null alleles can be a problem to avoid misinterpretations); this is the case of HWE deviations in brown trout caused by potential paralogs whose impact can be reduced using a reference genome.

Attention should be paid to the order of the different filtering steps because this can alter the final SNP panel. When adjusting the filtering parameters, it would be advisable to consider not exclusively the number of removed SNPs at each step separately, since they could result from the interaction among filtering steps. For instance, the coverage filter determines the increase of missing data which influences the percentage of SNPs eliminated by MAC and population representation filters, according to the stringency of the coverage threshold used. Furthermore, missing data may be due to a lower coverage than the selected threshold or for not being genotyped by the building-loci software with the genotyping options selected (e.g. previously selected nucleotide frequencies range to genotype in ALT pipeline). We found that the last could be the main source of COM SNPs genotyping differences between both building-loci pipelines excluding Manila clam. This means that the ALT pipeline genotyping parameter should be improved by choosing appropriate ranges for each species. The objective of any filtering strategy is removing SNPs that are not reliable without losing informative SNPs. Different factors can influence the filtering criteria, e.g. to achieve the number of SNPs required to meet the research goals. In this sense, a panel made up with markers found by two different pipelines should ensure reliability. It was found that 67% of SNPs from STACKS panel were common with UNEAK panel, using a de novo approach in soybean (Glycine max L.) data [21]. With reference genome the overlap percentages among STACKS and other building-loci pipelines ranged from 76 to 96% [21]. Using a reference genome approach the percentages of shared SNPs between STACKS with SAMtools and GATK ranged from 7.3 to 71.4% [25]. The lowest values could be partially explained because STACKS panel recruited many more SNPs than the other building-loci pipeline. The lowest percentage of COM SNPs taking STA panel as genotyping reference in our study (i.e. 23.9% in small-spotted catshark and 43.0% in Manila clam) were in panels with less than 1000 SNPs. These low values may be explained by a strong filtering effect, on shared SNPs between pipelines. The highest number of COM SNPs were detected when STA panels included the highest number of SNPs, around 74% in brown trout and 81% in silver catfish. Despite including a lower number of SNPs, the COM panels provided roughly similar results to the larger ones. This suggests that most informative markers are retained downstream, with the advantage of working with a reduced panel that can simplify and speed-up analyses. In the study by Díaz-Arce et al. [10] the possible effect of SNP number on FST estimation using reduced SNPs subsets was tested and similar values regarding the full panel were obtained. Moreover, estimated genotyping accuracy may be higher with SNPs shared by more than one building-loci pipeline according to Torkamaneh et al. [21]. The impact of genotypic differences between shared SNP panels was low, such as those obtained by Wright et al. [25].

Summarizing, the results obtained suggest that both building-loci pipelines are adequate and provide more confident results adjusting parameters and SNP filtering steps to the research context. Despite the differences observed in the number of SNPs among de novo approach panels, this seems not to affect dramatically the conclusions, at least in the biological scenarios managed in this study. When there is no reference genome, a COM panel could be interesting in terms of SNP panel consistency with species with high genomic complexity. In a general way for some population parameters, to have less SNPs do not imply loss of biological information and a COM panel could increase data reliability in these cases. In the case of choosing this option differences in genotyping between pipelines should be checked, although in this study genotyping differences between pipelines were infrequent. The main source of genotyping differences in COM SNP panels was missing data and had different sources. On one hand, different genotypes can be obtained due to the different building-loci pipeline parameters to call genotypes (e.g. --alpha at STACKS) and the different alignment strategies (e.g. reads with alternative allele can be stacked into another putative loci). For missing genotype differences, it should be considered that even RAD loci showing high coverage, might have missing data if building-loci pipeline parameters involved in genotyping are not properly set up. Furthermore, small differences in the building-loci pipeline could have more influence in the number of missing genotypes when working with low coverage loci. Anyway, it would be advisable to use a few intra-library and inter-library sample-replicates to estimate genotyping errors [45] to increase the confidence in our data, especially if the RAD-seq libraries are designed with low estimated coverage per locus (e.g. around 10x), due to the impact of coverage in genotyping error rates [46].

Conclusion

The results here obtained on the diverse case studies analysed show that selected building-loci pipeline did not have a substantial influence on the estimation of population parameters and derived biological interpretations. The slight differences in the results here obtained between some de novo and reference genome derived panels could be solved by improving SNP filtering steps. Anyway, our results cannot be generalized and users should contextualise building-loci pipelines to their particular species and population genomics scenarios. One recommendation would be to test building-loci parameters and filtering SNP steps with subset of samples to save hardware resources and computation time. The best parameter set would be those leading to consistent results obtained across different replicates and should be taken using population parameters consistent with the research goals of the study. Despite being time-consuming, this preliminary testing could enhance the robustness of results through improving the bioinformatic tools applied.

Methods

Samples studied

All datasets used in the present study are own resources obtained from previous research carried out by the authors. Four Manila clam (R. philippinarum) localities, three from the Adriatic Sea (Italy: Chioggia, N = 30; Porto Marghera, N = 30 [47]; and Po river mouth N = 25) and one from the Atlantic Ocean (Spain: Galicia, N = 25), were studied. Four common edible cockle (C. edule) localities from the European Atlantic area (Somme Bay, France, N = 30; Campelo, Spain, N = 30; Miño, Spain, N = 30; and Ría Formosa, Portugal, N = 30) were used from regions with extractive cockle activities [48]. Three localities of brown trout (Salmo trutta) from Duero River basin in the Iberian Peninsula (Águeda, N = 15; Omaña, N = 20; and Pisuerga, N = 17), two of them representing different mitochondrial pure native lineages (Atlantic and Duero, [49, 50]) and one from the hybrid zone (Omaña [51];), were evaluated. Two localities of silver catfish (R. quelen), a Neotropical freshwater species distributed from the Northeast of Los Andes to the centre of Argentina and living in fluvial and coastal lagoon environments [52], were analysed. These samples came from Sauce Lagoon (N = 10) and Uruguay River Basin (N = 11) belonging to two divergent lineages [30]. Finally, two nearby localities without genetic differentiation of small-spotted catshark (S. canicula) from North Sea (N = 13) and Irish Sea (N = 15) were analysed [31]. All information from samples analysed is summarized in supplementary material (Table S6).

Library preparation

DNA extraction and 2b-RAD libraries preparation using AlfI IIB RE followed the same protocol except for small-spotted catshark [31] where CspCI IIB RE was used instead. The libraries were sequenced using Illumina sequencing platforms (i.e. HiSeq 1500 for small-spotted catshark, HiSeq 2500 for Manila clam and NextSeq500 for the remaining species) following a 50 bp single-end chemistry. For details see [30, 31].

Bioinformatic analysis

Building-loci pipelines: background

STACKS 2.0 and Meyer’s 2b-RAD v2.1 were the building-loci pipelines chosen for comparing their performances using a de novo approach within the broad genome and population genetics species scenarios selected. Meyer’s 2b-RAD v2.1 pipeline and STACKS building-loci pipelines have some similarities on their strategy; roughly, both are based on stacking reads into putative loci by sequence similarity, assuming that each locus corresponds to a single place in the species genome. Nevertheless, there are many differences on how loci are built and how the user can control the existing options and genotyping strategies. STACKS, with a de novo approach, works firstly at the individual level demanding a number of identical reads to build a locus (i.e –m parameter in ustacks), while the 2b-RAD v2.1 pipeline works with a combined subset of samples to build a global reference panel to which align every read. For genotyping, STACKS uses a chi square test to call a heterozygote or a homozygote (i.e. –alpha and --gt-alpha), whereas nucleotide frequencies based on allele read depth at each position and sample are used for genotyping in the 2b-RAD v2.1 pipeline. Accordingly, huge differences in the raw SNP number were obtained in preliminary analysis in our study. Anyway, we tried to apply the highest number of common parameters in both pipelines to be consistent among comparisons.

STACKS 2.0 pipeline can be summarized as follows: (1) Raw sequence reads were demultiplexed and filtered according to different criteria such as quality, uncalled bases and read length (process_radtags); (2) reads from each individual were clustered into putative loci, and polymorphic nucleotide sites identified (ustacks); (3) putative loci were grouped across individuals and catalogues of RAD-loci, SNPs and alleles were constructed (cstacks); (4) putative loci from each individual were matched against the catalogue (sstacks); (5) the data were transposed to be oriented by locus, instead of by sample (tsv2bam); (6) all individuals were genotyped at each called SNP (gstacks); and (7) SNPs were finally subjected to population genetics filters (populations) and results written in different output files (e.g. GENEPOP [53], STRUCTURE [54] file formats).

The de novo approach used for the 2b-RAD v2.1 pipeline can be summarized as follows: (1) from a subset of high quality reads (Phred quality scores ≥30 at all positions) from all samples, a global de novo reference catalogue was built by clustering these reads with the BuildRef.pl script and cd-hit 4.6.8 [55, 56], (2) every read was aligned against the reference catalogue using a mapping program (in our case Bowtie 1.1.2 [57]); (3) allele frequencies were counted at each position (SAMBasecaller.pl) and genotypes determined from that information (NFGenotyper.pl); and (4) genotypes called across samples were combined into a single genotype matrix with samples as columns and loci as rows (CombineGenotypes.pl). Perl scripts belonging to the last version are publicly available (https://github.com/Eli-Meyer/2brad_utilities/) and earlier versions on request.

A reference genome-based pipeline was used for Manila clam and brown trout, the species with chromosome assembly level reference genomes. In this case, the differences in the pipelines of STACKS 2 and Meyer’s 2b-RAD v2.1 are lower. For instance, STACKS 2 reduces the number of modules necessary from six to two when using reference genome and the building of putative loci is conditioned by the step using a short-read aligner. Our goal was not to compare this option between pipelines, but to take as reference the genome-based approach to be compared with the de novo approach as a gold standard within each pipeline.

Building-loci pipelines: analysis

After demultiplexing raw data, several filtering criteria were applied: (1) all reads were trimmed and filtered by the RE recognition site to retain only those sequences of 36 bp (or 32 bp from CII RE catshark) centred on the RE recognition site using our own Perl scripts and Trimmomatic 0.38 [58]; and (2) process_radtags (module belonging to STACKS) was used to remove reads with uncalled bases (−c option). Other parameters were species-specific (e.g. window sliding size, −w; score limit, −see Table S7A). To check raw and filtered reads quality, FastQC 0.11.7 [59] was used. STACKS input sequences were oriented in the same orientation using our own Perl script to avoid oversplitting. To say, the set of input reads for both building-loci pipelines in each species was always the same.

At the building-loci pipeline step, the parameters considered were: (1) minimum number of identical reads to create a stack (default values were used); (2) maximum number of mismatches between RAD loci within and between individuals (−M 2/−n 2 for fish or -M 3/−n 3 for bivalves and their analogous parameters with ALT pipeline); (3) indels were discarded (i.e. --disable-gapped at different modules); (4) SNP calling model and alpha cut-off: their default values were used in the STA pipeline (STACKS 2.0), while in the ALT pipeline (Meyer’s 2b-RAD v2.1) we considered a range of 0.1–0.2 to determine the genotype at each position (default values are 0.01–0.25); to say, when the frequency of the less frequent allele was lower than 0.1 the genotype was called as homozygous while frequencies higher than 0.2 were called as heterozygous; intermediate allele frequencies for the less frequent allele were called as uncertain (see Tables S7B and S7C).

When a reference genome was available, Bowtie 1.1.2 was used as short read aligner. The number of mismatches allowed between reads and the reference genome using the -v alignment mode was 2 mismatches for brown trout and 3 mismatches for Manila clam, the same as mentioned above. Only reads which aligned to a single site in the reference genome were considered (−m 1). The same parameter values were used in STACKS modules shared between reference-based genome and de novo approaches (i.e. gstacks and populations modules).

SNP filtering steps and creation of datasets

The raw SNP panels of the STA and ALT pipelines were filtered using the same parameters for consistency, retaining only a set of markers and alleles represented across the individuals genotyped. Filters were applied in the same order for each dataset (Fig. 1), as recommended [9]: i) SNPs with more than two alleles were excluded (BIAL filter); ii) a minimum locus coverage of eight reads was chosen (Min coverage filter); iii) RAD-loci with more than three SNPs were excluded for further analyses; iv) SNPs were retained only if the less frequent allele was represented at least three times at the whole species sample (MAC, minimum allele count, filter); v) less than 40% of missing data in each population for a SNP to be retained (POP filter); vi) SNPs were excluded when they did not adjust to Hardy-Weinberg (HW) expectations (P < 0.01) in more than half of the populations analysed (HW filter); and vii) only the first SNP per RAD-locus was retained when several SNPs were called in the same RAD-locus to avoid redundant information.

According to the aforementioned criteria, four SNP panels were tested and compared: (1) STA and (2) ALT SNP panels were further used to obtain (3) common (COM) and (4) merged (MER) panels. When reference genome was available three additional SNP panels were obtained (i.e. RG, RG-STA, RG-ALT). RAD-loci of these panels from both pipelines were compared to identify shared and private RAD-loci. In order to do this, cd-hit-est was used to cluster similar RAD-loci taken from the two pipeline catalogues with the same threshold of similarity (−c) used in the clustering steps of building-loci pipelines (i.e. two mismatches maximum for fish species and three mismatches maximum for bivalve species). Furthermore, we used a -g value of 1 when clustering sequences to meet the established similarity threshold and a band alignment width (−b) of 1 to avoid previously separated sequences due to indels at specific clusters. This procedure rendered COM and MER SNP panels created using a customized Perl script (see Supplementary material). SNPs at shared RAD-loci between STA and ALT panels were selected after the sixth filtering step, since the first SNP at shared RAD-loci could be different after the full filtering pipeline (Fig. 1). MER panel was finally created by taking the SNPs from “private” ALT and STA pipelines RAD-loci (i.e. those from cd-hit-est clusters with only STA or ALT pipelines RAD-tags) plus the COM SNP panel previously obtained. Again, only the first SNP per RAD locus was retained to avoid redundant information. The GENEPOP files of all shared SNP panels were compared to quantify their genotyping differences using own Perl script (see Additional file 3).

Comparison of outputs and population genetics analyses

The results of the aforementioned pipelines were compared using both different quantitative (i.e. number of SNPs) and population genomics metrics. Filtered GENEPOP files were obtained using customized Perl scripts. These files were transformed for subsequent analyses using the PGDSPIDER 2.1.1.5 software [60]. Firstly, the consequences of filtering over the number of RAD-loci and SNPs were evaluated for each combination of species-pipeline; secondly, common, and private RAD-loci/SNPs between the two pipelines were obtained for each species. Finally, biological interpretations from each pipeline/species were compared, including basic population genetics results (i.e. genetic diversity levels and population structure).

Observed and expected heterozygosity (Ho and He, respectively), inbreeding coefficient (FIS, using 1000 bootstrap iterations to estimate their 95% confidence intervals) and allelic richness were calculated per population using R’s package diveRsity [61]. Global FST calculation and HW tests were performed with GENEPOP R package [53]. STRUCTURE software [54], using R package ParallelStructure [62], was used to define the most likely number of population units (K) present with LOCPRIOR model with correlated allele frequencies model, testing K values from 1 to the number of sampling localities in the species dataset + 1 with 10 replicates composed by 100,000 Markov chain Monte Carlo (MCMC) replicates and a burn-in period of 10,000 steps. STRUCTURE results were parsed with STRUCTURE HARVESTER [63], which implements the Evanno’s method [64] to detect the most likely number of clusters according to the data. CLUMPAK [65] was used to merge runs with the same K that suggested similar patterns of structuring and to obtain cluster membership plots. As a second approach to detect population structure, a Discriminant Analysis of Principal Components (DAPC) analysis was performed based on genetic data, as implemented in R package ADEGENET [66, 67]. The optimal number of principal components to be used was estimated with the cross-validation method implemented in the package and from one to three discriminant components were retained according to the amount of population structure variation they explained. Finally, outlier loci potentially under selection (OL), i.e. those showing higher or lower differentiation values (i.e. FST) across populations than the neutral background, were detected using the Bayesian approach implemented in BAYESCAN 2.1 [68] with default parameters. Loci with a Log10 posterior odds (PO) higher than 1.5 were retained as potential outliers for later comparison among the four datasets resulting from the two pipelines.

Availability of data and materials

Biological and genomic information (including access to NCBI sequence read archive (SRA) database) for the different species used in the current study is available in the NCBI Bioproject database (ID: PRJNA701896; https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/701896). Reference genome sequences were downloaded from the NCBI genome assembly website for brown trout (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/assembly/GCF_901001165.1/) and Manila clam (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/assembly/GCA_009026015.1/). Custom Perl scripts and example files corresponding with the Additional file 3 can be downloaded from the GitHub website (https://github.com/abhortas/USC-RAD-seq-scripts).

Abbreviations

ALT:

Alternative pipeline de novo panel

ALT pipeline:

Meyer’s 2b-RAD v2.1 pipeline with the selected parameters at this research

COM:

Common SNP panel

DAPC:

Discriminant analysis of principal components

ddRAD:

Double digest restriction-site associated DNA

Gbp:

Giga base pairs; GBS: Genotyping-by-sequencing

GWAS:

Genome-wide association study

He :

Expected heterozygosity

Ho :

Observed heterozygosity

HW:

Hardy-Weinberg

MAC:

Minimum allele count

Mya:

Million years ago

MER:

Merged SNP panel

NGS:

Next-generation sequencing

OL:

Outlier Loci

PCoA:

Principal Coordinate Analysis

PO:

Posterior odds

RAD-seq:

Restriction site-associated DNA sequencing

REs:

Restriction enzymes

SNPs:

Single nucleotide polymorphisms

RG:

Reference genome SNP panel

RG-STA:

Shared SNP panel between RG and STA panels

RG-ALT:

Shared SNP panel between RG and ALT panels

STA:

STACKS de novo panel

STA pipeline:

STACKS 2.0 pipeline with the parameters selected at this research

WGD:

Whole Genome Duplications.

References

  1. 1.

    Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics. 2012;13:341. https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2164-13-341.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Wetterstrand KA. DNA Sequencing Costs: Data | NHGRI. 2020. https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data. Accessed 1 July 2020.

  3. 3.

    Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17:333–51. https://0-doi-org.brum.beds.ac.uk/10.1038/nrg.2016.49.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 2008;3(10):e3376. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pone.0003376.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011;12:499–510. https://0-doi-org.brum.beds.ac.uk/10.1038/nrg3012.

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat Rev Genet. 2016;17:81–92. https://0-doi-org.brum.beds.ac.uk/10.1038/nrg.2015.28.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Wang S, Meyer E, McKay JK, Matz MV. 2b-RAD: a simple and flexible method for genome-wide genotyping. Nat Methods. 2012;9:808–10. https://0-doi-org.brum.beds.ac.uk/10.1038/nmeth.2023.

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Barbanti A, Torrado H, Macpherson E, Bargelloni L, Franch R, Carreras C, et al. Helping decision making for reliable and cost-effective 2b-RAD sequencing and genotyping analyses in non-model species. Mol Ecol Resour. 2020;20:795–806. https://0-doi-org.brum.beds.ac.uk/10.1111/1755-0998.13144.

    CAS  Article  Google Scholar 

  9. 9.

    O’Leary SJ, Puritz JB, Willis SC, Hollenbeck CM, Portnoy DS. These aren’t the loci you’re looking for: principles of effective SNP filtering for molecular ecologists. Mol Ecol. 2018;27:3193–206. https://0-doi-org.brum.beds.ac.uk/10.1111/mec.14792.

    Article  Google Scholar 

  10. 10.

    Díaz-Arce N, Rodríguez-Ezpeleta N. Selecting RAD-Seq data analysis parameters for population genetics: the more the better? Front Genet. 2019;10:533. https://0-doi-org.brum.beds.ac.uk/10.3389/fgene.2019.00533.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Huber M. Compendium of bivalves. A full-color guide to 3,300 of the World’s marine bivalves. A status on Bivalvia after 250 years of research. Hackenheim: ConchBooks; 2010.

    Google Scholar 

  12. 12.

    Fricke R, Eschmeyer W, Fong JD. CAS - Eschmeyer’s catalog of fishes - species by family. 2020.http://researcharchive.calacademy.org/research/ichthyology/catalog/SpeciesByFamily.asp. Accessed 22 Nov 2020.

    Google Scholar 

  13. 13.

    Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115:4325–33. https://0-doi-org.brum.beds.ac.uk/10.1073/pnas.1720115115.

    CAS  Article  PubMed  Google Scholar 

  14. 14.

    Rochette NC, Catchen JM. Deriving genotypes from RAD-seq short-read data using stacks. Nat Protoc. 2017;12:2640–59. https://0-doi-org.brum.beds.ac.uk/10.1038/nprot.2017.123.

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. Stacks: an analysis tool set for population genomics. Mol Ecol. 2013;22(11):3124–40. https://0-doi-org.brum.beds.ac.uk/10.1111/mec.12354.

    Article  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH. Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3. 2011;1(3):171–82. https://0-doi-org.brum.beds.ac.uk/10.1534/g3.111.000240.

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Puritz JB, Hollenbeck CM, Gold JR. dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ. 2014;2:e431. https://0-doi-org.brum.beds.ac.uk/10.7717/peerj.431.

    Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Torkamaneh D, Laroche J, Bastien M, Abed A, Belzile F. Fast-GBS: a new pipeline for the efficient and highly accurate calling of SNPs from genotyping-by-sequencing data. BMC Bioinformatics. 2017;18:1–7. https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-016-1431-9.

    CAS  Article  Google Scholar 

  19. 19.

    Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9(2):e90346. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pone.0090346.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Eaton DAR, Overcast I. ipyrad: Interactive assembly and analysis of RADseq datasets. Bioinformatics. 2020. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btz966.

  21. 21.

    Torkamaneh D, Laroche J, Belzile F. Genome-wide SNP calling from genotyping by sequencing (GBS) data: a comparison of seven pipelines and two sequencing technologies. PLoS One. 2016;11(8):e0161333. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pone.0161333.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Shafer ABA, Peart CR, Tusso S, Maayan I, Brelsford A, Wheat CW, et al. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods Ecol Evol. 2017;8:907–17. https://0-doi-org.brum.beds.ac.uk/10.1111/2041-210X.12700.

    Article  Google Scholar 

  23. 23.

    Paris JR, Stevens JR, Catchen JM. Lost in parameter space: a road map for stacks. Methods Ecol Evol. 2017;8:1360–73. https://0-doi-org.brum.beds.ac.uk/10.1111/2041-210X.12775.

    Article  Google Scholar 

  24. 24.

    Sovic MG, Fries AC, Gibbs HL. AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data. Mol Ecol Resour. 2015;15:1163–71. https://0-doi-org.brum.beds.ac.uk/10.1111/1755-0998.12378.

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Wright B, Farquharson KA, McLennan EA, Belov K, Hogg CJ, Grueber CE. From reference genomes to population genomics: comparing three reference-aligned reduced-representation sequencing pipelines in two wildlife species. BMC Genomics. 2019;20:453. https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-019-5806-y.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Martínez L, Freire R, Arias-Pérez A, Méndez J, Insua A. Patterns of genetic variation across the distribution range of the cockle Cerastoderma edule inferred from microsatellites and mitochondrial DNA. Mar Biol. 2015;162:1393–406. https://0-doi-org.brum.beds.ac.uk/10.1007/s00227-015-2676-y.

    CAS  Article  Google Scholar 

  27. 27.

    Vera M, Carlsson J, El Carlsson J, Cross T, Lynch S, Kamermans P, et al. Current genetic status, temporal stability and structure of the remnant wild European flat oyster populations: conservation and restoring implications. Mar Biol. 2016;163:239. https://0-doi-org.brum.beds.ac.uk/10.1007/s00227-016-3012-x.

    Article  Google Scholar 

  28. 28.

    Leitwein M, Guinand B, Pouzadoux J, Desmarais E, Berrebi P, Gagnaire PA. A Dense Brown Trout (Salmo trutta) Linkage Map Reveals Recent Chromosomal Rearrangements in the Salmo Genus and the Impact of Selection on Linked Neutral Diversity. G3. 2017;7:1365–76. https://0-doi-org.brum.beds.ac.uk/10.1534/g3.116.038497.

    Article  PubMed  Google Scholar 

  29. 29.

    Ferguson A. Genetic differences among brown trout, Salmo trutta, stocks and their importance for the conservation and management of the species. Freshw Biol. 1989;21:35–46.

    Article  Google Scholar 

  30. 30.

    Ríos N, Casanova A, Hermida M, Pardo BG, Martínez P, Bouza C, et al. Population genomics in Rhamdia quelen (Heptapteridae, siluriformes) reveals deep divergence and adaptation in the neotropical region. Genes. 2020;11:109. https://0-doi-org.brum.beds.ac.uk/10.3390/genes11010109.

    CAS  Article  PubMed Central  Google Scholar 

  31. 31.

    Manuzzi A, Zane L, Muñoz-Merida A, Griffiths AM, Veríssimo A. Population genomics and phylogeography of a benthic coastal shark (Scyliorhinus canicula) using 2b-RAD single nucleotide polymorphisms. Biol J Linn Soc. 2018;126:289–303. https://0-doi-org.brum.beds.ac.uk/10.1093/biolinnean/bly185.

    Article  Google Scholar 

  32. 32.

    Yan X, Nie H, Huo Z, Ding J, Li Z, Yan L, et al. Clam Genome Sequence Clarifies the Molecular Basis of Its Benthic Adaptation and Extraordinary Shell Color Diversity. iScience. 2019;19:1225–37. https://0-doi-org.brum.beds.ac.uk/10.1016/j.isci.2019.08.049.

    Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Salmo trutta assembly (NCBI). https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/assembly/GCF_901001165.1. Accessed on date 26 July 2020.

  34. 34.

    Wajid B, Serpedin E. Do it yourself guide to genome assembly. Brief Funct Genomics. 2016;15:1–9. https://0-doi-org.brum.beds.ac.uk/10.1093/bfgp/elu042.

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Wang Y, Guo X. Chromosomal rearrangement in pectinidae revealed by rRNA loci and implications for bivalve evolution. Biol Bull. 2004;207(3):247–56. https://0-doi-org.brum.beds.ac.uk/10.2307/1543213.

    CAS  Article  PubMed  Google Scholar 

  36. 36.

    Takeuchi T, Koyanagi R, Gyoja F, Kanda M, Hisata K, Fujie M, et al. Bivalve-specific gene expansion in the pearl oyster genome: implications of adaptation to a sessile lifestyle. Zool Lett. 2016;2:3. https://0-doi-org.brum.beds.ac.uk/10.1186/s40851-016-0039-2.

    Article  Google Scholar 

  37. 37.

    Curole JP, Hedgecock D. Bivalve Genomics: Complications, Challenges, and Future Perspectives. In: Liu Z(J), editor. Aquaculture Genome Technologies. Oxford: Blackwell Publishing Ltd; 2007. p. 525–43.

    Chapter  Google Scholar 

  38. 38.

    Pasquier J, Cabau C, Nguyen T, Jouanno E, Severac D, Braasch I, et al. Gene evolution and gene expression after whole genome duplication in fish: the PhyloFish database. BMC Genomics. 2016;17:368. https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-016-2709-z.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Macqueen DJ, Johnston IA. A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification. Proc R Soc B Biol Sci. 2014;281:1778. https://0-doi-org.brum.beds.ac.uk/10.1098/rspb.2013.2881.

    Article  Google Scholar 

  40. 40.

    Berthelot C, Brunet F, Chalopin D, Juanchich A, Bernard M, Noël B, et al. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat Commun. 2014;5:2. https://0-doi-org.brum.beds.ac.uk/10.1038/ncomms4657.

    Article  Google Scholar 

  41. 41.

    Donoghue PCJ, Purnell MA. Genome duplication, extinction and vertebrate evolution. Trends Ecol Evol. 2005;20(6):312–9. https://0-doi-org.brum.beds.ac.uk/10.1016/j.tree.2005.04.008.

    Article  PubMed  Google Scholar 

  42. 42.

    Benestan LM, Ferchaud AL, Hohenlohe PA, Garner BA, Naylor GJP, Baums IB, et al. Conservation genomics of natural and managed populations: building a conceptual and practical framework. Mol Ecol. 2016;25:2967–77. https://0-doi-org.brum.beds.ac.uk/10.1111/mec.13647.

    Article  PubMed  Google Scholar 

  43. 43.

    Hendricks S, Anderson EC, Antao T, Bernatchez L, Forester BR, Garner B, et al. Recent advances in conservation and population genomics data analysis. Evol Appl. 2018;11:1197–211. https://0-doi-org.brum.beds.ac.uk/10.1111/eva.12659.

    Article  PubMed Central  Google Scholar 

  44. 44.

    Hodel RGJ, Chen S, Payton AC, McDaniel SF, Soltis P, Soltis DE. Adding loci improves phylogeographic resolution in red mangroves despite increased missing data: comparing microsatellites and RAD-Seq and investigating loci filtering. Sci Rep. 2017;7:17598. https://0-doi-org.brum.beds.ac.uk/10.1038/s41598-017-16810-7.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Mastretta-Yanes A, Arrigo N, Alvarez N, Jorgensen TH, Piñero D, Emerson BC. Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Mol Ecol Resour. 2015;15:28–41. https://0-doi-org.brum.beds.ac.uk/10.1111/1755-0998.12291.

    CAS  Article  PubMed  Google Scholar 

  46. 46.

    Fountain ED, Pauli JN, Reid BN, Palsbøll PJ, Peery MZ. Finding the right coverage: the impact of coverage and sequence quality on single nucleotide polymorphism genotyping error rates. Mol Ecol Resour. 2016;16:966–78. https://0-doi-org.brum.beds.ac.uk/10.1111/1755-0998.12519.

    CAS  Article  PubMed  Google Scholar 

  47. 47.

    Milan M, Maroso F, Dalla Rovere G, Carraro L, Ferraresso S, Patarnello T, et al. Tracing seafood at high spatial resolution using NGS-generated data and machine learning: comparing microbiome versus SNPs. Food Chem. 2019;286:413–20. https://0-doi-org.brum.beds.ac.uk/10.1016/j.foodchem.2019.02.037.

    CAS  Article  PubMed  Google Scholar 

  48. 48.

    Maroso F, De Gracia CP, Iglesias D, Cao A, Díaz S, Villalba A, et al. A useful SNP panel to distinguish two cockle species, Cerastoderma edule and C. glaucum, co-occurring in some European beds, and their putative hybrids. Genes. 2019;10:760. https://0-doi-org.brum.beds.ac.uk/10.3390/genes10100760.

    CAS  Article  PubMed Central  Google Scholar 

  49. 49.

    Bouza C, Castro J, Sánchez L, Martínez P. Allozymic evidence of parapatric differentiation of brown trout (Salmo trutta L.) within an Atlantic river basin of the Iberian Peninsula. Mol Ecol. 2001;10:1455–69. https://0-doi-org.brum.beds.ac.uk/10.1046/j.1365-294X.2001.01272.x.

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Vera M, Cortey M, Sanz N, García-Marín JL. Maintenance of an endemic lineage of brown trout (Salmo trutta) within the Duero river basin. J Zool Syst Evol Res. 2010;48:181–7. https://0-doi-org.brum.beds.ac.uk/10.1111/j.1439-0469.2009.00547.x.

    Article  Google Scholar 

  51. 51.

    Martínez P, Bouza C, Castro J, Hermida M, Pardo BG, Sánchez L. Analysis of a secondary contact between divergent lineages of brown trout Salmo trutta L. from Duero basin using microsatellites and mtDNA RFLPs. J Fish Biol. 2007;71:195–213. https://0-doi-org.brum.beds.ac.uk/10.1111/j.1095-8649.2007.01551.x.

    CAS  Article  Google Scholar 

  52. 52.

    Perdices A, Bermingham E, Montilla A, Doadrio I. Evolutionary history of the genus Rhamdia (Teleostei: Pimelodidae) in Central America. Mol Phylogenet Evol. 2002;25:172–89. https://0-doi-org.brum.beds.ac.uk/10.1016/S1055-7903(02)00224-5.

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Rousset F. GENEPOP’007: a complete re-implementation of the GENEPOP software for windows and Linux. Mol Ecol Resour. 2008;8:103–6. https://0-doi-org.brum.beds.ac.uk/10.1111/j.1471-8286.2007.01931.x.

    Article  PubMed  Google Scholar 

  54. 54.

    Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–59.

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bts565.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btl158.

    CAS  Article  Google Scholar 

  57. 57.

    Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

    Article  Google Scholar 

  58. 58.

    Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btu170.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available online at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

    Google Scholar 

  60. 60.

    Lischer HE, Excoffier L. PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics. 2012;28:298–9. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btr642.

    CAS  Article  PubMed  Google Scholar 

  61. 61.

    Keenan K, Mcginnity P, Cross TF, Crozier WW, Prodöhl PA. DiveRsity: an R package for the estimation and exploration of population genetics parameters and their associated errors. Methods Ecol Evol. 2013;4:782–8. https://0-doi-org.brum.beds.ac.uk/10.1111/2041-210X.12067.

    Article  Google Scholar 

  62. 62.

    Besnier F, Glover KA. ParallelStructure: a R package to distribute parallel runs of the population genetics program STRUCTURE on multi-Core computers. PLoS One. 2013;8(7):e70651. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pone.0070651.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Earl DA. vonHoldt BM. STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conserv Genet Resour. 2012;4:359–61. https://0-doi-org.brum.beds.ac.uk/10.1007/s12686-011-9548-7.

    Article  Google Scholar 

  64. 64.

    Evanno G, Regnaut S, Goudet J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol. 2005;14:2611–20. https://0-doi-org.brum.beds.ac.uk/10.1111/j.1365-294X.2005.02553.x.

    CAS  Article  Google Scholar 

  65. 65.

    Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour. 2015;15:1179–91. https://0-doi-org.brum.beds.ac.uk/10.1111/1755-0998.12387.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Jombart T. Adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics. 2008;24(11):1403–5. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btn129.

    CAS  Article  PubMed  Google Scholar 

  67. 67.

    Jombart T, Ahmed I. Adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011;27(21):3070–1. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btr521.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Foll M, Gaggiotti O. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics. 2008;180:977–93. https://0-doi-org.brum.beds.ac.uk/10.1534/genetics.108.092221.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We acknowledge the bioinformatic support of the Centro de Supercomputación de Galicia (CESGA). We wish to thank L. Insua, M. Portela, S. Sánchez-Darriba and S. Gómez for their technical support. We thank Carlos Saavedra for Manila clam samples from Galicia.

Funding

The work undertaken in this project was funded by Xunta de Galicia Autonomous Government (GRC2014/010), Interreg Atlantic Area (Cockles project, EAPA_458/2016) and Girona University (MPCUdG2016/060) projects. Adrián Casanova was a Xunta de Galicia fellowship (ED481A-2017/091).

Author information

Affiliations

Authors

Contributions

MV, PM, AC, FM, AB and MH designed the study. AC, FM, AB, MH ran the different building-loci pipelines. AC and FM performed the after building-loci analysis. AB wrote Perl scripts. MV, PM, NR, GG, AM, LZ, AV, JLGM and CB acquired and managed genomic data here used that were provided by the different laboratories involved to gather a diverse case-study scenarios for the work. MV and PM supervised the study. AC wrote the initial version of the manuscript and the rest of authors (FM, AB, MH, NR, GG, AM, LZ, AV, JLGM, CB, MV and PM) made a substantive intellectual contribution to the analysis, discussion and interpretation of results, further aiding to writing, and editing of the final version of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Manuel Vera.

Ethics declarations

Ethics approval and consent to participate

All samples were previously used in other published articles or in the framework of European project COCKLES (EAPA_458/2016). All specimens were collected in accordance with all bioethics standards and legislation of the different country governments where sampling was performed. The samples were shared with us for analysis.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

Supplementary tables and genotype differences at shared SNPs panel figures: Table S1. Number of SNPs and population metrics for Manila clam (Ruditapes philippinarum) samples (N = 110, four localities). Table S2. Number of SNPs and population metrics for common edible cockle (Cerastoderma edule) samples (N = 120, four localities). Table S3. Number of SNPs and population metrics for brown trout (Salmo trutta) samples (N = 52, three localities). Table S4. Number of SNPs and population metrics for silver catfish (Rhamdia quelen) samples (N = 21, two localities). Table S5. Number of SNPs and population metrics for small-spotted catshark (Scyliorhinus canicula) samples (N = 28, two localities). Table S6. Geographical coordinates of sampling localities for the five species used in this study. Table S7. Building-loci pipelines options selected for process_radtags (A), STACKS 2 (B), and Meyer’s 2b-RAD v2.1 (C). Supplementary figures: Figure S1. Differences in genotyping between common SNPs (COM panel) from both building-loci pipelines in each species. Figure S2. Differences in genotyping between common SNPs from reference genome and de novo approach comparisons (i.e. RG-STA and RG-ALT) in Manila clam and brown trout. Figure S3. Type of genotype differences between common SNPs (COM panel) from both building-loci pipelines in each species. Figure S4. Type of genotype differences between common SNPs from reference genome and de novo approach comparisons (i.e. RG-STA and RG-ALT) in Manila clam and brown trout.

Additional file 2

CLUMPAK and DAPC output comparisons for all species: Figure S5. Comparison between CLUMPAK and DAPC outputs for Manila clam (Ruditapes philippinarum) samples (N = 110). Figure S6. Comparison between CLUMPAK and DAPC outputs for common edible cockle (Cerastoderma edule) samples (N = 120). Figure S7. Comparison between CLUMPAK and DAPC outputs for brown trout (Salmo trutta) samples (N = 52). Figure S8. . Comparison between CLUMPAK and DAPC outputs for silver catfish (Rhamdia quelen) samples (N = 21). Figure S9. Comparison between CLUMPAK and DAPC outputs for small-spotted catshark (Scyliorhinus canicula) samples (N = 28).

Additional file 3.

Description of the information included within GitHub website (https://github.com/abhortas/USC-RAD-seq-scripts) where custom Perl scripts and file examples to obtain shared SNPs and compare genotypes from the two building-loci pipelines used in the present study are available.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Casanova, A., Maroso, F., Blanco, A. et al. Low impact of different SNP panels from two building-loci pipelines on RAD-Seq population genomic metrics: case study on five diverse aquatic species. BMC Genomics 22, 150 (2021). https://0-doi-org.brum.beds.ac.uk/10.1186/s12864-021-07465-w

Download citation

Keywords

  • STACKS 2
  • 2b-RAD v2.1 pipeline
  • de novo approach
  • Bowtie 1
  • Reference genome approach
  • Bivalves
  • Fish
  • Population genomics