Field sampling is biased against small-ranged species of high conservation value: a case study on the sphingid moths of East Africa

The range size of species co-occurring in local assemblages is a pivotal variable in assessments of a site’s conservation value. Assemblages featuring many small-ranged species are given more priority than assemblages consisting mainly of wide-ranging species. However, the assembly of relevant information can be challenging and local range size distributions of tropical invertebrates are rarely available for conservation planning. We present such data for sphingid moths in East Africa, a highly diverse region of high conservation value. We compare geographic range size distributions based on field samples with predictions from modelled range map data. Using this system as a case study, we provide evidence for a systematic sampling bias when inferring average local range sizes from field data. Unseen species (i.e., species present but missed in local sampling) are often those with small ranges (hence, of high conservation value). Using an elevational gradient, we illustrate how this bias can lead to false, counterintuitive assessments of environmental effects on local range size distributions. Furthermore, with particular reference to sphingid moths in the study region, we show that current protected areas appear unrelated to the spatial distribution of species richness or average geographic range sizes at a local scale. We discuss the need to treat field sampled data with caution and in concert with other data sources such as probabilistic models.


Introduction
Assessing a landscape's conservation value is a relevant step in prioritizing resources to those regions where conservation efforts are most useful (Mace et al. 2007). While species richness, the simplest measure of biodiversity, is often used for this purpose, there are a variety of approaches that aim to incorporate uniqueness, endemism, risk of extinction, or related features of assemblages (e.g., biodiversity hotspots, Mittermeier et al. 2011;IUCN red lists, Rodrigues et al. 2006; see also Crisp et al. 2001, Wilson et al. 2006Bottrill et al. 2009;Beck et al. 2011). The geographic range size of the species involved is a pivotal variable in such assessments. Small-ranged species (often loosely termed 'endemics') make a unique contribution to local assemblages because they are found and can be protected at only a few other sites. Furthermore, range size is a species property that is strongly and inversely linked to extinction risk (Thomas et al. 2004, Harris andPimm 2008). This may be particularly relevant in tropical conservation, where species' ranges are often smaller than at northern latitudes McCain 2009;Grünig et al. 2017). With current aims of standardizing global biodiversity data availability for conservation assessments (e.g., 'essential biodiversity variables'; Geijzendorffer et al. 2016), it is timely to investigate potential problems in distilling reliable local range size distribution data for conservation from field sampling.
Undersampling (i.e., incomplete observations of the species occurring at a site) is a notorious issue in biodiversity field studies, particularly in tropical invertebrates or other species-rich assemblages (Coddington et al. 2009). Undersampling of species richness has been broadly recognized as a problem, and manifold attempts for control or mitigation have been devised (Colwell and Coddington 1994;Beck and Schwanghart 2010;Iknayan et al. 2014). However, undersampling may introduce an even stronger bias to the observed occurrences of small-ranged species in particular.
The reasoning for this hypothesis is as follows: A positive range-abundance relationship has been empirically well-supported (i.e., species' local abundance correlates with their geographic range size; Brown 1984;Gaston et al. 1997;in tropical moths: Beck et al. 2006; but see Novosolov et al. 2017 for conflicting analyses) even though it is mechanistically poorly understood.
Assuming the generality of the pattern, this implies that small-ranged species are typically also locally rare, and locally rare species are more likely to be overlooked in local inventories than locally abundant, hence widely ranging species. This will lead to a systematic overestimation of the average range sizes calculated for an assemblage, hence an underestimation of the conservation value of the assemblage and site. The central aim of our study is to test this prediction.
Sphingid moths are large, mostly nocturnal Lepidoptera that have become a model taxon for the study of the geographical ecology of invertebrates Ballesteros-Mejia et al. 2017;Grünig et al. 2017), being the only insect taxon for which detailed range estimates exist for many tropical regions of the world. Several studies suggested that the East African fauna harbors not only high sphingid richness and turnover , but also particularly high proportions of small-ranged species (Grünig et al. 2017; see also Burgess et al. 2007).
Here we investigate the local range size distribution and richness of sphingid moths occurring across Tanzania and Zambia. We mapped richness and range sizes across these two countries and relate these to the location of protected areas. Range map data were utilized in comparison to field collections to investigate and illustrate the predicted sampling effects on potential range size biases.
As an example of a steep environmental gradient with potential effects on range sizes we used the elevational variation in our study region. Many empirical studies have shown that, in tropical and subtropical regions, high elevation communities feature smaller average geographic range sizes compared with lowland communities (e.g., birds, Orme et al. 2006;amphibians, Whitton et al. 2012;sphingid moths, Grünig et al. 2017). This is consistent with the theoretical reasoning that aseasonal, tropical highlands feature unique climates and habitats that are restricted to small areas, which limits the range sizes of tropical-montane organisms (Janzen 1967;Hawkins et al. 2006). We therefore expect a decline in median geographic range sizes with increasing elevation of sites in our tropical samples. A review by  proposed the opposite pattern, but their references for terrestrial organisms are either exclusively from temperate zones or they refer to the elevational extent of species as 'range'; note that we refer here and throughout the manuscript to geographical (or horizontal) range size, not elevational range size.
To test our hypotheses, we first show that field samples are undersampled by testing the prediction that field sampled richness is lower than range-map derived richness, and that this is related to sampling effort (number of nights, sampled individuals). We then test predictions of our hypothesis that small-range species are selectively undersampled, specifically: (a) local median ranges based on field samples are larger than those based on range map data; (b) species that were not found in field samples but expected from range maps have smaller ranges than the species found in field samples; and (c) sample completeness (the ratio of found/expected species) is related to the overestimation of median range size assessments from field data, compared to range map data. We use the geographic range size distribution along an elevational gradient to illustrate the differences in conclusions on environment-range size relationships drawn from different data sources (i.e., field vs. range map data).

Methods
Range estimates for sphingid species were available for all African species at 5 km grain size . These estimates were based on a large compilation of specimen records, which were input for climate-and vegetation-based species distribution models (SDMs) that were subsequently expert-edited for dispersal limitation (specimen records and range maps can be browsed on the Map of Life, www.mol.org). We calculated geographic range sizes for each species (in km 2 ) from these maps. These range sizes were used for all further analyses of the average range sizes of local assemblages, either based on field sampled data or on range map predictions.
Field sampling was carried out at 56 sites ( Fig. 1) between 2010 and 2014 by combined light and bait trapping. Moths were hand-sampled from a tent-like trap lit by a 125 W mercury-vapor bulb.
Traps were active for 11 hours per night (1800 to 0500 hours). Light trapping is the standard field method for nocturnal Lepidoptera Truxa and Fiedler 2012). Specimens arriving at the lighted trap were captured and collected except if they were of the most widespread and common species, of which only one or two vouchers were collected per site (i.e., there was no quantitative sampling). Furthermore, carrion-baited butterfly traps (Holloway et al. 2013) were operated at night, which sampled some of the rarest sphingid species. Trapping effort (number of nights) differed between sites depending on logistics (Online Resource 1), with most sites being sampled both at the beginning and the end of the rainy season. Collected specimens were prepared according to standard entomological procedures, and identified according to latest taxonomic understanding (http://sphingidae.myspecies.info/). Notably, these field samples were not part of the data compilation used for the SDMs, hence each dataset was fully independent (indeed, two species were found outside their SDM-predicted ranges). A checklist of the species collected at the various sites is in preparation for publication.
As taxonomic understanding is continually developing, there were some discrepancies in the taxonomic definitions of the range map data (based on 2011 nomenclature) and the newer definitions for field sample identifications. To make the datasets comparable we generally disregarded subspecies assignments, and we reassigned the currently valid taxa Temnora fuscata and T. neodentata to the older T. plagiata (sensu lato), Theretra dominika to T. jugurtha s.l., and Lophostethus morettoi to L. dumolinii s.l. designations. These changes did not influence richness measures, but allowed comparisons of species-specific range sizes. Point locality data from these field samples were integrated into the latest version of our global specimen record database and can be viewed at Map of Life (www.mol.org).
We measured species richness as the sum of species observed at light and bait traps (for field samples; Sfield) or predicted to occur from SMDs in the corresponding 5 x 5 km pixel (for range map data; SSDM). Thus, there is a potential scale effect (i.e., local trap vs. 5 km-pixel), but given the very high mobility of flying sphingids  we assume this to be no source of error. We calculated the median of range sizes for all species co-occurring at a site, respectively in a 5 km pixel. Low median range values indicate the presence of many small-ranged species, hence a high conservation value of an assemblage. We compare mapped species richness and median range sizes across Tanzania and Zambia to each other and to GIS layers of protected areas for these countries (source: https://www.protectedplanet.net/; using all IUCN categories of protected areas).
We used the number of sampling nights as a measure of sampling effort. We measured sample completeness as Sfield/SSDM, where Sfield is the species richness of field samples and SSDM is the species richness computed from SDM-based range maps. Incomplete field inventories are indicated by Sfield/SSDM <1.
We compared median local range sizes based on the species found in field samples (Rangefield) with median local range sizes in 5 km-pixels from SDM-predictions (RangeSDM).
Furthermore, as a more specific test of our hypothesis, we compared the median range size in field samples (Rangefield) with the median range of the (SDM-expected) species that were not found at a site. We expected the latter to have a smaller median range value than the former. We used paired ttests on log-transformed data (nonparametric Wilcoxon tests on untransformed data led to identical conclusions). We assessed the similarity of median range sizes of assemblages by calculating Rangefield/RangeSDM, where Rangefield/RangeSDM >1 indicates that some small-ranged species were missed in the field samples. We expected an inverse relationship of Rangefield/RangeSDM and Sfield/SSDM; with increasingly incomplete species inventories, median range sizes of assemblages should be overestimated.
We illustrate the potential effect of undersampling issues on geographic range size data by relating median range size data to a major environmental gradient, elevation. Specifically, we plotted median geographic range sizes from field samples and from range map predictions against the elevation of sites and tested whether they recover the same relationship. To account for effects of spatial non-independence of these geographical data (Dormann et al. 2007, Bini et al. 2009), we adjusted degrees of freedom in statistical tests (spatial correlation with corrected degrees of freedom, dfcorr; Dutilleul 1993; software: SAM, Rangel et al. 2010).
We provide details on field site richness and median geographic range sizes (Online Resource 1) as well as comprehensive maps of richness and median range size for the two countries (GIScompatible format; Online Resources 2, 3).

Results
Across Tanzania and Zambia, light and bait trapping field data for 56 sites (Fig. 1A) yielded 2206 individuals and 122 species of sphingid moth (after nomenclature adjustments, see Methods).
However, 204 species were expected to occur in those countries according to SDM-derived range maps. From GIS-derived range maps we have data for 65736 raster cells of 5 x 5 km extent for the two countries.
Range map-derived species richness and median geographic range sizes follow quite different geographic patterns (Figs. 1B, C). While richness is distinctively higher in the northern part of the research area (e.g., Tanzania median richness per cell = 63; Zambia = 41), regions of small median range size (i.e., many geographically restricted species) also stretch south along the mountain ranges into Zambia. Contrastingly, the western plains along the Zambesi river feature mainly large range sizes (i.e., mostly widespread species are predicted to occur there). Species richness and median range sizes are inversely, but not very strongly correlated (range map data; Pearson's r = -0.290).
Values for sphingid moth range sizes and richness based on SDM maps are almost equal with regard to the location of protected areas (i.e., no significant differences within and outside protected areas). Median assemblage range size inside protected areas (Median[upper,lower quartile], in 10 6 km 2 : 4.73[4.12,5.52]) is even slightly higher than outside protected areas (4.71[4.14,5.52]) and per cell species richness inside protected areas (54[40,63]) is slightly lower than outside protected areas (56 [43,67] . Given the broad overlap in quartiles and even a slight trend towards the opposite of expected patterns we refrained from more detailed statistical testing. Local sampling effort per site ranged between 1 and 27 nights, the number of captured moths between 2 to 231 individuals, and observed richness (Sfield) between 2 and 54 species; all three measures are significantly correlated to each other (Online Resource 1). Sampling effort, in particular, explained over half of the variation of observed species richness and of sample completeness (Sfield/SSDM; Fig. 2A; N = 56, r 2 = 0.544, p <0.001), which indicates that observed richness is heavily affected by undersampling.
We had predicted that undersampling specifically excludes small-ranged species, which was supported by data. First, local median range sizes based on SDM data (median = 4.3 x 10 6 km 2 ) are smaller than those based on field data (median = 5.0 x 10 6 km 2 ; paired t-test of log10-transformed data: t = -2.6, df = 55, p = 0.012). Second, locally collected species had larger geographic ranges than those species that were expected at the sites (according to range maps) but which were not caught in the field (presumably due to undersampling; Fig. 2B). A paired t-test on log10-transformed median ranges supports that this difference is not due to random variation (t = 2.93, df = 55, p = 0.005). Species expected but not found in the field have a ca. 20 percent smaller median range size than the observed species. However, we did not find a correlation of sample completeness (Sfield/SSDM) and the similarity of local range assessments (Rangefield/RangeSDM; Fig. 2C), our third prediction (correlation of log10trandsformed rations: r 2 <0.01, p = 0.941). Thus, incorrect assessments in local range distributions are not only due to undersampling but may have further causes (see Discussion).

Discussion
In our East African sphingid moth case study, we illustrated the existence of an undersampling bias in field-sampled data with regard to geographic range size assessments, which are crucial aspects of conservation evaluations. Our results confirmed the hypothesized effect of undersampling on range size assessments for sphingid moth assemblages. Missed species (i.e., those not found in the field but expected from estimated range maps) featured significantly smaller ranges than those species that were common enough to be actually found. Field sampling might therefore underestimate the conservation value of local assemblages if species inventories are incomplete. We illustrated how the discrepancy of field-sampled and range map-derived range data lead to significant yet opposite assessments of the elevational pattern of geographic range size, with range map-derived data following the theoretical expectation of a decline. There may be disagreement on this expectation (see references in Introduction), but in any case, conclusions from the two data sources differ significantly from one another. The discrepancy between datasets exemplifies the potential for misjudgment due to undersampled data. These findings are consistent with deductions from assuming a general range-abundance relationship in species communities-species with smaller geographic range sizes are also likely to be less abundant (Brown 1984;Gaston et al. 1997. There is no reason to believe that this is a taxon-specific or regional effect, so it raises concerns about the general reliability of field data for assemblage-wide range size assessments. The interpretation and relevance of our results rests on the assumption that SDM-derived range map data are sufficiently close to the "truth" that they can be used to judge the accuracy of field samples. However, range maps themselves are estimates that may contain error (this is true of all range maps for all taxa, modelled or expert-drawn), which may be viewed as a caveat to this study.
For example, field sampling provided confirmed records for two species (Hypaedalea neglecta, Pseudenyo benitensis) outside their estimated ranges. Despite this, we judge range map data as much more accurate (regarding inventory completeness) than field samples for the following reasons. (1) Range maps have been tested and vetted on different levels, from individual, numeric SDM quality metrics through species-specific expert opinions to tests of emergent data such as species richness ).
(2) Field data clearly suffer from undersampling ( Fig. 2A). Observed species richness variation is heavily affected by sampling effort and by the number of individuals collected (Online Resource 1; similar figures were found in other field data of this type, e.g. Beck and Chey 2008). We therefore know that many field sites were incompletely sampled. (3) The empirical results found in our study match sound deductions from previously well-established knowledge (i.e., a positive range-abundance relationship). In light of this, we feel confident that it is valid to interpret range maps as "true" and field data as "biased" in this study. However, we acknowledge that our prediction of higher deviation between field-and map-derived range data with heavier undersampling (low Sfield/SSDM; Fig. 2C) was not met significantly, and that the observation of highest deviations occurring at the most heavily undersampled sites (when plotting non-transformed ratios, not shown) is probably rather an effect of higher data variance with small field samples.
Despite our overall assessment of reliable range map data, it is important to note that range maps may generally overestimate species' presence to varying degrees depending on grain size (Rahbek 2005;Jetz et al. 2008; but note the relatively fine grain (5 km) used in our analyses).
Additionally, but not addressed by our study, geographic range data are notoriously difficult to assess and verify due to temporal instability at range edges in particular (Gaston 1996) or due to artefacts of unsuitable sampling methods (which may make some species appear rarer than they are).

Field sampling vs. probabilistic modeling
Our study has shown that incomplete field sampling of biodiversity can lead to entirely wrong assessments of a highly relevant aspect of conservation value of the sampled assemblages, i.e., their geographic range size distribution. Incomplete sampling is a major issue in tropical invertebrate studies in particular (Coddington et al. 2009). While its effect on richness assessments have been thoroughly appreciated and various approaches of correction have been proposed and are increasingly applied (Colwell and Coddington 1994;Beck and Schwanghart 2010;Iknayan et al. 2014), there are currently no simple numerical solutions to more subtle undersampling effects such as the one treated here. Among field ecologists and conservation biologists there is substantial skepticism towards probabilistic modelling of species ranges (Elith and Leathwick 2009), which seems supported by the huge amount of literature discussing potential error, biases, mis-implementation or misinterpretations of such approaches (e.g., Qiao et al. 2015). Equally, however, the inherent error in supposedly 'solid' field data must be challenged, not only with regards to species richness but also to more refined aspects of biodiversity, such as range size distributions (as shown here). Increased awareness for this issue is necessary among field ecologists and conservation biologists. In the absence of any easy fix for these problems, we advocate multiple approaches (e.g., consensus among field data and range estimates) and an acknowledgement of the tentative nature of empirical findings unless re-tested and confirmed repeatedly.

Range size distributions and conservation in East Africa
Species richness and median range sizes of sphingid moth assemblages, as measured from range map data, did not follow the same pattern across Tanzania and Zambia, hence they capture different aspects of 'conservation value'. For maximum effectiveness, protected areas should ideally contain assemblages of lower median range size (i.e., higher portion of range-restricted species) and of higher species richness compared to unprotected areas. However, reserve placement across our research region appears to be unrelated to such data for sphingid moths: richness and median range sizes were almost equal within and outside protected areas. Conservation policy is obviously a difficult process affected not only by ecological reasoning, but also by economic, societal and political factors. It remains to be tested in a similar manner for other taxa, such as plants and vertebrates, whether protected area locations in the research region fulfill conservation expectations (e.g., Brooks et al. 2001). This will aid to refine our assessment and, more importantly, help improve future reserve design. To this end, we make our GIS data for this region fully available (Online Resources 2, 3). Different taxonomic groups, as well as different aspects of biodiversity or other metrics of 'conservation value', may likely indicate different priority regions (e.g., Schulze et al. 2004, Wolters et al. 2006, Grenyer et al. 2006Beck et al. 2013). Consensus methods, or clear criteria for what is to be protected in a particular region, will then be required for objective decision making.
In conclusion, we confirmed a disproportionate undersampling bias on small-ranged species in field data, which may distort assessments of the conservation value. We showed that it can lead to different, probably false, ecological inferences when analyzing geographic range size distributions along an elevational gradient, in comparison to model-based data. While such effects may be widespread across taxa and regions, it is highly probable in all field study systems known to be vulnerable to undersampling, such as tropical invertebrates. We argue that field data should not be treated as 'solid' but met with appropriate skepticism, and that confirmation of relevant patterns should be sought for by additional, alternative methods, such as probabilistic modelling.

Online Resources
Online Resources 1: (a) Field data, (b) correlations of sampling effort, individuals and species Online Resources 2: GIS raster for sphingid species richness (ASCII, Mollweide World, 5 x 5 km) Online Resources 3: GIS raster for sphingid median range sizes (ASCII, Mollweide World, 5 x 5 km)