Data set details

Data set description: Species occurrence data in different countries
Source: Global Biodiversity Information Facility (GBIF). Biodiversity Information Serving Our Nation (BISON), iNaturalist (iNat), eBird, and VertNet
Details on the retrieved data: Panthera onca (Jaguar) occurrence in South America throughout 2020.
Spatial and temporal resolution: Species occurrence observed worldwide (different start/end dates can be defined).

Introduction

In this tutorial, we will see how to extract and work with species occurrence data from different sources, namely, the Global Biodiversity Information Facility (GBIF). Biodiversity Information Serving Our Nation (BISON), iNaturalist (iNat), eBird, and VertNet. We could do this manually or use specific R packages (such as rgbif, rbison, rebird, or rvertnet) that separately retrieve data from these databases, however, we will use the spocc package to work with all databases at once.

Installing the spocc package

The package can be installed in the following way

if (!require(spocc)) {
  install.packages('spocc', dependencies = TRUE)
  library(spocc)
}

Alternatively, the development version can be installed via

remotes::install_github('ropensci/spocc')

In addition to the spocc package, we will also use the tidyverse and sf packages. So, assuming they are already installed, we can load them via

library(tidyverse)
library(sf)
library('rnaturalearth')

Retrieving data

The main function from the spocc data used to retrieve data from different sources is the occ(). It has some 19 different parameters (which can be accessed via the ?occ command), but we will first start setting the parameters query and from.

data <- occ(query = c('Panthera leo', 'Giraffa'), from = c('gbif'))
data
## Searched: gbif
## Occurrences - Found: 18,369, Returned: 1,000
## Search type: Scientific
##   gbif: Panthera leo (500), Giraffa (500)

From the above code, notice that we have selected the Panthera leo (Lion), Giraffa (Giraffe) species from the GBIF data set. Also notice that, although 18,253 occurrences have been reported, the function just returned 1,000 observations. You can change this behavior by setting the limit parameter accordingly.

If we want to check the retrieved data (e.g., the first observation from the “Phantera Leo” data set), we can access them via

head(data$gbif$data$Panthera_leo, 3) 
## # A tibble: 3 x 86
##   name      longitude latitude issues  prov  key   scientificName   datasetKey  
##   <chr>         <dbl>    <dbl> <chr>   <chr> <chr> <chr>            <chr>       
## 1 Panthera~      24.1   -23.6  cdround gbif  3031~ Panthera leo me~ 50c9509d-22~
## 2 Panthera~      24.1   -23.5  cdround gbif  3031~ Panthera leo me~ 50c9509d-22~
## 3 Panthera~      37.7     2.10 cdround gbif  3031~ Panthera leo me~ 50c9509d-22~
## # ... with 78 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
## #   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
## #   basisOfRecord <chr>, occurrenceStatus <chr>, taxonKey <int>,
## #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## #   familyKey <int>, genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>,
## #   acceptedScientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>, ...

However, since different sources provide data in various formats, spocc also has a function that converts and formats data appropriately, namely occ2df().

occ2df(obj = data)
## # A tibble: 1,000 x 6
##    name                             longitude latitude prov  date       key     
##    <chr>                                <dbl>    <dbl> <chr> <date>     <chr>   
##  1 Panthera leo melanochaita (C.E.~      24.1  -23.6   gbif  2021-01-07 3031738~
##  2 Panthera leo melanochaita (C.E.~      24.1  -23.5   gbif  2021-01-08 3031818~
##  3 Panthera leo melanochaita (C.E.~      37.7    2.10  gbif  2021-01-04 3031987~
##  4 Panthera leo melanochaita (C.E.~      25.6  -33.5   gbif  2021-01-25 3032110~
##  5 Panthera leo melanochaita (C.E.~      37.3   -2.76  gbif  2021-01-01 3039441~
##  6 Panthera leo melanochaita (C.E.~      31.7  -24.5   gbif  2021-01-27 3039460~
##  7 Panthera leo melanochaita (C.E.~      35.0   -1.31  gbif  2021-01-03 3044600~
##  8 Panthera leo melanochaita (C.E.~      34.8   -1.23  gbif  2021-01-03 3044642~
##  9 Panthera leo leo                      30.1   -0.145 gbif  2021-01-31 3044945~
## 10 Panthera leo leo                      23.6    6.51  gbif  2021-01-30 3097200~
## # ... with 990 more rows

Working with the downloaded data

Now, suppose that we want to verify the observed jaguar (“Panthera onca”) in countries (not territories, for this tutorial) in South America in 2020 based on the GBIF and iNAT databases. There are different ways to do this, but the first thing we will do is load geographical data about the countries in South America. We will do this using the ne_countries() from the rnaturalearth package.

south_america <- c('Argentina', 'Bolivia', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana', 'Suriname', 'Paraguay', 'Peru', 'Uruguay', 'Venezuela')

shape <- ne_countries(scale = 50, country = south_america, returnclass = 'sf')
shape <- shape %>% select(admin, geometry)

ggplot() + geom_sf(data = shape)

The next step is to retrieve the species occurrence data as we have just learnt. To do this, we will use the occ() function.

jaguar <- spocc::occ(query = c('Panthera onca'), 
                     from = c('gbif', 'inat'), 
                     limit = 500,
                     date = c('2020-01-01', '2020-12-31'))
jaguar
## Searched: gbif, inat
## Occurrences - Found: 684, Returned: 684
## Search type: Scientific
##   gbif: Panthera onca (358)
##   inat: Panthera onca (326)

As one can see, we have retrieved 185 observations from GBIF and 324 from iNAT (509 in total). Now, we can nicely convert and format our data using the occ2df() function.

jaguar <- occ2df(jaguar)

# Convert 'longitude' and 'latitude' columns into numbers
jaguar <- jaguar %>% mutate_at(c('longitude', 'latitude'), as.numeric)

# Remove lines with NA for 'longitude' or 'latitude', if any
jaguar <- jaguar %>% filter_at(vars(longitude, latitude), all_vars(!is.na(.)))

jaguar
## # A tibble: 656 x 6
##    name                           longitude latitude prov  date       key       
##    <chr>                              <dbl>    <dbl> <chr> <date>     <chr>     
##  1 Panthera onca (Linnaeus, 1758)     -105.     21.7 gbif  2020-01-10 2557813727
##  2 Panthera onca (Linnaeus, 1758)     -106.     21.8 gbif  2020-01-10 2563487815
##  3 Panthera onca (Linnaeus, 1758)     -105.     21.8 gbif  2020-01-10 2563488017
##  4 Panthera onca (Linnaeus, 1758)     -106.     21.7 gbif  2020-01-11 2563488309
##  5 Panthera onca (Linnaeus, 1758)     -105.     21.7 gbif  2020-01-11 2563488355
##  6 Panthera onca (Linnaeus, 1758)     -105.     21.8 gbif  2020-01-10 2563488417
##  7 Panthera onca (Linnaeus, 1758)     -105.     21.6 gbif  2020-01-11 2563488541
##  8 Panthera onca (Linnaeus, 1758)     -105.     21.7 gbif  2020-01-11 2563488786
##  9 Panthera onca (Linnaeus, 1758)     -106.     21.7 gbif  2020-01-10 2563488961
## 10 Panthera onca (Linnaeus, 1758)     -105.     21.7 gbif  2020-01-11 2563489363
## # ... with 646 more rows

However, these data come from all around the world. Recall that we want data just from South America. We can achieve this by only considering the data points that has intersection with our shape object. This can be done by

# Convert longitude/latitude to POINT
jaguar <- st_as_sf(x = jaguar, coords = c('longitude', 'latitude'), crs = st_crs(shape))
# Select locations that belong to South America
jaguar <- st_join(x = jaguar, y = shape, left = FALSE) # if left = TRUE, return left join

jaguar
## Simple feature collection with 284 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -78.41903 ymin: -29.54858 xmax: -40.63385 ymax: 11.03458
## CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## # A tibble: 284 x 6
##    name                   prov  date       key                   geometry admin 
##  * <chr>                  <chr> <date>     <chr>              <POINT [°]> <chr> 
##  1 Panthera onca (Linnae~ gbif  2020-01-05 2563542~ (-77.48873 -1.035258) Ecuad~
##  2 Panthera onca (Linnae~ gbif  2020-01-07 2563550~ (-75.51861 -0.652688) Ecuad~
##  3 Panthera onca (Linnae~ gbif  2020-01-07 2563580~ (-76.07225 -0.758781) Ecuad~
##  4 Panthera onca (Linnae~ gbif  2020-01-14 2576304~ (-56.41679 -17.03398) Brazil
##  5 Panthera onca (Linnae~ gbif  2020-01-18 3067760~ (-71.02769 -12.07986) Peru  
##  6 Panthera onca (Linnae~ gbif  2020-01-30 3118585~   (-67.68931 6.12016) Colom~
##  7 Panthera onca (Linnae~ gbif  2020-01-31 3118585~   (-67.75761 6.13851) Colom~
##  8 Panthera onca (Linnae~ gbif  2020-01-24 3118586~   (-67.74546 6.08136) Colom~
##  9 Panthera onca (Linnae~ gbif  2020-01-26 3395110~   (-76.77283 0.87822) Colom~
## 10 Panthera onca (Linnae~ gbif  2020-01-26 3395110~   (-76.77283 0.87822) Colom~
## # ... with 274 more rows

Notice that we went from 482 x 6 to a 110 x 6 data set.

Now we can plot the shapefile along with the observed jaguars in 2020.

# Plot 'Panthera onca' occurrence 
ggplot() +
  geom_sf(data = shape) +
  geom_sf(data = jaguar, aes(color = prov), size = 3) + 
  scale_color_manual(name = 'Provider',
                     values = c(alpha(colour =  'red', alpha = 0.35),
                                alpha(colour = 'blue', alpha = 0.35)),
                     labels = c('GBIF', 'INAT')) +
  labs(x = 'Longitude', y = 'Latitude', title = 'Panthera onca occurrence in countries in South America in 2020') + 
  theme_bw()

From the above image, notice that, since there are (almost) overlapping points, the semi-transparency into the plotted locations plays an important role in distinguishing regions with low and high jaguar density.

References


Last updated: 2021-11-30
Source code: https://github.com/rspatialdata/rspatialdata.github.io/blob/main/species_occurrence.Rmd

Tutorial was complied using: (click to expand)
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rnaturalearth_0.1.0 sf_1.0-3            forcats_0.5.1      
##  [4] stringr_1.4.0       dplyr_1.0.7         purrr_0.3.4        
##  [7] readr_2.0.2         tidyr_1.1.4         tibble_3.1.5       
## [10] ggplot2_3.3.5       tidyverse_1.3.1     spocc_1.2.0        
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0                lubridate_1.8.0         oai_0.3.2              
##  [4] httr_1.4.2              rgbif_3.6.0             tools_4.1.1            
##  [7] backports_1.3.0         bslib_0.3.1             utf8_1.2.2             
## [10] R6_2.5.1                KernSmooth_2.23-20      rgeos_0.5-8            
## [13] DBI_1.1.1               lazyeval_0.2.2          colorspace_2.0-2       
## [16] withr_2.4.2             sp_1.4-5                rnaturalearthdata_0.1.0
## [19] tidyselect_1.1.1        curl_4.3.2              compiler_4.1.1         
## [22] cli_3.1.0               rvest_1.0.2             xml2_1.3.2             
## [25] triebeard_0.3.0         sass_0.4.0              scales_1.1.1           
## [28] classInt_0.4-3          proxy_0.4-26            digest_0.6.28          
## [31] rmarkdown_2.11          pkgconfig_2.0.3         htmltools_0.5.2        
## [34] highr_0.9               dbplyr_2.1.1            fastmap_1.1.0          
## [37] maps_3.4.0              rlang_0.4.11            readxl_1.3.1           
## [40] httpcode_0.3.0          rstudioapi_0.13         farver_2.1.0           
## [43] jquerylib_0.1.4         generics_0.1.1          jsonlite_1.7.2         
## [46] rbison_1.0.0            magrittr_2.0.1          s2_1.0.7               
## [49] Rcpp_1.0.7              munsell_0.5.0           fansi_0.5.0            
## [52] lifecycle_1.0.1         stringi_1.7.5           whisker_0.4            
## [55] yaml_2.2.1              rvertnet_0.8.2          plyr_1.8.6             
## [58] grid_4.1.1              crayon_1.4.2            lattice_0.20-44        
## [61] conditionz_0.1.0        haven_2.4.3             mapproj_1.2.7          
## [64] hms_1.1.1               knitr_1.36              pillar_1.6.4           
## [67] wellknown_0.7.4         uuid_0.1-4              crul_1.1.0             
## [70] wk_0.5.0                reprex_2.0.1            glue_1.5.0             
## [73] rebird_1.3.0            evaluate_0.14           ridigbio_0.3.5         
## [76] data.table_1.14.2       modelr_0.1.8            urltools_1.7.3         
## [79] vctrs_0.3.8             tzdb_0.1.2              cellranger_1.1.0       
## [82] gtable_0.3.0            assertthat_0.2.1        xfun_0.26              
## [85] broom_0.7.9             e1071_1.7-9             class_7.3-19           
## [88] units_0.7-2             ellipsis_0.3.2

Corrections: If you see mistakes or want to suggest changes, please create an issue on the source repository or submit a pull request
Contributions: If you want to contribute or collaborate on the project, please see the guidelines for collaborating
Reuse: Text and figures are licensed under Creative Commons Attribution CC BY 4.0.