--- title: "Formatting your data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Formatting your data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" ) ``` The `funbiogeo`package requires that information is structured in three different datasets: - the **species x traits** `data.frame` (`species_traits` in `funbiogeo`), which contains trait values for several traits (in columns) for several species (in rows). - the **site x species** `data.frame` (`site_species` in `funbiogeo`), which contains the presence/absence, abundance, or cover information for species (in columns) by sites (in rows). - the **site x locations** object (`site_locations` in `funbiogeo`), which contains the physical locations of the sites of interest Optionally, an additional dataset can be provided: - a **species x categories** `data.frame` (`species_categories` in `funbiogeo`), which contains two-columns: one for species, one for potential categorization of species (whether it's taxonomic classes, specific diets, or any arbitrary classification) ```{r setup} library(funbiogeo) ``` ## Wide vs long format In `funbiogeo` these datasets **must be** in a wide format (where one row hosts several variables across columns), but sometimes information is structured in a long format (one observation per row, also called [**tidy format**](https://r4ds.had.co.nz/tidy-data.html)). For instance, the following dataset illustrates the wider format (the presence/absence of all species is spread across columns). ```{r wide-format, echo=FALSE} wide_data <- data.frame("site" = LETTERS[1:3], "species_1" = c(1, 0, 1), "species_2" = c(0, 0, 1), "species_3" = c(1, 1, 1), "species_4" = c(1, 1, 0)) knitr::kable(wide_data, caption = "Wide format dataset (used in `funbiogeo`)", align = rep("c", ncol(wide_data))) ``` The following dataset illustrates the long format (the column `species` contains the name of the species and the column `occurrence` contains the presence/absence of species). ```{r long-format, echo=FALSE} long_data <- data.frame("site" = rep(LETTERS[1:3], 4), "species" = c(rep("species_1", 3), rep("species_2", 3), rep("species_3", 3), rep("species_4", 3)), "occurrence" = c(1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0)) knitr::kable(long_data, caption = "Long format dataset", align = rep("c", ncol(long_data))) ``` ## The `fb_format_*()` functions If your data are not split into these wider datasets, you can use the functions `fb_format_*()` to create these specific objects from a long format dataset. - `fb_format_site_locations()` allows to extract the **site x locations** information from the long format data - `fb_format_site_species()` allows to extract the **site x species** information from the long format data - `fb_format_species_traits()` allows to extract the **species x traits** information from the long format data - `fb_format_species_categories()` allows to extract the **species x categories** information from the long format data All these functions take a long dataset as input (argument `data`), where one row corresponds to the occurrence/abundance/coverage of one species at one site and output a wider object. ## Usage `funbiogeo` provides a small excerpt of long format data to show how to use the functions. This data sits at `system.file("extdata", "raw_mammals_data.csv", package = "funbiogeo")`. Let's import the long format dataset provided by `funbiogeo`: ```{r 'load-raw-dataset'} # Define the path to long format dataset ---- file_name <- system.file("extdata", "raw_mammals_data.csv", package = "funbiogeo") # Read the file ---- all_data <- read.csv(file_name) ``` ```{r preview-raw-dataset, echo=FALSE} knitr::kable(head(all_data, 10), caption = "Long table example", align = c("c", "r", "r", "l", "c", "r", "r", "r", "r", "r", "r") ) ``` ### Extracting species x traits data The function `fb_format_species_traits()` extracts species traits values from this long table to create the species x traits dataset. Note that one species must have one unique trait value (no trait variation across sites is allowed). ```{r 'format-species-traits'} # Extract species x traits data ---- species_traits <- fb_format_species_traits( data = all_data, species = "species", traits = c("adult_body_mass", "gestation_length", "litter_size", "max_longevity", "sexual_maturity_age", "diet_breadth") ) # Preview ---- head(species_traits, 10) ``` ### Extracting site x species data The function `fb_format_site_species()` extracts species occurrence/abundance/coverage from this long table to create the site x species dataset. Note that one species must have been observed one time at one site (the package `funbiogeo` does not yet consider temporal survey). ```{r 'format-sites-species'} # Format site x species data ---- site_species <- fb_format_site_species(data = all_data, site = "site", species = "species", value = "count", na_to_zero = TRUE ) # Preview ---- head(site_species[ , 1:8], 10) ``` ### Extracting site x locations data The function `fb_format_site_locations()` extracts sites coordinates from this long table to create the site x locations dataset. Note that one site must have one unique longitude x latitude value. ```{r 'format-sites-locs'} # Format site x locations data ---- site_locations <- fb_format_site_locations(data = all_data, site = "site", longitude = "longitude", latitude = "latitude", na_rm = FALSE) # Preview ---- head(site_locations) ``` ### Extracting species x categories data The function `fb_format_species_categories()` extracts species values for one supra-category (optional) from this long table to create the species x categories dataset. This category (e.g. order, family, endemism status, conservation status, etc.) can be later by several functions in `funbiogeo` to aggregate metrics at this level. ```{r 'format-species-categories'} # Extract species x categories data ---- species_categories <- fb_format_species_categories(data = all_data, species = "species", category = "order" ) # Preview ---- head(species_categories, 10) ``` Once your data are in the good format, you can [get started](https://frbcesab.github.io/funbiogeo/articles/funbiogeo.html) with `funbiogeo`.