vignettes/website/eurostat_tutorial.Rmd
eurostat_tutorial.Rmd
This rOpenGov R package provides tools to access Eurostat database, which you can also browse on-line for the data sets and documentation. For contact information and source code, see the package website.
Release version (CRAN):
install.packages("eurostat")
Development version (Github):
library(devtools)
install_github("ropengov/eurostat")
Overall, the eurostat package includes the following functions:
add_nuts_level Add the statistical aggregation level to data
frame
check_access_to_data Check access to ec.europe.eu
clean_eurostat_cache Clean Eurostat Cache
cut_to_classes Cuts the Values Column into Classes and
Polishes the Labels
dic_order Order of Variable Levels from Eurostat
Dictionary.
eu_countries Countries and Country Codes
eurostat_geodata_60_2016
Geospatial data of Europe from Gisco in 1:60
million scale from year 2016
eurostat-package R Tools for Eurostat open data
eurotime2date Date Conversion from Eurostat Time Format
eurotime2num Conversion of Eurostat Time Format to Numeric
get_bibentry Create A Data Bibliography
get_eurostat Read Eurostat Data
get_eurostat_dic Download Eurostat Dictionary
get_eurostat_geospatial
Download Geospatial Data from GISCO
get_eurostat_json Get Data from Eurostat API in JSON
get_eurostat_raw Download Data from Eurostat Database
get_eurostat_toc Download Table of Contents of Eurostat Data
Sets
harmonize_country_code
Harmonize Country Code
harmonize_geo_code Harmonize NUTS region codes that changed with
the 'NUTS2016' definition
label_eurostat Get Eurostat Codes
nuts_correspondence Correspondence Table NUTS2013-NUTS2016
recode_to_nuts_2013 Recode geo labels and rename regions from
NUTS2016 to NUTS2013
recode_to_nuts_2016 Recode geo labels and rename regions from
NUTS2013 to NUTS2016
regional_changes_2016 Changes in regional boundaries
NUTS2013-NUTS2016
search_eurostat Grep Datasets Titles from Eurostat
tgs00026 Auxiliary Data
evaluate <- curl::has_internet()
Function get_eurostat_toc()
downloads a table of contents of eurostat datasets. The values in column ‘code’ should be used to download a selected dataset.
# Load the package
library(eurostat)
library(rvest)
# Get Eurostat data listing
toc <- get_eurostat_toc()
# Check the first items
library(knitr)
kable(head(toc))
title | code | type | last update of data | last table structure change | data start | data end | values |
---|---|---|---|---|---|---|---|
Database by themes | data | folder | NA | NA | NA | NA | NA |
General and regional statistics | general | folder | NA | NA | NA | NA | NA |
European and national indicators for short-term analysis | euroind | folder | NA | NA | NA | NA | NA |
Business and consumer surveys (source: DG ECFIN) | ei_bcs | folder | NA | NA | NA | NA | NA |
Consumer surveys (source: DG ECFIN) | ei_bcs_cs | folder | NA | NA | NA | NA | NA |
Consumers - monthly data | ei_bsco_m | dataset | 28.01.2021 | 28.01.2021 | 1980M01 | 2021M01 | NA |
Some of the data sets (e.g. in the ‘comext’ type) are not accessible through the standard interface. See the get_eurostat function documentation for more details.
With search_eurostat()
you can search the table of contents for particular patterns, e.g. all datasets related to passenger transport. The kable function to produces nice markdown output. Note that with the type
argument of this function you could restrict the search to for instance datasets or tables.
# info about passengers
kable(head(search_eurostat("passenger transport")))
title | code | type | last update of data | last table structure change | data start | data end | values |
---|---|---|---|---|---|---|---|
Volume of passenger transport relative to GDP | tran_hv_pstra | dataset | 01.09.2020 | 31.08.2020 | 1990 | 2018 | NA |
Modal split of passenger transport | tran_hv_psmod | dataset | 01.09.2020 | 31.08.2020 | 1990 | 2018 | NA |
Air passenger transport by reporting country | avia_paoc | dataset | 26.01.2021 | 26.01.2021 | 1993 | 2020Q4 | NA |
Air passenger transport by main airports in each reporting country | avia_paoa | dataset | 26.01.2021 | 26.01.2021 | 1993 | 2020Q4 | NA |
Air passenger transport between reporting countries | avia_paocc | dataset | 26.01.2021 | 26.01.2021 | 1993 | 2020Q4 | NA |
Air passenger transport between main airports in each reporting country and partner reporting countries | avia_paoac | dataset | 26.01.2021 | 26.01.2021 | 1993 | 2020Q4 | NA |
Codes for the dataset can be searched also from the Eurostat database. The Eurostat database gives codes in the Data Navigation Tree after every dataset in parenthesis.
The package supports two of the Eurostats download methods: the bulk download facility and the Web Services’ JSON API. The bulk download facility is the fastest method to download whole datasets. It is also often the only way as the JSON API has limitation of maximum 50 sub-indicators at a time and whole datasets usually exceeds that. To download only a small section of the dataset the JSON API is faster, as it allows to make a data selection before downloading.
A user does not usually have to bother with methods, as both are used via main function get_eurostat()
. If only the table id is given, the whole table is downloaded from the bulk download facility. If also filters are defined the JSON API is used.
Here an example of indicator ‘Modal split of passenger transport’. This is the percentage share of each mode of transport in total inland transport, expressed in passenger-kilometres (pkm) based on transport by passenger cars, buses and coaches, and trains. All data should be based on movements on national territory, regardless of the nationality of the vehicle. However, the data collection is not harmonized at the EU level.
Pick and print the id of the data set to download:
# For the original data, see
# http://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&plugin=1&language=en&pcode=tsdtr210
id <- search_eurostat("Modal split of passenger transport",
type = "table")$code[1]
print(id)
[1] “t2020_rk310”
Get the whole corresponding table. As the table is annual data, it is more convient to use a numeric time variable than use the default date format:
dat <- get_eurostat(id, time_format = "num")
Investigate the structure of the downloaded data set:
str(dat)
## tibble [2,798 × 5] (S3: tbl_df/tbl/data.frame)
## $ unit : chr [1:2798] "PC" "PC" "PC" "PC" ...
## $ vehicle: chr [1:2798] "BUS_TOT" "BUS_TOT" "BUS_TOT" "BUS_TOT" ...
## $ geo : chr [1:2798] "AT" "BE" "CH" "DE" ...
## $ time : num [1:2798] 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ values : num [1:2798] 8.2 10.6 3.7 9.1 11.3 32.4 14.9 13.5 6 24.8 ...
unit | vehicle | geo | time | values |
---|---|---|---|---|
PC | BUS_TOT | AT | 1990 | 8.2 |
PC | BUS_TOT | BE | 1990 | 10.6 |
PC | BUS_TOT | CH | 1990 | 3.7 |
PC | BUS_TOT | DE | 1990 | 9.1 |
PC | BUS_TOT | DK | 1990 | 11.3 |
PC | BUS_TOT | EL | 1990 | 32.4 |
Or you can get only a part of the dataset by defining filters
argument. It should be named list, where names corresponds to variable names (lower case) and values are vectors of codes corresponding desidered series (upper case). For time variable, in addition to a time
, also a sinceTimePeriod
and a lastTimePeriod
can be used.
dat2 <- get_eurostat(id, filters = list(geo = c("EU28", "FI"), lastTimePeriod=1), time_format = "num")
kable(dat2)
unit | vehicle | geo | time | values |
---|---|---|---|---|
PC | BUS_TOT | EU28 | 2018 | 8.7 |
PC | BUS_TOT | FI | 2018 | 10.1 |
PC | CAR | EU28 | 2018 | 83.3 |
PC | CAR | FI | 2018 | 84.2 |
PC | TRN | EU28 | 2018 | 8.0 |
PC | TRN | FI | 2018 | 5.7 |
By default variables are returned as Eurostat codes, but to get human-readable labels instead, use a type = "label"
argument.
datl2 <- get_eurostat(id, filters = list(geo = c("EU28", "FI"),
lastTimePeriod = 1),
type = "label", time_format = "num")
kable(head(datl2))
unit | vehicle | geo | time | values |
---|---|---|---|---|
Percentage | Motor coaches, buses and trolley buses | European Union - 28 countries (2013-2020) | 2018 | 8.7 |
Percentage | Motor coaches, buses and trolley buses | Finland | 2018 | 10.1 |
Percentage | Passenger cars | European Union - 28 countries (2013-2020) | 2018 | 83.3 |
Percentage | Passenger cars | Finland | 2018 | 84.2 |
Percentage | Trains | European Union - 28 countries (2013-2020) | 2018 | 8.0 |
Percentage | Trains | Finland | 2018 | 5.7 |
Eurostat codes in the downloaded data set can be replaced with human-readable labels from the Eurostat dictionaries with the label_eurostat()
function.
datl <- label_eurostat(dat)
kable(head(datl))
unit | vehicle | geo | time | values |
---|---|---|---|---|
Percentage | Motor coaches, buses and trolley buses | Austria | 1990 | 8.2 |
Percentage | Motor coaches, buses and trolley buses | Belgium | 1990 | 10.6 |
Percentage | Motor coaches, buses and trolley buses | Switzerland | 1990 | 3.7 |
Percentage | Motor coaches, buses and trolley buses | Germany (until 1990 former territory of the FRG) | 1990 | 9.1 |
Percentage | Motor coaches, buses and trolley buses | Denmark | 1990 | 11.3 |
Percentage | Motor coaches, buses and trolley buses | Greece | 1990 | 32.4 |
The label_eurostat()
allows conversion of individual variable vectors or variable names as well.
label_eurostat_vars(names(datl))
## [1] "Unit of measure" "Vehicles"
## [3] "Geopolitical entity (reporting)" "Period of time"
Vehicle information has 3 levels. You can check them now with:
levels(datl$vehicle)
## NULL
To facilitate smooth visualization of standard European geographic areas, the package provides ready-made lists of the country codes used in the eurostat database for EFTA (efta_countries), Euro area (ea_countries), EU (eu_countries) and EU candidate countries (eu_candidate_countries). These can be used to select specific groups of countries for closer investigation. For conversions with other standard country coding systems, see the countrycode R package. To retrieve the country code list for EFTA, for instance, use:
code | name | label |
---|---|---|
IS | Iceland | Iceland |
LI | Liechtenstein | Liechtenstein |
NO | Norway | Norway |
CH | Switzerland | Switzerland |
dat_eu12 <- subset(datl, geo == "European Union - 28 countries" & time == 2012)
kable(dat_eu12, row.names = FALSE)
unit | vehicle | geo | time | values |
---|
Reshaping the data is best done with spread()
in tidyr
.
library("tidyr")
dat_eu_0012 <- subset(dat, geo == "EU28" & time %in% 2000:2012)
dat_eu_0012_wide <- spread(dat_eu_0012, vehicle, values)
kable(subset(dat_eu_0012_wide, select = -geo), row.names = FALSE)
unit | time | BUS_TOT | CAR | TRN |
---|---|---|---|---|
PC | 2000 | 10.4 | 82.5 | 7.1 |
PC | 2001 | 10.2 | 82.8 | 7.0 |
PC | 2002 | 9.8 | 83.4 | 6.8 |
PC | 2003 | 9.8 | 83.6 | 6.6 |
PC | 2004 | 9.7 | 83.5 | 6.7 |
PC | 2005 | 9.8 | 83.4 | 6.9 |
PC | 2006 | 9.6 | 83.4 | 7.0 |
PC | 2007 | 9.8 | 83.1 | 7.1 |
PC | 2008 | 9.8 | 82.9 | 7.4 |
PC | 2009 | 9.2 | 83.7 | 7.1 |
PC | 2010 | 9.3 | 83.5 | 7.2 |
PC | 2011 | 9.4 | 83.2 | 7.4 |
PC | 2012 | 9.4 | 82.9 | 7.7 |
dat_trains <- subset(datl, geo %in% c("Austria", "Belgium", "Finland", "Sweden")
& time %in% 2000:2012
& vehicle == "Trains")
dat_trains_wide <- spread(dat_trains, geo, values)
kable(subset(dat_trains_wide, select = -vehicle), row.names = FALSE)
unit | time | Austria | Belgium | Finland | Sweden |
---|---|---|---|---|---|
Percentage | 2000 | 9.8 | 6.3 | 5.1 | 6.9 |
Percentage | 2001 | 9.8 | 6.4 | 4.8 | 7.3 |
Percentage | 2002 | 9.7 | 6.5 | 4.8 | 7.2 |
Percentage | 2003 | 9.6 | 6.5 | 4.7 | 7.1 |
Percentage | 2004 | 9.5 | 7.1 | 4.7 | 6.9 |
Percentage | 2005 | 9.8 | 6.6 | 4.8 | 7.1 |
Percentage | 2006 | 10.0 | 6.9 | 4.8 | 7.7 |
Percentage | 2007 | 10.1 | 7.1 | 5.0 | 8.0 |
Percentage | 2008 | 11.1 | 7.5 | 5.4 | 8.7 |
Percentage | 2009 | 11.2 | 7.5 | 5.1 | 8.8 |
Percentage | 2010 | 11.1 | 7.7 | 5.2 | 8.7 |
Percentage | 2011 | 11.5 | 7.7 | 5.0 | 8.7 |
Percentage | 2012 | 11.9 | 7.8 | 5.3 | 9.1 |
Eurostat data is available also in the Statistical Data and Metadata eXchange (SDMX) Web Services. Our eurostat R package does not provide custom tools for this but the following generic R packages provide access to eurostat SDMX version:
For further examples, see the package homepage.
This tutorial was created with
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.10
##
## Matrix products: default
## BLAS: /home/lemila/bin/R-4.0.3/lib/libRblas.so
## LAPACK: /home/lemila/bin/R-4.0.3/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.1.2 rvest_0.3.6 xml2_1.3.2 eurostat_3.7.1 knitr_1.31
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.20 purrr_0.3.4 sf_0.9-7
## [5] lattice_0.20-41 vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1
## [9] yaml_2.2.1 rlang_0.4.10 e1071_1.7-4 pkgdown_1.6.1
## [13] pillar_1.4.7 glue_1.4.2 DBI_1.1.1 sp_1.4-5
## [17] RColorBrewer_1.1-2 lifecycle_0.2.0 plyr_1.8.6 stringr_1.4.0
## [21] ragg_0.4.1 memoise_2.0.0 evaluate_0.14 fastmap_1.1.0
## [25] curl_4.3 class_7.3-18 fansi_0.4.2 highr_0.8
## [29] broom_0.7.4 Rcpp_1.0.6 KernSmooth_2.23-18 readr_1.4.0
## [33] backports_1.2.1 classInt_0.4-3 cachem_1.0.1 desc_1.2.0
## [37] jsonlite_1.7.2 countrycode_1.2.0 systemfonts_0.3.2 fs_1.5.0
## [41] textshaping_0.2.1 hms_1.0.0 digest_0.6.27 stringi_1.5.3
## [45] dplyr_1.0.3 rprojroot_2.0.2 grid_4.0.3 cli_2.2.0
## [49] tools_4.0.3 magrittr_2.0.1 tibble_3.0.6 RefManageR_1.3.0
## [53] crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 lubridate_1.7.9.2
## [57] assertthat_0.2.1 rmarkdown_2.6.4 httr_1.4.2 R6_2.5.0
## [61] units_0.6-7 compiler_4.0.3