By

Searching, downloading and manipulating Eurostat data with R

Introduction

We have published a new package eurostat in CRAN. To install the package in R, use:

install.packages("eurostat")

The eurostat package is based on the SmarterPoland package, which was revised and expanded with new functionality. The new eurostat package includes the following functions:

library(eurostat)
kable(as.data.frame(ls("package:eurostat")))
ls(“package:eurostat”)
candidate_countries
clean_eurostat_cache
dic_order
ea_countries
efta_countries
eu_countries
eurotime2date
eurotime2num
get_eurostat
get_eurostat_dic
get_eurostat_json
get_eurostat_toc
getEurostatTOC
grepEurostatTOC
label_eurostat
label_eurostat_tables
label_eurostat_vars
search_eurostat

This blog post will walk you through the basic functionalities of the package. To reproduce the examples, run the following code to install the required dependencies.

PACKAGES <- c("eurostat","ggplot2","countrycode","tidyr","dplyr","knitr")
#  Install packages
inst <- match(PACKAGES, .packages(all=TRUE))
need <- which(is.na(inst))
if (length(need) > 0) install.packages(PACKAGES[need])
# Load packages
lapply(PACKAGES, require, character.only=T)

In this exercise we use indicator tgs00026, (Disposable income of private households by NUTS 2 regions) from Eurostat. Most indicators in Eurostat database are of country-year -type, however, some indicators have data also at lower level of regional breakdown, as tgs00026 at NUTS2-level. (See more information on regional classification here).

Searching and downloading data from Eurostat

Though we have already picked the data for following demonstrations you can search the available variables using search_eurostat-function. Data for tables resides in Datasets that are in folders. You can specify in the function whether you want to look for table, dataset or a folder.

results <- search_eurostat("disposable income", type = "dataset")
# print the first 6 rows
kable(head(results))
  title code type last.update.of.data last.table.structure.change data.start data.end values
1995 Share of housing cost in disposable income by level of activity limitation, sex and age hlth_dhc050 dataset 17.07.2015 17.07.2015 2005 2013 NA
3804 Gini coefficient of equivalised disposable income (source: SILC) ilc_di12 dataset 21.04.2016 23.02.2016 1995 2015 NA
3805 Gini coefficient of equivalised disposable income before social transfers (pensions included in social transfers) ilc_di12b dataset 21.04.2016 23.02.2016 2003 2015 NA
3806 Gini coefficient of equivalised disposable income before social transfers (pensions excluded from social transfers) ilc_di12c dataset 21.04.2016 23.02.2016 2003 2015 NA

Or you can also download the whole table of contents of the database with get_eurostat_toc-function. In both cases the values in column code should be used to download a selected dataset.

toc <- get_eurostat_toc()
# print the first 6 rows
kable(head(toc))
title code type last.update.of.data last.table.structure.change data.start data.end values
Database by themes data folder         NA
General and regional statistics general folder         NA
European and national indicators for short-term analysis euroind folder         NA
Business and consumer surveys (source: DG ECFIN) ei_bcs folder         NA
Consumer surveys (source: DG ECFIN) ei_bcs_cs folder         NA
Consumers - monthly data ei_bsco_m dataset 30.03.2016 30.03.2016 1985M01 2016M03 NA

Downloading and plotting time-series data at the NUTS2 regional level

However, we are interested in the disposable household income and first we download the data and convert the time column into numeric format.

library(eurostat)
# Clean the cache first
clean_eurostat_cache()
# Download the income variable
dat <- get_eurostat("tgs00026", time_format = "raw")
# Extract the information on country in question. (First two character in region string mark the country)
dat$cntry <- substr(dat$geo, 1,2)
# Create the variable for proper countrynames using countryname-package
library(countrycode)
dat$countryname <- countrycode(dat$cntry, "iso2c", "country.name")
# convert time column from Date to numeric
dat$time <- eurotime2num(dat$time)
kable(head(dat))
unit na_item geo time values cntry countryname
PPCS_HAB B6N AT11 2003 15500 AT Austria
PPCS_HAB B6N AT12 2003 16600 AT Austria
PPCS_HAB B6N AT13 2003 17500 AT Austria
PPCS_HAB B6N AT21 2003 15500 AT Austria
PPCS_HAB B6N AT22 2003 15600 AT Austria
PPCS_HAB B6N AT31 2003 15900 AT Austria

Then we plot the data at the regional level and color the lines using country names derived from countrycode-package.

library(ggplot2)
p <- ggplot(dat, aes(x=time,y=values,group=geo,color=countryname))
p <- p + geom_point() + geom_line()
# Add the region names at the end of each line
p <- p + geom_text(data=merge(dat, aggregate(time ~ geo, dat, max), 
                              by=c("time","geo")), 
                   aes(x=time, y = values, label=geo), hjust=-0.5,vjust=-1,size=3)
p <- p + labs(x="Year", y="Disposable income of private households (€ per year)")
# Place the legend with country names on top in three rows
p <- p + theme(legend.direction = "horizontal", legend.position = "top")
p <- p + guides(color = guide_legend(title = "Country",
                                     title.position = "top", 
                                     title.hjust=.5,
                                     nrow=3,
                                     byrow=TRUE))
p

center

Labelling the data

Function label_eurostat provides a straighforward way to convert the codes in data into more meaningful labels. The label_eurostat function requires that the data is in the “default format” with no added columns. First, check unlabeled example data:

# First, remove the cntry and countrycode columns that were added manually
cntry_cols <- dat[ , names(dat) %in% c("cntry","countryname")]
dat <- dat[ , !names(dat) %in% c("cntry","countryname")]
# Unlabeled data
kable(head(dat))
unit na_item geo time values
PPCS_HAB B6N AT11 2003 15500
PPCS_HAB B6N AT12 2003 16600
PPCS_HAB B6N AT13 2003 17500
PPCS_HAB B6N AT21 2003 15500
PPCS_HAB B6N AT22 2003 15600
PPCS_HAB B6N AT31 2003 15900

Labeling the data based on definitions from dictionary:

# apply the definitions from dictionary
datl <- label_eurostat(dat)
# Labeled data
kable(head(datl))
unit na_item geo time values
Purchasing power standard based on final consumption per inhabitant Disposable income, net Burgenland (AT) 2003 15500
Purchasing power standard based on final consumption per inhabitant Disposable income, net Niederösterreich 2003 16600
Purchasing power standard based on final consumption per inhabitant Disposable income, net Wien 2003 17500
Purchasing power standard based on final consumption per inhabitant Disposable income, net Kärnten 2003 15500
Purchasing power standard based on final consumption per inhabitant Disposable income, net Steiermark 2003 15600
Purchasing power standard based on final consumption per inhabitant Disposable income, net Oberösterreich 2003 15900
# Bind the country name columns back to the labelled data 
datl2 <- cbind(datl,cntry_cols)

For clarity, we plot first four countries in alphabetical order in their own facets with region names.

# subset the first 4 countries in albhabetical order
datl3 <- datl2[datl2$countryname %in% sort(unique(datl2$countryname))[1:4],]
# plot the data using country facets
library(ggplot2)
p <- ggplot(datl3,aes(x=time,y=values,group=geo))
p <- p + geom_point(color="grey") + geom_line(color="grey")
p <- p + geom_text(data=merge(datl3, aggregate(time ~ geo, datl3, max), 
                              by=c("time","geo")), 
                   aes(x=time, y = values, label=geo), vjust=1,size=3)
p <- p + coord_cartesian(xlim=c(min(datl2$time):max(datl2$time)+3))
p <- p + facet_wrap(~countryname, scales = "free", ncol = 1)
p <- p + labs(title=paste0(unique(datl3$indic_na),"\n(",unique(datl3$unit),")"),
                x="Year",y="Disposable income of private households (€ per year)")
p

center

Mapping the household incomes at NUTS2 level

In the following exercise we are plotting household income data from Eurostat on map from three different years. In addition to downloading and manipulating data from EUROSTAT, we will demonstrate how to access and use shapefiles of Europe published by EUROSTAT at Administrative units / Statistical units.

For this exercise you need few more dependencies that can be installed running the following script.

PACKAGES <- c("rgdal","maptools","rgeos","stringr","scales","grid")
#  Install packages
inst <- match(PACKAGES, .packages(all=TRUE))
need <- which(is.na(inst))
if (length(need) > 0) install.packages(PACKAGES[need])
# Load packages
lapply(PACKAGES, require, character.only=T)

Downloading and manipulating the tabular data

First, we shall retrieve the nuts2-level figures of variable tgs00026 (Disposable income of private households by NUTS 2 regions) and manipulate the extract the information for creating unit and geo variables.

library(eurostat)
df <- get_eurostat("tgs00026", time_format = "raw")
# convert time column from Date to numeric
df$time <- eurotime2num(df$time)
# subset time to have data for 2000, 2005 and 2011
df <- df[df$time %in% c(2000,2005,2011),]
# spread the data into wide format
library(tidyr)
dw <- spread(df, time, values)

Downloading and manipulating the spatial data

Second, we will download the zipped shapefile in 1:60 million scale from year 2010 and subset it at the level of NUTS2.

# Load the GISCO shapefile
download.file("http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/NUTS_2010_60M_SH.zip", destfile="NUTS_2010_60M_SH.zip")
# unzip
unzip("NUTS_2010_60M_SH.zip")
library(rgdal)
# read into SpatialPolygonsDataFrame
map <- readOGR(dsn = "./NUTS_2010_60M_SH/NUTS_2010_60M_SH/data", layer = "NUTS_RG_60M_2010")
## Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv, : Cannot open file
# subset the spatialpolygondataframe at NUTS2-level
map_nuts2 <- subset(map, STAT_LEVL_ == 2)
## Error in subset(map, STAT_LEVL_ == 2): object 'map' not found

Joining tabular data with spatial data

Third, we will make the both datas of same lenght, give then identical rownames and then merge the tabular data with the spatial data.

# dim show how many regions are in the spatialpolygondataframe
dim(map_nuts2)
## Error in eval(expr, envir, enclos): object 'map_nuts2' not found
# dim show how many regions are in the data.frame
dim(dw)
## [1] 298   5
# Spatial dataframe has 467 rows and attribute data 275.
# We need to make attribute data to have similar number of rows
NUTS_ID <- as.character(map_nuts2$NUTS_ID)
## Error in eval(expr, envir, enclos): object 'map_nuts2' not found
VarX <- rep(NA, 316)
dat <- data.frame(NUTS_ID,VarX)
## Error in data.frame(NUTS_ID, VarX): object 'NUTS_ID' not found
# then we shall merge this with Eurostat data.frame
dat2 <- merge(dat,dw,by.x="NUTS_ID",by.y="geo", all.x=TRUE)
## Error in fix.by(by.x, x): 'by' must specify a uniquely valid column
## merge this manipulated attribute data with the spatialpolygondataframe
## rownames
row.names(dat2) <- dat2$NUTS_ID
## Error in eval(expr, envir, enclos): object 'dat2' not found
row.names(map_nuts2) <- as.character(map_nuts2$NUTS_ID)
## Error in eval(expr, envir, enclos): object 'map_nuts2' not found
## order data
dat2 <- dat2[order(row.names(dat2)), ]
## Error in eval(expr, envir, enclos): object 'dat2' not found
map_nuts2 <- map_nuts2[order(row.names(map_nuts2)), ]
## Error in eval(expr, envir, enclos): object 'map_nuts2' not found
## join
library(maptools)
dat2$NUTS_ID <- NULL
## Error in dat2$NUTS_ID <- NULL: object 'dat2' not found
shape <- spCbind(map_nuts2, dat2)
## Error in spCbind(map_nuts2, dat2): error in evaluating the argument 'obj' in selecting a method for function 'spCbind': Error: object 'map_nuts2' not found

Preparing the data for ggplot2 visualization

As we are using ggplot2-package for plotting, we have to fortify the SpatialPolygonDataFrame into regular data.frame-class. As we have income data from several years, we have to also gather the data into long format for plotting.

## fortify spatialpolygondataframe into data.frame
library(ggplot2)
library(rgeos)
shape$id <- rownames(shape@data)
## Error in rownames(shape@data): object 'shape' not found
map.points <- fortify(shape, region = "id")
## Error in fortify(shape, region = "id"): object 'shape' not found
map.df <- merge(map.points, shape, by = "id")
## Error in merge(map.points, shape, by = "id"): error in evaluating the argument 'x' in selecting a method for function 'merge': Error: object 'map.points' not found
# As we want to plot map faceted by years from 2003 to 2011
# we have to melt it into long format

# (variable with numerical names got X-prefix during the spCbind-merge,
# therefore the X-prefix in variable names)
library(tidyr)
# lets convert unit variable (that is a list) into character
map.df$unit <- as.character(map.df$unit)
## Error in eval(expr, envir, enclos): object 'map.df' not found
map.df.l <- gather(map.df, "year", "value", 15:17)
## Error in is.data.frame(x): object 'map.df' not found
# year variable (variable) is class string and type X20xx.
# Lets remove the X and convert it to numerical
library(stringr)
map.df.l$year <- str_replace_all(map.df.l$year, "X","")
## Error in stri_replace_all_regex(string, pattern, replacement, vectorize_all = vec, : object 'map.df.l' not found
map.df.l$year <- factor(map.df.l$year)
## Error in factor(map.df.l$year): object 'map.df.l' not found
map.df.l$year <- as.numeric(levels(map.df.l$year))[map.df.l$year]
## Error in levels(map.df.l$year): object 'map.df.l' not found

Plotting the maps using ggplot2

Breaks in the map are set at equal intervals between yearly regional maximun and minimun values.

library(ggplot2)
library(scales)
library(grid)

# Creating a custom function for creating the breaks and makeing them look neat
categories <- function(x, cat=5) {
  
  library(stringr)
  levs <- as.data.frame(as.character(levels(cut_interval(x, cat))))
  names(levs) <- "orig"
  levs$mod <- str_replace_all(levs$orig, "\\[", "")
  levs$mod <- str_replace_all(levs$mod, "\\]", "")
  levs$mod <- str_replace_all(levs$mod, "\\(", "")
  levs$lower <- gsub(",.*$","", levs$mod)
  levs$upper <- gsub(".*,","", levs$mod)
  
  levs$lower <- factor(levs$lower)
  levs$lower <- round(as.numeric(levels(levs$lower))[levs$lower],0)
  
  levs$upper <- factor(levs$upper)
  levs$upper <- round(as.numeric(levels(levs$upper))[levs$upper],0)
  
  levs$labs <- paste(levs$lower,levs$upper, sep=" - ")
  
  labs <- as.character(c(levs$labs))
  y <- cut_interval(x, cat, right = FALSE, labels = labs)
  y <- as.character(y)
  y[is.na(y)] <- "No Data"
  y <- factor(y, levels=c("No Data",labs[1:cat]))
}

# years for for loop
years <- unique(map.df.l$year)
## Error in unique(map.df.l$year): object 'map.df.l' not found
# Loop over the three years
for (year in years) {
  
  # subset data
  plot_map <- map.df.l[map.df.l$year == year,]
  # set the breaks
  plot_map$value_cat <- categories(plot_map$value)
  
  p <- ggplot(data=plot_map, aes(long,lat,group=group))
  p <- p + geom_polygon(data = map.df.l, aes(long,lat),fill=NA,colour="white",size = 1)
  p <- p + geom_polygon(aes(fill = value_cat),colour="white",size=.2)
  p <- p + scale_fill_manual(values=c("Dim Grey","#d7191c","#fdae61","#ffffbf","#a6d96a","#1a9641")) 
  p <- p + coord_map(project="orthographic", xlim=c(-22,34), ylim=c(35,70))
  p <- p + labs(title = paste0("Disposable household incomes in  ",year))
  p <- p +  theme(legend.position = c(0.03,0.40), 
                          legend.justification=c(0,0),
                          legend.key.size=unit(6,'mm'),
                          legend.direction = "vertical",
                          legend.background=element_rect(colour=NA, fill=alpha("white", 2/3)),
                          legend.text=element_text(size=12), 
                          legend.title=element_text(size=12), 
                          title=element_text(size=16), 
                          panel.background = element_blank(), 
                          plot.background = element_blank(),
                          panel.grid.minor = element_line(colour = 'Grey80', size = .5, linetype = 'solid'),
                          panel.grid.major = element_line(colour = 'Grey80', size = .5, linetype = 'solid'),
                          axis.text = element_blank(), 
                          axis.title = element_blank(), 
                          axis.ticks = element_blank(), 
                          plot.margin = unit(c(-3,-1.5, -3, -1.5), "cm"))
  p <- p + guides(fill = guide_legend(title = "€ per Year",
                                     title.position = "top", 
                                     title.hjust=0))
  print(p)
}
## Error in eval(expr, envir, enclos): object 'years' not found

We are done! That was a small exercise on how to use the main functions in the eurostat-package. We hope you find the package useful! All suggestions, bug reports, and contributions are warmly welcome at: https://github.com/ropengov/eurostat. When using the packages, please cite accordingly:

citation("eurostat")
## 
## Kindly cite the eurostat R package as follows:
## 
##   (C) Leo Lahti, Janne Huovari, Markus Kainu, Przemyslaw Biecek
##   2014-2016. eurostat R package URL:
##   https://github.com/rOpenGov/eurostat
## 
## A BibTeX entry for LaTeX users is
## 
##   @Misc{,
##     title = {eurostat R package},
##     author = {Leo Lahti and Janne Huovari and Markus Kainu and Przemyslaw Biecek},
##     year = {2014},
##     url = {https://github.com/rOpenGov/eurostat},
##   }
sessionInfo() 
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scales_0.4.0         stringr_1.0.0        rgeos_0.3-17        
##  [4] maptools_0.8-39      rgdal_1.1-3          sp_1.2-2            
##  [7] tidyr_0.4.1          ggplot2_2.1.0        countrycode_0.18    
## [10] eurostat_1.2.21.9003 knitr_1.12.3        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6
##  [5] lattice_0.20-33  R6_2.1.2         highr_0.5.1      plyr_1.8.3      
##  [9] dplyr_0.4.3      tools_3.2.3      parallel_3.2.3   gtable_0.2.0    
## [13] DBI_0.3.1        lazyeval_0.1.10  assertthat_0.1   digest_0.6.9    
## [17] formatR_1.3      evaluate_0.8.3   labeling_0.3     stringi_1.0-1   
## [21] foreign_0.8-66