Case Study: Working with Eurobarometer surveys
Source:vignettes/eurobarometer.Rmd
eurobarometer.Rmd
The goal in this case study is to analyze trust in the national and European parliaments, and in the European Commission, in Europe, with data from the Eurobarometer.
The Eurobarometer is a biannual survey conducted by the European Commission with the goal of monitoring the public opinion of populations of EU member states and – occasionally – also in candidate countries. Each EB wave is devoted to a particular topic, but most waves ask some “trend questions”, i.e. questions that are repeated frequently in the same form. Trust in institutions is among such trend questions.
The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files. In this case study we use nine waves of the Eurobarometer between 1996 and 2019: 44.2bis (January-March 1996), 51.0 (March-April 1999), 57.1 (March-May 2002), 64.2 (October-November 2005), 69.2 (Mar-May 2008) 75.3 (May 2011), 81.2 (March 2014), 87.3 (May 2017), and 91.2 (March 2019).
In the Afrobaromter Case Study we have shown how to merge two waves of a survey with a limited number of variables. This workflow is not feasible with Eurobarometer on a PC or laptop, because there are too many large files to handle.
eurobarometer_waves <- file.path("working", dir("working"))
eb_waves <- read_surveys(eurobarometer_waves, .f='read_rds')
We can review if the main descriptive metadata is correctly present with document_surveys()
.
documented_eb_waves <- document_surveys(eb_waves)
Metadata map
We start by extracting metadata from the survey data files and storing them in a tidy table, where each row contains information about a variable from the survey data file. To keep the size manageable, we keep only a few variables: the row ID, the weighting variable, the country code, and variables that contain “parliament” or “commission” in their labels.
eb_trust_metadata <- metadata_create(eb_waves)
#let's keep the example manageable:
eb_trust_metadata <- eb_trust_metadata %>%
filter ( grepl("parliament|commission|rowid|weight_poststrat|country_id", var_name_orig) )
head(eb_trust_metadata)
#> filename id var_name_orig
#> 1 ZA2828_trust.rds ZA2828_trust rowid
#> 2 ZA2828_trust.rds ZA2828_trust country_id
#> 3 ZA2828_trust.rds ZA2828_trust weight_poststrat
#> 4 ZA2828_trust.rds ZA2828_trust trust_european_commission
#> 5 ZA2828_trust.rds ZA2828_trust trust_european_parliament
#> 6 ZA2828_trust.rds ZA2828_trust trust_national_parliament
#> class_orig
#> 1 character
#> 2 character
#> 3 numeric
#> 4 retroharmonize_labelled_spss_survey
#> 5 retroharmonize_labelled_spss_survey
#> 6 retroharmonize_labelled_spss_survey
#> label_orig labels valid_labels
#> 1 unique identifier in za2828_trust NA NA
#> 2 nation all samples iso 3166 crosstabulation variable NA NA
#> 3 weight result from target NA NA
#> 4 rely on european commission 1, 2, 3 1, 2
#> 5 rely on european parliament 1, 2, 3 1, 2
#> 6 rely on national parliament 1, 2, 3 1, 2
#> na_labels na_range n_labels n_valid_labels n_na_labels
#> 1 NA NA 0 0 0
#> 2 NA NA 0 0 0
#> 3 NA NA 0 0 0
#> 4 3 NA 3 2 1
#> 5 3 NA 3 2 1
#> 6 3 NA 3 2 1
The value labels in this example are not too numerous. The only variable that stands out is the one with Can rely on
and Cannot rely on
labels.
collect_val_labels(eb_trust_metadata)
#> [1] "CAN RELY ON IT" "CANNOT RELY ON IT" "Tend to trust"
#> [4] "Tend not to trust" "DK"
The following labels were marked by GESIS as missing values:
collect_na_labels(eb_trust_metadata)
#> [1] "DK" "NA"
#> [3] "Inap. (33 in V6)" "Inap. (CY-TCC in isocntry)"
We have created a helper function subset_save_survey()
that programmatically reads in SPSS files, makes the necessary type conversion to labelled_spss_survey()
without harmonization, and saves a small, subsetted rds
file. Because this is a native R file, it is far more efficient to handle in the actual workflow.
## You will likely use your own local working directory, or
## tempdir() that will create a temporary directory for your
## session only.
working_directory <- tempdir()
# This code is for illustration only, it is not evaluated.
# To replicate the worklist, you need to have the SPSS file names
# as a list, and you have to set up your own import and export path.
selected_eb_metadata <- readRDS(
system.file("eurob", "selected_eb_waves.rds", package = "retroharmonize")
) %>%
mutate ( id = substr(filename,1,6) ) %>%
rename ( var_label = var_label_std ) %>%
mutate ( var_name = var_label )
## This code is not evaluated, it is only an example. You are likely
## to have a directory where you have already downloaded the data
## from GESIS after accepting their term use.
subset_save_surveys (
var_harmonization = selected_eb_metadata,
selection_name = "trust",
import_path = gesis_dir,
export_path = working_directory )
Harmonize the labels
For easier looping we adopt the harmonize_values()
function with new default settings. It would be tempting to preserve the rely
labels as distinct from the trust
labels, but if we use the same numeric coding, it will lead to confusion. If you want to keep the difference of the two type of category labels, than the harmonization should be done in a two-step process.
harmonize_eb_trust <- function(x) {
label_list <- list(
from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely",
"^dk", "^inap", "na"),
to = c("not_trust", "not_trust", "trust", "trust",
"do_not_know", "inap", "inap"),
numeric_values = c(0,0,1,1, 99997,99999,99999)
)
harmonize_values(x,
harmonize_labels = label_list,
na_values = c("do_not_know"= 99997,
"declined" = 99998,
"inap" = 99999 )
)
}
Let’s see if things did work out fine:
document_surveys(eb_waves)
#> # A tibble: 9 × 5
#> id filename ncol nrow object_size
#> <chr> <chr> <int> <int> <dbl>
#> 1 ZA2828_trust ZA2828_trust.rds 7 65178 8881288
#> 2 ZA3171_trust ZA3171_trust.rds 14 16179 3150696
#> 3 ZA3639_trust ZA3639_trust.rds 14 16012 3116544
#> 4 ZA4414_trust ZA4414_trust.rds 14 29430 5693504
#> 5 ZA4744_trust ZA4744_trust.rds 14 30170 5833760
#> 6 ZA5481_trust ZA5481_trust.rds 10 31769 5109472
#> 7 ZA5913_trust ZA5913_trust.rds 10 27932 4494592
#> 8 ZA6863_trust ZA6863_trust.rds 14 33180 6411712
#> 9 ZA7562_trust ZA7562_trust.rds 8 27524 3980296
To review the harmonization on a single survey use pull_survey()
.
test_trust <- pull_survey(eb_waves, filename = "ZA4414_trust.rds")
Before running our adapted harmonization function, we have this:
test_trust$trust_european_commission[1:16]
#> [1] 3 1 2 3 1 2 2 2 1 3 1 1 1 1 1 1
#> attr(,"labels")
#> Tend to trust Tend not to trust DK
#> 1 2 3
#> attr(,"label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"na_values")
#> [1] 3
#> attr(,"ZA4414_name")
#> [1] "v213"
#> attr(,"ZA4414_values")
#> 1 2 3
#> 1 2 3
#> attr(,"ZA4414_label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"ZA4414_labels")
#> Tend to trust Tend not to trust DK
#> 1 2 3
#> attr(,"ZA4414_na_values")
#> [1] 3
#> attr(,"id")
#> [1] "ZA4414"
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"
#> [3] "haven_labelled"
After performing harmonization, it would look like this:
harmonize_eb_trust(x=test_trust$trust_european_commission[1:16])
#> [1] 99997 1 0 99997 1 0 0 0 1 99997 1 1
#> [13] 1 1 1 1
#> attr(,"labels")
#> not_trust trust do_not_know declined inap
#> 0 1 99997 99998 99999
#> attr(,"label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"na_values")
#> [1] 99997 99998 99999
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"
#> [3] "haven_labelled"
#> attr(,"survey_id_name")
#> [1] "x"
#> attr(,"survey_id_values")
#> 2 1 3
#> 0 1 99997
#> attr(,"survey_id_label")
#> [1] "QA27 EUROPEAN COMMISSION - TRUST"
#> attr(,"survey_id_labels")
#> Tend to trust Tend not to trust DK
#> 1 2 3
#> attr(,"survey_id_na_values")
#> [1] 3
#> attr(,"id")
#> [1] "survey_id"
If you are satisfied with the results, run harmonize_eb_trust()
through the 9 survey waves. Whenever a variable is missing from a wave, it is filled up with inapproriate
missing values.
Harmonize waves
We define a selection of countries: Belgium, Hungary, Italy, Malta, the Netherlands, Poland, Slovakia, and variables.
eb_waves_selected <- lapply (
eb_waves, function(x) { x %>% select (
any_of (c("rowid", "country_id", "weight_poststrat",
"trust_national_parliament", "trust_european_commission",
"trust_european_parliament"))) %>%
filter ( country_id %in% c("NL", "PL", "HU", "SK", "BE",
"MT", "IT"))}
)
harmonized_eb_waves <- harmonize_surveys (
survey_list = eb_waves_selected,
.f = harmonize_eb_trust )
We cannot rely on document_surveys()
anymore, because the result is a single data frame. Let’s have a look at the descriptive metadata.
wave_attributes <- attributes(harmonized_eb_waves)
wave_attributes$id
#> [1] "Waves: ZA2828_trust; ZA3171_trust; ZA3639_trust; ZA4414_trust; ZA4744_trust; ZA5481_trust; ZA5913_trust; ZA6863_trust; ZA7562_trust"
wave_attributes$filename
#> [1] "Original files: ZA2828_trust.rds; ZA3171_trust.rds; ZA3639_trust.rds; ZA4414_trust.rds; ZA4744_trust.rds; ZA5481_trust.rds; ZA5913_trust.rds; ZA6863_trust.rds; ZA7562_trust.rds"
wave_attributes$names
#> [1] "rowid" "country_id"
#> [3] "weight_poststrat" "trust_national_parliament"
#> [5] "trust_european_commission" "trust_european_parliament"
Analyze the data
The harmonized data can be analyzed in R. The labelled survey data is stored in labelled_spss_survey()
vectors, which is a complex class that retains much metadata for reproducibility. Most statistical R packages do not know it. To them, the data should be presented either as numeric data with as_numeric()
or as categorical with as_factor()
. (See more why you should not fall back on the more generic as.factor()
or as.numeric()
methods in The labelled_spss_survey class vignette.)
First, let’s treat the trust variables as factors. A summary of the resulting data allows us to screen for values that are outside of the expected range. In the trust variables, any values other than “trust” and “not trust” that are not defined as missing, are unacceptable. In our example, this is not the case. In our example, this is not the case.
We also see some basic information about the weighting factors, which in the selected Eurobarometer subset range from below 0.01 to almost 7. The range of these values is pretty large, which needs to be taken into account when analyzing the data.
harmonized_eb_waves %>%
mutate_at ( vars(contains("trust")), as_factor ) %>%
summary()
#> rowid country_id weight_poststrat
#> Length:58917 Length:58917 Min. :0.0095
#> Class :character Class :character 1st Qu.:0.7290
#> Mode :character Mode :character Median :0.9315
#> Mean :1.0000
#> 3rd Qu.:1.1908
#> Max. :6.9678
#> trust_national_parliament trust_european_commission trust_european_parliament
#> not_trust :31432 not_trust :15696 not_trust :16109
#> trust :22210 trust :26451 trust :27701
#> do_not_know: 5269 do_not_know:10132 do_not_know: 8468
#> declined : 0 declined : 0 declined : 0
#> inap : 6 inap : 6638 inap : 6639
#>
Now we convert the trust variables to numeric format, and look at the summary. Following the conversion, we lost information about the type of the missing values - now they are all lumped together as NA
. What we gained is the proportion of positive responses (which ranges between 0.41 for trust in the national parliament and 0.63 for trust in the European Parliament), and the ability to, e.g., construct scales of the binary variables.
numeric_harmonization <- harmonized_eb_waves %>%
mutate_at ( vars(contains("trust")), as_numeric )
summary(numeric_harmonization)
#> rowid country_id weight_poststrat
#> Length:58917 Length:58917 Min. :0.0095
#> Class :character Class :character 1st Qu.:0.7290
#> Mode :character Mode :character Median :0.9315
#> Mean :1.0000
#> 3rd Qu.:1.1908
#> Max. :6.9678
#>
#> trust_national_parliament trust_european_commission trust_european_parliament
#> Min. :0.000 Min. :0.000 Min. :0.000
#> 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
#> Median :0.000 Median :1.000 Median :1.000
#> Mean :0.414 Mean :0.628 Mean :0.632
#> 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
#> Max. :1.000 Max. :1.000 Max. :1.000
#> NA's :5275 NA's :16770 NA's :15107
Finally, let’s calculate weighted means of trust in the national parliament, the European Parliament, and the European Commission, for the selected countries, across all EB waves.
numeric_harmonization %>%
group_by(country_id) %>%
summarize_at ( vars(contains("trust")),
list(~mean(.*weight_poststrat, na.rm=TRUE)))
#> # A tibble: 7 × 4
#> country_id trust_national_parliament trust_european_comm… trust_european_parl…
#> <chr> <dbl> <dbl> <dbl>
#> 1 BE 0.453 0.610 0.623
#> 2 HU 0.328 0.639 0.642
#> 3 IT 0.317 0.603 0.630
#> 4 MT 0.568 0.753 0.754
#> 5 NL 0.657 0.664 0.635
#> 6 PL 0.225 0.658 0.642
#> 7 SK 0.294 0.613 0.641