The goal of retroharmonize
is to facilitate retrospective (ex-post) harmonization of survey data in a reproducible manner. The package provides tools for organizing the metadata, standardizing the coding of variables, variable names and value labels, including missing values, and for documenting all transformations, with the help of comprehensive S3 classes.
Currently being generalized from problems solved in the not yet released eurobarometer package (doi.)
Installation
The package is available on CRAN:
install.packages("retroharmonize")
The development version can be installed from GitHub with:
# install.packages("devtools")
devtools::install_github("rOpenGov/retroharmonize")
You can download the manual in PDF for the 0.2.4 release.
Survey harmonization
Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys. The retroharmonize package support various data processing, documentation, file/type conversion aspects of various retrosepctive survey harmonization workflows (i.e. harmonization tasks related to surveys that already have already been conducted, recorded into a coded file.)
From a technical perspective, the aim of the survey harmonization is to create a single, tidy, joined harmonized dataset in the form of a data frame that contains a row identifier, which is truly unique across all observations, and which also contains the concatenated and harmonized variables. We do this in a way that provides an unambigous mapping of numerical coded and labelled data, including special and missing data. This way we avoid coercion that may lead to logical errors due to syntactically correct, but logically inconsistent variable labelling in across differently coded source files. Taking the harmonization to the level of type harmonization to numeric and factor classes allows the use of R’s powerful statistical packages that require numeric or factor type input, and a wide range of survey output harmonization (harmonized statistics and indicators.
For an extended overview of these problems with illustrations please refer to the vignette Survey Harmonization.
1. Importing
Survey data, i.e., data derived from questionnaires or systematic data collection, such as inspecting objects in nature, recording prices at shops are usually stored databases, and converted to complex files retaining at least coding, labelling metadata together with the data. This must be imported to R so that the appropriate harmonization tasks can be carried out with the appropriate R types.
2. Harmonization of concepts
After importing data with some descriptive metadata such as numerical coding and labelling, we need to create a map of the information that is in our R session to prepare a harmonization plan. We must find information related to sufficiently similar concepts that can be harmonized to be successfully joined into a single variable, and eventually a table of similar variables must be joined.
We create a map of the measured concepts that needs to be harmonized, for example, a binary sex variable with missing cases and a four-level categorical variable on gender identification that has other and declined options. See the vignette Working With Survey Metadata how mapping the metadata of the surveys can help getting started with this first step.
We use a crosswalk table or a crosswalk scheme for all the variable name, value label and type conversion tasks that we plan to do.
3. Harmonization of variable names
Make sure that survey_1$sex
and survey_2$gender
can be concatenated to a gender vector or survey_joined$gender
. See more in the Working With A Crosswalk Table.
4. Harmonization of variable numerical codes and labels
For example, Female=0 in survey_1$sex
and female=2 in survey_2$gender
becomes consistently female=0. Missing and declined values are consistently handled.
5. Consistent types
To use R’s statistical functions with the concatenated version of survey_1$sex
and survey_2$gender
they must have the same R type. In the vast majority of the cases either numeric or factor, and in data visualization applications sometimes character. See more in the Harmonize Value Labels vignette.
6. Reproducibility & Documentation
To review statistical results and model results derived from the concatenated variable (or the joined data frame), they must remain comparable with survey_1$sex
and survey_2$gender
. It is also necessary to have a new, unique row ID for each observation. If you want to make your work available outside R, in a different software, the joined, longitudional data frame must be exported in a consistent manner.
Use Cases
We also provide three extensive case studies illustrating how the retroharmonize
package can be used for ex-post harmonization of data from cross-national surveys:
The creators of retroharmonize
are not affiliated with either Afrobarometer, Arab Barometer, Eurobarometer, or the organizations that designs, produces or archives their surveys.
We create a large, harmonized dataset for extensive testing of our packages capabilities. The replication data of this special use case can be found on
You can find this harmonized dataset on Zenodo in the Digital Music Observatory and the Cultural Creative Sectors Industries Data Observatory repositories.
We are building experimental APIs data in the form of automated observatories, which are running retroharmonize regularly and improving known statistical data sources. See also the Green Deal Data Observatory and the Economy Data Observatory.
Working with SPSS files
Survey data is often available in SPSS’s custom labelled format. Unfortunately, joining data with different labelling is not possible. When you do not need to preserve the history of complex harmonization problems, codebook, etc, then you do not necessary need to look under the hoods of our S3 classes. The new labelled_spss_survey()
class is an inherited extension of haven’s labelled_spss class. It not only preserves variable and value labels and the user-defined missing range, but also gives an identifier, for example, the filename or the wave number, to the vector. Additionally, it enables the preservation— as metadata attributes—the original variable names, labels, and value codes and labels, from the source data. This way, the harmonized data also contain the pre-harmonization record. The vignette Working With The labelled_spss_survey Class provides more information about the labelled_spss_survey()
class.
In Harmonize Value Labels we discuss the characteristics of the labelled_spss_survey()
class and demonstrates the problems that using this class solves.
Citations and related work
Citing the data sources
Our package has been tested on three harmonized survey’s microdata. Because retroharmonize is not affiliated with any of these data sources, to replicate our tutorials or work with the data, you have download the data files from these sources, and you have to cite those sources in your work.
Afrobarometer data: Cite Afrobarometer Arab Barometer data: cite Arab Barometer. Eurobarometer data: The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files.
Citing the retroharmonize R package
For main developer and contributors, see the package homepage.
This work can be freely used, modified and distributed under the GPL-3 license:
citation("retroharmonize")
#>
#> To cite package 'retroharmonize' in publications use:
#>
#> Daniel Antal (2021). retroharmonize: Ex Post Survey Data
#> Harmonization. https://retroharmonize.dataobservatory.eu/,
#> https://github.com/rOpenGov/retroharmonize.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {retroharmonize: Ex Post Survey Data Harmonization},
#> author = {Daniel Antal},
#> year = {2021},
#> note = {https://retroharmonize.dataobservatory.eu/,
#> https://github.com/rOpenGov/retroharmonize},
#> }
Contact
For contact information, see the package homepage.
Code of Conduct
Please note that the retroharmonize
project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.