The eurobarometer package relies on the survey class system of retroharmonize. You do not have to load the entire retroharmonize package - whatever is needed to make eurobarometer work is imported and modified as needed.
ZA6863 <- read_rds( system.file("examples", "ZA6863.rds", package = "eurobarometer") ) #> Survey read: #> id: ZA6863 #> filename: ZA6863.rds #> doi: doi:10.4232/1.12847 ZA7576 <- read_rds( system.file("examples", "ZA7576.rds", package = "eurobarometer") ) #> Survey read: #> id: ZA7576 #> filename: ZA7576.rds #> doi: doi:10.4232/1.13393
The metadata analysis is a first step to help both variable name and value label harmonization.
ZA6863_metadata <- gesis_metadata_create(ZA6863) ZA7576_metadata <- gesis_metadata_create(ZA7576)
Variables of base types numeric and character can be safely concatenated. The labelled, mainly categorical variables require special attention: their valid range and missing range must be harmonized before binding the two tables together.
ZA6863_items <- ZA6863_metadata %>% filter ( class_orig %in% c("character", "numeric") | str_sub(var_name_suggested, 1,5) == 'trust' ) %>% filter ( var_name_suggested != 'not_given' ) %>% pull (var_name_suggested)
ZA7576_items <- ZA7576_metadata %>% filter ( class_orig %in% c("character", "numeric") | str_sub(var_name_suggested, 1,5) == 'trust' ) %>% filter ( var_name_suggested != 'not_given' ) %>% pull (var_name_suggested)
In this case, the var_label_suggest() function worked perfectly, so we can approve the suggestions of gesis_metadata_create().
Let’s select the variables with identical names from the two surveys:
hZA6863 <- ZA6863 %>% stats::setNames ( nm = ZA6863_metadata$var_name_suggested ) %>% select ( all_of(intersect(ZA6863_items, ZA7576_items)))
hZA7576 <- ZA7576 %>% stats::setNames ( nm = ZA7576_metadata$var_name_suggested ) %>% select ( all_of(intersect(ZA6863_items, ZA7576_items)))
And have a look at their value labelling: [no idea why are this not identical.]
The retroharmonize::harmonize_values() is a prototype of the harmonization function. It should be adjusted to survey and question-block specific idiosyncrasies. This should be the work of various vocabulary tables, but the prototype can be made work with inputting the harmonization regex either as a list or as a data frame.
Because we would like to have the same harmonization for a question block, in this case we adopt the prototype with a regex. The retroharmonize::harmonize_values() function will normalize the labels, so you do not have to deal with capitalization and upper case versions. If you want to understand better the harmonization procedure, please refer to the Harmonize Value Labels vignette of the retroharmonize package.
With a better imputing system, this could be automated to a high level, probably harmonizing all trend variables at the same time. The harmonize_eurobaromter should be something that deals with this.
harmonize_trust <- function(x) { retroharmonize::harmonize_values( x = x, harmonize_label = NULL, harmonize_labels = ( list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)) ), na_values = c(do_not_know = 99997, declined = 99998, inap = 99999), na_range = NULL, id = "survey_id", name_orig = NULL) }
Choosing the first trust vector, we can see that the harmonization records all metadata for reproducibility.
harmonize_trust (hZA6863$trust_army) #> [1] 1 0 99997 0 99997 0 1 1 1 1 99997 1 #> [13] 1 0 0 0 0 1 0 1 1 1 0 1 #> [25] 1 0 1 1 1 1 0 1 0 0 1 99999 #> [37] 99999 99999 99999 99999 1 0 1 0 0 1 99997 99997 #> [49] 0 1 #> attr(,"labels") #> not_trust trust do_not_know inap #> 0 1 99997 99999 #> attr(,"label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"na_values") #> [1] 99997 99999 #> attr(,"class") #> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss" #> [3] "haven_labelled" #> attr(,"id") #> [1] "survey_id" #> attr(,"survey_id_name") #> [1] "x" #> attr(,"survey_id_values") #> 2 1 3 9 #> 0 1 99997 99999 #> attr(,"survey_id_label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"survey_id_labels") #> Tend to trust Tend not to trust #> 1 2 #> DK Inap. (CY-TCC in isocntry) #> 3 9 #> attr(,"survey_id_na_values") #> [1] 9
The coding appears very similar, so we use the same helper function for the same question in the other survey:
harmonize_trust (hZA7576$trust_army) #> [1] 1 1 1 1 1 1 1 0 0 1 1 99997 #> [13] 0 1 1 1 0 0 99997 1 99997 1 0 0 #> [25] 0 1 0 1 0 0 1 0 0 0 1 0 #> [37] 0 1 0 1 0 99999 99999 99999 99999 #> attr(,"labels") #> not_trust trust do_not_know inap #> 0 1 99997 99999 #> attr(,"label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"na_values") #> [1] 99997 99999 #> attr(,"class") #> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss" #> [3] "haven_labelled" #> attr(,"id") #> [1] "survey_id" #> attr(,"survey_id_name") #> [1] "x" #> attr(,"survey_id_values") #> 2 1 3 9 #> 0 1 99997 99999 #> attr(,"survey_id_label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"survey_id_labels") #> Tend to trust Tend not to trust #> 1 2 #> DK Inap. (CY-TCC in isocntry) #> 3 9 #> attr(,"survey_id_na_values") #> [1] 9
trust_in_army <- retroharmonize::concatenate( x = harmonize_trust ( hZA6863$trust_army), y = harmonize_trust ( hZA7576$trust_army) ) trust_in_army #> [1] 1 0 99997 0 99997 0 1 1 1 1 99997 1 #> [13] 1 0 0 0 0 1 0 1 1 1 0 1 #> [25] 1 0 1 1 1 1 0 1 0 0 1 99999 #> [37] 99999 99999 99999 99999 1 0 1 0 0 1 99997 99997 #> [49] 0 1 1 1 1 1 1 1 1 0 0 1 #> [61] 1 99997 0 1 1 1 0 0 99997 1 99997 1 #> [73] 0 0 0 1 0 1 0 0 1 0 0 0 #> [85] 1 0 0 1 0 1 0 99999 99999 99999 99999 #> attr(,"id") #> [1] "survey_id" #> attr(,"labels") #> not_trust trust do_not_know inap #> 0 1 99997 99999 #> attr(,"label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"na_values") #> [1] 99997 99999 #> attr(,"class") #> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss" #> [3] "haven_labelled" #> attr(,"survey_id_name") #> [1] "x" #> attr(,"survey_id_values") #> 2 1 3 9 #> 0 1 99997 99999 #> attr(,"survey_id_label") #> [1] "TRUST IN INSTITUTIONS: ARMY" #> attr(,"survey_id_labels") #> Tend to trust Tend not to trust #> 1 2 #> DK Inap. (CY-TCC in isocntry) #> 3 9 #> attr(,"survey_id_na_values") #> [1] 9
The attributes are complex, because they leave open reverting to historical coding, and for a choice of categorical or numeric representation in R.
summary ( as_factor(trust_in_army)) #> not_trust trust do_not_know inap #> 35 43 8 9 summary ( as_numeric(trust_in_army)) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 0.0000 0.0000 1.0000 0.5513 1.0000 1.0000 17
Let’s repeat the same harmonization for all trust variables.
hZA7576 <- hZA7576 %>% mutate_at (vars (starts_with("trust")), harmonize_trust ) hZA6863 <- hZA6863 %>% mutate_at (vars (starts_with("trust")), harmonize_trust ) hZA6863 %>% select ( all_of(c("trust_army", "trust_european_union"))) #> # A tibble: 50 x 2 #> trust_army trust_european_union #> <retroh_dbl> <retroh_dbl> #> 1 1 [trust] 1 [trust] #> 2 0 [not_trust] 1 [trust] #> 3 99997 (NA) [do_not_know] 99997 (NA) [do_not_know] #> 4 0 [not_trust] 1 [trust] #> 5 99997 (NA) [do_not_know] 1 [trust] #> 6 0 [not_trust] 0 [not_trust] #> 7 1 [trust] 99997 (NA) [do_not_know] #> 8 1 [trust] 1 [trust] #> 9 1 [trust] 0 [not_trust] #> 10 1 [trust] 1 [trust] #> # ... with 40 more rows
Given that the other selected variables have identical (harmonized) names and they are of base type numeric or character, after harmonizing the trust labels and na_values, we can bind the two panels with vectrs::vec_rbind() or dplyr::bind_rows(). Unfortunately, the generic c() method cannot be implemented to work with this type.
panel <- vctrs::vec_rbind ( hZA6863, hZA6863 ) #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length #> Warning in x_attr_names == paste0(x_id, "_name"): longer object length is not a #> multiple of shorter object length
The panel is created, and it is open for exporting to other statistical software, or further analysis in R. While some basic arithmetic methods are implemented for the labelled_spss_survey class of the retroharmonize package, for using all R statistical packages, the analyst has to chose a base R type that is compatible with them. Since the trust variables are categorical variables, they can be re-casted with the as_factor() or as_numeric() methods. Again, the base R as.factor() or as.numeric() will give a legible, but not correct representation.
The factor representation presents the user-defined missing values as categories:
panel %>% mutate_at (vars (starts_with("trust")), as_factor ) %>% summary() #> doi gesis_archive_version_and_date uniqid #> Length:100 Length:100 Min. :1.10e+08 #> Class :character Class :character 1st Qu.:3.20e+08 #> Mode :character Mode :character Median :6.30e+08 #> Mean :6.12e+08 #> 3rd Qu.:1.00e+09 #> Max. :1.00e+09 #> country_code_iso_3166 trust_army trust_european_union #> Length:100 not_trust :34 not_trust :38 #> Class :character trust :46 trust :40 #> Mode :character do_not_know:10 do_not_know:12 #> inap :10 inap :10 #> #> #> trust_european_union_tcc trust_justice_system trust_national_government #> not_trust: 4 not_trust :42 not_trust :44 #> trust : 6 trust :46 trust :34 #> inap :90 do_not_know: 2 do_not_know:12 #> inap :10 inap :10 #> #> #> trust_national_parliament trust_police trust_political_parties #> not_trust :40 not_trust :30 not_trust :56 #> trust :36 trust :56 trust :24 #> do_not_know:14 do_not_know: 4 do_not_know:10 #> inap :10 inap :10 inap :10 #> #> #> trust_political_parties_tcc trust_public_administration #> not_trust:10 not_trust :38 #> inap :90 trust :44 #> do_not_know: 8 #> inap :10 #> #> #> trust_regional_local_authorities trust_united_nations #> not_trust :38 not_trust :38 #> trust :40 trust :38 #> do_not_know:12 do_not_know:14 #> inap :10 inap :10 #> #> #> trust_united_nations_tcc weight_result_from_target_redressment #> not_trust: 4 Min. :0.4376 #> trust : 6 1st Qu.:0.7858 #> inap :90 Median :1.0876 #> Mean :1.1537 #> 3rd Qu.:1.4168 #> Max. :2.5878 #> weight_germany weight_extrapolated_population_aged_gt_15 #> Min. :0.0000 Min. : 203.8 #> 1st Qu.:0.0000 1st Qu.: 1506.2 #> Median :0.0000 Median : 5816.5 #> Mean :0.3344 Mean :18774.5 #> 3rd Qu.:0.5392 3rd Qu.:24301.3 #> Max. :2.1909 Max. :95773.7
And let’s compare this with the numeric representation, where the user-defined missing values are treated as missing:
panel %>% mutate_at (vars (starts_with("trust")), as_numeric ) %>% summary() #> doi gesis_archive_version_and_date uniqid #> Length:100 Length:100 Min. :1.10e+08 #> Class :character Class :character 1st Qu.:3.20e+08 #> Mode :character Mode :character Median :6.30e+08 #> Mean :6.12e+08 #> 3rd Qu.:1.00e+09 #> Max. :1.00e+09 #> #> country_code_iso_3166 trust_army trust_european_union #> Length:100 Min. :0.000 Min. :0.0000 #> Class :character 1st Qu.:0.000 1st Qu.:0.0000 #> Mode :character Median :1.000 Median :1.0000 #> Mean :0.575 Mean :0.5128 #> 3rd Qu.:1.000 3rd Qu.:1.0000 #> Max. :1.000 Max. :1.0000 #> NA's :20 NA's :22 #> trust_european_union_tcc trust_justice_system trust_national_government #> Min. :0.0 Min. :0.0000 Min. :0.0000 #> 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0000 #> Median :1.0 Median :1.0000 Median :0.0000 #> Mean :0.6 Mean :0.5227 Mean :0.4359 #> 3rd Qu.:1.0 3rd Qu.:1.0000 3rd Qu.:1.0000 #> Max. :1.0 Max. :1.0000 Max. :1.0000 #> NA's :90 NA's :12 NA's :22 #> trust_national_parliament trust_police trust_political_parties #> Min. :0.0000 Min. :0.0000 Min. :0.0 #> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0 #> Median :0.0000 Median :1.0000 Median :0.0 #> Mean :0.4737 Mean :0.6512 Mean :0.3 #> 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0 #> Max. :1.0000 Max. :1.0000 Max. :1.0 #> NA's :24 NA's :14 NA's :20 #> trust_political_parties_tcc trust_public_administration #> Min. :0 Min. :0.0000 #> 1st Qu.:0 1st Qu.:0.0000 #> Median :0 Median :1.0000 #> Mean :0 Mean :0.5366 #> 3rd Qu.:0 3rd Qu.:1.0000 #> Max. :0 Max. :1.0000 #> NA's :90 NA's :18 #> trust_regional_local_authorities trust_united_nations trust_united_nations_tcc #> Min. :0.0000 Min. :0.0 Min. :0.0 #> 1st Qu.:0.0000 1st Qu.:0.0 1st Qu.:0.0 #> Median :1.0000 Median :0.5 Median :1.0 #> Mean :0.5128 Mean :0.5 Mean :0.6 #> 3rd Qu.:1.0000 3rd Qu.:1.0 3rd Qu.:1.0 #> Max. :1.0000 Max. :1.0 Max. :1.0 #> NA's :22 NA's :24 NA's :90 #> weight_result_from_target_redressment weight_germany #> Min. :0.4376 Min. :0.0000 #> 1st Qu.:0.7858 1st Qu.:0.0000 #> Median :1.0876 Median :0.0000 #> Mean :1.1537 Mean :0.3344 #> 3rd Qu.:1.4168 3rd Qu.:0.5392 #> Max. :2.5878 Max. :2.1909 #> #> weight_extrapolated_population_aged_gt_15 #> Min. : 203.8 #> 1st Qu.: 1506.2 #> Median : 5816.5 #> Mean :18774.5 #> 3rd Qu.:24301.3 #> Max. :95773.7 #>
trust_in_army_doc <- retroharmonize::document_survey_item( trust_in_army)
trust_in_army_doc$code_table %>% kable ()
| values | survey_id_values | labels | survey_id_labels | missing |
|---|---|---|---|---|
| 0 | 1 | not_trust | Tend to trust | FALSE |
| 1 | 2 | trust | Tend not to trust | FALSE |
| 99997 | 3 | do_not_know | DK | TRUE |
| 99999 | 9 | inap | Inap. (CY-TCC in isocntry) | TRUE |
trust_in_army_doc$history_var_label #> label survey_id_label #> "TRUST IN INSTITUTIONS: ARMY" "TRUST IN INSTITUTIONS: ARMY"