4 Transfers Replication: Labor Market Statistics

4.1 Labor Market Statistics

The labor market statistics across our and their samples are, generally speaking, quite similar. We start by looking at differences in input data-sets. Then look at differences across the output data-sets, and find no major differences in the statistics themselves. What proves to be binding, however, is missing labor market information for some relevant labor markers in their data. This is further developed on in Chapter 5.

4.1.1 Input data-sets

Code

ours_input <- "our_replication_data.dta" %>% 
  paste0(dir_transfers_replication, "data/", .) %>% 
  haven::read_dta() %>% 
  as.data.table()   %>% 
    .[, .(year, cod6, prin, priw, prit, popibge)] 

theirs_input <- "main_exact_data_checkpointpre1.dta" %>% 
  paste0(dir_transfers_replication, "data/processing/", .) %>% 
  haven::read_dta() %>% 
  as.data.table()   %>% 
    .[, .(year, cod6, prin,prin_b10_1, priw, prit, popibge)] 


# theirs_input %>% names() %>% .[stri_detect_fixed(., "prin")] %>% return_in_vector_format()
# these <- c('prin_b10', 'prin_a10', 'prin', 'prin_1', 'prin_2', 'prin_3', 'prin_b10_1', 'prin_a10_1', 'prin_b10_2', 'prin_a10_2', 'prin_b10_3', 'prin_a10_3')
# summary(theirs_input[, ..these]) 
 # prin_b10         prin_a10            prin               prin_1            prin_2           prin_3          prin_b10_1       prin_a10_1     
 # Min.   :     0   Min.   :    -10   Min.   :0.000e+00   Min.   :    0.0   Min.   :     0   Min.   :      0   Min.   :   0.0   Min.   :    0.0  
 # 1st Qu.:    70   1st Qu.:     42   1st Qu.:1.240e+02   1st Qu.:   13.0   1st Qu.:    15   1st Qu.:     47   1st Qu.:   7.0   1st Qu.:    0.0  
 # Median :   235   Median :    273   Median :5.390e+02   Median :   74.0   Median :   119   Median :    214   Median :  36.0   Median :   13.0  
 # Mean   :  1522   Mean   :   5130   Mean   :4.021e+05   Mean   :  283.9   Mean   :  1954   Mean   :   4358   Mean   : 101.5   Mean   :  144.2  
 # 3rd Qu.:   762   3rd Qu.:   1527   3rd Qu.:2.307e+03   3rd Qu.:  268.0   3rd Qu.:   764   3rd Qu.:    948   3rd Qu.: 113.0   3rd Qu.:   90.0  
 # Max.   :742497   Max.   :3703141   Max.   :1.000e+09   Max.   :14186.0   Max.   :855568   Max.   :3624841   Max.   :3135.0   Max.   :12250.0  
 # NA's   :34467    NA's   :34483     NA's   :34162       NA's   :34188     NA's   :34188    NA's   :34188     NA's   :12244    NA's   :12244    
 #   prin_b10_2        prin_a10_2       prin_b10_3         prin_a10_3     
 # Min.   :    0.0   Min.   :     0   Min.   :     0.0   Min.   :      0  
 # 1st Qu.:    4.0   1st Qu.:     0   1st Qu.:    28.0   1st Qu.:      0  
 # Median :   20.0   Median :    54   Median :    98.0   Median :     45  
 # Mean   :  163.7   Mean   :  1256   Mean   :   919.1   Mean   :   2384  
 # 3rd Qu.:   77.0   3rd Qu.:   424   3rd Qu.:   363.0   3rd Qu.:    304  
 # Max.   :91411.0   Max.   :685213   Max.   :649329.0   Max.   :2917088  
 # NA's   :12244     NA's   :12244    NA's   :12244      NA's   :12244  

# checking for duplicates: none
# ours_input %>% 
#   .[, GRP := .GRP,.(year, cod6)] %>% 
#   .[duplicated(GRP)]
# 
# theirs_input %>% 
#   .[, GRP := .GRP,.(year, cod6)] %>% 
#   .[duplicated(GRP)]

4.1.1.1 Missing Private Employment over time

We perform the following analysis to determine whether the differences across our and their estimation samples stems from missing private employment data in the original input data-sets.

Their code drops observations (municipality-years) with missing private and public employment. Our code only drops observations with missing public employment. In Chapter 5, we see that their dropping of missing public employment pubn barely alters the composition of their output data-set; thus the binding constraint is the line of code which drops observations with missing private employment prin.

This line in the code is what drives differences across our and their samples. Here, we first look at the distribution of missing private employment prin over time and across municipality size to see if there are any patterns, comparing their data to ours. We then select a different private employment variable prin_b10_1 in their data and compare it to ours. We conclude that the missing observations introduced into their sample truly come from dropping missing prin.

Figure 4.1 (tab Missing prin over time) shows the number of missing municipality-years in each input data-set. It illustrates that their data-set has a considerable number of municipality-years with missing private employment prin information. For a better comparison, we only keep municipality years common to both input data-sets and present it this in Figure 4.2 (tab Missing prin over time - same sample). Again, we see that their data-set has a lot of missing observations. Furthermore, we noticed that while some labor market variables associated with private employment like prin have many missing observations in their data, others like prin_b10_1, which measures private employment for agricultural firms below 10 workers, have a similar number of missing observations to our data ( see Figure 4.3). When comparing missing prin (in our data) vs missing prin_b10_1 (in their data), we also notice the roughly half of the missing observations in one data-set are also missing in the other. In Figure 4.4 (tab Missing prin_b10_1 over time - same sample - no common missings), we remove these common missing municipality-years and see that across both data-sets, our data has a similarly small average number of missing municipalities to theirs. This figure also highlights that while our missing data tapers off to zero over time, theirs is constant between 20-30 missing municipality years when using the alternate prin_b10_1 variable.

Code

ours_input_missing_by_year <- ours_input %>% copy() %>% 
  .[is.na(prin)] %>% 
  .[, .N, year] %>% 
  rename(., Ours=N)

theirs_input_missing_by_year <- theirs_input %>% copy() %>% 
  .[is.na(prin)] %>% 
  .[, .N, year] %>% 
  rename(., Theirs=N)

input_missing_by_year <- merge(ours_input_missing_by_year, theirs_input_missing_by_year, by = "year", all=T) %>% 
  melt.data.table(id.vars=c("year"))

(a) This figure presents the number of observations in each input data-set that have missing private employment for the variable `prin`. Both data-sets are indexed at the year-municipality level.

4.1.1.2 Missing Private Employment across population sizes

A question that now emerges is why there are so many missing observations in their data for the prin variable.

One hypothesis is that the authors did not compute labor market statistics for municipality-years beyond their proposed sample of municipalities with populations between 6,793 - 47,537. Though there are indeed many observations with missing prin that extend beyond the sample, the hypothesis does not fully explain the missing data. because there is an equal amount of municipality-years beyond the sample that have a non-missing prin. Moreover, when we only look at in-sample municipality-years, again, we see a considerable number of missing prin.

Figure 4.5 shows us that a considerable portion of the missing observations would be excluded from the sample because they go beyond the prescribed municipality sizes. It also shows us that there is still a large number of missing observations that fall within these boundaries. Figure 4.6 zooms in and shows us that there is still a considerable number of missing observations.

Figure 4.7 shows us that 55% percent of the observations in their data are municipalities that fall between the 6,793 and 47,537 population cut-offs. This is a promising lead, as it greatly reduces the complexity of the issue. The majority of the in-sample data is not missing.

Figure 4.8 shows us that roughly half of the municipality-years that fall below the population cut-off are missing. This highlights the fact that the population cut-off is not the full reason for having a large number of missing observations.

Figure 4.9 shows us that when we further sub-divide the data, only considering municipality-years with populations between the 6,793 and 47,537 population bounds and within ±1500 people from the FPM cut-offs, roughly 9% (\(\frac{2.79}{2.79+28.41}\)) of the sample has missing prin. Figure 4.10 presents the same breakdown but for our data. In contrast to their 9%, we have 0.1% \(\frac{0.05}{0.05 + 31.15}\) of missing observations.

Code

# load in population cut-offs
population_thresholds <- 
  paste0(dir_fpm, "fpm_variables/fpm_lambda_coeff_min_popest.csv") %>%
  fread() %>% 
  .[, cutoff_upper := min_popest + 1500] %>% 
  .[, cutoff_lower := min_popest - 1500]  

ours_input_pop <-  ours_input %>% 
  .[, .(year, cod6, popibge, prin)] %>% 
  rename_columns(
    current_names = c("popibge","prin"),
    new_names = c("popibge_ours","prin_ours"))

theirs_input_pop <- theirs_input %>% 
  .[, .(year, cod6, popibge, prin)] %>% 
  rename_columns(
    current_names = c("popibge","prin"),
    new_names = c("popibge_theirs","prin_theirs"))

joint_pop <- merge(ours_input_pop, theirs_input_pop, by = c("year", "cod6"), all=T)

(a) This figure plots the population distribution of municipality-years with missing private employment data `prin` in our data. We take all of the municipality years with populations less than 50,000 (according to our population estimate) between 2002-2014 and plot the ditribution of their population estimates. The dark lines highlight the upper and lower sample cut-offs; any municipality beyond these would be excluded.

Code

joint_pop_with_samples <- joint_pop  %>% 
  .[year >2001 & year < 2015] %>% 
  .[, in_sample_ours := (popibge_ours>6793 &  popibge_ours<47537)] %>% 
  .[, below_50k_ours := (popibge_ours<50000)] %>% 
  .[, below_6793_ours := (popibge_ours<6793)]

(a) This visualization breaks down the data-set by whether prin is missing in their data and whether the population falls between the upper and lower cut-offs of 6,793 and 47,537.

(a) This visualization breaks down the data-set by whether prin is missing in their data and whether the population falls below the lower 6,793 cut-off.

Code

# add the population cut-offs to the data
for(i in 1:nrow(population_thresholds)){
      
      # message_with_lines(i)
      
      POP_CUTOFF <- population_thresholds[i, min_popest]
      POP_CUTOFF_max <- population_thresholds[i+1, min_popest]

      joint_pop_with_samples[popibge_ours>=POP_CUTOFF, population_cutoff_min := POP_CUTOFF]
      joint_pop_with_samples[popibge_ours>=POP_CUTOFF, population_cutoff_max := POP_CUTOFF_max]
      
}


# determine whether the observation falls within the 1500 margin 
joint_pop_with_samples %<>% 
  .[, within_margin_lower := (abs(popibge_ours-population_cutoff_min)<=1500)] %>% 
  .[, within_margin_upper := (abs(population_cutoff_max-popibge_ours)<=1500)] %>% 
  .[, within_margin := within_margin_upper==TRUE | within_margin_lower==TRUE ]  %>% 
  .[, within_margin := within_margin ==TRUE & in_sample_ours==TRUE ]

(a) This visualization breaks down the data-set by whether prin is missing in their data and whether the municipality-year population falls between the upper and lower cut-offs of 6,793 and 47,537 and within the ±1500 margin from population cut-offs.

4.1.2 Final data-set

Here we restrict both data-sets to the municipality-years common to both. We do so to have direct comparisons of ours and their labor market statistics and FPM transfers.

Code

ours <- "main_exact_data_our_corrected_LM.dta" %>% 
  paste0(dir_transfers_replication, "data/processing/", .) %>% 
  haven::read_dta() %>% 
  as.data.table()   

theirs <- "main_exact_data_our_deflator.dta" %>% 
  paste0(dir_transfers_replication, "data/processing/", .) %>% 
  haven::read_dta() %>% 
  as.data.table()

# get mun,-years
ours_munyear <- ours[, .N, .(codigo, year)][, N:=NULL][, Ours := 1]
theirs_munyear <- theirs[, .N, .(codigo, year)][, N:=NULL][, Theirs := 1]

# merge both in order to understand what the issue is
both_all <- merge(ours_munyear, theirs_munyear, by=c("codigo", "year"), all=T)

# restrict them to the correct rows though 
ours_restricted <- both_all %>% copy() %>%  
  .[Ours==1&Theirs==1] %>% 
  merge(x=ours, y=., by=c("codigo", "year"), all=F ) %>% 
  # restrict by years as well
  .[year>=2002 & year <=2014]

theirs_restricted <- both_all %>% copy() %>%  
  .[Ours==1&Theirs==1] %>%  
  merge(x=theirs, y=., by=c("codigo", "year"), all=F )   %>% 
  # restrict by years as well
  .[year>=2002 & year <=2014]

# save data-sets -----------------

"main_exact_data_our_corrected_LM_common_sample.dta" %>% 
  paste0(dir_transfers_replication, "data/processing/", .) %>% 
  haven::write_dta(ours_restricted, .)

"main_exact_data_our_deflator_common_sample.dta" %>% 
  paste0(dir_transfers_replication, "data/processing/", .) %>% 
  haven::write_dta(theirs_restricted, .)


# -----------------

ours_restricted_sumstats <- data.table(
  "Rows" = nrow(ours_restricted),
  "Total IBGE Population" = sum(ours_restricted$popibge, na.rm=T),
  "Avg. Avg. Wage" = mean(ours_restricted$priw, na.rm=T),
  "Avg. Log Avg. Wage" = mean(ours_restricted$lpriw, na.rm=T),
  "Avg. Total Earnings" = mean(ours_restricted$prit, na.rm=T),
  "Avg. Log  Total Earnings" = mean(ours_restricted$lprit, na.rm=T),
  "Sum Total Private Workers" = sum(ours_restricted$prin, na.rm=T),
  "Avg. Log Total Private Workers" = mean(ours_restricted$lprin, na.rm=T)
) %>% t() %>% 
  as.data.table(., keep.rownames=T) %>% 
  .[, V1 := round(V1, digits = 3)] %>% 
  rename_columns(
    current_names = c("rn", "V1"), 
    new_names = c("Statistics", "Ours"))

theirs_restricted_sumstats <- data.table(
  "Rows" = nrow(theirs_restricted),
  "Total IBGE Population" = sum(theirs_restricted$popibge, na.rm=T),
  "Avg. Avg. Wage" = mean(theirs_restricted$priw, na.rm=T),
  "Avg. Log Avg. Wage" = mean(theirs_restricted$lpriw, na.rm=T),
  "Avg. Total Earnings" = mean(theirs_restricted$prit, na.rm=T),
  "Avg. Log  Total Earnings" = mean(theirs_restricted$lprit, na.rm=T),
  "Sum Total Private Workers" = sum(theirs_restricted$prin, na.rm=T),
  "Avg. Log Total Private Workers" = mean(theirs_restricted$lprin, na.rm=T)
) %>% t() %>% 
  as.data.table(., keep.rownames=T) %>% 
  .[, V1 := round(V1, digits = 3)] %>% 
  rename_columns(
    current_names = c("rn", "V1"), 
    new_names = c("Statistics", "Theirs")) 


comparisons <- merge(ours_restricted_sumstats,
theirs_restricted_sumstats, by = "Statistics", all=T) %>% 
  .[, `Equal?` := Ours == Theirs] %>% 
  .[order(`Equal?`)] %>% 
  .[, `Ours over Theirs` := round(Ours/Theirs, digits = 2)] %>% 
  .[order(`Ours over Theirs`)] 

options(scipen = 15)

comparisons[order(-`Equal?`)]

Comparing the Most restricted versions of our and their data

Code

ours_restricted_earn <- ours_restricted %>% copy() %>% 
  .[, sum(prit, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_restricted_earn <-theirs_restricted %>% copy() %>% 
  .[, sum(prit, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_earn <- merge(ours_restricted_earn, theirs_restricted_earn, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]


merge(ours_restricted_earn, ratio_earn, by = "year", all=T) %>% 
rbind(theirs_restricted_earn, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_text_repel(size=3, nudge_y = TRUE, nudge_x = T) +
    geom_point(size=2) +
    geom_line(size=1) +
    ylab("Total Private Earnings (2000BRL)") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    labs(color = "Data-set") +
    theme_minimal() +
    theme(
      axis.title = element_text(size=15),
      legend.title = element_text(size=20),
      legend.position = "bottom",
      text = element_text(size=20)
    )

We plot the time series of total municipal private sector earnings for Our and Their data. These are not directly comparable as they use different deflators, but the sheer difference is alarming. We use the exact same municipality-years across both samples. The y-axis represents the total amount of wages in each sample for each year. The percentage is included for the reader to understand the difference in magnitude between one sample and the other; it is constructed as the ratio of our yearly total over their yearly total.

Total yearly Private Earnings for each data-set (not log)

Code

ours_earn <- ours %>% copy() %>% 
  .[, sum(prit, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_earn <-theirs %>% copy() %>% 
  .[, sum(prit, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_earn <- merge(ours_earn, theirs_earn, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]


merge(ours_earn, ratio_earn, by = "year", all=T) %>% 
rbind(theirs_earn, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  .[ratio!=0] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_text_repel(size=3, nudge_y = TRUE, nudge_x = T) +
    geom_point(size=2) +
    geom_line(size=1) +
    ylab("Total Private Earnings (2000BRL)") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    labs(color = "Data-set") +
    theme_minimal() +
    theme(
      axis.title = element_text(size=15),
      legend.title = element_text(size=20),
      legend.position = "bottom",
      text = element_text(size=20)
    )

Total yearly Private Earnings for each data-set (not log)

4.1.2.2 Yearly Average Mean Private Earnings

Restricted
Unrestricted

Code

ours_restricted_avgwage <- ours_restricted %>% copy() %>% 
  .[, mean(priw, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_restricted_avgwage <-theirs_restricted %>% copy() %>% 
  .[, mean(priw, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_restricted_avgwage <- merge(ours_restricted_avgwage, theirs_restricted_avgwage, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]

merge(ours_restricted_avgwage, ratio_restricted_avgwage, by = "year", all=T) %>% 
rbind(theirs_restricted_avgwage, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_label_repel(size=4) +
    geom_point() +
    geom_line() +
    ylab("Mean Avg. Wages (2000BRL / Workers)") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    labs(color = "Data-set") +
    theme_minimal() +
    theme(
      axis.title = element_text(size=12),
      legend.title = element_text(size=20),
      legend.position = "bottom",
      text = element_text(size=20)
    )

We plot the time series of average municipal private sector earnings for Our and Their data. The y axis is constructed by taking the average of each municipality-year mean wage. These are not directly comparable as they use different deflators, but the sheer difference is alarming. We use the exact same municipality-years across both samples. The y-axis represents the total amount of wages in each sample for each year. The percentage is included for the reader to understand the difference in magnitude between one sample and the other; it is constructed as the ratio of our yearly average over their yearly average.

Average Yearly Mean Private Earnings for each data-set (not log)

Code

ours_avgwage <- ours %>% copy() %>% 
  .[, mean(priw, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_avgwage <-theirs %>% copy() %>% 
  .[, mean(priw, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_avgwage <- merge(ours_avgwage, theirs_avgwage, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]

merge(ours_avgwage, ratio_avgwage, by = "year", all=T) %>% 
rbind(theirs_avgwage, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_label_repel(size=4) +
    geom_point() +
    geom_line() +
    ylab("Mean Avg. Wages (2000BRL / Workers)") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    labs(color = "Data-set") +
    theme_minimal() +
    theme(
      axis.title = element_text(size=12),
      legend.title = element_text(size=20),
      legend.position = "bottom",
      text = element_text(size=20)
    )

Average Yearly Mean Private Earnings for each data-set (not log)

4.2 Yearly Total Private Sector Employment

Restricted
Unrestricted

Code

ours_restricted_pop <- ours_restricted %>% copy() %>% 
  .[, mean(prin, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_restricted_pop <-theirs_restricted %>% copy() %>% 
  .[, mean(prin, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_pop <- merge(ours_restricted_pop, theirs_restricted_pop, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]

merge(ours_restricted_pop, ratio_pop, by = "year", all=T) %>% 
rbind(theirs_restricted_pop, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_label_repel(size=4) +
    geom_point() +
    geom_line() +
    ylab("Total Private Sector Employment") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    theme_minimal() +
    theme(
      axis.title = element_text(size=15),
      legend.title = element_blank(),
      legend.position = "bottom",
      text = element_text(size=20)
    )

We plot the time series of total municipal private sector employment for Our and Their data. The y axis is constructed by summing the number ofo private sector workers in each municipality-year. These are directly comparable. We use the exact same municipality-years across both samples. The y-axis represents the total amount of wages in each sample for each year. The labels present the ratio (in percent) of our private sector municipality-year workforce estimate over theirs.

Total Yearly Private Sector Employment (not log)

Code

ours_pop <- ours %>% copy() %>% 
  .[, mean(prin, na.rm=T), .(year)] %>%
  .[, data_set:="Ours"]

theirs_pop <-theirs %>% copy() %>% 
  .[, mean(prin, na.rm=T), .(year)] %>%
  .[, data_set:="Theirs"]

ratio_pop <- merge(ours_pop, theirs_pop, by = "year", all=T) %>% 
  .[, ratio := round(V1.x/V1.y, digits = 2)] %>% 
  .[, .(ratio, year)]

merge(ours_pop, ratio_pop, by = "year", all=T) %>% 
rbind(theirs_pop, ., fill=T) %>% 
  .[is.na(ratio), ratio := 1] %>% 
  ggplot(aes(x=year, y = V1, color=data_set, label= scales::percent(ratio,trim = T)))  + 
    geom_label_repel(size=4) +
    geom_point() +
    geom_line() +
    ylab("Total Private Sector Employment") +
    xlab("Year") +
    scale_y_continuous(labels = scales::label_comma()) + 
    theme_minimal() +
    theme(
      axis.title = element_text(size=15),
      legend.title = element_blank(),
      legend.position = "bottom",
      text = element_text(size=20)
    )

Total Yearly Private Sector Employment (not log)