6  Comparing Inputs

6.1 Overview

In this page, we look at differences in the inputs which create theirs vs our figures. In Section 6.2 we investigate differences across population estimates. In Section 6.3 we investigate differences in the FPM transfers data-set; starting with the nominal FPM transfers (Section 6.3.1) then moving onto special cases which require 1997 data (Section 6.3.2). Lastly, we dive into differences in the labor market statistics.

6.2 Population

Code
pop_theirs <- paste0(dir_transfers_replication, "data/pop_1997_2014.dta") %>% haven::read_dta() %>% 
  as.data.table() %>% 
  .[, .(cod6, year, popibge)] %>%
  rename(popibge_theirs = popibge)

pop_ours <- paste0(dir_transfers_replication, "data/pop_2001_2018.dta") %>% haven::read_dta() %>% 
  as.data.table() %>% 
    .[, .(cod6, year, popibge)] %>%
  rename(popibge_ours = popibge)

# our cleaned population data with imputed values for 2010 
ibge <- paste0(dir_eemt, "data/input/population_data/population_mun_year.csv") %>% 
  fread() %>% 
  .[, cod6 := municipio] %>% 
  .[, .(cod6, year, pop_est)]

ibge_info <- paste0(dir_transfers_replication, "data/pop_2001_2018.dta") %>% haven::read_dta() %>% 
  as.data.table() %>% 
  .[, .(cod6, codigo, sigla, mesocod, microcod)] %>% 
  .[!duplicated(cod6)]

ibge %>% rename(., popibge=pop_est) %>% 
  merge(., ibge_info, by = "cod6", all=T) %>% 
  haven::write_dta(., paste0(dir_transfers_replication, "data/pop_ibge_2000_2009_2011_2021.dta"))

6.2.1 How many municipalities in common and missing?

Code
pop <- merge(pop_theirs, pop_ours, by = c("cod6", "year"), all=T)  %>% 
  .[year > 2000] %>% 
  .[year <= 2014]  
  
pop_common <- pop %>% copy() %>% .[popibge_theirs==popibge_ours]
pop_not <- pop %>% copy() %>% .[popibge_theirs!=popibge_ours]

pop_both_missing <- pop %>% .[is.na(popibge_theirs) & is.na(popibge_ours)]
pop_our_missing_theirs_present <- pop %>% .[!is.na(popibge_theirs) & is.na(popibge_ours)]
pop_our_present_theirs_missing <- pop %>% .[is.na(popibge_theirs) & !is.na(popibge_ours)]
Code
data.table(
  Info = c(
    "Total 2001-2014",
    "Same Population", 
    "Different Population, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present"),
  Values = c(
    nrow(pop),
    nrow(pop_common),
    nrow(pop_not),
    nrow(pop_both_missing),
    nrow(pop_our_missing_theirs_present),
    nrow(pop_our_present_theirs_missing)
  )
  )

Table 6.1: IBGE Population across samples

(a)

The following table presents comparisons between our and their population data-sets.

Of the 78,344 municipalities-years after 2000 available; both our and their data-sets share the same values for 70,566 municipality-years and differ (without counting missing data) across 6,943. While both data-sets share 437 missing municipalities, we have 392 additional missing municipality-years. Their data, on the other hand, has 6 missing municipality years.

6.2.2 Which municipalities are they missing?

They are missing municipalities which were admittedly tricky to deal with when I cleaned the data but which we dealt with correctly and they didn’t. These 6 municipalities appear in the data-set in 2009 or 2013. Since we use the FPM transfer data with the original IBGE estimates available to the TCU in the year prior, rather than the IBGE population data, we were able to catch these.

Code
pop_our_present_theirs_missing %>% 
  .[popibge_ours>6790 & popibge_ours<47600] %>% 
  ggplot(., aes(x=popibge_ours)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) + 
  ylab("Count") + 
  xlab("Population Ours") + 
  theme_bw() 

Figure 6.1: Population of missing municipality-years
Code
pop_our_present_theirs_missing %>% 
  ggplot(., aes(x=popibge_ours)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793)  + 
  ylab("Count") + 
  xlab("Population Ours") + 
  theme_bw() 

Figure 6.2: Population of missing municipality-years
Code
pop_our_present_theirs_missing %>% 
  merge(ibge_info, . , by = c("cod6"), all.x=F, all.y=T)
Figure 6.3: Population of missing municipality-years

6.2.3 Which municipalities are we missing?

Figures Figure 6.4 and Figure 6.5 highlight that there are no municipalities within the analysis sample that have missing population data. Additionally, in our do file, we do not exclude observations that do not get merged, so premature dropping (which would impact total FPM computations) is not an issue for us.

Code
pop_our_missing_theirs_present %>% 
  .[popibge_theirs>6790 & popibge_theirs<47600] %>% 
  ggplot(., aes(x=popibge_theirs)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) + 
  ylab("Count") + 
  xlab("Population Theirs") + 
  theme_bw() 

Figure 6.4: Population of missing municipality-years
Code
pop_our_missing_theirs_present %>% 
  ggplot(., aes(x=popibge_theirs)) + 
  geom_histogram(bins = 100) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793)  + 
  ylab("Count") + 
  xlab("Population Theirs") + 
  theme_bw() 

Figure 6.5: Population of missing municipality-years

6.2.4 Which municipalities are missing in both?

Municipalities missing in both data-sets is partially an artifact of the data wrangling done below and has no impact in the final analysis. This occurs because there are some municipalities which appear and dissapear in the full panel. Below (Table 6.2) highlights these municipalities.

Additionally, there are some odd municipalities in their data-set which seemingly do not have any real world equivalents. (e.g. 159991) If this isn’t an error, then I would guess it is an old municipality code which does not exist.

Code
# pop_both_missing %>%
#   merge(ibge_info, . , by = c("cod6"), all.x=F, all.y=T) %>% 
#   .[, unique(cod6)] 


pop %>%
  merge(ibge_info, . , by = c("cod6"), all.x=F, all.y=T) %>% 
  .[cod6%in%c(150475, 159991, 159993, 220095, 220672, 231029, 239991, 239992, 239993, 269991, 269992, 269993, 299991, 299992, 299993, 299994, 329991, 329992, 339991, 339992, 359991, 419991, 419992, 419993, 419994, 421265, 422000, 429991, 431454, 439991, 439992, 500390, 500627, 510452, 510454, 519991)]
Table 6.2:

Missing Population data in both

Code
# 
# pop_ours[cod6%in%c(150475, 159991, 159993, 220095, 220672, 231029, 239991, 239992, 239993, 269991, 269992, 269993, 299991, 299992, 299993, 299994, 329991, 329992, 339991, 339992, 359991, 419991, 419992, 419993, 419994, 421265, 422000, 429991, 431454, 439991, 439992, 500390, 500627, 510452, 510454, 519991)]
Code
# pop_ours[cod6==159991]
pop_theirs[cod6==159991]

6.2.5 Which municipalities are different but not missing?

There are roughly 7,000 municipalities with mis-matching data. The issue occurs in all states but is mostly limited to 2010 and 2011. Most differences are relatively small, but they can extend up til 20,000 people.

Code
pop_not %>% 
  ggplot(., aes(x=year)) + geom_histogram(bins = 100) + 
  ylab("Count") + 
  xlab("Year") + 
  theme_bw() 

Figure 6.6: Distribution of differing populations by year.
Code
pop_not %>% copy() %>% 
  .[, uf := floor(cod6/10000)] %>% 
  ggplot(., aes(x=uf)) + geom_histogram(bins = 100) + 
  ylab("Count") + 
  xlab("State Code") + 
  theme_bw() 

Figure 6.7: Distribution of differing populations by year.
Code
pop_not %>% copy() %>% 
  .[, diff := popibge_theirs - popibge_ours] %>% 
  ggplot(., aes(x=popibge_theirs, y = diff, colour=year)) + 
  geom_point() + 
    geom_abline(slope = 0) + 
  ylab("Difference Theirs-Ours") + 
  xlab("Population Theirs") + 
  labs(colour="Year") + 
  theme_bw() + 
  scale_color_binned(breaks=seq(2001, 2014, 2)) + 
  theme(legend.position = "bottom", legend.text = element_text(angle = 90, size = 9)) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) 

Figure 6.8: Comparing Population Differences
Code
pop_not %>% copy() %>% 
    .[popibge_theirs>6790 & popibge_theirs<47600] %>% 
  .[, diff := popibge_theirs - popibge_ours] %>% 
  ggplot(., aes(x=popibge_theirs, y = popibge_ours, colour= diff)) + 
  geom_point() + 
    geom_abline(slope = 1) +
  ylab("Population Ours") + 
  xlab("Population Theirs") + 
  labs(colour="Difference Theirs-Ours") + 
  xlim(c(6000, 70000)) + 
  ylim(c(6000, 70000)) + 
  theme_bw() + 
  scale_color_binned() + 
  theme(legend.position = "bottom",
        legend.text = element_text(angle = 90)) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) 

Figure 6.9: Comparing Population Differences
Code
pop_not %>% copy() %>% 
  .[year==2011] %>% 
    .[popibge_theirs>6790 & popibge_theirs<47600] %>% 
  .[, diff := popibge_theirs - popibge_ours] %>% 
  ggplot(., aes(x=popibge_theirs, y = popibge_ours, colour= diff)) + 
  geom_point() + 
    geom_abline(slope = 1) +
  ylab("Population Ours") + 
  xlab("Population Theirs") + 
  labs(colour="Difference Theirs-Ours") + 
  xlim(c(6000, 70000)) + 
  ylim(c(6000, 70000)) + 
  theme_bw() + 
  scale_color_binned() + 
  theme(legend.position = "bottom",
        legend.text = element_text(angle = 90)) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) 

Figure 6.10: Comparing Population Differences in 2011

6.2.6 What could explain these inconsistent municipalities?

First, as shown above, the deviation is both positive and negative and appears relatively symmetrical.

6.2.6.1 2010 has no official estimates online.

The reason for inconsistent estimates in 2010 is that there is no official estimates/documentation on the IBGE page for that year. Moreover, the authors do not state where exactly they source their data from, apart from stating it comes from IBGE. We checked the paper, the online appendix and the readme files and found no reference to this.

Our main data comes straight from the documentation used to compute the FPM coefficients, so it is bound to be the best source, at least in theory. When we check online, in say Wikipedia, which also cites IBGE without a clear link to the data, their estimates differ from ours and their as well: e.g. for cod6==110001 & year==2010, here are the following estimates for our, theirs and the Wikipedia figures: 24,422; 24,392 and 24,577.

We also clean data from the online web-page, but, as stated above, are missing estimates for 2010.

6.2.6.2 2011 illustrates their data comes from the same IBGE source we found

There is a perfect correlation between the IBGE data we found and cleaned online in the IBGE page and the author’s statistics. This implies that, beyond the primary sources data from the FPM documentation, our other IBGE data-set comes from the same source of data as theirs.

Code
# mere it in
pop_not_ibge <- pop_not %>% 
  merge(x=., y = ibge, by = c("cod6", "year"), all.x=T, all.y=F)

# 2010 -----

# no data for 2010
# pop_not_ibge[year==2010][cod6==110001]

# 2011 -----

# pop_not_ibge %>%
#   .[year==2011] %>%
#   .[, cor(pop_est, popibge_theirs)]

# pop_not_ibge %>%
#   .[year==2011] %>%
#   .[, cor(pop_est, popibge_ours)]

6.2.7 Is the alternate IBGE data we collected exactly like their data?

Yes, except we do not have 2010 data and 2007 has different estimates. Coincidentally, 2007 is the census year. We check to see if this is the source of the problems and discover that indeed it is. We went to the 2007 census webpage and downloaded the municipality level population data. We compare municipio==110001 & year==2007. Their estimate is consistent with the 2007 census estimate of 11,520; while our IBGE estimate is 12,241.

Code
pop2 <- merge(pop_theirs, ibge, by = c("cod6", "year"), all=T)  %>% 
  .[year <= 2014]  %>% 
  .[year>=2000]
  
pop_common2 <- pop2 %>% copy() %>% .[popibge_theirs==pop_est]
pop_not2 <- pop2 %>% copy() %>% .[popibge_theirs!=pop_est]
pop_both_missing2 <- pop2 %>% .[is.na(popibge_theirs) & is.na(pop_est)]
pop_our_missing_theirs_present2 <- pop2 %>% .[!is.na(popibge_theirs) & is.na(pop_est)]
pop_our_present_theirs_missing2 <- pop2 %>% .[is.na(popibge_theirs) & !is.na(pop_est)]

# acrelandia: 1200013 ==> census ==   11 520
# pop_not2[cod6==120001]
Code
data.table(
  Info = c(
    "Total 2000-2014",
    "Same Population", 
    "Different Population, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present"),
  Values = c(
    nrow(pop2),
    nrow(pop_common2),
    nrow(pop_not2),
    nrow(pop_both_missing2),
    nrow(pop_our_missing_theirs_present2),
    nrow(pop_our_present_theirs_missing2)
  )
  )

Table 6.3: IBGE Population across samples

(a)

The following table presents comparisons between our and their population data-sets.

6.2.7.1 Which municipalities are they missing?

Code
pop_our_present_theirs_missing2

Municipalities missing in their data but present in ours*

6.2.7.2 Which municipalities are we missing?

We are missing all of the 2010 municipalities because we were unable to find data for that year in the IBGE web-site.

Code
# pop_our_missing_theirs_present2

6.2.8 Which municipalities are missing in both?

Municipalities missing in both data-sets is an artifact of the data wrangling done below and has no impact in the final analysis. This occurs because there are some municipalities which appear and dissapear in the full panel. Below (Table 6.4) highlights these municipalities.

Code
# pop_both_missing2 %>%
#   merge(ibge_info, . , by = c("cod6"), all.x=F, all.y=T) %>%
#   .[, unique(cod6)] %>% 
#   return_in_vector_format()

pop2 %>%
  merge(ibge_info, . , by = c("cod6"), all.x=F, all.y=T) %>% 
  .[cod6%in%c(150475, 159991, 159993, 220095, 220672, 220779, 231029, 239991, 239992, 239993, 240615, 269991, 269992, 269993, 270375, 290327, 291955, 299991, 299992, 299993, 299994, 320225, 329991, 329992, 330285, 339991, 339992, 359991, 419991, 419992, 419993, 419994, 421265, 422000, 429991, 430003, 430047, 430107, 430222, 430223, 430258, 430461, 430462, 430465, 430583, 430593, 430613, 430843, 431065, 431087, 431123, 431217, 431346, 431413, 431417, 431446, 431454, 431531, 431595, 431673, 431697, 431861, 431936, 432146, 432377, 439991, 439992, 500390, 500627, 510185, 510325, 510336, 510343, 510452, 510454, 510617, 510619, 510631, 510757, 510774, 510776, 510779, 510788, 510835, 519991, 520485, 520815, 521015, 521225)]
Table 6.4:

Missing Population data in both

6.2.8.1 Which municipalities are different but not missing?

Code
pop_not2 %>% 
  ggplot(., aes(x=year)) + geom_histogram(bins = 100) + 
  scale_x_continuous(breaks = c(2000:2014)) +
  ylab("Count") + 
  xlab("Year") + 
  theme_bw() 

Figure 6.11: Distribution of differing populations by year.
Code
pop_not2 %>% copy() %>% 
  .[, uf := floor(cod6/10000)] %>% 
  ggplot(., aes(x=uf)) + geom_histogram(bins = 100) + 
  ylab("Count") + 
  xlab("State Code") + 
  theme_bw() 

Figure 6.12: Distribution of differing populations by year.
Code
pop_not2 %>% copy() %>% 
  .[, diff := popibge_theirs - pop_est] %>% 
  ggplot(., aes(x=popibge_theirs, y = diff, colour=year)) + 
    geom_abline(slope = 0) + 
  geom_point() + 
  ylab("Difference Theirs-Ours*") + 
  xlab("Population Theirs") + 
  labs(colour="Year") + 
  theme_bw() + 
  scale_color_binned(breaks=c(2007:2014)) + 
  theme(legend.position = "bottom", legend.text = element_text(angle = 90)) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) 

Figure 6.13: Comparing Population Differences
Code
pop_not2 %>% copy() %>% 
    .[popibge_theirs>6790 & popibge_theirs<47600] %>% 
  .[, diff := popibge_theirs - pop_est] %>% 
  ggplot(., aes(x=popibge_theirs, y = pop_est, colour= diff)) + 
  geom_point() + 
  geom_abline(slope = 1) + 
  ylab("Population Ours*") + 
  xlab("Population Theirs") + 
  labs(colour="Difference Theirs-Ours*") + 
  xlim(c(6000, 70000)) + 
  ylim(c(6000, 70000)) + 
  theme_bw() + 
  scale_color_binned(breaks=c(-2000, -1000, 0, 1000)) + 
  theme(legend.position = "bottom",
        legend.text = element_text(angle = 90)) + 
  geom_vline(xintercept = 47544, linetype="dashed") + 
  geom_vline(xintercept = 6793) 

Figure 6.14: Comparing Population Differences

6.3 FPM transfers

In this section, we compare their and our FPM data.

In terms of nominal yearly transfers, both data-sets appear to be consistent with one another. Minor differences in rounding lead to small differences in transfer estimates for 22% of all municipality-years. These differences are too small to matter. Additionally, differences in cleaning lead to two municipalities having a month’s worth of FPM transfers more in their data; because while we include negative FPM transfers in our computation, the authors omit them. All in all, the nominal municipality-year FPM transfers should not be a cause for concern and would not lead to the differences seen thus far in the replication.

Code
fpm_theirs_dt <- paste0(dir_transfers_replication, "data/fpm.dta") %>% haven::read_dta() %>% 
  as.data.table() 

fpm_ours_dt <- paste0(dir_transfers_replication, "data/fpm_ours.dta") %>% haven::read_dta() %>% 
  as.data.table() 

6.3.1 Data on Actual FPM transfer

First, we check both data-sets on the actual nominal fpm variable to determine whether there are any structural inconsistencies with the FPM data. In theory, we are aiming to have ALL municipalities the country for each year, because we require all available FPM transfers to compute the total yearly budget.

Code
fpm_actual_ours <- fpm_ours_dt %>% 
  copy() %>% 
  .[, .(cod6, year, fpm)] %>% 
  .[, fpm := as.integer(fpm)] %>% 
  rename(., fpm_ours = fpm)

fpm_actual_theirs <- fpm_theirs_dt %>% 
  copy() %>% 
  .[, .(cod6, year, fpm)] %>% 
    .[, fpm := as.integer(fpm)] %>% 
  rename(., fpm_theirs = fpm)

fpm_both <- merge(fpm_actual_ours,
      fpm_actual_theirs,
      by = c("cod6", "year"),
      all = T)

6.3.1.1 Differences in data-sets

Our data-set spans 2001-2019 while theirs span 1996-2014. For the 2001-2019 period, there is a total of 77,895 unique municipality-years across both data-sets.

Shockingly, a large number of municipality-year fpm transfers are not equal across data-sets. As it turns out, these differences are a result of minor rounding erros and (in two cases) differences in how each research team handles negative transfers; we include them as they are while they replace negatives with NA. All in all though, there are no significant differences across the actual FPM transfers.

Code
fpm_common <- fpm_both %>% copy() %>% .[fpm_theirs==fpm_ours]
fpm_not <- fpm_both %>% copy() %>% .[fpm_theirs!=fpm_ours]
fpm_both_missing <- fpm_both %>% .[is.na(fpm_theirs) & is.na(fpm_ours)]
fpm_our_missing_theirs_present <- fpm_both %>% .[!is.na(fpm_theirs) & is.na(fpm_ours)]
fpm_our_present_theirs_missing <- fpm_both %>% .[is.na(fpm_theirs) & !is.na(fpm_ours)]
Code
data.table(
  Info = c(
    "Total 2001-2014",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present", 
    "Mean FPM Ours", 
    "Mean FPM Theirs"
    ),
  Values = c(
    nrow(fpm_both[year>=2001 & year<=2014]),
    nrow(fpm_common[year>=2001 & year<=2014]),
    nrow(fpm_not[year>=2001 & year<=2014]),
    nrow(fpm_both_missing[year>=2001 & year<=2014]),
    nrow(fpm_our_missing_theirs_present[year>=2001 & year<=2014]),
    nrow(fpm_our_present_theirs_missing[year>=2001 & year<=2014]), 
    fpm_both[year>=2001 & year<=2014][, mean(fpm_ours, na.rm=T)], 
    fpm_both[year>=2001 & year<=2014][, mean(fpm_theirs, na.rm=T)] 
  )
  )
Table 6.5:

FPM transfers across samples - 2001-2014

Code
data.table(
  Info = c(
    "Total 2000-2014",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present"),
  Values = c(
    nrow(fpm_both),
    nrow(fpm_common),
    nrow(fpm_not),
    nrow(fpm_both_missing),
    nrow(fpm_our_missing_theirs_present),
    nrow(fpm_our_present_theirs_missing)
  )
  )
Table 6.6:

FPM transfers across samples

6.3.1.2 Summary statistics for mismatched data

All of the mismatched data is a result of decisions taken when cleaning the data-set. The vast majority (99%) of mismatched data stems from rounding errors while the remaining few are a result of how each team cleans negative transfers. While they remove negative transfers from the yearly total, we include them. In ?tbl-sumstats_different_fpm, we present summary statistics for the differing data. In @

Below, we show that the two outlieRs are an artifact of the cleaning methodologies. The large differences for (Adamantina, SP) cod6==350010 & year==2001 and (Feira da Mata, BA) cod6==291077 & year=2004 are due to how each research team handles negative values in FPM data-set for specific months. It is unclear whether these should be positive values or whether they are analogous to refunds, but they are two minor edge cases and should not have a large impact on the final outcomes.

Code
fpm_both %>% 
  copy() %>%
  .[, diff := fpm_theirs - fpm_ours] %>% 
  .[diff!=0] %>% 
  .[, quantile(diff, probs = c(0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1))]

?(caption)

     0%      1%      5%     25%     50%     75%     95%     99%    100% 
-252626      -1       1       1       1       1       1       2      27 
Code
fpm_both %>% 
  copy() %>%
  .[, diff := fpm_theirs - fpm_ours] %>% 
  .[diff!=0] %>% 
  .[abs(diff)>1]  %>% 
  .[order(-abs(diff))]
Table 6.7:

Different FPMs by Difference

Code
knitr::include_graphics("images/issues_with_theirs/FPM_monthly_tranfers.png")

Code
knitr::include_graphics("images/issues_with_theirs/FPM_monthly_tranfers2.png")

6.3.1.3 Rounding down to the nearest 1000 BRL resolves most differences

Code
fpm_both2 <- fpm_both %>%
  copy() %>%
  .[, `:=`(fpm_theirs = (floor(fpm_theirs / 1000) * 1000),
           fpm_ours = (floor(fpm_ours / 1000) * 1000))]


fpm_common2 <- fpm_both2 %>% copy() %>% .[fpm_theirs==fpm_ours]
fpm_not2 <- fpm_both2 %>% copy() %>% .[fpm_theirs!=fpm_ours]
fpm_both_missing2 <- fpm_both2 %>% .[is.na(fpm_theirs) & is.na(fpm_ours)]
fpm_our_missing_theirs_present2 <- fpm_both2 %>% .[!is.na(fpm_theirs) & is.na(fpm_ours)]
fpm_our_present_theirs_missing2 <- fpm_both2 %>% .[is.na(fpm_theirs) & !is.na(fpm_ours)]
Code
data.table(
  Info = c(
    "Total 2001-2014",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present"),
  Values = c(
    nrow(fpm_both2[year>=2001 & year<=2014]),
    nrow(fpm_common2[year>=2001 & year<=2014]),
    nrow(fpm_not2[year>=2001 & year<=2014]),
    nrow(fpm_both_missing2[year>=2001 & year<=2014]),
    nrow(fpm_our_missing_theirs_present2[year>=2001 & year<=2014]),
    nrow(fpm_our_present_theirs_missing2[year>=2001 & year<=2014])
  )
  )
Table 6.8:

FPM transfers across samples - 2001-2014 - Rounded

6.3.1.4 Observations per year

There is very similar number of observations per year.

Code
fpm_year_comp <- merge(fpm_actual_ours[, .N, year], fpm_actual_theirs[, .N, year], by = "year", all=T) %>% 
  rename(., Ours = N.x) %>% 
  rename(., Their = N.y) %>% 
  melt.data.table(id.vars = "year")
Code
fpm_year_comp %>% 
  .[year>=2001] %>% 
  .[year<=2014] %>% 
  ggplot(aes(x=year, y = value, color=variable)) + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous(breaks = seq(5555,5800,1))

Code
fpm_year_comp %>% 
  ggplot(aes(x=year, y = value, color=variable)) + 
  geom_point() + 
  geom_line()
Warning: Removed 10 rows containing missing values (`geom_point()`).
Warning: Removed 10 rows containing missing values (`geom_line()`).

6.3.1.5 Missing data in their fpm data-set

We detect inconsistent FPM transfers in their data-set (see Table 6.9). Namely, there are 10 matched observations which contain fpm transfer data but are not associated with a municipality cod6. Five of these correspond to actual municipality-years. Given how few there are, we do not investigate much further beyond explaining how to back track the municipality codes for these observations.

Despite not having associated municipality codes, we are able to associate the FPM transfers from their data to the state-year it belongs to and could even try to piece together what municipality has incorrect missing data. Each state-year’s FPM lump-sum is a function of federal tax revenues and each state’s fixed-share of the total FPM amount is unique. Additionally, all municipalities in a state-year within the same population bounds, will be assigned the same interior FPM amount. This logic breaks down only if the municipality in question also recieves reserve transfers. Therefore, we can assume that every state-year’s FPM lumpsum is unique, all municipalities with the same interior coefficient receive the same transfer and can thus infer which state-year-population bucket their missing data comes from.

For instance, below we show how we were able to determine that the missing observation for 2001 with an FPM transfer (their estimate) of 1,194,582.2 BRL was actually municipality cod6==431454. This is the only municipality with this same amount of transfers in 2001 missing in their data but present in ours.

Code
fpm_both[is.na(cod6)]
Table 6.9:

Inconsistent missing data in their data-set

Code
fpm_both %>% 
  copy() %>% 
  .[ , likely_candidate := fpm_ours==1194582 & is.na(fpm_theirs)] %>% 
  .[fpm_ours==1194582] %>% 
  .[order(-likely_candidate)]

6.3.2 Differences in 1997 FPM coefficients

In this section, we look at differences in the 1997 FPM coefficients.

Code
fpm_special_ours <- fpm_ours_dt %>% 
  copy() %>% 
  .[, .(cod6, year,coeff97, state_capital, special_case)] %>% 
    .[, coeff97 := round(coeff97, digits = 2)] %>% 
  rename(., coeff97_ours = coeff97)

fpm_special_theirs <- fpm_theirs_dt %>% 
  copy() %>% 
  .[, .(cod6, year,coeff97)] %>% 
  .[, coeff97 := round(coeff97, digits = 2)] %>% 
  rename(., coeff97_theirs = coeff97)

fpm_special_both <- merge(
  fpm_special_ours,
  fpm_special_theirs,
  by = c("cod6", "year"),
  all = T
)

fpm_special_common <- fpm_special_both %>% copy() %>% .[coeff97_theirs==coeff97_ours]
fpm_special_not <- fpm_special_both %>% copy() %>% .[coeff97_theirs!=coeff97_ours]
fpm_special_both_missing <- fpm_special_both %>% .[is.na(coeff97_theirs) & is.na(coeff97_ours)]
fpm_special_our_missing_theirs_present <- fpm_special_both %>% .[!is.na(coeff97_theirs) & is.na(coeff97_ours)]
fpm_special_our_present_theirs_missing <- fpm_special_both %>% .[is.na(coeff97_theirs) & !is.na(coeff97_ours)]

6.3.2.1 Differences in data-sets

Table 6.10 highlights that there is a large number of observations whose 1997 coefficients do not match up. That said, this could simply be the result of not having 1997 coefficients for municipality-years that are not special cases. We test this in Table 6.12 and Table 6.13. Table 6.12 looks at observations between 2001-2007 which we categorize as being special cases. Of the 8908 municipality-years that are special cases, 578 (6%) have missing data. Conversely, Table 6.13 shows 2,062 of 24,297 municipality-years which are not special cases have the 1997 coefficient.

Code
data.table(
  Info = c(
    "Total 2001-2014",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present", 
    "Mean FPM Ours", 
    "Mean FPM Theirs"
    ),
  Values = c(
    nrow(fpm_special_both[year>=2001 & year<=2014]),
    nrow(fpm_special_common[year>=2001 & year<=2014]),
    nrow(fpm_special_not[year>=2001 & year<=2014]),
    nrow(fpm_special_both_missing[year>=2001 & year<=2014]),
    nrow(fpm_special_our_missing_theirs_present[year>=2001 & year<=2014]),
    nrow(fpm_special_our_present_theirs_missing[year>=2001 & year<=2014]), 
    fpm_special_both[year>=2001 & year<=2014][, mean(coeff97_ours, na.rm=T)], 
    fpm_special_both[year>=2001 & year<=2014][, mean(coeff97_theirs, na.rm=T)] 
  )
  )
Table 6.10:

FPM 1997 coefficients across samples - 2001-2014

Code
data.table(
  Info = c(
    "Total 2001-2007",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present", 
    "Mean FPM Ours", 
    "Mean FPM Theirs"
    ),
  Values = c(
    nrow(fpm_special_both[year>=2001 & year<=2007]),
    nrow(fpm_special_common[year>=2001 & year<=2007]),
    nrow(fpm_special_not[year>=2001 & year<=2007]),
    nrow(fpm_special_both_missing[year>=2001 & year<=2007]),
    nrow(fpm_special_our_missing_theirs_present[year>=2001 & year<=2007]),
    nrow(fpm_special_our_present_theirs_missing[year>=2001 & year<=2007]), 
    fpm_special_both[year>=2001 & year<=2007][, mean(coeff97_ours, na.rm=T)], 
    fpm_special_both[year>=2001 & year<=2007][, mean(coeff97_theirs, na.rm=T)] 
  )
  )
Table 6.11:

FPM 1997 coefficients across samples - 2001-2007

Code
data.table(
  Info = c(
    "Total 2001-2007",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present", 
    "Mean FPM Ours", 
    "Mean FPM Theirs"
    ),
  Values = c(
    nrow(fpm_special_both[year>=2001 & year<=2007 & special_case==TRUE]),
    nrow(fpm_special_common[year>=2001 & year<=2007 & special_case==TRUE]),
    nrow(fpm_special_not[year>=2001 & year<=2007 & special_case==TRUE]),
    nrow(fpm_special_both_missing[year>=2001 & year<=2007 & special_case==TRUE]),
    nrow(fpm_special_our_missing_theirs_present[year>=2001 & year<=2007 & special_case==TRUE]),
    nrow(fpm_special_our_present_theirs_missing[year>=2001 & year<=2007 & special_case==TRUE]), 
    fpm_special_both[year>=2001 & year<=2007 & special_case==TRUE][, mean(coeff97_ours, na.rm=T)], 
    fpm_special_both[year>=2001 & year<=2007 & special_case==TRUE][, mean(coeff97_theirs, na.rm=T)] 
  )
  )
Table 6.12:

FPM 1997 coefficients across samples - 2001-2007 - Special case

Code
data.table(
  Info = c(
    "Total 2001-2007",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present", 
    "Mean FPM Ours", 
    "Mean FPM Theirs"
    ),
  Values = c(
    nrow(fpm_special_both[year>=2001 & year<=2007 & special_case==FALSE]),
    nrow(fpm_special_common[year>=2001 & year<=2007 & special_case==FALSE]),
    nrow(fpm_special_not[year>=2001 & year<=2007 & special_case==FALSE]),
    nrow(fpm_special_both_missing[year>=2001 & year<=2007 & special_case==FALSE]),
    nrow(fpm_special_our_missing_theirs_present[year>=2001 & year<=2007 & special_case==FALSE]),
    nrow(fpm_special_our_present_theirs_missing[year>=2001 & year<=2007 & special_case==FALSE]), 
    fpm_special_both[year>=2001 & year<=2007 & special_case==FALSE][, mean(coeff97_ours, na.rm=T)], 
    fpm_special_both[year>=2001 & year<=2007 & special_case==FALSE][, mean(coeff97_theirs, na.rm=T)] 
  )
  )
Table 6.13:

FPM 1997 coefficients across samples - 2001-2007 - Special case

Code
data.table(
  Info = c(
    "Total 2000-2014",
    "Same FPM", 
    "Different FPM, not missing", 
    "Both missing", 
    "Ours missing, theirs present", 
    "Theirs missing, ours present"),
  Values = c(
    nrow(fpm_special_both),
    nrow(fpm_special_common),
    nrow(fpm_special_not),
    nrow(fpm_special_both_missing),
    nrow(fpm_special_our_missing_theirs_present),
    nrow(fpm_special_our_present_theirs_missing)
  )
  )
Table 6.14:

FPM transfers across samples

6.4 Labor Market statistics

Code
RAIS_ours <- paste0(dir_transfers_replication, "data/RAIS_ours.dta") %>% 
  haven::read_dta() %>% 
  as.data.table() %>% 
    .[, .(cod6, year, prit, prin, priw)] %>%
  # rename_columns(c("prit", "prin", "priw"), paste0(c("prit", "prin", "priw"), "_ours")) 
  melt.data.table(id.vars = c("year", "cod6")) %>% 
  # rename_columns(c("variable", "value"), paste0(c("variable", "value"), "_ours")) 
  rename_columns(c("value"), paste0(c("value"), "_ours")) 
  
RAIS_theirs <-
  paste0(dir_transfers_replication, "data/processing/main_exact_data_checkpointpre3.dta") %>%
  haven::read_dta() %>%
  as.data.table() %>%
  .[, .(cod6, year, prit, prin, priw)] %>%
  # rename_columns(c("prit", "prin", "priw"), paste0(c("prit", "prin", "priw"), "_theirs"))
  melt.data.table(id.vars = c("year", "cod6")) %>% 
  # rename_columns(c("variable", "value"), paste0(c("variable", "value"), "_theirs")) 
  rename_columns(c("value"), paste0(c("value"), "_theirs")) 
  
RAIS_joint <- merge(RAIS_ours, RAIS_theirs, by = c("cod6", "year", "variable"), all=F) %>% 
  .[, diff := value_theirs - value_ours]
Code
RAIS_joint %>% 
  .[variable=="prit"] %>% 
  ggplot(aes(x=value_ours, y = value_theirs, color=year)) + 
  geom_point() + 
  geom_abline()
Warning: Removed 3450 rows containing missing values (`geom_point()`).

Code
RAIS_joint_prit <- RAIS_joint %>% 
  .[variable=="prit"] %>% 
  .[!(is.na(value_ours)|is.na(value_theirs))]


# estimate null model prit theirs to remove year + mun effects
lm_prit_theirs <- summary(feols( value_theirs ~ 1 |  year + cod6,  data = RAIS_joint_prit))
lm_prit_ours <- summary(feols( value_ours ~ 1 |  year + cod6,  data = RAIS_joint_prit))

# get residulas
RAIS_joint_prit$value_theirs_resid <- lm_prit_theirs$residuals
RAIS_joint_prit$value_ours_resid <- lm_prit_ours$residuals

RAIS_joint_prit %>% 
  ggplot(aes(x=value_ours_resid, y = value_theirs_resid, color=year)) + 
  geom_point() + 
  geom_abline()

Code
# estimate residualized relationship
prit_resid_lm <- lm(value_theirs_resid ~ value_ours_resid,  RAIS_joint_prit)

# get leverage
RAIS_joint_prit[, lm_leverage := hatvalues(prit_resid_lm)]

plot(prit_resid_lm)

Code
high_leverage_prit <- RAIS_joint_prit %>% 
  .[quantile(lm_leverage, .99) < lm_leverage] %>% 
  .[order(-lm_leverage)] %>% 
  .[1:3] %>% 
  .[, .(year, cod6)]

merge(RAIS_joint_prit, high_leverage_prit, by = c("cod6", "year"), all=F) %>% 
  .[order(-lm_leverage)]

High leverage Total Earnings

Code
RAIS_joint_prin <- RAIS_joint %>% copy() %>% 
  # already exclude the mega outlier
  .[variable=="prin"] %>% 
  .[!(is.na(value_ours)|is.na(value_theirs))]

# identify mega outliers
mega_outlier <- RAIS_joint_prin %>% 
  .[value_theirs>1000000] %>% 
  .[, .(cod6, year)]
  
RAIS_joint_prin <- RAIS_joint_prin[value_theirs<1000000]

# estimate null model prin theirs to remove year + mun effects
lm_prin_theirs <- summary(feols( value_theirs ~ 1 |  year + cod6,  data = RAIS_joint_prin))
lm_prin_ours <- summary(feols( value_ours ~ 1 |  year + cod6,  data = RAIS_joint_prin))

# get residulas
RAIS_joint_prin$value_theirs_resid <- lm_prin_theirs$residuals
RAIS_joint_prin$value_ours_resid <- lm_prin_ours$residuals

RAIS_joint_prin %>% 
  ggplot(aes(x=value_ours_resid, y = value_theirs_resid, color=year)) + 
  geom_point() + 
  geom_abline()

Code
# estimate residualized relationship
prin_resid_lm <- lm(value_theirs_resid ~ value_ours_resid,  RAIS_joint_prin)

# get leverage
RAIS_joint_prin[, lm_leverage := hatvalues(prin_resid_lm)]

plot(prin_resid_lm)

Code
# the biggest issue are the top two

high_leverage_prin <- RAIS_joint_prin %>% 
  .[quantile(lm_leverage, .99) < lm_leverage] %>% 
  .[order(-lm_leverage)] %>% 
  .[1:2] %>% 
  .[, .(year, cod6)]

merge(RAIS_joint_prin, high_leverage_prin, by = c("cod6", "year"), all=F) %>% 
  .[order(-lm_leverage)]

High leverage Total Earnings

Code
RAIS_joint_priw <- RAIS_joint %>% copy() %>% 
  # already exclude the mega outlier
  .[variable=="priw"] %>% 
  .[!(is.na(value_ours)|is.na(value_theirs))]

# estimate null model priw theirs to remove year + mun effects
lm_priw_theirs <- summary(feols( value_theirs ~ 1 |  year + cod6,  data = RAIS_joint_priw))
lm_priw_ours <- summary(feols( value_ours ~ 1 |  year + cod6,  data = RAIS_joint_priw))

# get residulas
RAIS_joint_priw$value_theirs_resid <- lm_priw_theirs$residuals
RAIS_joint_priw$value_ours_resid <- lm_priw_ours$residuals

RAIS_joint_priw %>% 
  ggplot(aes(x=value_ours_resid, y = value_theirs_resid, color=year)) + 
  geom_point() + 
  geom_abline()

Code
# estimate residualized relationship
priw_resid_lm <- lm(value_theirs_resid ~ value_ours_resid,  RAIS_joint_priw)

# get leverage
RAIS_joint_priw[, lm_leverage := hatvalues(priw_resid_lm)]

plot(priw_resid_lm)

Code
# the biggest issue are the top four
high_leverage_priw <- RAIS_joint_priw %>% 
  .[quantile(lm_leverage, .99) < lm_leverage] %>% 
  .[order(-lm_leverage)] %>% 
  .[1:4] %>% 
  .[, .(year, cod6)]

merge(RAIS_joint_priw, high_leverage_priw, by = c("cod6", "year"), all=F) %>% 
  .[order(-lm_leverage)]

High leverage Total Earnings

Code
# compile list of observations to remove

remove_these <- rbind(high_leverage_prin, high_leverage_prit) %>% rbind(., high_leverage_priw)  

# create stata code to remove
remove_these %>% 
  .[, paste("drop if year == ", year, " & cod6 == ", cod6, "
            ")]
[1] "drop if year ==  2014  & cod6 ==  150215 \n            "
[2] "drop if year ==  2010  & cod6 ==  210405 \n            "
[3] "drop if year ==  2009  & cod6 ==  411580 \n            "
[4] "drop if year ==  2009  & cod6 ==  412350 \n            "
[5] "drop if year ==  2014  & cod6 ==  150215 \n            "
[6] "drop if year ==  2002  & cod6 ==  291180 \n            "
[7] "drop if year ==  2003  & cod6 ==  291180 \n            "
[8] "drop if year ==  2007  & cod6 ==  293305 \n            "
[9] "drop if year ==  2004  & cod6 ==  150570 \n            "
Code
RAIS_joint %>% 
  .[variable=="prin"] %>% 
  ggplot(aes(x=value_ours, y = value_theirs, color=year)) + 
  geom_point() + 
  geom_abline()
Warning: Removed 3449 rows containing missing values (`geom_point()`).

6.4.0.1 What is this outlier?

Code
RAIS_joint %>% 
  .[variable=="prin"] %>% 
  .[value_theirs>100000000]
Code
RAIS_joint[cod6==292650&year==2002&variable=="prin", value_theirs]
[1] 1e+09
attr(,"format.stata")
[1] "%9.0g"
Code
RAIS_joint %>% 
  .[variable=="prin"] %>% 
  .[value_theirs<100000000] %>% 
  ggplot(aes(x=value_ours, y = value_theirs, color=year)) + 
  geom_point() + 
  geom_abline()
Warning: Removed 6 rows containing missing values (`geom_point()`).

Code
RAIS_joint %>% 
  .[variable=="priw"] %>% 
  .[value_theirs<100000000] %>% 
  ggplot(aes(x=value_ours, y = value_theirs, color=year)) + 
  geom_point() + 
  geom_abline(slope = 1)
Warning: Removed 6 rows containing missing values (`geom_point()`).