Balancing Panel Data and Removing Duplicate Unique IDs
Published:
Properly balancing panel data and removing duplicate unique identifiers is a recurring challenge for students in my Research Practicum course. Accordingly, I thought it would be helpful to provide an example for everyone.
Some may ask: What are properly balanced panel data? Generally, a properly balanced panel means that all possible ID variables and have data for the same (or appropriate) time period. In the example that follows, we will be working with a dataset of United States (US) foreign aid disbursements to all African countries for the 2001-2017 period. The data come from the US Foreign Aid Explorer (FAE), and we have deflated them to constant 2010 dollars to account for potential inflationary pressures. For these data, we want to ensure that all existent African countries have a foreign aid disbursement value for each year between 2001 and 2017.
To start, let’s import our data frame and take a quick look at it:
rm(list=ls(all=TRUE)) # clear the environment
library(rio) # load package
fae_all <- import("african_aid.csv") # import the data
summary(fae_all) # summarize the data
## country_name year aid_constant
## Length:911 Min. :2001 Min. : -38845
## Class :character 1st Qu.:2005 1st Qu.: 9568481
## Mode :character Median :2009 Median : 58087880
## Mean :2009 Mean : 165454902
## 3rd Qu.:2013 3rd Qu.: 167795723
## Max. :2017 Max. :3173867164
## NA's : 1
head(fae_all) # take a quick peak at the data
## country_name year aid_constant
## 1 Algeria 2001 563682
## 2 Angola 2001 88977561
## 3 Benin 2001 41040486
## 4 Botswana 2001 3786002
## 5 Burkina Faso 2001 19766559
## 6 Burundi 2001 28287903
Let’s check how many countries are in the dataset:
length(unique(fae_all$country_name))
## [1] 54
Let’s also check how many years are in the dataset:
length(unique(fae_all$year))
## [1] 17
Now, let’s check whether the data are balanced:
suppressMessages(library(plm)) # to remove note about dependencies
is.pbalanced(fae_all) # are the data balanced?
## Warning in pdata.frame(x, index): duplicate couples (id-time) in resulting pdata.frame
## to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
## Warning in is.pbalanced.default(id, time): duplicate couples (id-time)
## [1] FALSE
Let’s first find the duplicate couple(s):
fae_all$unique_id <- paste(fae_all$country_name,fae_all$year) # concatenate to make unique ID
fae_all$duplicate = duplicated(fae_all$unique_id) # generate the duplicate variable
subset(fae_all, duplicate=="TRUE") # find the duplicate
## country_name year aid_constant unique_id duplicate
## 17 Egypt 2001 0 Egypt 2001 TRUE
The only duplicate value is for Egypt in 2001. It has an aid disbursement value of 0, so we can simply summarize the data. Stata users would say “collapse” the data. In this case, summarizing or collapsing the data means that we add together the two values for Egypt in 2001 and delete the duplicate row in the process. The process is generally the same if there were more duplicates in our data frame:
suppressMessages(library(tidyverse))
fae_all <-
fae_all %>% # refer to the fae_all data frame
group_by(country_name,year) %>% # tell R the unique ID
select(-c(unique_id,duplicate)) %>% # drop (now) useless variables
summarize(aid_constant = sum(aid_constant, na.rm=TRUE)) # summarize
Did that get rid of our duplicates couples problem?
library(plm)
is.pbalanced(fae_all) # duplicate couple issue now gone
## [1] FALSE
We are no longer getting the duplicates couples warning, so we can now concentrate on balancing the panel. Before proceeding, let’s first create a numeric country ID variable:
fae_all$country_id <- as.numeric(factor(fae_all$country_name)) # numeric country ID
We need to create a numeric country ID variable because the plm
package needs two numeric variables to constitute its ID-time index. With that step out of the way, let’s now balance the panel:
fae_balanced <- make.pbalanced(fae_all,
balance.type = c("fill"),
index = c("country_id","year"))
Since we used the numeric country_id
variable to balance the panel, it means that we are going to have some missing country_name
observations. Let’s first figure out which countries do not have a country name attached to them:
subset(fae_balanced, is.na(country_name))
## country_name year aid_constant country_id
## 273 <NA> 2001 <NA> 17
## 477 <NA> 2001 <NA> 29
## 478 <NA> 2002 <NA> 29
## 479 <NA> 2003 <NA> 29
## 783 <NA> 2001 <NA> 47
## 784 <NA> 2002 <NA> 47
## 786 <NA> 2004 <NA> 47
## 787 <NA> 2005 <NA> 47
Note that we have country_id
observations with missing country names, so we can use those IDs to identify the corresponding missing countries:
subset(fae_balanced, country_id==17|country_id==29|country_id==47)
## country_name year aid_constant country_id
## 273 <NA> 2001 <NA> 17
## 274 Equatorial Guinea 2002 81590 17
## 275 Equatorial Guinea 2003 300441 17
## 276 Equatorial Guinea 2004 42583 17
## 277 Equatorial Guinea 2005 247406 17
## 278 Equatorial Guinea 2006 76640 17
## 279 Equatorial Guinea 2007 29844 17
## 280 Equatorial Guinea 2008 15953 17
## 281 Equatorial Guinea 2009 191463 17
## 282 Equatorial Guinea 2010 143462 17
## 283 Equatorial Guinea 2011 55248 17
## 284 Equatorial Guinea 2012 414390 17
## 285 Equatorial Guinea 2013 207469 17
## 286 Equatorial Guinea 2014 204699 17
## 287 Equatorial Guinea 2015 244107 17
## 288 Equatorial Guinea 2016 948035 17
## 289 Equatorial Guinea 2017 3321453 17
## 477 <NA> 2001 <NA> 29
## 478 <NA> 2002 <NA> 29
## 479 <NA> 2003 <NA> 29
## 480 Libya 2004 52165 29
## 481 Libya 2005 161604 29
## 482 Libya 2006 30792152 29
## 483 Libya 2007 4806971 29
## 484 Libya 2008 16698323 29
## 485 Libya 2009 6528419 29
## 486 Libya 2010 10736737 29
## 487 Libya 2011 236478945 29
## 488 Libya 2012 162328736 29
## 489 Libya 2013 54600235 29
## 490 Libya 2014 21013668 29
## 491 Libya 2015 25773448 29
## 492 Libya 2016 17908764 29
## 493 Libya 2017 40775906 29
## 783 <NA> 2001 <NA> 47
## 784 <NA> 2002 <NA> 47
## 785 South Sudan 2003 0 47
## 786 <NA> 2004 <NA> 47
## 787 <NA> 2005 <NA> 47
## 788 South Sudan 2006 262143 47
## 789 South Sudan 2007 4463780 47
## 790 South Sudan 2008 4803235 47
## 791 South Sudan 2009 3111907 47
## 792 South Sudan 2010 36194887 47
## 793 South Sudan 2011 107385654 47
## 794 South Sudan 2012 406072707 47
## 795 South Sudan 2013 511596816 47
## 796 South Sudan 2014 650811758 47
## 797 South Sudan 2015 750664797 47
## 798 South Sudan 2016 638801967 47
## 799 South Sudan 2017 942560453 47
Now that we know which country IDs correspond to the missing country names, let’s make the appropriate fixes and check to ensure that there are no longer any missing country names:
fae_balanced$country_name[fae_balanced$country_id==17] = "Equatorial Guinea"
fae_balanced$country_name[fae_balanced$country_id==29] = "Libya"
fae_balanced$country_name[fae_balanced$country_id==47] = "South Sudan"
subset(fae_balanced, is.na(country_name)) # nothing now: all good
## [1] country_name year aid_constant country_id
## <0 rows> (or 0-length row.names)
Returning to the actual data balancing, let’s now ensure that everything went through as planned:
is.pbalanced(fae_balanced) # are data balanced?
## [1] TRUE
Although the above command indicates that the data are now balanced, let’s table the country and year variables to verify. Recall from above that the original (unbalanced) dataset contains 54 distinct countries and 17 distinct years. Accordingly, the new (balanced) data frame should contain 54 countries for our 17 years of data (2001-2017). To verify, let’s proceed one step at a time, starting with the years:
table(fae_balanced$year) # years OK?
##
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 54 54 54 54 54 54 54 54 54 54 54 54 54 54 54
## 2016 2017
## 54 54
The years look right, and we can say the same for the countries:
table(fae_balanced$country_name) # countries OK?
##
## Algeria Angola Benin
## 17 17 17
## Botswana Burkina Faso Burundi
## 17 17 17
## Cabo Verde Cameroon Central African Republic
## 17 17 17
## Chad Comoros Congo (Brazzaville)
## 17 17 17
## Congo (Kinshasa) Cote d'Ivoire Djibouti
## 17 17 17
## Egypt Equatorial Guinea Eritrea
## 17 17 17
## Eswatini Ethiopia Gabon
## 17 17 17
## Gambia Ghana Guinea
## 17 17 17
## Guinea-Bissau Kenya Lesotho
## 17 17 17
## Liberia Libya Madagascar
## 17 17 17
## Malawi Mali Mauritania
## 17 17 17
## Mauritius Morocco Mozambique
## 17 17 17
## Namibia Niger Nigeria
## 17 17 17
## Rwanda Sao Tome and Principe Senegal
## 17 17 17
## Seychelles Sierra Leone Somalia
## 17 17 17
## South Africa South Sudan Sudan (former)
## 17 17 17
## Tanzania Togo Tunisia
## 17 17 17
## Uganda Zambia Zimbabwe
## 17 17 17
Lastly, let’s remedy our missing data problem for our foreign aid disbursement variable:
subset(fae_balanced, is.na(aid_constant))
## country_name year aid_constant country_id
## 273 Equatorial Guinea 2001 <NA> 17
## 477 Libya 2001 <NA> 29
## 478 Libya 2002 <NA> 29
## 479 Libya 2003 <NA> 29
## 783 South Sudan 2001 <NA> 47
## 784 South Sudan 2002 <NA> 47
## 786 South Sudan 2004 <NA> 47
## 787 South Sudan 2005 <NA> 47
As shown above, there are missing (NA) values. The Foreign Aid Explorer (FAE) is a very comprehensive data source, and the US government provides very transparent data about how its spends taxpayer money. It is thus impossible that there are missing data. For this reason, let’s initially replace the missing values with 0, indicating that the country did not receive any foreign aid for the given years.
fae_balanced$aid_constant[is.na(fae_balanced$aid_constant)] <- 0
Was this the right decision? Unfortunately, it was a mistake: South Sudan only became an internationally-recognized country in 2011, and we don’t want to have data for the country when it did not exist. Accordingly, let’s check the foreign aid disbursement values for South Sudan:
subset(fae_balanced, country_name=="South Sudan")
## country_name year aid_constant country_id
## 783 South Sudan 2001 0 47
## 784 South Sudan 2002 0 47
## 785 South Sudan 2003 0 47
## 786 South Sudan 2004 0 47
## 787 South Sudan 2005 0 47
## 788 South Sudan 2006 262143 47
## 789 South Sudan 2007 4463780 47
## 790 South Sudan 2008 4803235 47
## 791 South Sudan 2009 3111907 47
## 792 South Sudan 2010 36194887 47
## 793 South Sudan 2011 107385654 47
## 794 South Sudan 2012 406072707 47
## 795 South Sudan 2013 511596816 47
## 796 South Sudan 2014 650811758 47
## 797 South Sudan 2015 750664797 47
## 798 South Sudan 2016 638801967 47
## 799 South Sudan 2017 942560453 47
Interestingly, South Sudan has aid values from 2006-2017, indicating that the US government figured out which funds supported South Sudan even before it became a recognized country in 2011. We should thus not automatically delete anything prior to 2006. By the same token, we need to delete the zero values from 2001-2005, since South Sudan could not receive any foreign aid for that period:
library(tidyverse)
fae_final <-
fae_balanced %>%
filter(!(country_name=="South Sudan" & year<=2005))
We now have properly balanced panel data. In other words, (i) we have a panel dataset that has the appropriate number of countries and years; and (ii) we have carefully examined the extent to which missing/NA values should be part of our dataset. I hope this helps!