Removing Accents in R

4 minute read

Published:

R is a fantastic open-source program that allows users to do just about anything, but sometimes the program requires some tricks to accomplish seemingly simple tasks. Removing accents is a case in point, so I’d like to provide everyone with some guidance to overcome some of the thornier accent removal issues.

Typical One- and Two-Character Substitutions

Before even getting into the accent removal, the first order of business is to ensure that your R Studio is using UTF-8 file encoding:

In my case, everything was already in UTF-8, but it was good to check just in case:

Now, let’s create a data frame that will allow us to remove different types of accents that we will encounter:

# clear environment
rm(list=ls(all=TRUE)) 

# create data frame
df <- data.frame(
  country = c("Argentina", "Honduras","Germany"),
  city = c("San Martín", "San José","Aßlar"))

head(df)

##     country       city
## 1 Argentina San Martín
## 2  Honduras   San José
## 3   Germany      Aßlar

I specifically chose those cities above because they embody different types of accent fixes that you may need to perform. Removing the “é” from San José and “í” from “San Martín” are typical cases that won’t cause many problems–regardless of how you chose to remove them. The “ß” in “Aßlar”, however, requires a two-character substitution: the “ß” needs to be replaced with “ss” when it is transcribed into English.

So that we can take care of the one- and two-character substitutions all at once, let’s write a function called remove.accents. The old1 vector contains the letter with the accent in the original language for the one-character substitutions, and the new1 vector contains the respective replacements without the accents. Note how old1 and new1 are kept in order of the replacements. The logic is similar for old2 and new2.

# define the function
remove.accents <- function(s) {
  
  # 1 character substitutions
  old1 <- "éí"
  new1 <- "ei"
  s1 <- chartr(old1, new1, s)
  
  # 2 character substitutions 
  old2 <- c("ß")
  new2 <- c("ss")
  s2 <- s1
  
  # finalize the function
  for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)
  
  s2
}

Given that the function is now defined, let’s execute it and see whether it works:

# finish the accent fix
df$city = remove.accents(df$city)

# examine the data frame
head(df)

##     country       city
## 1 Argentina San Martin
## 2  Honduras   San Jose
## 3   Germany     Asslar

When the Above Doesn’t Work

Sometimes, the above tricks won’t work. I recently ran into such an instance while cleaning Moldova data in Romanian for my project on natural resources and subnational public goods provision. To show how I overcame this challenge, let me load the shapefile with the sf package and only keep the admin1 column with the accented values:

# load libraries
library(sf)
library(dplyr)

# load shapefile
moldova = st_read("Moldova_SHP/MDA_adm1.shp")

# keep only the admin1 column with the accented characters
moldova <-  
  moldova %>% 
  dplyr::select(NAME_1) %>% 
  st_set_geometry(NULL) %>% # drop latitude & longitude pairs separately
  dplyr::rename(admin1 = NAME_1)

Let’s see what these accented characters look like. Because R Markdown has a tough time reading them, I’ll use a screenshot to show the results here:

In such instances, simply use the make_clean_names function from the janitor package to remove the accents:

# load library
library(janitor)

# perform fixes
moldova$admin1 = make_clean_names(moldova$admin1)

# make sure it goes through
table(moldova$admin1)

## 
##   anenii_noi        balti basarabeasca       bender      briceni        cahul 
##            1            1            1            1            1            1 
##     calarasi     cantemir      causeni     chisinau     cimislia     criuleni 
##            1            1            1            1            1            1 
##    donduseni      drochia     dubasari       edinet      falesti     floresti 
##            1            1            1            1            1            1 
##     gagauzia      glodeni     hincesti     ialoveni        leova    nisporeni 
##            1            1            1            1            1            1 
##       ocnita        orhei       rezina      riscani     singerei   soldanesti 
##            1            1            1            1            1            1 
##       soroca  stefan_voda     straseni     taraclia    telenesti transnistria 
##            1            1            1            1            1            1 
##      ungheni 
##            1

To conclude, let’s just change the underscore back to a space:

# perform replacement
moldova$admin1 = gsub("_", " ", moldova$admin1)

# make sure it goes through
table(moldova$admin1)

## 
##   anenii noi        balti basarabeasca       bender      briceni        cahul 
##            1            1            1            1            1            1 
##     calarasi     cantemir      causeni     chisinau     cimislia     criuleni 
##            1            1            1            1            1            1 
##    donduseni      drochia     dubasari       edinet      falesti     floresti 
##            1            1            1            1            1            1 
##     gagauzia      glodeni     hincesti     ialoveni        leova    nisporeni 
##            1            1            1            1            1            1 
##       ocnita        orhei       rezina      riscani     singerei   soldanesti 
##            1            1            1            1            1            1 
##       soroca  stefan voda     straseni     taraclia    telenesti transnistria 
##            1            1            1            1            1            1 
##      ungheni 
##            1