# Removing Accents in R

Published:

R is a fantastic open-source program that allows users to do just about anything, but sometimes the program requires some tricks to accomplish seemingly simple tasks. Removing accents is a case in point, so I’d like to provide everyone with some guidance to overcome some of the thornier accent removal issues.

## Typical One- and Two-Character Substitutions

Before even getting into the accent removal, the first order of business is to ensure that your R Studio is using UTF-8 file encoding:

In my case, everything was already in UTF-8, but it was good to check just in case:

Now, let’s create a data frame that will allow us to remove different types of accents that we will encounter:

# clear environment
rm(list=ls(all=TRUE))

# create data frame
df <- data.frame(
country = c("Argentina", "Honduras","Germany"),
city = c("San Martín", "San José","Aßlar"))


##     country       city
## 1 Argentina San Martín
## 2  Honduras   San José
## 3   Germany      Aßlar


I specifically chose those cities above because they embody different types of accent fixes that you may need to perform. Removing the “é” from San José and “í” from “San Martín” are typical cases that won’t cause many problems–regardless of how you chose to remove them. The “ß” in “Aßlar”, however, requires a two-character substitution: the “ß” needs to be replaced with “ss” when it is transcribed into English.

So that we can take care of the one- and two-character substitutions all at once, let’s write a function called remove.accents. The old1 vector contains the letter with the accent in the original language for the one-character substitutions, and the new1 vector contains the respective replacements without the accents. Note how old1 and new1 are kept in order of the replacements. The logic is similar for old2 and new2.

# define the function
remove.accents <- function(s) {

# 1 character substitutions
old1 <- "éí"
new1 <- "ei"
s1 <- chartr(old1, new1, s)

# 2 character substitutions
old2 <- c("ß")
new2 <- c("ss")
s2 <- s1

# finalize the function
for(i in seq_along(old2)) s2 <- gsub(old2[i], new2[i], s2, fixed = TRUE)

s2
}



Given that the function is now defined, let’s execute it and see whether it works:

# finish the accent fix
df$city = remove.accents(df$city)

# examine the data frame

##     country       city
## 1 Argentina San Martin
## 2  Honduras   San Jose
## 3   Germany     Asslar


## When the Above Doesn’t Work

Sometimes, the above tricks won’t work. I recently ran into such an instance while cleaning Moldova data in Romanian for my project on natural resources and subnational public goods provision. To show how I overcame this challenge, let me load the shapefile with the sf package and only keep the admin1 column with the accented values:

# load libraries
library(sf)
library(dplyr)

# keep only the admin1 column with the accented characters
moldova <-
moldova %>%
dplyr::select(NAME_1) %>%
st_set_geometry(NULL) %>% # drop latitude & longitude pairs separately


Let’s see what these accented characters look like. Because R Markdown has a tough time reading them, I’ll use a screenshot to show the results here:

In such instances, simply use the make_clean_names function from the janitor package to remove the accents:

# load library
library(janitor)

# perform fixes
moldova$admin1 = make_clean_names(moldova$admin1)

# make sure it goes through
table(moldova$admin1)  ## anenii_noi balti basarabeasca bender briceni cahul ## 1 1 1 1 1 1 ## calarasi cantemir causeni chisinau cimislia criuleni ## 1 1 1 1 1 1 ## donduseni drochia dubasari edinet falesti floresti ## 1 1 1 1 1 1 ## gagauzia glodeni hincesti ialoveni leova nisporeni ## 1 1 1 1 1 1 ## ocnita orhei rezina riscani singerei soldanesti ## 1 1 1 1 1 1 ## soroca stefan_voda straseni taraclia telenesti transnistria ## 1 1 1 1 1 1 ## ungheni ## 1  To conclude, let’s just change the underscore back to a space: # perform replacement moldova$admin1 = gsub("_", " ", moldova$admin1) # make sure it goes through table(moldova$admin1)

##   anenii noi        balti basarabeasca       bender      briceni        cahul
##            1            1            1            1            1            1
##     calarasi     cantemir      causeni     chisinau     cimislia     criuleni
##            1            1            1            1            1            1
##    donduseni      drochia     dubasari       edinet      falesti     floresti
##            1            1            1            1            1            1
##     gagauzia      glodeni     hincesti     ialoveni        leova    nisporeni
##            1            1            1            1            1            1
##       ocnita        orhei       rezina      riscani     singerei   soldanesti
##            1            1            1            1            1            1
##       soroca  stefan voda     straseni     taraclia    telenesti transnistria
##            1            1            1            1            1            1
##      ungheni
##            1


Tags: