Case_when() function

R
tidyverse
data manipulation
R-SIG
tutorial
R-SIG 18.12.2023
Published

December 18, 2023

1

The case_when() function from the dplyr package of the tidyverse is a useful function for combining multiple ifelse() statements.

How to use it

Let’s take a look at a little example. Let’s consider a very simple data frame containing only a column of different countries:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
df <- data.frame(country = c(rep("Deu", 4), "Mexico", "Peru", "Ghana", "China", "Spanien"))

Now, let’s suppose we want to add a second column containing the continent of the country. We can either use nested ifelse() statements, which makes the coed quite hard to read:

df$continent <- ifelse(df$country %in% c("Deu", "Spanien"), 
                       yes = "Europe", 
                       no = ifelse(
                         df$country == "Mexico" | df$country == "Peru", 
                         yes = "America",
                         no = ifelse(
                           df$country == "Ghana", 
                           yes = "Africa",
                           no = "Asia"
                       )
                       ))

df
  country continent
1     Deu    Europe
2     Deu    Europe
3     Deu    Europe
4     Deu    Europe
5  Mexico   America
6    Peru   America
7   Ghana    Africa
8   China      Asia
9 Spanien    Europe

case_when() has a slightly different syntax, but is not nested, which makes it easier to read. Condition and output are separated by ~. So if the condition on the left side is met in a row, the function returns the value on the right side of ~:

df_2 <- df %>%
  mutate(continent = case_when(country %in% c("Deu", "Spanien") ~ "Europe", 
                               country %in% c("Mexico", "Peru") ~ "America",
                               country == "Ghana" ~ "Africa", 
                               TRUE ~ "Another continent"
                                 )
         )
df_2
  country         continent
1     Deu            Europe
2     Deu            Europe
3     Deu            Europe
4     Deu            Europe
5  Mexico           America
6    Peru           America
7   Ghana            Africa
8   China Another continent
9 Spanien            Europe

We wrap this statement into a mutate function to automatically create the new column continent from the output of case_when. The TRUE in the last row catches all conditions we haven’t dealt with further above. So all rows which haven’t met any of the above conditions will get the label “Another continent”.

Evaluation order

case_when() goes from the top to the bottom. So if a row has met a statement, it is not considered further down. That’s why it makes sense to go from the most specific statements to the less specific ones. Otherwise the least specific ones might overwrite everything in the beginning:

df_3 <- df %>%
  mutate(continent = case_when(country %in% c(df$country) ~ "Other country", 
                               country %in% c("Mexico", "Peru") ~ "America",
                               country == "Ghana" ~ "Africa", 
                               TRUE ~ "Another continent"
                                 )
         )

df_3  
  country     continent
1     Deu Other country
2     Deu Other country
3     Deu Other country
4     Deu Other country
5  Mexico Other country
6    Peru Other country
7   Ghana Other country
8   China Other country
9 Spanien Other country

Because our first statement already covers all rows, the rest is obsolete. This top-down working also makes the TRUE condition in our last line possible, because only those rows that haven’t been used yet will come this far, and all of them are caught (because TRUE always is true).

Footnotes

  1. Image by Sky Replacement Pack on Unsplash.↩︎