Column-wise operations in the tidyverse

R
tidyverse
loops
R-SIG
R-SIG 08.04.2024
Published

April 8, 2024

1

library(tidyverse)

athletes <- readRDS(file = here::here(  "raw_data", "athletes.rds"))

Column-wise operations with across()

Instead of looping over columns with a for-loop, we can also use across() in combination with other tidyverse functions.

mutate_...()

The functions mutate_all(), mutate_at(), and mutate_if() do the same, but are superseded. This means they still work, but the tidyverse team recommends to use across() instead.

Across can take column names and a function that should be applied to the selected columns:

## Here we transform the Height and Weight columns to the type character:
athletes %>%
  mutate(across(c(Height, Weight), as.character)) %>%
  str
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: chr  NA NA "163" NA ...
 $ Weight: chr  NA NA "57" NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

But it is also possible to select columns based on a selection function:

## And here we transform all numeric columns into character:
athletes %>%
  mutate(across(where(is.numeric), as.character)) %>%
  str()
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : chr  "132181" "87371" "44977" "502" ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : chr  NA NA "28" NA ...
 $ Height: chr  NA NA "163" NA ...
 $ Weight: chr  NA NA "57" NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : chr  "1956" "1948" "1980" "1956" ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also specify our own transformation function.

Using tidyverse functions in loops

When using tidyverse syntax within a loop, we might run into the problem that the tidyverse function can’t deal with our iteration counter like we are used to:

## Trying to transform the Height and Weight column to character using a for-loop.
## Note: Normally across() would be a better option in this case (and most of the time anyways).
##       But sometimes a good old fashioned for-loop might be easier to programm to get the job done, 
##       in which case one should keep some specifics in mind: 

## This throws an error: 
athletes_2 <- athletes

for(i in c("Height", "Weight")){
  athletes_2 <- athletes_2 %>%
    mutate(i = as.character(.$i))
}

This doesn’t work, because we are trying to program a function using data masking. Data masking just means that we don’t have to type athletes$Height in a tidyverse function, but simply Height, because the function knows this refers to a column in the current data.frame.

Loops

  • In the case of mutate(), we have to use dynamic dots, which need to used if we want to create names programmatically: := instead of =.
  • We need to embrace the changeable variable (either in a function or a loop) like this: {var}.
  • In mutate(), we can also simply write "{var}_..." to paste together a new column name.
  • If we just want to extract data, we can use the .data pronoun with [[ (see here). .data helps to clear up ambiguity, it makes clear you want to extract a column from the current data.frame. This is something different than the ., which can be read like “data up to this point” and references the data that gets put into the function where the . is used. The . actually stands for a data.frame, while .data is used for symbol evaluation.
athletes_2 <- athletes

for(i in c("Height", "Weight")){
  athletes_2 <- athletes_2 %>%
    mutate({{i}} := as.character(.[[i]]))
}
str(athletes_2)
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: chr  NA NA "163" NA ...
 $ Weight: chr  NA NA "57" NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
for(i in c("Height", "Weight")){
  athletes_2 <- athletes_2 %>%
    mutate({{i}} := as.character({{i}}))
}
str(athletes_2)
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: chr  "Height" "Height" "Height" "Height" ...
 $ Weight: chr  "Weight" "Weight" "Weight" "Weight" ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## With programmatically built new columns: 
athletes_3 <- athletes
for(i in c("Height", "Weight")){
  athletes_3 <- athletes_3 %>%
    mutate("{i}_char" := as.character(.[[i]]))
}

str(athletes_3)
'data.frame':   270767 obs. of  18 variables:
 $ NOC        : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID         : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name       : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex        : chr  "M" "M" "M" "M" ...
 $ Age        : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height     : int  NA NA 163 NA NA 168 NA NA NA NA ...
 $ Weight     : num  NA NA 57 NA 74 73 NA NA 57 NA ...
 $ Team       : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games      : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year       : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season     : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City       : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport      : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event      : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal      : chr  NA NA NA NA ...
 $ Region     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Height_char: chr  NA NA "163" NA ...
 $ Weight_char: chr  NA NA "57" NA ...

Beware of overwriting the same data.frame you put into mutate()! Otherwise the resulting data.frame will always be overwritten by the old one that always gets put into the function.

Functions

This also applies if we want to define a function with column names as arguments, using tidyverse inside. Here we need to embrace our variable as well to make it data masking friendly:

## This doesn't work:
print_col <- function(dat, var){
  print(dat %>% pull(var))
}

athletes %>%
  print_col(Region)
## This works:
print_col <- function(dat, var){
  print(dat %>% pull({{var}}))
}

athletes %>%
  print_col(Region)
    [1] "Afghanistan"                      "Afghanistan"                     
    [3] "Afghanistan"                      "Afghanistan"                     
    [5] "Afghanistan"                      "Afghanistan"                     
    [7] "Afghanistan"                      "Afghanistan"                     
...
!!sym()

In previous SIG-Sessions we have used !!sym() for this, which also works, but is more to remember:

athletes_2 <- athletes
for(i in c("Height", "Weight")){
  athletes_2 <- athletes_2 %>%
    mutate(!!sym(i) := as.character(!!sym(i)))
}

str(athletes_2)
'data.frame':   270767 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  NA NA 28 NA 24 28 28 NA NA NA ...
 $ Height: chr  NA NA "163" NA ...
 $ Weight: chr  NA NA "57" NA ...
 $ Team  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Games : chr  "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
 $ Year  : int  1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Melbourne" "London" "Moskva" "Melbourne" ...
 $ Sport : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ Event : chr  "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
 $ Medal : chr  NA NA NA NA ...
 $ Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

Exercise

  1. Read the characters.rds and the psych_stats.csv into R (download here).
characters <- readRDS(here::here(  "raw_data", "characters.rds"))
psych_stats <- read.csv(here::here(  "raw_data", "psych_stats.csv"), sep = ";")
  1. Write a function that can standardize a numeric column. It should take a data.frame and a column as arguments. To standardize, the function should subtract the mean of the column from each of its raw values and then divide each difference by the standard deviation of the original variable.
    Finally, it should return a numeric vector containing the standardized values. Make the function work with data masking!

To extract a single column as a vector to do some operations on it you can use pull().

standardize_var <- function(dat, var){
  column <- dat %>%
    pull({{var}}) 
  
if(!is.numeric(column)){stop(paste0("Please provide a numeric column! Yours has the type '", typeof(column), "'."))}

  column_std <- (column - mean(column, na.rm = TRUE)) / sd(column, na.rm = TRUE) # automatically remove NAs
  return(column_std)
}

## test the function:
# standardize_var(athletes, Region)

standardized_col <- psych_stats %>% 
  standardize_var(messy_neat)

mean(standardized_col) %>% round(5)
[1] 0
sd(standardized_col)
[1] 1
  1. Apply this function within a for-loop on 10 columns of your choice. Add the standardized columns as new columns to your data frame.

Always overwrite the same data.frame you put into the function.

for(i in c("col_1", "col_2"){dat_new <- dat %>% ...} won’t work because we always overwrite the dat_new object, where our new column was added, with our old data.frame.

for(i in colnames(psych_stats)[20:29]){
 psych_stats <- psych_stats %>%
    mutate("{i}_std" := standardize_var(., .data[[i]])) # The point means we use the object called before the last pipe, in this case psych_stats. 
}

mean(psych_stats$flexible_rigid_std)
[1] -1.582513e-16
sd(psych_stats$flexible_rigid_std)
[1] 1
  1. Now build the same standardization function, but it should only take a numeric vector as input, not a whole data.frame.
standardize_var_2 <- function(vec){
  if(!is.numeric(vec)){stop(paste0("Please provide a numeric vector! Yours has the type '", typeof(vec), "'."))}
  (vec- mean(vec))/sd(vec)
}

## Testing the function
standardize_var_2(c(1,2,3))
[1] -1  0  1
# standardize_var_2(c("a",2,3))
  1. Apply one of your two standardization functions onto the psych_stats data.frame using across(). The standardized columns should be added to the data frame. Think carefully about which one should be used in this context and why.
    This exercise is a little bit harder, so use google, StackOverflow and the help function ? within R.
  • Look at the documentation of across() to see how you can give new column names into mutate() from within across().
  • It’s probably easier to use a function that only takes a vector, and not the whole data.frame within across().
psych_stats_std <- psych_stats %>%
  mutate(
    across(
    .cols = where(is.numeric), 
    .fns = standardize_var_2, 
    .names = "{.col}_standardized"
    )
  )

mean(psych_stats_std$innocent_jaded_standardized)
[1] 4.127042e-18

I used the function that only takes a vector: using it within across() made it possible to just provide the function, and not having to deal with the extra data argument. The only argument in my function is the vector, and it is more naturally for across to handle, because it already applies a function to a column, so the extraction is not necessary.

Footnotes

  1. Image by Darryl Low on Unsplash.↩︎