library(tidyverse)
<- readRDS(file = here::here( "raw_data", "athletes.rds")) athletes
Column-wise operations in the tidyverse
Column-wise operations with across()
Instead of looping over columns with a for-loop, we can also use across()
in combination with other tidyverse
functions.
mutate_...()
The functions mutate_all()
, mutate_at()
, and mutate_if()
do the same, but are superseded. This means they still work, but the tidyverse
team recommends to use across()
instead.
Across can take column names and a function that should be applied to the selected columns:
## Here we transform the Height and Weight columns to the type character:
%>%
athletes mutate(across(c(Height, Weight), as.character)) %>%
str
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: chr NA NA "163" NA ...
$ Weight: chr NA NA "57" NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
But it is also possible to select columns based on a selection function:
## And here we transform all numeric columns into character:
%>%
athletes mutate(across(where(is.numeric), as.character)) %>%
str()
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : chr "132181" "87371" "44977" "502" ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : chr NA NA "28" NA ...
$ Height: chr NA NA "163" NA ...
$ Weight: chr NA NA "57" NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : chr "1956" "1948" "1980" "1956" ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
We can also specify our own transformation function.
Using tidyverse functions in loops
When using tidyverse syntax within a loop, we might run into the problem that the tidyverse function can’t deal with our iteration counter like we are used to:
## Trying to transform the Height and Weight column to character using a for-loop.
## Note: Normally across() would be a better option in this case (and most of the time anyways).
## But sometimes a good old fashioned for-loop might be easier to programm to get the job done,
## in which case one should keep some specifics in mind:
## This throws an error:
<- athletes
athletes_2
for(i in c("Height", "Weight")){
<- athletes_2 %>%
athletes_2 mutate(i = as.character(.$i))
}
This doesn’t work, because we are trying to program a function using data masking. Data masking just means that we don’t have to type athletes$Height
in a tidyverse
function, but simply Height
, because the function knows this refers to a column in the current data.frame.
Loops
- In the case of
mutate()
, we have to use dynamic dots, which need to used if we want to create names programmatically::=
instead of=
. - We need to embrace the changeable variable (either in a function or a loop) like this:
{var}
. - In
mutate()
, we can also simply write"{var}_..."
to paste together a new column name. - If we just want to extract data, we can use the
.data
pronoun with[[
(see here)..data
helps to clear up ambiguity, it makes clear you want to extract a column from the current data.frame. This is something different than the.
, which can be read like “data up to this point” and references the data that gets put into the function where the.
is used. The.
actually stands for a data.frame, while.data
is used for symbol evaluation.
<- athletes
athletes_2
for(i in c("Height", "Weight")){
<- athletes_2 %>%
athletes_2 mutate({{i}} := as.character(.[[i]]))
}str(athletes_2)
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: chr NA NA "163" NA ...
$ Weight: chr NA NA "57" NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
for(i in c("Height", "Weight")){
<- athletes_2 %>%
athletes_2 mutate({{i}} := as.character({{i}}))
}str(athletes_2)
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: chr "Height" "Height" "Height" "Height" ...
$ Weight: chr "Weight" "Weight" "Weight" "Weight" ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## With programmatically built new columns:
<- athletes
athletes_3 for(i in c("Height", "Weight")){
<- athletes_3 %>%
athletes_3 mutate("{i}_char" := as.character(.[[i]]))
}
str(athletes_3)
'data.frame': 270767 obs. of 18 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height : int NA NA 163 NA NA 168 NA NA NA NA ...
$ Weight : num NA NA 57 NA 74 73 NA NA 57 NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season : chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Height_char: chr NA NA "163" NA ...
$ Weight_char: chr NA NA "57" NA ...
Beware of overwriting the same data.frame you put into mutate()
! Otherwise the resulting data.frame will always be overwritten by the old one that always gets put into the function.
Functions
This also applies if we want to define a function with column names as arguments, using tidyverse
inside. Here we need to embrace our variable as well to make it data masking friendly:
## This doesn't work:
<- function(dat, var){
print_col print(dat %>% pull(var))
}
%>%
athletes print_col(Region)
## This works:
<- function(dat, var){
print_col print(dat %>% pull({{var}}))
}
%>%
athletes print_col(Region)
[1] "Afghanistan" "Afghanistan"
[3] "Afghanistan" "Afghanistan"
[5] "Afghanistan" "Afghanistan"
[7] "Afghanistan" "Afghanistan"
...
In previous SIG-Sessions we have used !!sym()
for this, which also works, but is more to remember:
<- athletes
athletes_2 for(i in c("Height", "Weight")){
<- athletes_2 %>%
athletes_2 mutate(!!sym(i) := as.character(!!sym(i)))
}
str(athletes_2)
'data.frame': 270767 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 44977 502 109153 29626 1076 121376 80210 87374 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sex : chr "M" "M" "M" "M" ...
$ Age : int NA NA 28 NA 24 28 28 NA NA NA ...
$ Height: chr NA NA "163" NA ...
$ Weight: chr NA NA "57" NA ...
$ Team : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Games : chr "1956 Summer" "1948 Summer" "1980 Summer" "1956 Summer" ...
$ Year : int 1956 1948 1980 1956 1964 1960 1936 1956 1972 1956 ...
$ Season: chr "Summer" "Summer" "Summer" "Summer" ...
$ City : chr "Melbourne" "London" "Moskva" "Melbourne" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ Event : chr "Hockey Men's Hockey" "Hockey Men's Hockey" "Wrestling Men's Bantamweight, Freestyle" "Hockey Men's Hockey" ...
$ Medal : chr NA NA NA NA ...
$ Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
Exercise
- Read the
characters.rds
and thepsych_stats.csv
into R (download here).
<- readRDS(here::here( "raw_data", "characters.rds"))
characters <- read.csv(here::here( "raw_data", "psych_stats.csv"), sep = ";") psych_stats
- Write a function that can standardize a numeric column. It should take a
data.frame
and a column as arguments. To standardize, the function should subtract the mean of the column from each of its raw values and then divide each difference by the standard deviation of the original variable.
Finally, it should return a numeric vector containing the standardized values. Make the function work with data masking!
To extract a single column as a vector to do some operations on it you can use pull()
.
<- function(dat, var){
standardize_var <- dat %>%
column pull({{var}})
if(!is.numeric(column)){stop(paste0("Please provide a numeric column! Yours has the type '", typeof(column), "'."))}
<- (column - mean(column, na.rm = TRUE)) / sd(column, na.rm = TRUE) # automatically remove NAs
column_std return(column_std)
}
## test the function:
# standardize_var(athletes, Region)
<- psych_stats %>%
standardized_col standardize_var(messy_neat)
mean(standardized_col) %>% round(5)
[1] 0
sd(standardized_col)
[1] 1
- Apply this function within a for-loop on 10 columns of your choice. Add the standardized columns as new columns to your data frame.
Always overwrite the same data.frame you put into the function.
for(i in c("col_1", "col_2"){dat_new <- dat %>% ...}
won’t work because we always overwrite the dat_new object, where our new column was added, with our old data.frame.
for(i in colnames(psych_stats)[20:29]){
<- psych_stats %>%
psych_stats mutate("{i}_std" := standardize_var(., .data[[i]])) # The point means we use the object called before the last pipe, in this case psych_stats.
}
mean(psych_stats$flexible_rigid_std)
[1] -1.582513e-16
sd(psych_stats$flexible_rigid_std)
[1] 1
- Now build the same standardization function, but it should only take a numeric vector as input, not a whole
data.frame
.
<- function(vec){
standardize_var_2 if(!is.numeric(vec)){stop(paste0("Please provide a numeric vector! Yours has the type '", typeof(vec), "'."))}
- mean(vec))/sd(vec)
(vec
}
## Testing the function
standardize_var_2(c(1,2,3))
[1] -1 0 1
# standardize_var_2(c("a",2,3))
- Apply one of your two standardization functions onto the psych_stats
data.frame
usingacross()
. The standardized columns should be added to the data frame. Think carefully about which one should be used in this context and why.
This exercise is a little bit harder, so use google, StackOverflow and the help function?
within R.
- Look at the documentation of
across()
to see how you can give new column names intomutate()
from withinacross()
. - It’s probably easier to use a function that only takes a vector, and not the whole data.frame within
across()
.
<- psych_stats %>%
psych_stats_std mutate(
across(
.cols = where(is.numeric),
.fns = standardize_var_2,
.names = "{.col}_standardized"
)
)
mean(psych_stats_std$innocent_jaded_standardized)
[1] 4.127042e-18
I used the function that only takes a vector: using it within across()
made it possible to just provide the function, and not having to deal with the extra data argument. The only argument in my function is the vector, and it is more naturally for across to handle, because it already applies a function to a column, so the extraction is not necessary.
Footnotes
Image by Darryl Low on Unsplash.↩︎