Data wrangling in the tidyverse

tidyverse

R-SIG

tutorial

data manipulation

R-SIG 25.03.24

Published

March 25, 2024

library(tidyverse)

The tidyverse provides many tools for wrangling data, from selecting, sorting or renaming columns over filtering specific rows according to complex conditions to building new columns according to values in other columns. Let’s take a look at the most important ones. We will use the (athletes)[] dataset in the examples:

Load the data

# install.packages("tidyverse")
# install.packages("here")

library(tidyverse)
library(here)

athletes <- readRDS(file = here::here(  "raw_data", "athletes.rds"))

1 `select()` columns

Selecting columns from a data.frame is pretty straight forward:

athletes %>%
  select(Year, Sport)

      Year                     Sport
1     1956                    Hockey
2     1948                    Hockey
3     1980                 Wrestling
...

Note how we don’t have to put the columns in "", and how we can simply seperate them by ,.
select() becomes especially useful when combined with selection helpers:

## Select all columns starting with a Se
athletes %>%
  select(starts_with("Se"))

      Sex Season
1       M Summer
2       M Summer
3       M Summer
...

## Select all columns containing the letters "ea"
athletes %>%
  select(contains("ea"))

                                         Team Year Season
1                                 Afghanistan 1956 Summer
2                                 Afghanistan 1948 Summer
3                                 Afghanistan 1980 Summer
...

## Or, we can combine them:
athletes %>% 
  select(ends_with("t") & contains("igh"))

      Height Weight
1         NA     NA
2         NA     NA
3        163   57.0
...

2 `filter()` rows

We can use filter to subset rows according to their values in specific columns:

## All Volleyballers
athletes %>%
  filter(Sport == "Volleyball") %>%
  str

'data.frame':   3404 obs. of  16 variables:
 $ NOC   : chr  "ALG" "ALG" "ALG" "ALG" ...
 $ ID    : int  122168 73155 47642 74593 74593 117675 249 249 90117 90100 ...
 $ Name  : chr  "Faza Tsabet" "Narimne Madani" "Sehryne Hennaoui" "Nawal Mansouri" ...
...

## All Judoka between 50 and 100 kg
athletes %>%
  filter(Sport == "Judo", between(Weight, 50, 100)) %>%
  str()

'data.frame':   2916 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "ALB" ...
 $ ID    : int  99303 33817 7050 58601 121096 9593 78883 5689 78882 9593 ...
 $ Name  : chr  "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Majlinda Kelmendi" ...
...

## All athletes with missing height
athletes %>%
  filter(is.na(Height)) %>%
  str()

'data.frame':   60083 obs. of  16 variables:
 $ NOC   : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID    : int  132181 87371 502 109153 1076 121376 80210 87374 6323 59344 ...
 $ Name  : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Ahmad Shah Abouwi" "Shakar Khan Shakar" ...
...

Note how we can just write our conditions without connecting them with & (filter() does that automatically for us). Also, we don’t have to put the column names into ““, because filter() knows that this are column names of the athletes data frame, which makes coding a bit more pleasant. Also, missing rows are automatically removed, which makes sense in many cases!

3 `pivot_...()` long/wide format

To reshape data.frames from long to wide or wide to long format we can use pivot_wider() and pivot_longer():

Let’s define a simpler data.frame first:

inhabitants_wide <- data.frame(
  country = c("China", "India", "USA"),
  inhabitants_2021 = c(1425893465, 1407563842, NA),
  inhabitants_2022 = c(1425857720, 1420939232, 338903174)
)
inhabitants_wide

  country inhabitants_2021 inhabitants_2022
1   China       1425893465       1425857720
2   India       1407563842       1420939232
3     USA               NA        338903174

inhabitants_long <- inhabitants_wide %>%
  pivot_longer(
    ## Select the columns we want to reshape:
    cols = starts_with("inhabitants"),
    names_prefix = "inhabitants_",
    ## Define a new column where the column names will go to:
    names_to = "year",
    ## Define a new column where the values will go to:
    values_to = "inhabitants"
  )

head(inhabitants_long)

# A tibble: 6 × 3
  country year  inhabitants
  <chr>   <chr>       <dbl>
1 China   2021   1425893465
2 China   2022   1425857720
3 India   2021   1407563842
4 India   2022   1420939232
5 USA     2021           NA
6 USA     2022    338903174

In other cases, it might happen that multiple variables are put into the same column, together with an identifier column:

inhabitants_long_2

  country         variable      value
1   China             area    9597000
2   China inhabitants_2022 1425857720
3   India             area    3287000
4   India inhabitants_2022 1420939232
5     USA             area    9834000
6     USA inhabitants_2022  338903174

In that case it can make sense to spread the the distinct variables into two columns:

inhabitants_wide_2 <- inhabitants_long_2 %>%
  pivot_wider(
    id_cols = "country",
    names_from = "variable",
    values_from = "value"
  )

inhabitants_wide_2

# A tibble: 3 × 3
  country    area inhabitants_2022
  <chr>     <dbl>            <dbl>
1 China   9597000       1425857720
2 India   3287000       1420939232
3 USA     9834000        338903174

4 `mutate()`

With mutate() we can add new columns to a data.frame or edit existing ones:

athletes %>%
  mutate(new_column = NA) %>%
  mutate(ID = as.character(ID)) %>%
  str()

'data.frame':   270767 obs. of  17 variables:
 $ NOC       : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ ID        : chr  "132181" "87371" "44977" "502" ...
 $ Name      : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
...

4.1 Useful helpers

Like select(), mutate() really starts to shine when helper functions are added. For example we can fill a new column according to values in other columns:

## Build a new column indicating if this is a contact sport athlete
athletes %>%
  mutate(contact_sport = ifelse(Sport %in% c("Wrestling", "Boxing", "Judo", "Rugby", "Taekwondo", "Rugby Sevens"), 
                                yes = TRUE, 
                                no = FALSE)
         ) %>%
  select(Name, Sport, contact_sport) %>%
  str

'data.frame':   270767 obs. of  3 variables:
 $ Name         : chr  "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
 $ Sport        : chr  "Hockey" "Hockey" "Wrestling" "Hockey" ...
 $ contact_sport: logi  FALSE FALSE TRUE FALSE TRUE TRUE ...

If we have to connect multiple ifelse() functions, it’s better to use dplyrs case_when():

## This gets complicated pretty quickly:
athletes %>%
  mutate(judo_weightclass = if_else(str_detect(Event, "Middleweight"), 
                                  true = "Middleweight", 
                                  false = if_else(str_detect(Event, "Half-Lightweight"), 
                                        true = "Half-Lightweight", 
                                        false = if_else(str_detect(Event, "Lightweight"), 
                                                    true = "Lightweight", 
                                                    false = NA)
                                        ) 
                                        )
                                  ) %>% 
  filter(Sport == "Judo") %>%
  select(Name, Sport, Event, judo_weightclass) %>%
  str

'data.frame':   3799 obs. of  4 variables:
 $ Name            : chr  "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Sergio Murray" ...
 $ Sport           : chr  "Judo" "Judo" "Judo" "Judo" ...
 $ Event           : chr  "Judo Women's Middleweight" "Judo Men's Half-Lightweight" "Judo Men's Half-Heavyweight" "Judo Men's Middleweight" ...
 $ judo_weightclass: chr  "Middleweight" "Half-Lightweight" NA "Middleweight" ...

## so do this instead:
athletes %>%
  mutate(judo_weightclass = case_when(str_detect(Event, "Middleweight") ~ "Middleweight", 
                                      str_detect(Event, "Half-Lightweight") ~ "Half-Lightweight", 
                                      str_detect(Event, "Lightweight") ~ "Lightweight",
                                      TRUE ~ "other Weightclass" )
         ) %>%
  filter(Sport == "Judo") %>%
  select(Name, Sport, Event, judo_weightclass) %>%
  str()

'data.frame':   3799 obs. of  4 variables:
 $ Name            : chr  "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Sergio Murray" ...
 $ Sport           : chr  "Judo" "Judo" "Judo" "Judo" ...
 $ Event           : chr  "Judo Women's Middleweight" "Judo Men's Half-Lightweight" "Judo Men's Half-Heavyweight" "Judo Men's Middleweight" ...
 $ judo_weightclass: chr  "Middleweight" "Half-Lightweight" "other Weightclass" "Middleweight" ...

4.2 Programmatically using `mutate()`

If you want to use mutate() programmatically within a loop or a function, take a look at Column-wise operations in the tidyverse

5 `..._replace_...()`

We can easily replace values in a column using str_replace() or replace_na():

athletes %>%
  mutate(Sex = str_replace(Sex, "M", "Male")) %>%
  mutate(Sex = str_replace(Sex, "F", "Female")) %>%
  mutate(Height = replace_na(Height, 0)) %>%
  select(Sex, Height) %>%
  str()

'data.frame':   270767 obs. of  2 variables:
 $ Sex   : chr  "Male" "Male" "Male" "Male" ...
 $ Height: int  0 0 163 0 0 168 0 0 0 0 ...

6 `group_by()`

We can group our data by values in specific columns and perform some sort of operation on the groups. If we wanted to know the number of medals each region has won, we can for example group by region and medal type, and then count() (another tidyverse function) the number of cases in each group:

medal_counts <- athletes %>%
  group_by(Region, Medal) %>%
  count(Medal) 

medal_counts

# A tibble: 533 × 3
# Groups:   Region, Medal [533]
   Region         Medal      n
   <chr>          <chr>  <int>
 1 Afghanistan    Bronze     2
 2 Afghanistan    <NA>     124
 3 Albania        <NA>      70
 4 Algeria        Bronze     8
 5 Algeria        Gold       5
 6 Algeria        Silver     4
...

We can also summarize() data:

## Let's see what the mean, min and max age of athletes was in each Region:
athletes %>%
  group_by(Region) %>%
      summarize(mean_age = mean(Age, na.rm = TRUE), 
                min_age = min(Age, na.rm = TRUE), 
                max_age = max(Age, na.rm = TRUE)
                )

# A tibble: 206 × 4
   Region         mean_age min_age max_age
   <chr>             <dbl>   <int>   <int>
 1 Afghanistan        23.5      17      35
 2 Albania            25.3      16      46
 3 Algeria            24.4      14      38
 4 American Samoa     27.2      16      43
 5 Andorra            23.1      15      61
 6 Angola             24.9      13      51
 7 Antigua            23.2      14      38
...

7 `..._join_..()` data.frames

We have multiple options for merging data.frames in the tidyverse. left_join() can be used if we want to keep all rows of the first data.frame and only adds the rows of the second data.frame that have an identifier in the first data.frame, right_join() keeps all rows of the second data frame, and full_join() keeps all rows of both data frames.

Let’s merge a world coordinate data set onto our medal counts. This can be helpful if we want to plot the number of won medals in each country later on:

world_coordinates <- readRDS(here::here("raw_data", "world_coordinates.rds"))

Only take gold medals into account:

medal_counts <- medal_counts %>% filter(Medal == "Gold")

To merge two data frames that include information that belongs together, we need a common column, on which we can combine them. In our case, this is the column containing the country. They are both named region, but one with an upper case R. This doesn’t pose a problem, as we can define which columns should be taken from which data frame for merging with join_by(). Let’s take a quick look before merging to check if there are any countries named differently in both data sets:

medal_counts$Region[!(medal_counts$Region %in% world_coordinates$region)]

[1] "Individual Olympic Athletes"

Looks like all of the countries in our medal_countries data frame can also be found in our world_coordinates frame. Only athletes without a country will be lost when merging, but that’s ok for now, as we are interested in the country specific gold medal counts. So let’s merge:

medal_countries <- world_coordinates %>%
  left_join(medal_counts, join_by(region == Region))

head(medal_countries)

       long      lat group order region subregion Medal  n
1 -69.89912 12.45200     1     1  Aruba      <NA>  <NA> NA
2 -69.89571 12.42300     1     2  Aruba      <NA>  <NA> NA
3 -69.94219 12.43853     1     3  Aruba      <NA>  <NA> NA
4 -70.00415 12.50049     1     4  Aruba      <NA>  <NA> NA
5 -70.06612 12.54697     1     5  Aruba      <NA>  <NA> NA
6 -70.05088 12.59707     1     6  Aruba      <NA>  <NA> NA

8 Exercise

Read the characters.rds and the psych_stats.csv into R (download here).

Solution

characters <- readRDS(here::here(  "raw_data", "characters.rds"))
psych_stats <- read.csv(here::here(  "raw_data", "psych_stats.csv"), sep = ";")

Reshape the psych_stats data frame so there are only three columns in the data set: char_id, question and rating.

Hint

You can select multiple columns like this: column_1:column_10.

Solution

psych_stats_long <- psych_stats %>%
  pivot_longer(cols = messy_neat:innocent_jaded, 
               names_to = "question", 
               values_to = "rating")

head(psych_stats_long)

# A tibble: 6 × 3
  char_id question                      rating
  <chr>   <chr>                          <dbl>
1 F2      messy_neat                     95.7 
2 F2      disorganized_self.disciplined  95.2 
3 F2      diligent_lazy                   6.10
4 F2      on.time_tardy                   6.2 
5 F2      competitive_cooperative         6.40
6 F2      scheduled_spontaneous           6.60

Now we have multiple rows for every character, but all question ratings are nicely aligned in one column.

Merge the characters data frame and the psych_stats_long data frame on a common column.

Hint

Identify the common columns. Are they named the same in both data frames? Look at the documentation of ?join_by() to see, how you can merge data frames that don’t have identically named columns.

Solution

First, let’s take a look at both data sets again:

str(characters)

'data.frame':   889 obs. of  7 variables:
 $ id        : chr  "F2" "F1" "F5" "F4" ...
 $ name      : chr  "Monica Geller" "Rachel Green" "Chandler Bing" "Joey Tribbiani" ...
 $ uni_id    : chr  "F" "F" "F" "F" ...
 $ uni_name  : chr  "Friends" "Friends" "Friends" "Friends" ...
 $ notability: num  79.7 76.7 74.4 74.3 72.6 51.6 86.5 84.2 82.6 65.6 ...
 $ link      : chr  "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/1" "https://openpsychometrics.org/tests/characters/stats/F/5" "https://openpsychometrics.org/tests/characters/stats/F/4" ...
 $ image_link: chr  "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/1.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/5.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/4.jpg" ...

str(psych_stats)

'data.frame':   889 obs. of  365 variables:
 $ char_id                                     : chr  "F2" "F1" "F5" "F4" ...
 $ messy_neat                                  : num  95.7 30.2 45.3 13 20.9 ...
 $ disorganized_self.disciplined               : num  95.2 25.9 42.4 11 20.9 75.6 10.4 31.9 39.6 31.1 ...
 $ diligent_lazy                               : num  6.1 51.8 52.2 78.1 45.2 ...
 $ on.time_tardy                               : num  6.2 77.9 57.1 84.1 74 20.6 85.7 68.3 73.6 58.2 ...
 $ competitive_cooperative                     : num  6.4 28.9 42.8 44.2 55.3 ...
 $ scheduled_spontaneous                       : num  6.6 72.3 54.9 91.3 94.9 ...
 $ ADHD_OCD                                    : num  92.9 31.8 26.7 10.4 12.8 70.1 35.5 30.1 51.8 39.2 ...
 $ chaotic_orderly                             : num  92.2 27 38.2 12.6 11.2 68.8 6.8 20.6 23.4 28.8 ...
 $ motivated_unmotivated                       : num  7.8 31.8 52.3 45.6 24.7 31.5 80.9 30.5 40.8 50.7 ...
 $ bossy_meek                                  : num  7.9 30.6 64.8 60.8 40.1 ...
 $ persistent_quitter                          : num  7.9 35.8 43.9 33.8 21.3 ...
 $ overachiever_underachiever                  : num  8.2 43.8 55.8 68.8 51.3 23.2 67.7 36.7 44.1 44.4 ...
 $ muddy_washed                                : num  91 80.2 58.7 42.7 48.1 64.6 27.6 62.4 70.1 69.2 ...
 $ beautiful_ugly                              : num  9.2 5.3 26.2 11 11.4 ...
 $ slacker_workaholic                          : num  90.8 45.9 53.4 17.6 32 81.5 23.8 30.1 33.2 34.6 ...
 $ driven_unambitious                          : num  9.5 30.3 49.8 49.4 43.4 22.7 58.5 34.1 32 47.4 ...
 $ outlaw_sheriff                              : num  90.3 39.3 46.7 23.8 16.1 85.4 21.4 22.7 27.3 30.1 ...
 $ precise_vague                               : num  9.9 64.7 53.2 78 78.1 25.4 68.4 60.1 47.3 61.7 ...
 $ bad.cook_good.cook                          : num  90 11.1 28.2 31.2 29.4 35.9 27.3 46.2 43.8 52.8 ...
 $ manicured_scruffy                           : num  10.6 7.7 45.6 47.6 62.5 20.5 81.3 37.3 20.3 20.9 ...
 $ lenient_strict                              : num  89.3 34.2 28.8 11 15.4 76.7 15.2 24.2 38.9 21.5 ...
 $ relaxed_tense                               : num  89 58.8 66.4 10.4 16.9 88.9 69.9 64.2 54.5 64.8 ...
 $ demanding_unchallenging                     : num  11 23.9 58.3 66.3 57.1 28.5 35.9 37.8 16.8 60.3 ...
 $ drop.out_valedictorian                      : num  88.9 32.5 47 14.9 22.1 87.7 12.5 29.6 36.5 51.2 ...
 $ go.getter_slugabed                          : num  11.7 31.3 52.6 48.1 27.6 41.8 62.6 33.9 27.3 51.1 ...
 $ competent_incompetent                       : num  11.9 47.1 37.1 77.2 53.6 37.8 51.9 41.1 35.2 56.1 ...
 $ aloof_obsessed                              : num  88.1 62.3 52.3 35.1 33.2 80.8 75.1 54.9 70.7 61.9 ...
 $ flexible_rigid                              : num  87.8 41.8 45.9 17.3 13.9 83.7 45.9 27.4 55 32.1 ...
 $ active_slothful                             : num  12.2 33.1 61.1 56.7 31.4 48.8 73.9 19.8 29.2 35.5 ...
 $ loose_tight                                 : num  87.4 43.2 44.3 14 15.3 82.5 28.1 26 44.8 43.6 ...
 $ pointed_random                              : num  12.8 49.9 67.1 86.2 87.4 36.7 65.4 53.1 36.9 56.2 ...
 $ fresh_stinky                                : num  12.9 14.6 31.9 44.3 39.2 44.3 64.4 30.2 18.2 24.6 ...
 $ dominant_submissive                         : num  13.6 41.6 73.7 40.2 30.9 69.5 43.5 52.6 36.9 77.9 ...
 $ anxious_calm                                : num  13.7 28.8 20 66.1 58 11.9 12 32.1 37.1 29.8 ...
 $ clean_perverted                             : num  13.7 42.5 56.8 77.5 59.4 44 53.1 51.2 61.9 50.6 ...
 $ neutral_opinionated                         : num  86.3 74.6 67.2 43.4 76.6 84.2 67.3 77.9 82.5 43.9 ...
 $ always.down_picky                           : num  85.9 72.6 49.8 27.1 35.2 71.9 23.6 36.2 71.8 36.2 ...
 $ hurried_leisurely                           : num  14.6 55.1 55.9 85.9 81 22.1 48.6 45.6 49 39.3 ...
 $ attractive_repulsive                        : num  14.7 9.4 28.5 15.7 18.2 ...
 $ devoted_unfaithful                          : num  14.8 29.1 22.6 41.5 19.6 47.5 34.1 55.7 42.7 48.2 ...
 $ helpless_resourceful                        : num  85 41.4 56.6 37.9 70.6 52.4 41.4 51.5 36.2 29.8 ...
 $ deliberate_spontaneous                      : num  15.1 71.7 56.5 89.1 92.9 20.9 78.6 88.3 64 60.9 ...
 $ plays.hard_works.hard                       : num  84.7 41.3 46.5 13.7 26 81.2 28.2 30 19.9 26.4 ...
 $ imaginative_practical                       : num  84.7 37.9 54.6 17 5.4 ...
 $ frenzied_sleepy                             : num  15.5 29.9 34.7 55.6 30 31.1 59.4 25.2 19 46 ...
 $ queer_straight                              : num  84.3 84.1 65.4 84.3 45.4 77.9 10.2 4.8 73.4 64.1 ...
 $ assertive_passive                           : num  15.8 40.4 66.3 44.3 39.3 60.9 45.1 45.8 23.4 63.3 ...
 $ fast.talking_slow.talking                   : num  15.9 20.8 18.3 42.2 21.7 49.6 69.5 34.3 32.5 44.5 ...
 $ astonishing_methodical                      : num  83.8 28 49.9 19.2 17.4 83 31.2 27.4 36 32.7 ...
 $ hoarder_unprepared                          : num  16.2 70 63.5 82 54.9 35.5 60.3 64.5 48.3 67.8 ...
 $ consistent_variable                         : num  16.6 60.2 46.3 63.1 79.3 39.5 72 65.3 69.7 62.3 ...
 $ involved_remote                             : num  16.7 26.3 42.7 30.2 36.7 36.6 62.2 39.3 26.4 38.7 ...
 $ backdoor_official                           : num  83.3 51.9 47.4 24.4 20.4 76.4 29.1 29.3 53.5 36.7 ...
 $ captain_first.mate                          : num  16.7 52.7 73.5 74.2 57.9 68.4 55.9 51 19 73.6 ...
 $ refined_rugged                              : num  17.3 18.9 48.4 74.4 69.9 24.4 81.6 48 31.4 40.7 ...
 $ accommodating_stubborn                      : num  82.7 77.2 48.2 43.9 48.3 78.5 78.1 69 85.9 41.5 ...
 $ barbaric_civilized                          : num  82.6 76.5 66.6 32.9 39.9 77 33.4 44.4 36.7 55.5 ...
 $ alpha_beta                                  : num  17.7 37.9 73.9 33.6 41.9 78.2 44.3 37.4 17.5 66.6 ...
 $ loyal_traitorous                            : num  17.8 32.3 20 15.3 14.5 40.3 29.2 43.1 47.2 33.2 ...
 $ trash_treasure                              : num  82 80.1 82.2 78.4 83.2 47.8 64.5 62.2 68.2 78.4 ...
 $ fast_slow                                   : num  18.1 43.7 38.1 69 55.3 60.4 57.8 29.4 30 54.5 ...
 $ perceptive_unobservant                      : num  18.3 59.5 41.5 80 41.1 48.6 21.6 33.3 28 49 ...
 $ goof.off_studious                           : num  81.4 33.2 20.7 7.4 16.6 ...
 $ feminist_sexist                             : num  18.6 23.3 43.9 62 10.5 ...
 $ desperate_high.standards                    : num  81.1 69.2 30.7 36.8 56.7 29.2 33.7 32.5 61.7 25.8 ...
 $ impatient_patient                           : num  18.9 21.9 34 25.7 39.1 25.8 23.8 35.1 18 57.2 ...
 $ preppy_punk.rock                            : num  18.9 16.4 41.5 49.5 73.2 14.4 87.7 74.4 26.4 18.2 ...
 $ naive_paranoid                              : num  80.7 35.5 66.6 22 39.7 71.6 69.6 45.6 50.7 32.1 ...
 $ important_irrelevant                        : num  19.3 22.3 24.6 24.7 26.4 47.4 12.5 14.8 16.4 33.4 ...
 $ apprentice_master                           : num  80.6 42.3 44.9 36.3 61.5 60.8 48 48 73 31.5 ...
 $ healthy_sickly                              : num  19.6 17.8 39.1 26.9 22.6 37 88.9 65.7 56.7 45.5 ...
 $ morning.lark_night.owl                      : num  19.6 69.9 58.3 80.4 61.9 23.2 90.6 81.9 90.1 78.3 ...
 $ alert_oblivious                             : num  19.6 70.7 55.5 87.6 78.9 57.1 54.7 48.9 38.3 67.4 ...
 $ f....the.police_tattle.tale                 : num  80 57.5 56.7 34.4 13.7 ...
 $ experimental_reliable                       : num  79.7 37.8 62 35 22.2 61.7 28 26.5 30.4 39.8 ...
 $ loud_quiet                                  : num  20.4 20.8 25 10.6 15.3 39.5 71.9 42.7 13.2 55.2 ...
 $ high.IQ_low.IQ                              : num  20.5 56.7 28.8 82.6 50.6 19.3 30.9 26.1 47.7 55.6 ...
 $ oppressed_privileged                        : num  79.2 85.4 67.2 66.5 42.1 84.3 22.4 19.6 59.9 63.4 ...
 $ animalistic_human                           : num  79.2 75.6 73.7 43.8 42.1 69.3 70.4 55.9 64.4 73.2 ...
 $ still_twitchy                               : num  79.2 68.6 79.9 76.9 83.6 81.9 77.9 67.4 60.1 58.4 ...
 $ thick_thin                                  : num  78.8 79.6 52.8 35.2 69.2 60.6 73.3 81.4 66.1 48.8 ...
 $ repetitive_varied                           : num  21.3 44.5 40.9 43.4 74.1 18.4 40.1 68.4 47.3 42.1 ...
 $ rational_whimsical                          : num  21.7 72.3 54.4 86.8 93 27.4 67 78.7 69.6 70.9 ...
 $ egalitarian_racist                          : num  21.7 27.8 24.7 24.3 10.7 ...
 $ disreputable_prestigious                    : num  78.2 66.2 47 32.5 36.7 68.2 21.2 42.5 65.2 45.8 ...
 $ ignorant_knowledgeable                      : num  78.2 37.7 66.9 22.2 59.9 68.5 60.8 68.1 44.2 42.6 ...
 $ hard.work_natural.talent                    : num  21.9 47.5 41.8 69.8 71.2 29.2 55.8 67.5 65.8 57.3 ...
 $ androgynous_gendered                        : num  78.1 89.4 68.5 82.5 60.1 78.1 32.6 43.4 88.3 87.9 ...
 $ dispassionate_romantic                      : num  77.9 80.5 64.7 69.6 74.9 67.2 61.5 64.8 59.1 82.3 ...
 $ eloquent_unpolished                         : num  22.1 32.1 56.1 79.8 69 33.7 76.3 45.2 35 42.9 ...
 $ permanent_transient                         : num  22.2 56.1 39 59.6 71.1 31.9 68.5 79.7 57.2 70.6 ...
 $ intense_lighthearted                        : num  22.2 50.8 73.8 79.8 64.2 28.2 22.4 34.7 18.2 44.3 ...
 $ mischievous_well.behaved                    : num  77.8 34.2 30.6 15.8 20.3 71.4 13.3 19.4 17.6 38.2 ...
 $ adventurous_stick.in.the.mud                : num  77.7 37.4 59.7 14.4 8 ...
 $ obedient_rebellious                         : num  22.3 69.2 42.9 72.9 86.2 16.5 92.3 87.1 84.2 38.1 ...
 $ authoritarian_democratic                    : num  22.4 55.2 70 72.1 75.4 41.6 68 67.4 21.8 68.9 ...
 $ city.slicker_country.bumpkin                : num  22.7 9 22.4 18.4 42.6 18.8 26.5 20.2 16.8 24 ...
 $ traditional_unorthodox                      : num  22.8 52.8 54.9 67.2 90 23.1 85.7 90.2 74.5 62.6 ...
  [list output truncated]

It seems like both data frames have a column containing an ID for the character. We can use that column for merging:

characters_stats <- characters %>%
  right_join(psych_stats_long, join_by(id == char_id))

str(characters_stats)

'data.frame':   323596 obs. of  9 variables:
 $ id        : chr  "F2" "F2" "F2" "F2" ...
 $ name      : chr  "Monica Geller" "Monica Geller" "Monica Geller" "Monica Geller" ...
 $ uni_id    : chr  "F" "F" "F" "F" ...
 $ uni_name  : chr  "Friends" "Friends" "Friends" "Friends" ...
 $ notability: num  79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 ...
 $ link      : chr  "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" ...
 $ image_link: chr  "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" ...
 $ question  : chr  "messy_neat" "disorganized_self.disciplined" "diligent_lazy" "on.time_tardy" ...
 $ rating    : num  95.7 95.2 6.1 6.2 6.4 ...

Remove all columns from your merged data frame that start with "uni". Don’t overwrite your old data, this is just for exercise and won’t be worked with further on.

Hint

Take a look at the examples in ?select to see how you can select all columns but those fulfilling a certain condition.

Caution

characters_stats %>%
  select(!starts_with("uni")) %>%
  str()

'data.frame':   323596 obs. of  7 variables:
 $ id        : chr  "F2" "F2" "F2" "F2" ...
 $ name      : chr  "Monica Geller" "Monica Geller" "Monica Geller" "Monica Geller" ...
 $ notability: num  79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 ...
 $ link      : chr  "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" ...
 $ image_link: chr  "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" ...
 $ question  : chr  "messy_neat" "disorganized_self.disciplined" "diligent_lazy" "on.time_tardy" ...
 $ rating    : num  95.7 95.2 6.1 6.2 6.4 ...

Calculate the mean rating of all characters by show and question, so you get the mean rating of all characters in a show on each item.

Hint

Use group_by() and summarise().

Caution

characters_summary <- characters_stats %>%
  group_by(uni_name, question) %>%
  summarise(mean_rating = mean(rating, na.rm = TRUE))

`summarise()` has grouped output by 'uni_name'. You can override using the
`.groups` argument.

Choose two of your favorite shows. Build a data frame that has two mean_rating columns, one for each show.

Hint

You can get an overview of the used shows with unique(characters_stats$uni_name) First, filter your two shows from the characters_stats data.frame.
Second, reshape this data.frame into long format.

Caution

compare_shows <- characters_summary %>%
  filter(str_detect(uni_name, c("Friends|How I Met Your Mother" ))) %>%
  pivot_wider(
    names_from = "uni_name",
    values_from = "mean_rating")

compare_shows

# A tibble: 364 × 3
   question               Friends `How I Met Your Mother`
   <chr>                    <dbl>                   <dbl>
 1 ADHD_OCD                  40.8                    46.6
 2 Coke_Pepsi                49.2                    48.1
 3 English_German            26.6                    31.3
 4 French_Russian            30.7                    38.7
 5 Greek_Roman               44.0                    47.9
 6 Italian_Swedish           39.2                    44.5
 7 abstract_concrete         46.0                    43.6
 8 accepting_judgemental     52.0                    53.4
 9 accommodating_stubborn    63.1                    63.6
10 active_slothful           40.6                    37.0
# ℹ 354 more rows

Now we could look at specific questions. For example the How I Met Your Mother characters seem to be rated a bit more slothful than the Friends characters.

Now, add a column containing the difference in rating between both shows for each question to your new comparison data.frame. Then, sort the rows by descending size of difference between the ratings. So the row with the highest difference in mean rating between your two shows should be on top.

Hint

To work with columns within mutate() you could for example use .$column_name.

Caution

compare_shows %>%
  mutate(diff_shows = .$Friends - .$"How I Met Your Mother") %>%
  arrange(desc(abs(diff_shows)))

# A tibble: 364 × 4
   question                            Friends How I Met Your Mothe…¹ diff_shows
   <chr>                                 <dbl>                  <dbl>      <dbl>
 1 cat.person_dog.person                  46.4                   65.1      -18.6
 2 musical_off.key                        61.5                   44.8       16.7
 3 nonpolitical_political                 40.2                   54.7      -14.5
 4 gamer_non.gamer                        58.6                   44.8       13.8
 5 focused.on.the.future_focused.on.t…    55.0                   41.8       13.2
 6 gatherer_hunter                        38.8                   51.2      -12.5
 7 captain_first.mate                     57.2                   44.9       12.3
 8 flower.child_goth                      21.5                   33.6      -12.1
 9 perceptive_unobservant                 48.2                   36.4       11.7
10 dunce_genius                           46.4                   57.9      -11.4
# ℹ 354 more rows
# ℹ abbreviated name: ¹`How I Met Your Mother`

desc() means descending, so we go from the largest value to the smallest.
abs() means absolute, so we get the absolute value instead of negative values in some cases.

Footnotes

Image by Bing Copilot ↩︎

1 select() columns

2 filter() rows

3 pivot_...() long/wide format

4 mutate()

4.1 Useful helpers

4.2 Programmatically using mutate()

5 ..._replace_...()

6 group_by()

7 ..._join_..() data.frames

8 Exercise

Footnotes

1 `select()` columns

2 `filter()` rows

3 `pivot_...()` long/wide format

4 `mutate()`

4.2 Programmatically using `mutate()`

5 `..._replace_...()`

6 `group_by()`

7 `..._join_..()` data.frames