library(tidyverse)
Data wrangling in the tidyverse
The tidyverse
provides many tools for wrangling data, from selecting, sorting or renaming columns over filtering specific rows according to complex conditions to building new columns according to values in other columns. Let’s take a look at the most important ones. We will use the (athletes
)[] dataset in the examples:
1 select()
columns
Selecting columns from a data.frame is pretty straight forward:
%>%
athletes select(Year, Sport)
Year Sport
1 1956 Hockey
2 1948 Hockey
3 1980 Wrestling
...
Note how we don’t have to put the columns in ""
, and how we can simply seperate them by ,
.
select()
becomes especially useful when combined with selection helpers:
## Select all columns starting with a Se
%>%
athletes select(starts_with("Se"))
Sex Season
1 M Summer
2 M Summer
3 M Summer
...
## Select all columns containing the letters "ea"
%>%
athletes select(contains("ea"))
Team Year Season
1 Afghanistan 1956 Summer
2 Afghanistan 1948 Summer
3 Afghanistan 1980 Summer
...
## Or, we can combine them:
%>%
athletes select(ends_with("t") & contains("igh"))
Height Weight
1 NA NA
2 NA NA
3 163 57.0
...
2 filter()
rows
We can use filter
to subset rows according to their values in specific columns:
## All Volleyballers
%>%
athletes filter(Sport == "Volleyball") %>%
str
'data.frame': 3404 obs. of 16 variables:
$ NOC : chr "ALG" "ALG" "ALG" "ALG" ...
$ ID : int 122168 73155 47642 74593 74593 117675 249 249 90117 90100 ...
$ Name : chr "Faza Tsabet" "Narimne Madani" "Sehryne Hennaoui" "Nawal Mansouri" ...
...
## All Judoka between 50 and 100 kg
%>%
athletes filter(Sport == "Judo", between(Weight, 50, 100)) %>%
str()
'data.frame': 2916 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "ALB" ...
$ ID : int 99303 33817 7050 58601 121096 9593 78883 5689 78882 9593 ...
$ Name : chr "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Majlinda Kelmendi" ...
...
## All athletes with missing height
%>%
athletes filter(is.na(Height)) %>%
str()
'data.frame': 60083 obs. of 16 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : int 132181 87371 502 109153 1076 121376 80210 87374 6323 59344 ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Ahmad Shah Abouwi" "Shakar Khan Shakar" ...
...
Note how we can just write our conditions without connecting them with &
(filter()
does that automatically for us). Also, we don’t have to put the column names into ““, because filter()
knows that this are column names of the athletes data frame, which makes coding a bit more pleasant. Also, missing rows are automatically removed, which makes sense in many cases!
3 pivot_...()
long/wide format
To reshape data.frames from long to wide or wide to long format we can use pivot_wider()
and pivot_longer()
:
Let’s define a simpler data.frame first:
<- data.frame(
inhabitants_wide country = c("China", "India", "USA"),
inhabitants_2021 = c(1425893465, 1407563842, NA),
inhabitants_2022 = c(1425857720, 1420939232, 338903174)
) inhabitants_wide
country inhabitants_2021 inhabitants_2022
1 China 1425893465 1425857720
2 India 1407563842 1420939232
3 USA NA 338903174
<- inhabitants_wide %>%
inhabitants_long pivot_longer(
## Select the columns we want to reshape:
cols = starts_with("inhabitants"),
names_prefix = "inhabitants_",
## Define a new column where the column names will go to:
names_to = "year",
## Define a new column where the values will go to:
values_to = "inhabitants"
)
head(inhabitants_long)
# A tibble: 6 × 3
country year inhabitants
<chr> <chr> <dbl>
1 China 2021 1425893465
2 China 2022 1425857720
3 India 2021 1407563842
4 India 2022 1420939232
5 USA 2021 NA
6 USA 2022 338903174
In other cases, it might happen that multiple variables are put into the same column, together with an identifier column:
inhabitants_long_2
country variable value
1 China area 9597000
2 China inhabitants_2022 1425857720
3 India area 3287000
4 India inhabitants_2022 1420939232
5 USA area 9834000
6 USA inhabitants_2022 338903174
In that case it can make sense to spread the the distinct variables into two columns:
<- inhabitants_long_2 %>%
inhabitants_wide_2 pivot_wider(
id_cols = "country",
names_from = "variable",
values_from = "value"
)
inhabitants_wide_2
# A tibble: 3 × 3
country area inhabitants_2022
<chr> <dbl> <dbl>
1 China 9597000 1425857720
2 India 3287000 1420939232
3 USA 9834000 338903174
4 mutate()
With mutate()
we can add new columns to a data.frame or edit existing ones:
%>%
athletes mutate(new_column = NA) %>%
mutate(ID = as.character(ID)) %>%
str()
'data.frame': 270767 obs. of 17 variables:
$ NOC : chr "AFG" "AFG" "AFG" "AFG" ...
$ ID : chr "132181" "87371" "44977" "502" ...
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
...
4.1 Useful helpers
Like select()
, mutate()
really starts to shine when helper functions are added. For example we can fill a new column according to values in other columns:
## Build a new column indicating if this is a contact sport athlete
%>%
athletes mutate(contact_sport = ifelse(Sport %in% c("Wrestling", "Boxing", "Judo", "Rugby", "Taekwondo", "Rugby Sevens"),
yes = TRUE,
no = FALSE)
%>%
) select(Name, Sport, contact_sport) %>%
str
'data.frame': 270767 obs. of 3 variables:
$ Name : chr "Najam Yahya" "Ahmad Jahan Nuristani" "Mohammad Halilula" "Ahmad Shah Abouwi" ...
$ Sport : chr "Hockey" "Hockey" "Wrestling" "Hockey" ...
$ contact_sport: logi FALSE FALSE TRUE FALSE TRUE TRUE ...
If we have to connect multiple ifelse()
functions, it’s better to use dplyrs
case_when()
:
## This gets complicated pretty quickly:
%>%
athletes mutate(judo_weightclass = if_else(str_detect(Event, "Middleweight"),
true = "Middleweight",
false = if_else(str_detect(Event, "Half-Lightweight"),
true = "Half-Lightweight",
false = if_else(str_detect(Event, "Lightweight"),
true = "Lightweight",
false = NA)
)
)%>%
) filter(Sport == "Judo") %>%
select(Name, Sport, Event, judo_weightclass) %>%
str
'data.frame': 3799 obs. of 4 variables:
$ Name : chr "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Sergio Murray" ...
$ Sport : chr "Judo" "Judo" "Judo" "Judo" ...
$ Event : chr "Judo Women's Middleweight" "Judo Men's Half-Lightweight" "Judo Men's Half-Heavyweight" "Judo Men's Middleweight" ...
$ judo_weightclass: chr "Middleweight" "Half-Lightweight" NA "Middleweight" ...
## so do this instead:
%>%
athletes mutate(judo_weightclass = case_when(str_detect(Event, "Middleweight") ~ "Middleweight",
str_detect(Event, "Half-Lightweight") ~ "Half-Lightweight",
str_detect(Event, "Lightweight") ~ "Lightweight",
TRUE ~ "other Weightclass" )
%>%
) filter(Sport == "Judo") %>%
select(Name, Sport, Event, judo_weightclass) %>%
str()
'data.frame': 3799 obs. of 4 variables:
$ Name : chr "Friba Razayee" "Ajmal Faizzada" "Mohammad Tawfiq Bakhshi" "Sergio Murray" ...
$ Sport : chr "Judo" "Judo" "Judo" "Judo" ...
$ Event : chr "Judo Women's Middleweight" "Judo Men's Half-Lightweight" "Judo Men's Half-Heavyweight" "Judo Men's Middleweight" ...
$ judo_weightclass: chr "Middleweight" "Half-Lightweight" "other Weightclass" "Middleweight" ...
4.2 Programmatically using mutate()
If you want to use mutate()
programmatically within a loop or a function, take a look at Column-wise operations in the tidyverse
5 ..._replace_...()
We can easily replace values in a column using str_replace()
or replace_na()
:
%>%
athletes mutate(Sex = str_replace(Sex, "M", "Male")) %>%
mutate(Sex = str_replace(Sex, "F", "Female")) %>%
mutate(Height = replace_na(Height, 0)) %>%
select(Sex, Height) %>%
str()
'data.frame': 270767 obs. of 2 variables:
$ Sex : chr "Male" "Male" "Male" "Male" ...
$ Height: int 0 0 163 0 0 168 0 0 0 0 ...
6 group_by()
We can group our data by values in specific columns and perform some sort of operation on the groups. If we wanted to know the number of medals each region has won, we can for example group by region and medal type, and then count()
(another tidyverse
function) the number of cases in each group:
<- athletes %>%
medal_counts group_by(Region, Medal) %>%
count(Medal)
medal_counts
# A tibble: 533 × 3
# Groups: Region, Medal [533]
Region Medal n
<chr> <chr> <int>
1 Afghanistan Bronze 2
2 Afghanistan <NA> 124
3 Albania <NA> 70
4 Algeria Bronze 8
5 Algeria Gold 5
6 Algeria Silver 4
...
We can also summarize()
data:
## Let's see what the mean, min and max age of athletes was in each Region:
%>%
athletes group_by(Region) %>%
summarize(mean_age = mean(Age, na.rm = TRUE),
min_age = min(Age, na.rm = TRUE),
max_age = max(Age, na.rm = TRUE)
)
# A tibble: 206 × 4
Region mean_age min_age max_age
<chr> <dbl> <int> <int>
1 Afghanistan 23.5 17 35
2 Albania 25.3 16 46
3 Algeria 24.4 14 38
4 American Samoa 27.2 16 43
5 Andorra 23.1 15 61
6 Angola 24.9 13 51
7 Antigua 23.2 14 38
...
7 ..._join_..()
data.frames
We have multiple options for merging data.frames in the tidyverse
. left_join()
can be used if we want to keep all rows of the first data.frame and only adds the rows of the second data.frame that have an identifier in the first data.frame, right_join()
keeps all rows of the second data frame, and full_join()
keeps all rows of both data frames.
Let’s merge a world coordinate data set onto our medal counts. This can be helpful if we want to plot the number of won medals in each country later on:
<- readRDS(here::here("raw_data", "world_coordinates.rds")) world_coordinates
Only take gold medals into account:
<- medal_counts %>% filter(Medal == "Gold") medal_counts
To merge two data frames that include information that belongs together, we need a common column, on which we can combine them. In our case, this is the column containing the country. They are both named region
, but one with an upper case R
. This doesn’t pose a problem, as we can define which columns should be taken from which data frame for merging with join_by()
. Let’s take a quick look before merging to check if there are any countries named differently in both data sets:
$Region[!(medal_counts$Region %in% world_coordinates$region)] medal_counts
[1] "Individual Olympic Athletes"
Looks like all of the countries in our medal_countries
data frame can also be found in our world_coordinates
frame. Only athletes without a country will be lost when merging, but that’s ok for now, as we are interested in the country specific gold medal counts. So let’s merge:
<- world_coordinates %>%
medal_countries left_join(medal_counts, join_by(region == Region))
head(medal_countries)
long lat group order region subregion Medal n
1 -69.89912 12.45200 1 1 Aruba <NA> <NA> NA
2 -69.89571 12.42300 1 2 Aruba <NA> <NA> NA
3 -69.94219 12.43853 1 3 Aruba <NA> <NA> NA
4 -70.00415 12.50049 1 4 Aruba <NA> <NA> NA
5 -70.06612 12.54697 1 5 Aruba <NA> <NA> NA
6 -70.05088 12.59707 1 6 Aruba <NA> <NA> NA
8 Exercise
- Read the
characters.rds
and thepsych_stats.csv
into R (download here).
<- readRDS(here::here( "raw_data", "characters.rds"))
characters <- read.csv(here::here( "raw_data", "psych_stats.csv"), sep = ";") psych_stats
- Reshape the
psych_stats
data frame so there are only three columns in the data set:char_id
,question
andrating
.
You can select multiple columns like this: column_1:column_10
.
<- psych_stats %>%
psych_stats_long pivot_longer(cols = messy_neat:innocent_jaded,
names_to = "question",
values_to = "rating")
head(psych_stats_long)
# A tibble: 6 × 3
char_id question rating
<chr> <chr> <dbl>
1 F2 messy_neat 95.7
2 F2 disorganized_self.disciplined 95.2
3 F2 diligent_lazy 6.10
4 F2 on.time_tardy 6.2
5 F2 competitive_cooperative 6.40
6 F2 scheduled_spontaneous 6.60
Now we have multiple rows for every character, but all question ratings are nicely aligned in one column.
- Merge the
characters
data frame and thepsych_stats_long
data frame on a common column.
Identify the common columns. Are they named the same in both data frames? Look at the documentation of ?join_by()
to see, how you can merge data frames that don’t have identically named columns.
First, let’s take a look at both data sets again:
str(characters)
'data.frame': 889 obs. of 7 variables:
$ id : chr "F2" "F1" "F5" "F4" ...
$ name : chr "Monica Geller" "Rachel Green" "Chandler Bing" "Joey Tribbiani" ...
$ uni_id : chr "F" "F" "F" "F" ...
$ uni_name : chr "Friends" "Friends" "Friends" "Friends" ...
$ notability: num 79.7 76.7 74.4 74.3 72.6 51.6 86.5 84.2 82.6 65.6 ...
$ link : chr "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/1" "https://openpsychometrics.org/tests/characters/stats/F/5" "https://openpsychometrics.org/tests/characters/stats/F/4" ...
$ image_link: chr "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/1.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/5.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/4.jpg" ...
str(psych_stats)
'data.frame': 889 obs. of 365 variables:
$ char_id : chr "F2" "F1" "F5" "F4" ...
$ messy_neat : num 95.7 30.2 45.3 13 20.9 ...
$ disorganized_self.disciplined : num 95.2 25.9 42.4 11 20.9 75.6 10.4 31.9 39.6 31.1 ...
$ diligent_lazy : num 6.1 51.8 52.2 78.1 45.2 ...
$ on.time_tardy : num 6.2 77.9 57.1 84.1 74 20.6 85.7 68.3 73.6 58.2 ...
$ competitive_cooperative : num 6.4 28.9 42.8 44.2 55.3 ...
$ scheduled_spontaneous : num 6.6 72.3 54.9 91.3 94.9 ...
$ ADHD_OCD : num 92.9 31.8 26.7 10.4 12.8 70.1 35.5 30.1 51.8 39.2 ...
$ chaotic_orderly : num 92.2 27 38.2 12.6 11.2 68.8 6.8 20.6 23.4 28.8 ...
$ motivated_unmotivated : num 7.8 31.8 52.3 45.6 24.7 31.5 80.9 30.5 40.8 50.7 ...
$ bossy_meek : num 7.9 30.6 64.8 60.8 40.1 ...
$ persistent_quitter : num 7.9 35.8 43.9 33.8 21.3 ...
$ overachiever_underachiever : num 8.2 43.8 55.8 68.8 51.3 23.2 67.7 36.7 44.1 44.4 ...
$ muddy_washed : num 91 80.2 58.7 42.7 48.1 64.6 27.6 62.4 70.1 69.2 ...
$ beautiful_ugly : num 9.2 5.3 26.2 11 11.4 ...
$ slacker_workaholic : num 90.8 45.9 53.4 17.6 32 81.5 23.8 30.1 33.2 34.6 ...
$ driven_unambitious : num 9.5 30.3 49.8 49.4 43.4 22.7 58.5 34.1 32 47.4 ...
$ outlaw_sheriff : num 90.3 39.3 46.7 23.8 16.1 85.4 21.4 22.7 27.3 30.1 ...
$ precise_vague : num 9.9 64.7 53.2 78 78.1 25.4 68.4 60.1 47.3 61.7 ...
$ bad.cook_good.cook : num 90 11.1 28.2 31.2 29.4 35.9 27.3 46.2 43.8 52.8 ...
$ manicured_scruffy : num 10.6 7.7 45.6 47.6 62.5 20.5 81.3 37.3 20.3 20.9 ...
$ lenient_strict : num 89.3 34.2 28.8 11 15.4 76.7 15.2 24.2 38.9 21.5 ...
$ relaxed_tense : num 89 58.8 66.4 10.4 16.9 88.9 69.9 64.2 54.5 64.8 ...
$ demanding_unchallenging : num 11 23.9 58.3 66.3 57.1 28.5 35.9 37.8 16.8 60.3 ...
$ drop.out_valedictorian : num 88.9 32.5 47 14.9 22.1 87.7 12.5 29.6 36.5 51.2 ...
$ go.getter_slugabed : num 11.7 31.3 52.6 48.1 27.6 41.8 62.6 33.9 27.3 51.1 ...
$ competent_incompetent : num 11.9 47.1 37.1 77.2 53.6 37.8 51.9 41.1 35.2 56.1 ...
$ aloof_obsessed : num 88.1 62.3 52.3 35.1 33.2 80.8 75.1 54.9 70.7 61.9 ...
$ flexible_rigid : num 87.8 41.8 45.9 17.3 13.9 83.7 45.9 27.4 55 32.1 ...
$ active_slothful : num 12.2 33.1 61.1 56.7 31.4 48.8 73.9 19.8 29.2 35.5 ...
$ loose_tight : num 87.4 43.2 44.3 14 15.3 82.5 28.1 26 44.8 43.6 ...
$ pointed_random : num 12.8 49.9 67.1 86.2 87.4 36.7 65.4 53.1 36.9 56.2 ...
$ fresh_stinky : num 12.9 14.6 31.9 44.3 39.2 44.3 64.4 30.2 18.2 24.6 ...
$ dominant_submissive : num 13.6 41.6 73.7 40.2 30.9 69.5 43.5 52.6 36.9 77.9 ...
$ anxious_calm : num 13.7 28.8 20 66.1 58 11.9 12 32.1 37.1 29.8 ...
$ clean_perverted : num 13.7 42.5 56.8 77.5 59.4 44 53.1 51.2 61.9 50.6 ...
$ neutral_opinionated : num 86.3 74.6 67.2 43.4 76.6 84.2 67.3 77.9 82.5 43.9 ...
$ always.down_picky : num 85.9 72.6 49.8 27.1 35.2 71.9 23.6 36.2 71.8 36.2 ...
$ hurried_leisurely : num 14.6 55.1 55.9 85.9 81 22.1 48.6 45.6 49 39.3 ...
$ attractive_repulsive : num 14.7 9.4 28.5 15.7 18.2 ...
$ devoted_unfaithful : num 14.8 29.1 22.6 41.5 19.6 47.5 34.1 55.7 42.7 48.2 ...
$ helpless_resourceful : num 85 41.4 56.6 37.9 70.6 52.4 41.4 51.5 36.2 29.8 ...
$ deliberate_spontaneous : num 15.1 71.7 56.5 89.1 92.9 20.9 78.6 88.3 64 60.9 ...
$ plays.hard_works.hard : num 84.7 41.3 46.5 13.7 26 81.2 28.2 30 19.9 26.4 ...
$ imaginative_practical : num 84.7 37.9 54.6 17 5.4 ...
$ frenzied_sleepy : num 15.5 29.9 34.7 55.6 30 31.1 59.4 25.2 19 46 ...
$ queer_straight : num 84.3 84.1 65.4 84.3 45.4 77.9 10.2 4.8 73.4 64.1 ...
$ assertive_passive : num 15.8 40.4 66.3 44.3 39.3 60.9 45.1 45.8 23.4 63.3 ...
$ fast.talking_slow.talking : num 15.9 20.8 18.3 42.2 21.7 49.6 69.5 34.3 32.5 44.5 ...
$ astonishing_methodical : num 83.8 28 49.9 19.2 17.4 83 31.2 27.4 36 32.7 ...
$ hoarder_unprepared : num 16.2 70 63.5 82 54.9 35.5 60.3 64.5 48.3 67.8 ...
$ consistent_variable : num 16.6 60.2 46.3 63.1 79.3 39.5 72 65.3 69.7 62.3 ...
$ involved_remote : num 16.7 26.3 42.7 30.2 36.7 36.6 62.2 39.3 26.4 38.7 ...
$ backdoor_official : num 83.3 51.9 47.4 24.4 20.4 76.4 29.1 29.3 53.5 36.7 ...
$ captain_first.mate : num 16.7 52.7 73.5 74.2 57.9 68.4 55.9 51 19 73.6 ...
$ refined_rugged : num 17.3 18.9 48.4 74.4 69.9 24.4 81.6 48 31.4 40.7 ...
$ accommodating_stubborn : num 82.7 77.2 48.2 43.9 48.3 78.5 78.1 69 85.9 41.5 ...
$ barbaric_civilized : num 82.6 76.5 66.6 32.9 39.9 77 33.4 44.4 36.7 55.5 ...
$ alpha_beta : num 17.7 37.9 73.9 33.6 41.9 78.2 44.3 37.4 17.5 66.6 ...
$ loyal_traitorous : num 17.8 32.3 20 15.3 14.5 40.3 29.2 43.1 47.2 33.2 ...
$ trash_treasure : num 82 80.1 82.2 78.4 83.2 47.8 64.5 62.2 68.2 78.4 ...
$ fast_slow : num 18.1 43.7 38.1 69 55.3 60.4 57.8 29.4 30 54.5 ...
$ perceptive_unobservant : num 18.3 59.5 41.5 80 41.1 48.6 21.6 33.3 28 49 ...
$ goof.off_studious : num 81.4 33.2 20.7 7.4 16.6 ...
$ feminist_sexist : num 18.6 23.3 43.9 62 10.5 ...
$ desperate_high.standards : num 81.1 69.2 30.7 36.8 56.7 29.2 33.7 32.5 61.7 25.8 ...
$ impatient_patient : num 18.9 21.9 34 25.7 39.1 25.8 23.8 35.1 18 57.2 ...
$ preppy_punk.rock : num 18.9 16.4 41.5 49.5 73.2 14.4 87.7 74.4 26.4 18.2 ...
$ naive_paranoid : num 80.7 35.5 66.6 22 39.7 71.6 69.6 45.6 50.7 32.1 ...
$ important_irrelevant : num 19.3 22.3 24.6 24.7 26.4 47.4 12.5 14.8 16.4 33.4 ...
$ apprentice_master : num 80.6 42.3 44.9 36.3 61.5 60.8 48 48 73 31.5 ...
$ healthy_sickly : num 19.6 17.8 39.1 26.9 22.6 37 88.9 65.7 56.7 45.5 ...
$ morning.lark_night.owl : num 19.6 69.9 58.3 80.4 61.9 23.2 90.6 81.9 90.1 78.3 ...
$ alert_oblivious : num 19.6 70.7 55.5 87.6 78.9 57.1 54.7 48.9 38.3 67.4 ...
$ f....the.police_tattle.tale : num 80 57.5 56.7 34.4 13.7 ...
$ experimental_reliable : num 79.7 37.8 62 35 22.2 61.7 28 26.5 30.4 39.8 ...
$ loud_quiet : num 20.4 20.8 25 10.6 15.3 39.5 71.9 42.7 13.2 55.2 ...
$ high.IQ_low.IQ : num 20.5 56.7 28.8 82.6 50.6 19.3 30.9 26.1 47.7 55.6 ...
$ oppressed_privileged : num 79.2 85.4 67.2 66.5 42.1 84.3 22.4 19.6 59.9 63.4 ...
$ animalistic_human : num 79.2 75.6 73.7 43.8 42.1 69.3 70.4 55.9 64.4 73.2 ...
$ still_twitchy : num 79.2 68.6 79.9 76.9 83.6 81.9 77.9 67.4 60.1 58.4 ...
$ thick_thin : num 78.8 79.6 52.8 35.2 69.2 60.6 73.3 81.4 66.1 48.8 ...
$ repetitive_varied : num 21.3 44.5 40.9 43.4 74.1 18.4 40.1 68.4 47.3 42.1 ...
$ rational_whimsical : num 21.7 72.3 54.4 86.8 93 27.4 67 78.7 69.6 70.9 ...
$ egalitarian_racist : num 21.7 27.8 24.7 24.3 10.7 ...
$ disreputable_prestigious : num 78.2 66.2 47 32.5 36.7 68.2 21.2 42.5 65.2 45.8 ...
$ ignorant_knowledgeable : num 78.2 37.7 66.9 22.2 59.9 68.5 60.8 68.1 44.2 42.6 ...
$ hard.work_natural.talent : num 21.9 47.5 41.8 69.8 71.2 29.2 55.8 67.5 65.8 57.3 ...
$ androgynous_gendered : num 78.1 89.4 68.5 82.5 60.1 78.1 32.6 43.4 88.3 87.9 ...
$ dispassionate_romantic : num 77.9 80.5 64.7 69.6 74.9 67.2 61.5 64.8 59.1 82.3 ...
$ eloquent_unpolished : num 22.1 32.1 56.1 79.8 69 33.7 76.3 45.2 35 42.9 ...
$ permanent_transient : num 22.2 56.1 39 59.6 71.1 31.9 68.5 79.7 57.2 70.6 ...
$ intense_lighthearted : num 22.2 50.8 73.8 79.8 64.2 28.2 22.4 34.7 18.2 44.3 ...
$ mischievous_well.behaved : num 77.8 34.2 30.6 15.8 20.3 71.4 13.3 19.4 17.6 38.2 ...
$ adventurous_stick.in.the.mud : num 77.7 37.4 59.7 14.4 8 ...
$ obedient_rebellious : num 22.3 69.2 42.9 72.9 86.2 16.5 92.3 87.1 84.2 38.1 ...
$ authoritarian_democratic : num 22.4 55.2 70 72.1 75.4 41.6 68 67.4 21.8 68.9 ...
$ city.slicker_country.bumpkin : num 22.7 9 22.4 18.4 42.6 18.8 26.5 20.2 16.8 24 ...
$ traditional_unorthodox : num 22.8 52.8 54.9 67.2 90 23.1 85.7 90.2 74.5 62.6 ...
[list output truncated]
It seems like both data frames have a column containing an ID for the character. We can use that column for merging:
<- characters %>%
characters_stats right_join(psych_stats_long, join_by(id == char_id))
str(characters_stats)
'data.frame': 323596 obs. of 9 variables:
$ id : chr "F2" "F2" "F2" "F2" ...
$ name : chr "Monica Geller" "Monica Geller" "Monica Geller" "Monica Geller" ...
$ uni_id : chr "F" "F" "F" "F" ...
$ uni_name : chr "Friends" "Friends" "Friends" "Friends" ...
$ notability: num 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 ...
$ link : chr "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" ...
$ image_link: chr "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" ...
$ question : chr "messy_neat" "disorganized_self.disciplined" "diligent_lazy" "on.time_tardy" ...
$ rating : num 95.7 95.2 6.1 6.2 6.4 ...
- Remove all columns from your merged data frame that start with
"uni"
. Don’t overwrite your old data, this is just for exercise and won’t be worked with further on.
Take a look at the examples in ?select
to see how you can select all columns but those fulfilling a certain condition.
%>%
characters_stats select(!starts_with("uni")) %>%
str()
'data.frame': 323596 obs. of 7 variables:
$ id : chr "F2" "F2" "F2" "F2" ...
$ name : chr "Monica Geller" "Monica Geller" "Monica Geller" "Monica Geller" ...
$ notability: num 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 79.7 ...
$ link : chr "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" "https://openpsychometrics.org/tests/characters/stats/F/2" ...
$ image_link: chr "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" "https://openpsychometrics.org/tests/characters/test-resources/pics/F/2.jpg" ...
$ question : chr "messy_neat" "disorganized_self.disciplined" "diligent_lazy" "on.time_tardy" ...
$ rating : num 95.7 95.2 6.1 6.2 6.4 ...
- Calculate the mean rating of all characters by show and question, so you get the mean rating of all characters in a show on each item.
Use group_by()
and summarise()
.
<- characters_stats %>%
characters_summary group_by(uni_name, question) %>%
summarise(mean_rating = mean(rating, na.rm = TRUE))
`summarise()` has grouped output by 'uni_name'. You can override using the
`.groups` argument.
- Choose two of your favorite shows. Build a data frame that has two mean_rating columns, one for each show.
You can get an overview of the used shows with unique(characters_stats$uni_name)
First, filter your two shows from the characters_stats
data.frame.
Second, reshape this data.frame into long format.
<- characters_summary %>%
compare_shows filter(str_detect(uni_name, c("Friends|How I Met Your Mother" ))) %>%
pivot_wider(
names_from = "uni_name",
values_from = "mean_rating")
compare_shows
# A tibble: 364 × 3
question Friends `How I Met Your Mother`
<chr> <dbl> <dbl>
1 ADHD_OCD 40.8 46.6
2 Coke_Pepsi 49.2 48.1
3 English_German 26.6 31.3
4 French_Russian 30.7 38.7
5 Greek_Roman 44.0 47.9
6 Italian_Swedish 39.2 44.5
7 abstract_concrete 46.0 43.6
8 accepting_judgemental 52.0 53.4
9 accommodating_stubborn 63.1 63.6
10 active_slothful 40.6 37.0
# ℹ 354 more rows
Now we could look at specific questions. For example the How I Met Your Mother characters seem to be rated a bit more slothful than the Friends characters.
- Now, add a column containing the difference in rating between both shows for each question to your new comparison data.frame. Then, sort the rows by descending size of difference between the ratings. So the row with the highest difference in mean rating between your two shows should be on top.
To work with columns within mutate()
you could for example use .$column_name
.
%>%
compare_shows mutate(diff_shows = .$Friends - .$"How I Met Your Mother") %>%
arrange(desc(abs(diff_shows)))
# A tibble: 364 × 4
question Friends How I Met Your Mothe…¹ diff_shows
<chr> <dbl> <dbl> <dbl>
1 cat.person_dog.person 46.4 65.1 -18.6
2 musical_off.key 61.5 44.8 16.7
3 nonpolitical_political 40.2 54.7 -14.5
4 gamer_non.gamer 58.6 44.8 13.8
5 focused.on.the.future_focused.on.t… 55.0 41.8 13.2
6 gatherer_hunter 38.8 51.2 -12.5
7 captain_first.mate 57.2 44.9 12.3
8 flower.child_goth 21.5 33.6 -12.1
9 perceptive_unobservant 48.2 36.4 11.7
10 dunce_genius 46.4 57.9 -11.4
# ℹ 354 more rows
# ℹ abbreviated name: ¹`How I Met Your Mother`
desc()
means descending, so we go from the largest value to the smallest.
abs()
means absolute, so we get the absolute value instead of negative values in some cases.
Footnotes
Image by Bing Copilot↩︎