Dynamic column/variable names with dplyr using Standard Evaluation functions

Data manipulation works like a charm in R when using a library like dplyr. An often overlooked feature of this library is called Standard Evaluation (SE) which is also described in the vignette about the related Non-standard Evaluation. It basically allows you to use dynamic arguments in many dplyr functions (“verbs”).

When is this useful?

In dplyr you directly specify the columns you want to work with directly without quoting them (i.e. without turning them into a character string):

# works:
mtcars %>% select(mpg, cyl)

# does not work:
mtcars %>% select('mpg', 'cyl')
# -> Error: All select() inputs must resolve to integer column positions.

This is called Non-standard Evaluation (NSE). It’s good because it saves typing, but at the same time you can’t easily use dynamic arguments as you could by using strings. Dynamic arguments are after necessary when you write loops that perform the same type of data manipulation one-by-one for different columns/variables. More generally, you need dynamic arguments when you’re writing functions that do not just solve a problem for a specific data set or a specific column in a data set, but should work with several kinds of data sets or columns (see also the Don’t Repeat Yourself (DRY) Principle).

How to use the SE-versions of dplyr verbs

Here Standard Evaluation (SE) comes into effect. The SE-versions of dplyr verbs always end with an underscore, for example select_() or group_by_():

# using the SE-version select_()
# now this works:
mtcars %>% select_('mpg', 'cyl')

To pass a dynamically specified set of arguments to a SE-enabled dplyr function, we need to use the special .dots argument and pass it a list of strings:

# this is the same as above:
mtcars %>% select_(.dots = list('mpg', 'cyl'))

Of course this doesn’t make sense so far, because it is not really “dynamic”. As an easy example, let’s say we want to select individual columns and print the first rows. We defined a list of lists vars and loop through it. Each v in vars is a list of arguments passed to select_().

vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))
for (v in vars) {
  print(mtcars %>% select_(.dots = v) %>% head)

                  cyl  mpg
Mazda RX4           6 21.0
Mazda RX4 Wag       6 21.0
Datsun 710          4 22.8
Hornet 4 Drive      6 21.4
Hornet Sportabout   8 18.7
Valiant             6 18.1
                  vs disp
Mazda RX4          0  160
Mazda RX4 Wag      0  160
Datsun 710         1  108
Hornet 4 Drive     1  258
Hornet Sportabout  0  360
Valiant            1  225

Let’s make something more practical. For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_(), e.g. mean_mpg = mean(mpg) using string concatenation and setNames():

summarise_vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))

for (v in summarise_vars) {
  group_var <- v[1]   # group by this variable
  summ <- paste0('mean(', v[2], ')')  # construct summary method, e.g. mean(mpg)
  summ_name <- paste0('mean_', v[2])  # construct summary variable name, e.g. mean_mpg

  print(paste('grouping by', group_var, 'and summarising', summ))

  df_summ <- mtcars %>%
    group_by_(.dots = group_var) %>%
    summarise_(.dots = setNames(summ, summ_name))

# output
[1] "grouping by cyl and summarising mean(mpg)"
# A tibble: 3 × 2
    cyl mean_mpg

1     4 26.66364
2     6 19.74286
3     8 15.10000
[1] "grouping by vs and summarising mean(disp)"
# A tibble: 2 × 2
     vs mean_disp

1     0  307.1500
2     1  132.4571

Comments are closed.

Post Navigation