Lesser known dplyr tricks

In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

mtcars %>%

    select(-disp) %>%

    head()

##                    mpg cyl  hp drat    wt  qsec vs am gear carb

## Mazda RX4         21.0   6 110 3.90 2.620 16.46  0  1    4    4

## Mazda RX4 Wag     21.0   6 110 3.90 2.875 17.02  0  1    4    4

## Datsun 710        22.8   4  93 3.85 2.320 18.61  1  1    4    1

## Hornet 4 Drive    21.4   6 110 3.08 3.215 19.44  1  0    3    1

## Hornet Sportabout 18.7   8 175 3.15 3.440 17.02  0  0    3    2

## Valiant           18.1   6 105 2.76 3.460 20.22  1  0    3    1

Re-ordering columns

Still using select(), it is easy te re-order columns in your data frame:

mtcars %>%

    select(cyl, disp, hp, everything()) %>%

    head()

##                   cyl disp  hp  mpg drat    wt  qsec vs am gear carb

## Mazda RX4           6  160 110 21.0 3.90 2.620 16.46  0  1    4    4

## Mazda RX4 Wag       6  160 110 21.0 3.90 2.875 17.02  0  1    4    4

## Datsun 710          4  108  93 22.8 3.85 2.320 18.61  1  1    4    1

## Hornet 4 Drive      6  258 110 21.4 3.08 3.215 19.44  1  0    3    1

## Hornet Sportabout   8  360 175 18.7 3.15 3.440 17.02  0  0    3    2

## Valiant             6  225 105 18.1 2.76 3.460 20.22  1  0    3    1

As its name implies everything() simply means all the other columns.

Renaming columns with `rename()`

mtcars <- rename(mtcars, spam_mpg = mpg)

mtcars <- rename(mtcars, spam_disp = disp)

mtcars <- rename(mtcars, spam_hp = hp)

head(mtcars)

##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am

## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1

## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1

## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1

## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0

## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0

## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0

##                   gear carb

## Mazda RX4            4    4

## Mazda RX4 Wag        4    4

## Datsun 710           4    1

## Hornet 4 Drive       3    1

## Hornet Sportabout    3    2

## Valiant              3    1

Selecting columns with a regexp

It is easy to select the columns that start with “spam” with some helper functions:

mtcars %>%

    select(contains("spam")) %>%

    head()

##                   spam_mpg spam_disp spam_hp

## Mazda RX4             21.0       160     110

## Mazda RX4 Wag         21.0       160     110

## Datsun 710            22.8       108      93

## Hornet 4 Drive        21.4       258     110

## Hornet Sportabout     18.7       360     175

## Valiant               18.1       225     105

take also a look at starts_with(), ends_with(), contains(), matches(), num_range(), one_of() and everything().

Create new columns with `mutate()` and `if_else()`

mtcars %>%

    mutate(vs_new = if_else(

        vs == 1,

        "one",

        "zero",

        NA_character_)) %>%

    head()

##   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb vs_new

## 1     21.0   6       160     110 3.90 2.620 16.46  0  1    4    4   zero

## 2     21.0   6       160     110 3.90 2.875 17.02  0  1    4    4   zero

## 3     22.8   4       108      93 3.85 2.320 18.61  1  1    4    1    one

## 4     21.4   6       258     110 3.08 3.215 19.44  1  0    3    1    one

## 5     18.7   8       360     175 3.15 3.440 17.02  0  0    3    2   zero

## 6     18.1   6       225     105 2.76 3.460 20.22  1  0    3    1    one

You might want to create a new variable conditionally on several values of another column:

mtcars %>%

    mutate(carb_new = case_when(.$carb == 1 ~ "one",

                                .$carb == 2 ~ "two",

                                .$carb == 4 ~ "four",

                                 TRUE ~ "other")) %>%

    head(15)

##    spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb

## 1      21.0   6     160.0     110 3.90 2.620 16.46  0  1    4    4

## 2      21.0   6     160.0     110 3.90 2.875 17.02  0  1    4    4

## 3      22.8   4     108.0      93 3.85 2.320 18.61  1  1    4    1

## 4      21.4   6     258.0     110 3.08 3.215 19.44  1  0    3    1

## 5      18.7   8     360.0     175 3.15 3.440 17.02  0  0    3    2

## 6      18.1   6     225.0     105 2.76 3.460 20.22  1  0    3    1

## 7      14.3   8     360.0     245 3.21 3.570 15.84  0  0    3    4

## 8      24.4   4     146.7      62 3.69 3.190 20.00  1  0    4    2

## 9      22.8   4     140.8      95 3.92 3.150 22.90  1  0    4    2

## 10     19.2   6     167.6     123 3.92 3.440 18.30  1  0    4    4

## 11     17.8   6     167.6     123 3.92 3.440 18.90  1  0    4    4

## 12     16.4   8     275.8     180 3.07 4.070 17.40  0  0    3    3

## 13     17.3   8     275.8     180 3.07 3.730 17.60  0  0    3    3

## 14     15.2   8     275.8     180 3.07 3.780 18.00  0  0    3    3

## 15     10.4   8     472.0     205 2.93 5.250 17.98  0  0    3    4

##    carb_new

## 1      four

## 2      four

## 3       one

## 4       one

## 5       two

## 6       one

## 7      four

## 8       two

## 9       two

## 10     four

## 11     four

## 12    other

## 13    other

## 14    other

## 15     four

Mind the .$ before the variable carb. There is a github issue about this, and it is already fixed in the development version of dplyr, which means that in the next version of dplyr, case_when() will work as any other specialized dplyr function inside mutate().

Apply a function to certain columns only, by rows

mtcars %>%

    select(am, gear, carb) %>%

    purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2

head(mtcars2)

## # A tibble: 6 × 4

##      am  gear  carb sum_am_gear_carb

##   <dbl> <dbl> <dbl>            <dbl>

## 1     1     4     4                9

## 2     1     4     4                9

## 3     1     4     1                6

## 4     0     3     1                4

## 5     0     3     2                5

## 6     0     3     1                4

For this, I had to use purrr’s by_row() function. You can then add this column to your original data frame:

mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb)

head(mtcars)

##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am

## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1

## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1

## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1

## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0

## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0

## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0

##                   gear carb sum_am_gear_carb

## Mazda RX4            4    4                9

## Mazda RX4 Wag        4    4                9

## Datsun 710           4    1                6

## Hornet 4 Drive       3    1                4

## Hornet Sportabout    3    2                5

## Valiant              3    1                4

Use `do()` to do any arbitrary operation

mtcars %>%

    group_by(cyl) %>%

    do(models = lm(spam_mpg ~ drat + wt, data = .)) %>%

    broom::tidy(models)

## Source: local data frame [9 x 6]

## Groups: cyl [3]

##

##     cyl        term   estimate  std.error  statistic     p.value

##   <dbl>       <chr>      <dbl>      <dbl>      <dbl>       <dbl>

## 1     4 (Intercept) 33.2493403 17.0987286  1.9445504 0.087727622

## 2     4        drat  1.3244329  3.4519717  0.3836743 0.711215433

## 3     4          wt -5.2400608  2.2150213 -2.3656932 0.045551615

## 4     6 (Intercept) 30.6544931  7.5141648  4.0795609 0.015103868

## 5     6        drat -0.4435744  1.1740862 -0.3778039 0.724768945

## 6     6          wt -2.9902720  1.5685053 -1.9064468 0.129274249

## 7     8 (Intercept) 29.6519180  7.0878976  4.1834574 0.001527613

## 8     8        drat -1.4698722  1.6285054 -0.9025897 0.386081744

## 9     8          wt -2.4518017  0.7985112 -3.0704664 0.010651044

do() is useful when you want to use any R function (user defined functions work too!) with dplyr functions. First I grouped the observations by cyl and then ran a linear model for each group. Then I converted the output to a tidy data frame usingbroom::tidy().

Using `dplyr()` functions inside your own functions

extract_vars <- function(data, some_string){

  data %>%

    select_(lazyeval::interp(~contains(some_string))) -> data

  return(data)

}

extract_vars(mtcars, "spam")

##                     spam_mpg spam_disp spam_hp

## Mazda RX4               21.0     160.0     110

## Mazda RX4 Wag           21.0     160.0     110

## Datsun 710              22.8     108.0      93

## Hornet 4 Drive          21.4     258.0     110

## Hornet Sportabout       18.7     360.0     175

## Valiant                 18.1     225.0     105

## Duster 360              14.3     360.0     245

## Merc 240D               24.4     146.7      62

## Merc 230                22.8     140.8      95

## Merc 280                19.2     167.6     123

## Merc 280C               17.8     167.6     123

## Merc 450SE              16.4     275.8     180

## Merc 450SL              17.3     275.8     180

## Merc 450SLC             15.2     275.8     180

## Cadillac Fleetwood      10.4     472.0     205

## Lincoln Continental     10.4     460.0     215

## Chrysler Imperial       14.7     440.0     230

## Fiat 128                32.4      78.7      66

## Honda Civic             30.4      75.7      52

## Toyota Corolla          33.9      71.1      65

## Toyota Corona           21.5     120.1      97

## Dodge Challenger        15.5     318.0     150

## AMC Javelin             15.2     304.0     150

## Camaro Z28              13.3     350.0     245

## Pontiac Firebird        19.2     400.0     175

## Fiat X1-9               27.3      79.0      66

## Porsche 914-2           26.0     120.3      91

## Lotus Europa            30.4      95.1     113

## Ford Pantera L          15.8     351.0     264

## Ferrari Dino            19.7     145.0     175

## Maserati Bora           15.0     301.0     335

## Volvo 142E              21.4     121.0     109

About this last point, you can read more about it here.

Hope you liked this small list of tricks!

转自：http://www.brodrigues.co/blog/2017-02-17-lesser_known_tricks/

Lesser known dplyr tricks

Removing unneeded columns

Re-ordering columns

Renaming columns with rename()

Selecting columns with a regexp

Create new columns with mutate() and if_else()

Apply a function to certain columns only, by rows

Use do() to do any arbitrary operation

Using dplyr() functions inside your own functions

Renaming columns with `rename()`

Create new columns with `mutate()` and `if_else()`

Use `do()` to do any arbitrary operation

Using `dplyr()` functions inside your own functions