Lesser known dplyr tricks

时间:2023-03-09 15:12:30
Lesser known dplyr tricks

In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

mtcars %>%
select(-disp) %>%
head()
##                    mpg cyl  hp drat    wt  qsec vs am gear carb
## Mazda RX4 21.0 6 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 105 2.76 3.460 20.22 1 0 3 1

Re-ordering columns

Still using select(), it is easy te re-order columns in your data frame:

mtcars %>%
select(cyl, disp, hp, everything()) %>%
head()
##                   cyl disp  hp  mpg drat    wt  qsec vs am gear carb
## Mazda RX4 6 160 110 21.0 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160 110 21.0 3.90 2.875 17.02 0 1 4 4
## Datsun 710 4 108 93 22.8 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258 110 21.4 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360 175 18.7 3.15 3.440 17.02 0 0 3 2
## Valiant 6 225 105 18.1 2.76 3.460 20.22 1 0 3 1

As its name implies everything() simply means all the other columns.

Renaming columns with rename()

mtcars <- rename(mtcars, spam_mpg = mpg)
mtcars <- rename(mtcars, spam_disp = disp)
mtcars <- rename(mtcars, spam_hp = hp) head(mtcars)
##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1

Selecting columns with a regexp

It is easy to select the columns that start with “spam” with some helper functions:

mtcars %>%
select(contains("spam")) %>%
head()
##                   spam_mpg spam_disp spam_hp
## Mazda RX4 21.0 160 110
## Mazda RX4 Wag 21.0 160 110
## Datsun 710 22.8 108 93
## Hornet 4 Drive 21.4 258 110
## Hornet Sportabout 18.7 360 175
## Valiant 18.1 225 105

take also a look at starts_with()ends_with()contains()matches()num_range()one_of() and everything().

Create new columns with mutate() and if_else()

mtcars %>%
mutate(vs_new = if_else(
vs == 1,
"one",
"zero",
NA_character_)) %>%
head()
##   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb vs_new
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 zero
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 zero
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 zero
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 one

You might want to create a new variable conditionally on several values of another column:

mtcars %>%
mutate(carb_new = case_when(.$carb == 1 ~ "one",
.$carb == 2 ~ "two",
.$carb == 4 ~ "four",
TRUE ~ "other")) %>%
head(15)
##    spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## carb_new
## 1 four
## 2 four
## 3 one
## 4 one
## 5 two
## 6 one
## 7 four
## 8 two
## 9 two
## 10 four
## 11 four
## 12 other
## 13 other
## 14 other
## 15 four

Mind the .$ before the variable carb. There is a github issue about this, and it is already fixed in the development version of dplyr, which means that in the next version of dplyrcase_when() will work as any other specialized dplyr function inside mutate().

Apply a function to certain columns only, by rows

mtcars %>%
select(am, gear, carb) %>%
purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)
## # A tibble: 6 × 4
## am gear carb sum_am_gear_carb
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 4 9
## 2 1 4 4 9
## 3 1 4 1 6
## 4 0 3 1 4
## 5 0 3 2 5
## 6 0 3 1 4

For this, I had to use purrr’s by_row() function. You can then add this column to your original data frame:

mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb)
head(mtcars)
##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
## gear carb sum_am_gear_carb
## Mazda RX4 4 4 9
## Mazda RX4 Wag 4 4 9
## Datsun 710 4 1 6
## Hornet 4 Drive 3 1 4
## Hornet Sportabout 3 2 5
## Valiant 3 1 4

Use do() to do any arbitrary operation

mtcars %>%
group_by(cyl) %>%
do(models = lm(spam_mpg ~ drat + wt, data = .)) %>%
broom::tidy(models)
## Source: local data frame [9 x 6]
## Groups: cyl [3]
##
## cyl term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) 33.2493403 17.0987286 1.9445504 0.087727622
## 2 4 drat 1.3244329 3.4519717 0.3836743 0.711215433
## 3 4 wt -5.2400608 2.2150213 -2.3656932 0.045551615
## 4 6 (Intercept) 30.6544931 7.5141648 4.0795609 0.015103868
## 5 6 drat -0.4435744 1.1740862 -0.3778039 0.724768945
## 6 6 wt -2.9902720 1.5685053 -1.9064468 0.129274249
## 7 8 (Intercept) 29.6519180 7.0878976 4.1834574 0.001527613
## 8 8 drat -1.4698722 1.6285054 -0.9025897 0.386081744
## 9 8 wt -2.4518017 0.7985112 -3.0704664 0.010651044

do() is useful when you want to use any R function (user defined functions work too!) with dplyr functions. First I grouped the observations by cyl and then ran a linear model for each group. Then I converted the output to a tidy data frame usingbroom::tidy().

Using dplyr() functions inside your own functions

extract_vars <- function(data, some_string){

  data %>%
select_(lazyeval::interp(~contains(some_string))) -> data return(data)
} extract_vars(mtcars, "spam")
##                     spam_mpg spam_disp spam_hp
## Mazda RX4 21.0 160.0 110
## Mazda RX4 Wag 21.0 160.0 110
## Datsun 710 22.8 108.0 93
## Hornet 4 Drive 21.4 258.0 110
## Hornet Sportabout 18.7 360.0 175
## Valiant 18.1 225.0 105
## Duster 360 14.3 360.0 245
## Merc 240D 24.4 146.7 62
## Merc 230 22.8 140.8 95
## Merc 280 19.2 167.6 123
## Merc 280C 17.8 167.6 123
## Merc 450SE 16.4 275.8 180
## Merc 450SL 17.3 275.8 180
## Merc 450SLC 15.2 275.8 180
## Cadillac Fleetwood 10.4 472.0 205
## Lincoln Continental 10.4 460.0 215
## Chrysler Imperial 14.7 440.0 230
## Fiat 128 32.4 78.7 66
## Honda Civic 30.4 75.7 52
## Toyota Corolla 33.9 71.1 65
## Toyota Corona 21.5 120.1 97
## Dodge Challenger 15.5 318.0 150
## AMC Javelin 15.2 304.0 150
## Camaro Z28 13.3 350.0 245
## Pontiac Firebird 19.2 400.0 175
## Fiat X1-9 27.3 79.0 66
## Porsche 914-2 26.0 120.3 91
## Lotus Europa 30.4 95.1 113
## Ford Pantera L 15.8 351.0 264
## Ferrari Dino 19.7 145.0 175
## Maserati Bora 15.0 301.0 335
## Volvo 142E 21.4 121.0 109

About this last point, you can read more about it here.

Hope you liked this small list of tricks!

转自:http://www.brodrigues.co/blog/2017-02-17-lesser_known_tricks/