dplyr 0.8.0
Breaking changes
-
The error
could not find function "n"
or the warning
Calling `n()` without importing or prefixing it is deprecated, use `dplyr::n()`
indicates when functions like
n()
,row_number()
, ... are not imported or prefixed.The easiest fix is to import dplyr with
import(dplyr)
in yourNAMESPACE
or
#' @import dplyr
in a roxygen comment, alternatively such functions can be
imported selectively as any other function withimportFrom(dplyr, n)
in the
NAMESPACE
or#' @importFrom dplyr n
in a roxygen comment. The third option is
to prefix them, i.e. usedplyr::n()
-
If you see
checking S3 generic/method consistency
in R CMD check for your
package, note that :sample_n()
andsample_frac()
have gained...
filter()
andslice()
have gained.preserve
group_by()
has gained.drop
-
Error: `.data` is a corrupt grouped_df, ...
signals code that makes
wrong assumptions about the internals of a grouped data frame.
New functions
-
New selection helpers
group_cols()
. It can be called in selection contexts
such asselect()
and matches the grouping variables of grouped tibbles. -
last_col()
is re-exported from tidyselect (#3584). -
group_trim()
drops unused levels of factors that are used as grouping variables. -
nest_join()
creates a list column of the matching rows.nest_join()
+tidyr::unnest()
is equivalent toinner_join
(#3570).band_members %>% nest_join(band_instruments)
-
group_nest()
is similar totidyr::nest()
but focusing on the variables to nest by
instead of the nested columns.starwars %>% group_by(species, homeworld) %>% group_nest() starwars %>% group_nest(species, homeworld)
-
group_split()
is similar tobase::split()
but operating on existing groups when
applied to a grouped data frame, or subject to the data mask on ungrouped data framesstarwars %>% group_by(species, homeworld) %>% group_split() starwars %>% group_split(species, homeworld)
-
group_map()
andgroup_walk()
are purrr-like functions to iterate on groups
of a grouped data frame, jointly identified by the data subset (exposed as.x
) and the
data key (a one row tibble, exposed as.y
).group_map()
returns a grouped data frame that
combines the results of the function,group_walk()
is only used for side effects and returns
its input invisibly.mtcars %>% group_by(cyl) %>% group_map(~ head(.x, 2L))
-
distinct_prepare()
, previously known asdistinct_vars()
is exported. This is mostly useful for
alternative backends (e.g.dbplyr
).
Major changes
-
group_by()
gains the.drop
argument. When set toFALSE
the groups are generated
based on factor levels, hence some groups may be empty (#341).# 3 groups tibble( x = 1:2, f = factor(c("a", "b"), levels = c("a", "b", "c")) ) %>% group_by(f, .drop = FALSE) # the order of the grouping variables matter df <- tibble( x = c(1,2,1,2), f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c")) ) df %>% group_by(f, x, .drop = FALSE) df %>% group_by(x, f, .drop = FALSE)
The default behaviour drops the empty groups as in the previous versions.
tibble( x = 1:2, f = factor(c("a", "b"), levels = c("a", "b", "c")) ) %>% group_by(f)
-
filter()
andslice()
gain a.preserve
argument to control which groups it should keep. The default
filter(.preserve = FALSE)
recalculates the grouping structure based on the resulting data,
otherwise it is kept as is.df <- tibble( x = c(1,2,1,2), f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c")) ) %>% group_by(x, f, .drop = FALSE) df %>% filter(x == 1) df %>% filter(x == 1, .preserve = TRUE)
-
The notion of lazily grouped data frames have disappeared. All dplyr verbs now recalculate
immediately the grouping structure, and respect the levels of factors. -
Subsets of columns now properly dispatch to the
[
or[[
method when the column
is an object (a vector with a class) instead of making assumptions on how the
column should be handled. The[
method must handle integer indices, including
NA_integer_
, i.e.x[NA_integer_]
should produce a vector of the same class
asx
with whatever represents a missing value.
Minor changes
-
tally()
works correctly on non-data frame table sources such astbl_sql
(#3075). -
sample_n()
andsample_frac()
can usen()
(#3527) -
distinct()
respects the order of the variables provided (#3195, @foo-bar-baz-qux)
and handles the 0 rows and 0 columns special case (#2954). -
combine()
uses tidy dots (#3407). -
group_indices()
can be used without argument in expressions in verbs (#1185). -
Using
mutate_all()
,transmute_all()
,mutate_if()
andtransmute_if()
with grouped tibbles now informs you that the grouping variables are
ignored. In the case of the_all()
verbs, the message invites you to use
mutate_at(df, vars(-group_cols()))
(or the equivalenttransmute_at()
call)
instead if you'd like to make it explicit in your code that the operation is
not applied on the grouping variables. -
Scoped variants of
arrange()
respect the.by_group
argument (#3504). -
first()
andlast()
hybrid functions fall back to R evaluation when given no arguments (#3589). -
mutate()
removes a column when the expression evaluates toNULL
for all groups (#2945). -
grouped data frames support
[, drop = TRUE]
(#3714). -
New low-level constructor
new_grouped_df()
and validatorvalidate_grouped_df
(#3837). -
glimpse()
prints group information on grouped tibbles (#3384). -
sample_n()
andsample_frac()
gain...
(#2888). -
Scoped filter variants now support functions and purrr-like lambdas:
mtcars %>% filter_at(vars(hp, vs), ~ . %% 2 == 0)
Lifecycle
-
do()
,rowwise()
andcombine()
are questioning (#3494). -
funs()
is soft-deprecated and will start issuing warnings in a future version.
Changes to column wise functions
-
Scoped variants for
distinct()
:distinct_at()
,distinct_if()
,distinct_all()
(#2948). -
summarise_at()
excludes the grouping variables (#3613). -
mutate_all()
,mutate_at()
,summarise_all()
andsummarise_at()
handle utf-8 names (#2967).
Performance
-
R expressions that cannot be handled with native code are now evaluated with
unwind-protection when available (on R 3.5 and later). This improves the
performance of dplyr on data frames with many groups (and hence many
expressions to evaluate). We benchmarked that computing a grouped average is
consistently twice as fast with unwind-protection enabled.Unwind-protection also makes dplyr more robust in corner cases because it
ensures the C++ destructors are correctly called in all circumstances
(debugger exit, captured condition, restart invokation). -
sample_n()
andsample_frac()
gain...
(#2888). -
Improved performance for wide tibbles (#3335).
-
Faster hybrid
sum()
,mean()
,var()
andsd()
for logical vectors (#3189). -
Hybrid version of
sum(na.rm = FALSE)
exits early when there are missing values.
This considerably improves performance when there are missing values early in the vector (#3288). -
group_by()
does not trigger the additionalmutate()
on simple uses of the.data
pronoun (#3533).
Internal
-
The grouping metadata of grouped data frame has been reorganized in a single tidy tibble, that can be accessed
with the newgroup_data()
function. The grouping tibble consists of one column per grouping variable,
followed by a list column of the (1-based) indices of the groups. The newgroup_rows()
function retrieves
that list of indices (#3489).# the grouping metadata, as a tibble group_by(starwars, homeworld) %>% group_data() # the indices group_by(starwars, homeworld) %>% group_data() %>% pull(.rows) group_by(starwars, homeworld) %>% group_rows()
-
Hybrid evaluation has been completely redesigned for better performance and stability.
Documentation
-
Add documentation example for moving variable to back in
?select
(#3051). -
column wise functions are better documented, in particular explaining when
grouping variables are included as part of the selection.