New features
add_rownames()
turns row names into an explicit variable (#639).as_data_frame()
efficiently coerces a list into a data frame (#749).bind_rows()
andbind_cols()
efficiently bind a list of data frames by
row or column.combine()
applies the same coercion rules to vectors
(it works likec()
orunlist()
but is consistent with thebind_rows()
rules).right_join()
(include all rows iny
, and matching rows inx
) and
full_join()
(include all rows inx
andy
) complete the family of
mutating joins (#96).group_indices()
computes a unique integer id for each group (#771). It
can be called on a grouped_df without any arguments or on a data frame
with same arguments asgroup_by()
.
New vignettes
vignette("data_frame")
describes dplyr functions that make it easier
and faster to create and coerce data frames. It subsumes the oldmemory
vignette.vignette("two-table")
describes how two-table verbs work in dplyr.
Minor improvements
-
data_frame()
(andas_data_frame()
&tbl_df()
) now explicitly
forbid columns that are data frames or matrices (#775). All columns
must be either a 1d atomic vector or a 1d list. -
do()
uses lazyeval to correctly evaluate its arguments in the correct
environment (#744), and newdo_()
is the SE equivalent ofdo()
(#718).
You can modify grouped data in place: this is probably a bad idea but it's
sometimes convenient (#737).do()
on grouped data tables now passes in all
columns (not all columns except grouping vars) (#735, thanks to @kismsu).
do()
with database tables no longer potentially includes grouping
variables twice (#673). Finally,do()
gives more consistent outputs when
there are no rows or no groups (#625). -
first()
andlast()
preserve factors, dates and times (#509). -
Overhaul of single table verbs for data.table backend. They now all use
a consistent (and simpler) code base. This ensures that (e.g.)n()
now works in all verbs (#579). -
In
*_join()
, you can now name only those variables that are different between
the two tables, e.g.inner_join(x, y, c("a", "b", "c" = "d"))
(#682).
If non-join colums are the same, dplyr will add.x
and.y
suffixes to distinguish the source (#655). -
mutate()
handles complex vectors (#436) and forbidsPOSIXlt
results
(instead of crashing) (#670). -
select()
now implements a more sophisticated algorithm so if you're
doing multiples includes and excludes with and without names, you're more
likely to get what you expect (#644). You'll also get a better error
message if you supply an input that doesn't resolve to an integer
column position (#643). -
Printing has recieved a number of small tweaks. All
print()
method methods
invisibly return their input so you can interleaveprint()
statements into a
pipeline to see interim results.print()
will column names of 0 row data
frames (#652), and will never print more 20 rows (i.e.
options(dplyr.print_max)
is now 20), not 100 (#710). Row names are no
never printed since no dplyr method is guaranteed to preserve them (#669).glimpse()
prints the number of observations (#692)type_sum()
gains a data frame method. -
summarise()
handles list output columns (#832) -
slice()
works for data tables (#717). Documentation clarifies that
slice can't work with relational databases, and the examples show
how to achieve the same results usingfilter()
(#720). -
dplyr now requires RSQLite >= 1.0. This shouldn't affect your code
in any way (except that RSQLite now doesn't need to be attached) but does
simplify the internals (#622). -
Functions that need to combine multiple results into a single column
(e.g.join()
,bind_rows()
andsummarise()
) are more careful about
coercion.Joining factors with the same levels in the same order preserves the
original levels (#675). Joining factors with non-identical levels
generates a warning and coerces to character (#684). Joining a character
to a factor (or vice versa) generates a warning and coerces to character.
Avoid these warnings by ensuring your data is compatible before joining.rbind_list()
will throw an error if you attempt to combine an integer and
factor (#751).rbind()
ing a column full ofNA
s is allowed and just
collects the appropriate missing value for the column type being collected
(#493).summarise()
is more careful aboutNA
, e.g. the decision on the result
type will be delayed until the first non NA value is returned (#599).
It will complain about loss of precision coercions, which can happen for
expressions that return integers for some groups and a doubles for others
(#599). -
A number of functions gained new or improved hybrid handlers:
first()
,
last()
,nth()
(#626),lead()
&lag()
(#683),%in%
(#126). That means
when you use these functions in a dplyr verb, we handle them in C++, rather
than calling back to R, and hence improving performance.Hybrid
min_rank()
correctly handlesNaN
values (#726). Hybrid
implementation ofnth()
falls back to R evaluation whenn
is not
a length one integer or numeric, e.g. when it's an expression (#734).Hybrid
dense_rank()
,min_rank()
,cume_dist()
,ntile()
,row_number()
andpercent_rank()
now preserve NAs (#774) -
filter
returns its input when it has no rows or no columns (#782). -
Join functions keep attributes (e.g. time zone information) from the
left argument forPOSIXct
andDate
objects (#819), and only
only warn once about each incompatibility (#798).
Bug fixes
[.tbl_df
correctly computes row names for 0-column data frames, avoiding
problems with xtable (#656).[.grouped_df
will silently drop grouping
if you don't include the grouping columns (#733).data_frame()
now acts correctly if the first argument is a vector to be
recycled. (#680 thanks @jimhester)filter.data.table()
works if the table has a variable called "V1" (#615).*_join()
keeps columns in original order (#684).
Joining a factor to a character vector doesn't segfault (#688).
*_join
functions can now deal with multiple encodings (#769),
and correctly name results (#855).*_join.data.table()
works when data.table isn't attached (#786).group_by()
on a data table preserves original order of the rows (#623).
group_by()
supports variables with more than 39 characters thanks to
a fix in lazyeval (#705). It gives meaninful error message when a variable
is not found in the data frame (#716).grouped_df()
requiresvars
to be a list of symbols (#665).min(.,na.rm = TRUE)
works withDate
s built on numeric vectors (#755)rename_()
generic gets missing.dots
argument (#708).row_number()
,min_rank()
,percent_rank()
,dense_rank()
,ntile()
and
cume_dist()
handle data frames with 0 rows (#762). They all preserve
missing values (#774).row_number()
doesn't segfault when giving an external
variable with the wrong number of variables (#781)group_indices
handles the edge case when there are no variables (#867)