github tidyverse/dplyr v0.4.0
dplyr 0.4.0

latest releases: v1.1.4, v1.1.3, v1.1.2...
9 years ago

New features

  • add_rownames() turns row names into an explicit variable (#639).
  • as_data_frame() efficiently coerces a list into a data frame (#749).
  • bind_rows() and bind_cols() efficiently bind a list of data frames by
    row or column. combine() applies the same coercion rules to vectors
    (it works like c() or unlist() but is consistent with the bind_rows()
  • right_join() (include all rows in y, and matching rows in x) and
    full_join() (include all rows in x and y) complete the family of
    mutating joins (#96).
  • group_indices() computes a unique integer id for each group (#771). It
    can be called on a grouped_df without any arguments or on a data frame
    with same arguments as group_by().

New vignettes

  • vignette("data_frame") describes dplyr functions that make it easier
    and faster to create and coerce data frames. It subsumes the old memory
  • vignette("two-table") describes how two-table verbs work in dplyr.

Minor improvements

  • data_frame() (and as_data_frame() & tbl_df()) now explicitly
    forbid columns that are data frames or matrices (#775). All columns
    must be either a 1d atomic vector or a 1d list.

  • do() uses lazyeval to correctly evaluate its arguments in the correct
    environment (#744), and new do_() is the SE equivalent of do() (#718).
    You can modify grouped data in place: this is probably a bad idea but it's
    sometimes convenient (#737). do() on grouped data tables now passes in all
    columns (not all columns except grouping vars) (#735, thanks to @kismsu).
    do() with database tables no longer potentially includes grouping
    variables twice (#673). Finally, do() gives more consistent outputs when
    there are no rows or no groups (#625).

  • first() and last() preserve factors, dates and times (#509).

  • Overhaul of single table verbs for data.table backend. They now all use
    a consistent (and simpler) code base. This ensures that (e.g.) n()
    now works in all verbs (#579).

  • In *_join(), you can now name only those variables that are different between
    the two tables, e.g. inner_join(x, y, c("a", "b", "c" = "d")) (#682).
    If non-join colums are the same, dplyr will add .x and .y
    suffixes to distinguish the source (#655).

  • mutate() handles complex vectors (#436) and forbids POSIXlt results
    (instead of crashing) (#670).

  • select() now implements a more sophisticated algorithm so if you're
    doing multiples includes and excludes with and without names, you're more
    likely to get what you expect (#644). You'll also get a better error
    message if you supply an input that doesn't resolve to an integer
    column position (#643).

  • Printing has recieved a number of small tweaks. All print() method methods
    invisibly return their input so you can interleave print() statements into a
    pipeline to see interim results. print() will column names of 0 row data
    frames (#652), and will never print more 20 rows (i.e.
    options(dplyr.print_max) is now 20), not 100 (#710). Row names are no
    never printed since no dplyr method is guaranteed to preserve them (#669).

    glimpse() prints the number of observations (#692)

    type_sum() gains a data frame method.

  • summarise() handles list output columns (#832)

  • slice() works for data tables (#717). Documentation clarifies that
    slice can't work with relational databases, and the examples show
    how to achieve the same results using filter() (#720).

  • dplyr now requires RSQLite >= 1.0. This shouldn't affect your code
    in any way (except that RSQLite now doesn't need to be attached) but does
    simplify the internals (#622).

  • Functions that need to combine multiple results into a single column
    (e.g. join(), bind_rows() and summarise()) are more careful about

    Joining factors with the same levels in the same order preserves the
    original levels (#675). Joining factors with non-identical levels
    generates a warning and coerces to character (#684). Joining a character
    to a factor (or vice versa) generates a warning and coerces to character.
    Avoid these warnings by ensuring your data is compatible before joining.

    rbind_list() will throw an error if you attempt to combine an integer and
    factor (#751). rbind()ing a column full of NAs is allowed and just
    collects the appropriate missing value for the column type being collected

    summarise() is more careful about NA, e.g. the decision on the result
    type will be delayed until the first non NA value is returned (#599).
    It will complain about loss of precision coercions, which can happen for
    expressions that return integers for some groups and a doubles for others

  • A number of functions gained new or improved hybrid handlers: first(),
    last(), nth() (#626), lead() & lag() (#683), %in% (#126). That means
    when you use these functions in a dplyr verb, we handle them in C++, rather
    than calling back to R, and hence improving performance.

    Hybrid min_rank() correctly handles NaN values (#726). Hybrid
    implementation of nth() falls back to R evaluation when n is not
    a length one integer or numeric, e.g. when it's an expression (#734).

    Hybrid dense_rank(), min_rank(), cume_dist(), ntile(), row_number()
    and percent_rank() now preserve NAs (#774)

  • filter returns its input when it has no rows or no columns (#782).

  • Join functions keep attributes (e.g. time zone information) from the
    left argument for POSIXct and Date objects (#819), and only
    only warn once about each incompatibility (#798).

Bug fixes

  • [.tbl_df correctly computes row names for 0-column data frames, avoiding
    problems with xtable (#656). [.grouped_df will silently drop grouping
    if you don't include the grouping columns (#733).
  • data_frame() now acts correctly if the first argument is a vector to be
    recycled. (#680 thanks @jimhester)
  • works if the table has a variable called "V1" (#615).
  • *_join() keeps columns in original order (#684).
    Joining a factor to a character vector doesn't segfault (#688).
    *_join functions can now deal with multiple encodings (#769),
    and correctly name results (#855).
  • * works when data.table isn't attached (#786).
  • group_by() on a data table preserves original order of the rows (#623).
    group_by() supports variables with more than 39 characters thanks to
    a fix in lazyeval (#705). It gives meaninful error message when a variable
    is not found in the data frame (#716).
  • grouped_df() requires vars to be a list of symbols (#665).
  • min(.,na.rm = TRUE) works with Dates built on numeric vectors (#755)
  • rename_() generic gets missing .dots argument (#708).
  • row_number(), min_rank(), percent_rank(), dense_rank(), ntile() and
    cume_dist() handle data frames with 0 rows (#762). They all preserve
    missing values (#774). row_number() doesn't segfault when giving an external
    variable with the wrong number of variables (#781)
  • group_indices handles the edge case when there are no variables (#867)

Don't miss a new dplyr release

NewReleases is sending notifications on new releases.