github tidyverse/dplyr v0.3
dplyr 0.3

latest releases: v1.1.4, v1.1.3, v1.1.2...
9 years ago

New functions

  • between() vector function efficiently determines if numeric values fall
    in a range, and is translated to special form for SQL (#503).
  • count() makes it even easier to do (weighted) counts (#358).
  • data_frame() by @kevinushey is a nicer way of creating data frames.
    It never coerces column types (no more stringsAsFactors = FALSE!),
    never munges column names, and never adds row names. You can use previously
    defined columns to compute new columns (#376).
  • distinct() returns distinct (unique) rows of a tbl (#97). Supply
    additional variables to return the first row for each unique combination
    of variables.
  • Set operations, intersect(), union() and setdiff() now have methods
    for data frames, data tables and SQL database tables (#93). They pass their
    arguments down to the base functions, which will ensure they raise errors if
    you pass in two many arguments.
  • Joins (e.g. left_join(), inner_join(), semi_join(), anti_join())
    now allow you to join on different variables in x and y tables by
    supplying a named vector to by. For example, by = c("a" = "b") joins
    x.a to y.b.
  • n_groups() function tells you how many groups in a tbl. It returns
    1 for ungrouped data. (#477)
  • transmute() works like mutate() but drops all variables that you didn't
    explicitly refer to (#302).
  • rename() makes it easy to rename variables - it works similarly to
    select() but it preserves columns that you didn't otherwise touch.
  • slice() allows you to selecting rows by position (#226). It includes
    positive integers, drops negative integers and you can use expression like
    n().

Programming with dplyr (non-standard evaluation)

  • You can now program with dplyr - every function that does non-standard
    evaluation (NSE) has a standard evaluation (SE) version ending in _.
    This is powered by the new lazyeval package which provides all the tools
    needed to implement NSE consistently and correctly.
  • See vignette("nse") for full details.
  • regroup() is deprecated. Please use the more flexible group_by_()
    instead.
  • summarise_each_q() and mutate_each_q() are deprecated. Please use
    summarise_each_() and mutate_each_() instead.
  • funs_q has been replaced with funs_.

Removed and deprecated features

  • %.% has been deprecated: please use %>% instead. chain() is
    defunct. (#518)
  • filter.numeric() removed. Need to figure out how to reimplement with
    new lazy eval system.
  • The Progress refclass is no longer exported to avoid conflicts with shiny.
    Instead use progress_estimated() (#535).
  • src_monetdb() is now implemented in MonetDB.R, not dplyr.
  • show_sql() and explain_sql() and matching global options dplyr.show_sql
    and dplyr.explain_sql have been removed. Instead use show_query() and
    explain().

Minor improvements and bug fixes

  • Main verbs now have individual documentation pages (#519).
  • %>% is simply re-exported from magrittr, instead of creating a local copy
    (#496, thanks to @jimhester)
  • Examples now use nycflights13 instead of hflights because it the variables
    have better names and there are a few interlinked tables (#562). Lahman and
    nycflights13 are (once again) suggested packages. This means many examples
    will not work unless you explicitly install them with
    install.packages(c("Lahman", "nycflights13")) (#508). dplyr now depends on
    Lahman 3.0.1. A number of examples have been updated to reflect modified
    field names (#586).
  • do() now displays the progress bar only when used in interactive prompts
    and not when knitting (#428, @jimhester).
  • glimpse() now prints a trailing new line (#590).
  • group_by() has more consistent behaviour when grouping by constants:
    it creates a new column with that value (#410). It renames grouping
    variables (#410). The first argument is now .data so you can create
    new groups with name x (#534).
  • Now instead of overriding lag(), dplyr overrides lag.default(),
    which should avoid clobbering lag methods added by other packages.
    (#277).
  • mutate(data, a = NULL) removes the variable a from the returned
    dataset (#462).
  • trunc_mat() and hence print.tbl_df() and friends gets a width argument
    to control the deafult output width. Set options(dplyr.width = Inf) to
    always show all columns (#589).
  • select() gains one_of() selector: this allows you to select variables
    provided by a character vector (#396). It fails immediately if you give an
    empty pattern to starts_with(), ends_with(), contains() or matches()
    (#481, @leondutoit). Fixed buglet in select() so that you can now create
    variables called val (#564).
  • Switched from RC to R6.
  • tally() and top_n() work consistently: neither accidentally
    evaluates the the wt param. (#426, @mnel)
  • rename handles grouped data (#640).

Minor improvements and bug fixes by backend

Databases

  • The db backend system has been completely overhauled in order to make
    it possible to add backends in other packages, and to support a much
    wider range of databases. See vignette("new-sql-backend") for instruction
    on how to create your own (#568).
  • src_mysql() gains a method for explain().
  • When mutate() creates a new variable that uses a window function,
    automatically wrap the result in a subquery (#484).
  • Correct SQL generation for first() and last() (#531).
  • order_by() now works in conjunction with window functions in databases
    that support them.

Data frames/tbl_df

  • All verbs now understand how to work with difftime() (#390) and
    AsIs (#453) objects. They all check that colnames are unique (#483), and
    are more robust when columns are not present (#348, #569, #600).
  • Hybrid evaluation bugs fixed:
    • Call substitution stopped too early when a sub expression contained a
      $ (#502).
    • Handle :: and ::: (#412).
    • cumany() and cumall() properly handle NA (#408).
    • nth() now correctly preserve the class when using dates, times and
      factors (#509).
    • no longer substitutes within order_by() because order_by() needs to do
      its own NSE (#169).
  • [.tbl_df always returns a tbl_df (i.e. drop = FALSE is the default)
    (#587, #610). [.grouped_df preserves important output attributes (#398).
  • arrange() keeps the grouping structure of grouped data (#491, #605),
    and preserves input classes (#563).
  • contains() accidentally matched regular expressions, now it passes
    fixed = TRUE to grep() (#608).
  • filter() asserts all variables are white listed (#566).
  • mutate() makes a rowwise_df when given a rowwise_df (#463).
  • rbind_all() creates tbl_df objects instead of raw data.frames.
  • If select() doesn't match any variables, it returns a 0-column data frame,
    instead of the original (#498). It no longer fails when if some columns
    are not named (#492)
  • sample_n() and sample_frac() methods for data.frames exported.
    (#405, @alyst)
  • A grouped data frame may have 0 groups (#486). Grouped df objects
    gain some basic validity checking, which should prevent some crashes
    related to corrupt grouped_df objects made by rbind() (#606).
  • More coherence when joining columns of compatible but different types,
    e.g. when joining a character vector and a factor (#455),
    or a numeric and integer (#450)
  • mutate() works for on zero-row grouped data frame, and
    with list columns (#555).
  • LazySubset was confused about input data size (#452).
  • Internal n_distinct() is stricter about it's inputs: it requires one symbol
    which must be from the data frame (#567).
  • rbind_*() handle data frames with 0 rows (#597). They fill character
    vector columns with NA instead of blanks (#595). They work with
    list columns (#463).
  • Improved handling of encoding for column names (#636).
  • Improved handling of hybrid evaluation re $ and @ (#645).

Data tables

  • Fix major omission in tbl_dt() and grouped_dt() methods - I was
    accidentally doing a deep copy on every result :(
  • summarise() and group_by() now retain over-allocation when working with
    data.tables (#475, @arunsrinivasan).
  • joining two data.tables now correctly dispatches to data table methods,
    and result is a data table (#470)

Cubes

  • summarise.tbl_cube() works with single grouping variable (#480).

Don't miss a new dplyr release

NewReleases is sending notifications on new releases.