github OpenRefine/OpenRefine 4.0-alpha1
OpenRefine 4.0-alpha1

latest releases: 3.8-beta1, 3.7.9, 3.7.8...
pre-release2 years ago

This is the first alpha release of the 4.0 series. Expect many bugs: your help is welcome to test this new architecture.

Main changes

  • This new version uses a different workspace: your projects from OpenRefine 3.x will not appear in this version. They will not be deleted though: you can always open them again by running OpenRefine 3.x. Project archives exported from OpenRefine 3.x can be read in OpenRefine 4.x, but the operation history will be discarded.
  • Project data no longer needs to fit in the working memory (RAM) of your machine. This makes it easier to work on large datasets. (#242)
  • It is possible to execute OpenRefine operations in Apache Spark (#1433). The execution engine used by OpenRefine is currently selected at startup with the -r (Unix) or /r (Windows) parameter (it is foreseen that this will change before a stable release as Spark support will be moved to an extension, see #4396).
  • Facet statistics are computed on a sample of rows by default. The size of the sample can be configured.
  • The CSV/TSV importer supports a new option which controls whether rows are allowed to span multiple lines of the source file.

Documentation about those new features will be published soon.

For developers

Most extensions will be incompatible with this new version, as many incompatible changes have been introduced.

  • OpenRefine now uses the org.openrefine namespace instead of com.google.refine.
  • The code base was split into more granular Maven modules. Those modules are published to Maven Central to ease the development of extensions (currently in the snapshot repository as their structure is not final yet). Feedback about the module structure is welcome.
  • The architecture of the data processing engine changed to make it extensible. The execution of workflows can happen fully in memory, off disk or in an Apache Spark, or in other execution engines if the corresponding runners are implemented. Feedback about the data model API is welcome.

A documentation of the new architecture will be published soon.

Don't miss a new OpenRefine release

NewReleases is sending notifications on new releases.