github IQSS/dataverse v5.14

latest releases: v6.4, v6.3, v6.2...
15 months ago

Dataverse Software 5.14

(If this note appears truncated on the GitHub Releases page, you can view it in full in the source tree: https://github.com/IQSS/dataverse/blob/master/doc/release-notes/5.14-release-notes.md)

This release brings new features, enhancements, and bug fixes to the Dataverse software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Please note that, as an experiment, the sections of this release note are organized in a different order. The Upgrade and Installation sections are at the top, with the detailed sections highlighting new features and fixes further down.

Installation

If this is a new installation, please see our Installation Guide. Please don't be shy about asking for help if you need it!

After your installation has gone into production, you are welcome to add it to our map of installations by opening an issue in the dataverse-installations repo.

Upgrade Instructions

0. These instructions assume that you are upgrading from 5.13. If you are running an earlier version, the only safe way to upgrade is to progress through the upgrades to all the releases in between before attempting the upgrade to 5.14.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin undeploy dataverse-5.13

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.14.war

5. Restart Payara

  • service payara stop
  • service payara start

6. Update the Citation metadata block: (the update makes the field Series repeatable)

  • wget https://github.com/IQSS/dataverse/releases/download/v5.14/citation.tsv
  • curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @citation.tsv -H "Content-type: text/tab-separated-values"

If you are running an English-only installation, you are finished with the citation block. Otherwise, download the updated citation.properties file and place it in the dataverse.lang.directory; /home/dataverse/langBundles used in the example below.

  • wget https://github.com/IQSS/dataverse/releases/download/v5.14/citation.properties
  • cp citation.properties /home/dataverse/langBundles

7. Upate Solr schema.xml to allow multiple series to be used. See specific instructions below for those installations without custom metadata blocks (7a) and those with custom metadata blocks (7b).

7a. For installations without custom or experimental metadata blocks:

  • Stop Solr instance (usually service solr stop, depending on Solr installation/OS, see the Installation Guide)

  • Replace schema.xml

    • cp /tmp/dvinstall/schema.xml /usr/local/solr/solr-8.11.1/server/solr/collection1/conf
  • Start Solr instance (usually service solr start, depending on Solr/OS)

7b. For installations with custom or experimental metadata blocks:

  • Stop Solr instance (usually service solr stop, depending on Solr installation/OS, see the Installation Guide)

  • There are 2 ways to regenerate the schema: Either by collecting the output of the Dataverse schema API and feeding it to the update-fields.sh script that we supply, as in the example below (modify the command lines as needed):

	wget https://raw.githubusercontent.com/IQSS/dataverse/master/conf/solr/8.11.1/update-fields.sh
	chmod +x update-fields.sh
	curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-8.8.1/server/solr/collection1/conf/schema.xml

OR, alternatively, you can edit the following lines in your schema.xml by hand as follows (to indicate that series and its components are now multiValued="true"):

     <field name="series" type="string" stored="true" indexed="true" multiValued="true"/>
     <field name="seriesInformation" type="text_en" multiValued="true" stored="true" indexed="true"/>
     <field name="seriesName" type="text_en" multiValued="true" stored="true" indexed="true"/>
  • Restart Solr instance (usually service solr restart depending on solr/OS)

8. Run ReExportAll to update dataset metadata exports. Follow the directions in the Admin Guide.

9. If your installation did not have :FilePIDsEnabled set, you will need to set it to true to keep file PIDs enabled:

  curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:FilePIDsEnabled

10. If your installation uses Handles as persistent identifiers (instead of DOIs): remember to upgrade your Handles service installation to a currently supported version.

Generally, Handles is known to be working reliably even when running older versions that haven't been officially supported in years. We still recommend to check on your service and make sure to upgrade to a supported version (the latest version is 9.3.1, https://www.handle.net/hnr-source/handle-9.3.1-distribution.tar.gz, as of writing this). An older version may be running for you seemingly just fine, but do keep in mind that it may just stop working unexpectedly at any moment, because of some incompatibility introduced in a Java rpm upgrade, or anything similarly unpredictable.

Handles is also very good about backward incompatibility. Meaning, in most cases you can simply stop the old version, unpack the new version from the distribution and start it on the existing config and database files, and it'll just keep working. However, it is a good idea to keep up with the recommended format upgrades, for the sake of efficiency and to avoid any unexpected surprises, should they finally decide to drop the old database format, for example. The two specific things we recommend: 1) Make sure your service is using a json version of the siteinfo bundle (i.e., if you are still using siteinfo.bin, convert it to siteinfo.json and remove the binary file from the service directory) and 2) Make sure you are using the newer bdbje database format for your handles catalog (i.e., if you still have the files handles.jdb and nas.jdb in your server directory, convert them to the new format). Follow the simple conversion instructions in the file README.txt in the Handles software distribution. Make sure to stop the service before converting the files and make sure to have a full backup of the existing server directory, just in case. Do not hesitate to contact the Handles support with any questions you may have, as they are very responsive and helpful.

New JVM Options and MicroProfile Config Options

The following PID provider options are now available. See the section "Changes to PID Provider JVM Settings" below for more information.

  • dataverse.pid.datacite.mds-api-url
  • dataverse.pid.datacite.rest-api-url
  • dataverse.pid.datacite.username
  • dataverse.pid.datacite.password
  • dataverse.pid.handlenet.key.path
  • dataverse.pid.handlenet.key.passphrase
  • dataverse.pid.handlenet.index
  • dataverse.pid.permalink.base-url
  • dataverse.pid.ezid.api-url
  • dataverse.pid.ezid.username
  • dataverse.pid.ezid.password

The following MicroProfile Config options have been added as part of Signposting support. See the section "Signposting for Dataverse" below for details.

  • dataverse.signposting.level1-author-limit
  • dataverse.signposting.level1-item-limit

The following JVM options are described in the "Creating datasets with incomplete metadata through API" section below.

  • dataverse.api.allow-incomplete-metadata
  • dataverse.ui.show-validity-filter
  • dataverse.ui.allow-review-for-incomplete

The following JVM/MicroProfile setting is for External Exporters. See "Mechanism Added for Adding External Exporters" below.

  • dataverse.spi.export.directory

The following JVM/MicroProfile settings are for handling of support emails. See "Contact Email Improvements" below.

  • dataverse.mail.support-email
  • dataverse.mail.cc-support-on-contact-emails

The following JVM/MicroProfile setting is for extracting a geospatial bounding box even if S3 direct upload is enabled.

  • dataverse.netcdf.geo-extract-s3-direct-upload

Backward Incompatibilities

The following list of potential backward incompatibilities references the sections of the "Detailed Release Highlights..." portion of the document further below where the corresponding changes are explained in detail.

Using the new External Exporters framework

Care should be taken when replacing Dataverse's internal metadata export formats as third party code, including other third party Exporters, may depend on the contents of those export formats. When replacing an existing format, one must also remember to delete the cached metadata export files or run the reExport command for the metadata exports of existing datasets to be updated.

See "Mechanism Added for Adding External Exporters".

Publishing via API

When publishing a dataset via API, it now mirrors the UI behavior by requiring that the dataset has either a standard license configured, or has valid Custom Terms of Use (if allowed by the instance). Attempting to publish a dataset without such will fail with an error message.

See "Handling of license information fixed in the API" for guidance on how to ensure that datasets created or updated via native API have a license configured.

Detailed Release Highlights, New Features and Use Case Scenarios

For Dataverse developers, support for running Dataverse in Docker (experimental)

Developers can experiment with running Dataverse in Docker: (PR #9439)

This is an image developers build locally (or can pull from Docker Hub). It is not meant for production use!

To provide a complete container-based local development environment, developers can deploy a Dataverse container from
the new image in addition to other containers for necessary dependencies:
https://guides.dataverse.org/en/5.14/container/dev-usage.html

Please note that with this emerging solution we will sunset older tooling like docker-aio and docker-dcm.
We envision more testing possibilities in the future, to be discussed as part of the
Dataverse Containerization Working Group. There is no sunsetting roadmap yet, but you have been warned.
If there is some specific feature of these tools you would like to be kept, please reach out.

Indexing performance improved

Noticeable improvements in performance, especially for large datasets containing thousands of files.
Uploading files one by one to the dataset is much faster now, allowing uploading thousands of files in an acceptable timeframe. Not only uploading a file, but all edit operations on datasets containing many files, got faster.
Performance tweaks include indexing of the datasets in the background and optimizations in the amount of the indexing operations needed. Furthermore, updates to the dateset no longer wait for ingesting to finish. Ingesting was already running in the background, but it took a lock, preventing updating the dataset and degrading performance for datasets containing many files. (PR #9558)

For installations using MDC (Make Data Count), it is now possible to display both the MDC metrics and the legacy access counts, generated before MDC was enabled.

This is enabled via the new setting :MDCStartDate that specifies the cutoff date. If a dataset has any legacy access counts collected prior to that date, those numbers will be displayed in addition to any MDC numbers recorded since then. (PR #6543)

Changes to PID Provider JVM Settings

In preparation for a future feature to use multiple PID providers at the same time, all JVM settings for PID providers
have been enabled to be configured using MicroProfile Config. In the same go, they were renamed to match the name
of the provider to be configured.

Please watch your log files for deprecation warnings. Your old settings will be picked up, but you should migrate
to the new names to avoid unnecessary log clutter and get prepared for more future changes. An example message
looks like this:

[#|2023-03-31T16:55:27.992+0000|WARNING|Payara 5.2022.5|edu.harvard.iq.dataverse.settings.source.AliasConfigSource|_ThreadID=30;_ThreadName=RunLevelControllerThread-1680281704925;_TimeMillis=1680281727992;_LevelValue=900;|
   Detected deprecated config option doi.username in use. Please update your config to use dataverse.pid.datacite.username.|#]

Here is a list of the new settings:

  • dataverse.pid.datacite.mds-api-url
  • dataverse.pid.datacite.rest-api-url
  • dataverse.pid.datacite.username
  • dataverse.pid.datacite.password
  • dataverse.pid.handlenet.key.path
  • dataverse.pid.handlenet.key.passphrase
  • dataverse.pid.handlenet.index
  • dataverse.pid.permalink.base-url
  • dataverse.pid.ezid.api-url
  • dataverse.pid.ezid.username
  • dataverse.pid.ezid.password

See also https://guides.dataverse.org/en/5.14/installation/config.html#persistent-identifiers-and-publishing-datasets (multiple PRs: #8823 #8828)

Signposting for Dataverse

This release adds Signposting support to Dataverse to improve machine discoverability of datasets and files. (PR #8424)

The following MicroProfile Config options are now available (these can be treated as JVM options):

  • dataverse.signposting.level1-author-limit
  • dataverse.signposting.level1-item-limit

Signposting is described in more detail in a new page in the Admin Guide on discoverability: https://guides.dataverse.org/en/5.14/admin/discoverability.html

Permalinks support

Dataverse now optionally supports PermaLinks, a type of persistent identifier that does not involve a global registry service. PermaLinks are appropriate for Intranet deployment and catalog use cases. (PR #8674)

Creating datasets with incomplete metadata through API

It is now possible to create a dataset with some nominally mandatory metadata fields left unpopulated. For details on the use case that lead to this feature see issue #8822 and PR #8940.

The create dataset API call (POST to /api/dataverses/#dataverseId/datasets) is extended with the "doNotValidate" parameter. However, in order to be able to create a dataset with incomplete metadata, the Solr configuration must be updated first with the new "schema.xml" file (do not forget to run the metadata fields update script when you use custom metadata). Reindexing is optional, but recommended. Also, even when this feature is not used, it is recommended to update the Solr configuration and reindex the metadata. Finally, this new feature can be activated with the "dataverse.api.allow-incomplete-metadata" JVM option.

You can also enable a valid/incomplete metadata filter in the "My Data" page using the "dataverse.ui.show-validity-filter" JVM option. By default, this filter is not shown. When you wish to use this filter, you must reindex the datasets first, otherwise datasets with valid metadata will not be shown in the results.

It is not possible to publish datasets with incomplete or incomplete metadata. By default, you also cannot send such datasets for review. If you wish to enable sending for review of datasets with incomplete metadata, turn on the "dataverse.ui.allow-review-for-incomplete" JVM option.

In order to customize the wording and add translations to the UI sections extended by this feature, you can edit the "Bundle.properties" file and the localized versions of that file. The property keys used by this feature are:

  • incomplete
  • valid
  • dataset.message.incomplete.warning
  • mydataFragment.validity
  • dataverses.api.create.dataset.error.mustIncludeAuthorName

Registering PIDs (DOIs or Handles) for files in select collections

It is now possible to configure registering PIDs for files in individual collections.

For example, registration of PIDs for files can be enabled in a specific collection when it is disabled instance-wide. Or it can be disabled in specific collections where it is enabled by default. See the :FilePIDsEnabled section of the Configuration guide for details. (PR #9614)

Mechanism Added for Adding External Exporters

It is now possible for third parties to develop and share code to provide new metadata export formats for Dataverse. Export formats can be made available via the Dataverse UI and API or configured for use in Harvesting. Dataverse now provides developers with a separate dataverse-spi JAR file that contains the Java interfaces and classes required to create a new metadata Exporter. Once a new Exporter has been created and packaged as a JAR file, administrators can use it by specifying a local directory for third party Exporters, dropping then Exporter JAR there, and restarting Payara. This mechanism also allows new Exporters to replace any of Dataverse's existing metadata export formats. (PR #9175). See also https://guides.dataverse.org/en/5.14/developers/metadataexport.html

Backward Incompatibilities

Care should be taken when replacing Dataverse's internal metadata export formats as third party code, including other third party Exporters may depend on the contents of those export formats. When replacing an existing format, one must also remember to delete the cached metadata export files or run the reExport command for the metadata exports of existing datasets to be updated.

New JVM/MicroProfile Settings

dataverse.spi.export.directory - specifies a directory, readable by the Dataverse server. Any Exporter JAR files placed in this directory will be read by Dataverse and used to add/replace the specified metadata format.

Contact Email Improvements

Email sent from the contact forms to the contact(s) for a collection, dataset, or datafile can now optionally be cc'd to a support email address. The support email address can be changed from the default :SystemEmail address to a separate :SupportEmail address. When multiple contacts are listed, the system will now send one email to all contacts (with the optional cc if configured) instead of separate emails to each contact. Contact names with a comma that refer to Organizations will no longer have the name parts reversed in the email greeting. A new protected/admin feedback API has been added. (PR #9186) See https://guides.dataverse.org/en/5.14/api/native-api.html#send-feedback-to-contact-s

New JVM/MicroProfile Settings

dataverse.mail.support-email - allows a separate email, distinct from the :SystemEmail to be used as the to address in emails from the contact form/ feedback api.
dataverse.mail.cc-support-on-contact-emails - include the support email address as a CC: entry when contact/feedback emails are sent to the contacts for a collection, dataset, or datafile.

Support for Grouping Dataset Files by Folder and Category Tag

Dataverse now supports grouping dataset files by folder and/or optionally by Tag/Category. The default for whether to order by folder can be changed via :OrderByFolder. Ordering by category must be enabled by an administrator via the :CategoryOrder parameter which is used to specify which tags appear first (e.g. to put Documentation files before Data or Code files, etc.) These Group-By options work with the existing sort options, i.e. sorting alphabetically means that files within each folder or tag group will be sorted alphabetically. :AllowUsersToManageOrdering can be set to true to allow users to turn folder ordering and category ordering (if enabled) on or off in the current dataset view. (PR #9204)

New Settings

:CategoryOrder - a comma separated list of Category/Tag names defining the order in which files with those tags should be displayed. The setting can include custom tag names along with the pre-defined defaults ( Documentation, Data, and Code, which can be overridden by the ::FileCategories setting.)
:OrderByFolder - defaults to true - whether to group files in the same folder together
:AllowUserManagementOfOrder - default false - allow users to toggle ordering on/off in the dataset display

Metadata field Series now repeatable

This enhancement allows depositors to define multiple instances of the metadata field Series in the Citation Metadata block.

Data contained in a dataset may belong to multiple series. Making the field repeatable makes it possible to reflect this fact in the dataset metadata. (PR #9256)

Guides in PDF Format

An experimental version of the guides in PDF format is available at http://preview.guides.gdcc.io/_/downloads/en/develop/pdf/ (PR #9474)

Advice for anyone who wants to help improve the PDF is available at https://guides.dataverse.org/en/5.14/developers/documentation.html#pdf-version-of-the-guides

Datasets API extended

The following APIs have been added: (PR #9592)

  • /api/datasets/summaryFieldNames
  • /api/datasets/privateUrlDatasetVersion/{privateUrlToken}
  • /api/datasets/privateUrlDatasetVersion/{privateUrlToken}/citation
  • /api/datasets/{datasetId}/versions/{version}/citation

Extra fields included in the JSON metadata

The following fields are now available in the native JSON output:

  • alternativePersistentId
  • publicationDate
  • citationDate

(PR #9657)

Files downloaded from Binder are now in their original format.

For example, data.dta (a Stata file) will be downloaded instead of data.tab (the archival version Dataverse creates as part of a successful ingest). (PR #9483)

This should make it easier to write code to reproduce results as the dataset authors and subsequent researchers are likely operating on the original file format rather that the format that Dataverse creates.

For details, see #9374, jupyterhub/repo2docker#1242, and jupyterhub/repo2docker#1253.

Handling of license information fixed in the API

(PR #9568)

When publishing a dataset via API, it now requires the dataset to either have a standard license configured, or have valid Custom Terms of Use (if allowed by the instance). Attempting to publish a dataset without such will fail with an error message. This introduces a backward incompatibility, and if you have scripts that automatically create, update and publish datasets, this last step may start failing. Because, unfortunately, there were some problems with the datasets APIs that made it difficult to manage licenses, so an API user was likely to end up with a dataset missing either of the above. In this release we have addressed it by making the following fixes:

We fixed the incompatibility between the format in which license information was exported in json, and the format the create and update APIs were expecting it for import (#9155). This means that the following json format can now be imported:

"license": {
   "name": "CC0 1.0",
   "uri": "http://creativecommons.org/publicdomain/zero/1.0"
}

However, for the sake of backward compatibility the old format

"license" : "CC0 1.0"

will be accepted as well.

We have added the default license (CC0) to the model json file that we provide and recommend to use as the model in the Native API Guide (#9364).

And we have corrected the misleading language in the same guide where we used to recommend to users that they select, edit and re-import only the .metadataBlocks fragment of the json metadata representing the latest version. There are in fact other useful pieces of information that need to be preserved in the update (such as the "license" section above). So the recommended way of creating base json for updates via the API is to select everything but the "files" section, with (for example) the following jq command:

jq '.data | del(.files)'

Please see the Update Metadata For a Dataset section of our Native Api guide for more information.

New External Tool Type and Implementation

With this release a new experimental external tool type has been added to the Dataverse Software. The tool type is "query" and its first implementation is an experimental tool named Ask the Data which allows users to ask natural language queries of tabular files in Dataverse. More information is available in the External Tools section of the guides. (PR #9737) See https://guides.dataverse.org/en/5.14/admin/external-tools.html#file-level-query-tools

Default Value for File PIDs registration has changed

The default for whether PIDs are registered for files or not is now false.

Installations where file PIDs were enabled by default will have to add the :FilePIDsEnabled = true setting to maintain the existing functionality.

See Step 9 of the upgrade instructions:

If your installation did not have :FilePIDsEnabled set, you will need to set it to true to keep file PIDs enabled:

curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:FilePIDsEnabled

It is now possible to allow File PIDs to be enabled/disabled per collection. See the :AllowEnablingFilePIDsPerCollection section of the Configuration guide for details.

For example, registration of PIDs for files can now be enabled in a specific collection when it is disabled instance-wide. Or it can be disabled in specific collections where it is enabled by default.

Changes and fixes in this release not already mentioned above include:

  • An endpoint for deleting a file has been added to the native API: https://guides.dataverse.org/en/5.14/api/native-api.html#deleting-files (PR #9383)
  • A date column has been added to the restricted file access request overview, indicating when the earliest request by that user was made. An issue was fixed where where the request list was not updated when a request was approved or rejected. (PR #9257)
  • Changes made in v5.13 and v5.14 in multiple PRs to improve the embedded Schema.org metadata in dataset pages will only be propagated to the Schema.Org JSON-LD metadata export if a reExportAll() is done. (PR #9102)
  • It is now possible to write external vocabulary scripts that target a single child field in a metadata block. Example scripts are now available at https://github.com/gdcc/dataverse-external-vocab-support that can be configured to support lookup from the Research Orgnaization Registry (ROR) for the Author Affiliation Field and for the CrossRef Funding Registry (Fundreg) in the Funding Information/Agency field, both in the standard Citation metadata block. Application if these scripts to other fields, and the development of other scripts targetting child fields are now possible (PR #9402)
  • Dataverse now supports requiring a secret key to add or edit metadata in specified "system" metadata blocks. Changing the metadata in such system metadata blocks is not allowed without the key and is currently only allowed via API. (PR #9388)
  • An attempt will be made to extract a geospatial bounding box (west, south, east, north) from NetCDF and HDF5 files and then insert these values into the geospatial metadata block, if enabled. (#9541) See https://guides.dataverse.org/en/5.14/user/dataset-management.html#geospatial-bounding-box
  • A file previewer called H5Web is now available for exploring and visualizing NetCDF and HDF5 files. (PR #9600) See https://guides.dataverse.org/en/5.14/user/dataset-management.html#h5web-previewer
  • Two file previewers for GeoTIFF and Shapefiles are now available for visualizing geotiff image files and zipped Shapefiles on a map. See https://github.com/gdcc/dataverse-previewers
  • New alternative to setup the Dataverse dependencies for the development environment through Docker Compose. (PR #9417)
  • New alternative, explained in the documentation, to build the Sphinx guides through a Docker container. (PR #9417)
  • A container has been added called "configbaker" that configures Dataverse while running in containers. This allows developers to spin up Dataverse with a single command. (PR #9574)
  • Direct upload via the Dataverse UI will now support any algorithm configured via the :FileFixityChecksumAlgorithm setting. External apps using the direct upload API can now query Dataverse to discover which algorithm should be used. Sites that have been using an algorithm other than MD5 and direct upload and/or dvwebloader may want to use the /api/admin/updateHashValues call (see https://guides.dataverse.org/en/5.14/installation/config.html?highlight=updatehashvalues#filefixitychecksumalgorithm) to replace any MD5 hashes on existing files. (PR #9482)
  • The OAI_ORE metadata export (and hence the archival Bag for a dataset) now includes information about file embargoes. (PR #9698)
  • DatasetFieldType attribute "displayFormat", is now returned by the API. (PR #9668)
  • An API named "MyData" has been available for years but is newly documented. It is used to get a list of the objects (datasets, collections or datafiles) that an authenticated user can modify. (PR #9596)
  • A Go client library for Dataverse APIs is now available. See https://guides.dataverse.org/en/5.14/api/client-libraries.html
  • A feature flag called "api-session-auth" has been added temporarily to aid in the development of the new frontend (#9063) but will be removed once bearer tokens (#9229) have been implemented. There is a security risk (CSRF) in enabling this flag! Do not use it in production! For more information, see https://guides.dataverse.org/en/5.14/installation/config.html#feature-flags
  • A feature flag called "api-bearer-auth" has been added. This allows OIDC useraccounts to send authenticated API requests using Bearer Tokens. Note: This feature is limited to OIDC! For more information, see https://guides.dataverse.org/en/5.14/installation/config.html#feature-flags (PR #9591)

Complete List of Changes

For the complete list of code changes in this release, see the 5.14 milestone on GitHub.

Don't miss a new dataverse release

NewReleases is sending notifications on new releases.