Dataverse Software 5.12

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Support for Globus

Globus can be used to transfer large files. Part of "Harvard Data Commons Additions" below.

Support for Remote File Storage

Dataset files can be stored at remote URLs. Part of "Harvard Data Commons Additions" below.

New Computational Workflow Metadata Block

The new Computational Workflow metadata block will allow depositors to effectively tag datasets as computational workflows.

To add the new metadata block, follow the instructions in the Admin Guide: https://guides.dataverse.org/en/5.12/admin/metadatacustomization.html

The location of the new metadata block tsv file is scripts/api/data/metadatablocks/computational_workflow.tsv. Part of "Harvard Data Commons Additions" below.

Support for Linked Data Notifications (LDN)

Linked Data Notifications (LDN) is a standard from the W3C. Part of "Harvard Data Commons Additions" below.

Harvard Data Commons Additions

As reported at the 2022 Dataverse Community Meeting, the Harvard Data Commons project has supported a wide range of additions to the Dataverse software that improve support for Big Data, Workflows, Archiving, and interaction with other repositories. In many cases, these additions build upon features developed within the Dataverse community by Borealis, DANS, QDR, TDL, and others. Highlights from this work include:

Initial support for Globus file transfer to upload to and download from a Dataverse managed S3 store. The current implementation disables file restriction and embargo on Globus-enabled stores.
Initial support for Remote File Storage. This capability, enabled via a new RemoteOverlay store type, allows a file stored in a remote system to be added to a dataset (currently only via API) with download requests redirected to the remote system. Use cases include referencing public files hosted on external web servers as well as support for controlled access managed by Dataverse (e.g. via restricted and embargoed status) and/or by the remote store.
Initial support for computational workflows, including a new metadata block and detected filetypes.
Support for archiving to any S3 store using Dataverse's RDA-conformant BagIT file format (a BagPack).
Improved error handling and performance in archival bag creation and new options such as only supporting archiving of one dataset version.
Additions/corrections to the OAI-ORE metadata format (which is included in archival bags) such as referencing the name/mimetype/size/checksum/download URL of the original file for ingested files, the inclusion of metadata about the parent collection(s) of an archived dataset version, and use of the URL form of PIDs.
Display of archival status within the dataset page versions table, richer status options including success, pending, and failure states, with a complete API for managing archival status.
Support for batch archiving via API as an alternative to the current options of configuring archiving upon publication or archiving each dataset version manually.
Initial support for sending and receiving Linked Data Notification messages indicating relationships between a dataset and external resources (e.g. papers or other dataset) that can be used to trigger additional actions, such as the creation of a back-link to provide, for example, bi-directional linking between a published paper and a Dataverse dataset.
A new capability to provide custom per field instructions in dataset templates
The following file extensions are now detected:
- wdl=text/x-workflow-description-language
- cwl=text/x-computational-workflow-language
- nf=text/x-nextflow
- Rmd=text/x-r-notebook
- rb=text/x-ruby-script
- dag=text/x-dagman

Improvements to Fields that Appear in the Citation Metadata Block

Grammar, style and consistency improvements have been made to the titles, tooltip description text, and watermarks of metadata fields that appear in the Citation metadata block.

This includes fields that dataset depositors can edit in the Citation Metadata accordion (i.e. fields controlled by the citation.tsv and citation.properties files) and fields whose values are system-generated, such as the Dataset Persistent ID, Previous Dataset Persistent ID, and Publication Date fields whose titles and tooltips are configured in the bundles.properties file.

The changes should provide clearer information to curators, depositors, and people looking for data about what the fields are for.

A new page in the Style Guides called "Text" has also been added. The new page includes a section called "Metadata Text Guidelines" with a link to a Google Doc where the guidelines are being maintained for now since we expect them to be revised frequently.

New Static Search Facet: Metadata Types

A new static search facet has been added to the search side panel. This new facet is called "Metadata Types" and is driven from metadata blocks. When a metadata field value is inserted into a dataset, an entry for the metadata block it belongs to is added to this new facet.

This new facet needs to be configured for it to appear on the search side panel. The configuration assigns to a dataverse what metadata blocks to show. The configuration is inherited by child dataverses.

To configure the new facet, use the Metadata Block Facet API: https://guides.dataverse.org/en/5.12/api/native-api.html#set-metadata-block-facet-for-a-dataverse-collection

Broader MicroProfile Config Support for Developers

As of this release, many JVM options
can be set using any MicroProfile Config Source.

Currently this change is only relevant to developers but as settings are migrated to the new "lookup" pattern documented in the Consuming Configuration section of the Developer Guide, anyone installing the Dataverse software will have much greater flexibility when configuring those settings, especially within containers. These changes will be announced in future releases.

Please note that an upgrade to Payara 5.2021.8 or higher is required to make use of this. Payara 5.2021.5 threw exceptions, as explained in PR #8823.

HTTP Range Requests: New HTTP Status Codes and Headers for Datafile Access API

The Basic File Access resource for datafiles (/api/access/datafile/$id) was slightly modified in order to comply better with the HTTP specification for range requests.

If the request contains a "Range" header:

The returned HTTP status is now 206 (Partial Content) instead of 200
A "Content-Range" header is returned containing information about the returned bytes
An "Accept-Ranges" header with value "bytes" is returned

CORS rules/headers were modified accordingly:

The "Range" header is added to "Access-Control-Allow-Headers"
The "Content-Range" and "Accept-Ranges" header are added to "Access-Control-Expose-Headers"

This new functionality has enabled a Zip Previewer and file extractor for zip files, an external tool.

File Type Detection When File Has No Extension

File types are now detected based on the filename when the file has no extension.

The following filenames are now detected:

Makefile=text/x-makefile
Snakemake=text/x-snakemake
Dockerfile=application/x-docker-file
Vagrantfile=application/x-vagrant-file

These are defined in MimeTypeDetectionByFileName.properties.

Upgrade to Payara 5.2022.3 Highly Recommended

With lots of bug and security fixes included, we encourage everyone to upgrade to Payara 5.2022.3 as soon as possible. See below for details.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

Administrators can configure an S3 store used in Dataverse to support users uploading/downloading files via Globus File Transfer. (PR #8891)
Administrators can configure a RemoteOverlay store to allow files that remain hosted by a remote system to be added to a dataset. (PR #7325)
Administrators can configure the Dataverse software to send archival Bag copies of published dataset versions to any S3-compatible service. (PR #8751)
Users can see information about a dataset's parent collection(s) in the OAI-ORE metadata export. (PR #8770)
Users and administrators can now use the OAI-ORE metadata export to retrieve and assess the fixity of the original file (for ingested tabular files) via the included checksum. (PR #8901)
Archiving via RDA-conformant Bags is more robust and is more configurable. (PR #8773, #8747, #8699, #8609, #8606, #8610)
Users and administrators can see the archival status of the versions of the datasets they manage in the dataset page version table. (PR #8748, #8696)
Administrators can configure messaging between their Dataverse installation and other repositories that may hold related resources or services interested in activity within that installation. (PR #8775)
Collection managers can create templates that include custom instructions on how to fill out specific metadata fields.
Dataset update API users are given more information when the dataset they are updating is out of compliance with Terms of Access requirements (Issue #8859)
Adds a new setting (:ControlledVocabularyCustomJavaScript) that allows a JavaScript file to be loaded into the dataset page for the purpose of showing controlled vocabulary as a list (Issue #8722)
Fixes an issue with the Redetect File Type API (Issue #7527)
Terms of Use is now imported when using DDI format through harvesting or the native API. (Issue #8715, PR #8743)
Optimizes some code to improve application memory usage (Issue #8871)
Fixes sample data to reflect custom licenses.
Fixes the Archival Status Input API (available to superusers) (Issue #8924)
Small bugs have been fixed in the dataset export in the JSON and DDI formats; eliminating the export of "undefined" as a metadata language in the former, and a duplicate keyword tag in the latter. (Issue #8868)

New DB Settings

The following DB settings have been added:

:ShibAffiliationOrder - Select the first or last entry in an Affiliation array
:ShibAffiliationSeparator (default: ";") - Set the separator for the Affiliation array
:LDNMessageHosts
:GlobusBasicToken
:GlobusEndpoint
:GlobusStores
:GlobusAppUrl
:GlobusPollingInterval
:GlobusSingleFileTransfer
:S3ArchiverConfig
:S3ArchiverProfile
:DRSArchiverConfig
:ControlledVocabularyCustomJavaScript

See the Database Settings section of the Guides for more information.

Notes for Dataverse Installation Administrators

Enabling Experimental Capabilities

Several of the capabilities introduced in v5.12 are "experimental" in the sense that further changes and enhancements to these capabilities should be expected and that these changes may involve additional work, for those who use the initial implementations, when upgrading to newer versions of the Dataverse software. Administrators wishing to use them are encouraged to stay in touch, e.g. via the Dataverse Community Slack space, to understand the limits of current capabilities and to plan for future upgrades.

Notes for Developers and Integrators

See the "Backward Incompatibilities" section below.

Backward Incompatibilities

OAI-ORE and Archiving Changes

The Admin API call to manually submit a dataset version for archiving has changed to require POST instead of GET and to have a name making it clearer that archiving is being done for a given dataset version: /api/admin/submitDatasetVersionToArchive.

Earlier versions of the archival bags included the ingested (tab-separated-value) version of tabular files while providing the checksum of the original file (Issue #8449). This release fixes that by including the original file and its metadata in the archival bag. This means that archival bags created prior to this version do not include a way to validate ingested files. Further, it is likely that capabilities in development (i.e. as part of the Dataverse Uploader to allow re-creation of a dataset version from an archival bag will only be fully compatible with archival bags generated by a Dataverse instance at a release > v5.12. (Specifically, at a minimum, since only the ingested file is included in earlier archival bags, an upload via DVUploader would not result in the same original file/ingested version as in the original dataset.) Administrators should be aware that re-creating archival bags, i.e. via the new batch archiving API, may be advisable now and will be recommended at some point in the future (i.e. there will be a point where we will start versioning archival bags and will start maintaining backward compatibility for older versions as part of transitioning this from being an experimental capability).

Installation

If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.10.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

Instructions for Upgrading to Payara 5.2022.3

Note: with the approaching EOL for the Payara 5 Community release train it's likely we will switch to a
yet-to-be-released Payara 6 in the not-so-far-away future.

We recommend you ensure you followed all update instructions from the past releases regarding Payara.
(latest Payara update was for v5.6)

Upgrading requires a maintenance window and downtime. Please plan ahead, create backups of your database, etc.

The steps below are a simple matter of reusing your existing domain directory with the new distribution.
But we also recommend that you review the Payara upgrade instructions as it could be helpful during any troubleshooting:
Payara Release Notes

Please note that the deletion of the lib/databases directory below is only required once, for this upgrade (see Issue #8230 for details).

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version

    $PAYARA/bin/asadmin list-applications
    $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara

    service payara stop
    rm -rf $PAYARA/glassfish/domains/domain1/generated
    rm -rf $PAYARA/glassfish/domains/domain1/osgi-cache
    rm -rf $PAYARA/glassfish/domains/domain1/lib/databases

3. Move the current Payara directory out of the way

    mv $PAYARA $PAYARA.MOVED

4. Download the new Payara version (5.2022.3), and unzip it in its place

5. Replace the brand new payara/glassfish/domains/domain1 with your old, preserved domain1

6. Start Payara

    service payara start

7. Deploy this version.

    $PAYARA/bin/asadmin deploy dataverse-5.12.war

8. Restart payara

    service payara stop
    service payara start

Additional Upgrade Steps

Update the Citation metadata block:

wget https://github.com/IQSS/dataverse/releases/download/v5.12/citation.tsv
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @citation.tsv -H "Content-type: text/tab-separated-values"
Run ReExportAll to update metadata files (OAI_ORE, JSON and DDI formats are affected by the changes and bug fixes in this release; PRs #8770 and #8868). Optionally, for those using the Dataverse software's BagIt-based archiving, re-archive dataset versions archived using prior versions of the Dataverse software. This will be recommended/required in a future release.

IQSS/dataverse v5.12 on GitHub