Dataverse 6.4

Please note: To read these instructions in full, please go to https://github.com/IQSS/dataverse/releases/tag/v6.4 rather than the list of releases, which will cut them off.

This release brings new features, enhancements, and bug fixes to Dataverse. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

New features in Dataverse 6.4:

Enhanced DataCite Metadata, including "Relation Type"
All ISO 639-3 languages are now supported
There is now a button for "Unlink Dataset"
Users will have DOIs/PIDs reserved for their files as part of file upload instead of at publication time
Datasets can now have types such as "software" or "workflow"
Croissant support
RO-Crate support
and more! Please see below.

New client library:

Rust

This release also fixes two important bugs described below and in a post on the mailing list:

"Update Current Version" can cause metadata loss
Publishing breaks designated dataset thumbnail, messes up collection page

Additional details on the above as well as many more features and bug fixes included in the release are described below. Read on!

Features Added

Enhanced DataCite Metadata, Including "Relation Type"

Within the "Related Publication" field, a new subfield has been added called "Relation Type" that allows for the most common values recommended by DataCite: isCitedBy, Cites, IsSupplementTo, IsSupplementedBy, IsReferencedBy, and References. For existing datasets where no "Relation Type" has been specified, "IsSupplementTo" is assumed.

Dataverse now supports the DataCite v4.5 schema. Additional metadata is now being sent to DataCite including metadata about related publications and files in the dataset. Improved metadata is being sent including how PIDs (ORCID, ROR, DOIs, etc.), license/terms, geospatial, and other metadata are represented. The enhanced metadata will automatically be sent to DataCite when datasets are created and published. Additionally, after publication, you can inspect what was sent by looking at the DataCite XML export.

The additions are in rough alignment with the OpenAIRE XML export, but there are some minor differences in addition to the Relation Type addition, including an update to the DataCite 4.5 schema. For details see #10632, #10615 and the design document referenced there.

Multiple backward incompatible changes and bug fixes have been made to API calls (three of four of which were not documented) related to updating PID target URLs and metadata at the provider service:

Full List of ISO 639-3 Languages Now Supported

The controlled vocabulary values list for the metadata field "Language" in the citation block has now been extended to include roughly 7920 ISO 639-3 values.

Some of the language entries in the pre-6.4 list correspond to "macro languages" in ISO-639-3 and admins/users may wish to update to use the corresponding individual language entries from ISO-639-3. As these cases are expected to be rare (they do not involve major world languages), finding them is not covered in the release notes. Anyone who desires help in this area is encouraged to reach out to the Dataverse community via any of the standard communication channels.

ISO 639-3 codes were downloaded from sil.org and the file used for merging with the existing citation.tsv was "iso-639-3.tab". See also #8578 and #10762.

Unlink Dataset Button

A new "Unlink Dataset" button has been added to the dataset page to allow a user to unlink a dataset from a collection. To unlink a dataset the user must have permission to link the dataset. Additionally, the existing API for unlinking datasets has been updated to no longer require superuser access as the "Publish Dataset" permission is now enough. See also #10583 and #10689.

Pre-Publish File DOI Reservation

Dataverse installations using DataCite as a persistent identifier (PID) provider (or other providers that support reserving PIDs) will be able to reserve PIDs for files when they are uploaded (rather than at publication time). Note that reserving file DOIs can slow uploads with large numbers of files so administrators may need to adjust timeouts (specifically any Apache "ProxyPass / ajp://localhost:8009/ timeout=" setting in the recommended Dataverse configuration).

Initial Support for Dataset Types

Out of the box, all datasets now have the type "dataset" but superusers can add additional types. At this time the type of a dataset can only be set at creation time via API. The types "dataset", "software", and "workflow" (just those three, for now) will be sent to DataCite (as resourceTypeGeneral) when the dataset is published.

For details see the guides, #10517 and #10694. Please note that this feature is highly experimental and is expected to evolve.

Croissant Support (Metadata Export)

A new metadata export format called Croissant is now available as an external metadata exporter. It is oriented toward making datasets consumable by machine learning.

For more about the Croissant exporter, including installation instructions, see https://github.com/gdcc/exporter-croissant. See also #10341, #10533, and discussion on the mailing list.

Please note: the Croissant exporter works best with Dataverse 6.2 and higher (where it updates the content of <head> as described in the guides) but can be used with 6.0 and higher (to get the export functionality).

RO-Crate Support (Metadata Export)

Dataverse now supports RO-Crate as a metadata export format. This functionality is not available out of the box, but you can enable one or more RO-Crate exporters from the list of external exporters. See also #10744 and #10796.

Rust API Client Library

An Dataverse API client library for the Rust programming language is now available at https://github.com/gdcc/rust-dataverse and has been added to the list of client libraries in the API Guide. See also #10758.

Collection Thumbnail Logo for Featured Collections

Collections can now have a thumbnail logo that is displayed when the collection is configured as a featured collection. If present, this thumbnail logo is shown. Otherwise, the collection logo is shown. Configuration is done under the "Theme" for a collection as explained in the guides. See also #10291 and #10433.

Saved Searches Can Be Deleted

Saved searches can now be deleted via API. See the Saved Search section of the API Guide, #9317 and #10198.

Notification Email Improvement

When notification emails are sent the part of the closing that says "contact us for support at" will now show the support email address (dataverse.mail.support-email), when configured, instead of the default system email address. Using the system email address here was particularly problematic when it was a "noreply" address. See also #10287 and #10504.

Ability to Disable Automatic Thumbnail Selection

It is now possible to turn off the feature that automatically selects one of the image datafiles to serve as the thumbnail of the parent dataset. An admin can turn it off by enabling the feature flag dataverse.feature.disable-dataset-thumbnail-autoselect. When the feature is disabled, a user can still manually pick a thumbnail image, or upload a dedicated thumbnail image. See also #10820.

More Flexible PermaLinks

The configuration setting dataverse.pid.*.permalink.base-url, which is used for PermaLinks, has been updated to support greater flexibility. Previously, the string /citation?persistentId= was automatically appended to the configured base URL. With this update, the base URL will now be used exactly as configured, without any automatic additions. See also #10775.

Globus Async Framework

A new alternative implementation of Globus polling during upload data transfers has been added in this release. This experimental framework does not rely on the instance staying up continuously for the duration of the transfer and saves the state information about Globus upload requests in the database. See globus-use-experimental-async-framework under Feature Flags and dataverse.files.globus-monitoring-server in the Installation Guide. See also #10623 and #10781.

CVoc (Controlled Vocabulary): Allow ORCID and ROR to Be Used Together in Author Field

Changes in Dataverse and updates to the ORCID and ROR external vocabulary scripts support deploying these for the citation block author field (and others). See also #10711, #10712, and gdcc/dataverse-external-vocab-support#22.

Development on Windows

New instructions have been added for developers on Windows trying to run a Dataverse development environment using Windows Subsystem for Linux (WSL). See the guides, #10606, and #10608.

Experimental Crossref PID (DOI) Provider

Crossref can now be used as a PID (DOI) provider, but this feature is experimental. Please provide feedback through the usual channels. See also the guides, #8581, and #10806.

Improved JSON Schema Validation for Datasets

JSON Schema validation has been enhanced with checks for required and allowed child objects as well as type checking for field types including primitive, compound and controlledVocabulary. More user-friendly error messages help pinpoint the issues in the dataset JSON. See Retrieve a Dataset JSON Schema for a Collection in the API Guide, #10169, and #10543.

Counter Processor 1.05 Support (Make Data Count)

Counter Processor 1.05 is now supported for use with Make Data Count. If you are running Counter Processor, you should reinstall/reconfigure it as described in the latest guides. Note that Counter Processor 1.05 requires Python 3, so you will need to follow the full Counter Processor install. Also note that if you configure the new version the same way, it will reprocess the days in the current month when it is first run. This is normal and will not affect the metrics in Dataverse. See also #10479.

Version Tags for Container Base Images

With this release we introduce a detailed maintenance workflow for our container images. As output of the Containerization Working Group, the community takes another step towards production ready containers available directly from the core project.

The maintenance workflow regularly updates the Container Base Image, which contains the operating system, Java, Payara, and tools and libraries required by the Dataverse application. Shipping these rolling releases as well as immutable revisions is the foundation for secure and reliable Dataverse Application Container images. See also #10478 and #10827.

Bugs Fixed

Update Current Version

A significant bug in the superuser-only Update Current Version publication option was fixed. If the "Update Current Version" option was used when changes were made to the dataset terms (rather than to dataset metadata) or if the PID provider service was down or returned an error, the update would fail and render the dataset unusable and require restoration from a backup. The fix in this release allows the update to succeed in both of these cases and redesigns the functionality such that any unknown issues should not make the dataset unusable (i.e. the error would be reported and the dataset would remain in its current state with the last-published version as it was and changes still in the draft version.)

If you do not plan to upgrade to Dataverse 6.4 right away, you are encouraged to alert your superusers to this issue (see this post). Here are some workarounds for pre-6.4 versions:

Change the "dataset.updateRelease" entry in the Bundle.properties file (or local language version) to "Do Not Use" or similar (this doesn't disable the button but alerts superusers to the issue), or
Edit the dataset.xhtml file to remove the lines below, delete the contents of the generated and osgi-cache directories in the Dataverse Payara domain, and restart the Payara server. This will remove the "Update Current Version" from the UI.

<c:if test="#{dataverseSession.user.isSuperuser()}">
  <f:selectItem rendered="#" itemLabel="#{bundle['dataset.updateRelease']}" itemValue="3" />
</c:if>

Again, the workarounds above are only for pre-6.4 versions. The bug has been fixed in Dataverse 6.4. See also #10797.

Broken Thumbnails

Dataverse 6.3 introduced a bug where publishing would break the dataset thumbnail, which in turn broke the rendering of the parent collection (dataverse) page.

This bug has been fixed but any existing broken thumbnails must be fixed manually. See "clearThumbnailFailureFlag" in the upgrade instructions below.

Additionally, it is now possible to turn off the feature that automatically selects of one of the image datafiles to serve as the thumbnail of the parent dataset. An admin can turn it off by raising the feature flag <jvm-options>-Ddataverse.feature.disable-dataset-thumbnail-autoselect=true</jvm-options>. When the feature is disabled, a user can still manually pick a thumbnail image, or upload a dedicated thumbnail image.

See also #10819, #10820, and the post on the mailing list.

No License, No Terms of Use

When datasets have neither a license nor custom terms of use, the dataset page will now indicate this. Also, these datasets will no longer be indexed as having custom terms. See also #8796, #10513, and #10614.

CC0 License Bug Fix

At a high level, some datasets have been mislabeled as "Custom License" when they should have been "CC0 1.0". This has been corrected.

In Dataverse 5.10, datasets with only "CC0 Waiver" in the "termsofuse" field were converted to "Custom License" (instead of the CC0 1.0 license) through a SQL migration script (see #10634). On deployment of Dataverse 6.4, a new SQL migration script will be run automatically to correct this, changing these datasets to CC0. You can review the script in #10634, which only affect the following datasets:

The existing "Terms of Use" must be equal to "This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions: CC0 Waiver" (this was set in #10634).
The following terms fields must be empty: Confidentiality Declaration, Special Permissions, Restrictions, Citation Requirements, Depositor Requirements, Conditions, and Disclaimer.
The license ID must not be assigned.

The script will set the license ID to that of the CC0 1.0 license and remove the contents of "termsofuse" field. See also #9081 and #10634.

Remap oai_dc Export and Harvesting Format Fields: dc:type and dc:date

The oai_dc export and harvesting format has had the following fields remapped:

dc:type was mapped to the field "Kind of Data". Now it is hard-coded to the word "Dataset".
dc:date was mapped to the field "Production Date" when available and otherwise to "Publication Date". Now it is mapped the field "Publication Date" or the field used for the citation date, if set (see Set Citation Date Field Type for a Dataset).

In order for these changes to be reflected in existing datasets, a reexport all should be run (mentioned below). See #8129 and #10737.

Zip File No Longer Misdetected as Shapefile (Hidden Directories)

When detecting files types, Dataverse would previously detect a zip file as a shapefile if it contained markers of a shapefile in hidden directories. These hidden directories are now ignored when deciding if a zip file is a shapefile or not. See also #8945 and #10627.

External Controlled Vocabulary

This release fixes a bug (introduced in v6.3) in the external controlled vocabulary mechanism that could cause indexing to fail (with a NullPointerException) when a script is configured for one child field and no other child fields were managed. See also #10869 and #10870.

Valid JSON in Error Response

When any ApiBlockingFilter policy applies to a request, the JSON in the body of the error response is now valid JSON. See also #10085.

Docker Container Base Image Security and Compatibility

Switch "wait-for" to "wait4x", aligned with the Configbaker Image
Update "jattach" to v2.2
Install AMD64 / ARM64 versions of tools as necessary
Run base image as unprivileged user by default instead of root - this was an oversight from OpenShift changes
Linux User, Payara Admin and Domain Master passwords:
- Print hints about default, public knowledge passwords in place for
- Enable replacing these passwords at container boot time
Enable building with updates Temurin JRE image based on Ubuntu 24.04 LTS
Fix entrypoint script troubles with pre- and postboot script files
Unify location of files at CONFIG_DIR=/opt/payara/config, avoid writing to other places

Cleanup of Temp Directories

In this release we addressed an issue where copies of files uploaded via the UI were left in one specific temp directory (.../domain1/uploads by default). We would like to remind all the installation admins that it is strongly recommended to have some automated (and aggressive) cleanup mechanisms in place for all the temp directories used by Dataverse. For example, at Harvard/IQSS we have the following configuration for the PrimeFaces uploads directory above: (note that, even with this fix in place, PrimeFaces will be leaving a large number of small log files in that location)

Instead of the default location (.../domain1/uploads) we use a directory on a dedicated partition, outside of the filesystem where Dataverse is installed, via the following JVM option:

<jvm-options>-Ddataverse.files.uploads=/uploads/web</jvm-options>

and we have a dedicated cronjob that runs every 30 minutes and deletes everything older than 2 hours in that directory:

15,45 * * * * /bin/find /uploads/web/ -mmin +119 -type f -name "upload*" -exec rm -f {} \; > /dev/null 2>&1

Trailing Commas in Author Name Now Permitted

When an author name ended in a comma (e.g. Smith, or Smith, ), the dataset page was broken after publishing (a "500" error page was presented to the user). The underlying issue causing the JSON-LD Schema.org output on the page to break was fixed. See #10343 and #10776.

API Updates

Search API: affiliation, parentDataverseName, image_url, etc.

The Search API (/api/search) response now includes additional fields, depending on the type.

For collections (dataverses):

"affiliation"
"parentDataverseName"
"parentDataverseIdentifier"
"image_url" (optional)

"items": [
    {
        "name": "Darwin's Finches",
        ...
        "affiliation": "Dataverse.org",
        "parentDataverseName": "Root",
        "parentDataverseIdentifier": "root",
        "image_url":"/api/access/dvCardImage/{identifier}"
(etc, etc)

For datasets:

"image_url" (optional)

"items": [
    {
        ...
        "image_url": "http://localhost:8080/api/datasets/2/logo"
        ...
(etc, etc)

For files:

"releaseOrCreateDate"
"image_url" (optional)

"items": [
    {
        "name": "test.png",
        ...
        "releaseOrCreateDate": "2016-05-10T12:53:39Z",
        "image_url":"/api/access/datafile/42?imageThumb=true"
(etc, etc)

These examples are also shown in the Search API section of the API Guide.

The image_url field was already part of the SolrSearchResult JSON (and incorrectly appeared in Search API documentation), but it wasn't returned by the API because it was appended only after the Solr query was executed in the SearchIncludeFragment of JSF (the old/current UI framework). Now, the field is set in SearchServiceBean, ensuring it is always returned by the API when an image is available.

The Solr schema.xml file has been updated to include a new field called "dvParentAlias" for supporting the new response field "parentDataverseIdentifier". See upgrade instructions below.

Search API: publicationStatuses

The Search API (/api/search) response will now include publicationStatuses in the JSON response as long as the list is not empty.

Example:

"items": [
    {
        "name": "Darwin's Finches",
        ...
        "publicationStatuses": [
            "Unpublished",
            "Draft"
        ],
(etc, etc)

Search Facet Information Exposed

A new endpoint /api/datasetfields/facetables lists all facetable dataset fields defined in the installation, as described in the guides.

A new optional query parameter "returnDetails" added to /api/dataverses/{identifier}/facets/ endpoint to include detailed information of each DataverseFacet, as described in the guides. See also #10726 and #10727.

User Permissions on Collections

A new endpoint at /api/dataverses/{identifier}/userPermissions for obtaining the user permissions on a collection (dataverse) has been added. See also the guides, #10749 and #10751.

addDataverse Extended

The addDataverse (/api/dataverses/{identifier}) API endpoint has been extended to allow adding metadata blocks, input levels and facet IDs at creation time, as the Dataverse page in create mode does in JSF. See also the guides, #10633 and #10644.

Metadata Blocks and Display on Create

The /api/dataverses/{identifier}/metadatablocks endpoint has been fixed to not return fields marked as displayOnCreate=true if there is an input level with include=false, when query parameters returnDatasetFieldTypes=true and onlyDisplayedOnCreate=true are set. See also #10741 and #10767.

The fields "depositor" and "dateOfDeposit" in the citation.tsv metadata block file have been updated to have the property "displayOnCreate" set to TRUE. In practice, only the API is affected because the UI has special logic that already shows these fields when datasets are created. See also and #10850 and #10884.

Feature Flags Can Be Listed

It is now possible to list all feature flags and see if they are enabled or not. See also the guides and #10732.

Settings Added

The following settings have been added:

dataverse.feature.disable-dataset-thumbnail-autoselect
dataverse.feature.globus-use-experimental-async-framework
dataverse.files.globus-monitoring-server
dataverse.pid.*.crossref.url
dataverse.pid.*.crossref.rest-api-url
dataverse.pid.*.crossref.username
dataverse.pid.*.crossref.password
dataverse.pid.*.crossref.depositor
dataverse.pid.*.crossref.depositor-email

Backward Incompatible Changes

The oai_dc export format has changed. See the "Remap oai_dc" section above.
Several APIs related to DataCite have changed. See "More and Better Data Sent to DataCite" above.

Complete List of Changes

For the complete list of code changes in this release, see the 6.4 milestone in GitHub.

Getting Help

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please follow our Installation Guide. Please don't be shy about asking for help if you need it!

Once you are in production, we would be delighted to update our map of Dataverse installations around the world to include yours! Please create an issue or email us at support@dataverse.org to join the club!

You are also very welcome to join the Global Dataverse Community Consortium (GDCC).

Upgrade Instructions

Upgrading requires a maintenance window and downtime. Please plan accordingly, create backups of your database, etc.

These instructions assume that you've already upgraded through all the 5.x releases and are now running Dataverse 6.3.

0. These instructions assume that you are upgrading from the immediate previous version. If you are running an earlier version, the only supported way to upgrade is to progress through the upgrades to all the releases in between before attempting the upgrade to this version.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands, we assume that Payara 6 is installed in /usr/local/payara6. If not, adjust as needed.

export PAYARA=/usr/local/payara6`

(or setenv PAYARA /usr/local/payara6 if you are using a csh-like shell)

1. Undeploy the previous version

$PAYARA/bin/asadmin undeploy dataverse-6.3

2. Stop and start Payara

service payara stop
sudo service payara start

3. Deploy this version

$PAYARA/bin/asadmin deploy dataverse-6.4.war

Note: if you have any trouble deploying, stop Payara, remove the following directories, start Payara, and try to deploy again.

service payara stop
rm -rf $PAYARA/glassfish/domains/domain1/generated
rm -rf $PAYARA/glassfish/domains/domain1/osgi-cache
rm -rf $PAYARA/glassfish/domains/domain1/lib/databases

4. For installations with internationalization:

Please remember to update translations via Dataverse language packs.

5. Restart Payara

service payara stop
service payara start

6. Update metadata blocks

These changes reflect incremental improvements made to the handling of core metadata fields.

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.4/scripts/api/data/metadatablocks/citation.tsv

curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv

7. Update Solr schema.xml file. Start with the standard v6.4 schema.xml, then, if your installation uses any custom or experimental metadata blocks, update it to include the extra fields (step 7a).

Stop Solr (usually service solr stop, depending on Solr installation/OS, see the Installation Guide).

service solr stop

Replace schema.xml

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.4/conf/solr/schema.xml
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf

Start Solr (but if you use any custom metadata blocks, perform the next step, 7a first).

service solr start

7a. For installations with custom or experimental metadata blocks:

Before starting Solr, update the schema to include all the extra metadata fields that your installation uses. We do this by collecting the output of the Dataverse schema API and feeding it to the update-fields.sh script that we supply, as in the example below (modify the command lines as needed to reflect the names of the directories, if different):

	wget https://raw.githubusercontent.com/IQSS/dataverse/v6.4/conf/solr/update-fields.sh
	chmod +x update-fields.sh
	curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml

Now start Solr.

8. Reindex Solr

Below is the simplest way to reindex Solr:

curl http://localhost:8080/api/admin/index

The API above rebuilds the existing index "in place". If you want to be absolutely sure that your index is up-to-date and consistent, you may consider wiping it clean and reindexing everything from scratch (see the guides). Just note that, depending on the size of your database, a full reindex may take a while and the users will be seeing incomplete search results during that window.

9. Run reExportAll to update dataset metadata exports

This step is necessary because of changes described above for the Datacite and oai_dc export formats.

Below is the simple way to reexport all dataset metadata. For more advanced usage, please see the guides.

curl http://localhost:8080/api/admin/metadata/reExportAll

10. Pushing updated metadata to DataCite

(If you don't use DataCite, you can skip this.)

Above you updated the citation metadata block and Solr with the new "relationType" field. With these two changes, the "Relation Type" fields will be available and creation/publication of datasets will result in the expanded XML being sent to DataCite. You've also already run "reExportAll" to update the Datacite metadata export format.

Entries at DataCite for published datasets can be updated by a superuser using an API call (newly documented):

curl -X POST -H 'X-Dataverse-key:<key>' http://localhost:8080/api/datasets/modifyRegistrationPIDMetadataAll

This will loop through all published datasets (and released files with PIDs). As long as the loop completes, the call will return a 200/OK response. Any PIDs for which the update fails can be found using the following command:

grep 'Failure for id' server.log

Failures may occur if PIDs were never registered, or if they were never made findable. Any such cases can be fixed manually in DataCite Fabrica or using the Reserve a PID API call and the newly documented /api/datasets/<id>/modifyRegistration call respectively. See https://guides.dataverse.org/en/6.4/admin/dataverses-datasets.html#send-dataset-metadata-to-pid-provider. Please reach out with any questions.

PIDs can also be updated by a superuser on a per-dataset basis using

curl -X POST -H 'X-Dataverse-key:<key>' http://localhost:8080/api/datasets/<id>/modifyRegistrationMetadata

Additional Upgrade Steps

11. If there are broken thumbnails

To restore any broken thumbnails caused by the bug described above, you can call the http://localhost:8080/api/admin/clearThumbnailFailureFlag API, which will attempt to clear the flag on all files (regardless of whether caused by this bug or some other problem with the file) or the http://localhost:8080/api/admin/clearThumbnailFailureFlag/$FILE_ID to clear the flag for individual files. Calling the former, batch API is recommended.

12. PermaLinks with custom base-url

If you currently use PermaLinks with a custom base-url: You must manually append /citation?persistentId= to the base URL to maintain functionality.

If you use a PermaLinks without a configured base-url, no changes are required.

IQSS/dataverse v6.4 on GitHub