github IQSS/dataverse v6.3

5 days ago

Dataverse 6.3

Summary

  • Search performance improvements, especially for large installations
  • Data files can have retention periods
  • Various feature improvements and bug fixes

Please note: To read these instructions in full, please go to https://github.com/IQSS/dataverse/releases/tag/v6.3 rather than the list of releases, which will cut them off.

This release brings new features, enhancements, and bug fixes to Dataverse. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Table of Contents

  • Release Highlights
  • Features
  • Bug Fixes
  • API
  • Settings
  • Complete List of Changes
  • Getting Help
  • Upgrade instructions

Release Highlights

Solr Search and Indexing Improvements

Multiple improvements have been made to the way Solr indexing and searching is done. Response times should be significantly improved.

  • Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.

  • Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.

  • A feature flag called "reduce-solr-deletes" has been added to improve how datafiles are indexed. When the flag is enabled, Dataverse will avoid pre-emptively deleting existing Solr documents for the files prior to sending updated information. This
    should improve performance and will allow additional optimizations going forward.

  • The /api/admin/index/status and /api/admin/index/clear-orphans calls
    (see https://guides.dataverse.org/en/latest/admin/solr-search-index.html#index-and-database-consistency)
    will now find and remove (respectively) additional permissions related Solr documents that were not being detected before.
    Reducing the overall number of documents will improve Solr performance and large sites may wish to periodically call the "clear-orphans" API.

  • Dataverse now relies on the autoCommit and autoSoftCommit settings in the Solr configuration instead of explicitly committing documents to the Solr index. This improves indexing speed.

See also #10554, #10654, and #10579.

File Retention Period

Dataverse now supports file-level retention periods. The ability to set retention periods, with a minimum duration (in months), can be configured by a Dataverse installation administrator. For more information, see the Retention Periods section of the User Guide.

  • Users can configure a specific retention period, defined by an end date and a short reason, on a set of selected files or an individual file, by selecting the "Retention Period" menu item and entering information in a popup dialog. Retention periods can only be set, changed, or removed before a file has been published. After publication, only Dataverse installation administrators can make changes, using an API.

  • After the retention period expires, files can not be previewed or downloaded (as if restricted, with no option to allow access requests). The file (landing) page and all the metadata remains available.

↑ Table of Contents

Features

Large Datasets Improvements

For scenarios involving API calls related to large datasets (numerous files, for example: ~10k) the following have been been optimized:

  • The Search API endpoint.
  • The permission checking logic present in PermissionServiceBean.

See also #10415.

Improved Controlled Vocabulary for Citation Block

The Controlled Vocabuary Values list for the "Language" metadata field in the citation block has been improved, with some missing two- and three-letter ISO 639 codes added, as well as more alternative names for some of the languages, making all these extra language identifiers importable. See also #8243.

Updates on Support for External Vocabulary Services

Multiple extensions of the external vocabulary mechanism have been added. These extensions allow interaction with services based on the Ontoportal software and are expected to be generally useful for other service types.

These changes include:

  • Improved Indexing with Compound Fields: When using an external vocabulary service with compound fields, you can now specify which field(s) will include additional indexed information, such as translations of an entry into other languages. This is done by adding the indexIn in retrieval-filtering. See also #10505 and GDCC/dataverse-external-vocab-support documentation.

  • Broader Support for Indexing Service Responses: Indexing of the results from retrieval-filtering responses can now handle additional formats including JSON arrays of strings and values from arbitrary keys within a JSON Object. See #10505.

  • HTTP Headers: You are now able to add HTTP request headers required by the service you are implementing. See #10331.

  • Flexible params in retrievalUri: You can now use managed-fields field names as well as the term-uri-field field name as parameters in the retrieval-uri when configuring an external vocabulary service. {0} as an alternative to using the term-uri-field name is still supported for backward compatibility. Also you can specify if the value must be url encoded with encodeUrl:. See #10404.

    For example : "retrieval-uri": "https://data.agroportal.lirmm.fr/ontologies/{keywordVocabulary}/classes/{encodeUrl:keywordermURL}"

  • Hidden HTML Fields External controlled vocabulary scripts, configured via :CVocConf, can now access the values of managed fields as well as the term-uri-field for use in constructing the metadata view for a dataset. These values are now added as hidden elements in the HTML and can be found with the HTML attribute data-cvoc-metadata-name. See also #10503.

A Contributor Guide is now available

A new Contributor Guide has been added by the UX Working Group (#10531 and #10532).

URL Validation Is More Permissive

URL validation now allows two slashes in the path component of the URL.
Among other things, this allows metadata fields of url type to be filled with more complex url such as https://archive.softwareheritage.org/browse/directory/561bfe6698ca9e58b552b4eb4e56132cac41c6f9/?origin_url=https://github.com/gem-pasteur/macsyfinder&revision=868637fce184865d8e0436338af66a2648e8f6e1&snapshot=1bde3cb370766b10132c4e004c7cb377979928d1

See also #9750 and #9739

Improved Detection of RO-Crate Files

Detection of mime-types based on a filename with extension and detection of the RO-Crate metadata files.

From now on, filenames with extensions can be added into MimeTypeDetectionByFileName.properties file. Filenames added there will take precedence over simply recognizing files by extensions. For example, two new filenames are added into that file:

ro-crate-metadata.json=application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"
ro-crate-metadata.jsonld=application/ld+json; profile="http://www.w3.org/ns/json-ld#flattened http://www.w3.org/ns/json-ld#compacted https://w3id.org/ro/crate"

Therefore, files named ro-crate-metadata.json will be then detected as RO-Crated metadata files from now on, instead as generic JSON files.
For more information on the RO-Crate specifications, see https://www.researchobject.org/ro-crate

See also #10015.

New S3 Tagging Configuration Option

If your S3 store does not support tagging and gives an error if you configure direct upload, you can disable the tagging by using the dataverse.files.<id>.disable-tagging JVM option. For more details, see the section on S3 tags in the guides, #10022 and #10029.

Feature Flag To Remove the Required "Reason" Field in the "Return to Author" Dialog

A reason field, that is required to not be empty, was added to the "Return to Author" dialog in v6.2. Installations that handle author communications through email or another system may prefer to not be required to use this new field. v6.3 includes a new
disable-return-to-author-reason feature flag that can be enabled to drop the reason field from the dialog and make sending a reason optional in the api/datasets/{id}/returnToAuthor call. See also #10655.

Improved Use of Dataverse Thumbnail

Dataverse will use the dataset thumbnail, if one is defined, rather than the generic Dataverse logo in the Open Graph metadata header. This means the image will be seen when, for example, the dataset is referenced on Facebook. See also #5621.

Improved Email Notifications When Guestbook is Used for File Access Requests

Multiple improvements to guestbook response emails making it easier to organize and process them. The subject line of the notification email now includes the name and user identifier of the requestor. Additionally, the body of the email now includes the user id of the requestor. Finally the guestbook responses have been sorted and spaced to improve readability. See also #10581.

New keywordTermURI Metadata Field in the Citation Metadata Block

A new metadata field - keywordTermURI, has been added in the citation metadata block (as a fourth child field under the keyword parent field). This has been done to improve usability and to facilitate the integration of controlled vocabulary services, adding the possibility of saving the "term" and/or its associated URI. For more information, see #10288 and PR #10371.

Updated Computational Workflow Metadata Block

The computational workflow metadata block has been updated to present a clickable link for the External Code Repository URL field. See also #10339.

Metadata Source Facet Added

An option has been added to index the name of the harvesting client as the "Metadata Source" of harvested datasets and files; if enabled, the Metadata Source facet will show separate entries for the content harvested from different sources, instead of the current, default behavior where there is one "Harvested" facet for all such content.

Tho enable this feature, set the optional feature flage (jvm option) dataverse.feature.index-harvested-metadata-source=true before reindexing.

See also #10611 and #10651.

Additional Facet Settings

Extra settings have been added giving an instance admin more choices in selectively limiting the availability of search facets on the collection and dataset pages.

See Disable Solr Facets under the configuration section of the Installation Guide for more info as well as #10570.

Sitemap Now Supports More Than 50k Items

Dataverse can now handle more than 50,000 items when generating sitemap files, splitting the content across multiple files to comply with the Sitemap protocol. For details, see the sitemap section of the Installation Guide. See also #8936 and #10321.

MIT and Apache 2.0 Licenses Added

New files have been added to import the MIT and Apache 2.0 Licenses to Dataverse:

  • licenseMIT.json
  • licenseApache-2.0.json

Guidance has been added to the guides to explain the procedure for adding new licenses to Dataverse.

See also #10425.

3D Viewer by Open Forest Data

3DViewer by openforestdata.pl has been added to the list of external tools. See also #10561.

Datalad Integration With Dataverse

DataLad has been integrated with Dataverse. For more information, see the integrations section of the guides. See also #10468.

Rsync Support Has Been Deprecated

Support for rsync has been deprecated. Information has been removed from the guides for rsync and related software such as Data Capture Module (DCM) and Repository Storage Abstraction Layer (RSAL). You can still find this information in older versions of the guides. See Settings, below, for deprecated settings. See also #8985.

↑ Table of Contents

Bug Fixes

OpenAPI Re-Enabled

In Dataverse 6.0 when Payara was updated it caused the url /openapi to stop working:

In addition to fixing the /openapi URL, we are also making some changes on how we provide the OpenAPI document:

When it worked in Dataverse 5.x, the /openapi output was generated automatically by Payara, but in this release we have switched to OpenAPI output produced by the SmallRye OpenAPI plugin. This gives us finer control over the output.

For more information, see the section on OpenAPI in the API Guide and #10328.

Re-Addition of "Cell Counting" to Life Sciences Block

In the Life Sciences metadata block under the "Measurement Type" field the value cell counting was accidentally removed in v5.1. It has been restored. See also #8655 and #9735.

Math Challenge Fixed on 403 Error Page

On the "forbidden" (403) error page, the math challenge now correctly displays so that the contact form can be submitted. See also #10466.

Ingest Option Bug Fixed

A bug that prevented the "Ingest" option in the file page "Edit File" menu from working has been fixed. See also #10568.

Incomplete Metadata Bug Fix

A bug was fixed where the incomplete metadata label was being shown for published dataset with incomplete metadata in certain scenarios. This label will now be shown for draft versions of such datasets and published datasets that the user can edit. This label can also be made invisible for published datasets (regardless of edit rights) with the new option dataverse.ui.show-validity-label-when-published set to false. See also #10116.

Identical Role Error Message

An error is now correctly reported when an attempt is made to assign an identical role to the same collection, dataset, or file. See also #9729 and #10465.

↑ Table of Contents

API

Superuser Endpoint

The existing API endpoint for toggling the superuser status of a user has been deprecated in favor of a new API endpoint that allows you to explicitly and idempotently set the status as true or false. For details, see the API Guide, #9887 and #10440.

New Featured Collections Endpoints

New API endpoints have been added to allow you to add or remove featured collections from a collection.

See also the sections on listing, setting, and removing featured collections in the API Guide, #10242 and #10459.

Dataset Version Endpoint Extended

The API endpoint for getting the Dataset version has been extended to include latestVersionPublishingStatus. See also #10330.

New Optional Query Parameters for Metadatablocks Endpoints

New optional query parameters have been added to api/metadatablocks and api/dataverses/{id}/metadatablocks endpoints:

  • returnDatasetFieldTypes: Whether or not to return the dataset field types present in each metadata block. If not set, the default value is false.
  • Setting the query parameter onlyDisplayedOnCreate=true also returns metadata blocks with dataset field type input levels configured as required on the General Information page of the collection, in addition to the metadata blocks and their fields with the property displayOnCreate=true.

See also #10389

Dataverse Payload Includes Release Status

The Dataverse object returned by /api/dataverses has been extended to include "isReleased": {boolean}. See also #10491.

New Field Type Input Level Endpoint

A new endpoint api/dataverses/{id}/inputLevels has been created for updating the dataset field type input levels of a collection via API. See also #10477.

Banner Message Endpoint Extended

The endpoint api/admin/bannerMessage has been extended so the ID is returned when created. See also #10565.

↑ Table of Contents

Settings

Database Settings:

New:

  • :DisableSolrFacets

Deprecated (used with rsync):

  • :DataCaptureModuleUrl
  • :DownloadMethods
  • :LocalDataAccessPath
  • :RepositoryStorageAbstractionLayerUrl

New Configuration Options

  • dataverse.files.<id>.disable-tagging
  • dataverse.feature.add-publicobject-solr-field
  • dataverse.feature.avoid-expensive-solr-join
  • dataverse.feature.reduce-solr-deletes
  • dataverse.feature.disable-return-to-author-reason
  • dataverse.feature.index-harvested-metadata-source
  • dataverse.ui.show-validity-label-when-published

↑ Table of Contents

Complete List of Changes

For the complete list of code changes in this release, see the 6.3 Milestone in GitHub.

↑ Table of Contents

Getting Help

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

↑ Table of Contents

Upgrade Instructions

Upgrading requires a maintenance window and downtime. Please plan accordingly, create backups of your database, etc.

These instructions assume that you've already upgraded through all the 5.x releases and are now running Dataverse 6.2.

0. These instructions assume that you are upgrading from the immediate previous version. If you are running an earlier version, the only supported way to upgrade is to progress through the upgrades to all the releases in between before attempting the upgrade to this version.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands, we assume that Payara 6 is installed in /usr/local/payara6. If not, adjust as needed.

export PAYARA=/usr/local/payara6

(or setenv PAYARA /usr/local/payara6 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin undeploy dataverse-6.2

2. Stop Payara and remove the following directories:

service payara stop
rm -rf $PAYARA/glassfish/domains/domain1/generated
rm -rf $PAYARA/glassfish/domains/domain1/osgi-cache
rm -rf $PAYARA/glassfish/domains/domain1/lib/databases

3. Upgrade Payara to v6.2024.6

With this version of Dataverse, we encourage you to upgrade to version 6.2024.6.
This will address security issues accumulated since the release of 6.2023.8.

Note that if you are using GDCC containers, this upgrade is included when pulling new release images.
No manual intervention is necessary.

The steps below are a simple matter of reusing your existing domain directory with the new distribution.
But we recommend that you review the Payara upgrade instructions as it could be helpful during any troubleshooting:
Payara Release Notes.
We also recommend you ensure you followed all update instructions from the past releases regarding Payara.
(The latest Payara update was for v6.0.)

Move the current Payara directory out of the way:

mv $PAYARA $PAYARA.6.2023.8

Download the new Payara version 6.2024.6 (from https://www.payara.fish/downloads/payara-platform-community-edition/), and unzip it in its place:

cd /usr/local
unzip payara-6.2024.6.zip

Replace the brand new payara/glassfish/domains/domain1 with your old, preserved domain1:

mv payara6/glassfish/domains/domain1 payara6/glassfish/domains/domain1_DIST
mv payara6-2023.8/glassfish/domains/domain1 payara6/glassfish/domains/

Make sure that you have the following --add-opens options in your payara6/glassfish/domains/domain1/config/domain.xml. If not present, add them:

<jvm-options>--add-opens=java.management/javax.management=ALL-UNNAMED</jvm-options>
<jvm-options>--add-opens=java.management/javax.management.openmbean=ALL-UNNAMED</jvm-options>
<jvm-options>[17|]--add-opens=java.base/java.io=ALL-UNNAMED</jvm-options>
<jvm-options>[21|]--add-opens=java.base/jdk.internal.misc=ALL-UNNAMED</jvm-options>

(Note that you likely already have the java.base/java.io option there, but without the [17|] prefix. Make sure to replace it with the version above)

Start Payara:

sudo service payara start

4. Deploy this version.

$PAYARA/bin/asadmin deploy dataverse-6.3.war

5. For installations with internationalization:

6. Restart Payara

service payara stop
service payara start

7. Update the following metadata blocks to reflect the incremental improvements made to the handling of core metadata fields:

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/citation.tsv

curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/biomedical.tsv

curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file biomedical.tsv

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/scripts/api/data/metadatablocks/computational_workflow.tsv

curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file computational_workflow.tsv

8. Upgrade Solr

Solr 9.4.1 is now the version recommended in our Installation Guide and used with automated testing. There is a known security issue in the previously recommended version 9.3.0: https://nvd.nist.gov/vuln/detail/CVE-2023-36478. While the risk of an exploit should not be significant unless the Solr instance is accessible from outside networks (which we have always recommended against), we recommend to upgrade.

Install Solr 9.4.1 following the instructions from the Installation Guide.

The instructions in the guide suggest to use the config files from the installer zip bundle. Upgrading an existing instance, it may be easier to download them from the source tree:

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/solrconfig.xml
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/schema.xml
cp solrconfig.xml schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf

8a. For installations with custom or experimental metadata blocks:

  • Stop Solr instance (usually service solr stop, depending on Solr installation/OS, see the Installation Guide).

  • Run the update-fields.sh script that we supply, as in the example below (modify the command lines as needed to reflect the correct path of your Solr installation):

wget https://raw.githubusercontent.com/IQSS/dataverse/v6.3/conf/solr/update-fields.sh
chmod +x update-fields.sh
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml
  • Start Solr instance (usually service solr start depending on Solr/OS).

9. Enable the Metadata Source facet for harvested content (Optional):

If you choose to enable this new feature, set the optional feature flag (jvm option) dataverse.feature.index-harvested-metadata-source=true before reindexing.

10. Reindex Solr, if you upgraded Solr (recommended), or chose to enable any options that require a reindex:

curl http://localhost:8080/api/admin/index

Note: if you choose to perform a migration of your keywordValue metadata fields (section below), that will require a reindex as well, so do that first.

Notes for Dataverse Installation Administrators

Data migration to the new keywordTermURI field

You can migrate your keywordValue data containing URIs to the new keywordTermURI field.
In case of data migration, view the affected data with the following database query:

SELECT value FROM datasetfieldvalue dfv
INNER JOIN datasetfield df ON df.id = dfv.datasetfield_id
WHERE df.datasetfieldtype_id = (SELECT id FROM datasetfieldtype WHERE name = 'keywordValue')
AND value ILIKE 'http%';

If you wish to migrate your data, a database update is then necessary:

UPDATE datasetfield df
SET datasetfieldtype_id  = (SELECT id FROM datasetfieldtype WHERE name = 'keywordTermURI')
FROM datasetfieldvalue dfv
WHERE dfv.datasetfield_id  = df.id
AND df.datasetfieldtype_id = (SELECT id FROM datasetfieldtype WHERE name = 'keywordValue')
AND dfv.value ILIKE 'http%';

A reindex in place will be required. ReExportAll will need to be run to update the metadata exports of the dataset. Follow the directions in the Admin Guide.

↑ Table of Contents

Don't miss a new dataverse release

NewReleases is sending notifications on new releases.