Dataverse Software 5.9
This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.
Release Highlights
Dataverse Collection Page Optimizations
The Dataverse Collection page, which also serves as the search page and the homepage in most Dataverse installations, has been optimized, with a specific focus on reducing the number of queries for each page load. These optimizations will be more noticable on Dataverse installations with higher traffic.
Support for HTTP "Range" Header for Partial File Downloads
Dataverse now supports the HTTP "Range" header, which allows users to download parts of a file. Here are some examples:
bytes=0-9
gets the first 10 bytes.bytes=10-19
gets 10 bytes from the middle.bytes=-10
gets the last 10 bytes.bytes=9-
gets all bytes except the first 10.
Only a single range is supported. For more information, see the Data Access API section of the API Guide.
Support for Optional External Metadata Validation Scripts
The Dataverse software now allows an installation administrator to provide custom scripts for additional metadata validation when datasets are being published and/or when Dataverse collections are being published or modified. The Harvard Dataverse Repository has been using this mechanism to combat content that violates our Terms of Use, specifically spam content. All the validation or verification logic is defined in these external scripts, thus making it possible for an installation to add checks custom-tailored to their needs.
Please note that only the metadata are subject to these validation checks. This does not check the content of any uploaded files.
For more information, see the Database Settings section of the Guide. The new settings are listed below, in the "New JVM Options and DB Settings" section of these release notes.
Displaying Author's Identifier as Link
In the dataset page's metadata tab the author's identifier is now displayed as a clickable link, which points to the profile page in the external service (ORCID, VIAF etc.) in cases where the identifier scheme provides a resolvable landing page. If the identifier does not match the expected scheme, a link is not shown.
Auxiliary File API Enhancements
This release includes updates to the Auxiliary File API. These updates include:
- Auxiliary files can now also be associated with non-tabular files
- Auxiliary files can now be deleted
- Duplicate Auxiliary files can no longer be created
- A new API has been added to list Auxiliary files by their origin
- Some auxiliary were being saved with the wrong content type (MIME type) but now the user can supply the content type on upload, overriding the type that would otherwise be assigned
- Improved error reporting
- A bugfix involving checksums for Auxiliary files
Please note that the Auxiliary files feature is experimental and is designed to support integration with tools from the OpenDP Project. If the API endpoints are not needed they can be blocked.
Major Use Cases and Infrastructure Enhancements
Newly-supported major use cases in this release include:
- The Dataverse collection page has been optimized, resulting in quicker load times on one of the most common pages in the application (Issue #7804, PR #8143)
- Users will now be able to specify a certain byte range in their downloads via API, allowing for downloads of file parts. (Issue #6397, PR #8087)
- A Dataverse installation administrator can now set up metadata validation for datasets and Dataverse collections, allowing for publish-time and create-time checks for all content. (Issue #8155, PR #8245)
- Users will be provided with clickable links to authors' ORCIDs and other IDs in the dataset metadata (Issue #7978, PR #7979)
- Users will now be able to associate Auxiliary files with non-tabular files (Issue #8235, PR #8237)
- Users will no longer be able to create duplicate Auxiliary files (Issue #8235, PR #8237)
- Users will be able to delete Auxiliary files (Issue #8235, PR #8237)
- Users can retrieve a list of Auxiliary files based on their origin (Issue #8235, PR #8237)
- Users will be able to supply the content type of Auxiliary files on upload (Issue #8241, PR #8282)
- The indexing process has been updated so that datasets with fewer files and indexed first, resulting in fewer failures and making it easier to identify problematically-large datasets. (Issue #8097, PR #8152)
- Users will no longer be able to create metadata records with problematic special characters, which would later require Dataverse installation administrator intervention and a database change (Issue #8018, PR #8242)
- The Dataverse software will now appropriately recognize files with the .geojson extension as GeoJSON files rather than "unknown" (Issue #8261, PR #8262)
- A Dataverse installation administrator can now retrieve more information about role deletion from the ActionLogRecord (Issue #2912, PR #8211)
- Users will be able to use a new role to allow a user to respond to file download requests without also giving them the power to manage the dataset (Issue #8109, PR #8174)
- Users will no longer be forced to update their passwords when moving from Dataverse 3.x to Dataverse 4.x (PR #7916)
- Improved accessibility of buttons on the Dataset and File pages (Issue #8247, PR #8257)
Notes for Dataverse Installation Administrators
Indexing Performance on Datasets with Large Numbers of Files
We discovered that whenever a full reindexing needs to be performed, datasets with large numbers of files take an exceptionally long time to index. For example, in the Harvard Dataverse Repository, it takes several hours for a dataset that has 25,000 files. In situations where the Solr index needs to be erased and rebuilt from scratch (such as a Solr version upgrade, or a corrupt index, etc.) this can significantly delay the repopulation of the search catalog.
We are still investigating the reasons behind this performance issue. For now, even though some improvements have been made, a dataset with thousands of files is still going to take a long time to index. In this release, we've made a simple change to the reindexing process, to index any such datasets at the very end of the batch, after all the datasets with fewer files have been reindexed. This does not improve the overall reindexing time, but will repopulate the bulk of the search index much faster for the users of the installation.
Custom Analytics Code Changes
You should update your custom analytics code to capture a bug fix related to tracking within the dataset files table. This release restores that tracking.
For more information, see the documentation and sample analytics code snippet provided in Installation Guide. This update can be used on any version 5.4+.
New ManageFilePermissions Permission
Dataverse can now support a use case in which a Admin or Curator would like to delegate the ability to grant access to restricted files to other users. This can be implemented by creating a custom role (e.g. DownloadApprover) that has the new ManageFilePermissions permission. This release introduces the new permission, and a Flyway script adjusts the existing Admin and Curator roles so they continue to have the ability to grant file download requrests.
Thumbnail Defaults
New default values have been added for the JVM settings dataverse.dataAccess.thumbnail.image.limit
and dataverse.dataAccess.thumbnail.pdf.limit
, of 3MB and 1MB respectively. This means that, unless specified otherwise by the JVM settings already in your domain configuration, the application will skip attempting to generate thumbnails for image files and PDFs that are above these size limits.
In previous versions, if these limits were not explicitly set, the application would try to create thumbnails for files of unlimited size. Which would occasionally cause problems with very large images.
New JVM Options and DB Settings
The following DB settings allow configuration of the external metadata validator:
- :DataverseMetadataValidatorScript
- :DataverseMetadataPublishValidationFailureMsg
- :DataverseMetadataUpdateValidationFailureMsg
- :DatasetMetadataValidatorScript
- :DatasetMetadataValidationFailureMsg
- :ExternalValidationAdminOverride
See the Database Settings section of the Guides for more information.
Notes for Developers and Integrators
Two sections of the Developer Guide have been updated:
- Instructions on how to sync a PR in progress with develop have been added in the version control section
- Guidance on avoiding ineffeciencies in JSF render logic has been added to the "Tips" section
Complete List of Changes
For the complete list of code changes in this release, see the 5.9 Milestone in Github.
For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.
Installation
If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.
Upgrade Instructions
0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.9.
If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo
to change to that user first. For example, sudo -i -u dataverse
if dataverse
is your dedicated application user.
In the following commands we assume that Payara 5 is installed in /usr/local/payara5
. If not, adjust as needed.
export PAYARA=/usr/local/payara5
(or setenv PAYARA /usr/local/payara5
if you are using a csh
-like shell)
1. Undeploy the previous version.
$PAYARA/bin/asadmin list-applications
$PAYARA/bin/asadmin undeploy dataverse<-version>
2. Stop Payara and remove the generated directory
service payara stop
rm -rf $PAYARA/glassfish/domains/domain1/generated
3. Start Payara
service payara start
4. Deploy this version.
$PAYARA/bin/asadmin deploy dataverse-5.9.war
5. Restart payara
service payara stop
service payara start
6. Run ReExportall to update JSON Exports
Following the directions in the Admin Guide
Additional Release Steps
(for installations collecting web analytics)
1. Update custom analytics code per the Installation Guide.
(for installations with GeoJSON files)
1. Redetect GeoJSON files to update the type from "Unknown" to GeoJSON, following the directions in the API Guide
2. Kick off full reindex following the directions in the Admin Guide