github paradedb/paradedb v0.10.0

latest releases: v0.14.1, v0.14.0, v0.13.2...
4 months ago

This is pg_search v0.10.0. It represents a significant number of improvements over 0.9.4, including a few breaking changes.

Upgrading

Upgrading to v0.10.0, from any prior version, requires dropping the pg_search extension using DROP EXTENSION pg_search CASCADE; and re-creating it with CREATE EXTENSION pg_search;. ALTER EXTENSION pg_search UPDATE; is not supported for this release.

Note that dropping the extension will also drop everything in your schema that has a dependency on the pg_search extension or the paradedb schema, including existing BM25 indexes.

As a result, all BM25 indexes need to be re-created using CALL paradedb.create_bm25(...).

The primary reason for this is that we've changed where we physically store the underlying indexes on disk. This change resolves a situation where an index from database A could overwrite an index from database B that also happen to share the same internal OID (#1651).

Additionally, we've improved the representation of our internal document id field ("ctid"), which leads to smaller indexes (#1584).

Unfortunately, neither of these changes are backwards compatible. Our intent is that this will be the last release causing on-disk breaking changes.

๐Ÿšจ Breaking Changes

New File Paths

As mentioned above, a big breaking change is that we now store tantivy index files in a different location on disk. The location is still relative to the $PGDATA directory, but is now organized by database id, index id, and the index's assigned "file number".

Primarily, this ensures databases won't inadvertently overwrite each other's indexes. Secondarily, it allows pg_search to better interoperate with transaction handling around CALL paradedb.create_bm25() and CALL paradedb.drop_bm25().

Compact Internal IDs

We now more compactly store "ctid" values in the index. This is a breaking change as v0.10.0's understanding of this
field is not compatible with prior versions.

On large indexes (>10G) we've seen about a 2% savings in disk storage. The savings will be even greater for larger
indexes.

Cleanup after DROP INDEX

Prior to v0.10.0, dropping a BM25 index would not remove the old index files from disk. As of v0.10.0, the physical index files are deleted during CALL paradedb.drop_bm25(). This also extends to the DROP INDEX statement, and other statements that may lead to dropping an index, such as DROP SCHEMA and DROP TABLE.

Object Dependencies

  • fix: add pg_depend entry for drop schema + index by @neilyio in #1627

pg_search now creates internal dependencies between the SCHEMA and INDEX pg_search creates (in both directions). The result is that if one object is dropped, the other will be too. Doing so ensures database objects are cleanly removed during schema modifications.

One Index per Table

Until now, pg_search has allowed creating multiple BM25 indexes on a table.

However, when searching it wasn't always guaranteed the specified index would actually be used. As of v0.10.0, only one BM25 index can be created per table.

paradedb.fuzzy_term() Argument Defaults

While minor, it's worth noting that the default prefix argument to the paradedb.fuzzy_term() function has changed to false. This change could impact search results of queries using this function. The new behavior, by default, matches more documents.

๐Ÿ’ช Stability Improvements

json Fields don't Crash

Tables with a column of type json would cause a crash during indexing if they were used in the set of json_fields => during CALL paradedb.create_bm25(). This has been resolved.

Background Worker doesn't Terminate Unexpectedly

  • fix: guard against bgworker exiting and client backend crashing by @eeeebbbbrrrr in #1656

The background worker responsible for performing index write operations is now more resilient to unexpected errors. It was possible for it to exit early in some situations, which could then lead to a client backend crashing.

Improved COMMIT/ABORT Handling

v0.10.0 improves the code around COMMIT and ABORT, resolving issues where a COMMIT wouldn't always happen after an ABORT.

Additionally, v0.9.3, introduced a bug where VACUUM wouldn't remove dead rows. This has also been resolved.

Improved Locking Primitives

Similar to above, our internal locking structures are now more resilient to unexpected errors. This is an improvement for the code along with user-facing error propagation.

๐ŸŽ‰ New Features

Improved Query Planner Integration

  • feat: work on overall IAM support, including Bitmap scans and improved cost estimation by @eeeebbbbrrrr in #1639

pg_search now supports more Postgres query plan types and provides greatly improved query cost estimations.

With this, our pseudo-secret @@@ operator is now ready for direct use.

The documentation covers its usage, but it's worth mentioning that using the @@@ operator in a WHERE clause, as opposed to SELECTing directly from idxname.search(...), can lead to drastically improved query execution times in situations where scores (or the effects of scoring on ordering) are not necessary..

The @@@ operator also allows combining standard SQL WHERE clause predicates with advanced text-search queries. Additionally, it's our recommended way of working with JOINed tables where one (or both) sides of the JOIN require some kind of text-search filtering.

Tokenization Configuration

Combined, these changes enable more control over per-field tokenization rules.

Queries and Configuration

PostGIS support

ParadeDB's Dockerfile now includes support for PostGIS.

๐Ÿ“š Documentation

Our CTO, @rebasedming, has done a tremendous amount of work refactoring our documentation. Please be sure to check it out, not only to appreciate its awesomeness, but for details of the new features in this release.

Documentation can be found at https://docs.paradedb.com/.

๐Ÿงช Testing/CI

๐Ÿก Housekeeping

New Contributors

Full Changelog: v0.9.4...v0.10.0

Don't miss a new paradedb release

NewReleases is sending notifications on new releases.