This is pg_search
v0.10.0. It represents a significant number of improvements over 0.9.4, including a few breaking changes.
Upgrading
Upgrading to v0.10.0, from any prior version, requires dropping the pg_search
extension using DROP EXTENSION pg_search CASCADE;
and re-creating it with CREATE EXTENSION pg_search;
. ALTER EXTENSION pg_search UPDATE;
is not supported for this release.
Note that dropping the extension will also drop everything in your schema that has a dependency on the pg_search
extension or the paradedb
schema, including existing BM25 indexes.
As a result, all BM25 indexes need to be re-created using CALL paradedb.create_bm25(...)
.
The primary reason for this is that we've changed where we physically store the underlying indexes on disk. This change resolves a situation where an index from database A could overwrite an index from database B that also happen to share the same internal OID (#1651).
Additionally, we've improved the representation of our internal document id field ("ctid"), which leads to smaller indexes (#1584).
Unfortunately, neither of these changes are backwards compatible. Our intent is that this will be the last release causing on-disk breaking changes.
๐จ Breaking Changes
New File Paths
As mentioned above, a big breaking change is that we now store tantivy index files in a different location on disk. The location is still relative to the $PGDATA
directory, but is now organized by database id, index id, and the index's assigned "file number".
Primarily, this ensures databases won't inadvertently overwrite each other's indexes. Secondarily, it allows pg_search
to better interoperate with transaction handling around CALL paradedb.create_bm25()
and CALL paradedb.drop_bm25()
.
Compact Internal IDs
- feat: better represent ctids as u64 by @eeeebbbbrrrr in #1584
We now more compactly store "ctid" values in the index. This is a breaking change as v0.10.0's understanding of this
field is not compatible with prior versions.
On large indexes (>10G) we've seen about a 2% savings in disk storage. The savings will be even greater for larger
indexes.
Cleanup after DROP INDEX
Prior to v0.10.0, dropping a BM25 index would not remove the old index files from disk. As of v0.10.0, the physical index files are deleted during CALL paradedb.drop_bm25()
. This also extends to the DROP INDEX
statement, and other statements that may lead to dropping an index, such as DROP SCHEMA
and DROP TABLE
.
Object Dependencies
pg_search
now creates internal dependencies between the SCHEMA and INDEX pg_search
creates (in both directions). The result is that if one object is dropped, the other will be too. Doing so ensures database objects are cleanly removed during schema modifications.
One Index per Table
- feat: only allow one USING bm25 index per relation by @eeeebbbbrrrr in #1637
Until now, pg_search
has allowed creating multiple BM25 indexes on a table.
However, when searching it wasn't always guaranteed the specified index would actually be used. As of v0.10.0, only one BM25 index can be created per table.
paradedb.fuzzy_term()
Argument Defaults
- chore: change fuzzy term default to
prefix = false
by @rebasedming in #1642
While minor, it's worth noting that the default prefix
argument to the paradedb.fuzzy_term()
function has changed to false
. This change could impact search results of queries using this function. The new behavior, by default, matches more documents.
๐ช Stability Improvements
json
Fields don't Crash
- fix: segfault with fields of type '::json' by @eeeebbbbrrrr in #1654
Tables with a column of type json
would cause a crash during indexing if they were used in the set of json_fields =>
during CALL paradedb.create_bm25()
. This has been resolved.
Background Worker doesn't Terminate Unexpectedly
- fix: guard against bgworker exiting and client backend crashing by @eeeebbbbrrrr in #1656
The background worker responsible for performing index write operations is now more resilient to unexpected errors. It was possible for it to exit early in some situations, which could then lead to a client backend crashing.
Improved COMMIT
/ABORT
Handling
- fix: change
COMMIT
/ABORT
strategy & fixVACUUM
by @eeeebbbbrrrr in #1659
v0.10.0 improves the code around COMMIT and ABORT, resolving issues where a COMMIT wouldn't always happen after an ABORT.
Additionally, v0.9.3, introduced a bug where VACUUM wouldn't remove dead rows. This has also been resolved.
Improved Locking Primitives
- chore: migrate to parking_lot mutexes by @eeeebbbbrrrr in #1658
Similar to above, our internal locking structures are now more resilient to unexpected errors. This is an improvement for the code along with user-facing error propagation.
๐ New Features
Improved Query Planner Integration
- feat: work on overall IAM support, including Bitmap scans and improved cost estimation by @eeeebbbbrrrr in #1639
pg_search
now supports more Postgres query plan types and provides greatly improved query cost estimations.
With this, our pseudo-secret @@@
operator is now ready for direct use.
The documentation covers its usage, but it's worth mentioning that using the @@@
operator in a WHERE clause, as opposed to SELECTing directly from idxname.search(...)
, can lead to drastically improved query execution times in situations where scores (or the effects of scoring on ordering) are not necessary..
The @@@
operator also allows combining standard SQL WHERE clause predicates with advanced text-search queries. Additionally, it's our recommended way of working with JOINed tables where one (or both) sides of the JOIN require some kind of text-search filtering.
Tokenization Configuration
- feat: make tokenizer filters configurable by @aalexandrov in #1583
- feat: make
stemmer
a filter by @rebasedming in #1635 - fix: Add missing
stemmer
andlowercase
filters toraw
and language tokenizers by @rebasedming in #1643
Combined, these changes enable more control over per-field tokenization rules.
Queries and Configuration
- feat: Introduce
fuzzy_phrase
query by @rebasedming in #1653 - feat: adding lenient and conjuction configs by @Weijun-H in #1634
PostGIS support
ParadeDB's Dockerfile
now includes support for PostGIS.
๐ Documentation
- docs: Introduce tutorials and concepts by @rebasedming in #1663
- docs: Major refactor by @rebasedming in #1649
- docs: Update replication.mdx (replciated -> replicated) by @wendyzhan05 in #1617
Our CTO, @rebasedming, has done a tremendous amount of work refactoring our documentation. Please be sure to check it out, not only to appreciate its awesomeness, but for details of the new features in this release.
Documentation can be found at https://docs.paradedb.com/.
๐งช Testing/CI
- chore: teach test framework to report the underlying sqlx error by @eeeebbbbrrrr in #1640
- fix: Fix CI with log printing by @philippemnoel in #1644
- ci: Add a test for the Helm chart by @philippemnoel in #1559
- chore: Update to new Helm Chart repo by @philippemnoel in #1648
- chore: Add Koala API Key by @philippemnoel in #1650
- chore: Remove the need on token for workflow dispatch by @philippemnoel in #1652
- feat: Dynamically determine pgrx version in CI / Docker by @Weijun-H in #1655
- fix: Dynamically determine pgrx version failure when pushing to dev by @Weijun-H in #1660
- fix: compilation problem in dev by @eeeebbbbrrrr in #1664
๐ก Housekeeping
- chore: upgrade all dependencies by @eeeebbbbrrrr in #1641
- chore: upgrade dependencies, specifically pgrx to v0.12.4 by @eeeebbbbrrrr in #1645
- chore: Rebase tantivy to support more aggregation functions by @Weijun-H in #1618
New Contributors
- @aalexandrov made their first contribution in #1583
- @wendyzhan05 made their first contribution in #1617
Full Changelog: v0.9.4...v0.10.0