Docker image/tag: semitechnologies/weaviate:0.20.0
See also: example docker compose file.
Changes since 0.19.x
New Features
-
Major rewrite: Esvector-only, no more Janusgraph (#932)
Prior to to 0.20.0, the default stack would use Janusgraph, in turn depending on Elasticsearch and Cassandra. With 0.20.0, weaviate stores all graph data only in a vector-optimized instance of Elasticsearch ("esvector").Major benefits
This change was driven by the following motivating factors- Faster response time on listing queries
Previously queries listing more than a single object (regardless of graph traversal depth) were problematic. With the new stack they come native to the backends and are considerably faster. - Leaner stack
Due to the fewer required backends, the overall resource requirements for running the entire Weaviate stack are far lower, enabling weaviate to run on smaller clusters and with reduced infrastructure cost - Native Vector Searching
Starting with 0.17.0 Weaviate has embraced vector-based searching and querying. This has quickly become one of Weaviate's major strengths and the main reason for users to adopt Weaviate. However, in versions prior to 0.20.0 the vector searching infrastructure was used alongside the traditional graph infrastructure. This lead to inefficient queries, as a vector-search would be performed first, then each result would query additional information from the traditional backends. With 0.20.0 the vector-backend is also the main storage backend, eliminating a vast amount of queries necessary to display complete information - Setup for more things to come
Weaviate's roadmap shows plenty more features built around storing data in a vector space, such as single command classifications or entity merging. The vector-first storage setup introduced in 0.20.0 will act as the foundation for those.
How it works under the hood
Essentially weaviate indexes data objects as Elasticsearch documents, indexing both their structured and vector representation. Cross-References are saved as "beacons" (i.e. links to other items), as well as a cached copy of the resolved reference - in the background without the user noticing . See Usage Guidelies for more info on caching and denormalization. This allows us to keep all features from prior to 0.20.0 where a graph database would provide most of those. - Faster response time on listing queries
Breaking Changes
-
GraphQL
Meta
API merged intoAggregate
API
There was plenty of overlap betweenMeta
andAggregate
prior to 0.20.0, the major difference was thatAggregate
would always group the aggregations by a specific property, whereasMeta
was always ungrouped. With 0.20.0 both APIs have been merged and grouping is simply an optional parameter now. For more details, see #949. -
Base unit for
geoCoordinates
search distance is now meter
It used to be kilometer, so simply multiply your existing query values with 1000 -
Stricter separation of
text
andstring
property types
In previous version those two types were almost identical. Form 0.20.0text
is mapped in Elastiscsearch astext
, i.e. intended for full text fields, such as paragraphs, descriptions, etc.string
is mapped in Elasticsearch askeyword
and should be used for exact values such as emails, ids, codes, etc. As a consequence, aggregations (such as "what are the 5 most used values forfoo
") will only be possible onstring
props from 0.20.0 on - no longer on text props. See #930 for more details.
Usage Guidelines
-
Eventual Consistency
Weaviate never offered immediate consistency before (due to both Elasticearch and Cassandra being known for their eventual consistency). As most weaviate use cases are in the Analytics domain, where consistency is less critical, 0.20.0 embraces eventual consistency. Notably, two things are eventual:- Changes to the Elasticsearch index are refreshed continuously, by default this is set to once per second
- The cross-reference denormalization cache (more about that below) is build asynchronously. In most cases caching is as fast as importing, so the caching period cannot be noted. However, if a data object (
A
) is updated and this object is referenced by another object (B
), a traversal such asB -> A
might retrieve outdated information while the cache is stale. In most scenarios, cache is refreshed in less than a second. Note that if no cache is present cross-refs are resolved at runtime.
-
Avoid Supernodes in your schema design
A supernode is a node which has a very large number of outgoing references. There is no exact definition of when a node becomes a super-node, but keep in mind while designing your schema that supernode behavior might become problematic. By default Weaviate creates a cached copy of referenced objects. So assume the following schema:City ->inCountry-> Country ->hasInhabitants-> Person
. In this schema, Country might have a lot of outgoing references (to Person). With a denormalization depth (see below) of two or higher every city would store a cache that goes two levels deep (Level 1: Country, Level 2: Person). This essentially means that every city would cache a copy of all inhabitants of its linked Country. This leads to a lot of duplication possibly using up a lot of space. If the same Schema was designed asPerson ->livesIn-> City ->inCountry -> Country
, the country will have a lot of incoming references (which aren't problematic for caching), but no entity will have a very large amount of outgoing references.Note that behavior around supernodes will probably be improved upon in the future. There are several measures we could take - such as stop caching if more than x references per object or identify that a referenced object is a supernode and decide not to cache it.
-
Caching and Denormalization
To allow for efficient searches by cross-references and efficient traversal, cross-references are automatically cached. If a referenced object is changed, the cache of all incoming references is marked as stale, and recreated. This process happens constantly in the background.Since a graph isn't necessarily linear, one might easily end up with a circular structure, such as
Person ->knows-> Person
. To avoid a never ending caching scenario, a denormalization depth can be configured (vector_index.denormalizationDepth
in config.yaml). If this depth is reached, caching will stop before it would cross the caching boundary.Imagine the following schema:
Person -> City -> Country -> Continent -> Planet
. This schema shows a reference level depth of four (Note that every reference, represented by an arrow, not every class counts as one level - that's why the above example is four levels and not five). With adenormalizationDepth: 2
every class object will cache two levels deep, so the following classes would have a cache structure like this:Person -> City -> Country
City-> Country -> Contintent
Country -> Continent -> Planet
Planet -> Continent
If a user now wants to retrieve all four levels in a single query, Weaviate will do the following:
- Retrieve
Person
, notice that cache goes two levels, up toCountry
- Since
Country->Continent
crosses the cache boundary Weaviate has to send a second request to the backend - The result is a
Continent
which itself has a cache which is two levels deep, reaching all the way to the end of the query.
Although the denormalization depth limit was set to only two, weaviate was able to query four levels deep with only two database queries. The default denormalization limit is set to
3
. This should work well in most use cases. You can adjust that value up if your schema is very deep, but narrow or adjust the value down if your schema is shallow, but very wide.
Known Limitations
-
Filter by cross-ref only within cache boundary
At the moment a filtering by reference query, such as "Give me all the People living in a City located in a Country where the official language is English" will only work up to the configured denormalization depth. The above query would look likePeople->City->Country->Language
would make use of three levels and therefore work with the default settings. If an additional level is required, you need to setvector_index.denormalizationDepth
to4
or higher. Note this limitation is only in place for filtering by cross-ref, not for simply displaying resolved cross-refs. The issue to overcome this limitation is #967. -
Delete Schema Property
A schema property cannot be deleted at the moment. The workaround would be to delete the entire class and recreate it without the property. The issue to overcome this limitation is #973. -
Aggregations don't allow for counting string props
Aggregate
queries currently cannot count the number of string props. Counting non-string properties or counting overall results works fine. This was scoped out of 0.20.0 as the cost for this was considered higher than its potential benefit. The issue to overcome this limitation is #974. -
No batch-references imports possible yet
Changes since 0.20.0-rc0
New Features
- #980 Force Index refresh if referenced thing/action is not found
When adding an object and immediately adding a second object with a cross-reference to the first, you previously had to wait for the first object to be present on the index. This took up to 1 second. Requests sent before a finished refresh period would have failed, resulting in the need to client-side retry logic. With this feature, weaviate forces an index refresh if a referenced object could not be found and then immediately retries. If an object is still not found an error is returned. If an object was found the first time around, no (expensive) refresh is triggered.