github weaviate/weaviate 0.20.0
0.20.0 - Leaner & faster stack with Esvector

latest releases: v1.24.10, v1.24.9, v1.24.8...
4 years ago

Docker image/tag: semitechnologies/weaviate:0.20.0
See also: example docker compose file.

Changes since 0.19.x

New Features

  • Major rewrite: Esvector-only, no more Janusgraph (#932)
    Prior to to 0.20.0, the default stack would use Janusgraph, in turn depending on Elasticsearch and Cassandra. With 0.20.0, weaviate stores all graph data only in a vector-optimized instance of Elasticsearch ("esvector").

    Major benefits
    This change was driven by the following motivating factors

    • Faster response time on listing queries
      Previously queries listing more than a single object (regardless of graph traversal depth) were problematic. With the new stack they come native to the backends and are considerably faster.
    • Leaner stack
      Due to the fewer required backends, the overall resource requirements for running the entire Weaviate stack are far lower, enabling weaviate to run on smaller clusters and with reduced infrastructure cost
    • Native Vector Searching
      Starting with 0.17.0 Weaviate has embraced vector-based searching and querying. This has quickly become one of Weaviate's major strengths and the main reason for users to adopt Weaviate. However, in versions prior to 0.20.0 the vector searching infrastructure was used alongside the traditional graph infrastructure. This lead to inefficient queries, as a vector-search would be performed first, then each result would query additional information from the traditional backends. With 0.20.0 the vector-backend is also the main storage backend, eliminating a vast amount of queries necessary to display complete information
    • Setup for more things to come
      Weaviate's roadmap shows plenty more features built around storing data in a vector space, such as single command classifications or entity merging. The vector-first storage setup introduced in 0.20.0 will act as the foundation for those.

    How it works under the hood
    Essentially weaviate indexes data objects as Elasticsearch documents, indexing both their structured and vector representation. Cross-References are saved as "beacons" (i.e. links to other items), as well as a cached copy of the resolved reference - in the background without the user noticing . See Usage Guidelies for more info on caching and denormalization. This allows us to keep all features from prior to 0.20.0 where a graph database would provide most of those.

Breaking Changes

  • GraphQL Meta API merged into Aggregate API
    There was plenty of overlap between Meta and Aggregate prior to 0.20.0, the major difference was that Aggregate would always group the aggregations by a specific property, whereas Meta was always ungrouped. With 0.20.0 both APIs have been merged and grouping is simply an optional parameter now. For more details, see #949.

  • Base unit for geoCoordinates search distance is now meter
    It used to be kilometer, so simply multiply your existing query values with 1000

  • Stricter separation of text and string property types
    In previous version those two types were almost identical. Form 0.20.0 text is mapped in Elastiscsearch as text, i.e. intended for full text fields, such as paragraphs, descriptions, etc. string is mapped in Elasticsearch as keyword and should be used for exact values such as emails, ids, codes, etc. As a consequence, aggregations (such as "what are the 5 most used values for foo") will only be possible on string props from 0.20.0 on - no longer on text props. See #930 for more details.

Usage Guidelines

  • Eventual Consistency
    Weaviate never offered immediate consistency before (due to both Elasticearch and Cassandra being known for their eventual consistency). As most weaviate use cases are in the Analytics domain, where consistency is less critical, 0.20.0 embraces eventual consistency. Notably, two things are eventual:

    • Changes to the Elasticsearch index are refreshed continuously, by default this is set to once per second
    • The cross-reference denormalization cache (more about that below) is build asynchronously. In most cases caching is as fast as importing, so the caching period cannot be noted. However, if a data object (A) is updated and this object is referenced by another object (B), a traversal such as B -> A might retrieve outdated information while the cache is stale. In most scenarios, cache is refreshed in less than a second. Note that if no cache is present cross-refs are resolved at runtime.
  • Avoid Supernodes in your schema design
    A supernode is a node which has a very large number of outgoing references. There is no exact definition of when a node becomes a super-node, but keep in mind while designing your schema that supernode behavior might become problematic. By default Weaviate creates a cached copy of referenced objects. So assume the following schema: City ->inCountry-> Country ->hasInhabitants-> Person. In this schema, Country might have a lot of outgoing references (to Person). With a denormalization depth (see below) of two or higher every city would store a cache that goes two levels deep (Level 1: Country, Level 2: Person). This essentially means that every city would cache a copy of all inhabitants of its linked Country. This leads to a lot of duplication possibly using up a lot of space. If the same Schema was designed as Person ->livesIn-> City ->inCountry -> Country, the country will have a lot of incoming references (which aren't problematic for caching), but no entity will have a very large amount of outgoing references.

    Note that behavior around supernodes will probably be improved upon in the future. There are several measures we could take - such as stop caching if more than x references per object or identify that a referenced object is a supernode and decide not to cache it.

  • Caching and Denormalization
    To allow for efficient searches by cross-references and efficient traversal, cross-references are automatically cached. If a referenced object is changed, the cache of all incoming references is marked as stale, and recreated. This process happens constantly in the background.

    Since a graph isn't necessarily linear, one might easily end up with a circular structure, such as Person ->knows-> Person. To avoid a never ending caching scenario, a denormalization depth can be configured (vector_index.denormalizationDepth in config.yaml). If this depth is reached, caching will stop before it would cross the caching boundary.

    Imagine the following schema: Person -> City -> Country -> Continent -> Planet. This schema shows a reference level depth of four (Note that every reference, represented by an arrow, not every class counts as one level - that's why the above example is four levels and not five). With a denormalizationDepth: 2 every class object will cache two levels deep, so the following classes would have a cache structure like this:

    • Person -> City -> Country
    • City-> Country -> Contintent
    • Country -> Continent -> Planet
    • Planet -> Continent

    If a user now wants to retrieve all four levels in a single query, Weaviate will do the following:

    1. Retrieve Person, notice that cache goes two levels, up to Country
    2. Since Country->Continent crosses the cache boundary Weaviate has to send a second request to the backend
    3. The result is a Continent which itself has a cache which is two levels deep, reaching all the way to the end of the query.

    Although the denormalization depth limit was set to only two, weaviate was able to query four levels deep with only two database queries. The default denormalization limit is set to 3. This should work well in most use cases. You can adjust that value up if your schema is very deep, but narrow or adjust the value down if your schema is shallow, but very wide.

Known Limitations

  • Filter by cross-ref only within cache boundary
    At the moment a filtering by reference query, such as "Give me all the People living in a City located in a Country where the official language is English" will only work up to the configured denormalization depth. The above query would look like People->City->Country->Language would make use of three levels and therefore work with the default settings. If an additional level is required, you need to set vector_index.denormalizationDepth to 4 or higher. Note this limitation is only in place for filtering by cross-ref, not for simply displaying resolved cross-refs. The issue to overcome this limitation is #967.

  • Delete Schema Property
    A schema property cannot be deleted at the moment. The workaround would be to delete the entire class and recreate it without the property. The issue to overcome this limitation is #973.

  • Aggregations don't allow for counting string props
    Aggregate queries currently cannot count the number of string props. Counting non-string properties or counting overall results works fine. This was scoped out of 0.20.0 as the cost for this was considered higher than its potential benefit. The issue to overcome this limitation is #974.

  • No batch-references imports possible yet

Changes since 0.20.0-rc0

New Features

  • #980 Force Index refresh if referenced thing/action is not found
    When adding an object and immediately adding a second object with a cross-reference to the first, you previously had to wait for the first object to be present on the index. This took up to 1 second. Requests sent before a finished refresh period would have failed, resulting in the need to client-side retry logic. With this feature, weaviate forces an index refresh if a referenced object could not be found and then immediately retries. If an object is still not found an error is returned. If an object was found the first time around, no (expensive) refresh is triggered.

Fixes

  • #983 Filtering for UUID is not returning the object even though it exists
  • #978 Bug: Flakyness in Aggregate in 0.20.0-rc0
  • #981 Empty data type results in panic
  • #986 Address Flakyness in 0.20.x integration tests

Don't miss a new weaviate release

NewReleases is sending notifications on new releases.