github weaviate/weaviate 0.21.10
0.21.10 - More control over what is vectorized

latest releases: v1.24.11, v1.25.0-rc.0, v1.24.10...
4 years ago

Docker image/tag: semitechnologies/weaviate:0.21.10
See also: example docker compose files in english and dutch.

Breaking Changes

New Features

  • More user control over what is used in vectorization (#1062)

    Motivation

    Image a Fruit class with a property name of type string. Let's add an object with name: Pineapple. Prior to this release the vector representating this object in the vector space would always be formed from the class name, the property names and the property values. Thus, the vector representation of the above object would be formed from the text corpus fruit name pineapple. This helps in search, as a lot of context is added, but in some cases there is a lot of redundant information. Let's add another fruit with name: tomato. If both objects were part of a classification (or deduplication process with the upcoming "Entity Merging" feature) their vector position would be quite similar, as two out of the three words used to form the vector are identical: fruit name pineapple vs. fruit name tomato. Additionally, one can argue that name does not add any semantic value in there. We have thus decided to provide the user with more control over vectorization.

    How to use

    Two new fields were added at the schema/{things,actions} level:

    class: Fruit
    vectorizeClassName: true             # <-- newly introduced in this release
    properties:
    - name: name
       dataType: ["string"]
       vectorizePropertyName: false      # <-- newly introduced in this release

    Default values

    If not explicitly set, vectorizeClassName will default to true, whereas vectorizePropertyName will default to false. To understand the motivation behind these defaults, see the next section

    New/Updated Validation Requirements

    • If you choose to vectorize a class name (default: true), the class name must be contextionary-valid (Prior to this change the class name was always vectorized, so it was already required to be contextionary-valid, therefore this is not a breaking change)
    • If you choose not to vectorize a class name, it does not have to be contextionary-valid.
    • If you choose to vectorize a property name (default: false), the class name must be contextionary-valid (Prior to this change the property name was always vectorized, so it was already required to be contextionary-valid, therefore this is not a breaking change)
    • If you choose not to vectorize a property name, it does not have to be contextionary-valid.
    • If you choose to vectorize neither the class name, nor any property names there is a chance that some imports will fail: Weaviate needs to be able to extract at least one contextionary-valid word from every object, so it can build a vector position for this object. If you have a a class which only has numerical props or only use non-contextionary-valid values in your string/text props, Weaviate must rely on the class/property names to extract at least one word. However, if you additionally choose to disable vectorization for both class and property names, Weaviate might end up not being able to extract any words. In this case importing will fail.

    Some example error cases and how to fix them

    Example 1

    class:

    class: aqfaowiefj  # random word, not contextionary-
    properties: 
       - name: name
          dataType: ["string"]

    Error: class name is not contained in contextionary
    Solutions:

    • Either use a contextionary-valid class name
    • Or set vectorizeClassName: false

    Example 2

    class:

    class: Fruit # contextionary-valid word
    vectorizeClassName: false # explicitly disable 
    properties: 
       - name: name
          dataType: ["string"]

    Note: The class above is valid, but whether importing fails or succeeds, depends on the object:

    object:

    class: Fruit
    schema:
      name: aoiapwueog # random, non-contextionary-valid word

    Error: no-contextionary-valid words extracted from object corpus
    Solutions:

    • Since you are already using a contextionary-valid class name, you could set vectorizeClassName: true
    • Alternatively, you need to make sure that name fields contains at least contextionary-valid word, e.g. name: aoiapwueog apple

Dependencies

  • Make sure that you are using at contextionary version ...v0.4.5, to benefit from more precise error messages when vectorization fails due to incorrect input
    The docker-compose files linked in the top of the release notes have already been adapted to point to the correct version.

Don't miss a new weaviate release

NewReleases is sending notifications on new releases.