Getting started with Lexy

First steps

Let's get started by instantiating the Lexy client. By default, this will connect to the Lexy server running at http://localhost:9900. You can visit the OpenAPI documentation for the server at http://localhost:9900/docs.

from lexy_py import LexyClient

lx = LexyClient()

We can get more information about the Lexy server by calling the info method. Here we see existing Collections, Indexes, Transformers, and Bindings.

lx.info()

Lexy server <http://localhost:9900/api>

2 Collections
    - <Collection('default', description='Default collection')>
    - <Collection('code', description='Github code repos')>
1 Indexes
    - <Index('default_text_embeddings', description='Text embeddings for default collection')>
6 Transformers
    - <Transformer('image.embeddings.clip', description='Embed images using 'openai/clip-vit-base-patch32'.')>,
    - <Transformer('text.embeddings.clip', description='Embed text using 'openai/clip-vit-base-patch32'.')>,
    - <Transformer('text.embeddings.minilm', description='Text embeddings using "sentence-transformers/all-MiniLM-L6-v2"')>,
    - <...additional transformers...>
1 Bindings
    - <Binding(id=1, status=ON, collection='default', transformer='text.embeddings.minilm', index='default_text_embeddings')>

Let's add some documents to our "default" collection. We can add them using the LexyClient.add_documents method.

lx.add_documents([
    {"content": "This is my first document! It's great!"},
    {"content": "Starlink is a satellite internet constellation operated by American aerospace company SpaceX, providing coverage to over 60 countries."},
    {"content": "A latent space is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another."}
])

[<Document("This is my first document! It's great!")>,
 <Document("Starlink is a satellite internet constellation operated by American aerospace company SpaceX,...")>,
 <Document("A latent space is an embedding of a set of items within a manifold in which items resembling each...")>]

Documents that are added to the "default" collection are automatically embedded, and the embeddings are stored in the index "default_text_embeddings".

flowchart LR
    collection["Collection

    &quot;default&quot;"]
    -->
    transformer["Transformer

    &quot;text.embeddings.minilm&quot;"]
    -->
    index["Index

    &quot;default_text_embeddings&quot;"];

We can query the default index for "what is deep learning" and see our documents ranked by cosine similarity.

lx.query_index('what is deep learning')

[{'document_id': '28dee402-31aa-4622-89e9-ce9ac06a4cf8',
  'custom_id': None,
  'meta': {},
  'index_record_id': '56c2e116-13af-4db6-ac27-68cb44f3077c',
  'content': 'A latent space is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another.',
  'distance': 1.2856687307357788},
 {'document_id': '01e1f7ff-a890-4f6a-aeac-a0f99bdf9e77',
  'custom_id': None,
  'meta': {},
  'index_record_id': '5544c784-5e7c-4b34-bd73-dc4cbd139af6',
  'content': "This is my first document! It's great!",
  'distance': 1.3882379531860352},
 {'document_id': '449cc363-dc2b-45cf-b47e-6465af505401',
  'custom_id': None,
  'meta': {},
  'index_record_id': '9b57f1e0-de4c-4b2b-95b9-0d77af31fe2c',
  'content': 'Starlink is a satellite internet constellation operated by American aerospace company SpaceX, providing coverage to over 60 countries.',
  'distance': 1.3894593715667725,}]

Example: Famous biographies

Let's go through a longer example to see how Collections, Documents, Indexes, Bindings, and Transformers interact with one another. We'll use Lexy to create and query embeddings for a new collection of documents.

Collections

We can see that there are currently two collections, "default" and "code".

lx.collections

[<Collection('default', description='Default collection')>,
 <Collection('code', description='Github code repos')>]

Let's create a new "bios" collection for famous biographies.

bios = lx.create_collection('bios', description='Famous biographies')
bios

<Collection('bios', description='Famous biographies')>

Documents

We can use the Collection.list_documents method to see that our new collection is empty.

bios.list_documents()

[]

Let's add a few documents to our collection.

bios.add_documents([
    {"content": "Stephen Curry is an American professional basketball player for the Golden State Warriors."},
    {"content": "Dwayne 'The Rock' Johnson is a well-known actor, former professional wrestler, and businessman."},
    {"content": "Taylor Swift is a singer known for her songwriting, musical versatility, and artistic reinventions."}
])

[<Document("Stephen Curry is an American professional basketball player for the Golden State Warriors.")>,
 <Document("Dwayne 'The Rock' Johnson is a well-known actor, former professional wrestler, and businessman.")>,
 <Document("Taylor Swift is a singer known for her songwriting, musical versatility, and artistic reinventions.")>]

Transformers

Now we want to create embeddings for the documents in our new collection. We'll use a Transformer to generate embeddings for our documents. We can use the LexyClient.transformers property to see a list of available transformers.

# list of available transformers
lx.transformers

[<Transformer('image.embeddings.clip', description='Embed images using 'openai/clip-vit-base-patch32'.')>,
 <Transformer('text.embeddings.clip', description='Embed text using 'openai/clip-vit-base-patch32'.')>,
 <Transformer('text.embeddings.minilm', description='Text embeddings using "sentence-transformers/all-MiniLM-L6-v2"')>,
 <Transformer('text.embeddings.openai-3-large', description='Text embeddings using OpenAI's "text-embedding-3-large" model')>,
 <Transformer('text.embeddings.openai-3-small', description='Text embeddings using OpenAI's "text-embedding-3-small" model')>,
 <Transformer('text.embeddings.openai-ada-002', description='OpenAI text embeddings using model text-embedding-ada-002')>]

For our example, we'll use the "text.embeddings.minilm" transformer, which uses the MiniLM sentence transformer model to generate embeddings for text.

Indexes

Before we can bind this transformer to our collection, we need to create an Index for storing the resulting embeddings.

Let's create a new index called "bios_index" with embeddings for our new collection. Our index will have a single field called bio_embedding that will store the embeddings output from the MiniLM sentence transformer.

# define index fields
index_fields = {
    "bio_embedding": {
        "type": "embedding", "extras": {"dims": 384, "model": "text.embeddings.minilm"}
    }
}

# create index
index = lx.create_index(
    index_id='bios_index',
    description='Biography embeddings',
    index_fields=index_fields
)
index

<Index('bios_index', description='Biography embeddings')>

Bindings

Now let's create a Binding. Our binding will: 1) Feed the documents in our "bios" collection into the "text.embeddings.minilm" transformer, then 2) Insert the resulting output in our newly created index, "bios_index".

flowchart LR
    collection["Collection

    &quot;bios&quot;"]
    -->
    transformer["Transformer

    &quot;text.embeddings.minilm&quot;"]
    -->
    index["Index

    &quot;bios_index&quot;"];

binding = lx.create_binding(
    collection_name='bios',
    transformer_id='text.embeddings.minilm',
    index_id='bios_index'
)
binding

<Binding(id=2, status=ON, collection='bios', transformer='text.embeddings.minilm', index='bios_index')>

We can now query our index for "famous artists" and see the results ranked by cosine similarity.

index.query(query_text='famous artists', query_field='bio_embedding', k=3)

[{'document_id': 'c1090c0f-5fb4-4e8b-af47-d3fecb1d3e11',
  'custom_id': None,
  'meta': {},
  'index_record_id': '17ea9419-790f-4143-94c9-1c43a25a13e2',
  'content': 'Taylor Swift is a singer known for her songwriting, musical versatility, and artistic reinventions.',
  'distance': 1.1270135641098022},
 {'document_id': 'b0a5e701-749d-4361-a941-b8d6ffcbfe92',
  'custom_id': None,
  'meta': {},
  'index_record_id': 'dc840614-43fc-48bc-b4a9-fedca1f5950d',
  'content': "Dwayne 'The Rock' Johnson is a well-known actor, former professional wrestler, and businessman.",
  'distance': 1.2409147024154663},
 {'document_id': '68b0bb68-fee4-4631-a5db-874a7af32b46',
  'custom_id': None,
  'meta': {},
  'index_record_id': '5b44fa50-0b08-4797-84ca-af47c2f05dff',
  'content': 'Stephen Curry is an American professional basketball player for the Golden State Warriors.',
  'distance': 1.3264542818069458}]

Because our binding has status set to "ON", any new documents added to our collection will automatically be processed by our transformer and inserted into our index as embeddings.

Let's add another document.

bios.add_documents([
    {"content": "Beyoncé is a singer and songwriter recognized for her boundary-pushing artistry, vocals, and performances."}
])

[<Document("Beyoncé is a singer and songwriter recognized for her boundary-pushing artistry, vocals, and...")>]

Now let's run the same query again for "famous artists". We can see the results have been updated and include our new document.

index.query(query_text='famous artists', query_field='bio_embedding', k=3)

[{'document_id': 'c1090c0f-5fb4-4e8b-af47-d3fecb1d3e11',
  'custom_id': None,
  'meta': {},
  'index_record_id': '17ea9419-790f-4143-94c9-1c43a25a13e2',
  'content': 'Taylor Swift is a singer known for her songwriting, musical versatility, and artistic reinventions.',
  'distance': 1.1270135641098022},
 {'document_id': 'c7f51ec0-9f88-4fd7-a176-846cc6db1736',
  'custom_id': None,
  'meta': {},
  'index_record_id': 'd55753d8-33cb-4a11-9a6f-c3c4a5fe2751',
  'content': 'Beyoncé is a singer and songwriter recognized for her boundary-pushing artistry, vocals, and performances.',
  'distance': 1.1645721197128296},
 {'document_id': 'b0a5e701-749d-4361-a941-b8d6ffcbfe92',
  'custom_id': None,
  'meta': {},
  'index_record_id': 'dc840614-43fc-48bc-b4a9-fedca1f5950d',
  'content': "Dwayne 'The Rock' Johnson is a well-known actor, former professional wrestler, and businessman.",
  'distance': 1.2409147024154663}]

Next steps

Check out the following tutorials to learn more about Lexy.

Retrieval Augmented Generation (RAG) for a basic implementation of RAG using Lexy.
Multimodal image search for how to use Lexy to build a multimodal search engine.
Custom transformers for using your own functions as transformers.
Document filters shows how to build pipelines for different types of documents.
Structured data extraction for how to extract structured data from documents.