As your organization explores the potential of large language models in the enterprise, you’re bound to encounter buzzwords like ‘embeddings’ and ‘vector databases’. It can be a challenge to sift through the jargon and understand what these technologies mean for your business. So, let’s break it down in plain English.
Computers aren’t naturally adept at understanding human language — the words and sentences we use every day. Instead, they’re much more comfortable working with numbers. This is where ‘embeddings’ come into play. They enable us to convert sizable text chunks, such as emails, PDF documents, webpages, or customer chats, into a series of numbers. This way, we can quickly analyze these texts with computers, based on their underlying meaning.
Let’s illustrate this with a simple example. Suppose we have two sentences: “Mary had a little lamb” and “The cow jumped over the moon”. We can break these sentences into four parts:
1. “Mary had”,
2. “a little lamb”,
3. “the cow”, and
4. “jumped over the moon”.
Based on this limited information, “a little lamb” and “the cow” would appear to be the most closely related because they’re both farm animals. If we convert each chunk into a set of numbers (an ‘embedding’), we could plot them on a graph. The chunks with similar meaning would be close together on the graph.
Let’s change the context and see what happens. Suppose these sentences are part of a story where Mary is an intelligent cow. Suddenly, “Mary had” and “the cow” become more closely related due to the new context. This illustrates how the meaning of embeddings can change depending on the meaning of the text.
Now, imagine applying this technique to thousands of pages of text, where you break the text into paragraphs. Suddenly, you have a tool for finding paragraphs that share similar meanings or finding other texts that closely align with specific paragraphs.
Consider this real-world use case from some of our client work: Suppose you’re a bank with a vast collection of compliance documents. The SEC introduces a new rule, and you want to identify the sections of your compliance documents that align closely with this rule. By using embeddings, you can convert the SEC’s proposed rule language and your compliance documents into a set of numbers. Then, you can find the closest matches and use large language model to analyze them and provide recommendations on how to update your compliance documentation accordingly.
So, that’s a brief introduction to embeddings. It’s a powerful technique that can play a crucial role in your generative AI and large language model strategies.
ABOUT PROLEGO
Prolego is an elite consulting team of AI engineers, strategists, and creative professionals guiding the world’s largest companies through the AI transformation. Founded in 2017 by technology veterans Kevin Dewalt and Russ Rands, Prolego has helped dozens of Fortune 1000 companies develop AI strategies, transform their workforce, and build state-of-the-art AI solutions.