At a loss for <literal tokens> — when are vectors enough?
Search technology is a challenging domain. With the current market buzz and growing excitement, many organizations are exploring the development of AI-driven systems. One notable advancement is the emergence of embedding models from Generative AI. For the first time in history, these models allow us to build semantic search experiences with ease. However, it’s a common misconception that vector search is a panacea for all search needs. In this article, I aim to provide straightforward and condensed guidance on when vectors are sufficient and when they’re not.
With all the buzz surrounding embeddings, and considering the affordability and accessibility of embedding models, it’s tempting to view vector search as the go-to solution for text search in applications. You might find yourself eagerly diving into the rabbit hole of content preprocessing, generating embeddings, and experimenting with numerous tweaks during search time. However, ask any seasoned expert and they’ll emphasize that traditional BM25 (aka ‘keyword search’) isn’t exiting the stage anytime soon. But why is this the case?
Before we dive deeper, here’s a hot take to kick things off: a completely oversimplified decision tree for those wondering where to start. Let’s see if this sparks some thoughts:
Let me explain…
When do you need keyword search?
Let’s start with a few example queries, then think how BM25, statistical keyword search would perform vs. vector search using high quality embeddings. Which one would perform better and why?
- Quentin Tarantino Uma Thurman movie
- hyundai ioniq 5 dimensions
- Stephen King novels
- Movies like Inception with complex plots
- Ideas for family vacations in nature
- Improving mental health without medication
- Recipes for healthy and fast breakfasts
- Managing work-life balance as a software engineer
- Historical landmarks to visit in Paris
- Sci-fi books similar to Dune
Consider the semantic content of these queries. Which ones do you think are more suited to the precision of keyword search, and which ones would benefit from the broader context understanding of vector search?
When choosing technologies, rather than focusing on the pro’s and con’s, start by eliminating things based on exclusion criteria. There are some nuanced technical ones, but the very straight forward an obvious criteria that requires keyword search is the ability to match — as the name implies — keywords.
There’s a fair amount of use cases where filtering based on a specific term or phrase in the text is crucial. When a researcher at a pharma company is looking for “PI3K cancer studies”, he doesn’t care about several million documents vaguely discussing <protein in the context of cancer and study>, he probably cares about documents specifically mentioning Phosphatidylinositol 3-Kinase. Being constrainted to vague searches similar to something, as compared to filtering based on terms can negatively impact precision of a search to a point where it’s unusable to end users.
Precision and recall
Understanding precision and recall isn’t just a technical exercise; it’s central to the user experience and, dare I say, the business model of your application. These metrics are vital in determining how effectively your search function meets user needs.
- Recall is about the completeness of search results, indicating the percentage of all relevant items in the database that are successfully retrieved by a query.
- Precision, on the other hand, measures the accuracy of the search results, showing the proportion of retrieved items that are truly relevant to the query.
Consider this: if users can’t find a product in your online store, they’re likely to purchase it elsewhere. If they’re frustrated with your search experience, they might switch to a competitor’s app. In essence, the effectiveness of your search can directly influence customer retention and sales.
Keyword searches boost precision by ensuring every search hit contains specific information relevant to the user’s query. However, they often do so at the expense of recall. A search too focused on specific terminology can miss relevant results due to paraphrasing and synonyms, leading to user frustration and missed opportunities.
Search technologies have evolved to address these challenges. From simple hacks like synonym usage to advanced NLP and machine learning techniques for query expansion, search systems are constantly improving to balance precision and recall.
Moreover, the use of inverted index structures not only enables quick location of keyword-specific documents but also facilitates efficient intersection with structured data. By integrating techniques like entity extraction and query intent classification, search systems can offer rich and nuanced experiences, meeting both the technical and commercial needs of businesses.
In summary, a deep understanding of precision and recall is not optional but a necessity, especially if search is a critical component of your user experience and business model. We’ll get back to this one after talking about vectors and embeddings…
Vectors to the rescue?
When users express dissatisfaction with search engines, particularly those embedded in applications, their complaints usually revolve around a few common themes, such as:
- “My search returns no results.”
- “I know the document exists, why can’t I find it?”
- “The search engine doesn’t understand what I mean.”
Surely you can remember the last time you were trying to locate a bit of information that you are sort of familiar with, but struggled to find the right keywords to surface it. Like trying to locate a ticket in Jira and having to try dozens of customer names or error codes until you finally get it right. If only there was a way to vaguely describe something instead…
Language is inherently arbitrary; there are countless ways to express the same idea. The concept of using vectors for proximity search on a continuous spectrum, as opposed to relying on keyword intersections and boolean filtering logic, has been around since the 70s. However, the challenge was that using literal terms (or approximations like lemmas or synonyms) didn’t significantly improve recall. It was still a game of precision over recall.
Word embeddings have been in play since the 2000s, but it was Google’s BERT and ultimately models like GPT that elevated them to a level where a vector could capture what feels like the semantic meaning. The catch with embeddings is that they’re tricky to perfect. Since vector search operates on a “nearest neighbor” approach, returning records based on similarity rather than specific filtering criteria, recall can be very high, independent of literal keywords.
However, the aim of text search isn’t to achieve 100% recall. It’s about finding the right balance: high enough recall to fetch your desired content and high precision to rank this content much higher than anything else. The current generation of large language models is finally excelling at capturing the semantic meaning of longer text passages, enabling the retrieval of semantically similar neighbors, even if they’re worded differently.
Let’s explore a practical example using embeddings to measure story similarity. We’ll use a handful of short stories and observe how they fare in terms of semantic similarity. Here’s a brief overview of what the following code does:
- We input five different stories into the model.
- The model computes embeddings for each story, capturing their semantic essence.
- We then use PCA (Principal Component Analysis) to reduce the dimensions of these embeddings for visualization, since we want to view the output on our 2d screens...
- Finally, we plot these reduced embeddings to visually inspect their similarity.
from transformers import AutoModel
from numpy.linalg import norm
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
stories = ["""In a bustling city, a young boy named Alex
eagerly headed to his first day at kindergarten.
With a bright red backpack and a wide grin,
he waved goodbye to his parents. During the day,
he made a new friend, Sam, and they played with blocks together.
As the sun set, Alex returned home, excitedly recounting his adventures.""",
"""In a lively town, a small lad was excited for
his initial day at daycare. Carrying a vivid scarlet rucksack and beaming,
he bid farewell to his guardians. Throughout the day, he befriended a new mate,
Samuel, engaging in games with building bricks. With the evening twilight,
Alexander came back, eagerly sharing tales of his exploits.""",
"""In a sprawling city, a young artist, eager with inspiration,
headed to her studio. Clad in a bright red backpack filled with paints
and brushes, she wore a wide grin, anticipating the day's creations.
She met a new friend, an old sculptor, in the shared space,
who taught her to carve blocks of marble. As the sun set, casting a
warm glow over her finished masterpiece, she headed home, excited to
display her new art.""",
"""In a vast metropolis, a budding muralist, filled with enthusiasm,
journeyed to her atelier. Donning a vibrant crimson knapsack loaded with
canvases and colors, her expression beamed with anticipation for the day's
artistic endeavor. She encountered a novel acquaintance, a seasoned stone carver,
within their communal artistic haven, who instructed her in the art of shaping
slabs of stone. As dusk descended, bathing her new creation in a gentle amber
light, she ventured back to her abode, thrilled to exhibit her latest work.""",
"""Underneath a starlit sky, an elderly fisherman set sail on the tranquil ocean.
His vessel, an old but sturdy craft, glided softly over the gentle waves. Tonight,
he was in search of the elusive moonfish, a creature of legend in these waters.
Guided by the lighthouse's distant beam, he cast his net, hoping for a catch that
would become a tale for generations."""]
# Calculate embeddings
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embeddings = model.encode(stories)
# Reduce dimensions
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
# Annotate points
for i, story in enumerate(stories):
plt.annotate(f"Story {i+1}", (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
plt.title('2D Visualization of Story Embeddings')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.grid(True)
plt.show()
The output illustrates that stories 1 and 2, as well as 3 and 4, are closely related semantically, despite being written with different vocabulary. This is courtesy of the powerful language model ChatGPT. Interestingly, while stories 1 and 3 share many keywords, their semantic similarity is not as close as one might expect based on keyword overlap alone. This is quite impressive already, since models in the past struggled with “just the right amount” of semantic abstraction.
So, is vector search the be-all and end-all? Let’s add the embeddings for a few queries to our plot and see what happens.
Intuitively, “Alex meets his new friend Sam” should align closely with stories 1 and 2. But what we often see is that literal terms may be undervalued in vector search. In contrast, inverted index search might over-prioritize them. The crux of the issue is finding a balance between the two approaches.
Out-of-the-box embedding models excel at clustering related content and facilitating search-by-example scenarios. However, they are less adept in situations where the input query diverges significantly from the actual content, such as text queries.
How do you overcome this? There are numerous strategies, but none offer a “set-it-and-forget-it” solution. Realistically, achieving an optimal search experience may involve some training, tuning, and ongoing adjustments. It’s about crafting a search solution that can discern the subtleties of human language and intent, bridging the gap between “too literal” and “not literal enough”.
So… Do We Need Both?
When it comes to search functionality, a ‘best of both worlds’ approach often proves to be most effective. Combining the precision of keyword search with the contextual understanding of vector search can lead to a significant improvement in search results. Leading enterprise search companies and tech giants like Microsoft are already harnessing this hybrid approach. They run keyword and vector searches in parallel and then apply a ranking model that scores question-result pairs. This model re-ranks the top N results from both searches to produce the final list of search results.
For search, you can try a best of both worlds approach. There are various ways of combining keywords and vectors in order to improve search results. The current road taken by some of the leading enterprise search companies and giants like Microsoft is running keyword searches and vector search in parallel, then utilizing a ranking model to score question and result pairs to re-rank the top N results from both.
This hybrid approach has shown promising results in standardized benchmarks, indicating its potential effectiveness. An important note on benchmarks: many standardized search benchmarks, such as BEIR and Miracl, tend to be biased towards keyword-based queries. This bias means that while these benchmarks are effective for measuring the success of keyword-centric search approaches, they may not accurately represent the efficacy of semantic search queries. Furthermore — benchmarks don’t always reflect the intricacies of domain-specific use cases and datasets.
Beyond this general-purpose strategy, there are many methods to fuse the two search paradigms, such as:
- Keyword Boosting: Begin with an Approximate Nearest Neighbors (ANN) search in vector space and then apply a boost to results that have keyword matches to improve relevance.
- Two-Stage Retrieval: Start with keyword search to quickly filter down a large corpus to a manageable subset, then apply vector search for semantic ranking; or, conversely, use vector search first and then refine results with keyword search. This method can be fine-tuned to be highly selective about filters and can be combined with traditional search techniques like query expansion.
- Semantic Expansion: Utilize vector search to expand the query with semantically related terms, or refine the query by identifying its most critical components to improve recall without sacrificing precision.
- Vector Manipulations: Integrate additional dimensions into the embeddings to account for entities, user profiles, and preferences, modifying the similarity calculations to be more user-specific and context-aware.
Each of these approaches can be tailored to the specific needs and challenges of different search scenarios, ensuring that the search tool not only understands the words being used but also the user’s intent and context, leading to a more intuitive and effective search experience.