Skip to content

Instantly share code, notes, and snippets.

@Ovid
Created August 30, 2024 06:58
Show Gist options
  • Save Ovid/17b19faf2fb7e0019e375e97f0a4c8af to your computer and use it in GitHub Desktop.
Save Ovid/17b19faf2fb7e0019e375e97f0a4c8af to your computer and use it in GitHub Desktop.
GenAI provides real value. Vector databases.
I'm getting more pushback on #Mastodon that #GenAI doesn't provide any real value that traditional software can't provide cheaper or more reliably. <voice type="Samuel Jackson">Allow me to retort.</voice>
I'll try not to be too technical. Further, this is based on real work I did for a client and not something I just read about in a blog post.
Consider search engines. For "traditional" databases (RDBMs and NoSQL), you might have a search engine for videos. A user types in "videos about cats." With stemming and stop words, that becomes "cat."
What the heck is "stemming" and "stop words"? Stemming is the process of taking a word and reducing it to a common form. For example, consider the words run, runs, running, ran, runner, runners. All of those might be stemmed down to the word "run". This way, if someone searches for "videos about running", everything which has those words is a candidate for being returned in search results.
Stop words are different. Stop words are common words like "the," "is," "and," or "of" that search engines often ignore during a search. These words are used so frequently in everyday language that they don't add much meaning to a search query. By ignoring stop words, search engines can focus on the more important keywords in a query, making the search faster and more relevant to what you're actually looking for. However, stop words vary from search engine to search engine. A video search engine won't have "hotel" as a stop word, but a large hotel room rental site will.
Worse, stop words can make a search engine useless. I once stayed in a lovely London hotel named "Hotel 55" and a hotel search engine wouldn't find it, despite it being in the database, because "hotel" was a stop word and the search engine wouldn't index any words under three characters (e.g., "55"). When I tried to fix that by indexing two-character words, it broke the search engine because it started returning too much "junk."
So when our user types "videos about cats" and the search engine searches for "cats," they might see a list of tons of videos about house cats, the unix "cat" command, or "CAT" (various types of construction machinery). What they *won't* get is videos about cheetahs, tigers, lions, and so on, unless someone has laboriously gone through and hand-labeled all of those videos with this metadata. This is a slow, expensive, and error-prone process (been there, done that).
So at their core, traditional databases lets you search for strings of characters.
A vector database, however, lets you search for the *meaning* of your search. "Videos about cats" is such a popular thing that a vector database can take the meaning of your search and return videos about house cheetahs, tigers, lions, and so on, because it knows that those things are cats. It might return a video about the unix cat command, but it will be at such a low relevance that you probably won't display it.
However, if someone types in "unix cat" or "linux cat", the video about the cat command will probably be the very first response!
Vector databases also allow you to dump unstructured data into your database and make it magically searchable. Much of the work of "normalizing" (designing) a database to conform to 3rd or 5th normal form just goes away. There's still a lot of work needed for vector databases to set them up, manage them, and add optional metadata (for example, for "filtering"), so I don't want to pretend that they're simple, but they serve use cases traditional databases cannot.
Traditional databases are still incredibly valuable, but vector databases are a fundamentally different beast which gives you a power that is very hard to get otherwise.
As a case in point, when I was evaluating using a vector database as a search engine for a client, I searched for "Armageddon", a word I knew was not in the data corpus. It returned a bunch of videos about war, despite me doing *nothing* to try to accommodate that word. A traditional search engine probably can't do that out of the box.
Vector databases are being used to help medical professionals find documents related to diseases they're investigating. They're being used by lawyers to find caselaw relevant to their current case. They're being used by businesses to find internal documentation that is hard to index.
So please stop telling me that AI can't provide real value. It absolutely does.
If you want to critique AI, start by talking about environmental impact, bias, and harm. These are real, powerful, negative consequences.
If you're curious about vector databases, the client was using Chroma, a free, open-source database under the Apache 2.0 License. https://www.trychroma.com/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment