Basics of Vector Databases

January 26, 2025•278 words

In a traditional database with structured tables, querying data is easy because each table has a key column that uniquely identifies the rows. In contrast, the same task may be harder for unstructured data such as text documents, audio files and images.

A vector database is one solution for storing and querying unstructured data. Given the popularity of large language models, let's use text documents as an example:

In a vector database, documents (or parts of them) are stored as high-dimensional vectors.
These vectors are calculated based on some embedding method, e.g. an encoder that converts tokens into numerical vectors.
Obviously, one needs to select an embedding method, such as a language model, but it can be just any model (e.g. a neural network) that extracts key information from the raw unstructured data.
The embedding process is done once for each document when it is added to the database, and then the vectors are stored and can be reused for querying purposes. Querying wouldn't be fast if embedding were done at query time.
When we make a query to the vector database, we look for vectors which have the highest degree of similarity to the query.
Optionally, there may also be metadata associated with the vectors which add more description to the data.
To some degree, we can think of the vectors as rows in a traditional dataframe, but in a slightly fancier way. As such, the most common usages of vector databases are similarity search (hence retrieval-augmented generation), anomaly detection, recommendation systems and feature engineering.

The above are my reading notes based on the following resources: Wikipedia, Pinecone and DataCamp.

👍❤️🫶👏👌🤯🤔😂😍😭😢😡😮

Subscribe to the author

You'll only receive email when they publish something new.

More from Spark Tseung
All posts

Kagi Trial Log Pt. 1

January 24, 2025•629 words

Kagi is a paid search engine focusing on privacy and better search results without advertisements. Privacy and quality of results are the primary reasons I wanted to try out this operating model of search engine. I originally signed up for a Kagi trial account in November 2024. It came with 100 free searches and 50 AI interactions. I did 7 searches, but then discovered Whoogle, an open-source and (sort of) private way to indirectly use Google. Prior to that, I had long left Google and switched ...

Read post

Beam Search

January 29, 2025•296 words

Suppose we have a distribution over a sequence of words, and we want to generate a sentence from it. It can be very challenging to generate the sequence that has the highest probability overall - the number of possible sequences actually grows exponentially relative to the length of the sequence. One simple idea is to use the Greedy Algorithm where we always aim for the local optimum. In this context, the next word is chosen based on the highest probability. We continue until we reach the end-...

Read post

Basics of Vector Databases

More from Spark TseungAll posts

Kagi Trial Log Pt. 1

Beam Search

More from Spark Tseung
All posts