Basics of Vector Databases
January 26, 2025•278 words
In a traditional database with structured tables, querying data is easy because each table has a key column that uniquely identifies the rows. In contrast, the same task may be harder for unstructured data such as text documents, audio files and images.
A vector database is one solution for storing and querying unstructured data. Given the popularity of large language models, let's use text documents as an example:
- In a vector database, documents (or parts of them) are stored as high-dimensional vectors.
- These vectors are calculated based on some embedding method, e.g. an encoder that converts tokens into numerical vectors.
- Obviously, one needs to select an embedding method, such as a language model, but it can be just any model (e.g. a neural network) that extracts key information from the raw unstructured data.
- The embedding process is done once for each document when it is added to the database, and then the vectors are stored and can be reused for querying purposes. Querying wouldn't be fast if embedding were done at query time.
- When we make a query to the vector database, we look for vectors which have the highest degree of similarity to the query.
- Optionally, there may also be metadata associated with the vectors which add more description to the data.
- To some degree, we can think of the vectors as rows in a traditional dataframe, but in a slightly fancier way. As such, the most common usages of vector databases are similarity search (hence retrieval-augmented generation), anomaly detection, recommendation systems and feature engineering.
The above are my reading notes based on the following resources: Wikipedia, Pinecone and DataCamp.