MariaDB 向量

Sergei Golubchik
MariaDB

关于我

什么是“向量搜索”

如何?

什么是“向量”?

什么是“MHNSW”?

示例

表格

CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(1536) NOT NULL,
        VECTOR INDEX (embedding) M=8 DISTANCE=COSINE
);

Python

from openai import OpenAI

client = OpenAI()
model = "text-embedding-3-small"

def get_embedding(text):
   return client.embeddings.create(input = [text],
                     model=model).data[0].embedding
  

索引

import mariadb v = get_embedding(document[i]) cur.execute("INSERT embeddings VALUES (%d, Vec_FromText(%s))", (i, str(v))) 
import mariadb import array v = get_embedding(document[i]) cur.execute("INSERT embeddings VALUES (%d, %s)", (i, array.array("f", v).tobytes())) 

搜索

import mariadb
import array

q = get_embedding(user_question)

cur.execute("""
    SELECT doc_id FROM embeddings
        ORDER BY VEC_DISTANCE(%s, embedding)
        LIMIT 5
""", array.array("f", q).tobytes()))
  

LlamaIndex

from llama_index.vector_stores.mariadb import MariaDBVector

vector_store = MariaDBVector.from_params(
    host="localhost",
    user="llamaindex",
    password="********",
    database="llamaindexdb",
    table_name="vectorstore",
    embed_dim=1536
)

LangChain

from langchain_mariadb import MariaDBStore, DistanceStrategy

vector_store = MariaDBStore(
    embeddings=OpenAIEmbeddings(),
    embedding_length=1536,
    distance_strategy=DistanceStrategy.COSINE,
    pool=mariadb.ConnectionPool(
        host="localhost",
        user="langchain",
        password="*********",
        database="langchaindb"
    )
)

调优

调优很难

服务器变量

SET @@mhnsw_max_cache_size = 16777216;

SET @@mhnsw_ef_search = 20;

SET @@mhnsw_default_distance = EUCLIDEAN;

SET @@mhnsw_default_m = 6;

mhnsw_max_cache_size

DISTANCE (mhnsw_default_distance)


CREATE TABLE embeddings (
        doc_id BIGINT UNSIGNED PRIMARY KEY,
        embedding VECTOR(100) NOT NULL,
        VECTOR INDEX (embedding) DISTANCE=COSINE
);

mhnsw_ef_search 和 M (mhnsw_default_m)

qps vs 召回率

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1000 2000 3000 4000 5000 M=4 ← ef_search=10 20 → 30 → 40 → 800 → M=5 M=6 M=8 M=12

构建时间 vs 召回率

0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 M=3 M=4 M=5 M=6 M=8 M=12

性能比较

小型:6 万向量,784 维度,欧几里得

构建时间 vs 召回率qps vs 召回率
0.996 0.997 0.998 0.999 1 0 10 20 30 40 50 60 mariadb pgvector qdrant redisearch weaviate 0.96 0.97 0.98 0.99 1 0 2000 4000 6000 8000 10000

大型:100 万向量,960 维度,欧几里得

构建时间 vs 召回率qps vs 召回率
0.9 0.92 0.94 0.96 0.98 1 0 2000 4000 6000 8000 mariadb pgvector qdrant redisearch weaviate 0.9 0.92 0.94 0.96 0.98 1 0 200 400 600 800 1000

复杂:29 万向量,256 维度,余弦

构建时间 vs 召回率qps vs 召回率
0.8 0.85 0.9 0.95 1 0 1000 2000 3000 4000 mariadb pgvector qdrant redisearch weaviate 0.8 0.85 0.9 0.95 1 0 1000 2000 3000 4000 5000 6000

特性总结

当前已实现

当前已实现(更多)

由以下支持

限制

未来计划

可用版本