Vector Search in Oracle 23ai

Ryota Yamanaka
Oracle Developers
Published in
6 min readMay 5, 2024

After months of waiting, the Oracle Database update 23.4.0, which features vector search capabilities, has finally been released. It’s now curiously named 23ai instead of 23c..!

The vector search documentation (Oracle AI Vector Search User’s Guide) explains that SQL (and partially PL/SQL) enables the whole RAG process: storing LLMs in the database, querying LLMs for vector representations, and sending prompts via SQL.

However, for many engineers who are implementing RAG flows for their applications, what they need for databases may be scalable vector search (combined with established SQL capability and database maintenance).

In this article, I will focus on vector search and introduce steps to quickly get started. By the end of this read, you will see this mysterious SQL syntax:

SELECT g2.embedding <=> g1.embedding AS distance

Prepare a Docker Environment

This time, I used an AMD Oracle Linux 8 instance created on Oracle Cloud. If you have a Docker environment available, you may skip this step.

sudo yum install podman podman-docker

The database uses port 1521, so allow it through the firewall. In a cloud environment, also open port 1521 in the virtual network.

sudo firewall-cmd --list-all
sudo firewall-cmd --zone=public --add-port=1521/tcp --permanent
sudo firewall-cmd --reload

Launch Database 23ai

Download and start the container using Docker or Podman. Note that this container, as described here, is for amd64 architecture and approximately 10 GB in size.

docker run -p 1521:1521 --name database-23ai \
container-registry.oracle.com/database/free:23.4.0.0

When you see this message, the database has successfully started.


#########################
DATABASE IS READY TO USE!
#########################

If you stop the container with Ctrl + C or similar, restart it.

docker start database-23ai

Set the administrator user’s password.

docker exec -it database-23ai ./setPassword.sh password123

Log in to the database as the administrator user (= sys).

docker exec -it database-23ai sqlplus sys/password123@freepdb1 as sysdba

Create a User

Create a regular database user myuser, and grant it the CONNECT and RESOURCE roles.

CREATE USER myuser
IDENTIFIED BY password123
DEFAULT TABLESPACE users
TEMPORARY TABLESPACE temp
QUOTA UNLIMITED ON users;

GRANT CONNECT, RESOURCE TO myuser;

Log out.

exit

Verify you can connect with the new user.

docker exec -it database-23ai sqlplus myuser/password123@freepdb1

Store VECTOR Type Data

Create a table with a VECTOR type embedding column. This table stores the names of various galaxies, a short description, and the vector representation of the description retrieved through a language model.

CREATE TABLE galaxies (
id NUMBER,
name VARCHAR2(50),
doc VARCHAR2(500),
embedding VECTOR
);

Let’s store entries for 9 galaxies. The table definition and original data were obtained from the official documentation, but I replaced the vector data with more realistic vectors generated using a language model.

Original vectors:

M31: [0,2,2,0,0]
M33: [0,0,1,0,0]
M58: [1,1,1,0,0]

Updated vectors (5 dimensions but can be higher dimensions):


M31: [0.26833928, 0.012467232, -0.48890606, 0.61341953, 0.5590402]
M33: [0.43291375, -0.06379729, -0.40366283, 0.77222455, 0.2219036]
M58: [-0.07266286, 0.0802545, -0.34327424, 0.8917586, 0.27424216]

The vectors are passed as strings to the INSERT statement as follows.

INSERT INTO galaxies VALUES (1, 'M31', 'Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.', '[0.26833928, 0.012467232, -0.48890606, 0.61341953, 0.5590402]');
INSERT INTO galaxies VALUES (2, 'M33', 'Messier 33 is a spiral galaxy in the Triangulum constellation.', '[0.43291375, -0.06379729, -0.40366283, 0.77222455, 0.2219036]');
INSERT INTO galaxies VALUES (3, 'M58', 'Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.', '[-0.07266286, 0.0802545, -0.34327424, 0.8917586, 0.27424216]');
INSERT INTO galaxies VALUES (4, 'M63', 'Messier 63 is a spiral galaxy in the Canes Venatici constellation.', '[0.16099164, -0.28416803, -0.32071748, 0.7423811, 0.4892247]');
INSERT INTO galaxies VALUES (5, 'M77', 'Messier 77 is a barred spiral galaxy in the Cetus constellation.', '[0.43336692, -0.20248333, -0.38971466, 0.6521541, 0.44046697]');
INSERT INTO galaxies VALUES (6, 'M91', 'Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.', '[0.41154075, 0.14537011, -0.42996794, 0.33680823, 0.7149753]');
INSERT INTO galaxies VALUES (7, 'M49', 'Messier 49 is a giant elliptical galaxy in the Virgo constellation.', '[0.45378348, 0.37351337, -0.33156702, 0.7252909, 0.13632572]');
INSERT INTO galaxies VALUES (8, 'M60', 'Messier 60 is an elliptical galaxy in the Virgo constellation.', '[0.124404885, 0.1522709, -0.27538168, 0.9206997, 0.19445813]');
INSERT INTO galaxies VALUES (9, 'NGC1073', 'NGC 1073 is a barred spiral galaxy in Cetus constellation.', '[-0.11507724, -0.3145153, -0.42180404, 0.50861514, 0.67173606]');
COMMIT;

Nearest Neighbor Vector Search

Based on the descriptions, calculate the distance between galaxy ID = 1 and other galaxies, ordering them by closeness (i.e., similarity in description).

These three lines are for formatting the results if you’re using sqlplus. Please skip this if you are using a different SQL client.

col galaxy_1 for a10
col galaxy_2 for a10
col distance for 0.00000000000000000

Here’s the new VECTOR_DISTANCE function added in 23ai.

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
VECTOR_DISTANCE(g2.embedding, g1.embedding) AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;

This function returns the distance between two vectors as a number:

GALAXY_1   GALAXY_2   DISTANCE
---------- ---------- --------------------
M31 M31 -0.00000011920928955
M31 M77 0.04941833019256592
M31 M91 0.07123643159866333
M31 M63 0.07465422153472900
M31 M33 0.08952367305755615
M31 NGC1073 0.14105635881423950
M31 M58 0.15033429861068726
M31 M60 0.15659791231155396
M31 M49 0.19035106897354126

Specify Distance Calculation Method

In the previous examples, since the distance calculation method (= metric) was not specified, the default cosine distance was calculated. You can also specify it with COSINE:

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
VECTOR_DISTANCE(g2.embedding, g1.embedding, COSINE) AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;

According to the SQL reference, besides cosine distance, other metrics are available: DOT, EUCLIDEAN, EUCLIDEAN_SQUARED, HAMMING, and MANHATTAN. Here, let me show the results for DOT and EUCLIDEAN only. The order of results was the same as with cosine distance with this data.

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
VECTOR_DISTANCE(g2.embedding, g1.embedding, DOT) AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;
GALAXY_1   GALAXY_2   DISTANCE
---------- ---------- --------------------
M31 M31 -0.99999994039535522
M31 M77 -0.95058161020278931
M31 M91 -0.92876350879669189
M31 M63 -0.92534565925598145
M31 M33 -0.91047626733779907
M31 NGC1073 -0.85894358158111572
M31 M58 -0.84966564178466797
M31 M60 -0.84340202808380127
M31 M49 -0.80964887142181396

The distance between two M31 entries is zero using Euclidean distance:

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
VECTOR_DISTANCE(g2.embedding, g1.embedding, EUCLIDEAN) AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;
GALAXY_1   GALAXY_2   DISTANCE
---------- ---------- --------------------
M31 M31 0.00000000000000000
M31 M77 0.31438317894935608
M31 M91 0.37745609879493713
M31 M63 0.38640478253364563
M31 M33 0.42314010858535767
M31 NGC1073 0.53114295005798340
M31 M58 0.54833269119262695
M31 M60 0.55963915586471558
M31 M49 0.61701065301895142

Mysterious Vector Distance Notation!

While looking through the SQL reference, I stumbled upon a curious notation: <=>. This notation is an alternative to the VECTOR_DISTANCE function for calculating cosine distance. I tried the following query and found the results returned were indeed the same:

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
g2.embedding <=> g1.embedding AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;
GALAXY_1   GALAXY_2   DISTANCE
---------- ---------- --------------------
M31 M31 -0.00000011920928955
M31 M77 0.04941833019256592
M31 M91 0.07123643159866333
M31 M63 0.07465422153472900
M31 M33 0.08952367305755615
M31 NGC1073 0.14105635881423950
M31 M58 0.15033429861068726
M31 M60 0.15659791231155396
M31 M49 0.19035106897354126

I must raise a question: Is it (legally) allowed to create such a new SQL syntax in the 2020s..?

For Euclidean distance, the notation is <->, and for dot product, <#>..!

SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
g2.embedding <-> g1.embedding AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;
SELECT
g1.name AS galaxy_1,
g2.name AS galaxy_2,
g2.embedding <#> g1.embedding AS distance
FROM galaxies g1, galaxies g2
WHERE g1.id = 1
ORDER BY distance ASC;

These notations are definitely simple and fun, but somewhat shocking!

Appendix: Obtain Vectors

The vector representations used in this article were obtained using the OpenAI API. Although the INSERT statements are printed as text in the script below, you could alternatively execute queries from a Python script connecting to the Oracle Database:

from openai import OpenAI
client = OpenAI()
galaxies = [
{"id":1, "name":'M31', "doc":'Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.'},
{"id":2, "name":'M33', "doc":'Messier 33 is a spiral galaxy in the Triangulum constellation.'},
{"id":3, "name":'M58', "doc":'Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.'},
{"id":4, "name":'M63', "doc":'Messier 63 is a spiral galaxy in the Canes Venatici constellation.'},
{"id":5, "name":'M77', "doc":'Messier 77 is a barred spiral galaxy in the Cetus constellation.'},
{"id":6, "name":'M91', "doc":'Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.'},
{"id":7, "name":'M49', "doc":'Messier 49 is a giant elliptical galaxy in the Virgo constellation.'},
{"id":8, "name":'M60', "doc":'Messier 60 is an elliptical galaxy in the Virgo constellation.'},
{"id":9, "name":'NGC1073', "doc":'NGC 1073 is a barred spiral galaxy in Cetus constellation.'}
]
for galaxy in galaxies:
embedding_response = client.embeddings.create(
model = "text-embedding-3-large",
input = galaxy["doc"],
dimensions = 5
)
str_embedding = str(embedding_response.data[0].embedding)
str_insert = "INSERT INTO galaxies VALUES (" + str(galaxy["id"]) + ", '" + galaxy["name"] + "', '" + galaxy["doc"] + "', '" + str_embedding + "');"
print(str_insert)

Next Step: Performance

After trying out the new SQL syntax, it’s crucial to consider the search performance when storing large amounts of vectors in a database. In the next story, I will explore the newly introduced indexes for vector search.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Oracle Developers
Oracle Developers

Published in Oracle Developers

Aggregation of articles from Oracle engineers, Groundbreaker Ambassadors, Oracle ACEs, and Java Champions on all things Oracle technology. The views expressed are those of the authors and not necessarily of Oracle.

Ryota Yamanaka
Ryota Yamanaka

Written by Ryota Yamanaka

Ex-Oracle Database Consultant. tw @oraryotas