Vector Databases Explained: AI Tech Fact Check and Analysis

/100

Generally Credible

10 verified, 0 misleading, 0 false, 0 unverifiable out of 10 claims analyzed

The video comprehensively covers vector databases, detailing technologies like clustering-based indexing, HNSW, ANNOY, and LSH, supported by real-world applications such as semantic search in ElasticSearch and use in large-scale AI systems. The presenters accurately describe theoretical concepts and practical implementations, including challenges like computational expense and optimization strategies. The discussion of LLMs, fine-tuning, and retrieval augmented generation is factually sound, with clear linkage to vector database roles. Minor informal language and lightly speculative remarks do not impair overall factual integrity. The video is highly credible for audiences seeking an insightful introduction to vector database technology and its AI applications, earning an overall credibility score of 88.

Claims Analysis

Verified

Vector databases represent data as mathematical vectors in n-dimensional space to enable similarity search.

Vector databases store data items as vectors in high-dimensional space, allowing geometric proximity to define similarity, as illustrated by clustering of related items in vector space.

Introduction to Vector Databases

Verified

Nearest neighbor algorithms find the closest vectors (data points) to a query vector within this vector space.

Nearest neighbor search is a fundamental algorithm in vector databases to identify vectors closest to a query, based on distance metrics like Euclidean or cosine similarity.

Approximate Nearest Neighbor Search

Verified

Clustering-based indexing (e.g. product quantization used in Facebook AI similarity search "FAISS") improves search efficiency by creating memory-efficient representations.

FAISS by Facebook AI uses product quantization techniques to compress vectors and cluster them to speed up similarity search efficiently.

FAISS: A Library for Efficient Similarity Search

Verified

Hierarchical Navigable Small World (HNSW) is a proximity graph index algorithm enabling efficient vector search via navigable layered graphs.

HNSW creates multi-layer graph structures for efficient approximate nearest neighbor search by traversing from upper layers down to closest nodes, ensuring speed and accuracy.

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

Verified

Approximate Nearest Neighbors (ANNOY) is a tree-based indexing method for vector search, widely used in production.

ANNOY uses random projection trees to approximate nearest neighbors for fast vector search, developed originally at Spotify and open-sourced.

Annoy: Approximate Nearest Neighbors

Verified

Locality Sensitive Hashing (LSH) uses hash buckets to quickly approximate nearest neighbors in high dimensional vector data with less computation.

LSH hashes vectors so that similar items map to the same buckets with high probability, enabling sublinear time approximate nearest neighbor queries.

Locality-Sensitive Hashing (LSH) and its applications

Verified

ElasticSearch, originally open-source document database, is used for text search and recommendation and has features enabling vector search.

ElasticSearch builds on Lucene and supports vector search capabilities, widely employed for semantic search and recommendation systems, with commercial extensions after Amazon's involvement.

ElasticSearch vector search

Verified

Large Language Models (LLMs) require months of training on thousands of GPUs and consume energy comparable to mid-sized cities.

Training large models like GPT-4 is known to require significant computational resources and time, with energy consumption estimates comparable to medium-sized cities.

Energy and Policy Considerations for Deep Learning in NLP

Verified

Retrieval Augmented Generation (RAG) systems use vector databases to enable LLMs to recall and generate more accurate, grounded responses.

RAG models retrieve relevant documents from vector stores to augment language generation and mitigate hallucinations, combining search with generation effectively.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Verified

Fine-tuning large models involves training parts of the neural network (e.g., last layers or adapters) to adapt model behavior for specific tasks or styles.

Fine-tuning adjusts selected network parameters or adds adapter layers to customize pretrained models, a standard method to specialize large models efficiently.

Parameter-Efficient Transfer Learning for NLP

[Music] yeah [Music]

just this is [Music]

this [Music] yeah

[Music] n [Music]

hey trick lost [Music] camera oh yeah about that I comfortable

like was going to give just a short intro about and you could totally start from

your first SL just do you know whatever you would feel most comfortable with you

know abely yeah I I [Music]

I [Music] I I'm even going to give an intro where

we mentioned that that this is kind of and then absolutely we have a Tempo worked out it's going to be

good yeah I mean I think it's a certain numbness that comes

I mean you set your your gost jump right I know some people years ago they were doing with

teain so big anymore but 10 years ago or so they were everything you want to do

cool from one all right we're good I think we're good to go you ready we're starting on us ready yeah we're starting

this way today all right well happy Sunday everyone welcome to hack Sunday thanks for joining us I

know it's a beautiful day out outside and you know I'm glad that other people share my enthusiasm for being inside and

learning so here we are today welcome to hack Sunday for those of you who don't know uh we are a live Production Studio

we actually do Focus here on live production we do live streams here all the time my music's still on let me turn

that off um and today we're here for heck Sunday we're here to learn how AI is taken over

the world and the this today's talk started so for anyone who missed uh last week's talk or didn't watch on YouTube

we started a conversation about Vector databases which began with gav teaching people in the telegram chat me being

lost and saying what the hell are you guys talking about can you please explain this to me so we did part one

last last week where Gro tried to explain Vector databases to me he did a amazing job I'm still confused as hell

uh but we did learn a lot and that's the goal here at machine learning together so today we got two people to try and

teach me we got grav and Stanley we think between the two of them maybe they can educate me here um so with that I'll

I'll turn it over to you guys um if you want to give a quick little intro and tell us what we're talking about and um

and I'm actually so excited we get a second day talking about Vector databases and um um I I I have to say

though it's it's mostly my fault I was asking too many questions last time so just going to cut through the middlemen

and I'm going to be up here um you know sort of contributing a little bit to gv's flow but yeah I think we're

actually going to start fresh kind of go through the slides cover the main topics and even for y'all non-technical people

um the systems were talking about underly all of this large language model stuff and and so these um these tools

and these Technologies are going to be a big part of the rest of of our lives so sure um since we are doing like a recap

I'll I'll do it quickly um so first of all we'll start with what is database so database is a is a technology that

stores and retrives information or data and then we went through types of databases we have different types of

database such as document database most most of you are familiar with relational database um um during the last talk we

also talked about like analytical database which was like used for business intelligence and uh we we had a

BRI um long talk about it then there is favorite uh Stanley's favorite graph databases and then uh you have like all

sort of um no SQL or not a SQL database uh which is like Dynamo DB or um and most of these databases are either

column based or they are key value pair or they are document stores so we went with them uh through them a little bit

and then we went with different use cases of uh databases that uh um that people use uh

and here here is the slide that that explains it quite well with the examples and we went through that one and finally

we introduced what is a vector database so Vector database uh is is a is a representation of the data in n space um

uh sort of representation right so uh as you can see in this picture like all those points which are animals they are

clustered together and then uh fruits which are uh which are data points are clustered together so uh basically um

what data Vector database stores is the mathematic itical representation of that data and uh we we did introduce that uh

a little bit do you want to add something to that Stanley yeah um um it just U the idea here is that um the

relationships words have linguistically within a piece of text become geometric relationships in the space it's a very

beautiful kind of transformation and then one thing I would love to say is that I feel like um and it's uh starting

to become clear that our brain actually uses this to store information right and so I always think like when you say

you're reaching for a word that's actually true in the vector database in your

brain all right um so then we went through how do actually Vector databases work so as you can see in this um in

this slide that there are images documents and audio these are like different formats of data we trans

transform them into embedding embedding is nothing but a vector U mathematical representation of that data which we uh

depict over here U with the vector representation and then we use some sort of uh nearest neighbor finding

algorithms and then uh those algorithms help us uh do the like do any kind of query on those um yeah and and last time

we also went through some uh uh some distance metrics uh uh in detail like what's the mathematics behind it um

gr up I'm I'm so sorry so quick too but I just wonder is there any chance you'd say a little something about what a

nearest neighbor algorithm is sure um so nearest na algorithm as as you as you know that the the whole Vector space uh

you can you can think it it's sort of like n dimensional space and now you want to find

uh you want to find say suppose uh let's let's go to the example over here like I have um say suppose I have a query which

is which falls into um into like animal uh let's say uh uh I I can't think of animal say say

elephant right so where do elephant closely matches with with this category and that's that's what the nearest

algorithm uh neighbor algorithm would do it will try to find the uh the centroid of of that particular cluster and it

will try to put that particular query or the sample into um into that cluster do do you want to U it's so well described

um and then just to emphasize a little bit um so for example wolf might be close to dog because they're both

animals it might also be close close to dog cuz they're both mammals right could also be because they're four-legged so

we think that there's all of these different dimensions of similarity and then the K in K nearest neighbors means

it's looking at a number of different dimensions right right um anyway it's just so cool I I love this picture man

this is perfect yeah this uh I I also love this picture because every human being had has this inherit ability to

think in 3D right we we understand 3D space we don't understand like four dimensionals or n dimensional spaces so

um I I really love this uh uh this sort of representation of vector database and uh that's why I chose this particular

slide I have one friend who can think in four or five dimensions and I ha her such it's such a

superpower that's that's awesome you have a question I'm supposed to limit the questions but let's see oh sorry no

I'm just Kidd apologize if you're going to get to this later I can hold off on it I just wondering if you could speak

to the challenges of choosing vectors oh 100 million per and then actually the main reason I'm here is we're actually

going to look at a small case study at the end that relates to um generating embeddings constructively and then just

for example has anyone had the pleasure of giving an image to chat GPT and then chat GPT can answer questions about the

image um that is what's called a multimodal learning system right and the way that works is you take the text and

you create a set of Vector embeddings in the same space as you embed the images so it's kind of like you're creating a

common language for text and images and and that's a big reason why everyone's so excited about these Technologies

right now but anyway for forgive my digression no that's uh that's super cool and then we uh we dealt with little

bit on um uh what's what's like the traditional search and semantic search and uh um I

also presented like two uh real case uh demos that that um um uh that we went through uh one was a very simple Vector

uh store search where it was using like traditional uh sorry it was using the semantic search so um what this demo was

um I took like a PDF um and and I gave that PDF to a vector store and then creating uh created embeddings from it

and then we tried to do a semantic query on it uh the second demo was uh um was for images same thing we did for for a

large uh large data store of images but here we tried to use traditional SQL queries um with uh with a with a single

store database um I will uh I will go through these demos uh again um if needed but uh without further Ado um uh

I would like to jump into how can we optimize these uh these algorithms right because right now it feels like um you

are trying to you have data stored somewhere you have these mathematical representation

of data which you are storing in Vector databases and overall like how can you um how can you optimize how can you

improve the search quality and the search efficiency so um researchers have come up with different sort of

algorithms we'll go through some of them um uh today um so first one I would like to think about is like clustering since

since I was talking about clusters creating clusters of data points and stuff like that so the first algorithm

that I would like to go through is clustering based indexing which is um f as an example so f is Facebook AI

similarity search um it's it's like most popular um in in the vector databases and uh what what F does it

optimizes uh the query by using uh different me methods most popular they use is called Product quantization and

what product quantization is um is basically each data set is converted into more memory efficient

representation uh which is called PQ code and um and as you can see uh in the image like you have these original

vectors which are nothing but the embeddings but then you slice them you try to Cluster those vectors and then

you finally get like the common point which is uh as as Stanley said like this common point would be for example

four-legged mammals right so four-legged mammals have like one common centroid point and they have all the animals that

are like really close to each other would you like to oh no so I'm so well described and um and forgive me for

being Mr asteris on this talk I mean just gorov is is saying all the main stuff so all I can do is H put a cherry

on top um but one thing I I would like to say is like man think what you will of Facebook as a social media platform

but has their engineering team ever contributed a lot to this space yeah they they have because they they see

these um these sort of problems that that other world or other businesses do not see they have

like millions of people getting onto their website side and you you need to have those optimized searches you need

to find your friends like within seconds right so so they came up with uh um with these with these ideas with these

techniques so that you could you could find the um uh the people closest to you and you can build that social network

and again they wanted to also try to do targeted advertising which which was like their source of Revenue so they

wanted that really fast right you need to see see an advertise which which belongs to you and like uh I cannot show

um uh I cannot show an advertise of uh uh of bike racing to to a mother right like she she won't be interested in

looking at bike racing so it's um um one one funny thing too about these these Technologies is how powerful they are

right um so there's actually been some controversy um related to Target MH where where Target was serving ads for

uh maternity related Goods to to teens because their algorithm said that they should and then it turned out the

teens were pregnant yeah um and it it is just really amazing because we are such social animals so

the connections that we have to our communities and our culture re really kind of gives you a fingerprint of our

cultural DNA and and you truly can extrapolate between people in a way that is almost Magic right oh Ed go ahead

brother I'm sorry just so I think what you're trying to say and I think understand this now sliced sub vectors

might be the original vectors some representation of a person on Facebook their profile and all that but you might

slice it up for advertisers differently exact slice it up for maternity people exactly exactly so uh these sliced

vectors are like based on common feature as um um as you correctly pointed out now quick question and again we'll get

on to the next slide stat people but is the motivation for the slice sub vectors completely that it provides like a kind

of specialized perspective on classification or is there computational

motives as well there is a computation motive as well because U as you slice them you have less number of

mathematical representations to go through less number of computations so so definitely it helps and again um as

you said like the motivation was also coming from having like a certain feature right because if you have a

certain feature it will help you in that particular business and and as um as you brought up uh like multimodal similarly

um it's it's like multifeature at the same time so yeah it's um it's so cool and then I always also think about this

a little bit as like um you know because I think these things are a reflection of processes that happen in our mind right

or at least I always like to look at them that way and and for some of these slic sub vectors you could almost think

of like like anyone anyone into sneakers any any sneaker heads here big sneaker head myself there's a whole

cultural classification layer where it's like you know you see someone with panda dunks and you're

like you know um and and so it's just kind of interesting it's like these you might almost think of in the human

context as little subcultures right so let's move to the next algorithm which is like proximity base

graph index and this one um as it says hnsw hierarchical navigable Small World um this is this is just a just an

acronym of like imagining each Vector data base a as a as a world and then you have subclusters as um uh as Stanley

said like subgroups or um subcultures right so um so it's that and uh if if anyone is uh interested uh in knowing

but the data structure that this uses is called skip list it's it's like just another list but instead of um storing

the next pointer uh to to to the next uh list node you store its um uh like each each node is basically a

list so you just store its head and then you try to Traverse through it to it so skip list is a is a fun uh like

wonderful structure and uh as you can see like there are these layers right and and you can imagine how each layer

is is a different list altoe and uh this is like another another optimization algorithm that researchers have come up

with to to do efficient searches on Vector databases do you want to add oh I I shouldn't but it's just so beautiful

um isn't there something kind of interesting happening in this picture where like each layer is a network right

but then it's a graph a graph or a network but then aren't the layers themselves a

network yep and and there's there's actually some real depth that that comes from that idea right right anyway this

is so beautiful man what a what a good illustration awesome um a network of networks

yeah and then then same thing uh again like since since most of these algorithms are computer science based so

people you try to use the same tools that they are familiar with so on one hand we are looking at skip list which

is like a data structure of a list now the next one is a tree so we have seen so many um so many applications of uh uh

tree searches so they said okay now let's try to do some tree based indexing if if it's possible right so ano is

again as I said these are all nearest neighbor algorithm they are they are algorithms to find like who is the

nearest to the centroid of that cluster where can I put this particular sample um uh like in which cluster so this one

is called anoy and uh it's kind of funny with the name so it's approximate nearest neighbors oh

yeah so I don't know uh why they had to add U uh oh yeah at the end of it but uh it's it's so cool um and uh there are

multiple uh well-known Vector databases that uh that use anoi as their technique um uh to their uh do their queries do

you want to add to that um I I I do uh I I have always wondered about the the what was behind the naming

of this um a lot of these Tech techniques are actually used for type of algorithm called an annoyance algorithm

right and so I wonder if that might have been the origin um so so YouTube actually has a model built for each of

you guys that tries to predict when you're reaching being too annoyed by ads like they try to predict when you're

going to rage quit YouTube because there's too many ads and then they'll show you 10 seconds less than that

number and so I've always wondered if maybe annoy came out of annoyance algorithms um kind of a a silly thought

but um but that's true I know I was uh uh was done by Google so it might be true it it it could be I I'm I'm going

to look into it but this is um this is so cool yeah awesome um then this is the oh LS this is your your favorite I want

you to get uh go go for it no no no um um forgive me I just actually past couple weeks have been working um to use

lsh in a non-linguistic situation and so getting a lot of time um to work with it um one thing that's kind of fun too is

is like um there is this uh story in Science and Mathematics where we keep coming up with weirder and weirder GE

geometric spaces right so we're we're like oh three dimensions we live in three dimensions we can think of that

right what about 100 Dimensions right what what about 100 curved Dimensions right um and so it's kind of fun because

it's almost like there's this story of trying to find more and more powerful ways to use nearest neighbor

relationships and um and sort of being like Oh we have a phenomena that's very complicated is there a way to imagine a

space that's so complicated that all questions reduce to nearness in that crazy space um and then lsh is is also

just very beautiful maybe you should do the intro to lsh and then I can share why it gets me sort of overe excited

sure so uh as you can see on the slides like each key is is the mathematical representation the embedding that we

were talking about now we give it to this fuzzy hashing technique which uh which uses this function and tries to

put them into hash buckets so now when you're searching for your query you don't have to go to the original

embedding but you can just look it in in a specific hash bucket um do you want to yeah absolutely so um our uh uh Kish

neighbor algorithm for this community here that would be Jerry and I'm kind of kidding but I'm kind of

not kidding right because Jerry um helps us get organized he helps us figure out who to connect to that's why we make

y'all sign up on the hack sunday.com invite your friends um right and and it's so that Jerry can have the

information of like oh this guy's interested in that I'm going to connect him to this person this person is

interested in that so I'm going to connect him to this person but then you think about what happens as the

community grows and there's more and more dimensions of variation and more and more people that our Jerry would

have to keep in mind in order to make the right connections and this is a problem that Google has right because

you imagine the scale of their user base and then you know maybe they have Ed who's um a new user and they're going to

try to search in their Universe of billions of other users who is closest to Ed who is most like Ed um and to

actually go and compare Ed onet to one with every one of the other billion users and to do that for every single

user it's computationally really expensive yeah extremely expensive scales very poorly this is the solution

which is they created a function that's kind of well the way that it's built is a little complex but they've built a

function so that you apply it to a single person and it outputs almost what you want almost that classification

without needing to query the entire network and and that's why it's locally sensitive right awesome and and then uh

yeah as as you explained with a um uh with that like for for example uh Jerry's exam um taking take Taking Jerry

as the centroid of it now he has created these buckets which is like one is this science talk another would be regarding

web3 another one you do is uh on on artists right so so now you know you have this hashing function in your mind

which puts us into separate cluster right and now when you when you want to uh search for like science you will just

just go to your science bucket right and and then you can pick those values you can put say suppose um um some Mr XYZ

who is who is like into Finance so so now you don't know where to put him so you will try

to uh like do some query with him you'll you'll try to ask hey you are in finance are you interested in blockchain then he

goes to that crypto bucket right so so that's that's the hashing function algorithm those questions that you will

be asking this person are those hashing functions um um and and yeah just just to mention

one thing um lsh is seriously like maybe one of the core Technologies behind Google right um they they run lsh on

just extremely um large unprecedentedly large data sets right um but but anyway here's just a fun and interesting thing

um when lsh was developed to run on social data for Google it was the only data sets in the world that were that

big and so in many ways like the edge of data science in application to social media drives a bunch of the rest of the

field but we're starting to have similarly large and complex data sets in other areas so a project I've been

working on for the past month actually is um applying lsh to genomic data and so actually mapping genomes to these

kind of hashes and using them to understand what a genome does like what kind of chemistry it performs so it's

just very interesting and it's kind of I think part of the story of these technologies that they kind of start in

big Tech and then find other applications yeah that's that's really awesome so uh next one here uh it's like

self-explanatory and uh it it does um this sort of Animation where you are putting all these um all this training

regarding uh regarding shakes Shakespeare's uh literature into the database Tower and then you say okay I

want to find uh a a story which is done by Shakespeare and is related to tragedy now you haven't put any uh any metadata

saying like this is the story about uh about tragedy or this is the story about love and romance but as you can see uh

when the query happens it falls like closer to King Lear and uh Rome and Juliet which is which is a romance and a

tragedy uh whereas King leer is also a tragedy based uh novel so so the this is the uh this is the overall um use uh

that that scan is doing uh basically it is using CNN uh convoluted neural networks to um uh to find the nearest

neighbor uh as we we are talking like every time it is the nearest neighbor right and these are just the

optimization uh algorithms for it um as you can see it is it is is uh it is doing some sort of compression based on

neural network so that you you get uh uh you get like efficient and uh closest nearest neighbor uh for uh for your

query um yeah and uh do you want to add something to that um oh absolutely I

mean it's just so cool that it's sort of like neural networks or how we create and interact with these spaces

Ito kind by put the out in the high dimensional um and then uh convolutional neural networks are very interesting

because it's actually like very explicitly their um deficiencies that led to the invention of large language

models yeah um the convolutional part of a convolutional neural network means it collapses um a number of things down to

one thing and then the idea would be that it's recognizing that a couple different things are really the same

thing and so it's okay to collapse them together and and sort of compress the information that way but then it's also

very possible to lose important information through right uh compression and and for that reason convolutional

neural networks sucked at producing language um actually if you remember the days of Google

translate where it was almost a comedy show you know you put in something you get out something so broken that you're

like whoa um that was all convolutional neural networks and then it was all the um desire to have a little bit of a

better tool for interacting with these spaces that gave us llms right the Gans that's

right awesome uh again someone wants to know this uh scan algorithm uses the ukian distance um as we had um um as we

had discussed earlier um with that so here's like the the conclusion side from from my end and I I I did this sort of

research to come up with the Victor database comparison so as you can see I have uh listed down some of the uh um

some of the Technologies which are uh uh which are out there for data uh Vector databases so pine cone is is uh um uh

one of the one of the most used and it's it's like free so uh that uses the a Ann which is the approximate anoy as we all

know um uh as the approximate nearest name uh algorithm the use case is realtime

search and recommendations um uh similarly vate and also one of them uh that I used was

single store they all use uh hnsw so uh this was the the layered based approach um that and again U as it uses graph it

is used for knowledge graphs right um and then uh elastic search is one which I used uh um

uh pretty much in my uh in my implementations so uh it's like really good for Tex search and recommendations

so one of the applications that I did uh with elastic search was um um I created uh this uh this auxiliary

database to um uh to to our postgress uh datab base where we where doctors could search for their appointments um and uh

uh we had this uh what do we call um like a call center where patients could call uh and and change their

appointments or change their doctors and um that's where uh it was taking like almost 3 minutes for um for our postgis

database which had like millions of users to to search uh the appointment even we we tried like different indexing

mechanisms but elastic search was like really good uh tool for us to uh to find that and it

just reduced it to 300 milliseconds um I was very closely working with the managers of the skull Center and they

said like we really had to make train our um uh our call center employees to talk to the patient about like weather

and how their family is doing and stuff like that and people you were getting annoyed um so yeah uh so so this elastic

search really uh really helped uh it's a it's it's a noral tune it is based on uh Lucine

Lucine is uh is aach open source based um uh sort of like precursor to Vector

databases um it's um uh it's more of a document database but um but people have found even using that document database

could be uh could be a good use case for Vector databases as well um myself a big elasa search fan

used on many key projects right um do do you have an opine on the Amazon hostile takeover of elastic

search um of course like uh when I used elastic search uh back in 2017 it was open source and we could we could

recommend U um changes and uh I was I was happy to take their source code and make certain uh subtle changes that that

were required specifically for my use case but now since um Amazon has taken uh taken them over we we do not have

that uh that Liberty and again Amazon uh has taken this open source technology and made them commercial so and it's

kind of like you know I shouldn't I I don't have a too much of a leg to stand on to talk crap about Amazon so grateful

for the S the cheap servers guys thank you um but at the same time Amazon has this big Vibe of like hey that's a nice

open source s project you got over there it' be a shame if anything happened to it

right so it's it's actually kind of one of the funny stories that actually and and and push back on on this if I'm not

correct in in terms of the kind of role in the space but um Vector database has been along for a while elasa search was

the leading database for most of the history in this area and then right before this llm stuff happened Amazon

did this like hostile takeover of the open source group and and the usage from my perspective really fractured it's

really kind of like a tragic story right on some level uh otherwise uh if elasa search was um um was open source still

uh most famous Vector database would be elastic search um but uh really really created the space for Pine con and I I

have to say love Pine con yeah Pine con is really cool so yeah this was uh uh this was my conclusion and uh if anyone

is uh interested uh I think I I have a small link up there uh I'll share the slides uh once

it's done and uh Here Comes Your Part uh Stanley um so this is the this is the use case uh you wanted to share with

everyone regarding graph databases and Vector databases in general so absolutely and it'll just be like a

little uh um uh addendum to the incredible presentation from gorov those slides were incredible so much fun to

talk about it um and forgive me I meant to actually create a few more slides with this but

um I had some kind of help with slides lined up for yesterday but uh family emergency happened um but anyway this is

like what a production data system looks like um it's kind of painful blurry

actually um and actually let's see if we could zoom in maybe on the um the top left of it

each of these nodes is a different type of data um and then we have a relationship that's not the right right

part sorry which one do you want Oh bottom left uh bottom left part um the the the

not that says sample if we could zoom that in a little bit um we we do have a notebook that's

kind of related to um looking at embeddings that gav had and I I I think maybe we can get that shared and then

right after I'm I'm done flopping my gums we can jump into some collabing together if you guys want but kind of

just wanted to express sort of like one use case of this so right here we're actually looking at a Marine Science

data system so in the middle you have a sample and that just means someone went out in a boat collected some water

sequenced the DNA in the water and then the sample has a number of different

connections um on the right the sample is connected to a node that's says mag a and and that is the actual genomic

data that was found in the sample of water and then over here on the left in the you see it goes over to ecological

event and then remote sensing so remote sensing just means every time we sequence the DNA in the water we also

take a picture of it so what we're attempting to do is understand what can we learn about the

genomics of a system by looking at it visually and vice versa right in order to

understand those connections though we need to embed both pieces of data in the same Vector database right so on the

left side we have um images on the right side we have sequences of uh acts and G's like we

have genomes and then we need to figure out how to put all of them in a space together with each other in the same way

that um you know gorov showed you and again it isn't always clear how to do that but um this is what uh comes out of

these systems when we structure them properly with the help of large language models um we're actually quite confident

that what we're going to be able to do is actually translate back and forth fluidly from a picture of a microbial

ecosystem and the exact genomic data that um that it uh that it contains um and anyway you know maybe

even at a future talk we could look at some embeddings together sure um with that data and um and on that note maybe

let's get the collab notebook back up and see if we want to mess around with it a little bit

sure want drop downam let's go to demo one

it's here I can minimize this today so here's the demo one as I was

talking about post this in the telegram channel right please

thank you all right uh let's see if we have

anything okay so do you guys want me to go over this notebook again um uh we had gone

through it uh last time okay yeah me give people a second to open it up but maybe while they're

doing that talking about it high level would be would be amazing sure and so this this is so like kind of like you

were saying this creates a specif specific set of embeddings that can let you compare yeah two images well uh so

this demo is basically as you can see you can give like any PDF URL it uses uh it uses Lama 2 um because uh Lama 2 has

like really uh good good textual embedding um uh already inbuilt in it so we uh I tried to use that and then uh we

we just use those embedding and try to create our own Vector data store on which we can do some semantic queries so

um and as uh um as shown over here we are trying to use the cosine similarity um if we go back to

our um our presentation over here as you can see it's using this

particular formula trying to find uh the closest distance between the two points using um this cosine distance

formula I uh I love just distance metrics so much and then

um we create a vector data store and over here I have said can you tell me the concepts of safety

fine-tuning and as this um uh this document has some um some information about uh how to how

to uh create Vector databases and stuff like that it was able to come up with uh with certain answer for me so it it

found out like um these are um these are some nodes which uh which match with uh safety and fine-tuning and then uh based

on that um it it comes up with actual um actual answer so this has been generated by um

uh by the Lama 2 llm itself so and it's based on the vector data store that we have created out of that PDF so this is

this was the whole demo um that uh that this particular notebook does any questions yeah so so we uh uh we did

um yeah Ju Just to um just to consume the PDF the content in the PDF we had to uh separate it out in different

paragraphs but then since Lama 2 has its own embedding and it knows how to how to Cluster the text it has done it so I I'm

not creating my own embeddings I'm just using yeah yeah it chunks it itself yeah the

um the selection of a an embedding to to use in a system like this is kind of interesting and um open a I actually

just released a new set of optimized embeddings awesome um question how would something like this be used in like a

rag system oh yeah so um So based on that uh we could create our own rack system um for example now

uh I want to create uh like a store just for um just for this particular this particular PDF so

imagine like uh this this is not a PDF but it's it's like um uh it's it's like cluster of PDF which are related to

medical or uh related to law immigration law for example so one could create a rack system uh that will that will help

you uh just get specific answers not not something um related to uh whatever is there available on the Internet or

whatever again could could hallucinate about right uh it would be like specific to that particular domain and that's

where rack systems or slms as we were talking about it some time back are are really powerful um and who who know who

knows what I mean when I say a rag system does everyone know what R RG is said

last absolutely and it's an acronym and so I should have said what it was it's bad bad uh impolite to just use an a

naked acronym um but yeah so rag is retrieval augmented generation M oh it's right there yeah oh my gosh I should

have pointed to that um and and yeah it's interesting like has anyone had an llm hallucinate on

them yeah it's it's a rag is one of the main solutions to hallucination about other things and it's it's basically I I

like to describe it as a long-term memory for the model but essentially just through these mechanisms that gav

has explained to us a large language model that's asked a question can actually go find the bulk

of the answer and then then kind of inject that into the response right it's very interesting and um you know I

actually um uh we'll need to think a little bit about what a good uh next talk is I I think we should go a little

bit more in the art direction for the the next talk but um down the road maybe we could present on fine-tuning together

yeah sure and and because that's actually one of the really interesting things is um not just um fine-tuning a

model but fine-tuning and ragging a model so sort of not just tuning a model so that it says things you like but so

that it takes things from Rag and then rephrases them in a powerful way so so just for uh uh just for audience can you

elaborate a little bit on fine tuning and like what is the difference between um between fine tuning and rag

especially oh I would love to but actually I would wonder maybe if if you could go back to your slide and go back

one or two slides to the those beautiful animations you had of the actual networks Rippling as the data passes

through them because right there is what a a neural network looks like it's all these little sorry oh please um dots

that are connected by lines and then I like to think of it as like um the intelligence is stored in the Dynamics

of the Rippling like you see there's the bottom of the neural network and it sort of shakes and then the next layer shakes

and then the next layer shakes and that's what a neural network is is it's a a data in data out machine and then as

the data ripples through the network it's changed right um so that's kind of what a neural network is how it works

really obviously handwavy high level um and then how does it like get the intelligence though it's this training

process that we've talked about right and this the training process for gp4 for example it's estimated that it was 6

to n months of 25,000 large computers working together like we're talking the energy bill of a

mid to large size city just to train one of these models um crazy right um and that is sort of

like um happening because there's so many of those little dots for example um gb4 is estimated to have a trillion of

those those dots that're all interconnected so every time they feed new data through it the data changes the

values of every single one of those little dots so it's it's sort of you imagine a trillion different numbers

that are constantly being updated and tested and their inter relations are being messed with and understood um so

on the higher level what you're are trying to say fine-tuning is basically changing those connections that are

between those dots so fine-tuning yes is changing some of the dots but not all the dots so it might be something like

we Chang just the last layer or the last two layers right or in the case of the most used uh technology which is called

Laura which stands for low rank adaption you actually add on a little extra layer of neurons at the end that are almost

like the translator for the model and then the translator learns to make the model sound a different way by taking

its output and then restructuring the output okay but overall it's kind of like you're restarting the training but

you're only picking a certain part of the network focusing your energy on getting that part of the network to

sound like you want and it turns out you can completely change the behavior of these models with just that little bit

of tuning and then again I like to call this um giving your model a master's degree nice a single ler it it depends

there's a number of different approaches to find tuning and different ways to do it um it actually usually wouldn't be

just a single layer uh it could be um but yeah as you add more neurons the computational cost it increases very

quickly um yes yes um yeah off off hand I'm I'm not

100% sure um with some paper in a little time I could work it out so maybe after the talk I'll I'll do my best to give an

estimate um the trillion parameters is somewhat misleading though because chat GPT is an eight-way mixture model so

they actually trained um about a 120 billion parameter model and then there's eight copies of it working together in

gbd4 Al although they do fine-tune those independently so um um

nice so that was uh demo one let's move to demo two the demo two one uh I hope uh it has

those uh sorry about that but what this um this would do it would take uh like almost

7,000 images um that that we have uh uh that I found um on the internet and these are like 7,000 images of different

celebrities and uh what we did is um and this was this was also um done by single store DB which is uh which is just

another database like pine cone um so what single store uh DB people did they they basically came up with uh uh with a

mathematical function uh called dot product um if we if we um go back uh to the slides uh here is

the here is the mathematical function for DOT product so it's uh it's taking uh

it's taking a DOT product between two two Vector quantities um as uh as known in physics so this is the formula to do

it but doing this on data was uh was something that um uh that single store DB guys did and they created like a

traditional SQL sort of query that you can do on any sort of vector databases and this is the same um uh same sort of

technology that I used and as you can see um let me go down the dot product is so beautiful yep

and one one of the most important discoveries in human history true um it solves many physics

equations still it's maybe one of the most useful tools in all of physics and then what the dot product does is you

give two vectors and the dot product answers the question to what degree are these vectors going in the same

direction basically um but we using the same thing in um in Vector databases so uh as you can see in I don't have the

line numbers but uh as you can see in the query it's it's doing like a traditional um uh traditional SQL like

query you can select a file name where doc product of vector is um uh is found out and basically what this does it

tries to find out the images that are similar in in features um I think uh last time we had uh whose picture did we

have Jerry uh yeah some celebrating um so uh his

picture was was selected and um it could find like similar five pictures um of them yeah Alec balwi yeah that's what it

was too and it was very funny because it produced two that were wrong but they were very Alec Baldwin looking dudes

right I I don't have that example today but uh yeah um some people might uh might take

it and uh run it today um and show us like um different different celebrities if they can come up with that and uh I

would love someone to to see if they could query uh William Rohan Hamilton any any Hamilton fans here the

inventor of the dot product Oh really uh I don't know if he would be there in the celebrity database no he

might not be but he's he's a big hero of mine um he was out for a walk with his wife and they they were just uh walking

over a bridge and he had a Eureka moment so he uh ran over and actually carved the equation for the dot product into

the bridge and it's h still there it's considered a pilgrimage site for for mathematicians wow and Engineers wow but

then um I always think about that anecdote and feel bad for his wife yeah any questions before we we jump

into the collab notebooks and then also maybe we should do another round of applause for thank

you I have to say I I I do this stuff every day it's been my career for 10 or 15 years and I I don't know if I've ever

seen a crisper sharper presentation of the good ideas that was so beautiful my friend thank you thank you

Stanley I question vector

come on Christina so I love that we've been um studying a lot of different

functionality today or at least you've been introducing it to us in terms of similarities in terms of vector

databases and all the different ways that that can be applied but are they also working on models to study

polarities because that could also be useful in many different contexts so could you elaborate on what

sort of polarities like dark light you know like but alth some of them are not are a little more similar than they

think which is kind of funny you know well I mean I would I would say sort of in the same way that these things learn

to understand patterns of similarity they necessarily also learn patterns of difference so right for example when we

were looking at examples last week um we we looked at an example of gender as a dimension that is

often part of these models um because our language is so polarized by gender it shows up very clearly in the

embeddings what language is male and what language is Fe is female and so I would actually say that's one of the

very exciting things about these Technologies is they allow us to quantitatively see um these kind of

polarizations or biases that exist in our language day-to-day um so we can actually see that there is this big

difference in in uh uh language related to gender which you know I think most of us are very familiar with as a a core

Injustice of our society yeah and uh just like these are two different uh uh School of thoughts right

one scientist might say okay we are finding similarity so we don't have to find dissimilarities another scientist

would say oh we need to find the similarities so that we can cluster the similarities together so like doing one

will automatically find the other so you don't have to do it explicitly as such thank you oh such a good question too

yeah and that shows everyone was listening to us no I have to say I thought it was so good and thanks

everyone for for being here and if there's you know no more questions we can wrap it up and I'm here if anyone

wants to goof around with the collab or get some help with it and then we can also talk more in some little breakout

groups for for at least a moment or two I'm sure sure thank you thank

you awesome awesome uh trick are we out we are

Heads up!

This fact check was automatically generated using AI with the Free YouTube Video Fact Checker by LunaNotes. Sources are AI-generated and should be independently verified.

Fact check a video for free

Related Fact Checks

Vector Databases & AI: Fact Check and Technical Overview

This fact check reviews a detailed technical discussion on various types of databases, focusing on vector databases and their role in AI and machine learning. Most technical claims regarding database types, vector embeddings, semantic search, and related AI concepts are accurate and well explained with no significant misinformation found.

Fact Check: Enhancing Vocabulary Retention Challenges and Solutions

This fact check analyzes claims about vocabulary retention challenges, recognition lexicon concepts, and memory techniques for learning new words. While many linguistic concepts are accurately described, some statistical claims and interpretations require context or verification.

Fact Check: The Amazing Digital Circus Series Review Analysis

This fact-check analyzes claims made in a detailed personal review of The Amazing Digital Circus animated series, focusing on plot points, character details, and production information. Most claims about show content and characters are accurate reflections of the series, though interpretive opinions and speculative theories are present but marked as subjective.

Fact Check: Dhurander Film and Realities of Terrorism & Intelligence

This fact check analyzes the extensive claims from the discussion about the film Dhurander, its portrayal of terrorism, intelligence operations, and geopolitical realities involving India and Pakistan. It verifies the authenticity of historical events, terrorist profiles, intelligence insights, and socio-political contexts presented in the video.

Fact Check: 2016 Cultural and Workplace Stories Analysis

This video presents a conversational recount of events and cultural moments from 2016, personal workplace experiences, and social observations. We fact-check claims related to notable 2016 events, workplace practices, and other historical references, clarifying their accuracy amid anecdotal storytelling.

Most Viewed Fact Checks

Fact Check: April 2026 Regulus-Sphinx Alignment and Biblical Prophecy

This fact-check examines the claim that the star Regulus will align with the Sphinx's gaze at Easter 2026, signalling a significant spiritual or prophetic event as proposed by Chris Bledso. We evaluate the astronomical accuracy of the claimed alignment, the biblical connections, and warnings about deception in prophecy.

Fact Check: April 2026 Rapture Predictions and Related Claims

This video makes multiple prophetic and biblical claims prophesying an imminent rapture event around April 4th to 5th, 2026, linking various visions, interpretations, and speculative timelines. Our fact-check finds that these claims are unsupported by credible evidence or mainstream religious scholarship and involve unverifiable personal revelations and misinterpretations of historical and biblical texts.

Fact Check: Prophetic Claims and the Essene Calendar Explained

This video presents claims about the prophetic significance of the Essene calendar, its connection to biblical prophecy, and recent historical events. While some historical facts about the Dead Sea Scrolls and Jewish history are accurate, the prophetic interpretations and calendar correlations remain speculative and unverified by mainstream scholarship.

Fact Check: Claims About Noah's Ark Discovery on Turkey's Highest Peak

This fact-check examines the sensational claims of an alleged Noah's Ark discovery on a Turkish mountain peak, analyzing the archaeological, scientific, and biblical assertions made. Our investigation finds no credible evidence supporting the extraordinary details presented, many of which contradict established science and historical knowledge.

Height Growth Fact Check: Nutrition, Exercise, and Sleep Truths

This fact check analyzes claims about human height determination, focusing on genetics, nutrition, exercise, and sleep. While many claims align with scientific evidence, some statements are oversimplified or lack nuance. We provide a detailed verification of each assertion with supporting sources.

If you found this fact check useful, consider buying us a coffee. It would help us a lot!

Vector Databases Explained: AI Tech Fact Check and Analysis

Generally Credible

Claims Analysis

Related Fact Checks

Vector Databases & AI: Fact Check and Technical Overview

Fact Check: Enhancing Vocabulary Retention Challenges and Solutions

Fact Check: The Amazing Digital Circus Series Review Analysis

Fact Check: Dhurander Film and Realities of Terrorism & Intelligence

Fact Check: 2016 Cultural and Workplace Stories Analysis

Most Viewed Fact Checks

Fact Check: April 2026 Regulus-Sphinx Alignment and Biblical Prophecy

Fact Check: April 2026 Rapture Predictions and Related Claims

Fact Check: Prophetic Claims and the Essene Calendar Explained

Fact Check: Claims About Noah's Ark Discovery on Turkey's Highest Peak

Height Growth Fact Check: Nutrition, Exercise, and Sleep Truths

Start Taking Better Notes Today with LunaNotes!