Vector Databases Explained: AI Tech Fact Check and Analysis
Generally Credible
10 verified, 0 misleading, 0 false, 0 unverifiable out of 10 claims analyzed
The video comprehensively covers vector databases, detailing technologies like clustering-based indexing, HNSW, ANNOY, and LSH, supported by real-world applications such as semantic search in ElasticSearch and use in large-scale AI systems. The presenters accurately describe theoretical concepts and practical implementations, including challenges like computational expense and optimization strategies. The discussion of LLMs, fine-tuning, and retrieval augmented generation is factually sound, with clear linkage to vector database roles. Minor informal language and lightly speculative remarks do not impair overall factual integrity. The video is highly credible for audiences seeking an insightful introduction to vector database technology and its AI applications, earning an overall credibility score of 88.
Claims Analysis
Vector databases represent data as mathematical vectors in n-dimensional space to enable similarity search.
Vector databases store data items as vectors in high-dimensional space, allowing geometric proximity to define similarity, as illustrated by clustering of related items in vector space.
Nearest neighbor algorithms find the closest vectors (data points) to a query vector within this vector space.
Nearest neighbor search is a fundamental algorithm in vector databases to identify vectors closest to a query, based on distance metrics like Euclidean or cosine similarity.
Clustering-based indexing (e.g. product quantization used in Facebook AI similarity search "FAISS") improves search efficiency by creating memory-efficient representations.
FAISS by Facebook AI uses product quantization techniques to compress vectors and cluster them to speed up similarity search efficiently.
Hierarchical Navigable Small World (HNSW) is a proximity graph index algorithm enabling efficient vector search via navigable layered graphs.
HNSW creates multi-layer graph structures for efficient approximate nearest neighbor search by traversing from upper layers down to closest nodes, ensuring speed and accuracy.
Approximate Nearest Neighbors (ANNOY) is a tree-based indexing method for vector search, widely used in production.
ANNOY uses random projection trees to approximate nearest neighbors for fast vector search, developed originally at Spotify and open-sourced.
Locality Sensitive Hashing (LSH) uses hash buckets to quickly approximate nearest neighbors in high dimensional vector data with less computation.
LSH hashes vectors so that similar items map to the same buckets with high probability, enabling sublinear time approximate nearest neighbor queries.
ElasticSearch, originally open-source document database, is used for text search and recommendation and has features enabling vector search.
ElasticSearch builds on Lucene and supports vector search capabilities, widely employed for semantic search and recommendation systems, with commercial extensions after Amazon's involvement.
Large Language Models (LLMs) require months of training on thousands of GPUs and consume energy comparable to mid-sized cities.
Training large models like GPT-4 is known to require significant computational resources and time, with energy consumption estimates comparable to medium-sized cities.
Retrieval Augmented Generation (RAG) systems use vector databases to enable LLMs to recall and generate more accurate, grounded responses.
RAG models retrieve relevant documents from vector stores to augment language generation and mitigate hallucinations, combining search with generation effectively.
Fine-tuning large models involves training parts of the neural network (e.g., last layers or adapters) to adapt model behavior for specific tasks or styles.
Fine-tuning adjusts selected network parameters or adds adapter layers to customize pretrained models, a standard method to specialize large models efficiently.
[Music] yeah [Music]
just this is [Music]
this [Music] yeah
[Music] n [Music]
hey trick lost [Music] camera oh yeah about that I comfortable
like was going to give just a short intro about and you could totally start from
your first SL just do you know whatever you would feel most comfortable with you
know abely yeah I I [Music]
I [Music] I I'm even going to give an intro where
we mentioned that that this is kind of and then absolutely we have a Tempo worked out it's going to be
good yeah I mean I think it's a certain numbness that comes
I mean you set your your gost jump right I know some people years ago they were doing with
teain so big anymore but 10 years ago or so they were everything you want to do
cool from one all right we're good I think we're good to go you ready we're starting on us ready yeah we're starting
this way today all right well happy Sunday everyone welcome to hack Sunday thanks for joining us I
know it's a beautiful day out outside and you know I'm glad that other people share my enthusiasm for being inside and
learning so here we are today welcome to hack Sunday for those of you who don't know uh we are a live Production Studio
we actually do Focus here on live production we do live streams here all the time my music's still on let me turn
that off um and today we're here for heck Sunday we're here to learn how AI is taken over
the world and the this today's talk started so for anyone who missed uh last week's talk or didn't watch on YouTube
we started a conversation about Vector databases which began with gav teaching people in the telegram chat me being
lost and saying what the hell are you guys talking about can you please explain this to me so we did part one
last last week where Gro tried to explain Vector databases to me he did a amazing job I'm still confused as hell
uh but we did learn a lot and that's the goal here at machine learning together so today we got two people to try and
teach me we got grav and Stanley we think between the two of them maybe they can educate me here um so with that I'll
I'll turn it over to you guys um if you want to give a quick little intro and tell us what we're talking about and um
and I'm actually so excited we get a second day talking about Vector databases and um um I I I have to say
though it's it's mostly my fault I was asking too many questions last time so just going to cut through the middlemen
and I'm going to be up here um you know sort of contributing a little bit to gv's flow but yeah I think we're
actually going to start fresh kind of go through the slides cover the main topics and even for y'all non-technical people
um the systems were talking about underly all of this large language model stuff and and so these um these tools
and these Technologies are going to be a big part of the rest of of our lives so sure um since we are doing like a recap
I'll I'll do it quickly um so first of all we'll start with what is database so database is a is a technology that
stores and retrives information or data and then we went through types of databases we have different types of
database such as document database most most of you are familiar with relational database um um during the last talk we
also talked about like analytical database which was like used for business intelligence and uh we we had a
BRI um long talk about it then there is favorite uh Stanley's favorite graph databases and then uh you have like all
sort of um no SQL or not a SQL database uh which is like Dynamo DB or um and most of these databases are either
column based or they are key value pair or they are document stores so we went with them uh through them a little bit
and then we went with different use cases of uh databases that uh um that people use uh
and here here is the slide that that explains it quite well with the examples and we went through that one and finally
we introduced what is a vector database so Vector database uh is is a is a representation of the data in n space um
uh sort of representation right so uh as you can see in this picture like all those points which are animals they are
clustered together and then uh fruits which are uh which are data points are clustered together so uh basically um
what data Vector database stores is the mathematic itical representation of that data and uh we we did introduce that uh
a little bit do you want to add something to that Stanley yeah um um it just U the idea here is that um the
relationships words have linguistically within a piece of text become geometric relationships in the space it's a very
beautiful kind of transformation and then one thing I would love to say is that I feel like um and it's uh starting
to become clear that our brain actually uses this to store information right and so I always think like when you say
you're reaching for a word that's actually true in the vector database in your
brain all right um so then we went through how do actually Vector databases work so as you can see in this um in
this slide that there are images documents and audio these are like different formats of data we trans
transform them into embedding embedding is nothing but a vector U mathematical representation of that data which we uh
depict over here U with the vector representation and then we use some sort of uh nearest neighbor finding
algorithms and then uh those algorithms help us uh do the like do any kind of query on those um yeah and and last time
we also went through some uh uh some distance metrics uh uh in detail like what's the mathematics behind it um
gr up I'm I'm so sorry so quick too but I just wonder is there any chance you'd say a little something about what a
nearest neighbor algorithm is sure um so nearest na algorithm as as you as you know that the the whole Vector space uh
you can you can think it it's sort of like n dimensional space and now you want to find
uh you want to find say suppose uh let's let's go to the example over here like I have um say suppose I have a query which
is which falls into um into like animal uh let's say uh uh I I can't think of animal say say
elephant right so where do elephant closely matches with with this category and that's that's what the nearest
algorithm uh neighbor algorithm would do it will try to find the uh the centroid of of that particular cluster and it
will try to put that particular query or the sample into um into that cluster do do you want to U it's so well described
um and then just to emphasize a little bit um so for example wolf might be close to dog because they're both
animals it might also be close close to dog cuz they're both mammals right could also be because they're four-legged so
we think that there's all of these different dimensions of similarity and then the K in K nearest neighbors means
it's looking at a number of different dimensions right right um anyway it's just so cool I I love this picture man
this is perfect yeah this uh I I also love this picture because every human being had has this inherit ability to
think in 3D right we we understand 3D space we don't understand like four dimensionals or n dimensional spaces so
um I I really love this uh uh this sort of representation of vector database and uh that's why I chose this particular
slide I have one friend who can think in four or five dimensions and I ha her such it's such a
superpower that's that's awesome you have a question I'm supposed to limit the questions but let's see oh sorry no
I'm just Kidd apologize if you're going to get to this later I can hold off on it I just wondering if you could speak
to the challenges of choosing vectors oh 100 million per and then actually the main reason I'm here is we're actually
going to look at a small case study at the end that relates to um generating embeddings constructively and then just
for example has anyone had the pleasure of giving an image to chat GPT and then chat GPT can answer questions about the
image um that is what's called a multimodal learning system right and the way that works is you take the text and
you create a set of Vector embeddings in the same space as you embed the images so it's kind of like you're creating a
common language for text and images and and that's a big reason why everyone's so excited about these Technologies
right now but anyway for forgive my digression no that's uh that's super cool and then we uh we dealt with little
bit on um uh what's what's like the traditional search and semantic search and uh um I
also presented like two uh real case uh demos that that um um uh that we went through uh one was a very simple Vector
uh store search where it was using like traditional uh sorry it was using the semantic search so um what this demo was
um I took like a PDF um and and I gave that PDF to a vector store and then creating uh created embeddings from it
and then we tried to do a semantic query on it uh the second demo was uh um was for images same thing we did for for a
large uh large data store of images but here we tried to use traditional SQL queries um with uh with a with a single
store database um I will uh I will go through these demos uh again um if needed but uh without further Ado um uh
I would like to jump into how can we optimize these uh these algorithms right because right now it feels like um you
are trying to you have data stored somewhere you have these mathematical representation
of data which you are storing in Vector databases and overall like how can you um how can you optimize how can you
improve the search quality and the search efficiency so um researchers have come up with different sort of
algorithms we'll go through some of them um uh today um so first one I would like to think about is like clustering since
since I was talking about clusters creating clusters of data points and stuff like that so the first algorithm
that I would like to go through is clustering based indexing which is um f as an example so f is Facebook AI
similarity search um it's it's like most popular um in in the vector databases and uh what what F does it
optimizes uh the query by using uh different me methods most popular they use is called Product quantization and
what product quantization is um is basically each data set is converted into more memory efficient
representation uh which is called PQ code and um and as you can see uh in the image like you have these original
vectors which are nothing but the embeddings but then you slice them you try to Cluster those vectors and then
you finally get like the common point which is uh as as Stanley said like this common point would be for example
four-legged mammals right so four-legged mammals have like one common centroid point and they have all the animals that
are like really close to each other would you like to oh no so I'm so well described and um and forgive me for
being Mr asteris on this talk I mean just gorov is is saying all the main stuff so all I can do is H put a cherry
on top um but one thing I I would like to say is like man think what you will of Facebook as a social media platform
but has their engineering team ever contributed a lot to this space yeah they they have because they they see
these um these sort of problems that that other world or other businesses do not see they have
like millions of people getting onto their website side and you you need to have those optimized searches you need
to find your friends like within seconds right so so they came up with uh um with these with these ideas with these
techniques so that you could you could find the um uh the people closest to you and you can build that social network
and again they wanted to also try to do targeted advertising which which was like their source of Revenue so they
wanted that really fast right you need to see see an advertise which which belongs to you and like uh I cannot show
um uh I cannot show an advertise of uh uh of bike racing to to a mother right like she she won't be interested in
looking at bike racing so it's um um one one funny thing too about these these Technologies is how powerful they are
right um so there's actually been some controversy um related to Target MH where where Target was serving ads for
uh maternity related Goods to to teens because their algorithm said that they should and then it turned out the
teens were pregnant yeah um and it it is just really amazing because we are such social animals so
the connections that we have to our communities and our culture re really kind of gives you a fingerprint of our
cultural DNA and and you truly can extrapolate between people in a way that is almost Magic right oh Ed go ahead
brother I'm sorry just so I think what you're trying to say and I think understand this now sliced sub vectors
might be the original vectors some representation of a person on Facebook their profile and all that but you might
slice it up for advertisers differently exact slice it up for maternity people exactly exactly so uh these sliced
vectors are like based on common feature as um um as you correctly pointed out now quick question and again we'll get
on to the next slide stat people but is the motivation for the slice sub vectors completely that it provides like a kind
of specialized perspective on classification or is there computational
motives as well there is a computation motive as well because U as you slice them you have less number of
mathematical representations to go through less number of computations so so definitely it helps and again um as
you said like the motivation was also coming from having like a certain feature right because if you have a
certain feature it will help you in that particular business and and as um as you brought up uh like multimodal similarly
um it's it's like multifeature at the same time so yeah it's um it's so cool and then I always also think about this
a little bit as like um you know because I think these things are a reflection of processes that happen in our mind right
or at least I always like to look at them that way and and for some of these slic sub vectors you could almost think
of like like anyone anyone into sneakers any any sneaker heads here big sneaker head myself there's a whole
cultural classification layer where it's like you know you see someone with panda dunks and you're
like you know um and and so it's just kind of interesting it's like these you might almost think of in the human
context as little subcultures right so let's move to the next algorithm which is like proximity base
graph index and this one um as it says hnsw hierarchical navigable Small World um this is this is just a just an
acronym of like imagining each Vector data base a as a as a world and then you have subclusters as um uh as Stanley
said like subgroups or um subcultures right so um so it's that and uh if if anyone is uh interested uh in knowing
but the data structure that this uses is called skip list it's it's like just another list but instead of um storing
the next pointer uh to to to the next uh list node you store its um uh like each each node is basically a
list so you just store its head and then you try to Traverse through it to it so skip list is a is a fun uh like
wonderful structure and uh as you can see like there are these layers right and and you can imagine how each layer
is is a different list altoe and uh this is like another another optimization algorithm that researchers have come up
with to to do efficient searches on Vector databases do you want to add oh I I shouldn't but it's just so beautiful
um isn't there something kind of interesting happening in this picture where like each layer is a network right
but then it's a graph a graph or a network but then aren't the layers themselves a
network yep and and there's there's actually some real depth that that comes from that idea right right anyway this
is so beautiful man what a what a good illustration awesome um a network of networks
yeah and then then same thing uh again like since since most of these algorithms are computer science based so
people you try to use the same tools that they are familiar with so on one hand we are looking at skip list which
is like a data structure of a list now the next one is a tree so we have seen so many um so many applications of uh uh
tree searches so they said okay now let's try to do some tree based indexing if if it's possible right so ano is
again as I said these are all nearest neighbor algorithm they are they are algorithms to find like who is the
nearest to the centroid of that cluster where can I put this particular sample um uh like in which cluster so this one
is called anoy and uh it's kind of funny with the name so it's approximate nearest neighbors oh
yeah so I don't know uh why they had to add U uh oh yeah at the end of it but uh it's it's so cool um and uh there are
multiple uh well-known Vector databases that uh that use anoi as their technique um uh to their uh do their queries do
you want to add to that um I I I do uh I I have always wondered about the the what was behind the naming
of this um a lot of these Tech techniques are actually used for type of algorithm called an annoyance algorithm
right and so I wonder if that might have been the origin um so so YouTube actually has a model built for each of
you guys that tries to predict when you're reaching being too annoyed by ads like they try to predict when you're
going to rage quit YouTube because there's too many ads and then they'll show you 10 seconds less than that
number and so I've always wondered if maybe annoy came out of annoyance algorithms um kind of a a silly thought
but um but that's true I know I was uh uh was done by Google so it might be true it it it could be I I'm I'm going
to look into it but this is um this is so cool yeah awesome um then this is the oh LS this is your your favorite I want
you to get uh go go for it no no no um um forgive me I just actually past couple weeks have been working um to use
lsh in a non-linguistic situation and so getting a lot of time um to work with it um one thing that's kind of fun too is
is like um there is this uh story in Science and Mathematics where we keep coming up with weirder and weirder GE
geometric spaces right so we're we're like oh three dimensions we live in three dimensions we can think of that
right what about 100 Dimensions right what what about 100 curved Dimensions right um and so it's kind of fun because
it's almost like there's this story of trying to find more and more powerful ways to use nearest neighbor
relationships and um and sort of being like Oh we have a phenomena that's very complicated is there a way to imagine a
space that's so complicated that all questions reduce to nearness in that crazy space um and then lsh is is also
just very beautiful maybe you should do the intro to lsh and then I can share why it gets me sort of overe excited
sure so uh as you can see on the slides like each key is is the mathematical representation the embedding that we
were talking about now we give it to this fuzzy hashing technique which uh which uses this function and tries to
put them into hash buckets so now when you're searching for your query you don't have to go to the original
embedding but you can just look it in in a specific hash bucket um do you want to yeah absolutely so um our uh uh Kish
neighbor algorithm for this community here that would be Jerry and I'm kind of kidding but I'm kind of
not kidding right because Jerry um helps us get organized he helps us figure out who to connect to that's why we make
y'all sign up on the hack sunday.com invite your friends um right and and it's so that Jerry can have the
information of like oh this guy's interested in that I'm going to connect him to this person this person is
interested in that so I'm going to connect him to this person but then you think about what happens as the
community grows and there's more and more dimensions of variation and more and more people that our Jerry would
have to keep in mind in order to make the right connections and this is a problem that Google has right because
you imagine the scale of their user base and then you know maybe they have Ed who's um a new user and they're going to
try to search in their Universe of billions of other users who is closest to Ed who is most like Ed um and to
actually go and compare Ed onet to one with every one of the other billion users and to do that for every single
user it's computationally really expensive yeah extremely expensive scales very poorly this is the solution
which is they created a function that's kind of well the way that it's built is a little complex but they've built a
function so that you apply it to a single person and it outputs almost what you want almost that classification
without needing to query the entire network and and that's why it's locally sensitive right awesome and and then uh
yeah as as you explained with a um uh with that like for for example uh Jerry's exam um taking take Taking Jerry
as the centroid of it now he has created these buckets which is like one is this science talk another would be regarding
web3 another one you do is uh on on artists right so so now you know you have this hashing function in your mind
which puts us into separate cluster right and now when you when you want to uh search for like science you will just
just go to your science bucket right and and then you can pick those values you can put say suppose um um some Mr XYZ
who is who is like into Finance so so now you don't know where to put him so you will try
to uh like do some query with him you'll you'll try to ask hey you are in finance are you interested in blockchain then he
goes to that crypto bucket right so so that's that's the hashing function algorithm those questions that you will
be asking this person are those hashing functions um um and and yeah just just to mention
one thing um lsh is seriously like maybe one of the core Technologies behind Google right um they they run lsh on
just extremely um large unprecedentedly large data sets right um but but anyway here's just a fun and interesting thing
um when lsh was developed to run on social data for Google it was the only data sets in the world that were that
big and so in many ways like the edge of data science in application to social media drives a bunch of the rest of the
field but we're starting to have similarly large and complex data sets in other areas so a project I've been
working on for the past month actually is um applying lsh to genomic data and so actually mapping genomes to these
kind of hashes and using them to understand what a genome does like what kind of chemistry it performs so it's
just very interesting and it's kind of I think part of the story of these technologies that they kind of start in
big Tech and then find other applications yeah that's that's really awesome so uh next one here uh it's like
self-explanatory and uh it it does um this sort of Animation where you are putting all these um all this training
regarding uh regarding shakes Shakespeare's uh literature into the database Tower and then you say okay I
want to find uh a a story which is done by Shakespeare and is related to tragedy now you haven't put any uh any metadata
saying like this is the story about uh about tragedy or this is the story about love and romance but as you can see uh
when the query happens it falls like closer to King Lear and uh Rome and Juliet which is which is a romance and a
tragedy uh whereas King leer is also a tragedy based uh novel so so the this is the uh this is the overall um use uh
that that scan is doing uh basically it is using CNN uh convoluted neural networks to um uh to find the nearest
neighbor uh as we we are talking like every time it is the nearest neighbor right and these are just the
optimization uh algorithms for it um as you can see it is it is is uh it is doing some sort of compression based on
neural network so that you you get uh uh you get like efficient and uh closest nearest neighbor uh for uh for your
query um yeah and uh do you want to add something to that um oh absolutely I
mean it's just so cool that it's sort of like neural networks or how we create and interact with these spaces
Ito kind by put the out in the high dimensional um and then uh convolutional neural networks are very interesting
because it's actually like very explicitly their um deficiencies that led to the invention of large language
models yeah um the convolutional part of a convolutional neural network means it collapses um a number of things down to
one thing and then the idea would be that it's recognizing that a couple different things are really the same
thing and so it's okay to collapse them together and and sort of compress the information that way but then it's also
very possible to lose important information through right uh compression and and for that reason convolutional
neural networks sucked at producing language um actually if you remember the days of Google
translate where it was almost a comedy show you know you put in something you get out something so broken that you're
like whoa um that was all convolutional neural networks and then it was all the um desire to have a little bit of a
better tool for interacting with these spaces that gave us llms right the Gans that's
right awesome uh again someone wants to know this uh scan algorithm uses the ukian distance um as we had um um as we
had discussed earlier um with that so here's like the the conclusion side from from my end and I I I did this sort of
research to come up with the Victor database comparison so as you can see I have uh listed down some of the uh um
some of the Technologies which are uh uh which are out there for data uh Vector databases so pine cone is is uh um uh
one of the one of the most used and it's it's like free so uh that uses the a Ann which is the approximate anoy as we all
know um uh as the approximate nearest name uh algorithm the use case is realtime
search and recommendations um uh similarly vate and also one of them uh that I used was
single store they all use uh hnsw so uh this was the the layered based approach um that and again U as it uses graph it
is used for knowledge graphs right um and then uh elastic search is one which I used uh um
uh pretty much in my uh in my implementations so uh it's like really good for Tex search and recommendations
so one of the applications that I did uh with elastic search was um um I created uh this uh this auxiliary
database to um uh to to our postgress uh datab base where we where doctors could search for their appointments um and uh
uh we had this uh what do we call um like a call center where patients could call uh and and change their
appointments or change their doctors and um that's where uh it was taking like almost 3 minutes for um for our postgis
database which had like millions of users to to search uh the appointment even we we tried like different indexing
mechanisms but elastic search was like really good uh tool for us to uh to find that and it
just reduced it to 300 milliseconds um I was very closely working with the managers of the skull Center and they
said like we really had to make train our um uh our call center employees to talk to the patient about like weather
and how their family is doing and stuff like that and people you were getting annoyed um so yeah uh so so this elastic
search really uh really helped uh it's a it's it's a noral tune it is based on uh Lucine
Lucine is uh is aach open source based um uh sort of like precursor to Vector
databases um it's um uh it's more of a document database but um but people have found even using that document database
could be uh could be a good use case for Vector databases as well um myself a big elasa search fan
used on many key projects right um do do you have an opine on the Amazon hostile takeover of elastic
search um of course like uh when I used elastic search uh back in 2017 it was open source and we could we could
recommend U um changes and uh I was I was happy to take their source code and make certain uh subtle changes that that
were required specifically for my use case but now since um Amazon has taken uh taken them over we we do not have
that uh that Liberty and again Amazon uh has taken this open source technology and made them commercial so and it's
kind of like you know I shouldn't I I don't have a too much of a leg to stand on to talk crap about Amazon so grateful
for the S the cheap servers guys thank you um but at the same time Amazon has this big Vibe of like hey that's a nice
open source s project you got over there it' be a shame if anything happened to it
right so it's it's actually kind of one of the funny stories that actually and and and push back on on this if I'm not
correct in in terms of the kind of role in the space but um Vector database has been along for a while elasa search was
the leading database for most of the history in this area and then right before this llm stuff happened Amazon
did this like hostile takeover of the open source group and and the usage from my perspective really fractured it's
really kind of like a tragic story right on some level uh otherwise uh if elasa search was um um was open source still
uh most famous Vector database would be elastic search um but uh really really created the space for Pine con and I I
have to say love Pine con yeah Pine con is really cool so yeah this was uh uh this was my conclusion and uh if anyone
is uh interested uh I think I I have a small link up there uh I'll share the slides uh once
it's done and uh Here Comes Your Part uh Stanley um so this is the this is the use case uh you wanted to share with
everyone regarding graph databases and Vector databases in general so absolutely and it'll just be like a
little uh um uh addendum to the incredible presentation from gorov those slides were incredible so much fun to
talk about it um and forgive me I meant to actually create a few more slides with this but
um I had some kind of help with slides lined up for yesterday but uh family emergency happened um but anyway this is
like what a production data system looks like um it's kind of painful blurry
actually um and actually let's see if we could zoom in maybe on the um the top left of it
each of these nodes is a different type of data um and then we have a relationship that's not the right right
part sorry which one do you want Oh bottom left uh bottom left part um the the the
not that says sample if we could zoom that in a little bit um we we do have a notebook that's
kind of related to um looking at embeddings that gav had and I I I think maybe we can get that shared and then
right after I'm I'm done flopping my gums we can jump into some collabing together if you guys want but kind of
just wanted to express sort of like one use case of this so right here we're actually looking at a Marine Science
data system so in the middle you have a sample and that just means someone went out in a boat collected some water
sequenced the DNA in the water and then the sample has a number of different
connections um on the right the sample is connected to a node that's says mag a and and that is the actual genomic
data that was found in the sample of water and then over here on the left in the you see it goes over to ecological
event and then remote sensing so remote sensing just means every time we sequence the DNA in the water we also
take a picture of it so what we're attempting to do is understand what can we learn about the
genomics of a system by looking at it visually and vice versa right in order to
understand those connections though we need to embed both pieces of data in the same Vector database right so on the
left side we have um images on the right side we have sequences of uh acts and G's like we
have genomes and then we need to figure out how to put all of them in a space together with each other in the same way
that um you know gorov showed you and again it isn't always clear how to do that but um this is what uh comes out of
these systems when we structure them properly with the help of large language models um we're actually quite confident
that what we're going to be able to do is actually translate back and forth fluidly from a picture of a microbial
ecosystem and the exact genomic data that um that it uh that it contains um and anyway you know maybe
even at a future talk we could look at some embeddings together sure um with that data and um and on that note maybe
let's get the collab notebook back up and see if we want to mess around with it a little bit
sure want drop downam let's go to demo one
it's here I can minimize this today so here's the demo one as I was
talking about post this in the telegram channel right please
thank you all right uh let's see if we have
anything okay so do you guys want me to go over this notebook again um uh we had gone
through it uh last time okay yeah me give people a second to open it up but maybe while they're
doing that talking about it high level would be would be amazing sure and so this this is so like kind of like you
were saying this creates a specif specific set of embeddings that can let you compare yeah two images well uh so
this demo is basically as you can see you can give like any PDF URL it uses uh it uses Lama 2 um because uh Lama 2 has
like really uh good good textual embedding um uh already inbuilt in it so we uh I tried to use that and then uh we
we just use those embedding and try to create our own Vector data store on which we can do some semantic queries so
um and as uh um as shown over here we are trying to use the cosine similarity um if we go back to
our um our presentation over here as you can see it's using this
particular formula trying to find uh the closest distance between the two points using um this cosine distance
formula I uh I love just distance metrics so much and then
um we create a vector data store and over here I have said can you tell me the concepts of safety
fine-tuning and as this um uh this document has some um some information about uh how to how
to uh create Vector databases and stuff like that it was able to come up with uh with certain answer for me so it it
found out like um these are um these are some nodes which uh which match with uh safety and fine-tuning and then uh based
on that um it it comes up with actual um actual answer so this has been generated by um
uh by the Lama 2 llm itself so and it's based on the vector data store that we have created out of that PDF so this is
this was the whole demo um that uh that this particular notebook does any questions yeah so so we uh uh we did
um yeah Ju Just to um just to consume the PDF the content in the PDF we had to uh separate it out in different
paragraphs but then since Lama 2 has its own embedding and it knows how to how to Cluster the text it has done it so I I'm
not creating my own embeddings I'm just using yeah yeah it chunks it itself yeah the
um the selection of a an embedding to to use in a system like this is kind of interesting and um open a I actually
just released a new set of optimized embeddings awesome um question how would something like this be used in like a
rag system oh yeah so um So based on that uh we could create our own rack system um for example now
uh I want to create uh like a store just for um just for this particular this particular PDF so
imagine like uh this this is not a PDF but it's it's like um uh it's it's like cluster of PDF which are related to
medical or uh related to law immigration law for example so one could create a rack system uh that will that will help
you uh just get specific answers not not something um related to uh whatever is there available on the Internet or
whatever again could could hallucinate about right uh it would be like specific to that particular domain and that's
where rack systems or slms as we were talking about it some time back are are really powerful um and who who know who
knows what I mean when I say a rag system does everyone know what R RG is said
last absolutely and it's an acronym and so I should have said what it was it's bad bad uh impolite to just use an a
naked acronym um but yeah so rag is retrieval augmented generation M oh it's right there yeah oh my gosh I should
have pointed to that um and and yeah it's interesting like has anyone had an llm hallucinate on
them yeah it's it's a rag is one of the main solutions to hallucination about other things and it's it's basically I I
like to describe it as a long-term memory for the model but essentially just through these mechanisms that gav
has explained to us a large language model that's asked a question can actually go find the bulk
of the answer and then then kind of inject that into the response right it's very interesting and um you know I
actually um uh we'll need to think a little bit about what a good uh next talk is I I think we should go a little
bit more in the art direction for the the next talk but um down the road maybe we could present on fine-tuning together
yeah sure and and because that's actually one of the really interesting things is um not just um fine-tuning a
model but fine-tuning and ragging a model so sort of not just tuning a model so that it says things you like but so
that it takes things from Rag and then rephrases them in a powerful way so so just for uh uh just for audience can you
elaborate a little bit on fine tuning and like what is the difference between um between fine tuning and rag
especially oh I would love to but actually I would wonder maybe if if you could go back to your slide and go back
one or two slides to the those beautiful animations you had of the actual networks Rippling as the data passes
through them because right there is what a a neural network looks like it's all these little sorry oh please um dots
that are connected by lines and then I like to think of it as like um the intelligence is stored in the Dynamics
of the Rippling like you see there's the bottom of the neural network and it sort of shakes and then the next layer shakes
and then the next layer shakes and that's what a neural network is is it's a a data in data out machine and then as
the data ripples through the network it's changed right um so that's kind of what a neural network is how it works
really obviously handwavy high level um and then how does it like get the intelligence though it's this training
process that we've talked about right and this the training process for gp4 for example it's estimated that it was 6
to n months of 25,000 large computers working together like we're talking the energy bill of a
mid to large size city just to train one of these models um crazy right um and that is sort of
like um happening because there's so many of those little dots for example um gb4 is estimated to have a trillion of
those those dots that're all interconnected so every time they feed new data through it the data changes the
values of every single one of those little dots so it's it's sort of you imagine a trillion different numbers
that are constantly being updated and tested and their inter relations are being messed with and understood um so
on the higher level what you're are trying to say fine-tuning is basically changing those connections that are
between those dots so fine-tuning yes is changing some of the dots but not all the dots so it might be something like
we Chang just the last layer or the last two layers right or in the case of the most used uh technology which is called
Laura which stands for low rank adaption you actually add on a little extra layer of neurons at the end that are almost
like the translator for the model and then the translator learns to make the model sound a different way by taking
its output and then restructuring the output okay but overall it's kind of like you're restarting the training but
you're only picking a certain part of the network focusing your energy on getting that part of the network to
sound like you want and it turns out you can completely change the behavior of these models with just that little bit
of tuning and then again I like to call this um giving your model a master's degree nice a single ler it it depends
there's a number of different approaches to find tuning and different ways to do it um it actually usually wouldn't be
just a single layer uh it could be um but yeah as you add more neurons the computational cost it increases very
quickly um yes yes um yeah off off hand I'm I'm not
100% sure um with some paper in a little time I could work it out so maybe after the talk I'll I'll do my best to give an
estimate um the trillion parameters is somewhat misleading though because chat GPT is an eight-way mixture model so
they actually trained um about a 120 billion parameter model and then there's eight copies of it working together in
gbd4 Al although they do fine-tune those independently so um um
nice so that was uh demo one let's move to demo two the demo two one uh I hope uh it has
those uh sorry about that but what this um this would do it would take uh like almost
7,000 images um that that we have uh uh that I found um on the internet and these are like 7,000 images of different
celebrities and uh what we did is um and this was this was also um done by single store DB which is uh which is just
another database like pine cone um so what single store uh DB people did they they basically came up with uh uh with a
mathematical function uh called dot product um if we if we um go back uh to the slides uh here is
the here is the mathematical function for DOT product so it's uh it's taking uh
it's taking a DOT product between two two Vector quantities um as uh as known in physics so this is the formula to do
it but doing this on data was uh was something that um uh that single store DB guys did and they created like a
traditional SQL sort of query that you can do on any sort of vector databases and this is the same um uh same sort of
technology that I used and as you can see um let me go down the dot product is so beautiful yep
and one one of the most important discoveries in human history true um it solves many physics
equations still it's maybe one of the most useful tools in all of physics and then what the dot product does is you
give two vectors and the dot product answers the question to what degree are these vectors going in the same
direction basically um but we using the same thing in um in Vector databases so uh as you can see in I don't have the
line numbers but uh as you can see in the query it's it's doing like a traditional um uh traditional SQL like
query you can select a file name where doc product of vector is um uh is found out and basically what this does it
tries to find out the images that are similar in in features um I think uh last time we had uh whose picture did we
have Jerry uh yeah some celebrating um so uh his
picture was was selected and um it could find like similar five pictures um of them yeah Alec balwi yeah that's what it
was too and it was very funny because it produced two that were wrong but they were very Alec Baldwin looking dudes
right I I don't have that example today but uh yeah um some people might uh might take
it and uh run it today um and show us like um different different celebrities if they can come up with that and uh I
would love someone to to see if they could query uh William Rohan Hamilton any any Hamilton fans here the
inventor of the dot product Oh really uh I don't know if he would be there in the celebrity database no he
might not be but he's he's a big hero of mine um he was out for a walk with his wife and they they were just uh walking
over a bridge and he had a Eureka moment so he uh ran over and actually carved the equation for the dot product into
the bridge and it's h still there it's considered a pilgrimage site for for mathematicians wow and Engineers wow but
then um I always think about that anecdote and feel bad for his wife yeah any questions before we we jump
into the collab notebooks and then also maybe we should do another round of applause for thank
you I have to say I I I do this stuff every day it's been my career for 10 or 15 years and I I don't know if I've ever
seen a crisper sharper presentation of the good ideas that was so beautiful my friend thank you thank you
Stanley I question vector
come on Christina so I love that we've been um studying a lot of different
functionality today or at least you've been introducing it to us in terms of similarities in terms of vector
databases and all the different ways that that can be applied but are they also working on models to study
polarities because that could also be useful in many different contexts so could you elaborate on what
sort of polarities like dark light you know like but alth some of them are not are a little more similar than they
think which is kind of funny you know well I mean I would I would say sort of in the same way that these things learn
to understand patterns of similarity they necessarily also learn patterns of difference so right for example when we
were looking at examples last week um we we looked at an example of gender as a dimension that is
often part of these models um because our language is so polarized by gender it shows up very clearly in the
embeddings what language is male and what language is Fe is female and so I would actually say that's one of the
very exciting things about these Technologies is they allow us to quantitatively see um these kind of
polarizations or biases that exist in our language day-to-day um so we can actually see that there is this big
difference in in uh uh language related to gender which you know I think most of us are very familiar with as a a core
Injustice of our society yeah and uh just like these are two different uh uh School of thoughts right
one scientist might say okay we are finding similarity so we don't have to find dissimilarities another scientist
would say oh we need to find the similarities so that we can cluster the similarities together so like doing one
will automatically find the other so you don't have to do it explicitly as such thank you oh such a good question too
yeah and that shows everyone was listening to us no I have to say I thought it was so good and thanks
everyone for for being here and if there's you know no more questions we can wrap it up and I'm here if anyone
wants to goof around with the collab or get some help with it and then we can also talk more in some little breakout
groups for for at least a moment or two I'm sure sure thank you thank
you awesome awesome uh trick are we out we are
A credibility score of 88 indicates that the video is highly reliable, with accurate information and well-supported facts. It suggests that the content thoroughly covers the topic with only minor issues that do not affect the overall trustworthiness.
The technical concepts were cross-checked against established sources and current research in AI and vector databases. This included validating explanations of clustering-based indexing, HNSW, ANNOY, and LSH with peer-reviewed papers and authoritative documentation.
While the video includes some informal language and light speculation, these do not misrepresent facts or distort key information. Such elements are common in educational content to maintain engagement without compromising factual integrity.
The video avoids oversimplifying complex algorithms, presenting unverified claims, or conflating unrelated AI concepts. It maintains clear distinctions between theoretical ideas and practical applications, reducing the risk of misleading viewers.
Vector databases are crucial for managing and searching high-dimensional data, which underpins semantic search and retrieval in AI systems. Grasping these databases enhances understanding of how AI models access and process information effectively.
Retrieval augmented generation refers to combining AI language models with external databases to retrieve relevant information during response generation. This process boosts accuracy and relevance in AI-generated content by integrating real-world data.
Viewers can check for clear citations of sources, alignment with established research, balanced presentations including challenges, and the presence of expert reviews or fact-checks. High credibility scores and transparent methodologies also signal trustworthy content.
Heads up!
This fact check was automatically generated using AI with the Free YouTube Video Fact Checker by LunaNotes. Sources are AI-generated and should be independently verified.
Fact check a video for freeRelated Fact Checks
Vector Databases & AI: Fact Check and Technical Overview
This fact check reviews a detailed technical discussion on various types of databases, focusing on vector databases and their role in AI and machine learning. Most technical claims regarding database types, vector embeddings, semantic search, and related AI concepts are accurate and well explained with no significant misinformation found.
Fact Check: 2016 Cultural and Workplace Stories Analysis
This video presents a conversational recount of events and cultural moments from 2016, personal workplace experiences, and social observations. We fact-check claims related to notable 2016 events, workplace practices, and other historical references, clarifying their accuracy amid anecdotal storytelling.
Fact Check zu Googles Quantencomputer Willow: Fakten und Fiktion
Dieses Fact-Checking untersucht die Behauptungen zum Quantencomputer Willow von Google, der angeblich Probleme in Minuten löst, für die Supercomputer Jahrmilliarden bräuchten, und angeblich mit Parallelwelten interagiert. Viele technische Details sind plausibel, aber sensationelle Aussagen entbehren wissenschaftlicher Grundlage.
Fact Check: Europe's Euro Stack Digital Sovereignty Initiative
This video examines Europe's move to create Euro Stack, aiming to reduce dependence on American tech giants for critical digital infrastructure. While many claims about dependency and strategic vulnerabilities align with available data, some specific figures and events are either exaggerated or lack independent verification. Overall, the video's core message about Europe's push for digital sovereignty is accurate.
Fact Check: Simple Weekly Organization and Productivity Tips Reviewed
This video offers practical advice on organizing your week using simple tools and routines to boost productivity. The fact check found the suggestions to be generally sound, emphasizing consistency, planning, and self-care without making unverifiable or exaggerated claims.
Most Viewed Fact Checks
Height Growth Fact Check: Nutrition, Exercise, and Sleep Truths
This fact check analyzes claims about human height determination, focusing on genetics, nutrition, exercise, and sleep. While many claims align with scientific evidence, some statements are oversimplified or lack nuance. We provide a detailed verification of each assertion with supporting sources.
Shopify Dropshipping Store $54K Revenue in January 2026 Fact Check
This fact check evaluates claims made in a detailed Shopify dropshipping case study, focusing on revenue figures, product research methods, marketing strategies, and supplier usage. While many claims about tools, strategies, and product selection reflect common industry practices, certain financial and operational claims lack independent verification.
Fact Check: Evaluating Claims on The New York Times and Media Coverage
This video transcript presents various claims about The New York Times' coverage of the Israel-Gaza conflict and other media commentary. While some claims regarding subscription routines and print media experience are subjective, the critique of the newspaper's coverage on the Gaza conflict includes factual assertions that are verified as partially accurate with some exaggerations. The overall video mixes opinion and fact, with some misleading framing of media behavior.
Fact Check: Understanding Narcissism - Causes, Types, and Effects
This analysis reviews a comprehensive podcast discussion on narcissism, evaluating the accuracy of claims about narcissistic personality disorder, its causes, types, and impacts on relationships. While the discussion conveys personal experiences and general psychological concepts, factual verification reveals mostly accurate information interspersed with informal language and anecdotal examples.
Fact Check: 2016 Cultural and Workplace Stories Analysis
This video presents a conversational recount of events and cultural moments from 2016, personal workplace experiences, and social observations. We fact-check claims related to notable 2016 events, workplace practices, and other historical references, clarifying their accuracy amid anecdotal storytelling.

