Understanding Linear Classifiers in Image Classification
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeIf you found this summary useful, consider buying us a coffee. It would help us a lot!
Introduction
In today's lecture, we dive into the world of linear classifiers, a fundamental component in the realm of image classification. In previous discussions, we reviewed the challenges faced in classifying images, including variations in viewpoint, illumination changes, and image deformation. With an emphasis on building robust classifiers, we will explore the principles of linear classifiers and how they fit into machine learning workflows. Through various viewpoints, we will gain insights into their behavior and effectiveness in recognizing visual patterns.
Recap of Previous Concepts
Before delving into linear classifiers, let's recap what we learned in the previous session:
- We dealt with image classification problems, aiming to predict category labels based on input images.
- We introduced the K nearest neighbor (KNN) classifier as a naive approach for image classification, noting its inefficiencies, such as rapid training but slow evaluation.
- We discussed the importance of robust classifiers that can handle diverse variations in visual data.
The Need for Linear Classifiers
Despite the appeal of KNN methods, we sought a more efficient approach that enables practical application in real-world scenarios. Here, linear classifiers become essential. They serve as fundamental building blocks for more complex neural networks, helping us understand the underlying mechanisms of these systems.
What Are Linear Classifiers?
Linear classifiers utilize linear combinations of input features to predict class labels. The primary equation for a linear classifier can be represented as:
$$ score(class) = W imes X + B $$
Where:
- W is the weight matrix targeting specific patterns in the input data.
- X is the input image represented as a long vector comprising pixel values.
- B is the bias term incorporated to adjust the output scores.
For instance, when using the CIFAR-10 dataset, the input images measuring 32x32 pixels translate to 3072 pixel values (32 * 32 * 3 color channels).
How Linear Classifiers Work
The output from the linear classifier provides a score for each class category. The classifier assigns higher scores to categories that best match the features of the input image. Essentially, linear classifiers simplify the representation of images by using matrix-vector multiplication, producing a score for each category.
Different Viewpoints on Linear Classifiers
To understand linear classifiers more intuitively, we can approach them from three distinct viewpoints:
1. Algebraic Viewpoint
In the algebraic sense, linear classifiers function as matrix-vector multiplications. The key points are:
- The bias can be incorporated into the weight matrix using a technique known as the bias trick, allowing it to be treated as part of the weight matrix itself.
- Predictions from linear classifiers are always linear, which can be a limitation in distinguishing certain types of input images.
2. Visual Viewpoint
Alternatively, we can visualize weights as templates that the classifier learns to recognize. Each category corresponds to a specific template, and the classifier operates similarly to template matching, scrutinizing images through the lens of these learned templates. The model thus aggregates both the object and its context when interpreting input.
3. Geometric Viewpoint
The geometric representation allows us to conceptualize classification as splitting high-dimensional space using hyperplanes, with each hyperplane representing a different category. However, it is critical to note that some configurations in the data such as XOR cannot be resolved by linear classifiers alone due to their inability to create complex boundaries.
Classification Challenges with Linear Classifiers
Linear classifiers have inherent limitations, including:
- Mode Splitting: Inability to learn multiple representations of the same class.
- Context Dependence: Contextual features can falsely influence predictions, as the classifier might rely on surrounding data rather than the object itself.
Loss Functions in Linear Classification
Loss functions are pivotal in assessing the performance of classifiers:
1. Multi-Class SVM Loss
The multi-class SVM loss seeks to maximize the score of the correct category while minimizing the score of incorrect classes. This function is defined mathematically, focusing on differences between the highest and lowest scores assigned.
2. Cross-Entropy Loss
Cross-entropy loss provides a probabilistic interpretation of scores from the classifier. This loss function computes the log probability of the true category compared to the predicted probabilities derived from softmax. Its main advantage lies in translating scores into interpretable probability distributions, thus allowing for more nuanced assessments of model performances.
Regularization in Linear Models
To achieve better generalization and avoid overfitting, regularization techniques are employed. These techniques penalize model complexity by adding terms, such as L1 or L2 norms, to the loss function. This preference for simpler models helps enhance performance on unseen data.
Conclusion
As we explored linear classifiers and their viability in image classification, we recognized the powerful yet simplistic mechanics driving their predictions. Understanding these classifiers is crucial as we transition into more complex networks and their respective behaviors. To fully leverage linear classifiers, we must grasp the mathematical models guiding their functioning and how the choices of loss functions, regularization methods, and classification challenges shape their outcomes.
so welcome back to welcome back to lecture three today we're going to talk about linear classifiers so a quick
recap let in the last lecture we talked about this image classification problem and you'll recall that this was a
foundational problem in computer vision where we had to take this input image and then our network or system had to
predict a category label from one of a fixed set of categories for the input image and remember last time we talked
about various challenges of this image recognition or image classification problem that we somehow need to build
classifiers that can be robust all these different sorts of variation that can appear in our input data things like
viewpoint to look with viewpoint changes illumination changes defamation etc that somehow the challenge in building high
performance recognition systems is building systems that are robust all these different changes in the visual
input that they need to process so you would remember also last time we talked about the data-driven approach to
overcoming some of these challenges that rather than trying to write down an explicit function that deals with all of
those hairy bits of visual recognition instead our approach is to collect a big data set that hopefully covers all of
the types of visual things that we want to recognize and then to use some kind of learning algorithm to learn from the
data how to recognize various text images and it's a concrete example of this pipeline in the last lecture we
talked about the K nearest neighbor classifier that was fairly simple that memorized the training data and then
output the label at test time that of the image most similar to in the training set to the test data and we saw
how this led to ideas of hyper parameters and cross-validation and this was and we let and we went through this
entire pipeline of an image classification system in the last lecture but remember when we left off we
said that the K nearest neighbor algorithm was actually not very useful in practice for a couple reasons
one was that it inverted this idea of what is slow and fast that it was very fast at training but very slow to
evaluate and the other problem was that it wasn't very perceptually meaningful that sort of l2 Euclidean or l1
distances on raw pixel values was not a very perceptually meaningful thing to measure so today we're going to talk
about a different sort I mean a different sort of classify classifier that is very different in flavor from
talk about various types of linear classifiers that we can use to solve this image classification problem so
linear classifiers might sound kind of simple but they're actually very important when you're studying neural
networks because sometimes when you build new neural networks it's kind of like you want to stack all together your
layers as a set of Lego blocks and one of the most basic blocks that you're going to have in your toolbox when you
build these large complicated big neural networks is a linear classifier so sort of speaking hoarsely once we move beyond
linear classifiers and move to these big complicated neural models then we'll see that meant that the individual
components of those neural network models will look very similar to these linear classifiers that we'll talk about
today and indeed much of the intuition and technical technical bits that we'll cover today will carry over completely
as we start to move to neural network systems in the next couple of lectures so as a quick recap remember with that
we've been working with this C part n dataset and that the C part then C part n dataset is one of these standard
benchmark data sets for image classification that contains 50,000 training images and 10,000 test images
where each of these images is little little little tiny so it's 32 pixels by 32 pixels and within each pixel we have
three scale your scalar values for the red blue and green color channels of the pixels so in so the idea of a linear
classifier is part of a much broader set of approaches toward building machine learning models so that's the idea of a
parametric approach so the idea of a parametric approach is that we're going to take our input image much as we've
seen in the previous lecture but now there's a new component in our system and that's these learn about stubble you
down in red at the bottom of the slide so then we're going so we're then going to write this this function f which is
going to somehow inputs the image the the pixels of the image acts as well as well as these learn ablates w and the
functional form will somehow end up spinning out ten numbers giving some classification scores for each of the
categories that we want the system to be able to recognize so this is a fairly general framework and a fairly general
set up and this this idea of a parametric classifier will carry over completely to the neural network systems
that we'll talk about but we're going to talk about the possibly the simplest the simplest possible
instantiation of this parametric classifier pipeline and that's the linear classifier where it has the
simplest possible functional form where this F of image acts and weights W is just going to be a matrix vector
multiply between the learn about weights W and the pixels of the image X so to put to make this a little bit more
concrete and remember that the input image for something like C part n has is a 32 by 32 by 3 which means that if we
count the total number of scalar values that are inside that each of those images we had kind of multiply it out
you end up with third with 3072 individual scalar numbers that make up the the pixels of that input image so
now so then we will have a weight matrix so then we'll take the the pixels of the image and stretch them out into a long
vector so this will completely destroy all of the spatial structure in the image and we'll just reorganize all of
the data in the input image into a long vector that has 3072 elements and of course we'll need to do we'll need to do
this vector vector application of our image in a consistent way that every time we take an image we always need to
convert it into a vector in the consistent same way every time and once we have chosen some way to to flatten
our image data into a vector then our noble weight matrix will be a two dimensional matrix of shape 10 by 3072
where 10 remember is the number of categories that we that we want to recognize and 3072 is the number of
pixels in the image and this and when you perform this this matrix vector multiplication the output will be again
a vector of size 10 where 10 giving one score for each of the ten categories we want our classifier to recognize so
sometimes you'll also see linear classifiers with a bias term that will be a matrix vector multiply plus an
additional bias term B where B is this vector with ten elements giving offsets for one of each of the ten categories
that we wish to learn so this is a fairly so this is a fairly straightforward way to think about
means in the context of image classification so first as a concrete example suppose that we just want to
make this super concrete suppose that our input image is a 2 by 2 grayscale image so then it has only 4 pixel values
that give the full state of the image then we want to stretch the pixels out into a vector form into a column vector
with four entries so here I've just written out the exact values of each of the pixels in this in this image and
then our weight matrix is and then in this in this simple example will consider classifying only three
categories rather than ten may be cats dog and ship shown in the three with these three corresponding colors now in
this simple example the weight matrix W will have shape 3 by 4 where 3 is the number of categories we want to
recognize and 4 is the total number of pixels in our input image and then our bias will again of shape 3 because this
is the number of categories that we want to recognize so then we'll perform this vector vector matrix multiplication and
we'll output the specter of scores getting one score for each category we want to recognize so when you look at
the problem in this way you can start to recognize a little bit of structure in how we're breaking up this image
classification problem so if you remember the way that matrix vector multiplication works you know you take
the vector and you kind of multiple take inner products along each row the matrix you recognize you realize that each row
of this matrix corresponds to one of the categories that our classifier wants to recognize so I think it's useful to
think about linear classifiers and a coupled a couple of different equivalent ways and when you think and by using
different viewpoints to think about linear classifiers it can make certain properties of them very very obvious and
or not obvious so having different ways to think about a linear classifier can help you understand it more intuitively
so the first idea that the first way I like to think of linear classifiers is what I call the algebraic viewpoint
which is exactly this this idea of a linear classifier as a matrix vector multiply plus a vector offset and if you
think about the algebraic viewpoint of a linear classifier you reckon you there's a couple features or facts about linear
classifiers that immediately become obvious one is that we can equivalently we can do what sometimes is referred to
as the bias trick that eliminates the bias as a separate learn will parameter and instead incorporates the bias
directly into the weight matrix W the way that we do this is that we can augment our input image with an the
the vector representation of our input image with an additional constant one at the end of the vector and then augment
our weight matrix with an additional column corresponding to that that will now perform the exact same computation
as the W X plus B formulation that we saw before so that's kind of a nice feature and this biased trick is quite
common is pretty common to use when your input data has a native vector form so it's nice to be aware of as you think
about building different types of machine learning systems but in fact in computer vision this bias trick is less
common to use in practice because it doesn't carry over so nicely as we move from linear classifiers to convolutions
later on and furthermore it's nice sometimes to separate the weight and the bias into separate parameters so we can
treat them differently and how they're initialized or regularize or other things like that but nevertheless this
bias trick is a fairly nice thing to be aware of for linear classifiers and it's totally obvious when you think about it
when you think about linear classifiers through this lens of the algebraic viewpoint another another thing that's
very obvious when you think about linear classifiers in this algebraic way is that the predictions are linear so what
this means is that isn't it so in is a simple example if we ignore the bias and we imagine scaling our whole input image
by some constant C then we could just pull that constant out of the linear classifier and that means that the
predictions of the model will also be scaled by that by that scalar value C so if you think about images that means
that if we have some input in it some original image on the left with some set with some predicted cat classifier
scores from a linear classifier then if we were to modify the image by sort of desaturating it by multiplying all the
pixels by some constant one half then that then all of the predicted category scores from the classifier would all be
cut in half as well so this is maybe a bug maybe a feature but it feels kind of weird for linear classifiers to behave
in this way on image data because you might think that just by scaling down all the pixels by a constant value we as
humans have can still recognize this as a cat just as easily but somehow it's a bit unintuitive that just scaling down
all the pixels change the predictive scores from the classifier so that's a look that's a
kind of a weird feature of linear classifiers that may or may not be important depending on exactly what loss
function used to Train B's so we'll talk about that a bit later so that's the algebraic viewpoint that I like to think
about for linear classifiers but there's a very there's a we can reformulate this computation in an equivalent but
slightly different way that will give us a slightly different way to think about exactly what image linear classifiers
are doing in the context of image data so remember from the up this this algebraic viewpoint of a matrix vector
multiply we saw that the classification score that's predicted for each category is the result of an inner product
between the vector representation of the image and one of the rows or the matrix right well in this algebraic viewpoint
recall that we had taken the pics the pixel values of our input image and stretch them out into a column vector
and then when we took this inner product we ended up with an inner product of these these rows in the matrix and the
and the column of the stretched out version of the image well rather than stretching out the image into a column
vector we can instead think about Rieff reshaping the rows of that matrix into the same shape as the input image then
we get a system that looks something like this on the right so here we've taken each of the rows of the matrix and
reshaped them to have this same two-by-two shape as the image that we're trying to classify and now then we've
broken up these rows of the matrix into these four different sort of columns in the diagram here and now the weight and
now the bias vector has then been broken up into these three separate elements that we split that we split along the
columns so then when we think about linear classifiers in this way it lets us interpret that interpret their their
behavior in a slightly different and slightly perhaps more intuitive more intuitive way so that's the what I like
to call the visual viewpoint of linear classifiers because if you think because now that we've taken each of these rows
of the weight matrix and stretch them out to have the same shape as the image what we can then do is try to visualize
each of the rows of that matrix as an image itself and this interpretation of a linear classifier looks kind of like
template matching right because now the classifier is learning one image template per category that we want
to recognize and each of these templates is then and then to produce the category score for the template we simply match
up the template for the class with the pixels of the image by computing it and inner product between them and you might
remember that if you have two vectors that are maybe of unit norm then they and you take the inner product of two
vectors then they achieve their maximum when they're all lined up which sort of fits with this idea of template matching
and now it's really interesting if you then you buy by visualizing these tuppy's learned templates from the
classifier as images themselves you get a bit more intuition about exactly what this linear classifier is looking for in
images when it tries to recognize the different categories so for example on the bottom left you can see that this
plane category it's maybe looking for some kind of a blob in the middle and it's generally looking for blue images
so any images that have a lot of blue in them are going to be very highly received very high scores for the
plate-glass using these particular weight matrix for a linear classifier similarly the the the dear class is kind
of this green blobby background with kind of a brown blob in the middle but it's maybe the deer so that's again
gives us some more intuition about what the linear classifier is looking at and one thing that's kind of interesting
from this viewpoint is that it's it becomes clear that even though we told the classifier that we wanted to
recognize object categories like planes and dogs and deer in fact it's using a lot more evidence from the input image
than just the object itself and it's in fact relying very strongly on the context cues from image so right if you
so for example if you imagined putting in an image that had a deep that had maybe a car in a forest that would be
kind of confusing for a linear classifier because the forest background might be very green and then would
achieve very high scores according to the deer classifier where the car in the middle might match up more to the car
template so it in some kind of image with objects in unusual contexts it would be very likely that if that a
linear classifier would completely fail to properly recognize those objects and that becomes very obvious when you think
Henschel failure mode of linear classifiers that becomes clear when you think about this visual view viewpoint
is that of mode splitting so our linear classifier is only able to learn one template per category but there's a
problem what happens if we have categories that might appear in different types of ways so as a concrete
example think about horses so if you go and look at the CFR 10 dataset which maybe you might have done if you started
working on the first homework assignment then you'll see that horses on C part n are sometimes looking to the left and
they're sometimes looking to the right and they're sometimes looking dead on now if we have horses that are looking
in different directions then the visual appearance of the images of horses looking in different directions will be
very different but unfortunately the linear classifier has no way to disentangle its representation and no
way to separately learn templates for horses that are looking in different directions so in fact if you look at
this example of a if you look at this learned template of a horse from this one particular linear classifier you can
kind of see that it actually has two heads so if you look at the horse here he has kind of a brown blob in the
middle and green on the bottom which you might expect but now there's a black blob of black blob on the left and a
black blob on the right which might court so then this is the linear classifier trying to do the best that it
can to match horses looking in different directions using only a single template that it has the ability to learn so this
is also somewhat visible in the car example you can see that the car template doesn't actually look anything
like a car it just kind of looks like a red blob and a windshield and again if the car template might have this funny
shape because it's trying to use a single template to cover all possible appearances of cars that you might see
in the data set this also gives us a sense that maybe see if our tent has a lot of red cars because the car template
that's learned is red and maybe if we try to recognize green cars or blue cars then the classifier might fail and all
of these type of failure modes become very obvious when you think about the linear classifier from this from this
visual viewpoint so another a third way that we can think about linear classifiers is what I like to call the
geometric viewpoint so here we can imagine drawing a plot where on the x axis so here we pick
we pick out a single pixel in the image and now we draw a plot where the x-axis is the value of the pixel and the y-axis
is the value of the classifier as that pixel changes maybe as we keep all the other pixels in the image fixed and now
because this linear classifier is a linear function then clearly the classifier score must vary linearly as
we change any of the individual values in the any of the individual pixel values in the image so this is not very
interesting when you think about this this this example with only a single pixel so we can instead try to broaden
this viewpoint and incorporate multiple pixel simultaneously so then we can imagine drawing a plot where the x-axis
is maybe one pixel in the image and the y-axis is a second pixel image and then now because I can't really draw three
dimensional plots on PowerPoint you have to live with some kind of a contour plot so here then we could draw a line where
the car score is equal to one half and you can see that this this level set of the car score forms a line in this in
this pixel space and that and then the court because this is linear the car school the cop there is a direction in
this pixel space along which the car score will increase linearly which is orthogonal to this line and kind of
tying this back to the template view the car template will lie saw the learned car template will lie somewhere along
this line which is orthogonal to the level set of the of the car score and then similarly similarly for all the
scores for all the different categories that we're trying to recognize we'll end up having different lines with different
level sets and different and the cart and the template the learned templates for those categories orthogonal to the
level sets in this pixel space now of course looking at only two pixel images like we're doing in this example is not
very intuitive but you can imagine that this viewpoint would extend to higher dimensions as well so here the idea is
that we imagine a beriberi we imagine this linear classifier as taking the whole space of images as this very very
high dimensional Euclidean space and now within that Euclidean space we have different hyper planes that are trying
to one hyperplane per category that we want to recognize and each of those hyper planes we've try to recommend we
each each of the hyperplane for each of the categories who want to recognize are now cutting this high
dimensional Euclidean space into two half spaces along this level set so that's this this third viewpoint on
linear classifiers which is of one hyperplane per class cutting up this high dimensional Euclidean space of
pixels so when I this this geometric viewpoint is a very useful way to think about linear classifiers but again I
would caution you that geometry gets really weird in high dimensions so we unfortunately are cursed to live in a
how geometry behaves is really shaped by these very low number of dimensions and that's kind of unfortunate because the
way that geometry the Euclidean geometry behaves in very high dimensions can be very non-intuitive to do this the two
low dimensions to our low dimensional experience so well I think that this this geometric viewpoint is kind of
useful sometimes it's very easy to be led astray by geometric intuition because we happen to have all our
intuition built on low dimensional spaces but nevertheless the geometric viewpoint does let us get some other
ideas about what kinds of things a linear classifier can and cannot recognize so then based on this
geometric viewpoint we can try to write out different types of cases or different kinds of classification
settings that would be difficult or impossible for a linear classifier to properly recognize so here the idea is
that we've colored this two-dimensional pixel space with red and blue corresponding to different categories
that we want the classifier to try to recognize and these are all three cases that are completely impossible for a
linear classifier to recognize so on the Left we have this case of red and blue in these these the first and this and
the third quadrants having in one category and the second and fourth quadrants being of a different category
and then if you think about it there's no way that we can draw a single hyperplane that can divide this that can
divide the red and the blue here so that's a case that is just impossible for linear classifiers to recognize
another case that's completely impossible for linear classifiers is this case on the right which is very
interesting of three modes so here this we've got the blue cape in the blue category there's maybe three distinct
patches and parts in regions in pixel space that correspond to possibly different visual appearances of the
category we wish we want to recognize and then if we have these different disjoint regions in pixel space
corresponding to a single category again you can see there's no way for a single line to perfectly carve up the red and
the red and the blue regions so this this this right example of these three modes is I think similar to the what we
saw in the visual example of maybe the horse is looking in different directions that you can imagine maybe in this high
dimensional pixel space there's some region of space corresponding to horses looking right and a completely separate
region of space corresponding to horses looking in a different direction and again with a single and now with this
geometric viewpoint of hyperplanes cutting out high dimensional spaces it again becomes clear that there's it's
very difficult for a linear classifier to carve up classes that I have completely separate modes of appearance
and this also ties back to the historical context that we saw in the first lecture if you remember in the
first lecture last week we talked about this historical context of different types of machine machine learning
algorithms people have built over the years and one of these very first machine learning algorithms that got
people very excited was the perceptron that all of a sudden there was this machine that could learn from data it
could learn to recognize digits and characters and got people really excited but it had this but now if we were to if
you were to look back at the exact math of the perceptron now we would recognize it as a linear classifier and because
the perceptron was a linear classifier there's a lot of things that it was just fundamentally unable to recognize the
most famous example was the XOR function which is shown here which where we have the the green as one category and the
blue is a different category so because the linear because the perceptron was a linear model there was no way that it
could carve up these these input these red and green regions red and sorry green and blue regions with a single
line and therefore there was no way that the perceptron could learn the XOR function so that's kind of a nice bit of
historical context about why the geometric viewpoint was historically useful for having people think about how
machine learning algorithms could operate so then we so now to this point we've talked about
linear classifiers as this fairly simple model of a matrix vector multiply and we've seen how even though there this is
a fairly simple equation to write now if you unpack it and think about it in different ways some of the shortcomings
of its representation abilities become clearer as we think about it from these different viewpoints so is there any
questions about these these different viewpoints of linear classifiers so far ok so then basically where we are now is
that once we have a linear classifier we're able to predict scores right given any value of the weight matrix W we can
perform this matrix vector multiply on an input image to now spit out a vector of scores for the for the classes that
we want to carry that we want to recognize so as an example here we've got three images and ten categories for
C part n so for any particular value of the weight matrix W we can run the classifier and get these vectors of
scores but this has told us nothing about how we actually select the weight matrix W and we've not said anything
about the learning process by which this this matrix W is selected or learn from data so that so now in order to actually
write down linear and actually in order to actually implement linear classifiers we need to talk about two more things
one is we need to use the idea of a loss function to quantify how good any particular value of W is and that's what
we'll talk about for the rest of this lecture and then in the next lecture we'll talk about optimization which is
the process by which we try to search using our training use our training data to search over all possible values of W
and arrive at one that works quite well for our data so a little bit more formally a loss function is some way to
tell how good our classifier is doing on our data with the interpretation that a high loss means we're doing bad and a
low loss means that we're doing good and the goal the goal whole goal of machine learning is to write down loss both well
okay that's a little bit reductive but one way to one way that we can think about ma'sha'allah - Murel network
systems is writing down loss functions that try to capture intuitive ideas about what types of models are good or
when models are working well and when models are not working well and then finding and then once we have this
quantitative way to evaluate models then to try to find models that do good and as a bit of terminology this at this
term of a loss function will also sometimes be called an objective function or a cost function in other
words literature and because people can never agree on names sometimes people will talk about the negative of a loss
function instead so then what loss function is something you want to minimize sometimes people
want to maximize something instead and it's the thing we care to if that if we want to right now in our model by
maximizing maximizing a function then it'll typically called something be called something like a reward function
profit function utility function fitness function each subfield has their own names and bits of terminology but
they're all the same idea it's just a way to quantify what your model is doing well and when your model is not doing
well then a bit more formally the way that we'll usually think about this is we have some data set of examples where
each input is a vector X and each output is a label Y in the image classification case X will be these these images of
fixed size and Y will be an integer giving the label of get will be an integer indexing into the categories
that we care to recognize now the loss for a single example will often write as Li and it will take in so then f of X I
and W will be the predictions of our model on a data point X I and the loss function that will then assign a score
of badness between the prediction and the ground truth or true label Y I and then the loss over the entire data set
will simply be the average of all the losses of the individual examples in the data set so then this is kind of the
idea of a loss function in the abstract and the first concrete loss function and then you can imagine that as we try to
tackle different tasks in machine learning we can we need to write down different types of loss functions for
each different task that we want to try to solve and even when we're focused on a single task we can often write down
different types of loss functions that encapsulate different types of preferences over when models are going
to be good and when models are going to be bad so as a first example of a loss function I want to talk about the
multi-class SVM loss for the infer image classification or really for classification more generally so here
the idea of the multi-class SVM loss is quite intuitive what it basically says is that the score of the correct class
should be a lot higher than all of the scores assigns to all of the incorrect classes right that's that's kind of an
intuitive statement that if we want to use this classifier to actually predict it to recognize images then at the end
of the day we don't care about the predicted scores we want to assign a single label to the each image that we
want to classify and in order to do that we it seems reasonable that we want our classifier to assign high scores to the
right category and low scores all the other categories and now the the multi-class SVM loss is one particular
way to make the intuition concrete so what exactly the multi-class multi-class SVM lost computes is that we can draw a
plot here where the x axis is going to be to score for the correct class for the example we're considering and the y
axis will be the the loss for that individual data point that we're trying to classify then in addition to keeping
track of the score of the correct class we also want to keep track of the highest score among assigns to all other
categories that we care to recognize so maybe if the if we were classifying an image whose correct class is cat then
the x axis would be cap score and then this particular dot would be the highest score assigned to all of the other
categories in the in the classifier and then the multi-class SVM loss looks like the following it's going to decrease
lynnie its if it's going to decrease linearly and once the score of the correct class is more than some margin
over the second highest score among all the impact classes well that will give us zero wasps and I'll call that low
loss means of a good a good classifier and then moving to the left you can see that as the score for the correct clock
class becomes close or even higher than the score to all to the highest incorrect class then the loss we
assigned to that example will increase linearly and this type of loss function that has a general shape of kind of a
linear region and then a zero region this type of a shape of loss function comes up a lot in different contexts and
machine learning and this is often called a hinge loss because it looks kind of like a door hinge that can open
a single data example X I X I image and y I label then the SVM loss has has the form where has the form where we sum
over each of the category labels in not including the correct label Y I so you see the the sum here goes over all
category labels but excludes the correct class and now it's going to take the max of 0 and the class we're looping over -
the correct class plus 1 and if you kind of think about the different cases about what can be higher and what can be lower
you can see that this correspond to on the on the right corresponds to two cases one is that if the correct class
is more than one greater than the indirect class then we then we achieve a loss of 0 for that class right so
basically what this is saying is that we're summing over all the collect all the classes that we want to recognize
and we're going to assign a sort of a mini loss per class per category incorrect category and now if the
incorrect category is less than is greater than 1 less than the correct class then we attend then we invokes
then we achieved then we get then we take some loss whereas if the if the correct class is more than 1 greater
than the incorrect class then we get 0 loss for that class example pair and then we loop over all the other classes
that we care to recognize so because that's a little bit hard to wrap your head around we can kind of look at a
more concrete example so here we're imagining a data set of three images hopefully you can recognize as expert
human visualizers that these are cats cars and frogs and now we're imagining some particular setting of the weight
matrix W that causes our classifier to spit out these scores for these images so given these scores and these images
we can compute the SVM loss as follows so first we want in order to compute the the loss for the cat example then we
need to loop over all the incorrect classes of the all the incorrect categories so we skip the cat category
and now for the car category we compute max of zero five point one is the car score minus 3.2 is the cat score plus
one is the margin and that gives us a score for that thing of 2.9 and now for the car category we
see that then we see that cat is more than 1 greater than frog then the Frog score so then we achieve zero loss for
the for the crab for the category of frog and the overall loss for the cat example is 2 for this cat image is 2.9
we can we can do something similar for the car image and here because the correct category for the correct cat or
category of this image is car and the score we're currently assigning to it is 4.9 and 4.9 is more than one greater
than all of the scores assigned to the incorrect categories so we achieve a block a loss of zero for this example
and you can imagine doing the similar computing doing the same computation for the Frog example here we get a lot of
loss because we've assigned a very low score to the Frog category and then to compute the loss over the full data set
we just take an average over the loss over the examples so now a couple questions first think about what happens
if the loss what happens to the to this loss if the if some of the predicted scores for the car image were to change
a little bit well in this case because the car image is achieving zero loss overall if we a met and and the
predicted car score it's a lot greater than any of the other scores assigned to the incorrect classes you can see that
if we were to change the predicted scores of this example by a little bit then we would still achieve zero loss so
that's kind of it that's that's one interesting property of the of the multi of the multi-class SVM loss is that once
an example is correctly classified then changing the predicted scores of that example just a little bit don't really
affect the loss anymore so another question is what's the maximum and minimum possible values for this loss on
a single example yeah so the minimum loss is zero so we achieved the minimum loss when the correct category has a lot
has a score much higher than all the incorrect categories and the maximum loss is infinite and that happens when
the correct category has a very very low loss that's much smaller than all the other predicted losses so then another
question if all of the score if we had a linear classifier that was randomly initialize the weight matrix
has not been learned at all then and if the values of the wave matrix for all may be small random values then maybe we
would extend we would probably expect at initialization when we first start the learning process that all of the
predicted scores for the linear classifier would also be small random values for each of the categories so in
this case if all of the predicted scores are small random values that approximately what loss would we expect
to see from the SVM classifier I heard they heard zero that's actually not correct small so when I say it okay
maybe this was not a bit not very precise so maybe that was my fault for asking an imprecise question but maybe
if all of this so maybe if we're going to draw on each of the scores from some Gaussian distribution with maybe a
standard deviation of like 0.001 something very very small then in that case if all of the predicted scores
would then be small random values so then the expected difference between the correct category and any of the
incorrect categories would be approximately zero so then if you imagine turning through this lost
computation we would get like small value minus small value is approximately zero and then this overall and then plus
1 would give max of 0 & 1 so then we would achieve a loss of 1 / incorrect category which and again because this
sum is looping over all the incorrect categories then in this case we would expect to see a loss of approximately C
minus 1 where C is the number of categories that we're trying to recognize now this might seem like kind
of a stupid question to ask but it's actually a really useful debugging technique whenever you're implementing a
neural network or other kind of learning based system you can you you you should think about what type of loss do you
expect to see if all of the scores are approximately random and then when you start training your system if you
actually see a loss which is very different from what you expect then probably have a bug somewhere so this
might have seemed like a contrived question but it's actually a very useful debugging technique to go through this
exercise of thinking about what kind of loss would you expect to see with small random values whenever you go and
implement a new loss function or start training the new loss function so then another question is that we
should we saw in this formulation of the SDM loss that we're summing over all of the incorrect categories only so what
would happen if we were to sum over all of the correct category over all of the categories including the correct
category would this represent the same preference over classifiers or would this represent some other type of
classifier some other to the preference over weigh matrices well in this case all we would just expect all of the
scores to be inflated by one right because this would be adding an extra term to the sum sorry okay
yes yes then we would then we expect all the predict we expect all all the flicked losses to just go up by a
constant one because we add an extra value to the sum which was syi - whistle syi which would be zero plus one x is 0
1 1 is 1 so we just add 1 all the losses so this would express the same preference over classifiers because all
the losses would be inflated by a value of 1 but the relative assignment but the we would not change our order about
whether we would prefer one mate 1 wait majors all over and over because all the losses would just be inflated by one
it's done another question what would happen if if we rather than using a sum we used a mean over categories instead
of a sum so here then all of them all of the computed losses would just be multiplied by a factor of 1 over C minus
1 and again because that's a monotonic transform this would Express the exact same preference over weight matrices so
the values of loss would change that we see when we're training but exactly the the preference over weight matrices
would be the same so another question what if we use some other type of formulation what if we took a square
what if we put a square over this max value so this would now express quite a different this would actually be quite
different so this would change all of the scores in a nonlinear way and this would cause our prep the the preference
over weight matrices that we're expressing with our loss function to change in a non-trivial way so this
would no longer be called this you can no longer call this a multi-class SVM loss because this would now be
question what happened if we found some if we happens to get lucky and find some weight matrix W that caused the overall
SVM loss to be zero if we if we happen to find such a such an example with zero loss would it be unique so here it would
not be right because if if we would take my our weight matrix and multiply out all by two then we would still get over
a loss of zero and we can see that by working through one of these examples that if the loss was zero that meant
that all that the score for the correct category was more than one greater than all the scores for the incorrect
categories so if we then if we multiply the weight matrix by two then all of the predicted scores will also go up by a
factor of two because the classifier is linear which will mean that now all of our predicted all of the predicted
scores for the correct categories will be more than two greater than all of the scores the incorrect categories so we'll
still be over the margin and we'll still get zero loss so now that leads kind of an interesting question now it now that
it's possible that we can have two different weight matrices that exceed the exact same loss then how can we
possibly express preferences over these weight matrices right because in this case we found two different weight
matrices that achieve the same loss on the training data so in order to distinguish them we need some other
mechanism additional mechanism beyond the training set laws in order to express our preference or preferences
over classifiers so this is an idea this is one idea called regularization so regularization is some thing some piece
that you add to the objective function or the overall learning objective that is fighting against the training data
what is performing well on the training data so so far we've seen this overall loss as the average loss of all the
examples on the training set so this is usually called the data loss which is somehow measuring how good are the
models predictions on the training data and it's very common to add an additional term to our overall loss
function that does something else that might not depend on the data that's called this is called a regularization
model one is to express me right so here the second term is called a regularization term and you'll see that
it does not involve the training data this is meant to prevent the model from doing too well on the training data
basically to give the model something else to do other than just try to fit the training data and here these
different types of regularization will often come with some kind of hyper parameter usually called lambda in terms
of regular for regularizer z' that will be some hyper parameter or controlling the trade-off between how well the model
is supposed to fit the data versus how well is the model supposed to achieve this regularization loss so a couple a
couple very common examples of regularization that are typically used for linear models are l2 regularization
which is the overall norm of the of the weight matrix W the l1 regular and we can sometimes use an l1 regularizer
which is the sum of the absolute values of all the elements in this weight matrix W sometimes you'll see what's
called an elastic net in statistics literature which is a combination of the l1 l2 regularizer regularizer z-- so all
of these types of regular risers will also be used in neural networks but as we move to neural network models we'll
also see other types of regular Reiser's such as dropout batch normalization and more recent things like cut out and mix
up stochastic gap there's a lot of interesting regularizer that people use for neural networks but the basic idea
of why we might want to use regularizer z' is somehow threefold in my in my thinking one is that adding some
additional term term to the loss beyond the data loss allows us to express our preferences over different types of
models when those different types of models are not distinguished by their training accuracy and this can sometimes
this can be a way that we can inject some of our own human prior knowledge into the types of classifiers that we
would like to learn a second is to avoid what we call overfitting so overfitting is a bad problem in
machine learning this happens when you build a model that works really really well on your training data but it
actually performs very poorly on unseen data and this is here this is a point where where machine learning is quite
distinct from something like optimization right in optimization we typically have an objective function and
our whole goal is just to find the bottom of the objective function but in machine learning we often don't really
want to do that at all because the end of the day we want to build a system that performs well on unseen data
so finding a model that does the bat gets the best possible performance on the training data might be working
actually against us in some ways and might result in models that do not work well on unseen data and then there's
another kind of technical bit is that if we're using gradient-based optimizers then adding an extra term of adding this
extra regularization term can sort of add extra curvature to the overall objective landscape and that can maybe
sometimes help the the optimization process so the I'd one idea of regular so I said that one idea of
regularization is to is that we can express preferences over different types of classifiers that we want a model to
learn so here's an example where we have an input vector X that has all ones and now we consider two different weight
matrices W 1 and W 2 and now imagine that we're in some kind of linear classification or linear regression
setting then the prediction of a linear model with this input X and either of these two weight matrices will be one
right because the inner product of the of the of the vector X and either of these two matrices is 1 which means that
if we were solely going by something like a data loss then the loss would have no way to distinguish these two
different of these two different values of the weight matrix and they would be preferred equally but if we're to use if
we were to add an l2 regularization term to this model and end to our loss function then this allows us to in
express an additional preference to tell the model which of these two we would prefer so here we add this l2
regularization term then we see that if you imagine computing the l2 norm of the w1 vector then it's l2 norm is 1 whereas
the l2 norm of the second vector is what 1/4 squared is 1/16 and we got four of those so the overall 2 norm is 1/4 so
the the weight matrix W 2 would be preferred if we would if we add in this l2 regularization and what's and here
this is very interesting right because what this is one way to think about what an l2 regularizer is doing that when you
have two different options that compute the same value on the input well you could either sort of choose to
spread out your weight matrix to use all of the available input features or you could concentrate all of your weights on
exactly one input feature and when you're using an l2 regularizer you're kind of giving the model this extra hint
that you that you would prefer that it used all available features were possible even if using a single feature
would have achieved the same result so this could be useful maybe if you believe that individual features might
be noisy and that you have maybe a lot of features that all could be correlated and you want to tell the model to use
all of available features something like l1 regularization it tends to express the opposite preference where in l1
regularization it tells the model to prefer to put all of your weight on a single feature where it possible so it's
kind of interesting that these different regularizer --zz allow us to give the model extra hints but what types of
classifiers we'd like them to learn that is completely separate from their performance on the training data so I
said the the second really interesting piece of regularization is to prefer simpler models in order to avoid
overfitting so here we can imagine we're building some model that is receiving a scalar input X and is predicting a
scalar to output Y and we've suppose we've got some noisy training data well specified by these blue points well we
could imagine fitting two different models to this training data maybe the model f1 is this blue curve
that goes and perfectly fits all of the training points whereas the model f2 is this green curve that does not perfectly
fit all the training points but somehow the the model F the f2 curve is somehow simpler because it's a line and not a
big Wiggly polynomial so it might be that given our human intuition about the problem we might we might have reason to
and indeed if we were to imagine collecting a couple more data points that are also kind of noisy data points
that fall roughly along a line then you can see that the this blue curve f1 might achieve very bad predictions on
unseen data while the simpler green curve f2 might might achieve better predictions on unseen data of course I
need to point out that we've been talking about linear models and people always complain that
this slide has a model definitely not linear on it so it's just a cartoon to express the idea of preferring simpler
models with regularization so and so the kind of a takeaway here is that regularization is really important when
you're building machine learning systems and that you should basically always incorporate some form of regularization
into whatever machine learning system you're trying to build so here now we've seen this idea of a linear classifier
we've seen the notion of a loss function with it we saw a concrete example of the loss function being via the multi-class
SVM loss and now we've talked about regularization as a way to prefer one type of classifier over another well
another way that you can tell the model how you you can give the model your preferences about the types of functions
you'd like it to learn is by using different types of loss functions to train the model so we've so far seen the
the multi-class SVM loss but another very commonly used loss and perhaps the most commonly used lost when trading
neural networks is the so called cross-entropy loss or multinomial logistic regression and this comes by
this comes in a lot of names so you'll see a lot of names for this but it all means the same thing
so here the intuition is that we'd like so remember we so far we've not really given much interpretation to the scores
that are being spit out by our linear model we just said that we had a input X we had a weight matrix W it was somehow
spinning out some collection of scores but we didn't really but the the multi-class SVM loss did not really give