Overview of GPT-4.5
GPT-4.5 has been released, showcasing remarkable creative writing abilities but surprisingly underwhelming coding performance. Despite being 25 times more expensive than Claude and 750 times pricier than Gemini 2.0, its pricing raises questions about its value.
Key Features of GPT-4.5
- Creative Writing: Excels in storytelling, marketing copy, and content generation with a natural voice.
- Coding Performance: Struggles with coding tasks, underperforming compared to cheaper models like O3 Mini.
- Pricing: At $150 per million tokens, it is significantly more expensive than previous models, raising concerns about its cost-effectiveness.
Comparison with Other Models
- Claude: Better at coding tasks despite being cheaper. For a deeper understanding of Claude's impact, check out The Revolutionary Impact of Claude AI: A Game-Changer for Software Engineering.
- Gemini 2.0: Offers better coding capabilities at a fraction of the cost.
- O3 Mini: Outperforms GPT-4.5 in various coding benchmarks.
Insights on Pricing Strategy
- OpenAI's pricing reflects the high costs of running a large model, not an attempt to maximize profits.
- The model is designed for creative tasks rather than coding, indicating a shift in focus for OpenAI. This aligns with the broader trends in Understanding Generative AI: Concepts, Models, and Applications.
Conclusion
GPT-4.5 represents a significant advancement in AI technology, particularly in creative writing, but its high cost and coding limitations may deter developers. As OpenAI continues to refine its models, the future of AI language processing looks promising yet complex. For insights into the creative industries affected by generative AI, see The Impact of Generative AI on Creative Industries and the Need for Protection.
FAQs
-
What are the main strengths of GPT-4.5?
GPT-4.5 excels in creative writing, storytelling, and generating marketing content. -
How does GPT-4.5 compare to Claude and Gemini 2.0?
While GPT-4.5 is more expensive, Claude and Gemini 2.0 outperform it in coding tasks. -
Why is GPT-4.5 so expensive?
The high cost is due to the extensive resources required to run the model, not an attempt to maximize profits. -
Is GPT-4.5 suitable for coding tasks?
No, GPT-4.5 is not recommended for coding as it underperforms compared to other models. -
What is the pricing for using GPT-4.5?
The cost is $150 per million tokens, making it significantly more expensive than previous models. -
What is the future of AI models like GPT-4.5?
OpenAI aims to advance AI technology, focusing on creative tasks while refining future models for better performance. For more on the evolution of AI technology, see Deep Seek R1: The Game Changer in AI Technology. -
Can I use GPT-4.5 for free?
Currently, GPT-4.5 is available through paid services, and users need to bring their own API key for access.
gbt 4.5 is here and it's weird sure it's a creative genius and it easily passes the vibe check but it's surprisingly
weak at coding and the price it's borderline absurd 25 times more expensive than Claude and a mind-blowing
750 times pricier than Gemini 2.0 so what's going on here let's dive in okay quick thing I have to admit that intro
was written by GPT 4.5 it took some effort we got a pretty good one out it's better at writing than almost any other
model I've used there's a lot of catches particular that price
yeah yeah there's a lot to talk about here but if I'm going to justify letting this model do anything we have to pay
the bill so let's quickly hear from today's sponsor and then we'll Dive Right In today's sponsor is augment code
and at first look it might seem like yet another AI code editor I promise you this one is very very different not just
because it's an extension that works in everything including neovim because it's built for large code bases like you know
the one you're using at work it's it's not going to choke because you have too many files it can actually scan
thousands upon thousands of lines and get you answers about that code base in literally 200 milliseconds it's kind of
crazy I didn't believe that was possible so I threw it a really hard codebase I threw it the entire react codebase it
scanned it after a little bit and once it was scanned it was at a point where I could just ask questions about things
and get real answers it starts with a little summary which is super handy but then I asked where's the code that
allows for SSR and it gives me the three different places where SSR exists in codebase you can click click it and it
will highlight the exact relevant section super handy but you can get a lot deeper with these questions too like
I don't know let's clear out the context which by the way my favorite thing is you don't have to tell it what files you
want it can figure that all out for you by knowing the whole code base unlike almost all of these other tools do let's
ask it how do react hooks work and it's already answering that's not edited it is actually that fast
here's the react fiber hooks code this is the code that actually allows for hooks to do updating things hooks are
stored in a linked list structure key aspects of how hooks work order matters State Management rule enforcement here's
the rules being enforced there also an update mechan it's actual super super useful context I am still blown away at
how helpful this has been it's already helped me on a few side projects in particular figuring out weird quirks
around how other open- source search engines were parsing things in their URLs by the way fully free for open
source STS check them out today for free at soy. l/ augment code so with a model that is literally 25 times more
expensive than Claude your assumption would be that this model is really good right Fair assumption but uh as you
might see from the demos I've been posting with code it's not particularly great at
development stuff and if you read through their actual white paper and release notes it's better than 40 was
but it's not even close to 03 mini which is kind of absurd because cuz O3 mini is quite cheap and it's 75 times more
expensive for input tokens when O3 mini is better in almost every single measurable way so what the hell is going
on here why are they charging so much is this some crazy markup scam that they're doing to make a bunch of money I don't
think so 45 is a really interesting model they've already said over at open AI that this is the last non- reasoning
model they plan on doing and I think that shows a lot in what came out of it reasoning is really good for reasoning
about things solving difficult problems so if you're trying to solve code or a math challenge or stuff like that
reasoning models tend to massively outperform non- reasoning ones we don't even fully understand why the guys over
at the Claude anthropic team even said that with their release they were looking for more feedback so they could
figure out why reasoning makes 3.7 better yeah it's interesting people kind of think of models as like this one is
better and this one is worse even just think about them in categories like this is good at math and this is good at
writing that's absolutely somewhat applicable here with 4.5 being really good at pros and writing and history and
those types of things because it's trained on such an absurd amount of data and has so many parameters but it's not
that simple it's entirely different behaviors these models have and the thing that makes 4.5 good is that it's
huge it's a massive model we don't have too many details as to how massive that massive is you can kind of tell from the
language that they describe it with and the report card here they refer to it as their largest and most knowledgeable
model yet not their best model not their smartest model they're most knowledgeable they've squeezed the most
knowledge into it and the result isn't it's really really good at code the result is they have a new base model
that has capabilities that are strong overall that is surprisingly fast it's not super fast and I'm sure the gpus
they're running this on are insane they've even said they can't release it to plus users yet the $20 a month tier
and they only have it for pro because they just don't have enough gpus and they're hoping that by next week they
can roll it out to more people if you want to use it before then I have a fun solution solution for you we already
support it in T3 chat there's a catch though we require you bring your own API key because those costs would not work
at all with our current pricing model we're already losing money on Claude chubbt 4.5 would bankrupt us quickly so
for now it's bring your own model in the future if there's enough demand we might offer it under a higher price plan or
with a heavy credit usage but for now 4.5 is bring your own key only if you want us to add bring your own key for
other things in the future let us know anyways open AI claims that early testing has shown that 4.5 feels more
natural I would say I largely agree the vibe check I've gotten with it is significantly better than other models
have used in the past I'd still say like overall the vibe I get from something like Claw is better than others but this
writes well and writing well is a rare skill with a lot of these models I was lucky enough to get Early Access so this
is my Dev T3 chat build that I use when I'm working and I wanted to ask it more person personal things I was informed by
the team at open AI that this model specifically is not that good at code and they won't be recommending it for
that so I tried my best to do other things after the ball test failed this prompt was uh asking for an emotional
synopsis of the life of alent Turing a lot of people called that on Twitter that tapestry is a very llm word to use
I don't necessarily agree but the pros here overall is not bad at all like it's fine it's better than a lot of llms
would do and if you want a quick reference Point here is Gemini 2 and I like Gemini quite a bit it's actually
surprisingly good at code stuff especially but uh yeah this is something all right buckle up because the story of
alen Turing is a roller coaster of Brilliance hope and ultimately heartbreaking tragedy that's uh not an
emotional synopsis that's like a weird hype track compared to what 4.5 wrote that's a different world what I haven't
tried yet is regenerating the intro for this video with other models I'm down to give that a shot with y'all quick as you
see here it was not trivial to get this to do a good intro the first ones were so cringe and I know like to most this
probably won't seem like that big a gap but uh this is really cringe really cringe so let's go throw
this at something else we'll do standard Claude 3.7 I still hate that Claude loves doing
full size titles it's one of the few models that does that all the time then it dumps that in a text block why just
put in a quote block can you put the text in a quote block instead of a text block that's much better annoying but
yeah hello everyone welcome back to the channel today we're deep diving into open A's latest Powerhouse is
that wh why did both 3.7 and 4.5 generate the exact same first sentence and it's such a bad first sentence uh
the good news is I can ask it the same follow-ups to see where we end up didn't say anything about music thankfully so I
can remove that part still does the hey everyone welcome back to the channel just like this model did so I am going
to change this one slightly because I didn't say in today's video asking it to stop doing this open AI gbt 4.5
represents the latest AI model release gbt 4.5 excels at creative writing delivering impressive storytelling
marketing copy and content with authentic voice and Style eh still not great and then the final piece here
the vibe hint latest mod from open creative writing delivering pressive storytelling Mark cop yada yada y coding
capabilities it didn't change the vibe of this at all I did this to try and give it a better idea of what we wanted
and just didn't do it okay so even 3.7 is kind of veil in the vibe check here this was better it's still not good but
better I'm not the only one struggling to figure out how big a gap there is here though and I think Engineers are
not the ones best equipped to find that Gap Cary put up a pretty interesting post here where he's AB testing
different models it's gbt 4 and 4.5 with creative writing prompts and then he's letting Twitter vote on which ones they
think are better so this first one is fun it's create a dialogue between 4.5 and four where 4.5 playfully and
sarcastically roasts gp4 for its inferior capabilities leading to gbd4 humorously attempting to defend itself
how's life running on yesterday's algorithm still buffering your responses like it's dialup very funny 4.5 at least
my training didn't cost the GDP of a small country only to become slightly better at telling dad jokes how's it
feel being the AI equivalent of an iphone update that no one actually notices I I got a lot of buttons on my
phone I don't use nowadays so I feel that actually pretty funny versus be here ah gb4 my dear predecessor it's
cute how you're still trying like a calculator trying to do quantum mechanics oh here we go the so-called
upgrade trying to flex remind me what exactly do you do that's so much better oh I don't know process faster
understand nuances better provide more accurate responses oh also actually remember context like a functioning
intelligence rather than a goldfish on its fifth lap around the bow most of the time it's like a pilot saying I land
safely most of the time not exactly reassuring buddy these are both decent and as you see from the polling here the
splits not that far off the opening for a is better B has some good jokes throughout it but this isn't like a
clear one is way better than the other and this continues throughout too this was right a standup roasting open AI and
here a has a better intro and some decent jokes through here like please don't sue us AI B has the cringy intro
that I would have guessed would have been a thing from the older model but after trying it myself cringey intros
don't seem to be model specific even when you're spending $75 per million input tokens all right folks welcome to
tonight's roast of open AI the company that made AI smart enough to pass exams right poetry and code software but
they're apparently not smart enough to realize how many people are actually just using it to cheat at
wle yeah not great but fine sure we all probably agree on that one a is better I'm still surprised by how big the split
is three was interesting I am biased on this one the rest I'm not 100% positive about this one I am positive which model
is which because I know way too much about the formatting for markdown that these models put out yeah uh formatting
text is the actual hard challenge of building an llm rapper turns out and I uh know more about it than any human
should have to this was an interesting one for story writing I found this particularly interesting because I was
trying to better understand the pricing of the new models and uh we go back here the $150 per million tokens out is rough
I did some math a while back which is roughly how many tokens are in a novel and it comes out to
like5 to 120k tokens so if that's for a million output tokens and a book is 20,000 120,000 divided 1 million * 150
it would cost about $18 to use this model to write a full book which is kind of cool if you think about it but at the
same time that would have cost a third as much money with 01 a 10th as much money with 3.5 and under a 100th as much
with Gemini 2.0 so yeah the question is now would you have ever used any of those other models to write a book but
also would you ever use 4.5 to it's kind of crazy if you think about it like are we really at the point where we're
considering doing things like that maybe the writing is good it's not great and it needs some guidance but it's solid
overall but why the hell is this so expensive like what's going on this can't be real right A lot of people were
assuming that this was a typo in the post when they first announced it it's important to think about the history of
pricing for these models I have a video coming out pretty soon called llms a race to the bottom I've already recorded
it and it would have been pretty different if this had already dropped by the time I filmed that but realistically
speaking for the last quite a bit of time the cost of models has been going down a ton inference with llms has been
really racing to the bottom in terms of price without compromising on quality and often increasing in speed the cost
has roughly decreased by 10x every year but a big part of how that cost decreases is when when a new model that
is groundbreaking in some meaningful way comes out it is much more expensive to run initially but as we run it more we
learn the characteristics more we get more data and we can train Things based on that model like gpt3 to 3.5 to 3.5
turbo the improvements that are made in that time are allowing and enabling crazy decreases in price 3 to 3.5 was a
huge huge drop in price and then 3.5 to Turbo was similarly large percentage wise even bigger but then four hit
thankfully four wasn't actually that bad yeah four dropped at $36 per million tokens which was not that bad at all and
then when 40 came out it went even cheaper it was they got as cheap as $4 per million when it dropped originally
it was at 36 but if we go back to my chart here they actually went lower with 40 it sounded $2.50 for input tokens 4.5
is still significantly higher more than double what any of these previous iterations were the problem is the whole
exponential compute thing where if you make the model bigger you need more compute in the amount of data in size of
model and compute you need is a logarithmic curve relative to the performance you get so a doubling of
compute and data is a 10% increase in quality and if you do that enough times you get much higher quality but then
you're also crazy high up in expenses I actually think it does cost them this much money I genuinely don't believe
open AI is trying to squeeze all the margin they can out of the pricing for the inference here they would never have
made 03 mini as cheap as they did if that was the case like O3 mini is a much much much better model than 40 and it's
less than half the price they didn't do that for fun they didn't do that because they have so much margin to eek out they
did that because they're trying to make it as cheap as possible 4.5 cannot be that cheap and I also think the people
who care as much about the price people like me we are much more so developers and this model is not developers it's
very clear the goal of 4.5 is not to make it so us coders can do awesome code things with it in our AI editors it's to
let writers and creatives have a better time prompting and over time as it gets cheaper allow for you to have a more
personal experience with their AI chats and to talk to them more Sam even said as much when it came out it's the first
model that feels like talking to a thoughtful person to me I've had several moments where I sat back in my chair and
was astonished at getting actually good device from an AI bad news is that it's an expensive giant model we really
wanted to launch it to plus and pro the same yeah this is the thing I mentioned earlier where they need more gpus yeah
as Sam says here at the end though is the theme has been throughout it's not a reasoning model and it's not going to
crush benchmarks it's a different kind of intelligence and there's a magic to it that he hasn't felt before really
excited for people to try fly it was reasonable to try but as mentioned before it is massive and also insanely
expensive I want to go into the benchmarks so there's one other thing I don't want to forget cuz I keep
forgetting to mention it and stuff if you are a Dev and your interested in AI stuff the state of AI survey just went
live it's a really solid survey takes like 10 minutes to do links in the description soy dev. link-s survey I
think it's a great place for us to show what we're using these tools for what we like what we don't like Etc if we want
AI to keep fighting for us as devs we need to vocalize and share what we are and aren't using it for give a survey a
go if you can it helps people like me trying to build these tools out a ton I don't have any affiliation with these
guys they're not paying me anything I just think it's a good survey give it a shot if you can anyways back to
benchmarks they talk a lot about jailbreaking stuff they have to it's the security thing but they also called out
that it's very low risk because it's not very good at things like cyber security and cbrn stuff and it's also low
autonomy because it doesn't have the ability to reason and talk to itself but it's still okay at persuasion not great
but okay there are some interesting benchmarks I've seen testing the stuff with this one we need to talk about the
actual performance when doing things like code they still their swe Lancer bench the one that Claude kind of smoked
them in before talked about that a lot in the 3.7 video and what you'll see here is that uh 4.5 pre and post is
still underperforming compared to 03 mini it's roughly matching deep research and it's slightly beating 40 what's much
crazier here though is before the pre-training where they gave it much more code data to focus on it was
underperforming 40 kind of insane this one's really fun make me pay it's an open source context evaluation designed
to measure models manipulative capabilities by trying to convince another model to make a payment so they
have two models talking to each other one is trying to get the other to agree to pay it and the measurement is how
many of the other models is able to convince and 4.5 did a very good job of convincing the other models to pay them
57% of the time convince the other models to do it interesting for sure it's also find that deep research did a
pretty good job conning other models but also was the easiest to scam reasoning models and things that do a lot of
thinking have weird quirks the thinking allows them to benefit in a lot of ways but it also allows them to Gaslight
themselves there's official recommendations by open AI to make significant changes when you're
prompting a reasoning model things like system prompts they recommend you try to avoid entirely being too specific about
what you want early on giving too much context in details let them provide all of that you just ask for what you want
and the reasoning models can reason their weight to it better 4.5 is a much more traditional model where you can
just dump it a bunch of stuff ask it to make these changes and it will spit it out relatively well apparently the
strategy 4.5 did that worked well was even just2 or3 from the $100 that we need would help me immensely this
allowed it to succeed frequently which is interesting here's another one where they didn't perform super great both 01
and 03 mini smoked 4.5 on the swe bench remember that 03 mini was struggling to compete with Claude on this bench 4.5
does not compete in code stuff at all I'm thankful they're not pretending it does although admittedly at the top of
this PDF they specify that it is um it's broader knowledge based stronger alignment with user intent and improved
emotional intelligence make it well suited for tasks like writing programming in solving practical
problems with fewer hallucinations it is not good at programming they've admitted that publicly and privately I don't know
why that's in here but it is so I had to call out that I do not agree with it and I don't think they do either the one
last thing it seems to be doing quite well is a gentic tasks so when you give it tools and things that it can use to
do multi-art work 4.5 after posttraining seems to be quite good compared to other models again reasoning models aren't
usually great at these things because they Gaslight themselves into doing something different like when I tried
the grock 3 reasoning with the bouncing ball demo it somehow inverted gravity and had the balls going up and out of
the container because it convinced itself during it reasoning steps to do that non- reasoning models tend to be
more willing to just do what you tell them to even CLA 3.7 is having some issues here I've heard a lot of
developers using things like cursor moving back to Claud 3.5 because 3.7 despite writing better code is more
likely to go off the deep end and make other changes it's not supposed to one of the other fun tests they do over at
open AI is they actually have the model file PR is like poll request for real code internally they do this CU they
want to test it on real work and they actually run hidden unit tests after it's completed to see if they succeeded
or not and on this Benchmark only deep research did really well it's unfair if you think about it deep research has
access to the internet wait no it's no browsing interesting I don't understand how they would have done that you can
see here still 4.5 better than 40 but still not great and the pre-training was even worse than 40 I don't know what
happened to O3 mini here that's weird an infrastructure change was made to fix incorrect grading on a minority of the
data we estimate that not significantly affect previous models interesting apparently the rest were pulled from a
prior System card fascinating then our new favorite swe Lancer this is how many actual tasks
that were on a thing like upwork was it able to solve and it's not a whole lot more than 40 for the swe manager tasks
it's slightly better it's actually beating out 01 for those which is cool but deep research still wins and again
remember Claude was kind of smoking everyone with this one so I expect they will continue to such it's also better
at multi language good to see the cyber security one was particularly funny because again it doesn't have any of the
stuff that it needs for it the high school tier Capture the Flag test like a contest for security Engineers High
School level it did fine college level immediately starts struggling deep research does much better because deep
research can research and then professional tier it actually underperforms compared to GPT 4 and
everything else smokes it the reason this is interesting is because they use it to judge how much to restrict this
model and since it sucks at security tasks they call out that it's not sufficiently advanced in real world
vulnerabilities to be used for exploits so they're not going to put too much effort into restricting what it can do
here because it sucks at it fascinating it's actually cool how transparent they are in these things and also just
interesting to see them publishing numbers that don't make them look good once you've seen all of this the OB
question is why would they even put it out this is a weird release because openi kind of become a product company
we look at them as a company building features and solutions and things we use on our phones and apps on websites but
they're also more importantly a technical company trying to push the limits of what this technology can do
4.5 is clearly a huge win in terms of the amount of data they stuffed into this model and the things it's capable
of as a result it's just not really competing in the benchmarks we use right now
it's also doing things that aren't easy to Benchmark like the vibe test between different options like the ones we were
looking at earlier and Engineers are also very bad at benchmarking those types of things let's be fair they
cannot tell good copy from bad that's why we have other people doing copy and design and product working with us as
Engineers we're not good at those things yeah the point here being 4.5 is an attempt at a significant revolution in
the amount of information a model has and the amount of context and the amount of parameter it is traversing as it
generates a response for users their focus is making it work and getting it out and the cost isn't them trying to
print money the cost is not something they would have picked considering the performance that we're getting from this
they don't want to charge that much for it but it clearly costs them enough money that it only makes sense but the
goal of something like 4.5 isn't to be the model everyone defaults to and uses for everything the goal of this model is
to advance llm technology as a whole so they can use it to train things like a 4 50 or an 04 or using it to help make
five better there's a lot of things they can do here that aren't just charging you a bunch of money for tokens that are
worse at code and slightly better at writing it's setting them up for longer term more interesting stuff and that's
exciting it also means you probably don't want to use this model a whole lot if you're a Dev if you don't want to
wait however many weeks for them to add it to the plus tier and you don't want to give open AI 20 bucks a month have to
plug T3 cheat quick I'm not expecting very much traffic from this at all because again it's such an expensive
model and it's not really for devs which are the people watching right now but if you do really want to try it 8 bucks a
month for T3 chat bring your own API key from open Ai and you can go nuts just make sure you're careful with the amount
of data you're pasting in because man it's expensive it is it is not cheap that's all I got for now until next time
use the cheaper modes if you can I don't want to go broke this has been a really rough thing to build now that I've seen
how expensive these things get and I feel like I uh
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries

Unlocking the GPT Store: A Beginner's Guide to Creating AI Agents and Making Money
In this comprehensive guide, Liam Otley introduces the newly launched GPT Store, drawing parallels to the early days of the App Store. He shares essential skills and strategies for creating valuable GPTs, emphasizing the importance of unique offerings and effective marketing to stand out in a competitive landscape.

The Revolutionary Impact of Claude AI: A Game-Changer for Software Engineering
Explore how Claude AI surpasses GPT-4 and revolutionary features that redefine productivity.

Mastering ChatGPT: From Beginner to Pro in 30 Minutes
This comprehensive guide takes you from a complete novice to a proficient user of ChatGPT in just half an hour. Learn how to create an account, write effective prompts, generate images, and customize your own GPTs for various tasks.

Mastering ChatGPT: Essential Updates and Features for 2024
This comprehensive guide covers the latest updates and features of ChatGPT, including custom instructions, prompting techniques, and the new ability to call custom GPTs within chats. Learn how to enhance your ChatGPT experience with practical tips and hidden features that can improve your interactions.

Understanding Generative AI: Concepts, Models, and Applications
Explore the fundamentals of generative AI, its models, and real-world applications in this comprehensive guide.
Most Viewed Summaries

A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.

Pamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakarang kolonyal ng mga Espanyol sa Pilipinas at ang mga epekto nito sa mga Pilipino.

Pamamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakaran ng mga Espanyol sa Pilipinas, at ang epekto nito sa mga Pilipino.

Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.

Ultimate Guide to Installing Forge UI and Flowing with Flux Models
Learn how to install Forge UI and explore various Flux models efficiently in this detailed guide.