Exploring GPT-4.5: A Comprehensive Review of Its Strengths and Weaknesses

Overview of GPT-4.5

GPT-4.5 has been released, showcasing remarkable creative writing abilities but surprisingly underwhelming coding performance. Despite being 25 times more expensive than Claude and 750 times pricier than Gemini 2.0, its pricing raises questions about its value.

Key Features of GPT-4.5

Creative Writing: Excels in storytelling, marketing copy, and content generation with a natural voice.
Coding Performance: Struggles with coding tasks, underperforming compared to cheaper models like O3 Mini.
Pricing: At $150 per million tokens, it is significantly more expensive than previous models, raising concerns about its cost-effectiveness.

Comparison with Other Models

Claude: Better at coding tasks despite being cheaper. For a deeper understanding of Claude's impact, check out The Revolutionary Impact of Claude AI: A Game-Changer for Software Engineering.
Gemini 2.0: Offers better coding capabilities at a fraction of the cost.
O3 Mini: Outperforms GPT-4.5 in various coding benchmarks.

Insights on Pricing Strategy

OpenAI's pricing reflects the high costs of running a large model, not an attempt to maximize profits.
The model is designed for creative tasks rather than coding, indicating a shift in focus for OpenAI. This aligns with the broader trends in Understanding Generative AI: Concepts, Models, and Applications.

Conclusion

GPT-4.5 represents a significant advancement in AI technology, particularly in creative writing, but its high cost and coding limitations may deter developers. As OpenAI continues to refine its models, the future of AI language processing looks promising yet complex. For insights into the creative industries affected by generative AI, see The Impact of Generative AI on Creative Industries and the Need for Protection.

FAQs

What are the main strengths of GPT-4.5?
GPT-4.5 excels in creative writing, storytelling, and generating marketing content.
How does GPT-4.5 compare to Claude and Gemini 2.0?
While GPT-4.5 is more expensive, Claude and Gemini 2.0 outperform it in coding tasks.
Why is GPT-4.5 so expensive?
The high cost is due to the extensive resources required to run the model, not an attempt to maximize profits.
Is GPT-4.5 suitable for coding tasks?
No, GPT-4.5 is not recommended for coding as it underperforms compared to other models.
What is the pricing for using GPT-4.5?
The cost is $150 per million tokens, making it significantly more expensive than previous models.
What is the future of AI models like GPT-4.5?
OpenAI aims to advance AI technology, focusing on creative tasks while refining future models for better performance. For more on the evolution of AI technology, see Deep Seek R1: The Game Changer in AI Technology.
Can I use GPT-4.5 for free?
Currently, GPT-4.5 is available through paid services, and users need to bring their own API key for access.

gbt 4.5 is here and it's weird sure it's a creative genius and it easily passes the vibe check but it's surprisingly

weak at coding and the price it's borderline absurd 25 times more expensive than Claude and a mind-blowing

750 times pricier than Gemini 2.0 so what's going on here let's dive in okay quick thing I have to admit that intro

was written by GPT 4.5 it took some effort we got a pretty good one out it's better at writing than almost any other

model I've used there's a lot of catches particular that price

yeah yeah there's a lot to talk about here but if I'm going to justify letting this model do anything we have to pay

the bill so let's quickly hear from today's sponsor and then we'll Dive Right In today's sponsor is augment code

and at first look it might seem like yet another AI code editor I promise you this one is very very different not just

because it's an extension that works in everything including neovim because it's built for large code bases like you know

the one you're using at work it's it's not going to choke because you have too many files it can actually scan

thousands upon thousands of lines and get you answers about that code base in literally 200 milliseconds it's kind of

crazy I didn't believe that was possible so I threw it a really hard codebase I threw it the entire react codebase it

scanned it after a little bit and once it was scanned it was at a point where I could just ask questions about things

and get real answers it starts with a little summary which is super handy but then I asked where's the code that

allows for SSR and it gives me the three different places where SSR exists in codebase you can click click it and it

will highlight the exact relevant section super handy but you can get a lot deeper with these questions too like

I don't know let's clear out the context which by the way my favorite thing is you don't have to tell it what files you

want it can figure that all out for you by knowing the whole code base unlike almost all of these other tools do let's

ask it how do react hooks work and it's already answering that's not edited it is actually that fast

here's the react fiber hooks code this is the code that actually allows for hooks to do updating things hooks are

stored in a linked list structure key aspects of how hooks work order matters State Management rule enforcement here's

the rules being enforced there also an update mechan it's actual super super useful context I am still blown away at

how helpful this has been it's already helped me on a few side projects in particular figuring out weird quirks

around how other open- source search engines were parsing things in their URLs by the way fully free for open

source STS check them out today for free at soy. l/ augment code so with a model that is literally 25 times more

expensive than Claude your assumption would be that this model is really good right Fair assumption but uh as you

might see from the demos I've been posting with code it's not particularly great at

development stuff and if you read through their actual white paper and release notes it's better than 40 was

but it's not even close to 03 mini which is kind of absurd because cuz O3 mini is quite cheap and it's 75 times more

expensive for input tokens when O3 mini is better in almost every single measurable way so what the hell is going

on here why are they charging so much is this some crazy markup scam that they're doing to make a bunch of money I don't

think so 45 is a really interesting model they've already said over at open AI that this is the last non- reasoning

model they plan on doing and I think that shows a lot in what came out of it reasoning is really good for reasoning

about things solving difficult problems so if you're trying to solve code or a math challenge or stuff like that

reasoning models tend to massively outperform non- reasoning ones we don't even fully understand why the guys over

at the Claude anthropic team even said that with their release they were looking for more feedback so they could

figure out why reasoning makes 3.7 better yeah it's interesting people kind of think of models as like this one is

better and this one is worse even just think about them in categories like this is good at math and this is good at

writing that's absolutely somewhat applicable here with 4.5 being really good at pros and writing and history and

those types of things because it's trained on such an absurd amount of data and has so many parameters but it's not

that simple it's entirely different behaviors these models have and the thing that makes 4.5 good is that it's

huge it's a massive model we don't have too many details as to how massive that massive is you can kind of tell from the

language that they describe it with and the report card here they refer to it as their largest and most knowledgeable

model yet not their best model not their smartest model they're most knowledgeable they've squeezed the most

knowledge into it and the result isn't it's really really good at code the result is they have a new base model

that has capabilities that are strong overall that is surprisingly fast it's not super fast and I'm sure the gpus

they're running this on are insane they've even said they can't release it to plus users yet the $20 a month tier

and they only have it for pro because they just don't have enough gpus and they're hoping that by next week they

can roll it out to more people if you want to use it before then I have a fun solution solution for you we already

support it in T3 chat there's a catch though we require you bring your own API key because those costs would not work

at all with our current pricing model we're already losing money on Claude chubbt 4.5 would bankrupt us quickly so

for now it's bring your own model in the future if there's enough demand we might offer it under a higher price plan or

with a heavy credit usage but for now 4.5 is bring your own key only if you want us to add bring your own key for

other things in the future let us know anyways open AI claims that early testing has shown that 4.5 feels more

natural I would say I largely agree the vibe check I've gotten with it is significantly better than other models

have used in the past I'd still say like overall the vibe I get from something like Claw is better than others but this

writes well and writing well is a rare skill with a lot of these models I was lucky enough to get Early Access so this

is my Dev T3 chat build that I use when I'm working and I wanted to ask it more person personal things I was informed by

the team at open AI that this model specifically is not that good at code and they won't be recommending it for

that so I tried my best to do other things after the ball test failed this prompt was uh asking for an emotional

synopsis of the life of alent Turing a lot of people called that on Twitter that tapestry is a very llm word to use

I don't necessarily agree but the pros here overall is not bad at all like it's fine it's better than a lot of llms

would do and if you want a quick reference Point here is Gemini 2 and I like Gemini quite a bit it's actually

surprisingly good at code stuff especially but uh yeah this is something all right buckle up because the story of

alen Turing is a roller coaster of Brilliance hope and ultimately heartbreaking tragedy that's uh not an

emotional synopsis that's like a weird hype track compared to what 4.5 wrote that's a different world what I haven't

tried yet is regenerating the intro for this video with other models I'm down to give that a shot with y'all quick as you

see here it was not trivial to get this to do a good intro the first ones were so cringe and I know like to most this

probably won't seem like that big a gap but uh this is really cringe really cringe so let's go throw

this at something else we'll do standard Claude 3.7 I still hate that Claude loves doing

full size titles it's one of the few models that does that all the time then it dumps that in a text block why just

put in a quote block can you put the text in a quote block instead of a text block that's much better annoying but

yeah hello everyone welcome back to the channel today we're deep diving into open A's latest Powerhouse is

that wh why did both 3.7 and 4.5 generate the exact same first sentence and it's such a bad first sentence uh

the good news is I can ask it the same follow-ups to see where we end up didn't say anything about music thankfully so I

can remove that part still does the hey everyone welcome back to the channel just like this model did so I am going

to change this one slightly because I didn't say in today's video asking it to stop doing this open AI gbt 4.5

represents the latest AI model release gbt 4.5 excels at creative writing delivering impressive storytelling

marketing copy and content with authentic voice and Style eh still not great and then the final piece here

the vibe hint latest mod from open creative writing delivering pressive storytelling Mark cop yada yada y coding

capabilities it didn't change the vibe of this at all I did this to try and give it a better idea of what we wanted

and just didn't do it okay so even 3.7 is kind of veil in the vibe check here this was better it's still not good but

better I'm not the only one struggling to figure out how big a gap there is here though and I think Engineers are

not the ones best equipped to find that Gap Cary put up a pretty interesting post here where he's AB testing

different models it's gbt 4 and 4.5 with creative writing prompts and then he's letting Twitter vote on which ones they

think are better so this first one is fun it's create a dialogue between 4.5 and four where 4.5 playfully and

sarcastically roasts gp4 for its inferior capabilities leading to gbd4 humorously attempting to defend itself

how's life running on yesterday's algorithm still buffering your responses like it's dialup very funny 4.5 at least

my training didn't cost the GDP of a small country only to become slightly better at telling dad jokes how's it

feel being the AI equivalent of an iphone update that no one actually notices I I got a lot of buttons on my

phone I don't use nowadays so I feel that actually pretty funny versus be here ah gb4 my dear predecessor it's

cute how you're still trying like a calculator trying to do quantum mechanics oh here we go the so-called

upgrade trying to flex remind me what exactly do you do that's so much better oh I don't know process faster

understand nuances better provide more accurate responses oh also actually remember context like a functioning

intelligence rather than a goldfish on its fifth lap around the bow most of the time it's like a pilot saying I land

safely most of the time not exactly reassuring buddy these are both decent and as you see from the polling here the

splits not that far off the opening for a is better B has some good jokes throughout it but this isn't like a

clear one is way better than the other and this continues throughout too this was right a standup roasting open AI and

here a has a better intro and some decent jokes through here like please don't sue us AI B has the cringy intro

that I would have guessed would have been a thing from the older model but after trying it myself cringey intros

don't seem to be model specific even when you're spending $75 per million input tokens all right folks welcome to

tonight's roast of open AI the company that made AI smart enough to pass exams right poetry and code software but

they're apparently not smart enough to realize how many people are actually just using it to cheat at

wle yeah not great but fine sure we all probably agree on that one a is better I'm still surprised by how big the split

is three was interesting I am biased on this one the rest I'm not 100% positive about this one I am positive which model

is which because I know way too much about the formatting for markdown that these models put out yeah uh formatting

text is the actual hard challenge of building an llm rapper turns out and I uh know more about it than any human

should have to this was an interesting one for story writing I found this particularly interesting because I was

trying to better understand the pricing of the new models and uh we go back here the $150 per million tokens out is rough

I did some math a while back which is roughly how many tokens are in a novel and it comes out to

like5 to 120k tokens so if that's for a million output tokens and a book is 20,000 120,000 divided 1 million * 150

it would cost about $18 to use this model to write a full book which is kind of cool if you think about it but at the

same time that would have cost a third as much money with 01 a 10th as much money with 3.5 and under a 100th as much

with Gemini 2.0 so yeah the question is now would you have ever used any of those other models to write a book but

also would you ever use 4.5 to it's kind of crazy if you think about it like are we really at the point where we're

considering doing things like that maybe the writing is good it's not great and it needs some guidance but it's solid

overall but why the hell is this so expensive like what's going on this can't be real right A lot of people were

assuming that this was a typo in the post when they first announced it it's important to think about the history of

pricing for these models I have a video coming out pretty soon called llms a race to the bottom I've already recorded

it and it would have been pretty different if this had already dropped by the time I filmed that but realistically

speaking for the last quite a bit of time the cost of models has been going down a ton inference with llms has been

really racing to the bottom in terms of price without compromising on quality and often increasing in speed the cost

has roughly decreased by 10x every year but a big part of how that cost decreases is when when a new model that

is groundbreaking in some meaningful way comes out it is much more expensive to run initially but as we run it more we

learn the characteristics more we get more data and we can train Things based on that model like gpt3 to 3.5 to 3.5

turbo the improvements that are made in that time are allowing and enabling crazy decreases in price 3 to 3.5 was a

huge huge drop in price and then 3.5 to Turbo was similarly large percentage wise even bigger but then four hit

thankfully four wasn't actually that bad yeah four dropped at $36 per million tokens which was not that bad at all and

then when 40 came out it went even cheaper it was they got as cheap as $4 per million when it dropped originally

it was at 36 but if we go back to my chart here they actually went lower with 40 it sounded $2.50 for input tokens 4.5

is still significantly higher more than double what any of these previous iterations were the problem is the whole

exponential compute thing where if you make the model bigger you need more compute in the amount of data in size of

model and compute you need is a logarithmic curve relative to the performance you get so a doubling of

compute and data is a 10% increase in quality and if you do that enough times you get much higher quality but then

you're also crazy high up in expenses I actually think it does cost them this much money I genuinely don't believe

open AI is trying to squeeze all the margin they can out of the pricing for the inference here they would never have

made 03 mini as cheap as they did if that was the case like O3 mini is a much much much better model than 40 and it's

less than half the price they didn't do that for fun they didn't do that because they have so much margin to eek out they

did that because they're trying to make it as cheap as possible 4.5 cannot be that cheap and I also think the people

who care as much about the price people like me we are much more so developers and this model is not developers it's

very clear the goal of 4.5 is not to make it so us coders can do awesome code things with it in our AI editors it's to

let writers and creatives have a better time prompting and over time as it gets cheaper allow for you to have a more

personal experience with their AI chats and to talk to them more Sam even said as much when it came out it's the first

model that feels like talking to a thoughtful person to me I've had several moments where I sat back in my chair and

was astonished at getting actually good device from an AI bad news is that it's an expensive giant model we really

wanted to launch it to plus and pro the same yeah this is the thing I mentioned earlier where they need more gpus yeah

as Sam says here at the end though is the theme has been throughout it's not a reasoning model and it's not going to

crush benchmarks it's a different kind of intelligence and there's a magic to it that he hasn't felt before really

excited for people to try fly it was reasonable to try but as mentioned before it is massive and also insanely

expensive I want to go into the benchmarks so there's one other thing I don't want to forget cuz I keep

forgetting to mention it and stuff if you are a Dev and your interested in AI stuff the state of AI survey just went

live it's a really solid survey takes like 10 minutes to do links in the description soy dev. link-s survey I

think it's a great place for us to show what we're using these tools for what we like what we don't like Etc if we want

AI to keep fighting for us as devs we need to vocalize and share what we are and aren't using it for give a survey a

go if you can it helps people like me trying to build these tools out a ton I don't have any affiliation with these

guys they're not paying me anything I just think it's a good survey give it a shot if you can anyways back to

benchmarks they talk a lot about jailbreaking stuff they have to it's the security thing but they also called out

that it's very low risk because it's not very good at things like cyber security and cbrn stuff and it's also low

autonomy because it doesn't have the ability to reason and talk to itself but it's still okay at persuasion not great

but okay there are some interesting benchmarks I've seen testing the stuff with this one we need to talk about the

actual performance when doing things like code they still their swe Lancer bench the one that Claude kind of smoked

them in before talked about that a lot in the 3.7 video and what you'll see here is that uh 4.5 pre and post is

still underperforming compared to 03 mini it's roughly matching deep research and it's slightly beating 40 what's much

crazier here though is before the pre-training where they gave it much more code data to focus on it was

underperforming 40 kind of insane this one's really fun make me pay it's an open source context evaluation designed

to measure models manipulative capabilities by trying to convince another model to make a payment so they

have two models talking to each other one is trying to get the other to agree to pay it and the measurement is how

many of the other models is able to convince and 4.5 did a very good job of convincing the other models to pay them

57% of the time convince the other models to do it interesting for sure it's also find that deep research did a

pretty good job conning other models but also was the easiest to scam reasoning models and things that do a lot of

thinking have weird quirks the thinking allows them to benefit in a lot of ways but it also allows them to Gaslight

themselves there's official recommendations by open AI to make significant changes when you're

prompting a reasoning model things like system prompts they recommend you try to avoid entirely being too specific about

what you want early on giving too much context in details let them provide all of that you just ask for what you want

and the reasoning models can reason their weight to it better 4.5 is a much more traditional model where you can

just dump it a bunch of stuff ask it to make these changes and it will spit it out relatively well apparently the

strategy 4.5 did that worked well was even just2 or3 from the $100 that we need would help me immensely this

allowed it to succeed frequently which is interesting here's another one where they didn't perform super great both 01

and 03 mini smoked 4.5 on the swe bench remember that 03 mini was struggling to compete with Claude on this bench 4.5

does not compete in code stuff at all I'm thankful they're not pretending it does although admittedly at the top of

this PDF they specify that it is um it's broader knowledge based stronger alignment with user intent and improved

emotional intelligence make it well suited for tasks like writing programming in solving practical

problems with fewer hallucinations it is not good at programming they've admitted that publicly and privately I don't know

why that's in here but it is so I had to call out that I do not agree with it and I don't think they do either the one

last thing it seems to be doing quite well is a gentic tasks so when you give it tools and things that it can use to

do multi-art work 4.5 after posttraining seems to be quite good compared to other models again reasoning models aren't

usually great at these things because they Gaslight themselves into doing something different like when I tried

the grock 3 reasoning with the bouncing ball demo it somehow inverted gravity and had the balls going up and out of

the container because it convinced itself during it reasoning steps to do that non- reasoning models tend to be

more willing to just do what you tell them to even CLA 3.7 is having some issues here I've heard a lot of

developers using things like cursor moving back to Claud 3.5 because 3.7 despite writing better code is more

likely to go off the deep end and make other changes it's not supposed to one of the other fun tests they do over at

open AI is they actually have the model file PR is like poll request for real code internally they do this CU they

want to test it on real work and they actually run hidden unit tests after it's completed to see if they succeeded

or not and on this Benchmark only deep research did really well it's unfair if you think about it deep research has

access to the internet wait no it's no browsing interesting I don't understand how they would have done that you can

see here still 4.5 better than 40 but still not great and the pre-training was even worse than 40 I don't know what

happened to O3 mini here that's weird an infrastructure change was made to fix incorrect grading on a minority of the

data we estimate that not significantly affect previous models interesting apparently the rest were pulled from a

prior System card fascinating then our new favorite swe Lancer this is how many actual tasks

that were on a thing like upwork was it able to solve and it's not a whole lot more than 40 for the swe manager tasks

it's slightly better it's actually beating out 01 for those which is cool but deep research still wins and again

remember Claude was kind of smoking everyone with this one so I expect they will continue to such it's also better

at multi language good to see the cyber security one was particularly funny because again it doesn't have any of the

stuff that it needs for it the high school tier Capture the Flag test like a contest for security Engineers High

School level it did fine college level immediately starts struggling deep research does much better because deep

research can research and then professional tier it actually underperforms compared to GPT 4 and

everything else smokes it the reason this is interesting is because they use it to judge how much to restrict this

model and since it sucks at security tasks they call out that it's not sufficiently advanced in real world

vulnerabilities to be used for exploits so they're not going to put too much effort into restricting what it can do

here because it sucks at it fascinating it's actually cool how transparent they are in these things and also just

interesting to see them publishing numbers that don't make them look good once you've seen all of this the OB

question is why would they even put it out this is a weird release because openi kind of become a product company

we look at them as a company building features and solutions and things we use on our phones and apps on websites but

they're also more importantly a technical company trying to push the limits of what this technology can do

4.5 is clearly a huge win in terms of the amount of data they stuffed into this model and the things it's capable

of as a result it's just not really competing in the benchmarks we use right now

it's also doing things that aren't easy to Benchmark like the vibe test between different options like the ones we were

looking at earlier and Engineers are also very bad at benchmarking those types of things let's be fair they

cannot tell good copy from bad that's why we have other people doing copy and design and product working with us as

Engineers we're not good at those things yeah the point here being 4.5 is an attempt at a significant revolution in

the amount of information a model has and the amount of context and the amount of parameter it is traversing as it

generates a response for users their focus is making it work and getting it out and the cost isn't them trying to

print money the cost is not something they would have picked considering the performance that we're getting from this

they don't want to charge that much for it but it clearly costs them enough money that it only makes sense but the

goal of something like 4.5 isn't to be the model everyone defaults to and uses for everything the goal of this model is

to advance llm technology as a whole so they can use it to train things like a 4 50 or an 04 or using it to help make

five better there's a lot of things they can do here that aren't just charging you a bunch of money for tokens that are

worse at code and slightly better at writing it's setting them up for longer term more interesting stuff and that's

exciting it also means you probably don't want to use this model a whole lot if you're a Dev if you don't want to

wait however many weeks for them to add it to the plus tier and you don't want to give open AI 20 bucks a month have to

plug T3 cheat quick I'm not expecting very much traffic from this at all because again it's such an expensive

model and it's not really for devs which are the people watching right now but if you do really want to try it 8 bucks a

month for T3 chat bring your own API key from open Ai and you can go nuts just make sure you're careful with the amount

of data you're pasting in because man it's expensive it is it is not cheap that's all I got for now until next time

use the cheaper modes if you can I don't want to go broke this has been a really rough thing to build now that I've seen

how expensive these things get and I feel like I uh

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

OpenAI Launches GPT-5: Expert-Level AI Revolutionizes Coding, Learning, and Healthcare

OpenAI unveils GPT-5, a groundbreaking AI model delivering expert-level intelligence, faster performance, and enhanced reliability. This major upgrade empowers users with advanced coding capabilities, personalized learning, and transformative healthcare support, available to free and paid users alike.

Unlocking the GPT Store: A Beginner's Guide to Creating AI Agents and Making Money

In this comprehensive guide, Liam Otley introduces the newly launched GPT Store, drawing parallels to the early days of the App Store. He shares essential skills and strategies for creating valuable GPTs, emphasizing the importance of unique offerings and effective marketing to stand out in a competitive landscape.

The Revolutionary Impact of Claude AI: A Game-Changer for Software Engineering

Explore how Claude AI surpasses GPT-4 and revolutionary features that redefine productivity.

GPT5: El Mejor Modelo de IA de OpenAI y sus Innovaciones Clave

Descubre por qué GPT5 es considerado el modelo de inteligencia artificial más avanzado de OpenAI, superando a competidores en programación, razonamiento y manejo de contexto. Con una ventana de contexto de 400,000 tokens y mejoras en la reducción de alucinaciones, GPT5 revoluciona el uso profesional de IA.

Nuevos Modelos GPT-4.1 de OpenAI: Comparativa y Análisis

OpenAI ha lanzado tres nuevos modelos de la serie GPT, incluyendo el GPT-4.1, GPT-4.1 Mini y GPT-4.1 Nano, diseñados para mejorar la programación y competir con otros modelos populares. En este video, se analizan sus características, rendimiento y se comparan con modelos como Cloud Sonet 3.7 y Gemini 2.5 Pro.