Overview of the Masterclass
This masterclass is designed to provide a thorough understanding of Apache Spark, focusing on key concepts and techniques essential for data engineering in 2025. The course covers:
- Spark Architecture: Understanding the master-slave architecture, driver, and executor roles.
- Transformations and Actions: Differentiating between narrow and wide transformations, lazy evaluation, and actions in Spark.
- Memory Management: Insights into driver and executor memory management, including handling out-of-memory errors.
- Dynamic Partition Pruning: Techniques to optimize data processing by reducing unnecessary data scans.
- Joins in Spark: Exploring different types of joins, including shuffle sort merge joins and broadcast joins, and their implications on performance.
- Caching and Persistence: Understanding how to effectively cache data for improved performance.
- Unified Memory Management: How Spark manages memory allocation between execution and storage.
Key Concepts Covered
- Spark Architecture: Mastering the components and their interactions.
- Transformations: Learning about lazy evaluation and the importance of actions.
- Memory Management: Strategies to avoid out-of-memory errors and optimize resource usage.
- Dynamic Partition Pruning: Techniques to enhance query performance by reducing data scans.
- Joins: Understanding the mechanics of different join types and their performance implications.
- Caching: Best practices for caching data to improve processing speed.
- Unified Memory Management: How Spark allocates memory dynamically based on workload.
FAQs
-
What is Apache Spark?
Apache Spark is an open-source distributed computing system designed for fast processing of large datasets across clusters of computers. -
What are the main components of Spark architecture?
The main components include the driver, executors, and cluster manager, which work together to process data efficiently. -
What is lazy evaluation in Spark?
Lazy evaluation means that Spark does not execute transformations until an action is called, allowing for optimization of the execution plan. -
How does Spark manage memory?
Spark uses a unified memory management model that dynamically allocates memory between execution and storage based on workload requirements. -
What is dynamic partition pruning?
Dynamic partition pruning is a technique that allows Spark to skip reading unnecessary partitions based on filter conditions applied to joined tables. -
What types of joins are available in Spark?
Spark supports various join types, including inner, outer, left, right, and broadcast joins, each with different performance characteristics. -
How can I optimize Spark jobs?
Optimizing Spark jobs involves using techniques such as caching, partitioning, and understanding the execution plan to reduce resource consumption and improve performance. For more insights on data processing techniques, check out our summary on Mastering Pandas DataFrames: A Comprehensive Guide.
Additionally, if you're interested in the broader context of data analytics and its career prospects, you might find The Ultimate Guide to a Career in Data Analytics: Roles, Responsibilities, and Skills helpful.
This 6 hours long master class will cover everything you need to know in 2025 from scratch, including Spark
architecture, Spark Dags and UI, lazy valuation and actions, narrow versus wide transformations, shuffle joins and
broadcast joints, Spark SQL engine and query plans, driver memory management, executor memory allocation, out of
memory errors, unified memory management, garbage collection cycle, edge node and deployment modes, storage
levels with cache and persist, dynamic partition pruning, adaptive query execution, handle skewess with salting
and much more. I spent a lot of time to create this complete guide and I referred so many books, documentation,
playlist to provide you the best condensed form of knowledge and one-stop solution. If you found this video that
means this video is made for you and will help you to achieve your goals in 2025. There is one prerequest for this
video. You need dedication in order to complete this video and commitment to yourself to become a
prosper. That's it. So if you can complete this prerequisite guide course. So, what's
up? What's up? What's up, my data fam? What's up? Happy Sunday. And there's a surprise for you today. What surprise?
This is the biggest surprise because I wanted to create this Apache Spark the ultimate guide since so long
and I think now is the right time to actually deliver this. I know I know I know Spark is very much
in demand. I know that and I know that I have already created my Pispark full course video in which I have just told
you how you can just code how you can just write the coding part of the Spark you can say Pispark using all the Python
API called Pispark that is fine. I know I know that I know that but what is this Spark ultimate guide because the thing
is coding is fine. I know now you know how to code and if you haven't watched that video you can simply click on the
link in the i button and you will see that apach pispark full course video that is for the coding perspective okay
but now in today's video we going to master the conceptual areas as well there are like so many advancements and
I would say if you are sitting in the interviews it's very important to have that coding knowledge as well. But in
order to you can say crack the interviews of big companies, you need to have strong grasp over spark
concepts. I can just give you some hints. Let's say um how you can just read the DAGs, how you can just perform
dynamic partition pruning, um how joins are being happen behind the scenes. You need to know all these things. Spark
architecture, client mode, cluster mode, everything, everything. And this is one of the most important areas to crack any
big data engineering or data engineering interview regardless of any cloud platform. This is the most important
part. Yes, once you land as a data engineer, okay, once you land as a data engineer, it is fine. It is fine. um if
you just know the coding okay it is like I I wouldn't say like it is the ideal scenario but it will work for the junior
roles okay because you just need to write the code that's it but I would say even if you just want to crack a junior
role data engineer interview you cannot skip this part you cannot and obviously once you land as a data engineer it
becomes your bread and butter why because you will see on a daily basis you are actually using all these stuff
you write your code very good you are seeing performance issues there you need this knowledge you are just trying to
let's say doing a P which is a proof of concept project you will see some performance things you will see
monitoring stuff so for all that thing apart from coding part you need all these things you need to understand hey
I'm writing this code but what is actually happening behind the code you need to know all these things. So that
is why and I can say that this video will be very much in detail and obviously it should be because this is
one of the most important and I would say the most complex part but don't worry this is an Lamba and you know me I
will make all the concepts so easy so simple to understand and you will say hey spark is so easy spark is so easy
and that's what my goal is I want to make you understand all these complex concepts with ease you do not need to
worry at all So I'm really excited for today's video and let me just tell about the prerequest and you will be very
happy after knowing this. There are no pre- request nothing. This video is like for those who are planning to become a
data engineer. This video is for those who are already in the data domain maybe data analyst, data scientist. This video
is for those who are like maybe software engineers or any any other um tech domain but they do not know anything
about data engineering. Yes, this video is for those. Plus, even if you are a data engineer, this video is for you.
Why? Because let me tell you, I know like there are like so many hirings for data engineers and
actually actually companies hire data engineers in bulk. Okay, in the previous years they actually hired data engineers
in bulk. Okay, now they might not know all the things. Let me be very honest with you, okay?
Because obviously in the technical rounds they were asking just the coding part but now it's time to actually tune
that code it act because even if you use AI chat GPD you can only know the code hey you will simply say how to apply
left join you will get the code how to apply group by you will get the code but how to tune that you need to know what
to look at what things to consider how to just make some tweaks with the memory optimization and everything everything
how to just pick the right size of execute computer everything. So that is why it becomes even more important
because AI is here. So how you can just beat the talent and this is the way to do that because you need human
intervention to do all these things. Yes, you can replace human with AI. If you just want to write Pispar code but
if you want to tune it, if you want to optimize it, you need experience, you need knowledge. And I I I am not saying
like AI cannot do it. If you want AI to do it, you still need human intervention because you will actually guide AI what
things to look at, what are your inputs and what needs you want to tweak. And in order to provide these parameters to AI,
you need to know all these things, right? Common sense. So I would say this video will help you to become AI proof
as well. Trust me because we going to go much more in depth in Apache Spark. Okay. So
there are no play request and as I just said even if you're a data engineer this video is for you. Trust me, trust me,
trust me. There are like so many things, bro. There are so many things. And trust me, even if I just try to revise the
notes of my Apache Spark, I still feel like, hey, there's so much to learn. There is so much to just revisit. Okay?
Because Spark is such a wide topic right now and we see new things in every upgrade. So do not feel like you are
already data engineer. You do not need to know anything in Spark. you still need to know a lot of things and trust
me when you will be just watching this video you will feel like oh this was new oh this was the right way to understand
it so I am really excited for this video and we are going to just like get started with this video so just hit that
subscribe button right now right now and right now and let me just welcome so many few like so many new members on my
channel because I was just waiting for this moment and let's welcome all the new members and if you also want to
become a part of this channel, you can simply click on the join button and you will be the part of this channel. Okay,
so here's the long list of our loyal data family members who genuinely support my channel and who genuinely
want me to continue creating more and more videos who genuinely contribute to my journey. So welcome welcome welcome
all and some are some are like recurring data family members as you can see some green texts uh Sai Yadiki Lakshmi Rabi
Dhanosh Louis Sai learn code Juliana so by the way if you see your name just drop a heart in the comments and then we
have AB Mani Shuraguna Tano Priya Naji Sepasu Krishna
Sepan Kelvin Nha G1 Fred each Barda, Jamuna G, Johi, macros and C30 solutions,
Anubria. So just just just drop a heart if you see your name and welcome all. And if you also want to contribute to my
journey to this channel, if you see this channel is fruitful and it's just contributing to your success journey
just reciprocate and just contribute to this channel and join this channel so that I will just make more and more
videos in future and happy learning to all of you. So now let's get started with our Apache Spark ultimate guide
course. Let's see. So first of all an Lamba just tell us what is Apache Spark because we don't even know what is
Apaches Park and even even if we know we want to listen it from you. Okay by the way and let me just tell you again we
will try to understand all the concepts with you can say easy to digest approach. Okay. So instead of just
telling you the definitions written on Google or written on like books I will just try to make you understand like
what are the things because you can easily search the definitions right you are here to actually understand the
things right okay let's get started so basically Apache spark Apache spark is a kind
of group yes so the thing is as we know that data is growing rapidly and we are
living in an era where we call data as big data Because data is actually grown rapidly and obviously it will continue
to grow. So if we want to process our data, if we want to process our data, we want obviously compute
obviously. So earlier we used to rely on just one machine. Let's say you are just doing something on your personal laptop
or your personal computer. Okay, you are let's say doing anything even let's say you are just processing some data. So
earlier let's assume you were just processing only one small CSV file and you were easily doing all that stuff but
now that one CSV file is like transition into let's say thousands of files and now you have to process all that data
again okay on a daily basis obviously so obviously your laptop or personal computer cannot handle it or even if it
can handle now let's assume your files are transitioned into millions of files Now tell
me so in these kinds of scenarios your single machine is not efficient to handle that load. That is why we need
something called as a distributed approach. A distributed computing engine instead of just one computing engine.
This is your Apache Spark. So that is why Apache Spark is a kind of big data framework that we utilize on a daily
basis to process data in the distributed manner. You would have heard a term called massive parallel processing. This
is the term in which you are processing your data parallelly using distributed approach or distributed
manner. Make sense? So this is your Apache Spark where it just relies on a group of machines instead of just one
machine and that group of machines is called a cluster. Okay, a cluster of machines and
in Apache Spark language we consider machine as a node. Okay, so that is why you would have heard like nodes cluster
of nodes. That is why node is just a machine. Okay. So we will also use the same term like nodes. Okay. And cluster
for like group make sense. So basically Apache Spark is your that particular group which handles big data processing
easily. Easily that is Apache Spark. Make sense? Now let's try to understand like why do we need Apache
Spark? Let's try to see that. So why do we need Apache Spark? Why obviously you will say Anlam you just told us that we
need to process our data in distributed manner and all exactly why didn't you ask me a question that okay if I have my
laptop I can even upgrade my laptop I can even just add more RAM more CPU more you can say cores and more
computation power in short why didn't you ask me that question why so in order To answer this, I would just like to
mention something called as there are basically two different kind of approaches to handle big data. The first
one is monolithic and baby just take notes right away because we
going to consume so much of knowledge and I know that you are a super human. You will store all the information in
your super brain but trust me you will forget everything after one week. So I will highly recommend you to take notes
because I do care about you and it's really important to take notes. Okay? Whatever I'm just writing, just try to
make everything available on your notebook, on your digital notebook, anywhere but just take notes. Trust me,
it will really help you a lot. Trust me. Okay. Okay. I know I know I know you follow my guidelines. So I'm happy. So
first approach is monolithic and the second approach is distributed. Okay, monolithic and
distributed. Basically monolithic approach is the approach that I'm talking about right now like upgrading
your existing laptop, upgrading your only one single machine. So in this one what we do, we simply
upgrade the system like adding RAM. Okay. um you can say cores to just
one machine and it is also called as vertical scaling okay where we are just adding
more and more computation power to only one machine make sense okay and what is distributed this is Apache sparks area
it will simply say hey bro why you are just putting so much of load to only one machine why you are just adding so much
of RAM but hold on you need to ask this question again what's wrong with What's wrong with this? What's wrong
with this? Let me just tell you. Can you add thousands of RAM chips to your single machine? No. So, there's a limit
to scale your system vertically. There is a limit. After that, you cannot scale it further. This is one drawback. Second
drawback. Second drawback, let me just tell you. Second drawback is availability.
Availability just add L here. Okay. Availability. So let's say you are using a laptop and for any reason that laptop
is not working. Who will process the data? Bro, no one. Because you are strongly
relying on just one machine. There's so much like low availability of the processing power of the computation
power. So very low availability, very low. Make sense? Okay, very good. On the other
hand, distributed computing solves all these issues. It simply says, hey, instead of upgrading system, add
the systems. Add the system into the network. Add more machines. Okay, you
can simply say add more machines or nodes. Okay. Then second thing is this is called as horizontal
scaling. Make sense? And then high availability. Let's say you added 10 machines. You added 10 machines. One
machine is not working. Fine. We still have nine machines. We can still process some
data. Make sense? Make sense? So it is similar to let's say you got your holiday homework.
Okay. And you just need to complete it obviously within a limited time period and you will be the only one who will be
completing that homework. But luckily you have three to four siblings and luckily you have elder siblings. So they
can help you in just completing the homework. So this is similar to that one where you where you do not need to rely
on just one machine. You can simply distribute to the work and you can just do a portion of work. That's it. That's
it. This is Apache Spark. Now let's talk about is Apache Spark the only like big data processing engine that we have or
what was the scenario before Apache Spark how we were just working with big data before Apache Spark. So let me just
tell you about that as well. So let's talk about Apache Spark versus map reduce. So answer to your
question how we were working with big data before Apache spark. So basically we had something called as
Hadoop map reduce. Okay Hadoop map reduce it is basically not one word it is a combination of two words map and
reduce. You can say mappers and reducers. So we were working with map reduce instead of working with Apache
Spark because Apache Spark was not there. Okay, because Apaches path was introduced around 2008 or 9 um in the
AMP lab by uh in the AMP lab and it was done in uh UC Berkeley. Okay. And it was actually a research project to identify
the drawbacks or you can say issues with map reduce. And now it's a revolution in the world of
big data engineering. And do you know the same team founded data bricks as well really? Yeah. So they are the
original creators of Apache Spark data bricks. So let's talk about Hadoop map reduce like we were using Hadoop map
reduce and why we switched to Apache Spark. So basically the thing is map reduce was also doing the same stuff.
Yes it was also doing the same same stuff. It was also distributing the data to the machines and yes it was also
following the distributed computing engine or it was also like treated as a distributed computing engine. Okay. What
happened? What happened? So basically the thing is it has like two concepts map and
reduce. Okay. So map is responsible to map your data basically to distribute your data to the machine. Okay. And
obviously those machines will be processing the data and reduce was responsible to gather the information to
collect the information from those machines. So the thing was the intermediate result of this
processing intermediate results can be like your transformations. If you're filtering any data, you are applying
joins, you are applying, you can say aggregations, group by anything, all those things are intermediate results.
So it was writing all the intermediate results to the disk. Okay. So what's wrong with that?
So basically it was writing all the intermediate results to the disk. So
every time it needs to spend time to go to the disk obviously to to read the data to fetch the information and then
again deliver that data for for the processing. So this to and from moment to disk was time taken. It was saying
bro it is really time taken. You cannot like go back and for to like disk. What are you doing bro? What are you doing?
So then then Apache Spark came into the picture and then it said there's no need to involve disk. We will process
everything in the memory. In the memory basically RAM we will write all the intermediate
results in the memory. Boom. Hold on. Hold on. I know Apache Spark also uses disk. You do not need to
get confused. just consume the information that I'm giving right now because there is a procedure there is a
step-by-step guide how you need to consume the information right so for now just take this information okay so it
said like we just need memory we do not need to go back to the disk again and again and we cannot do it we cannot
afford it because it is really time-t we know that it is much more faster than the your desk that is Why we
can say that in some cases not every time in some cases it is almost almost 100 times
faster than map reduce. Apache Spark is 100 times or you can say can up to 100 times faster as per if I just compare it
with the map reduce not every time 100 times not every time but yeah in some cases and that doesn't mean like it is
almost similar to your Hadoop map reduce no bro it is very fast because it is much
optimized optimized yes how hold on hold on hold on there are like so many topics that we be going to cover today so hold
on so This was the over highle overview of Apache Spark like why did we switch to Apache Spark from Hadoop map reduce.
We don't know about future like which engine will be in use but right now patches spark is the goat of the big
data engineering and yes it is like work handling all the things so so so efficiently your batch data your
streaming data your like continuous continuously flowing data everything everything everything so it's an amazing
engine and like all the organizations are leveraging spark. So that's why like there's a high demand of spark. So this
was your answer to the question like why do we need to use apache spark over map reduce and why did we introduce apache
spark. You got it very good. Very very very very very good. I know you have a lot of things in your mind. What are
RDDs? What are data frames? What are blah blah blah? Because obviously you would have heard those things right?
Even if you're just learning Apache Spark for the first time, you would have heard these terms because these are like
very popular terms. So don't worry, we will also discuss about these terms as well. So this is all about your Apache
Spark and map reduce. And now now now let's see what do we have next in our Apache Spark ultimate guide. So these
are basically some of the features that we some of them we have already discussed and obviously some of these
will be discussed later in this video. So as we know that it performs Apache Spark performs in-memory computation.
Lazy valuation you will get to know what is that. So this is one of the features that make your queries your
transformations optimized when you submit your code in Apaches Spark. Fall tolerance it does that with the help of
RDDs. Don't worry you will also get to know what are RDDs. Partitioning you will also get to know what are
partitioning. So these are some of the features that you should know that you have so many capabilities with like
Apache Spark and on top of it you can also write something called as um streaming and batch processing
together. So the thing is Apache Spark allows you to perform batch processing and stream processing using the same
API. Can you imagine? You do not need to change your code. You do not need to change your transformations. You can
just use exactly the same code and you will be able to use batch processing and stream processing together. That's
amazing. Amazing. And you will say that is it performant enough? Yes, it is very much performant. Okay. So, this is like
short glimpse of Apache Spark features. Now let's talk about you can say the most or you can say the backbone of
Apache Spark and the backbone of all the concepts that we would be just discussing further in this video. So you
have to very focused right now because we going to discuss Apache Spark architecture. Okay. So before just
showing you the image before just mentioning everything about about Apache Spark let's cover
some basic stuff first. So what are the basic stuff? So the basic stuff is Apache Spark
architecture you can say is developed on top of a concept called master and slaves. That is why uh it's also called
as master slave architecture. Why do we call Apache Sparks architecture equals master and slaves architecture? Because
in this particular architecture, in this particular framework, there will be one master. There will be one
master and other will be the slaves. Make sense? So what will be done by the master?
Obviously this master will be just sitting and passing some orders and make sure like everyone is
working fine and everyone is indulged in the task and everyone is doing their work. So that is why we consider this
architecture as master and slaves. So as you can see there's one master and another are slaves. It is just an
analogy. Okay. But it goes really well when you just trying when you are just trying to understand Apache Spark. So it
is one of those analogies which are written in the box as well. So that is fine. Master and slave architecture it
is actually built on top of this analogy. It is actually built on top of this. There are like many more analogies
that you can relate. Let's say manager and employees. Okay. But master and slave architecture is the best one and
is very much popular and you can also encounter many scenarios based on this in your interviews as well. So this is
just a glimpse. Let's go deeper. Now we know that in this architecture there will be one master and other will be the
slaves. That is fine. Let's go deeper. Let's go deeper. So basically if we just want to go deeper we need to understand
some technical terms first of all of Apaches part. So as you can see on your screen right now I have written
something called as resource manager. Let's discuss this first. So basically this resource
manager this resource manager is your master. Really? Yep.
It is your master. Okay. So you can treat this as your manager. Okay. Normal manager of
your organization. Okay. And we just call it as a resource manager here because it allocates the resources for
your work. Okay. It is a resource manager. H. Okay. So that means it will be managing resources obviously.
Then we have resources in the form of driver and workers. Driver and workers. These are two types of
resources that we have. H okay. And one more thing you will be glad to know that your
cluster is this one. Driver plus workers. Driver plus workers. So
basically and I think it is very much understood but still let me just tell you this is a driver. Okay, these are
the workers. Okay, makes sense. When it gets combined, okay, so these will become your cluster cluster
of nodes. Remember remember that we just distribute our work to the nodes. So
these are all your nodes. Okay. Now let's talk about driver. First of all, what is this? What is this? What is
this? Just tell us about this. What is this driver? Basically, we know that we have a
manager. We know that. But is it possible to work directly between your manager and workers directly? The answer
is no. Why? Because we need someone called as team lead. Team lead. Okay. Team lead's
responsibility is to orchestrate the work. Let's say you are an uh you are an associate, you are a developer. Your
team lead's duty is to allocate the tasks to you. Make sense? So just let's let's try to understand from the top
level. Your manager will get the work. Okay. From maybe client, from maybe any other business. Okay. Your manager will
allocate the resources and what will be the resources? Your team lead which will be called as driver and you with your
other developers will be called as workers. I know now you understood. Okay, makes sense. So your manager
allocated the resources like you and other developers plus team lead. Now team lead's responsibility is to see the
work that we that your manager got. It will just see the work and it will distribute the work among you among you
among your friend among your peer developer all the workers. So that is why we always have one team lead to
distribute the work to these workers because these workers cannot directly get the work from the manager. No, it is
not possible. Why? Because your manager may or may not understand the technical terms. Make
sense? Make sense? Similarly here as well. So all the technical terms will be understood by driver.
Those terms will be translated into better way to you by your team lead who will also call as tech lead or team lead
same thing because that person understands the technical terms. Similarly driver understands your code
that you are submitting. So it will distribute that work among these workers. Make sense? So
these are some terms that you need to understand. I know now you understood resource manager, driver and workers.
Make sense? I know now you have understood almost like the whole architecture. Trust me. Let me just tell
you the whole flow now like how it goes the whole flow. So this is the flow and let me just tell you what happens. So
let's say this is a person who is writing a spark code. Okay, make sense?
Okay, this person completed his spark application. Application is just like the code that it needs to submit. Okay,
so this person will submit the code to the who is who is this resource manager. Very good. So
this person will submit the code to the resource manager. Very good. This is a resource
manager. Make sense? Very good. There's one more thing. When the person will be submitting the code, Spark code, this
person will submit some more stuff as well. What's that? The information that it needs to provide to the cluster
manager such as I just mentioned that this resource manager will allocate the resources right h okay like one team
lead and workers. Now just tell me one thing. If your manager is receiving some work, okay, is receiving some work from
a client. Okay, obviously that manager needs to also get the workload as well like how many workers do you need? How
uh uh you can say talented workers you need because obviously obviously do you need senior developers? Do you need
junior developers? Do you need mid-level developers? So all that information will be provided by the
client. Hey I need three developers, three senior developers. Hey I need just three junior
developers. So all these things will be told by client told by client. So this person
who is writing the code will also tell that I want one driver. Driver is your team
lead. Okay. I want one driver or of let's say 10 gigs 10 GB and I want let's say three
workers and we just call call it as like executors. Don't worry I'll just tell you about executors as well. It's almost
the same thing. It's almost the same thing in the modern one world. It's almost the same thing. And uh so this
person is saying I want three executors. three executors of 10 GB each or let's say 20 GB each. Okay. So this
information will be provided by the person who is submitting the code and this is very popular called as
sparksummit. Okay. Sparks summit. So within this sparksummit command the person will
write hey I need one driver I need three exeutors blah blah blah. What are executors? For now you can understand
executors are your worker nodes. Same same stuff. For now you can just understand it. It's same stuff. Trust
me. Okay. So one driver the executors. This request will go to the resource manager. What are the request? Your code
and all the stuff that this person needs. Make sense? Okay. So what this resource manager will do? It will first
of all it will first of all create one driver node first of all. Why? because it needs a team lead for
the first thing. It needs the team lead as a first thing. So it will simply create the driver node.
Okay. Okay. Like driver node. This is our driver node. Then then this driver node will be
connected to this resource manager. Okay. Okay. Let's connect this. So this will be connected to
your resource manager. Okay. So now these are these two are connected. Sorted sorted. Sorted. Okay. Now this
driver node will take the sheet from him from your resource manager and it will read the request. Hm. You need three
executors. H you need this this this. So it will read all the instructions and it will send that instruction back to the
resource manager and it will say hey you need to create or you need to provide me three executors. So this resource
manager will say okay I trust you. So it will simply create three executors on these machines 1 2 and
three. 1, two, and three. Make sense? Now, now what will happen? Now, driver node is there. Worker nodes
are there like worker one, worker two, worker
three. Okay, three workers are there. Driver node is there. We have all the resources. Resource manager is free. So
it will simply say now just take care of this these workers on yourself. So what will happen? What
will happen? This driver node will be connected to these workers now
and all because this driver node has the sheet, right? We know that this just took the sheet from him from the
resource manager. So it has this sheet here. So what it will do? It will simply read the sheet. H you have this code.
Okay. You need to apply filter. You need to apply join. You need to apply group by. Hm. Okay. Then it will simply
distribute all that work to these worker nodes and these worker nodes will be actually doing the
work. Actual work will be done by the worker nodes or you can say executors. For now we are just using these words
interchangeably. It's fine. Okay. So these work are done by worker nodes. What driver node is
doing? Orchestrating. Orchestrating. Making sure like all the
executors are working fine. All the executtors are doing their work. Everything. Make
sense? Makes sense. So that's what your team lead does. Team lead actually does not write
your code. Team lead just a team lead is there just for the monitoring. It's just for the overhead. It's just for like um
whether you need any help, whether you need something, I'm there. So similarly, driver node is there for the
orchestration is for the distribution of the work. But actual work is done by the developers that are you worker one,
worker two, worker three. That's it. This is the spark architecture. This is the spark
architecture. Now you will say who is master here? So this is a controversial thing that I'm going to discuss. So
basically master slave architecture keeps on changing. What obviously just think
wisely when we had resource manager in the picture this was the master and all these machines were slaves. Make sense?
Very good. Now when we have a driver node and other workers and when these are connected who is master driver node
is a master because this is giving the instructions. Makes sense common sense bro and other are slaves. So that is why
master slave architecture also gets changed with the state of the cluster right now. Make sense? Makes
sense. That is all about your architecture. See, it was so simple. You just need right analogies. You just need
right examples and you just need someone who can explain things like this and that's all. That's all about life,
right? It's very simple. Very simple. Okay. So, this is the diagram that we have just made. You can just take the
screenshot and you can just save it in your notes. It will help you to understand. And this is the architecture
that is available on the official documentation. official documentation of Apaches Spark. Okay. And I can just show
you the documentation right now after explaining this image. Don't worry. Uh so I know that you can see lot of arrows
but trust me just give only 2 minutes on this image you will understand everything. Why? Because now you already
understood the concept. Let me just help you. So first of all what will happen? our code
our code. So this is the person who submitted the code. So this person will
submit the code to the cluster manager. Step one. Okay. What is step two? Step two is
it will simply create the driver program. Make sense? Make sense? Two. So now these two
are connected. See these two are connected. Just forget about other things. Just follow the things that I'm
just highlighting. So these two are connected. So this is two-way. Why? Because you remember it will simply go
back to cluster manager. Hey create two more executors. Three. Like in our example it was three. Just imagine there
were like two. Okay. It will simply say okay sir I know you are my team lead but still I'm just following your orders
because I think you are smarter than me. You are more technical smarter than me. So it will simply create two worker
nodes. one and two. So these are like step number
three. Okay, step number three. Now let's clear the confusion. Worker and executors. Basically worker node is just
a machine. Worker node is just a machine. On that machine we create something called as
executors. Executors as the name suggest executors which execute the work. We can host or we can install multiple
executors on one machine. But nowadays you can say in all the modern applications we just host one executor
in one one machine that is one worker node. So one worker node is actually holding one executor. So it makes sense
to say that one exeutor or one machine it's fine. It's like both can be used
interchangeably. Do not get confused. These are very small small things. You do not need to be confused in these
terms. Okay, executor worker node same thing. Okay, and as you can see here as well, we have like one executor within
one worker node. So it is fine. Makes sense? So now this is step number three. Now these worker nodes are directly
connected to this one to drive a program. See two-way communication. Two-way
communication. So this is step number four. This is step number four. So whatever task this drive program will
provide they will just do it and they will return back the results. Wow. This is your architecture. This is your
architecture. See it was so simple. So simple. H it was simple. Yeah.
Okay. Makes sense. Makes sense. Understood. Understood. Now take notes and write everything that we have
discussed because when you will be writing you will think when you will think you will store that memory in your
long-term memory area. Okay. So this is your spark architecture and I hope now you know
this concept. This is the backbone of all the concepts trust me. Okay. Very good. I'm happy that you understood this
concept. Now let's see what do we have next. So this is the official documentation page of Apache Spark and
you can just go here. This is the link and this is the image that we used to describe because we should always use
you can say graphics which are there in the official documentation. So so that you can actually understand whenever you
will be just going there and you can actually read the documentation as well. And let's talk about some more technical
stuff from this documentation. Why? Because it is really important. We know that we just saw something called as
first thing we saw that is was like resource manager or cluster manager same thing. So what are the popular cluster
managers that we have in the market? So we have very popular cluster managers. Okay. And what are those? It can be
Spark's own standalone cluster manager like its own or it can use Apache Misos. It can use YAN. YAN is the most popular
cluster manager. Okay. And it can also use Cubernetes. So these are the popular cluster managers that we have in the
market. Make sense? Very good. This is the cluster manager. Okay, we need to discuss one more stuff. What's that?
Basically, the spark context. Okay, what is the spark context? If you would just closely observe like it is in the drive
program and okay, we will also discuss about these things as well. But anal first tell us like spark context. What
is that? Basically spark context is now is now replaced by the spark session. Okay. So if you see something
called a spark session basically spark session has embedded spark context as well within that. Let's say this is your
spark context. So we had total three context spark context hive context and one more context. Okay. So it is just
earlier we used to just define different different context. Now we have condensed all the context and we have
created something called a spark session in which we have all the things spark context hive context and there was like
one more context anaba you have said context so many times what is the spark context okay okay spark context or spark
session because now it's spark special session not special session spark context or spark session is actually the
Starting point of spark h okay we connect driver driver node with cluster manager see these two things
with spark context. So whatever we are submitting okay whatever that person submitted the code it went to the
cluster manager then cluster manager created the driver program. Mhm. Then driver program had the spark context or
spark session because we defined that in our code and that session you can say that connection that starting point was
actually used to set up a connection between your driver program and cluster manager and it should be written
somewhere here as well. So yeah, Spark application run as an independent set of processes on a cluster coordinated by
the Spark context coordinated by the Spark context in your main program called drive
program. So Spark application gets coordinated with the help of Spark context. It is basically
the entry point or the connection between your driver program and the cluster manager or resource manager.
Makes sense. Makes sense. Makes sense. So you can actually read like so many things and here are like more
information about cluster manager standalone Apache misos hadoop cubernets okay submitting applications and all the
everything is written here and don't worry we will just learning all these stuff step by step. Okay. So I hope now
you have understood all the things related to spark architecture including spark context as well. Okay. So let's
see what do we have next. So let's talk about the driver node in detail. So as you know that when we have like so many
machines remember like we have like so many computers here listed and we decided that one machine will be treated
as the driver node. So what actually happens when we pick one machine and we say hey you are a driver node. Hey you
will be treated as or as a driver. what actually happens behind the scenes. So basically whenever we say that one
node or one machine will be treated as a driver node. So actually resource manager or cluster
manager creates something called as application master container or you can say creates a
container which is also called as application master. Simple. Let's say
bro remember we had like machines here and we decided that we will be just um picking this machine as a driver node.
So what resource manager or cluster manager will do within this machine? It will
install this thing application master container. It will simply install this AMC
application master container within this machine. Okay, make sense? Very good. So what happens when it creates this or
install this application master container? This is a container which actually is responsible for all the
orchestration, all the driver thing, all the driver program activities. What do we have within this? We have something
called as pispark main h. Okay. And then we have something called as JVM main. Okay. Then we have something called as
process. What is this? Hold on. So as you know that most of you will be writing your code in
pispark. Yes. But spark is written in Scala. Okay. Yes. So it is written in Scala. So that is why we add something
called as pispark main within your application master container. So this pispark main is
optional. Understood common sense you know the reason still let me just tell you if you
are not writing your code in pispark do you need this obviously no if you're writing your code in
Scala JVM is fine like JVM stands for Java virtual machine which is responsible to you can say interpret or
compile your code that you have written in Scala JVM can handle that pispark main will only be there if you are
writing your code in pispark make sense Okay. And let me just also tell you the flow. So basically you have
your spark core. Okay. This spark core has a wrapper for Scala or
Java Java wrapper. And on top of this we have a Python wrapper. Now you will say why why did we create the Python
wrapper? Why didn't we use Java? Because bro, there are so many Python developers. Okay. And in big data or you
can say in data domain, Python is the most widely used programming language. So in order to attract Python developers
to this engine, we added not me, the team added Python wrapper on top of this Java wrapper. Make sense? Okay. And that
is why we have something called as pispark main. So whatever your code your pispark code is there in your
application in your spark application will get converted into JVM main JVM main by a process called py
4j make sense so you do not need to worry about anything like how you will convert your code in java and scala. No
spark will do it. Okay. And one more potential question. This pispark main is called as pispark
driver. Okay. And this JVM main is called as application driver. Make sense? These are technical
terms that you need to just write down in your notes. Okay. Good. Understood? This is your driver node in detail. We
have just engineered that driver node. Okay. Let's try to understand worker node. Worker node is
very simple. Worker node is very simple. Worker node just has a JVM machine. That's it. Because obviously these are
executors. Okay. These just need to you can say execute the work. That's it. Okay. So basically you just install a
Java virtual machine on each worker node in order to just get your work done. Make sense? Okay. An Lamba why you
have left the space here? Why? Okay, let me just show you the magic.
Boom. What is this? So, basically I know that executors do not actually need anything other than JVM because all your
code will be converted into JVM by driver node and it will go to the executors. Yeah, makes sense. Why do we
need Python? Then this is a special case. this will not be there all the time. Okay. What is a special case? So
whenever you create UDF, what is UDF? UDF stands for userdefined functions. Let's say you are not using
pispark functions but you are using your own functions using def. Let's say my function and you created something like
this and you use this in your transformation. This is called UDF. Okay? You use this for any reason. Okay?
and you are saying I want to use this. Maybe it it can be the need of the R in some cases when pispark functions are
not performant enough. By the way, Pispark has a long list of functions that you can pick. But still sometimes
you need to do something that cannot be achieved by the pispark functions. You you have to just define your own
functions. I can understand that. No worries. So in those scenarios your executor or your worker node will
install a Python interpreter or Python worker as well on top of your JVM machine or like you can say not on top
of JV JVM machine along with JVM machine. Make sense? So that it can actually interpret your function that
you have created. Oh, because obviously you need a comp you need a you need an interpreter right
for Python code to just run during the runtime to execute a code. That is why it is always recommended not to write
any userdefined function and obviously you need to avoid it if it is possible because it will add a kind of load on
top of your executors because it needs to install Python and it will just take some space and it's not ideal. So that
is why you should try to avoid as much as you can to use userdefined functions. Okay. Okay. Okay. Okay. Okay. Don't
worry. So I know this was like a lot of knowledge for you but digest it, re-watch it, revisit it, digest it, take
notes, you will digest it. Make sense? Make sense? Make sense? Very good. By the way, just a you you can say out of
the box topic. Um, can you see JVM here? Yeah. Now, these JVMs are slowly slowly being replaced by the C++based executors
because those are very very very fast and they can just process your code natively instead of processing your code
in JVM. They can simply use the you can say resources like native resources. So they actually do not need to use JVM.
They can totally eliminate. For now it is not totally replaced. Um it is like slowly being replaced by C++ based
executor but that knowledge is like out of the box. Okay. And yes that's a great you can say initiative and I can see
those executors in fabric. I can even see those executors in datab bricks called as photon. So I can see those C++
based executors but that is fine. that is just like out of the box or you can say the latest information that you
should have. Don't worry, don't worry. Don't worry, don't worry because fundamentals will remain the same. Don't
worry. Don't worry at all. Okay. Very good. So, this was all about your driver node and worker node in detail in
detail. Now, it's time to see some code implementation because obviously I know you are really excited to see um how we
can create the spark session and all those things, right? So, now it's time to actually create our free databicks
account. Yes, because we will be just looking at the code implementation as well like how we can just do that. So,
we'll be just leveraging the free database community account and we will actually write our code and we will see
all those stuff that I'm just trying to explain you because when you actually do it, when you actually see it with your
eyes, you understand better. Okay, let's try to create a free database account. And it's very simple. It's just a few
step. But you need to just take care of all the things that I'm showing here because then you will say hey I'm not
able to create your data account. I'm not. So in order to create your free databicks account steps are very simple.
Simply go to Google and simply search datab bricks community edition. Okay datab bricks community edition. Then you
can simply click on this link databicks community. Okay make sense. So by the way the thing is they have
added a new datab bricks account which is called like I forgot the name. So in that datab bricks account it is
automatically managed and you can simply use the you can say cloud versioned UI but I would say just pick the community
edition to learn all the things from scratch. Okay. So this is the link and let's try to search that. What is saying
bro? Remind me later. I just also need to look whether it is the right one or not. I just simply clicked on this one.
Uh it is saying if you would like to login. No, I don't want to login. Let me just see. I
think this is not the right link. Click on database community additional login. Yeah, this is the one. So I will also
try to provide the link or you can simply um copy it from here um community.cloud.databicks.com/lo and you
will see something like this. Okay. Or you can simply see the URL like database community edition login. You simply need
to click on this. Now obviously you would not have a database account. Simply click on sign up. The moment you
will click on sign up it will simply say sign up for community edition. Okay. Now, simply put your Gmail account or
whatever account you want to just pick. Okay. And then simply say continue with email. The moment you just set up your
account, it will just ask for confirmation and all. Okay. Then come back to this page again because now you
have the account. Now you can simply use that email ID. Make sense? Make sense? Very good. Very good. Very good. Very
good. So this is our datab bricks account. By the way, if you are not familiar with data bricks, datab bricks
is a kind of one of the most popular technology right now in the world of big data. Okay. And as I just mentioned,
datab bricks team is the founder of Apache Spark. So you know how well they are um you can say how how well they can
just tune and you can say play with spark to just develop new and new things and that's why they are just leading the
industry right now and I love data bricks and I would say if you are just totally new to data bricks datab bricks
helps you to work with apaches spark it takes care of so many overhead you do not need to worry about so many things
and don't worry I will just discuss all of these things. So it actually helps you to just start with spark for example
for now I had two options just to set up the spark locally or I can simply pick data bricks it takes a lot of time and
efforts to set up your spark cluster locally and it's your responsibility to manage the cluster okay to start it to
stop it to build the sessions and all if you are learning you need to save all that time data bricks will say do not
need to do anything I will take care of all those things set Set up your Spark cluster, set up your drivers, set up
your worker nodes. You simply need to come here and start learning. That is why they have created this database
community edition just for the community to learn and grow. Just act wisely. Do not waste your time to set up your
clusters locally and just doing all those stuff that you will not be doing in the company because companies in the
companies as well you will be just using databicks or the similar products to just manage the clusters. that time is
gone when like you will be just when when you would be just managing the clusters. No, now you do not need to do
that. Make sense? So just start with data bricks. So this is the datab bricks workspace or you can say datab bricks
homepage and within this we have all these tabs. You do not need to worry about that because we are not learning
data bricks. We are only focusing on spark. We are using data bricks to run our spark code. That's it. And one thing
that I would like to mention simply go to catalog. This is basically the area where we create folders, tables and all
the things. Okay. So in your case maybe you will not see this area DBFS which is the datab bricks file system or you can
say datab bricks distributed file system. Okay. So in order to just enable this you simply need to click on this
button. Okay. And then you will see something called as settings. So just click on the settings button. This is
the settings button. Okay. Now just go to the developer. Maybe I also need to just check because I turned it on way
back. Just scroll down or it should be in advanced maybe. Uh yeah, I think yeah. So in
advanced you will just scroll down and you will see DBFS file browser in other. So you simply need to turn on this
toggle so that you can also see that DBFS button because we'll be just uploading the files and I also want to
show you a lot of stuff that you can just use to learn a lot of things. Okay. So simply turn on this button and you
will actually see that DBFS button. So now let's get started with our first code. Okay. And in order to do that I
will simply go to workspace and I will simply create a workspace. Okay. Make sense? So this is the workspace. Okay.
And now I will simply click on this create button and I will simply click on folder and I will simply provide the
folder name as let's say um spark or let's say Apache Spark. Okay, click on
create. So this is the folder that we have created for Apache Spark. Within this I will create um let's say oh not
this one. Just click on workspace. Yeah. So within the workspace just oh we created the workspace in the home tab.
No no no you do not need to do that. Simply go to workspace and then click on workspace for for one more time so that
you can actually create the workspace within the workspace folders. Okay. Simply click on workspace and click
on folder. So simply create a workspace. Wait wait wait wait. Create folder and Apache Spark. Apache
Spark guide just to distinguish it. Apache Spark guide create. So now you will see that your Apache Spark guide
folder is created under workspaces. That is the thing that you need to do. Okay. So this is your folder and obviously it
is empty. Obviously it is empty. So now let's create our first notebook. What is a notebook? Basically whenever you want
to run our code, if you are just running it locally, you would just run the code in any IDE such as PyCharm, VS Code. In
this we use notebook. Notebook provides a better UI or you can say user interface, user experience for
developers because you can actually run the piece of code that you want to run instead of rerunning the whole thing.
And obviously notebooks are the most popular things in the data domain. And if you are data analyst, you would know
data like notebook similar to Jupyter notebook, right? If you do not know it's just a notebook in which your you you
can treat notebook as like your whole notebook or let's say your whole script is divided into cells so that you can
run the specific cell specific part of code if you want that is the best thing about notebook. So simply create one and
I will simply say notebook. So this is your spark notebook okay or you can say databicks notebook
and this is the title. Let's rename it and let's call it as um spark session. Let's talk about spark
session first of all. Okay. So spark session. So this notebook we need to attach it to a cluster. Make
sense? We need to attach it to a cluster. Same stuff. Let's say I'm just writing my code but uh I want to submit
this code. Where it will go? We know that it will go to resource manager or cluster manager and then it will simply
create the uh driver node and that driver node connect with
resource manager and it will say hey create two clusters more or sorry not cluster two worker nodes more or two
executors more it will create two executors make sense so here the best thing about data
bricks is we do not need to say hey create one driver hey create two executors. Hey, do this. Hey, do that.
No, we can simply create a cluster and we can reuse it. Wow. Yes, that's the one of the powers of data bricks. Okay.
So, how we can just do that? If you just go to compute, this is the area where you create the compute. Compute means
cluster. Click on it and you will see allpurpose compute. Click on create compute. Okay. So this is the
configuration that we get for free. That means we will be running our code for free. We do not need to pay for any
machine. Why? Because this is a database community edition. So if you just click on this, you will see something that you
have 12.2 LTS. This is just like the version that we are using of the cluster 12.2. Latest one is I think 17. Okay.
But it's fine if we are just learning. 12.2 is fine. Okay. Makes sense. So this is the spark cluster that we'll be
creating. Click on create compute. What will happen? It will spin up the cluster. See it is just rotating. So it
will create the cluster for us. What will that cluster hold in real time when you are just paying for data bricks? You
can define driver size, number of workers and worker size. Same way you submit your command to the resource
manager. Same thing. Same thing. Advantage is you are just saving this cluster information and you can just
reuse it. But in the free account you cannot say hey create 100 machines. Hey add 100 GB of RAM. Obviously it will
simply create a basic machine for you so that you can learn. So what it will do it will simply create a cluster. Okay
who resource manager. Who is resource manager? Datab bricks is managing it automatically. Okay. So it will simply
create a cluster of one driver. Okay. And you can say one worker one
worker. So this cluster is there for you which will be processing your data. So now you do not need to worry about hey
driver memory hey worker memory. No, but yeah, in the real world we define these things and don't worry, we'll be just
discussing a lot about driver and workers in the like further parts of the video obviously. But this is something
called as managed layer that datab bricks provides us so that we can actually see all those things. Make
sense? Okay, that's really cool. I know you will also love data bricks. So now what I can do, I can simply go to
recents tab and this is my notebook that I opened. So in the recents tab you can see all the recents notebook. I will
click on it and you will see a button called connect. Click on this and you will see your cluster. It is being
turned on but you can click on it so that it will be attached to it because whatever code you will be writing it
needs some compute right? It needs a cluster. Very good. Very good. Now till the time this cluster is turning on I
will just tell you about spark session. We talked about spark session right? Yes, we talked about spark session. H do
you know what is the other best thing that databicks does for us? We do not need to create a spark session. What?
Yes, we do not need to create a spark session. Why? Datab bricks creates the spark session for us. Really? Yes.
Really? H okay. So one thing that I would like to mention whenever we create a spark session we create something like
this. We create a spark variable okay and we say spark equals to spark session
dot builder dot app name oops dot app name then we just provide the app name let's say
tutorial and we just write all these things right. So in the industry this is the most popular
variable name spark. We every time create spark session with the variable called spark. This is a kind of you can
say industry standard. So datab bricks actually found this as an opportunity that we
can automate this session creation for our customers. So we do not need to create a spark session now. Really? Yes,
we can directly use spark. The moment this cluster is turned on, you can actually say print spark. Yes, you can
actually say print spark and it will simply show the information of that spark
session. Wow. Yes. Yes. So, I will just show you how our spark session looks like and what
information do we have in the spark session. Okay. So it takes I think three two to three to four minutes to turn on
the cluster because obviously it needs to create the machines, nodes, executors, driver node, installing the
application master container. Now I know you are pro now you know a lot of stuff. So so oh it is turned on. So what I will
do? I will simply run this cell. And how we can just run this cell? You can simply hit shift + enter together. Shift
plus enter. That's it. It will run the cell. Okay. So whenever you want to run any cell simply say shift plus enter.
Very good. So it will simply give me the spark details. See it has simply said that this is the spark session which is
there for me. Which is there for me. You can even write spark and you will see that it will simply return the results
like this because in the cells we do not need to write print print command is not required in notebook. Okay. So I got the
details. My Spark session is this one. I have Hive and Spark context. And by the way, the third context is SQL context.
So it merges three of them. Spark context, SQL context and Hive context. And this is the version 3.3.2. And this
is the master which is saying local 8. What is master? What is master? You know what is
master? Driver node obviously. So it is local. Local means like on the local machine. Okay. And app name is data
brick shell. So it has automatically written the code for us called like all the code that I just showed to you like
all that like spark session do things everything is like written for you everything. So you actually do not
need to worry about anything. Got it? So this is the second best thing that I do like about you can
say spark context or spark session because you actually do not need to worry about anything like how to just
submit the things and creating the spark session again and again. So this code you can say is equivalent to um I I can
just write it for you if you just want. Let me just and I hope that you are just taking notes right. So this code is
equivalent to spark equals spark s
capital spark session. I think we also need to import spark session if I'm not wrong. Let me just import
that. Oops. From pispark.sql. So this is the library.
Okay. Import spark session. Yeah. We need to first of all import it. This is
imported. Now we can just simply use this spark session dot builder dot app name. App name I will simply say data
brick shell because it is also using the same thing right I can I can I can give any name but I'm just trying to show you
what code databicks is writing for us behind the scenes. Okay databick shell then we just use backslash to go to the
next line. Then it is simply saying that what what else it is using? I think master dot master and I will simply say
local local local who is texting me bro who is texting me local and 8 oops not 98 8 then we can
simply say dot get or create if we just want to add any configs let's say um I want to say create a driver of 12GB or
blah blah stuff you can simply say dot get to create. So let me just run this and it is fine, right? It is fine. If I
just run it spark, I will see the stuff. See, same stuff. If I
just rename it, what happens if I just rename it? If I just rename it, I will see why because I created it, bro. So
this is a new session that I have created. If I let's say change the name data bricks shelf on right and if I just
run this you will not see any change. Why? Because obviously application is managed by data bricks. So it cannot
create a new application for us because it is all managed by data on data bricks. If you are just running this
code locally you can give any app name. It's up to you and you can even just add so many configurations as well. But all
these things are managed by data. We do not need to worry at all. Make sense? Very good. I'll simply delete this code.
This was just for the reference that you can just write in your notes that this is that's how we can just create spark
session. But it is already there for us. We do not need to worry about spark variable. So this variable is ready for
us to use. So whatever you want to do with this session, we can do it using spark variable. That's it. Okay. Make
sense? Makes sense. And what is the spark UI? Don't worry, we will just talk about Spark UI in our next next
next you can say chapter that we have because in that we're going to discuss a lot about Spark UI and how to read DAGs
and all amazing amazing amazing so let's see what do we have next and I hope don't worry this is just like the
beginner thing and this is just related to some stuff related to data bricks it's fine it's fine we are just learning
spark so if you do not understand all the things of data bricks it's fine just focus on spark okay just focus on spark
very good let's See what do we have next. So let's try to understand this important concept lazy evaluation and
action. So spark is lazily evaluated and you will know the reason why and what are
the benefits of it. Okay. So let's say let's try to understand this with an example. So let's say you are writing
your code in the notebook. Okay. So let's say you are writing you are reading a file. you are creating a data
frame on top of it. Okay. Okay. You are performing some transformations. Let's say you are filtering some records. Then
you are adding some columns and then let's say you are aggregating the records. Okay. In total you performed
three transformations. Filter adding the transformations and aggregating three
transformations and you run the cell. You you you executed the cell. Okay. By hitting shift enter, shift enter. Shift
enter. And do you know what? Nothing will be happening. What? Yes. So let's say you read the data frame. Okay. You
applied some transformations, filter, adding columns and aggregation. You literally ran the code, ran the cell,
but nothing will happen. Why? Why? Because of the concept called lazy evaluation. Concept is very simple.
Don't make it complicated. So the thing is whatever transformations you are applying okay you are applying filter
you are applying adding columns you are um performing aggregations make sense spark will store all these
transformations in a plan in a plan okay all the transformations all the things so it will save all the information in a
plan if you perform more transformations It will add more. Let's say one more transformation four. It will store
everything in a plan. Okay. And it will only execute this plan when you will hit the action. It is kind of switch. When
you just turn on the switch tuck, then your transformations will be actually executed. Oh, okay. And why do we do
that? So basically we do not want to immediately execute your code. Why? Because it is
not optimized. H we will see like how your code will be ex optimized by the driver node. We will see that. But for
now just understand that your code that you have written is not optimized. You will say do you know what I have
written? I have written the most optimized code on this planet. No bro you have not. So that is why spark
understands better than you how to pass or you can say how to um provide that code to the executors. So that's why
before providing your code to the executors it simply puts all the transformations in the plan and it will
say okay okay add more add more add more. Then once you are okay with the transformation once you hit the action
then it will rearrange the transformations. Maybe transformation number four should be performed earlier.
Maybe just to optimize the code just to optimize the performance because Spark understands
better than you. Make sense? So it will rearrange the transformations. Maybe it will say
hey instead of applying two transformations we can simply apply one transformation. Don't
worry I will just show you all these things for now just try to understand the concept. So it will simply merge
these transformations. So in total instead of four transformations will simply perform three maybe
two. So it knows what to do and how to do and what's the best optimized way to do better than you. Okay. So it saves
the plan then the moment you hit the action it will simply perform all the
transformations okay with the optimized plan instead of your plan. Make sense? Now you will say hey where's this
button? Where is this action button? So we have some actions available within Pispark such as dot show. When you just
write df dot show it will simply hit the action. The second one is um dot count. When you say dot count, it will simply
hit the action. If you say display, it will simply hit the action. All these things, all these things are
actions. Dot collect method very popular. All these things are actions. when you actually perform these things
on your data frame, it will simply execute your plan. So this is the concept of lazy evaluation and action.
And the moment you hit the action, it will simply execute all these things and it creates something called
as job spark job. So this is your spark job which actually holds all the
transformations all the things and when it will be created again when it will be created only and only when you hit the
action there will be no job created for these transformations no and don't worry I will just show you don't worry so
there will be no job creation for these let's actually try to see it okay and let's actually see like what happens
behind the scenes and when we just write the code. Let's see in the notebook. So I am back on my database notebook. So
this is the code that I have written and don't worry if this is just a basic code. So this is the code that you use
to just define a data frame. And this is our data demo data. This is the schema for it. Okay. And this is the code to
just create a data frame. Do not worry about code right now because we are just trying to understand the concepts first
of all. Okay. So this is just a data frame that we have created this. So I will simply run
this using shift plus enter as you all know and how we can create a data frame simply use spark session this one spark
dot create data frame that's it this is the syntax I don't know like how spark would have made it more simpler and
struck type is not defined very good I wanted to show you this thing so basically whenever we just create a data
frame whenever you just work with pispark there are so many pispark libraries that we need to import what
What is the best way to do it? What is the best way? Because you would use one function, you import it, then you use
the another one, you will import it. There can be like so many things, right? So the best way to import is just import
all the functions. How? Simply say from Pispark.SQL dot functions import ax and from
pispark dotsql.types import. Simply run this and then just
try to run this and this will just simply create your data frame. Make sense? Okay. Very good. So now it is
completed. See green tech and everything is fine. Now this data frame is created, right? This data frame is created. Make
sense? Simply click on this button and you will see nothing happened. Nothing happened. No job is triggered because we
didn't hit any action. Let's say I want to apply a transformation. Let's say I want to just filter
um those records for city equals to New York. Let's say just I just want to filter this data. Okay, let's do it and
you will see the magic. Okay, so what's the code for this? And again I will simply say do not focus on code right
now because we are just trying to get the concepts first of all. So the code is very simple. I will simply say df
equals okay df equals df dot filter and by the way I'm just writing
the code from basic so even if you do not know you can easily grab like what I'm writing okay so this is like very
basic thing of pispark like how to write the code in pispark and if you have not watched my video for pispark full course
um I would say you can watch that video after this video after this video yeah because that is not a prerequest
Obviously, once you gain the concepts, it's time to master the coding. It's time to master the coding, right? So,
you can just watch that video after this video because this video is strongly focused on the
concepts behind the scenes of that code. Okay? And once you know the concept, you can easily write the code like this like
this. Okay? So, this is the code that I'm writing for filtering. So, how we can just write it? DF.filter. Then we
define something called as column. Which column? We will simply say city because column
name is city. Makes sense. Equals to equals to what? Equals to equals to uh New York. Okay. Simply copy this
and paste it here. That's it. This is your code. Now run this. Run this. Perfect. An Lamba. You said that
if we execute our code, it will not do anything. Yeah. Did it do anything? No. See click on this dropdown nothing
happened. Okay. So right now you are thinking that you have successfully filtered the data but it is not. Why?
Because lazy evaluation this is a transformation which is just stored in the plan in the plan not actually
executed. When it will be executed now it will be executed when I will just hit the action. So the action is this one
hashtag and I will simply say action. Okay. And then I will simply say D display DF. The moment I will run this
cell, you will see a job created for this execution. And let's see let's see see Spark jobs. So it is creating job
for me. See and this is your data frame obviously the result of it because display command is used to display the
output but it created the jobs for you. Wow. Simply click on this dropdown and you will see job 0 1 2. There are to
total three jobs created. Why like how many like why did we see three jobs? Do not worry. We will have a dedicated
chapter on jobs and all. Okay. For now, just try to understand that it creates a job only and only when you hit an
action. Otherwise, there's no job here. There's no job here. Job is only created when you hit the action. That is the
concept you need to gain right now. That's it. See how powerful it is. H okay. And
which transformation did you apply? Filter. Now what if I tell you that I have not I like Spark has two different
kinds of transformations really yes two different kinds of
transformations I will just discuss about transformations as well in the next chapter that is the very next
chapter after this like transformations narrow and wide transformations all the things okay but I just want to show you
something special right now what's that what's that so let me just show you something I told you that it creates a
plan. Okay, I told you it creates a plan. Okay, let's say I want to apply one more transformation.
Okay, not just filter, one more transformation. Let's do that because now I am just telling you how you can
just read the plans. Okay, how you can just read the plans like query plans and how you can just see the DAGs
everything. Oh, so we are just trying to see the DAGs as well. Yes, it is really
important. So now let's try to see query plans and DAGs. Okay, so what I will do? I will simply recreate this data frame
or let's create a new data frame called DF new so that you will understand it. So this is my new data frame. Make
sense? Make sense? Make sense? Let's delete all these cells because we are doing something new just to show you
like how it creates a plan and how it actually shows in the form of DAG as well because now you have to have to
have to see the DAGs and don't worry we will see DAGs in detail in narrow transformation and wide transformations
as well but you you should have some you can say idea okay just to have a groundwork for you so let's say I want
to apply the filter same filter we will be applying I will simply simply say df new dot filter and I'll simply say
column city equals to equals to New York make sense okay let's run this and you know
nothing will happen I want to apply one more transformation in which I will say df new dot select so I
don't want to select all the columns I just want to select city column that's it it's my choice it's my uh
transformation I will simply say city that's it I will apply this transformation so how many
transformations are there two okay and no job is created because no action is triggered now I will hit the action so
display df new so you will see job is created see this is my output desired output it just
has New York records and only one column city now I will show you something special very special like how like why
lazy evaluation is important and why we should just rely on lazy valution. So if you would be running this code, how many
times you will apply transformations? Two times one and two. Make sense? Yeah, makes sense. Now let's see what spark
has done. How we can see that? So basically there's a command called explain. So I will simply write. By the
way, this is a markdown in which we can add headings and titles. And don't worry, I will just share this notebook
with you so that you can refer refer so that you can refer while learning and you can take notes. Okay, just for you.
Just for you. So this time I will simply say um action or you can say query
plan. Okay, simply run this and this is just a ground. We will see the query plans and transformations as well. Okay,
so for that I simply need to use df new dot explain. Okay, simply run this. When you
run this, you will see that what exactly Spark has executed on your code. Just read it. And how to read it? You should
always follow bottom to top approach. Okay? Bottom to top. Make sense? Very good. So
in the bottom it scanned the data, all the data. Make sense? Okay. Then it applied your transformations and can you
see that it applied filter transformation? Yes. But why it added is not null city. We didn't add not null.
Why we can see it here? Because it is a part of optimization because it is already saying that it should not be
null. H. So these are small small things that Spark adds for us for better you can say performance. Plus one more
thing. Okay, it performed filter at the same time. Within the same transformation,
it performed select transformation as well using and operator. That means it didn't apply transformation two times.
If you would be running your code, it would run your transformation two times if there would be no lazy valuation. But
here there is a lazy valution. So it combined both the transformations. Why? because it can be done through one
transformation. Oh, so Spark knows it better than you. You are just writing your code but Spark is optimizing it for
you. Make sense? And then obviously project project is just to display the code, display the output. That's it.
This is the query plan that it is optimized for you. Make sense? Very good. And if you just want to see the
DAG of it, how you can just see that? You simply need to go here in the jobs and simply click on view. Okay,
basically this is a spark UI. Click on this and you will see this is the um Spark UI open for you. Click on this
middle button to expand the screen. So perfect. Simply go to jobs. Click on this jobs and you will see all the jobs
that are running for you. This is an amazing feature that Spark provides us to monitor our jobs, monitor the
performance, everything. So here for now what you need to see simply click on this job because this is the latest job
job number five. Okay click on this and let's see what it has done. So basically um this this is the DAG okay this is a
DAG visualization for your job ID equals to 5 and I am under job. So you need to just take care of
these tabs as well. If you just want to see the query plan that we just saw in the form of DAG, how you can just do
that? Simply go to SQL data frame and you will see it here. So click on this display new because this is the job that
we run ran. And this is the exact thing that we saw in the form of query plan. I always prefer to see it in the D because
it is much more visible and it looks good. So just start from the bottom. Uh or you can also follow the arrows. Same
thing. So this is the first step. scan uh existing RDD. Click on this plus button. You will see rows output three.
So it read three rows. Then it filtered. Okay, it filtered and the rows output was one because there was only one
record. So it performed filter and select statement both and it can be um read in the query plan itself. See here
if I click on this so it is written here. Make sense? Very good. And here we will simply see the highle thing that we
are done that we are doing. So you can see that uh hover over this and you will see filter and select statement both are
applying on this step. That's it. And on the third step it is simply displaying the output. That's it. This is the DAG.
tag is tag is
directed as cyclic graph. So that means that
means whenever whenever you perform any transformation you process your data
spark creates a DAG for you. DAG means the flow the flow the whole flow of the job. So every job will have its own DAG.
Make sense? Let's say you are just running this job. So this job knows the flow. It needs to read data. It needs to
filter data. It needs to select this column. And then it needs to display data. That's it. Directed asyclic graph
means it cannot move back to a circle. It has to go in a flow. It has to go like this. First
this, then this, then this, then this. It cannot go it. It
cannot go back like this to the entry point. No. So this is your DAG. It knows data
reading, filtering, selecting, displaying. That's it. It creates a DAG for you. Do not
make it complicated. It it is very simple. It will DAG is basically like keeping a track of of track of your job
like how it needs to reach to the last step. That's it. And it will simply consist your all the transformations.
That's it. That is a tag. Simple, right? Very good. So this was all about your lazy valuation action. And as the bonus
thing, you also got to know uh query plans and tag. And don't worry we will also have a look again in uh
transformations chapter which is the next chapter and we will discuss about types of transformations narrow
transformations and wide transformations because this is a very important concept and you should know about this in
detail. Let's see. So before talking about transformations I think we should first talk about the partitions because
only then you can understand the concept more precisely. So basically what are partitions? Because we always talk about
partitions. Partitions, partitions in Spark. What are partitions? It is very simple. So let's imagine you have this
data frame or you you have this data in your data lake, in your storage, in your DBFS, anywhere. This is your DF, right?
Yes, this is your DF. So this is one DF and you have already seen like data frame. We have just created the demo DF
previously. Okay, so that DF was really small of just three records. Okay, this DF is also like you can say a kind of
only showing four records but actually let's imagine we have something called as 1 million plus records in this
particular data frame. So we know that we perform distributed computing that means we distribute our data. But how we
can distribute one data data frame? How? That's where partitions come into the picture. So what it does, it basically
creates the partitions on your data frame. So this will be let's say your partition number one. This will be your
let's say partition number two. So it will simply create the partitions on top of your
data. Make sense? Very good. Let's say this is partition two. Now this partition two will be distributed among
the executors. Let's say this is exeutor number one. So one partition will be there and this let's say partition
number two. This partition will be available on this particular exeutor.
So this way our data gets partitioned and then it gets distributed among different executors.
Make sense? That is the role of partitions. But now you can ask me, hey, how we can just partition our data. How?
So you cannot partition your data. No, you cannot partition your data. You create logical partitions. Okay. Now
what are logical partitions? In order to understand logical partition, let's bring another topic. It's called RDD.
And I would say RDD is the backbone of Spark. Really? Yes. It is the backbone of Apache Spark. So basically it stands
for resilient and I will just tell you why it is called resilient
resilient distributed data set. So this is the full form of RDD
resilient distributed data set. So basically what happens? So as we talked about that let's say these are this is
our data frame of four records. Make sense? what it does it creates logical partitions of this data frame. Okay. And
then it stores these logical partition in something called as RDD. So if you just want to understand
RDD okay in a layman language um if you have worked with any programming language you would know about data
types list tpple dictionary set RDD is one of the data types that we have in Apache Spark. So this is the best
example that you can just take to understand. So basically it is also a kind of list. Yes, it is also a kind of
list. List of what? Logical partitions. See it is a list and it is the list of logical partitions. Now this list is
different from the traditional data type which is list because that list cannot be distributed. But this list can be
distributed. Why? Because of RDDs. because of the specialtity of this data type of this list it can be distributed
among machines and again I'm just repeating this is the logical partitions that are
applied on top of your data and it becomes RDD this is something which gets
distributed to the executors your data frame or you can say um your data data like data lake which
is holding your data is not directly going to the executors. No, it first be part it will first get partitioned into
the form of you can say RDDs then it will go to the executor. So basically RDD is the collection of logical
partitions. It is a kind of list that holds your logical partitions. So whenever this data will go to the
executor, so this executor will will know like which data it needs to read from the data lake from the source
because obviously the real data is sitting there. Make sense? Makes sense. Makes
sense. Now let me just tell you why it is called resilient because RDDs are fall tolerant.
How? So RDDs are basically immutable. Immutable means it cannot be changed. Okay. So whenever we are creating RDD
okay let's say you are creating a DF and you applied some transformation. So it will create an RDD. Let's say
RDD1. Okay. Now even if you write another transformation on top of that data frame
and I will just show you a lot of transformation and you can just take the example the previous example in the code
that we uh just wrote in the notebook. We just applied first of all filter okay on
DF and on the same DF what we did we applied select statement right when you look at the code you will feel that we
are changing the existing data frame right because we are using the same data frame variable the answer is
no every time you create a new data frame or you can say new RDD even the name is same even the name
is same but you create a new RDD you are not changing anything in the existing RDD so let's say you are applying filter
it worked fine it create one RDD okay let's say RDD1 now it wants to create select statement like it it wants to
apply select statement it will simply create RDD2 okay for any reason for any reason this RDD
failed so what it has RDD1 created already it will simply use this RDD1 and it has DAG right it knows how to create
the next steps it will simply read the DAG it will simply read the instructions and it will again recreate the
RDDD2 that's how it tolerates the faults or failures in your spark code this is the meaning of resilient
okay distributed we already know that it can be distributed to the executors data set. Data set means your original
data. That's the best example that you can consume to understand this RDD. It's very simple. It's very simple. You just
need right example, right person and obviously an Lamba to understand the concepts. This is your
RDD. Okay. So, whenever you are creating a data frame, yes, you are writing your code in data frame. We also used to
write our code in RDD but it is not optimized. Um it is not you can say recommended to write your code in RDD
for now as well. We simply write our code in data frame that gets optimized and at the end that information will be
given to the RDDs and those RDDs will be going to the executors. Make sense? Even if you are
not writing your code in the RDD that is fine because you are writing your code in dataf frame and dataf frame is
already optimizing your code because of spark SQL engine and we will just look at like catalyst optimizer and all
everything. So you are optimizing your code not you like driver node is optimizing your code and all that
information all that information will be transferred into the RTDs. So it will basically create the RDD and it will
have all the information how it needs to read the data how it needs to apply the transformation all that physical panel
remember then that information will be carried forward with this RDD and it will go to the
executor make sense so RDD is more like a list of information that it is carrying and it is simply passing on
this information through to the executors and they simply read the information and simply do the stuff
without thinking much because they are just consuming the knowledge from the driver. That's it. It can be confusing
for now but it will be very fine for the upcoming part you can say chapters in the video because all the things are
interrelated. So some things can be understood immediately some things will be understood later in the video. That's
how it goes because you need to connect the strings right and I hope that you know the you can say highle overview of
RDDs okay if you do not get everything it is fine because it is connected to a lot of stuff okay and when you when you
will just know those things you will see oh this was the concept so that's how it goes so you are not the only one who
experiences this so it is fine okay very good so this is RDD and we already know the concept of partitions now we can
easily understand and easily digest the knowledge of transformations, types of transformation
that we are talking about. Let's see what are the types of transformations and what is the difference and why
should we just worry about types of transformations because we simply apply the transformations. No, you should be
thinking about type of transformation you are applying. Let me just tell you. So let's try to understand narrow and
wide transformations or you can say types of transformations and trust me it is very simple. So basically what
happens uh let's first cover the narrow transformations. Okay, let's first cover
the narrow transformations. The very popular narrow transformations examples are
um filter select. Okay, these are like popular
transformation like what happens when we just apply this transformation and just try to imagine. Okay. So let's say this
is your source and these are two partitions that Spark has created on top of your data.
Okay. And now you know the partitions. Okay. Now if you have let's say three columns. Okay. If you are
filtering any records let's say okay let me just give you a duty. Okay you are a computer. Okay trust me. Trust me, even
if you just want to do some manual stuff, okay, I am providing you two sheets, two Excel sheets, okay? And
those two Excel sheets, okay, are the part of one sheet. But I have divided that sheet into two two parts, okay? And
you are just calculating some stuff that I'm just telling you. Okay. Okay. Very good. Because if you will imagine
yourself there, you will understand better. So let's say you are applying filter transformation. Okay. on top of
one sheet. First of all, let's consider sheet number one. Okay, you are filtering some records. Let's say I want
to filter all the uh values which have ID equals to one. Let's imagine. Okay, just do it. Can you do
it? Okay, I imagine you did it. Do the same stuff for sheet two. Do it. Okay, very good.
Did you need any kind of reference of previous sheet or from the next sheet or when you were just filtering the records
from sheet two? Did you need any kind of reference or did you need any kind of help from sheet number
one? Obviously no because you can simply filter the records from the data that you have. Exactly.
Now if you just I if I just ask you to remove some columns from sheet one, you can do it. Okay. Do it from sheet two as
well. You can do it. Very good. Same question. Do you need any kind of help from a different sheet? Obviously no. So
that's where narrow transformations comes in come into the picture where you do not need any kind of help from the
different partitions. You can simply apply a filter on this partition and you can just generate the output as it is as
an independent partition. See this output will be this and this output of this sheet will be this. So you actually
do not need any kind of help from this partition and same goes with this partition as well. Both are independent.
Both are generating the output to the independent partitions. Very good. Very good. Now let's cover the Y
transformation. Very popular Y transformation is very popular group by okay let's say this is your source
again the same example and let's take the same columns as well. You have like city in your column. Okay. I have
divided that data in two sheets. Okay. Okay. Very good. Now just tell you one thing. I want you to aggregate the sales
or let's say yeah sales sales of each city. Okay, just do it for sheet number one. You will do it. Okay, you will say
for example New York equals to 100,000. Okay, sales. Okay, very good. Do you need any kind of help from sheet number
two? Yes, obviously because in sheet number two as well you have New York, you have that particular record. So you
need help from the sheet number two to aggregate the values to give the final output. Make sense? Make sense? So here
comes the role of white transformations where you need to shuffle your data. Shuffle your data means just tell you
one thing. If you're just calculating the sum of sales per city, okay, if I just hand
over different sheet for different cities like let's say one sheet for New York, one sheet for Toronto, one sheet
for New Delhi, can you just find the sales? Obviously, you will say, "Hey, I have all the records for New York. I can
easily find the sum of all the sales." That's what white transformations do. So what it will do in this case,
let's say in your source you have just three cities, New York, Toronto and New Delhi. It will simply create individual
partitions for that particular city. one for New York, second for Toronto, third for
Delhi and then it will hand over this data to executor so that it can easily apply
aggregation easily apply group by because it needs reference right. This is white transformation where it needs
to because obviously this data is shuffled right some of the data some of the New York data is here some of the
New York data is here some of the Toronto data is here some of the Toronto data is here so what it will do it will
simply apply the shuffling some data will be coming from here and this is for New York okay just for your reference
this is New York so some data will be coming from here some data will be coming from here as well
right some data will be coming from here. Make sense? Very good. Same goes with Toronto. Some data will be coming
from here. Some data will be coming from here. Same thing with Delhi. Some data will be coming from here. Some data will
be coming from here. Make sense? This is your wide transformation. This is your wide
transformation. This is one to one and this is one to many. Okay? Make sense? This is your
narrow and wide transformations. Let's try to understand this with the help of code and you will actually see that it
will be creating a lot of partitions. A lot of partitions and let me just tell you by default whenever we create or
perform any white transformation spark used to create 200 partitions by default 200 because obviously there can be like
so many keys here we are just considering example of three cities there can be like so many cities. So by
default it creates 200 partitions. Now we create less but we are not talking about that for now. Okay. And
that is because of AQE we will discuss um in the later part of this video. So for now this is the information that you
need to consume. Okay. So it used to create 200 partitions as well. So what it will do? It will simply create three
partitions. Okay. And rest of the 197 partition will empty because obviously we just have three cities. But let's
imagine you have like more cities. Okay. So this is what it does. Let's try to see with the help of code
so that you can actually see and we will just read the query plans as well and and and we will also we will also see
the DAG for this as well. Make sense? Okay, let me just show you. So now let's see our database notebook. Okay, so by
the way this is for you as well. If you see your cluster has terminated because it will be automatically terminated
after 30 minutes. So you can simply create a new one. Click on this and click on create new resource and click
on create attach and run. It will simply create a new cluster for you. Obviously it will take like 2 to 3 to 4 minutes to
turn it on. And yeah that's it. So what we can do we can simply create a new notebook. Makes sense. We can simply
create a new notebook. Simply go to workspace and here I will simply create a new notebook. And this time we will
simply say um let's say transformations. Okay, transformations makes sense. And let's attach this to this cluster. Okay,
very good. Let me add some heading. And I will simply say first of all, let's see narrow
transformation. Okay, narrow transformations. And we can actually copy the code from here to create the
data frame because we can just see the same data frame so that you can actually understand the stuff. So this is the
code for creating the data frame and till the time it is being turned on. So we can prepare our code. So this is the
code for DF new or let's say DF it's fine. Let's create DF. And here I will just show you or let's put this code
above this because obviously and we need to import the libraries from pispark.sql SQL dot
functions import ax from pispark
dossql dot types import arix. Okay. Oh, it has turned on. It was
quick. Nice. So, let me first run this and it is running. Then I can simply run. Okay,
it's complete. Let's run this. Very good. And now let's perform narrow transformation. Let's say I want
to apply a filter. Okay. DF equals DF dot filter. I want to apply filter on column
city equals let's say New York. Okay. Very good. Let's run this. And now you know nothing will happen. Lazy
valuation. Okay. This is a narrow transformation. Make sense? Okay, let me just show you what will happen when I
will perform this transformation. So let's call the action display ADF.
Okay, now we will see the job as well. So it is running. Very good. Very good. Very good. Very good. So
now now this is the data that you wanted to see. Okay, because we applied the filter for New York only. Now let me
just show you the plan tf.x explain. So what it did simply scan the data and applied the filter and then
that's it. That's it. No need to worry at all. Make sense? And we already know like how we can see the DAG as well for
this. This is job zero. Click on this. It will simply open that uh Spark UI. Expand it. Go to SQL data frame and
click on this display df. Okay. And this is your DAG for it. Now let's perform the Y transformation. Okay. So in order
to perform Y transformation simply recreate this so that we will be having a fresh data frame. Okay. Let's write
here Y transformation heading three bold Y transformation. Okay. Let's
perform a group by df.group by. Okay. And how you can just perform group by? It's very simple. You just
need to write df.group by and then you simply need to uh put the column name on which you want to apply group by. I want
to apply group by on city. Make sense? And what I want to find I want to find or let's say what kind of aggregation I
want to apply. I will simply say ag then I can simply say count uh of age. Okay. Or let's say yeah or
let's let's say I want to just find the um maximum age. Okay, I know that we have just one record in each data frame.
It's fine. But let's say I want to find the maximum age of this. Okay, let's apply maximum aggregation on top of
column age. Make sense? Okay, let's perform this. And obviously we need to save it in
DF. Okay, now this is done. Obviously there's no job. Now let's hit an action and let's see what will what we will
get. Obviously we will get the result. Okay. So maximum age is 25 30 35 that is fine. Very good. Now let's see the plan.
And now now let's see what actually happened in case of white transformation. I will simply say df dot
explain. Oh yes. So first of all let's try to read this one like this plan like what
actually is happening. Okay. So so so so this is the final plan that we need to just focus and do not worry this was
the initial plan and you can just treat it as like your optimized plan. Okay. So just read this. First of all it scanned
the data. Very good. Then what it did? It did hash aggregation on
city hash aggregation obviously because this was our um grouping key right aggregation key.
So it created a it it applied a hash function on top of this hash aggregate a hash aggregate keys and key was your
column. Key was your column very key was your column. Then function equals to max like what aggregation we are trying to
do max. Very good. Then what happened? We can see exchange hash partition. So this is something called as shuffling.
Whenever you see the word word exchange that means shuffling. Shuffling of partitions. And you know why it happens.
Okay. Very good. Then you will see shuffle query stage and statistics that row count is three
and blah blah blah. Then aqe shuffle read you can simply ignore it because this is something related to adaptive
query execution that we'll be discussing later in this video. Okay. And now let me just show you the DAG for it. Let me
just click on this and expand it and close the previous ones. Click on SQL data frame and this is the latest
display. Run this. So you can see that a lot of stuff happened lot of first of all it scanned the data then it applied
hash aggregation on top of your grouping key then it shuffled the data and how many partitions you can see 200 200
partitions okay then this is aqe which is your I will just talk about this later in this video because you first
need to understand a lot of stuff before this so what it does basically we know that it will create 200 partitions right
so in the modern world it is a kind of optimization technique that we see so instead of creating 200 partitions what
it will do it will simply reduce the number of partitions and let me just show you and according to me I think it
will just create one partition so if you just hover over this and click on plus so you will see number of partitions so
what it does it simply like combines your partition this is like out of scope right now but you can have some
understanding and for now you can even disregard this tip just imagine there's no step like address AQE just disregard
for now okay we will talk about AQE later in this video for now just simply ignore this step okay and don't worry um
I will simply disable the AQE for rest of our code so that you will not feel confused but yes this is something that
you should know because let's say you are not turning off AQE you will see hey what is this what is this so this is AQE
okay so it created 200 partitions okay very good then it will simply create hash
aggregation. Perfect. And this is the final plan that will be used to execute the code. That's it. That's it.
This is your DAG and this is your Spark query plan for your grouping. Make sense? Or basically you
can say wide transformation. This is your narrow versus wide
transformation. Make sense? So now now what's next? Next is next thing is very important and very
very very important I would say. So we are talking about jobs right? We are talking about whenever we see job three
job four if we just click on this we see a kind of DAG for job and stage ID. If I just click on stages I see stages. What
is a job? What are stages? And there's something another called like term called tasks. Let's talk about this now.
What are jobs? stages and task how it gets created and how we can just work with these things and how it gets
optimized. Okay, this is really really interesting. Let's actually talk about that and before that let's talk about
repartition and colleies. This is really important and I would say now is the best time to discuss about it because
now you know wide and narrow transformation and let me just tell you one is narrow transformation and one is
wide transformation. Sometimes sometimes both can become white transformation really. Yeah. So the
thing is let's talk about repartition first of all. So let's say you have a table you have a data frame.
Okay. And by default let's say this has only two partition. Let's say this has only two
partitions initially. You have just two partitions but for some reason you want to increase the
number of partitions because obviously these two partitions are created automatically as per the partition size
which is 128 MB. If you do not know this is the
default partition size 128 MB and this is the default block size as well of your distributed file system. So this is
the information that you can note somewhere. This is the default partition size. Okay. Because ideally spark
recommends to have a partition size between 100 to 200 MB. Okay. So we keep it like 128 MB. Why? Because default
block size because your data is saved in blocks in data lake distributed file system Hadoop uh HDFS. So that's why we
just pick it as 128 MB. But for let's say this data size is of u 400 MB. Okay. This data this data is of 400 MB. Okay.
So how many partitions will be there? Two. Oh not two. Let's say 200 MB. Let's say 200 MB. So there will be
two partitions. One of 128 MB and second one will be little smaller. For some reasons we want to create 10 partitions.
For some reasons. How we can just do that? We can use something called as
repartition. Okay, we can use something called as repartition. When we perform repartition, we actually just just try
to imagine it. Okay, just try to imagine it. So these are your two partitions. Okay. Now if these two partitions needs
to be converted into 10 partitions 1 2 3 4 5 1 2 3 4 5 let's say 10 partitions will
we perform reshuffleling or shuffling of data or not? Just tell me within 2 seconds it's very simple. Answer is yes
because some of the data is here like this this partition is coming from uh let's say
here. Let's say some of the data of this partition is coming from here then this one then this one like so on because it
is just creating 10 partitions right so whenever you just want to increase the number of partitions we simply use
repartition we can even use repartition for decreasing the number as well but it is not recommended because if we just
want to decrease the number of partition let's say I have two partitions I want to create just one then I will use
something called as coies co-s make sense then I'll simply use something called as
co-elies okay make sense so in coies what happens it will simply let's say these are your two
partitions okay it will simply try to create one just tell me one thing will it create any kind of
shuffling tell me tell me tell me will it create any kind of shuffling the answer is no because it just needs to
merge the partitions, right? Okay. So, let me just tell you when it will perform shuffling as well. Really? Yes.
Sometimes it performs shuffling, sometimes it does not. Okay. Let's try to that is an advanced knowledge you
should know. Okay. So, what happens? Let's say just remember we have two partitions. Okay. Okay.
If if we have let's say only one executor. Okay. And these are your partitions. Two
partitions P1 and P2. If both the partition if if we just have one one exeutor, okay, both will go
to the same executor. Okay. Then this executor can perform the coies on top of these two partitions because it do not
need to go to any other or you can say anywhere else because it just need to merge. That's it.
But just imagine we have one more executor. So in that scenario one partition will go
here. One partition will go here. How it can perform coies without reshuffleling? It's not possible. Right.
Exactly. In these kinds of scenarios it has to reshuffle the data. It has to send this partition to this executor to
perform the merge. Okay, in this scenario, it has to perform reshuffleling. But in 90% of the
cases, okay, coles means merging the data without any kind of shuffle because you will not be having just two
partitions. It is very a rare scenario. But you should know about this, okay, you should know about this. But in real
world, you will hardly encounter with this particular you can say concept. Okay, make sense? Very good. So this is
the concept of repartition and coies and let me just show you how it works with the code. Really? Yeah, we can simply
see that. Let me just show you. So I am in my notebook. Okay. And let's try to see the qualies and repartition the same
notebook. Okay. So don't worry, I will just provide the notebook so you can easily use this. So I will simply
say repartition versus coales. Okay. So let's perform first of all first of all let me just show you how many number of
partitions do we have and how we can just find that. I will simply write df do. Rd dot get num partitions. Okay let
me just run this and you will see we just have one partition because data size is not exceeding 128 MB. Okay. It
is very small. So we just have one partition. Let's say I want to create three partitions. Wow. Let's let's
create. Okay. So, how we can just create that? I will simply say
repartition. Okay. I will simply say df equals df dot repartition. And how many partitions I want? Three. Just run this.
And just again run this command. And you will see the output as three. Very good. Three partitions. Okay. And this
is also a kind of action that is why you are seeing jobs created. Okay, very good. And you can even see the plan DF
equals or let's say DF.explain if you just see like what actually happened. So what it did it
simply performed the repartition step why it performed all the things again? Why? Do you remember the concept of RDD
that we discussed? It's high time to just revise that concept. Okay, do you know like what did I say? It computes
the by the way this concept will be you can say fully understood in the caching and persisting area but you
can still build a very strong base around this zone. Okay. So what happens? This is DF,
right? What is the last thing we did with DF? We perform this group by. And what did
we say? Every time it performs the operations using a DAG, it performs a new RDDD. It creates a new RDD. Right?
So what it did, it simply read the data here. Okay? And then it creates a new
RDD with this group by function. And this is your new RDD here. Then now we performed another
transformation called repartition. It created RDD2 or new RDD. That is why when you just hit
explain, you see all the previous stuff for RDD1 plus RDD2. So that means if for any reason
this step was failed any reason it will simply use this as RDD1 and it will simply perform the remaining steps of
the DAG that is the power of RDD that's how it tolerates the failures okay see how things are getting connected
now a lot of things will be connected don't worry okay so this is your repartitioning step okay very good now
let's say We want to again create only one partition. We do not want three partitions. So what you can do? You can
simply say coies. Okay. DF equals DF dot cos and then one. Okay. Simply run
this. And now again see the output of this command. You will see one. Perfect. And if you see the explain
method what we should see, just tell me in the comments right now. Obviously all the steps
RDD1 and again after that. So if you just observe CC see if you observe this is RDD1 right this is RDD1 very good
this is RDD2 okay and this is RDD3 coies one why we didn't see exchange with coies why you know the
answer you already know the answer and the answer is it does not performed shuffling right it it didn't perform
shuffling so that's where your coies helps to avoid shuffling because shuffling is an expensive
transformation. Okay, very good. Now I hope that you have understood everything in repartitioning and transformations.
Now let's try to understand jobs, stages and task. Let me just show you because we know that
whenever we hit an action a job is created but what after that? So let's try to understand what
happens when we hit the action and we know that job is created but just an tell us more about it. Okay. So we know
that whenever we hit the action a job is created for us. Yes. Not us for spark. Okay. So whenever a job is created it
contains multiple stages. Okay. And one stage can contain multiple tasks as well. Let's say this is the task number
one. This is task number two. This is task number three. Multiple tasks. This looks very simple, right? And it is
really very simple. But you just need to consider some stuff. Now hold on. We know that a job is created. Very good.
Okay. We know that this job is created. Very good. Very good. Trust me. Very good. This job is created, right? Very
good. Now, now this job will have multiple stages. Who will decide the stages? This should be your first
question. And if this is your question, let me answer. So if your whole code, your whole
transformation basically stages are what? Transformations. Stages are what? These are
transformations. If you are applying filter, okay, if you are applying select, if you are applying group by,
okay, these are the three things that we have applied so far. Let's take this example. So just this is a kind of
formula that you can just write write down. Okay, it will help you in the interviews as well. So
basically if your transformation just remember if your transformation is not
changing is not what is not changing. So this is a narrow transformation as you know. This is a narrow transformation as
you know. Make sense? So this these narrow transformations will be bucketed in the one
stage. So there will be only and only one stage for the similar for the similar transformations such as
narrow. Make sense? Okay. Very good. Then we know that this is a y transformation. So this will be
considered into a different stage into a different stage. Very good. So this will be the
one stage but different. Okay. Now within each stage we can have multiple tasks. We can have multiple tasks. Task
is associated with the number of partitions. Just write write it down. Task is associated to the number of
partitions. So we know that in the group by we create 200 partitions, right? How many tasks will be there?
200 200 tasks. It is strongly strongly strongly aligned with the number of partitions. That's it. This is the basic
thing that you can consider to understand it. Okay. It is just for the understanding. Now let's see this with
the help of code because you will actually understand what is going on and for this thing we going to read one CSV
file okay in the datab bricks and I will just show you how jobs are getting created and I will show you all the
number of jobs each and every job okay let me just show you in order to see the illustration okay demonstration let's
upload the file and how we can just upload the file it is very simple simply go to catalog okay and if you just go to
DBFS you will see the folders and you can simply click on file store and I have
like so many folders don't worry I will simply create a new folder okay simply click on upload and in the file store I
will create a new folder and I will call it as let's say Apache spark okay and then I will simply click
on upload because I have created a folder you can also create a folder like this okay so let me just upload the file
so I have uploaded this file mega dotcsv let me click on done. And this is uploaded in this folder. And I can also
get the URL of this. Simply right click on this and copy path and copy the uh this one. Yeah, this
one or this? Yeah, same thing. Copy this one. File API format. Okay. Now you will say how we can also get this file. You
can also get this file by going to my repository called Apache Spark full course. And you can also click on this
button. Okay? or this button. Click on this. Okay. Then you will see this file. Okay. And you can
also download it and you can upload it. Perfect. Perfect. So now we have the file. Now let me just show you the job
stages and everything. Okay. Let's create a new notebook. Go to workspaces. Click on create notebook. Let's call it
as jobs stages and tasks. Okay. like this. Attach this uh notebook to
the existing cluster. Very good. Simply say from pipark.sql dot functions
import from pispark.sql.types imports. Very good. Run this. Now we
will not be creating a data frame on our own. We will simply read the CSV file that we have uploaded. And how we can
simply create a data frame on top of it. It's very simple. Let me just tell you it's very simple. So we have something
called as spark dot read. So this is a kind of API that we use to read the files. Okay. Spark
dot read dot format. We need to specify the format. It's called CSV. Click backslash to go to the next line. Okay.
Then simply type dot option. So in this particular option we can just define extra configuration that we want such as
header. We know that in CSV we have to tell hey header is our first line. So we will simply say header and then true.
Make sense? Then I will define one more thing. It is very important and very handy. You will love it. Actually we
know that we have a CSV file. But if we do not want to define their schema, how we can just still define the schema
without defining the schema. So we have something called as infer
schema and just make it true. What it will do? It will simply predict the best schema for your data frame for your data
in CSV format. And I love this feature. Okay. Then simply say dotloadad because that's all and simply paste the URL. Now
simply remove this DBFS part because we do not need that. Just make sure you are having the path like this. Okay?
Otherwise you will see the errors. Simply run this and you will see the magic and I want you to observe it.
Let's see if you catch it. Let's see. Did you catch it? If you catch it, I love you. So basically the thing is I
just read the data. I didn't hit any action. Why it created two jobs for me? Oh
anch what's this? So let me just tell you something. Basically when we create a data frame without having files it is
not an action. But when we create a data frame on top of the files using dot read method it is an
action. Okay. Okay. So that is one action right? Yes, that is one action. Simply, let me
just add the comments. This is action number one. Okay, this is action number one. So
why we are seeing two jobs? Because we know that one action triggers one job. Very good catch. The next action is
infer schema. But why? Why? Because as I just mentioned that infer schema predicts the best schema for your
data. How does it predict? It simply have a like sneak peek to your data. It will simply see some of the
records of your file and it will then predict the best data. Because it is seeing some of the records, it is an
action. So that is why this is the action number two. Wow, this was something new. I know that. I know that.
Okay. So, let me just click on this drop-down. You will see two jobs. Job number 10 and job number 11. One more
information that is very useful and just note it down. So as we know that this is job. Now just tell me about the stages
because stages are aligned with the transformations but we are not applying any transformation. So the thing is and
again with the task as well. So the thing is one stage or let's say one
job will always have one stage plus one task. Always or you can say at least always at
least. Okay. One job will be having at least at least one stage and one task. That is why you are seeing stage one and
stage one. Let's click on this and let's see what it did. Click on view and expand it and just stop here. Okay. So
this is the stage number uh stage ID 13. Okay. So this is the stage that it performed and how many tasks it has one.
Why? because it's just like one partition and it is actually not doing anything. It is just reading the data.
So just one task make sense. Very good. And let's see the other job. This one job number 11. Click on it. Expand
it. And stage ID 14. Same stuff. Same stuff. And the task is one. And this is for infer schema.
This is for infer schema. Very good. You can even see the DAG visualization. Click on this and you will see it is. So
this is for your infer schema. Now let's do one thing. Let's perform some transformations. Okay. Let's say I want
to perform transformations. Okay. Transformations. And I will simply say df equals df
do.filter. Okay. Um do I know how data look like? Actually no. So what I can do quickly I can create a demo DF that we
will be we will not be using for uh counting the jobs. We will not be using that. We are simply creating a demo df
just to see the data so that we can just use some columns in the transformations. You know what I mean right? So what I
will do I will simply copy the code this one and I will paste it above. Okay. And I'll simply say df demo. It is just for
to see the data. That's it. I will simply say display df demo. Okay, simply run this. Uh what what's wrong?
Unexpected unexpected error. Let me just comment it out. Okay, what's wrong? spark read dot
format. Oh, let's remove the space because it doesn't like the space. Uh uh uh uh uh uh. I just copied
the same code, right? Uh let's remove the comment because it just messes up with
the spaces and all if it doesn't like the spaces. So, let's try to run this. Yeah. See? So, now let's just see the
data just to see this data. Okay. Order ID, user ID and so many things. Wow. Let's say I want to filter only and only
data for fashion. Okay. Where product category or let's say product name equals to sneakers. Make sense? Let's do
it. I will simply say DF do.filter uh column product
name equals to equals to sneakers. Make sense? Very good. Let's perform this transformation. And you know what will
happen? Nothing. No job. Why? Because no action is triggered. Very good. Now let's say I want to perform one more
transformation. What's that? So I want to perform let's
say I I just want to select order ID and product name. That's it. I want to do that. Okay. Order ID and product name.
Let's let's do it. So I will simply say DF equals DF do. Okay. Order ID and product name. Make
sense? Let's run this. Perfect. Now let's perform Y transformation. Oh, okay. DF equals DF.G group by. And I
want to apply group by on product name. Make sense? And what I want to aggregate? I want to aggregate sum of
order ID because I want to see like how many order ids do we have or basically not sum basically count. I want to count
the number of orders for product name equals to fashion. Right? So, count of uh column of order ID. Make sense? Very
good. So, let's run this. Perfect. Now, we know that this is our code that we are going to run. So, action number one,
action number two. Okay. So, we have two jobs so far. That is fine. Let's talk about the third job. Third action. We
have not triggered the action yet but yeah we will be triggering it. So this is transformation number one. Okay. This
is transformation number two but both are narrow transformation. So how many stages will be there? Only one. Very
good. Then we have another transformation. This is a wide transformation. So how many stages we
will be having? A new stage. Okay. A new stage. Okay. Very good. How many task we will be having here? only one. Why?
Because there's only one partition. There's only one partition. How many tasks we will be
having here? 200. How many? 200. But yes, AQ is there. That means adaptive query execution is there. So we have to
turn it off first of all. If we just want to see the desired result otherwise it will simply coales the things and it
will not show up the real things. Let me first turn off the AQ8. It is very simple. It is not a big task. I simply
need to use one statement. That's it. I can also show you right now as well. Simply go to Google if you do not
remember the code for it. So you can simply say turn off AQE in Spark or you can say
turn on same thing. You simply need to use true or false. Okay. Click on performance
tuning and it will be something like um let me
see AQE. So co sends yeah here it is adaptive query execution. So this is the
thing that we need to make it false default value is true. Okay, we simply need to do this and how we can just do
that. Simply run this command at the top of this spark dot conf dot set and then just make it false instead of true. So
as you can see spark.com set spark.sql dot adaptive co partitions. Oh wait wait wait wait partitions. No no no no no. We
do not just need to apply co partition. We need to turn off the AQ on all the things. So just go back. Uh ah yeah,
here's the thing. So you simply need to copy this one. So simply replace it with this one.
Spark.sql.adaptive.enabled. So it will simply turn off the AQE for all the things. Simply run this. And now AQ is
disabled. AQ is disabled. Very very very good. And that's what we wanted. Okay. So we may need to just rerun all the
code but that is fine. So what we can do? We can simply rerun this and we can simply remove this
action because it doesn't like the spaces. Okay, simply rerun your df and you will see the two jobs. Very good.
Rerun this this. Very good. So now what you were saying? We were saying that it will
simply create 200 partitions. That means 200 task. So in total in total I am just writing display df. In total, how many
stages will be there in this job and how many tasks will be there? Very simple question and you should answer
this. Very good. So in this job there will be total two task. Oh sorry there in this job there will be total two
stages. One for this one. Okay. And one for this one. And in this stage there will be only one task.
And in this stage there will be 200 tasks. Let's verify this. Simply run this command display
df. So very good. So what we are doing here it has simply figured out the number of you can say orders for
sneakers and it has created five jobs. Why? Because it just took two jobs from here. Okay. And let's see like what are
the other jobs because it does like lot of you can see skip skip skip skip skip skip skip skip skip skip skip skip skip
skip skip skip skip skip skip skip skip skip skip. So these four are actually skipped and these are not actually being
performed. So do not need to worry. This is our main job which is job 17. So simply click on this view. Why these are
skipped? Because we were just running our code for df. So it is just performing that thing but it is not
actually computing anything but because it is just skipping that part. But you will see that part in the number of
jobs. But job 17 is the main job. Now what it is doing and I love this D like what it is actually doing. Let me just
tell you what is happening. So basically basically basically basically first of all we need to see our two stages in
this job. Okay makes sense. So okay I can see only one task. Wow. Wow. Wow. Wow. Let's
see. Let's see. Let's see. Let's see. Hold on. So basically this is the job and what is actually happening. This is
stage ID 20 and this is stage ID 21. Stage ID 20 and stage ID 21. In the stage ID 20, it is applying your narrow
transformations. In this stage 21, it is simply applying your you can say wide transformation.
And you can actually see this stuff by going to SQL data frame. And there you will see all the stuff. SQL data frame
click on um wait simply first of all go to job click on this and then go to SQL data
frame completed queries and click on this one. Very good. So here you will actually see the stuff that is being
happening right now. So it will simply show you scanning. This is the file scanning. Then this is the filtering.
Okay. So this is the filter that is being applied. Okay, which is your filter and select statement. Then
hashing and you know what is hashing for that grouping key. Then it performed reshuffleling. Okay, how many partition
do we have? 200. Very good. Then we have hash aggregation. Obviously we want to apply group by and then collect and
limit. That's it. Make sense? Very good. Now let's go to the stages. Okay, very good. So now in the stages if you'll
just closely observe like how many stages do we have for like uh completed stages. So these are all the stages that
are completed and these are skipped stages. Okay. So now here if you just go to jobs this is our job right? If I just
go here this is our job right? Job 17. Simply open this job. Simply open maybe
from here you or you can simply reopen it because I just cancelled this. Simply open the job number 17. Okay. Simply
close this and go to job number 17. So click on jobs one more time. And these are all the jobs. Simply open this one.
Okay. Click on this. So this is the job for you. And these are the completed stages. Okay. So here we are seeing one
that we should not see. I know something is wrong with the execution because we didn't I think restarted the cluster but
that is fine. That is fine. The whole intent right now to explain is this one. This diagram is really important because
you know that you should just see 200 that is for sure and obviously like that is not a big deal. This is a big deal.
So basically what is happening right now this is very important. Okay don't worry we will also fix this. No no no need to
worry and this is not a kind of bug. This is just the thing that we um what we did we simply uh rerun this the whole
thing and we actually didn't restart our cluster because we defined a new property for our spark cluster. So it
didn't actually process this thing what thing this one like group by. So it actually skipped all the things. So that
is why it is happening. No need to worry I will just show you. So basically what is happening right now? So here in the
exchange so we just performed our narrow transformations fine then we know that we change the state we change the
transformation type we change from narrow to wide transformation right so that is why we can see an exchange but
what is the name of this exchange it is called write exchange
okay because it is writing this data to this page. Okay. And what is the name of this
exchange? This is called read exchange. Make sense? This is right exchange. This is read exchange. So, it
created 200 partitions. It read all the 200 partitions and then it performed the grouping. This step was really important
and I wanted to explain this thing. Now, let's restart our cluster. Okay, let's restart our cluster and then let's
perform the whole thing again. And let me just remove the clutter. simply remove this and
uh yeah that's fine we will simply so it is being restarted we will simply rerun the whole code and you will see all the
things sorted now make sense okay and I can even do one thing I can even merge the code within the one cell so that it
will be easier for you to manage it and see the results let's run all the code together okay
and it will be fine Okay, because you can just keep this notebook
for your reference then like easily you can just create notes. Okay, then just say
displaydf neat and clean. Okay, let me add hashtag action. Very good.
data reading. Very good. Let's delete all these
cells. Perfect. So now when it will be restarted, we will simply run our code all the code together and then you will
see total. Oh yeah, it's ret restarted. Let's do it. Run this. Let's run
this and I hope after this you will be master in stages, jobs, tasks and all everything. Okay, let's do this. Let's
run this as well. So now it is creating a fresh DF. It is not you can
say running any previous code. Okay, because some code was saved in the memory. So that's why. So now we can see
jobs being created. Very good. Very good. Very good. There are so many jobs. Okay.
Simply click on this one. Okay. And you can see only three jobs that are fine that we want. Job zero, job one, job
two. Make sense? Job zero is for reading. Job one is for your um infer schema. Job two is for your
transformations. Okay, let's open this one. Perfect. Now I can explain this. Very good. So now you can see that
shuffle right is 71 bytes. Okay. And as we know that how many values do we have for group by? Just one because we have
data frame for just one like sneakers, right? So how many partitions should be there for sneakers? only one very good
so how many tasks will be there only one which will be executed but as I said there will be 200 partitions and let me
just show you so if you just click on jobs for one more time okay like total jobs because total seven jobs were
created right out of which four were skipped why and just look at the number of tasks for them 75
100 20 and four just add these it will be 199. So 199 199
oops 199 tasks were like empty tasks. 199 partitions were empty. So obviously it doesn't need to do anything
right? It doesn't need to do anything. So it simply created 75 120 that means 195 and then four. So 199 empty
partitions were skipped and there was only one partition which actually get executed. Make sense? So it is not
coaling or it is not combining your partitions because we have turned off the executive query execution. Adaptive
query execution executive adaptive query execution. AQE make sense? That's how it works. That's how it works. And
obviously if you have let's say bigger data or let's say if you have uh more than one or two or three values that you
are using to aggregate obviously it will create more and more data more and more you can say
tasks. Okay let's let's let's try to sum up what we have done. Okay let's try to sum up. So first of
all we turned off the AQE just to see the results as it is. Then we simply read the data and there were like two
job uh two actions because two jobs were there. One is this one, one is this one. And then we had transformations. There
was one task for filter and select. There was then oh we just copied two times that is fine. It's fine. Then we
had a group by for our product name and then we performed a white transformation. So there was a new stage
for it. There were total 200 partitions. 199 were empty and one was dedicated to a task. Make sense? Let's remove the
select statement. Okay, just to see you just to show you something. So what we can do, we can simply rerun this. Okay,
just just to show you like how how does it work like without select. So that doesn't mean like it will simply go.
Okay, again it skipped two things like four things. Simply click on this view for ninth job and expand it. And you
will see that it has provided you the same stuff. And this time it has shuffle right 71 bytes.
And if you just click on the jobs, you will see the same stuff like why it is just skipped the partition. So 75 100
175 + 20 195 + 4 199. So 199 partitions are skipped. Simple, simple, simple. 199 partitions are skipped. Only
one partition, okay, was involved in the tasks. Make sense? Make sense? Makes sense. So these are like two stages,
okay? And in this only one task per stage, that's it. Just take notes. It is very important, bro. It is very
important. You will forget forget this like what is happening? Why do we just have one task? You will by the way tasks
are not much important nowadays because AQE just merges or coes the data the data but it is very important from the
interview point of view and for the understanding like what is happening behind the scenes right so simply noted
down everything noted down make sense okay very good so what's up baby it's finally time to master the most favorite
the most important the most the most the most popular topic joins in Spark because once you understand this
concept, you are almost there to understand almost all the concept of Spark. Trust me. Trust me. And just sit
back and relax. Hold your coffee. Um I also want to have coffee but it's late here so I cannot. So but I wanted to
have coffee while explaining this concept. It's okay. But I just noted down some notes that I have to have to
talk about. So I am really excited to talk about this topic and trust me just sit back and relax. You will understand
each and everything in joints each and everything. Trust me. Okay. Let me have my notebook. Okay. First of all
basically basically basically just forget about this part for now the right hand side because you need to just focus
on the part that I highlight. Okay. So I have not highlighted this part. So for now just ignore this. Okay. And let's
start. Let's say this is your data frame one. Okay. This is your DF1.
And you have just read this data. Okay. And you can imagine the size of this data is of let's say
um how much of MB? 400 MB or let's say 450 MB somewhere around 450 MB. Okay. So by default it
will create four partitions of this 128 into three and then it will create one small partition. Okay. So it will create
four partitions. Make sense? And we have only two executors as you can see one and two. So these four partitions will
be distributed among these two executors. Common sense right? So this is partition one, partition two,
partition three, partition four. And as we all know that it creates logical partitions. So you can treat like
partition one, partition two, partition three, partition four. Make sense? Just like four partitions. Okay? And you can
just imagine we have like lots of records but for the simplicity and to make you understand the concept we just
have like five to six records. That's it. Okay. Very good. Very good. Very good. Take it easy. Take it easy. Take
it easy. Take it easy. It's very easy. Trust me because I am explaining you this concept. So basically your
executor has two partitions one and two and this exeutor has two partitions three and four. Total four partitions
evenly sorted. Okay. By the way, uh each exeutor has four cores. And if you don't know about the cores,
basically cores are the terms that we use in generic computer sense or you can say computer terms. And cores are
responsible for like your parallel processing like similar to that. Okay, do not get confused with the new term.
So if this executor has four cores, that means it can perform four parallel tasks. So that is why two cores are
filled and two cores are filled, two are empty, two are empty. Okay, very basic term. Do not get confused because half
of the terms are just from your networking and IT terms. So it's very simple. So so far so good. This is your
data frame which is called DF1. Okay. And we have just two columns ID and name. Very good. These are two
executors. Four partitions are evenly distributed. Make sense? Well done. Very good job. Now we need to apply a join
with a new DF. What is the new DF and how does it look like? Let me just show you. So this is the new DF or you can
say second DF, second data frame that we need to apply join with. Okay, so this is called
DF2. Okay, this has ID and salary as two columns. And this is also let's say of 500 MB or let's say 460
MB. That means it will also have four partitions. Very good. So just to keep the things simple, I have created this
data frame with green and the previous one with pink. And as you all know, two partitions will be there and two
partitions will be there in this executor. That means our data is red and it is let's say evenly distributed among
the executors. Very very very good. So what is the current state of our executors? Current state of executors is
this one. Executor one, executor two. Okay, this is DF2. This is DF1. This is DF2. This is DF1. Very good. Now let's
talk about now I want to apply join. I want to apply what? Join. Okay. I want to apply
join between DF1 and DF2. Okay, make sense? DF1 and DF2. So as we know that join is a Y transformation.
You don't know about that. Join is a Y transformation. Okay, if you do not know about that, now you know. Join is a Y
transformation. Y transformation. Y transformation. Very good. So what happens in Y transformation? It creates
200 partitions. Okay. And now you will actually feel why it is so much important to create 200 partitions. Not
every time but yeah in most of the cases. Okay. It will create 200 partitions. And what will happen? Let me
just tell you. So basically 1 2 3 4 1 2 3 4 total eight partitions
total eight partitions will be
reshuffled reshuffled into 200 partitions into 200 partitions. Make sense? Very good.
Now what will happen on these two partitions is important as we already know some of the basic stuff in the 200
partitions that we know that the same key resides on the same partition so that it can have all the information
exactly that is why I made your base so strong because now it will be very easy for you to understand just tell me one
thing let's say you have ID you have ID 1 and two here. Okay. Three and four here. ID five here and six here. Make
sense? Very good. In DF1 you have ID let's say five here, six here, 1 and two here and three and four here. Can you
apply join? Just forget about these 200 partition for now. If if you do not create 200 partition, can you apply
join? This is exeutor one. This is executed two. Can you apply join without reshuffleling? The answer is no. Why?
Joining key is ID. This ID five of DF2 is not available in this particular executor from DF green side. Okay. DF
green or DF pink. Let's let's forget about DF12. DF green. DF pink. Simple. So ID = to 5 of DF pink is missing from
ID five from green make sense because ID5 of DF green is here in the second executor. So we cannot apply the join
because they are not on the same machine. They are not on the same executor. So that is why we need to
apply reshuffleling. We create 200 partitions and now is the main thing.
Now is the main thing. By the way, we do not have five and 6 ID. We have 2010 1 and 201. But it is just an example. And
I know some of you will say, "Ah, what are you doing?" So, simply say 2010, okay, just for you, baby. Just for you.
2001. Okay. 201. Oh, man. Come on man. See I'm so
possessive for you. So now what we will do? We will simply create 200 partition. Now these partitions are not just like
simple partitions. What will happen? Let's consider partition number one. Let's say this is your partition one.
What will happen in this partition? It will simply reshuffle all the data. Okay. And it will go to ID1. Okay. ID1
will go to partition one. Why? Why? Because it calculates something called as mode. So it will simply say ID
1 1 mod of total number of partition 200 it will become it will become what you can say maybe uh one obviously one more
200 is one. If you have some basic understanding of mathematics if no go to Google search it okay one more to let me
just search bro I'm saying with so much of confidence let me just search oh man I I am 90% sure
but oh man oh man oh mana wait hold on hold on hold on I will give you end to end knowledge including mathematics as
well so one mod 200 is 200. Let's
search. It's one. Told you. Told you, bro. I know. I did my bachelor's in mathematics,
bro. Oh man, don't remind me those days. My bachelor days. Bachelor's days.
Bachelor days. Yeah. Still still a bachelor. So basically it will simply go to ID number one. It will simply apply a
mode and then 200 equals to 1. Just for the simplicity I am just keeping it here as one. But in real life it just apply
because obviously if uh it if we are not using an integer column if we are using a string column in the join obviously we
cannot apply mod on it. So it will simply create a hash column for that particular hash column. Basically it
will create a kind of long number for that particular you can say column value. If you are just let's say you
have a value called shoes. So you can simply apply hash on shoes you will get a long integer right. So but just to
keep the things simple just for the understanding because once you understand the concept you can
actually learn anything and you can actually understand anything. So it will simply say 1 more 200 equals to 1. So
this becomes one that means it will go to partition number one. It will go to partition number one. And we know that
it is of green df. We will simply use green color here as well because we have so many colors available. Very good.
One. This is ID1. Very good. Now it will pick um let's say pink data frame. It will say ID 1. So ID1 obviously will be
having the same value. It will simply go here. It will simply go here ID equals to 1.
So this is my second DF. Okay. Or you can say DF pink. This is my partition one. Okay. This is my partition one.
Then let's say it will go to green DF again and it will say 2010. What is the value
of 2011 mode 200? I'm sure it is one. Trust me it is one. Trust me now. I I know. I know. I know. Okay. So, where it
will go? Where 201 ID will go? It will simply go to partition one. Okay. Make sense? And obviously pink 2010 one will
also go to the partition one. Very good. Very good. So 201 make sense. So this way it will populate all the 200
partitions with the same joining key on the same partition. Same joining key on the same
partition. Again same joining key on the same partition. It is really important. It is really important. So now you can
see that in this partition we have ID1. We have ID1 there as well. 2011. So that means it will simply bring
the joining keys on the same partition. Yes. Yes. Okay. Okay. Now just tell me can we
apply join? Easily. Easily. So this is the part. Now we are not applying join. We need to hold on till here it will
create 200 partitions. Total 200 partitions. So let me just say we have 200 partitions like this. Okay,
partition 200 makes sense. Very good. We have total 200 partitions and we have just
reshuffled all the data. Now you need to hold on because till here till here we have a state called shuffled state. We
have a state called shuffled state. Okay, shuffled state. Okay, this is the
shuffled state. Just note it down. This is the shuffle state. Okay. Now this step is common in every join. Not in
every join. Basically we have two major joins. Shuffle sort merge join and second one is shuffle hash join. So in
both the joins shuffling will be happening on the partitions. So till here steps are common. From here they
will divert they will be diverted. They will simply pick their path. Either they can just go to sort mer join or they can
go to hash join. Let's understand both and let me just take the screenshot of this thing so that when I will be just
explaining you the um hash join later on so I can just use the same screenshot you will be
understanding easily. Okay. So now we know that we have performed shuffling so far. Shuffling is done. Shuffling is
done. Very good. Now let's see what will happen in the sort merge join because then it will become shuffle sort merge
join. So shuffling is done. Now let's consider like what will happen in the shuffling like after shuffling. Let's
see. So what will happen? Let's consider our partition one. Okay. Let's consider it. Let's say this is our partition one.
Okay. And we have so much of data. Okay. We know that we have ID 1. Then we have ID
2011. Then we have 2010. Okay. And here we have DF green as ID
1. Let's say 2011 and then let's say 2010. Okay. And you can even say you can have multiple
values from df pink because obviously you can have duplicate values not duplicate but yeah transactional values
in pink and df green is not big enough. So it is just having like you can just imagine it with like a simple concept
called fact and dimension. Pink df is your fact table in which you have like redundant ids but in dimension it's just
like having one value. So this example will be much more clear when you will be just understanding hashing. Okay, you do
not need to worry at all for now. So for now what will happen? So this is our DF. Okay, DF pink. This is our DF green and
this is inside the partition. Okay, so now what will happen? What will happen? Okay, and in order to uh make you
understand what I will do, I will simply do like this. So let's say because obviously we have just put the data in
the partition. We didn't say hey just sort the data. Hey just do this. No, let's say first I store 201 then I
stored one then I stored 2011. Make sense? It's up to me or it's up to Spark. Now in the sort merge join
what will happen? It will simply say hey I can easily apply join. I can easily see the one to one. Then I can see 2010
1 to 200 1 then 2 1 to 2,000. But it will take a lot of time. Why? because I need to see this value then I need to
search it in the whole long list. What if we can sort this data first and then we can apply the join. Very good idea.
That's what sort mer join is. What it will do? It will simply sort the data of both the data frames. Okay. So this data
is already sorted so it is fine. But if it is not sorted it will first sort this data frame DF pink and DF green as well.
1 2010. So now just tell me will it be quicker to apply join? Obviously because
we can see the values. So this is your sort mer join. See how simple was that? How simple was
that? You just need right examples. You just need right people. You just need an Lamba. Right. Right. If you agree if you
agree to this statement just just just drop a comment right now. Just drop a comment right now. Right
now. So this is your sort mode join. And let me just tell you this is your by default join. Even if you do not
specify any kind of join, this join will be applied in most of the cases. Okay, in most of the cases this is your sort
merge join. Okay. Now what happens in the case of shuffle hash join? Instead of sort
merge join hash join. What will happen in the hash join? hash. So what will happen in the hash
join? So let's say let's redesign this. Okay let's say or let me just check quickly if I missed
[Music] anything. Let me just check. Bro pro bro. Um okay. This is covered. This is
covered. This is covered. I just want to make sure like you will not miss anything. Okay. See, I love you a
lot. Uh okay. Okay. Okay. This is fine. This is fine. Yeah. We have covered all the
things and shuffle sort. Yeah. Code implementation is left. I will just show you after this code. Yes, you will also
see code implementation. Don't worry. Don't worry. Don't worry. Okay. Um Okay. Very good. So now let's see the
hash join. Okay, this is really important and just try to understand it. Let's say I will just keep my notebook
here because I want to just make sure that I'm just covering all the things. Okay, very good. So now let's
let's let's try to see what will happen in the partition one. Okay, let's say dfpink is your fact table. Okay, dfpink
is your fact table. Okay, you can have multiple values. You can have like one one then you can have like 2011 201 then
2011 2010 one 2011 right you can have right very good now on the other side you have a dimension table which is df
green and you have values such as 1 2011 and this is a scenario because
obviously in dimension you cannot keep like duplicate joining key okay now what will happen
in broadcast uh uh uh uh uh uh what will happen in because all the joins are like okay okay okay okay it's fine now
okay I was just thinking about broadcast hash joins it's not about that one so just forget about the what the broadcast
I use okay so it is shuffle hash join okay so what will happen in shuffle join you know that this is our partition okay
and this is our dfp pink this is our DF green. So we know that in shuffle sort merge join we simply sorted the data and
merged it. Very good. What will happen in shuffle hash join? So we know that this is a smaller df or you can say
smaller table smaller df it's easy to understand smaller df. So what it will do it will simply create a hash table
of let me just use my handwriting the real handwriting. So this is my real handwriting. So what it will do? It will
simply create a hash table on this [Music] small table or small TF. What is the
advantage of it? So every time it needs to apply join, it needs to apply join, it will simply look at the joining key.
Okay. And this will be converted into a hash table. Hash table is basically a kind of lookup table in the form of
dictionary. Okay. So this will be a dictionary. So it can quickly apply the join by looking into it because lookup
tables hash tables are always faster than your normal traditional approach of joining. Okay. So it will go to one. It
will simply look into the hash table and it will apply join. One apply join. One apply join. 201 apply join. 201 apply
join. 2011 apply join. So it will simply look up into the hash table. So make sure this table is small enough to be
easily fit into the memory. Okay? Because hash table will be saved into the memory. Okay? And
executors memory is very important. So we cannot waste it. So we need to be very cautious before using this kind of
join. Plus the second thing is hash table will be created for each partition. That means this is partition
one. Okay, the same thing will happen on all the partitions. On all the partitions. This is the only
difference between shuffle sort merge join and shuffle hash join. Shuffling will be there. Then one way is sorting
and merging. Second way is hashing. These are the two major you can say joins available in the
market. Wow. Anala you did a great job. Let me just look at the list checklist if I missed anything. Uh okay, this is
done. Okay, I actually made the notes for join specially because I wanted to explain all the things. So I have
written okay partition of small table. Okay, done. Corresponding portion of big table. Very good. Okay, in each
partition I have explained each partition. Uh load the small table in memory. Yes, that is done. creates the
hash table for that partition. Yeah, explained. Scans the big part. Obviously done. Uh, very good. All the things are
done. Only code implementation is left for this one as well. Very good. Very good. Very good. Very good. Now you will
say an Lamba, you used a word called broadcast and we also know about broadcast because we have heard that
word. What is broadcast join? Let's talk about broadcast. H broadcast join. You can say it is like officially named as
like broadcast hash blah blah join. So just to keep things simple and do not get confused simply say broadcast join
simple broadcast join. That's it. Okay very good very good very good very good very good so should we implement the
code first and then we should jump on to the broadcast join. No we should just first understand the broadcast join as
well. Then we will see the code implementation of all the three together. Okay, let's do it. Let's
understand broadcast joint. So let's try to understand broadcast joints now. So let's say this is your DF pink and now
you are wellversed well well well versed with DFP. I know that and this is your standard um DF of 450 MB. Let me just
write it for you. This is your standard DF that we were using before the fact table, right?
Yeah, this is just the same fact table. So, DF pink equals to 450 MB. And it will simply create four partitions. 1 2
3 4. Very good. Now, the second table, the second DF is very special. Why? Let me just tell you. So, this is my second
DF. This second DF is just of is just of DF green is just of let's say 5 MB. What? It is very small. It is very small
5 MB only. Okay. And this is just a kind of small dimension table that we have. Okay. So obviously there'll be just one
partition. Makes sense. Very good. And as we know there's only one partition. So it will be residing here. Makes
sense. Makes sense. Makes sense. So just tell you one thing. This is the whole table that is required
to apply join with 1 2 3 four four partitions. These two partitions are good. These two partitions are very
happy. Why? Because these two partitions have all the keys they require to apply the join. Because this whole table is
sitting here. This is the whole table. Okay, let me just write it for you. whole
table. This whole table is sitting here. These two partitions are very happy because they can easily apply joins
because this is a whole table, not a partition. Whole table. These two partitions are sad. They are saying,
"Hey, we want to apply join. Just apply reshuffling. Just do it right now because we are
waiting." Then spark will come into the picture and it will say, "Bro, hold on. You do not need to apply reshuffling or
shuffling. Why? Because it will say instead of applying shuffling because shuffling is an expensive
transformation, I will simply send this whole table to your executor as well and enjoy.
Wow. Okay. So this whole table will actually create a replica. Okay. of this particular whole table here as well.
Okay. Yes, this whole table. Now, these two partitions are also happy, right? Very good. Now, let me just show you how
this thing happens. This broadcasting happens. Let me just show you. So, what it will do? So, this is your driver
node. This is your driver node. Okay. This driver node will just store your broadcasted table broadcasted data
frame. Okay. Temporarily. Okay. Then it will who driver will broadcast this particular data frame to the executors
to all the executors so that they can easily apply join. Okay. So that is why it is also recommended
that the table size should be small enough that can easily fit in the driver's memory as well because it needs
to broadcast. Okay. And obviously it should be easily fit in the executor's memory because it will be stored
there. Make sense? Makes sense. Makes sense. And remember this will not be stored in the cache memory. This will be
stored in the execution memory. What is cached memory and execution memory? Anala don't worry we will very soon
discuss about D um executor resource allocation exeutor memory
management topnotch topics topnotch okay so this is all about your joints let's see the implementation the code
implementation we will simply create two demo data frames to actually show you all the uh things then I will just
change the join strategy one by one okay make sense And obviously we may need to just turn off the broadcast join because
it can automatically sometime perform broadcast join as well automatically. We will simply turn it off. Don't worry.
And let's see how we can just perform those joints in the code. Let's see. So this is my notebook that I have created
and this particular notebook we are just first of all turning off the broadcast join. Why? Because we know that our data
size is really really small. So it can automatically perform some broadcast join because obviously nowadays spark
engines are optimized. So we will simply turn off it because we do not want to apply broadcast joint. So this is the
simple code and you are already familiar with this code. So we are simply creating two data frame DF1 and DF2.
Okay. Now let's apply the join. Okay. And how we can just apply join? It's very simple. Simply write DF join equals
DF1 dot join DF2. Okay. Then we simply need to define the condition joining condition.
So joining condition is simple. It's on the ID column. Okay.
DF ID, DF2 of ID and this is DF1 of ID. Now is the main thing. Now we need to simply
define what like uh what join we are just trying to do. Left join, right join, inner join, full join. So we are
simply applying for now let's say left join. Make sense? Make sense? Make sense? Okay. Let's apply a left join. So
let's run this. What will happen? Nothing because this is not an action. Let's hit the action by saying display
df. Okay. Oh sorry df join. Yep. Let's run this and you will see actually the join will happen and I
will also show you the tag obviously because it is really important to understand the join. So this is the
result of the join. Okay. Now simply click on this and click on the latest like job number one because job zero is
just your normal data creation and this is the job one that we are looking for. Okay, let's expand
it and this is the job. Okay, and it has total eight task. Okay, just click on it and
then simply wait. Okay. Now you can simply click on DAG visualization. But but but but simply go to SQL data frame
just to actually see the stuff. Okay. Exactly. Exactly. Exactly. So what happened right now? What happened? So so
so so what happened? So basically it first of all scanned the data. Okay. On first side. Okay sense. And actually aqe
was enabled. So that is why it performed some AQ of stuff. So we can also disable the AQA. And we can just rerun the
stuff. So I'll simply say spark.con dot set I think it was
SQL. Uh we can just simply grab the code. So this is the code for your adaptive query disable. Okay. And I have
just ran this. Let me just restart the cluster. Okay. And then we will simply apply all the things from scratch so
that I can actually show you what is happening right now. Okay. Because the thing is in this it is just performing
AQE and you know that in AQE it is an optimized engine and we will just talk about AQ in detail. Don't worry but for
now we simply need to just turn off AQE just to see the desired result because AQE is a game changer. So you can just
get confused when you are just using AQE. So we will be simply turning it off just to show you the context like how it
is just creating a DAG and how it creates you can say the whole um query plans without adaptive query execution.
Okay, make sense? Make sense? Makes sense. So so it should not take much to restart it but
yeah but yeah yeah yeah. So by default it tries to apply a shuffle sort mode join and we should see
the same stuff here as well. Okay. Okay. Okay. Okay. Shuffle sort mode join. And if you just want to perform uh broadcast
join then you have to change your code and don't worry I will just tell you how you can just perform broadcast join. So
it is restarted. Let me just run all the cells. Click on simply run all just to run all the cells instead of running it
one by one. Okay. So this is completed completed. Completed and it is simply displaying
this. Okay. Perfect. Perfect. Perfect. Very good. So now let's see the job. Okay. And this is
the one stages three display job. Uh let's see what it has done in this one. Okay. So simply go to
this one stage ID maybe two or maybe this one. Yeah, these are basically jobs. So you can simply say SQL data
frame and this is the one display df join. So exactly that's what I wanted to see. Okay. Makes sense? So, by the way,
these are just the jobs. Do not get confused. Okay? And we know that our job is job one, not job one. Which one is
that? Job zero, basically. Okay? So, if you just go to job zero, you can actually see all the three stages. Stage
zero, stage one, stage two. Okay? Shuffle right, shuffle right, and all these stuff. Make sense? Make sense? So
whenever you just want to see the DAG okay simply go to SQL data frame and simply click on your ID and this is your
DAG. So basically what it is doing it is simply reading the data scanning existing RDD and this is reading the DF1
and DF2 one by one. Okay and you can simply see that ID and name ID and name is your DF1 and this is ID and salary.
This is your DF2. Okay, make sense? Very good. It is applying filter. Now you can simply ask hey we didn't apply any
filter again if you remember our previous lectures like previous part of the video we said that spark add some
optimization to just add some not null steps because obviously you should not have nulls in your joints. So it is just
taking some precautions before applying join. Okay very good. Now exchange happens because obviously 200 partitions
remember click on this plus button you will see 200 partitions. Okay, click on this 200 partitions. Now this will be
shuffled together to create total 200 partitions and then it will simply sort the data. Sort the data and then it will
simply apply sort merge join. Shuffle sort merge join that is our by default join. So this is our
shuffle sort merge join. Okay. So this is our default join. Very good. Very very very good. Now let's say I want to
perform broadcast join. How I can just perform? So basically as we know broadcasted table should be the smaller
table. Let's imagine that this is our smaller table which is DF2. Okay. So how we can just simply say I will simply
create a new join DF join broad. Okay. So now you also need to write broadcast like this. Okay. just before the smaller
DF broadcast. Just take notes of this. Okay. Now I will also turn it on instead of minus one
because I cannot say hey do not apply broadcast join because I now I want to apply broadcast join. Okay. So what I
will do I will simply remove this. Okay. I will simply remove this and I can simply rerun it. I can even define the
size as well but just to make things simple. Okay. And just to make easy for you to take notes, I should simply
disable it and I should simply restart my cluster. So what I have done, I will simply apply the DF broad right now and
it will simply apply the broadcast join. And this time you will see a different DAG. I will not refresh this window. I
will just show you the difference. Okay, let me just collapse it. Collapse it. Collapse it. Collapse it. Yeah, simple.
So this is our normal you can say shuffle sortment join right shuffle sortment join very good now we will be
performing broadcast join you can also ask hey anchamba aren't we performing the broadcast hash join basically you
can perform broadcast hash join but as I mentioned that you need to be very cautious with the code if you're just
performing broadcast hash broadcast hash join and you should only use if you are
confident with broadcast hash join And in order to do that you can simply um turn off the um shuffle sort join. You
can also turn off the shuffle sort join but it is not recommended because you should always rely on the you can say
spark engine to pick the best join strategy which is shufflemer join in most of the cases. in most of the cases.
Okay. And and and broadcast join is something that you can additionally add if it is not being by the way AQE will
pick broadcast has join as well for you if it is needed. But in some cases you need to explicitly add broadcast has
join right. And what is AQE and Lamba? What is this? What is this? Why it is so much popular and why it is just changing
all the things. Basically AQE actually changed a lot of stuff for us. a lot a lot and for a good cause. Okay, for a
good cause. But obviously uh in order to understand AQE and in order to understand spark in the modern day in
the modern world, you need to understand all these things. Then you can easily understand AQE like what is it doing and
why it is doing right because right now every AI is everywhere, right? So you need to build your concepts really
really strong really strong. Okay, makes sense. And obviously you need you you know the place from
where you can just get your fundamentals so so so strong. It's An Lamba's channel. Simply hit the subscribe
button. Share this video with others. Drop a lovely comment or multiple comments. I feel happy whenever I see
those hearts. Whenever I see those smiling faces, whenever I see those uh messages saying that hey I am placed at
XYZ company, I really feel happy. I really feel happy. So, bro, simply restart the cluster
because it's 8:32 p.m. right now and I started at 8:00 a.m. and I need to eat something. I
didn't have coffee today. Huh? No worries. Everything for you, bro. And
sis, bro means like both. Male and female both. Okay. It's just a bro is is
like how how can I explain this? How how how
[Music] bro? I think last time it was started really really quick. Uh it's just a
matter of few seconds. Let's see. And by the way, we need to just rerun this code as well because yes, we need to disable
AQE before doing anything. Okay. Okay. Okay. Okay. So, whenever you see something called as um adaptive thing
like you were seeing before just um you can assume that it is due to AQE it is due to adaptive query execution.
Okay, that's why we just disable it every time whenever you want to do anything.
Okay, make sense. Makes sense. Makes sense, bro. What are you doing, man? What What's What's wrong with you?
Let me just fast forward this restarting of cluster. Okay. And let me just get back
to you. So cluster is on. Let's run all the commands and let's see what we get. What do we get? Let's
see. So perfect. And your joins concept should be crystal
clear after this. Okay, crystal clear, crystal clear, crystal clear. And you can even
like do something like experiment something to see more and more things and you can even like try with your own
CSV files. It's up to you. Okay, it's up to you. So it is simply running this job now. Display DF join broad. It has
created three jobs so far. Very good. So I am interested to see maybe job three because it will be for maybe display.
Makes sense. Let's click on this and let's go to SQL data frame and display DF join broad. Very good. Very good.
Very good. Very good. Because what happened? Oh wow. So this is your DF1 which is the bigger data frame. This is
the smaller data frame for ID and salary. See when I just hover over this I can see the columns. Then what
happened? What happened? It simply created a broadcast table called broadcast exchange. So it simply
broadcasted this table and it simply you can say joined this broadcasted table with your bigger
table. See and then you can see the rows output and then at the end collect limit. Wow.
This is your broadcast join. Okay. You can even like save these things in your notes. You can even draw it. You can
even take screenshots. Anything. Just save this information like this is very crucial information in the world of
Apache Spark. If you if you just compare the DAG of this one, see there was no broadcasting. But here you are actually
performing broadcasting. What is the advantage of broadcasting? There is no 200 partitions, no shuffling. See here
we have like shuffling 200 partitions but here is no shuffling. That is the advantage of broadcast join. No
shuffling. That's why it is really really quick. Wow. Wow. Wow. Wow. Yes. So this was all about your joins. All
the types of joins including concepts, including theory, including practical, including demonstration, everything. I
hope you understood the joints right now. Okay. Now let's try to understand something very important in the world of
patches spark. Now, now we're going to talk about Spark SQL engine or you can say catalyst optimizer. And now you will
see the entire flow because every time you are saying your code is optimized by Spark, your code is optimized by Spark.
Your code is optimized by Spark. How how how Plus every time whenever we just run df.explain, we see something called as
physical plan. Now what is this physical plan? Just clear everything right now because now we have a lot of knowledge
in Spark. Now we can understand that. Okay. So let me just tell you the entire flow and how Spark optimizes your code
and how Spark you can say picks the best plan and then sends that plan to the
executors. Let me just show you the entire flow. Okay. So let's say this is you. Okay. And you are very dedicated
data engineer. You are writing your spark code. Okay. And you know that whatever code you write, okay, it will
not do anything because it will simply store all that information or plan that plan is called by the way take notes.
Okay, that plan is called as unresolved logical plan. Unresolved. Yes. Now what is this unresolved logical plan?
Basically whatever you wrote df equals to blah blah blah I want to pick this column blah blah blah you wrote your
code right that is unresolved. Why? Because we have something called as catalog. What is catalog? Catalog has
all the objects registered. Catalog name, database name, table name, column name, all the things. Okay. All the
things. So whenever you are just reading any data okay whenever you are reading in data so it will simply create the
unresolved logical plan okay and it will make sure whatever columns you are using whatever file name you are using like
all the metadata catalog means all the metadata it will simply verify all the columns that you are mentioning in your
code is exactly the same column that we have in the file. So it simply perform a small analysis. Performs what? A small
analysis. Okay. On top of your unresolved logical plan with the help of this catalog which has all the metadata.
Okay. Once it approves your all the columns, let's say you are writing your code and um you have a column called
categories. Okay. And due to some reason due to a typo you wrote category instead of categories it will throw an error
called analysis exception. That is the error and that is due to this analysis which is called as catalog. It is due to
catalog. Make sense? Very good. Once that analysis is done it will simply create the resolved logical plan. Only
the resolved logical plan. That means okay everything is fine. There's no typo. All the columns that are being
used are there. So we can just use this code. Okay. Very good. So your code is converted into resolved logical plan.
Then your resolved logical plan will be converted into something called as optimized logical plan. Make sense?
Optimized logical plan. And this is the plan that you see in your code. You always see hey I applied filter, I
applied select statement. and they just combine both the transformations. Why? Because it is an optimized code. You can
perform both the things together. Okay. Another example and this is a very good example. Let's say you are applying
select statement first and then you are applying filter statement uh or let's say you are applying select statement.
Okay. Then you performed some group by and then you performed some kind of filter and then you performed like so
many things. So it knows that filter should always be performed at the beginning just to reduce the rows so
that your query will be performed. So it knows like what transformation to be prioritized first. So that are all the
examples of optimized logical plan. Okay. So you can write your code in any flow. You can even you do not need to
think anything. Hey um should I apply filter first because my data frame will be like shorter. You do not need to
worry about anything because Spark is there for you. Spark SQL engine is there for you. It is optimizing your code. It
is optimizing your code. So you do not need to worry at all. Simply apply all the transformations that you want to
apply. Just apply it. Don't need to worry. Don't need to worry. Once you have something called as optimized
logical plan that is not the end then that optimized logical plan this one will be converted into physical plans.
Now what is this physical plans? What is this? Basically um let's say you are applying a join for example you are
applying a join okay and in the optimized logical plan it has simply mentioned that join should be applied
after filter or whatever. Okay, that is just the optimized logical plan. But in the optimized logical plan, it is not
written that should it apply sort mer join, should it apply hash join, should it apply broadcast join. No, it is just
written that it needs to apply join. So which type of join it needs to use, it needs to apply will be determined by the
physical plan. How? So basically it simply creates a lot of physical plans from one optimized logical plan. It will
simply create all the physical plans that are possible. So many then it will simply use something called as cost
model. Cost model is just like you can say it is a kind of funnel that takes all the physical plans and that outputs
the best physical plan which is least expensive. Not in terms of cost. You can say in terms of cost as well because
obviously if you are just consuming more resources to run any data so obviously you are spending much data but here cost
doesn't mean directly like the money it means like computation resources which can be expensive right so that is the
cost model and the best physical plan let's say it outputs a best physical plan this physical plan will go to the
executors to be executed this is the entire flow of sparkql engine. How it takes your code,
it converts it into unresolved logical plan, it performs an analysis, it converts it into resolved logical plan,
then that resolved logical plan gets converted into optimized logical plan and then optimized logical plan gets
converted into physical plans that you see in your code. When you just say df.explain, you obviously see the
heading physical plan because that is the final plan that goes to the executors. and how we pick the best
physical plan through this cost model and everything will be done by Spark. Wow, this is your end to end flow. I
hope now you understood all the things related to Spark optimization, how it optimizes your code, how it creates a
physical plan, how it sends the data um to the executors. So I hope now you understood,
make sense? Very good. Now let's try to understand the very important stuff in Spark. Let me assure you. So now let's
look at the one of the most popular topics and very important. And this one and the next one both are very
important. Driver memory management and executor memory management. Let's start with driver memory management. Driver
memory management is pretty much simple. Exeutor one is also simple because I am explaining you the stuff. But driver is
trust me like much more simpler than executor much more simpler. So let's actually first of all look at like what
is driver memory. We all know that driver is the heart of the whole spark application or you can say whole spark
framework because driver's memory is responsible to orchestrate the work to create the logical plan to create the
optimized logical plan and everything. So let's look at that. Basically driver in Spark is the master process that
coordinates all the task. We know that it holds metadata, taskuling, info, tags and more. Agree? Very good. Driver
memory equals to memory allocated to the driver process when the Spark job runs. We will talk about it. Don't worry. If
driver runs out of memory, we see an error called driver out of memory. And we're going to talk about that as well.
So just be with me. Everything will be crystal clear in just few minutes. Basically, drivers memory has two major
components. JVM heap memory and overhead memory. For now, just ignore overhead memory. Let's start with JVM heap
memory. What is that? As the name suggests JVM heap memory or you you can also call it as like on heap memory. It
is basically for JVM tasks. all the Java virtual machine tasks such as DAX metadata broadcast variables. Broadcast
variables. Yes, you all know that whenever we need to broadcast something, we first bring that data to the driver.
Then we broadcast it to the executors. So this is the area JVM heap memory is the area where we bring that data like
broadcast variables, broadcast tables, everything make sense and obviously taskuling if we just want to schedule
anything we will be using this memory. So you can conclude that the most important area of driver memory or you
can say driver memory area is JVM heap memory because this will be used in most of the cases or you can say this is
responsible for most of the tasks that doesn't mean that you should not look for overhead memory. Overhead memory is
like responsible for JVM threads shared libraries of heap buffers like neti native code. Basically in short it is
responsible for all the nonJVM tasks non-Java virtual machine task. It is very good for st like make the process
stable and you will obviously ask me hey do we actually need it can we just skip it and
how we can just configure it all the answers will be will be will be given to you in the next slide. So basically
whenever you just want to request driver's memory whenever you just want to request the driver memory okay let's
say you requested 10 GB driver make sense okay 10 GB of driver so actually we are saying that we need
sparkdriver dotmemory which is your JVM heap memory so we are requesting 10 gigs or basically 10 GB okay make sense
and we will automatically get 10% extra for our overhead memory until or unless we have defined overhead memory as like
that I want 1 GB I want 2 GB no if we do not define and usually we do not define so we will simply get 10% of this memory
that is 10% of this memory this is 10 GB so it will be 1 gig. So 1 GB will be going to our spark driver memory
overhead automatically. Now, now no now no now no now no now no now no now no now no now no now no now
now so in total we'll be getting 11 GB right exactly exactly so you request the memory for your JVM heap process but we
will also get 10% added into that that means 10 GB plus 10% of it equals 1 GB now every time it is not equivalent to
10%. Why? Because in some cases if you are requesting let's say 1 GB of driver not 10 GB. So in that particular case
you will get 384 MB on Lamba. We also know a little bit of mathematics. So 10% of 1GB is not 384 MB. So I know that. So
that is why you will either get 10% of total executor memory that you requested
or 384 MB and how you need to decide or maybe how Spark will be deciding it basically whichever is higher. So in
case of 10GB the 10% of it is like 1 GB 1 GB is greater than 384 MB so we'll be getting 1 GB. In case of 1GB executor
10% of it is 100 MB obviously it is less than 384 MB. So it will simply get 384 MB. Make sense? Very good. Very good.
And overhead memory is like also useful for like other things that objects it is creating for us like containers and also
it needs some space right. So all those things will go under this driver memory overhead. See it was so simple. So
simple. Now Anchala tell us about driver out of memory because we have seen a lot of interview questions regarding driver
out of memory. Driver out of memory. So basically let's talk about it. Driver out of memory can be there
due to a lot of reasons. Okay. And today we'll be just talking about the major ones or you can say the popular ones. So
as we know that drivers heap memory will broadcast the you can say broadcast variables.
Okay, if you just want to broadcast the variables or broadcast data. So if you are broadcasting a data which is more
than which is more than your heap memory obviously it will create the error. So that is one type of driver out of
memory. Just note it down in your notebook. Make sense? That is one type. The second one and the most popular one
is whenever you run command such as df do.colct dot collect why why so whatever data your so let's say
this is driver this is an exeutor just to make you aware this is an exeutor and if your memory is sharp you you this is
this should be like understood because we have already taken this as an our example. So this is our executor in
which we have four partitions written. Okay. Very good. So this is our executor. Okay. Now when we run
df.show when we run df do.show or let's say display df. So what happens? Executors send the data to the driver.
Make sense? And when we run display command we just want to see like some few records right? We do not want to see
the entire data. We simply see I think thousand records by default. So what it will do it will simply pick any of the
partitions any partitions it cannot send like half partitions one/ird partition it will simply send one partition so it
will simply send one partition to to the driver okay and it will simply say hey just show the data to the customer
that's it but when you run the command such as df docolct dot collect command means you just want to collect all the
data in a Oh, so that means in that scenario it will simply collect all these partitions
and it will simply send all the partitions to the driver. But but okay, this is done. But and
but so now all these partitions will be going to the driver. Just imagine just imagine the load on the driver. If these
partitions are small enough that this driver can handle it is fine then. But let's imagine the size of all these
partitions together is more than the JVM heap memory that we have provided to the driver. It will simply throw an error of
driver out of memory. So what is the solution in that scenario? Simply increase the driver
size or why are you just using dot collect method? Who told you to use dot collect method? You should only use dot
collect method wisely with some common sense. You should only use dot collect method when you have like few records. I
have used dot dot collect method so many times. But I use it when I just want to let's say um convert any data frame into
a variable just to fetch like maximum date from a data frame. I use dot collect method in that
scenario. Makes sense. I do not call like all my data to the driver. Okay. So you should use dot collect method
wisely. You should only use dot collect method with those data frames which will not return a lot of
rows. Very popular use cases of dot collect method is just getting the maximum date from a data frame in which
we know that there will be just one row. Okay, makes sense. And we can then convert that
particular data frame into a key value pair. We can use it that's like totally different thing. But you should not use
these things like dot collect and just bringing all the data. Make sense? Make sense? Make sense? Make sense? So this
is your driver out of memory error and you know how to mitigate this. Now let's talk about executors memory
management and obviously we'll talk about executor out of memory. Let's see. Now let's talk about executor memory
management. So basically this is much much much more important than by the way both are important but this is I would
say a little bit more important than drivers memory and if you will ask me from the interview point uh point uh
perspective from interview perspective it is much much much more uh you can say important so basically without delaying
let's actually understand what is this I know this is a complex topic I will just make you understand like this so
basically this is divided into four major types of memory or you can say it has four major parts. JVM heap memory.
This is the again main memory similar to driver. Okay. And we going to discuss about JVM heap memory in very much in
detail. So don't worry about that. Second one is off heap memory. So basically off heap memory in exeutor by
default is zero. And we rarely use off heap memory. But sometimes it becomes really really
handy. Second thing off heap memory will not be managed by JVM. Okay. Will not be managed by JVM. That
means you need to manage it. So it brings some overhead that that's why like we rarely use it but sometime it
becomes really really helpful. Okay don't worry we'll just talk about this as well. This is like very high level
overview just to set a ground for you to understand the things. The third third is overhead memory similar to driver.
This is also the same stuff which is 10% of your total exeutor memory that you have requested. Again we will be just
discussing about overhead memory as well. And you already have some idea like why we use like overhead memory,
right? Very good. And this is similar to drivers and the same concept uh applies here as well. So okay no change very
good. Then we have pispark memory. Oh what's that? So this is again by default zero. Okay we rarely use pispark memory.
This is just for like pispark applications. We rarely use this memory as well. So in short if I just would uh
if I just want to give you a heads up this memory and this memory is really important because we use these two
memories in all the applications. This is by default zero. This is by default zero. So we rarely use these two kinds
of memories. Makes sense. But still we will be just looking at all the parts. But the OG is JVM heap memory. And this
is divided into so many parts. Okay, like there's a hierarchy of memory layers. And don't worry all are simple.
So let's first of all first of all let's master this and then we will just quickly cover this this this as well
because once you understand this all the other things are fine and trust me this is the major area that you need to focus
okay heap memory and then other things will be mastered like this. Okay, let's understand and master JVM heap memory of
executor. Let's so now let's expand these memories and you will actually understand the whole flow now because we
going to just go in detail. So let's expand this me. Let's say you requested um some memory from the resource
manager. Okay, let's say you requested 10 gigs of memory from the um resource manager and let's see what will happen
next. Okay, because you requested it, you don't know like what will happen in the back end. Let me just show you what
will happen. So basically, let's say you requested executor memory equals 10 gigs. Equals 10 gigs. Okay. And you will
request it using this particular configuration. Spark.exe.memory. So this is your JVM
deep memory like on memory or you can say main memory. main memory as we just talked about this is the main memory and
we're going to expand this as well so don't worry similarly similarly like similar to driver you will also get 10%
of the total executor memory as well additionally so in short when you requested 10
gigs that doesn't mean like 10 gigs is equivalent to both no 10 gigs means only this memory and this will be available
for you additionally. So 10 + 1 so total your executor this one total executor's
memory will be 11 gigs including overhead but you always request 10 gigs so you just get 10% additionally
whenever you request the memory make sense okay and now as just as I just mentioned that we going to expand this
because in driver we do not have much stuff to do so that's why we do not expand this memory we do not like go
deeper in this inside this memory because we do not have much stuff to do. It just needs to orchestrate it and
that's it. But here in executor you need to do some extra work to learn as well. And obviously in order
to tune your executors you need to know all these things. Okay. So now it's time to actually expand this memory. Make
sense? And again I'm repeating this is the memory that you will get when you request from the um resource manager.
Okay. So for now we are just taking the example of 10 gigs. Make sense? Very good. So let's expand this memory now.
So this is the memory. Wow. This is your JVM memory. Total total memory that is 10 gigs. Okay. So we have three major
parts in this memory. The first is reserved memory. Let's talk about this first. This is the fixed amount for
Spark engine. No negotiations, nothing. So this is the memory that we have to allocate. Obviously we do not allocate
like it will be automatically allocated for your spark engine fixed no negotiations no reserved memory for
spark engine. Then we have something very very special. What's that? The second stuff is spark memory pool or you
can say spark pool memory or whatever you want to say. I call it as like pool memory.
So, Spark memory pool is by default 60%. 60% of what? 60% of total memory minus 300 MB. Because if you have some
basic mathematical knowledge so obviously this is the total memory. The gray area is
the total memory that is 10 gigs. We have reserved memory 300 MB and this is the 60% of the
remaining and which is remaining total minus 300. Make sense? Make sense? So this is the spark memory pool which is
by default 60%. But if you just want to tune it, you can tune it using spark.mmemory.fraction. You can just
make it maybe 50%, you can make it 40%, you can make it 70%, you can make make it 80%, that depends upon your
requirements. That depends upon your code like what you are using within the code. By default, it is 60%. Okay? Even
if you do not define this, it will take 60%. So you do not need to just actually tell like like every time, hey, I want
60%. No, it will take 60% by default. The third memory that we have is user memory. So you can just type here user
memory. I just wrote memory only but actually it is user memory. Okay I can just write it for you. No worries. So
this is user memory. Make sense? So what is this user memory now? So this is the remaining memory and just tell me what
will be the remaining memory. Obviously 40% of total minus 300 MB. Make sense? Let me just give you some hint and hints
how to calculate this. Basically, this is the total memory. Make sense? This is reserved memory 300 MB. Make sense? This
is 60%. This is 40%. Simple. Simple. Why do we need user memory? And what is the reason we need
like user memory and why do we need to allocate 40%. By default, I know. But why 40%? Because obviously in the code
you will be writing a lot of UDFs. what are UDFs, user defined functions and all the userdefined data structures and all.
So all that oper all those operations will be actually performed for under this memory user memory make sense. Now
let's actually see this number or you can say this memory allocation with number so that you can actually
understand it. So we took the example of 10 gigs okay and you can actually take notes for this. So we took the example
of 10 gigs. Okay. So this is 300 MB reserved. So this is 300 MB. I have written that
so you can just take notes. Now calculate 60% of this. So let's say it will
become 60. Let me just write it for you. Let me just tell you how to calculate these
things. So 0.6 6 okay into 10 minus 300 MB 300 MB means 0.3 almost almost gigs so it will become
it will become 9.7 right 9.7 so this is your memory which will go
for spark memory pool make sense you can just use your calculator and calculate it 0.6 it should be somewhere around
5GB something somewhere around 5GB okay approximately I'm not just using my calculator I think it should be
somewhere around 5GB okay 5 gigs and user memory will be and it should be understood
0.4 into 9.7 make sense makes sense make sense
makes sense and you can again use your calculator and I think it would be around
uh 4 point something. Yeah, I think so. 4.5 something. Yeah. 4.5 something. This will be around 5 something. Yeah. Yeah.
Yeah. So, you can just calculate it. So, this is the number that you should get when you just ask for the memory. Anal
Lamba is it everything. Now, we are good for executor. Actually, no. Because now we need to expand this memory because
you can ask me an we understood that this is used for resort memory that is for Spark engine. This is for user
defined function. Where are our transformations happening? It is happening in this memory. Okay. Where is
our cached data located? It is under this memory. Oh, okay. So, we need to just expand this memory as well.
Exactly. So, now expand this memory which is spark memory pool. So, this is your spark memory pool. And don't worry,
it is very simple. Let me just tell you. So this is your spark memory pool and we already know like what memory do we have
here? We already know right we know that it is 0.6 into let's let's take an example of
5 gigs just to keep it simple. Okay so you can actually calculate the exact number but for the simplicity we are
just taking 5 gigs. Okay let's say 5GB total memory. So we have two major memories within this. One is storage
memory. The second one is executor memory. In the storage memory, you cache your data. It is used for caching. If
you are just caching your data frame, don't worry. We will just talk about caching in detail. If you are just
caching your data frame, you will actually use this memory, storage memory. Make sense? And it is also
called as long-term memory. Why? Because you actually store your data that you can refer later in the code. It will not
be eliminated. Okay? Whereas on the other hand, executor memory is the memory which is used to perform your
transformations. join group by everything will be happening on this memory
50% of total. So this will be let's say 2.5 gigs. Okay. And this will be 2.5
gigs. Make sense? Very good. And this is also called as like short-term memory. Why? Because obviously um it will not
keep your or hold your data because it needs to process a lot of data. So if it would be just holding your data for long
term it cannot process the data. You will be just having errors error like executor out of memory again and again
and again. Make sense? Okay. So now an Lamba you said like this is 50% this is 50%. Who defined this number? No one. It
is by default 50% 50%. But if you want to change it, you can just use spark.mmemory dots storage fraction and
you can define maybe 6040 7030. It depends like how much data you want to cache. Maybe you do not want to cache
much data. So you can just reduce this memory. Common sense. Okay. So this is by default. Now one very good point,
very good point. Let me just tell you. So can you just see this line? This line this is the you can say a kind of
divider between both the memories. The good thing is this is not a rigid boundary. This is a flexible
boundary. What? Yes. Even if we define let's say you defined that you want 50%. You want 50%. Let's say you defined that
you want 50%. I know this is by default but still let's say you define 50%. Still this boundary can move up and
down. really yes that's where the concept of allocation and borrowing come into the picture that means in short I
will just talk about this as well in detail for in short let's say you are just storing your data in the uh storage
memory you're caching your data okay and this memory is full but now you need to cache more data and your resource
manager knows that you do not have much thing to do here and you have a lot of data like lot of space here that means
this area can be expanded and this line will be shifted towards bottom so that it can use some space from this area and
vice versa is also true this line can go up as well so still if we define 50% this can go up and down this can go up
and down this is flexible make sense very very very good and I know you just got a lot of
information a lot of you can say hierarchy within the executive memory management just take notes of everything
and now let's talk about this line how it becomes flexible and how allocation and borrowing come into the picture how
we can just change it let me just tell you so let's talk about unified memory management and basically unified memory
is let's say another name for your spark pool memory in which we have two memory like executor memory and storage memory
some people also call it as like execution memory it makes sense okay so execution
memory/exe//executor memory or storage memory so this is the unified memory So let's try to understand like how it
works because we know that this is the separator. This is the separator between that and by default it is like 50 50%.
So this is 50%, this is 50%. We know that and we also know that this divider is flexible. It can move towards right.
It can move towards left. Make sense? Okay. Very good. So let's try to understand with the scenarios because
once you get all these scenarios, this will be a piece of cake for you. So now let's try to understand it. Let's say
you are using executed memory obviously for joins group by everything right all the transformations and this 50% kota is
completed. Okay now it needs more memory to let's say perform your more transformations. Okay so what it will
do? It will see that it has some space. Wow. So what it will do? It will simply it will
simply occupy this space and it will come here. Let's say it needs this space. Make sense? So it can occupy this
space because this space was empty. Sorted life. Sorted life. Sorted life. It can just take the space. Very
good. Now, now let's say okay let's talk about this in the
next scenario because this is scenario number one because I want to just include all the scenarios because if you
will miss any one scenario it will just leave confusion in your mind trust me. So this is scenario one when we have
space available right okay let's look at the scenario number two so this was our scenario right you can also call it as
like scenario two because now we going to do some tweaks so let's say you are using memory till here right your
executor or you can say your execution memory is being used till here now let's say your execution memory needs more
space now what will happen this is really important so what it will It will try to
evict these cached memory because we know that in the storage memory we use like we store like cache data. It will
simply try to evict this and it will occupy this space as well. But but but using
LRU you can say method which is like so basically whatever block of the data is least used
or you can say least used cached data that would be a better word least used cached data it will be
evicted. Okay, it will be evicted or let me just write it. It will evict it will
evict least used cache data because let's say first it evicted this one then it evicted this one then it evicted this
one. So whatever cache data will be least used. So it will just evict it one by one. That means execution memory has
the or executor memory has the power. It has that authority to evict the storage memory. Why? Because it is prioritized
because it needs to perform your transformations. It needs to process the ex like the actual transformations of
the data. It needs to actually process your data, right? So it will get the priority, it will get the power and it
will simply evict the occupied storage memory one by one based on the LRU method or you can say least used cache
data. Make sense? Okay, very good. Let's look at another scenario. Now let's say your storage memory is full. You have
cached so much of data and your executor memory is free. So in that particular scenario storage memory can borrow the
space can borrow the space. It can it can do it like it is it is allowed. It is possible. So it will be coming here
and it will take the space. It will take the space easily. Make sense? Very good. And just a follow-up question and it
should be understood now. What if this memory needs more space? Then obviously it will evict the So this is the whole
storage memory right now. Okay. So what it will do now it needs more space. It will simply evict the storage memory
based on LRU method. Simple. That should be understood now. And that was just a follow-up question. Make sense? That's
how it goes like from left to right. That's how it allocates the memory and borrows the memory and then just return
it back. So in short in short okay okay let's let's consider like one more uh scenario because that is very
interesting. Let's say this is like occupied the space. Now you will say tell me an Lamba if storage memory needs
more space what will happen? This is really important. So what will happen? Let me just tell you. Let's say storage
memory is full. Very good. And the the empty space is also full. It has occupied that empty space. It is also
full. Very good. Now, storage memory needs more space. What will happen? Can it evict executor memory? No way. It is
not allowed because storage memory has no right. Storage memory has no authority to evict our exeutor memory.
Okay. So, what it will do? It will evict its own memory based on the LRU approach. That means whatever let's say
let's say these two blocks these two blocks are like least used among others. So it will simply evict this and it will
simply store the new storage memory or let's say cache data in this one in this one let's say here make sense so it is
not allowed to evict only executor memory has the authority to evict the storage memory but the other way is not
possible it is just one way okay I hope I have covered all these scenarios and I I know that now you are very much very
much very much confident with this concept called unified memory management and I hope it will help you to just take
notes as well because I have just shown you through a grammatic way grammatic diagrammatic oh man I need coffee
diagrammatic way do we have a word diagrammatic with the help of diagram okay so that you can actually understand
and visualize it okay very good let's see what we have next in this particular spark course so let's talk about
executor computer out of memory and I know this is very critical topic and obviously a little bit
complex not really because now you will understand everything so let's get started let's say this is your data
frame okay this is simple data frame ID and product category sorted good so I have just taken top six rows let's say
this data frame has like millions of rows 100 200 millions of rows okay very good and this is our executor forget
about the inner things. Okay, green, pink and blue. This is our executor which is processing this data. Okay, and
for now just to keep things simple just imagine we have only one exeutor. Make sense? Okay, and this is the disk
associated to this executor because we just have one. Okay, now do you know let's say this is our executor. Okay.
And executor means like execution memory like the now you understand the executor's memory everything in detail.
This is that area of the exeutor which processes your data which transforms your data. You can imagine it is like
like uh unified memory. Okay. So let's say unified memory allocation is um approximately let's say 1 gigs just to
keep things simple. Okay. 1 GB makes sense. Okay. And we have applied something called as group by operation.
We have applied group by operation on product category. Okay. Product category column. So we know that each
product category will get a partition in which it will have all the similar product categories. For example, it will
get one partition for shoes, shoes, shoes, shoes, shoes, shoes, shoes. One for food food food food food. One for
dairy dairy dairy dairy. We know that right? We know that. Very good. So if our data is normally scattered, if it is
not skewed, what is skewed data? Skewed data means when you have one category, one product category has lots of data.
It's like covering at least 70 80% of your data. It is called skewed data because it is like skewed on only one
value. It is not normally distributed. It is not like normal distribution of statistics like normally
means like normally distribute. Normally scattered. Okay. So if it is normally scattered it is fine because it will
let's create let's say create one partition for food, one partition for shoes, one partition for dairy. Okay,
make sense? Very good. And each partition is of let's say 300 MB. Make sense? Let's say 300
MB. 300 MB, 300 MB and 300 MB. It can process the data because total size is um 1 GB. Make sense? Okay. Now what will
happen if we have more data? Let's say we have one more category called um plants. Called plants. and plans um
partition is also of 300 MB make sense then in that scenario obviously it needs to fit that
particular partition here right if this executor is of four codes okay but we know that we only have 1 GB so in that
particular scenario what we going to happen like what we going to do what what's going to happen so what executor
can do it has some leverage called spilling It's called data
spill. So it can spill the data, spill the partitions or you can just spill the intermediate results to the disk. Let's
say it decided to spill this data here. Okay. So now it is in the disk because it is already computed. It it has some
intermediate results. Now it has stored this data to the disk. So here we will be having space. So this partition now
can fit here. So far so good. Very good. So now you will say is it allowed? Yes, it is
allowed. It can spill the data to the disk. So whenever it needs to write your data, let's say you are writing this
data. So it will simply go back to the disk and it will grab the data and it will simply produce the output. Simple
simple because this partition is an independent partition. What thing is not allowed? Let me just tell you if you
will ask me an Lamba come here. Why? So I'll just talk from here. So basically let's say this is 300
MB. This is 300 MB. This is 300 MB. Total 900 MB. 900 MB. Open your mouth. 900 MB. So this partition the new
partition which is for plants is also 300 MB. Imagine. So you will say an lamba we have 100 MB
available here. Why we are shifting the whole partition here. We can simply shift 200 MB of data. Let's say only
that part of partition. No that is not allowed. You either have to shift the whole partition or you cannot shift the
partition. This is the thing. You cannot say hey Anala just break this partition and just shift shift half. Now you have
to shift the whole partition. Okay. So now this whole partition if shift shift to the disk. Okay. Now you have 400 MB
available here in your um exeutor to execute the data. Now this partition can stay there. Very good. That is the
concept of spelling. So an Lamba where is the error? Because it is allowed. Yeah right. It is allowed. Obviously it
is allowed. Okay. So here comes the question if executor can spill the data to the disk why it throws the error
called executor out of memory because we have so much of memory in the disk so much so much. So the thing is let me
just tell you when it will throw the error. So basically let me just remove everything. Let's say your data is
skewed. Skewed means let's say 80% of your data that means 800 MB of your data not 80% let's say this partition is
of size 800 MB alone when we just applied group by okay we applied group by earlier it was of 300 MB it was fine
but now it is skewed okay so it is group I have applied group by on product category makes sense so So what will
happen? So let's say this green partition was of 300 MB before when data was normally distributed. Now the data
is skewed. That means this partition is of 800 MB H.
Okay. Now this is fine. Is this fine? Yeah. Till here it is fine. But it is Q data. Now I'm just simply telling you
what is QES. So now we know that if this partition is of 300 MB, this is also of 300 MB. Can they
stay there? No, they will simply be spelled to the disk. Right? Very good. I just wanted to check your knowledge
regarding skewess and regarding the concept. Now let's talk about extreme skewess. Let's say your data is growing
rapidly and you are only receiving the orders for food with the time. Let's say your data is growing but only food
product categories are being sold with the time. Now your data size or you can say partition size has
become of 1.5 gigs. Oh yes it is good news for you. Why? Because your store is doing well. You are just selling so much
of things but data is skewed right data is skewed. So again you applied group by and all the food ids need to go on
the same partition. So now this partition is of size 1.5 gigs. Now just tell me one thing. Can it stay there?
Because executor memory is 1 gig. Can it stay there? No. Very good. Can we spill this data to to the disk? Yes. So it
will go here. Okay. Sorted. Hm. Sorted. Okay. This is fine. But we need to process the data, right? We need to
process this data. We can spill this data. It is like in disk. Okay. But what about the processing? We need to process
this data. And in order to process this data, this needs to go to the executor. And executor's memory is 1 gig.
Now is the time to throw the error called executor out of memory. Exeutor is saying hey
bro just tell me what to do because I have just 1 gig of memory and you are just giving me the partition of 1.5 gig.
How I can just do that? So now is the time to throw the error executor out of memory and this is the
scenario. Now here you have two options. You can either expand the memory. You will simply say I have a lot of money.
Okay, let's give it 2 gigs of memory. Let's give it 2 gigs of memory. Okay, very good. Clapping. Very good. Now,
let's say after a month, you got more orders. Your store is growing rapidly. Now, this data size is 3 gigs. Again,
will you expand the memory? You can, but it is a foolish approach. You cannot like expand the memory again and again.
No. Here comes the concept of eliminating the skewess. You should focus on eliminating the skewess instead
of just expanding the memory. Make sense? Make sense? Very good. So, how we
can eliminate the uh skewess? Here comes the concept of salting. Sultting is a is an approach with the help of which we
eliminate the skewess. Make sense? Make sense? Make sense? Very good. I hope now these things are
getting absorbed in your head. Okay. So now you got the concept. Okay. Now I know you are really excited how we can
just fix the skewess because this is just a partition. You said we cannot break the partition. You said we cannot
this. We cannot do this. We cannot do that. How we will just fix the skewess? Let me just tell you how you will fix
that. You you as a data engineer will fix the skewess and you have to do it. Don't worry. I'll just tell you how and
I will just tell you a lot of techniques. Okay, a lot. Let's learn one first. Okay, it's called salting. Yes,
salt. S A L T salt. It's real salt. Okay, let me just tell you what's that. You You got the concept. You got the
error. Okay, let's try to fix it. So, let's finally talk about salting. And trust me, this is very very
very important topic and very easy, very easy. Trust me, very easy. So in front of your screen you are seeing a data
frame in which you can see the skewess. You can see a lot of food values. See and we have already talked about this is
fully skewed. Fully skewed and obviously it is good for your store owner because he or she's gaining a lot of sales but
you as a data engineer needs to work with skewess right but that's fine. That's fine because now you know
like what are the right resources to learn data engineering, right? Hit the subscribe button right now and just
share this channel with others right now. If you want to support me and if you love me, okay, nice. So basically
let's talk about this. Let's say you have this column, okay, and you are seeing so many food food values. Now we
know that whenever we'll be just trying to apply the group by it will create the partition of 1.5 gig and we cannot
handle that. We cannot handle that. It is not practically possible to handle this. Make sense? Very good. What we
will do? What if I tell you that you can even add a column called let's say salt. Let's say salt.
Okay, this is a salt column. In this salt column, you will simply pick some random values and obviously random
values but have they should make some sense like do not like purely random values but you should pick the right
size of salt. You know that your data is like of 1.5 gigs. Okay, you need to divide this partition let's say in four
mini mini partitions. 1.5 divid by 4 it it will be equal to 350 MB something 375 I think 375 375 750 yeah
375 mathematics honors I told you bro so 375 MB is fine of like one partition size we it's fine like it's
totally depend upon the requirement and the solution and the architecture but in our case we know that four partitions
are fine okay very good so what we will do we will simply pick the salt size equals to 4. What does it mean? It means
that it will simply create an array of 0 1 2 3 four values. It will randomly assign any value from this list. Let's
say this is assign this will assign zero here, one here, two here, zero here, three here, 3 2
1 zero. Randomly, randomly randomly. Very good. So now we will apply group by on two
columns. On two columns, food zero, food zero, food zero. This will be one partition. That is
this food one, food one. This is one partition. Food two, food two, one partition. 43 43 1 partition. So each
partition is of size 375 MB. Not exactly because obviously this is like randomly distributed but still like closer to 300
to 400 to 500 MB. Make sense? So this is now reduced to four partitions. And now we know that our executor can process
this partition independently. And if it needs more space now this partition can freely go here in the disk and can be
bring back can be brought back. Why? Because this is like independent partition size and can fit in the
executor memory for computation. Make sense? This is the concept of salting. We simply add a salt but with some you
can say wise approach. We need to figure out like how many partitions do we need and how many partitions should be
efficient. We decide the salt size. Then we assign the random numbers between this list and then we simply divide our
partitions. Basically we apply the group by operation based on two columns instead of just one. This is salting.
This is salting. Make sense? Very good. All these concepts are really really important.
All are really really important. Okay. So this is your salting. You should take notes. Okay. And in like bigger
companies you will be seeing that some developers have already developed the salting code and you will be just
looking at that and you will be like why they doing this? This is the reason. This is the reason. Make sense? Very
good. So now I hope it is sorted now. I hope that you know the concept of salting. I hope now you know
everything. Okay. Very good. Very very very good. Because concepts are much more
important than implementation. Trust me. Trust me. Once you know the concept see writing the code is not a
big deal nowadays. Chat GBD is here and all those AI applications are there. But in order to understand that you need
your brain, you need your brain. You need your brain. Okay. So now this is also covered. Sorting is also done. Now
let's see something new which is totally I would say independent of this. Not totally independent because everything
is linked to spark and executors but yes it is totally a different topic I would say. So let's see what's that. So let's
talk about caching. Caching is one of the most used method because whenever you are writing the code okay and don't
worry I will just show you the demonstration as well for caching because you will understand by looking
at it. So why do we need caching? And caching is something that you should already know because whenever you let's
say use a mobile phone and whenever you see those recent applications open your mobile phone in the recent tab that is
also kind of caching because they are like caching your data. They are not eliminating it so that you can simply
reuse from the part where you left it. This is caching. Similarly caching is another important thing that we have in
Spark. So let's say you are writing your DF. Let's say you are creating DF1 and you simply read the data frame. Okay.
Then you created DF2 and you used DF1 in order to do something. Let's say filter. You applied the filter. Okay. Then you
created DF3 and let's say you used DF2 dot let's say you performed some aggregation. Make sense? Okay. So all
these things all these things will be done by the executors obviously and all these things will be performed in this
area exeutor memory as you all know okay because this is the area where it perform like all the transformations. So
what will happen? So first obviously it will first of all create the DAG. Okay, it will create the
DAG DAG of all the transformation DAG for DF1 then DAG for DF2 then DAG for DF3. Make sense? Because that is what we
get. So first of all it will calculate DF1 using this and DF1 will reside here. DF1 will reside here. Very good.
The moment we start creating DF2, it will simply eliminate this. Why? Because this is a short-term memory. Then it
will simply create the DF2. Now in order to create DF2, where is DF1? Where is
DF1? Nowhere. Nowhere. Why? Because DF1 is eliminated. because this is a short-term memory. So
in order to create DF2, it will first create DF1. It will simply go here first
here. Then it will create DF2 using DF1. Oh, okay. Let's say you want to create DF3. Okay. And obviously this will be
eliminated because it needs more space for DF3. So DF3. Oh, let me use a different color.
Let's use green. In order to create DF3, where is DF1 and DF2? Nowhere. It
will simply recomputee DF1 first of all. Then it will create uh DF2 using DAG. Then it will create
DF3. Oh, okay. But but but let's say let's say you are creating a DF4. Okay. Let's say you are creating a
DF4 and you are again using DF1. Okay, you're applying another filter. Okay, you are creating
DF5. You are again using DF1. You are doing something else. Again, you are using
DF6 again. This is the first time. Oh yeah. So DF6 again you're using DF1, not DF6. DF1 and you are doing
something. So every time you're using DF1, so every time it needs to go to the first step to compute the DF1 and then
perform your things, right? What if I say I can store this DF1 in a long-term memory? This is yours storage memory. I
can store this DF1 here. And the moment I store this memory here, I do not need to recomputee it. It
will save me a lot of time, lot of resources. I will simply go here for DF4. I will simply go here for DF5. I
will simply go here for DF6. And I will use DF1 one by one. And I can just do my calculation. This is the concept of
caching. I can store the data in the long-term memory. Okay, let's try to see this with the help of code, with the
help of you can say notebook. And I will just show you how it will perform caching. So now let's talk about caching
and let's quickly create our DF and this is the code and this is the same code that we are using in the other notebooks
as well. So you will be feeling aligned with the code and approach. Okay, so this is our DF DF1. Okay, let me just
run this and this will simply create your DF with two columns. One is ID, the second one is name. And these are like
simple records, five records. And this will simply populate your first data frame. Okay, let's see what do we have
in the first DF. So it is created. Let me just say display df df1. So it will simply obviously
populate the data frame. And now it will simply run the job as well. Okay, because we are displaying the data. So
okay perfect. So this is my DF1 and it has created the DF1. Okay that's why we are able to see it. We cannot say like
it is just a plan. It has created the data frame. Now what I will do I will simply create DF2 and I will simply say
DF1 dot filter um not filter let's create another column. So in spark in pi spark if you
just want to create a new column you just create a simple function called with
column and in this you create the function by obviously giving it a name. So let's say I want to create new column
and I will simply name it as new column just to make things simple and I want to simply
um apply a kind of constant transformation. What is that constant transformation? So in this I don't want
to actually compute anything. I simply want to um enter a value called yes and it is a
kind of flag. So let's say I am creating a new column called flag. Okay. And I want to say yes. So how I can just say
that? So if you're using any kind of constant, we have to use a function called lit. Okay. Lit. Then I can simply
say yes or whatever constant I want to assign. Okay. So this is a function and in order to run this we need to import
the functions as well from pispark dossql dot functions
import okay make sense very good okay perfect so let's create that okay this is my df2 by the way or or wait df2 we
want to create it in df1 We want to create it in DF1. Okay, perfect. Let's see the
DF DF1. Perfect. Because that's how you will see that the difference between caching and normal computation. Okay, so
this is my DF1 and Spark has created it for us not just a plan real data frame. Okay. So now what we can do? I want to
create a DF2 with reference of DF1. That means I want to use DF1 in my code in my code.
So I will simply say I want to filter DF1. And I want to apply filter on column let's say ID. Okay. And I want to
simply get the ID equals to 1. Okay. I will simply run it. What will happen? Nothing because this is just a plan. Now
I will simply run this plan. Display DF2. So you know that in TF2 we are not doing any kind of transformation. We are
not creating any new column. We are not creating anything. We are simply applying a filter transformation. That's
it. And on top of DF1 which is already created. Very good. Let's see what will happen. Let's see what will happen. So
okay we got the result. Very good. So let me just show you the plan like what exactly happened. So I will simply say
df2.ex explain what exactly happened. Okay, very good. So, this has created the plan such as first of all it
read the data. Okay, scan RDD. Then it filtered the data is not null because obviously filter should be applied
um as the first step. Then it just projected ID, name and flag. Why it is creating this column?
You will say where it is creating the new column. See this is column creation? It is saying yes as flag. Why it is
creating this new column? Why? In DF2 we are not creating any new column. We are not see we are not creating any new
column. Why it is creating a new column for us. And the reason is it is not using DF1 directly because DF1 was
created and it was stored in the short-term memory but the moment job was completed it was eliminated. Then we use
DF2 and it knows that it in order to create DF2 it needs to use DF1 and we don't have any DF1 because it is
eliminated. It will simply use the DAG to recomputee this data like this and then it will use in our DF2. That is why
in the plan you are seeing this step yes as flag because obviously we should you we should have seen this column like ID
and name like this column. Why we are seeing like this yes as that means it is applying
transformation right? Very good. So this is the reason that I was just focusing that it creates it it basically
recomputes the data based on the DAG. So it is recalculating DF1 every time. Okay. Okay. Let me just show you the
magic. Okay. Let me just show you the magic and you will see the difference. So let's say you are displaying this DF.
Okay, that is fine. Let's say before creating df2 I am saying df1 dot cache. Okay. And let me just uh
rerun all the things just to start fresh. Okay. This is done. This is done. This is done. Let
me cache it. And let's display it. Okay. This is done. Let let me cach it. So this is
cached. Okay. This is now cached. Let's rerun this. and display obviously output will be
will be the same because output will not be changed but query plan will be changed this
time this time you see the difference it is not calculating the column anymore no it is simply performing in-memory table
scan that means it is just fetching the details from the memory because our data frame is stored in the memory this time
and we do not need to worry about flag because now it is simply using flag as a column. It is not saying yes as flag. It
is not creating any column. It is simply using it using inmemory table scan. This is the power of caching. That means you
can just save your data frame if you want to reuse it in your code multiple times. Another thing another tip you
should not cache bigger data frames. You should only cache your data data frames which are small enough otherwise
executor out of memory is waiting for you. Make sense? Uh basically not like error but um it will not actually store
your data and it will just spill the data to the disk and it will simply going to the disk to get the result. it
is not efficient, right? Or you can say it will not even let's say um considering your request to cache data
because if it doesn't have any space enough space to cache your data in the storage memory, it will simply recmp
compute it every time. So that's not efficient. That's not efficient. So you should only cache your data or basically
data frame if your data frame is small enough to be easily fit in the memory plus if you are reusing it many times.
So just to save it save time you can just simply uh cache it. Okay. So let me show you something else. Uh let me just
show you the job of this. Let me just show you let's say job 15. Let's click on this and just expand it. And this
time just go to the result of this one this job
and yeah perfect. Just click on the storage. Just click on storage. What are you seeing? You should see your data
cached. This is your cache data in the storage. This is your storage memory. And you can see that eight partitions,
cached partitions, fraction cached 100%. That means it is fully cached. Size in memory it's just like 3 KB very small
data. Size on disk zero. What is this size in memory and size on disk? What is this? So basically this thing will be
much more clear when we'll be just looking at persist. Now what is this persist man? We understood cache. What
is persist? It is much more related to cache. And I actually everything is persist. Cash is just a special kind of
version of persist. Anala don't make things complex. Just explain us. Okay. Okay. Okay. So you understood the
concept of caching. First of all tell me this. You understood it. Very good. Now let's talk about the storage level and
this size in memory and size on disk like what is this and what's the difference between all these things.
Basically just to give you a hint whenever you want to perform caching okay we have so many options because we
have memory and we have disk we have both the options right so we can prioritize where we need to cache our
data where we need to store our data oh so it's just for like flexibility to store your data which is cached exactly
that's it so let me just tell you in detail what's that so let's talk about the storage
levels and you can also So say like persistence. So basically what happens caching is not like an
independent thing. No it is just like a kind of wrapper or you can say it is just a kind of special use case of
persist. So originally originally everything everything is for persist everything. Okay. But what happens when
we use persist we get a lot of flexibility. We get lots of storage levels that we are just discussing right
now. One of those storage level is memory and disk. When we say that we want to persist our data frame with
storage level equals memory and disk, it becomes df.cashe. Why? Because spark identifies that people were using memory
and disk storage level a lot because obviously it is fast, it is efficient and it gives you flexibility to store
your data on both the sides. memory and disk both. And as you can see, it will force try to save your data in the
memory. And if it can just save the data in the memory, it will save it. Otherwise, it will save rest of the data
in the disk. If memory is not enough, it will spill the rest to the disk. That's it. That's it. It will just spill rest
of the data to the disk. Make sense? Because you're persisting your data in both the areas.
Memory and disk both memory and disk both very good so it becomes cache so it is very popular that's why we instead of
just writing this big code we simply write dfc cache and in most of the scenarios you will be also be using dfc
cache trust me okay very good what are the other storage levels other storage levels are memory only now memory only
means you cannot spill your data to the disk so you will say um what if our data is not being stored in the cached area
or you can say storage memory. It will simply recomputee rest of the data. So it will just try to cache the data as
much as it can. Other data or you can say rest of the data will be recomputed. Make sense? So data is stored in RAM as
deserialized Java objects. If not enough memory recomputee the partitions when needed. Simple. Now this becomes really
controversial because a lot of you will say hey memory only is for caching. When we say DM.cashe that means we are
saying memory only. But some of you will say no no no no we say memory and disk both. So let me just clear a confusion.
If you are using dataf frame API it means df.casheache equals to storage and like not storage memory and disk both.
If you are using dataf frame API, it becomes memory and disk. If you are using RDD code and I don't know who is
using RDD code in 1225, okay, because it is not recommended to use RDD because it is not optimized but yeah some in in
rarest scenario in the rarest scenarios you can use RDD just to perform ground level transformations. So if you use RDD
API and if you use like dot cache in that scenario it becomes memory only. So do not get confused. Okay. And obviously
we write our code in the data frame API nowadays. So it becomes memory and disk not memory only update information
update and I think we stopped using RDDs like a decade back. I don't know who is still using RDD. So by the way the next
one is disk only. This is really nice because you do not need to worry about memory. Just save your data in the disk
but it's really slow okay because it stores all the data on the disk. The next storage level that we have is
memory only two. What is this? It is exactly same as memory only but it will be two times replicated because by
default we get one time replicated. It is just for the fault tolerance. Okay, I rarely use like two times replicated but
you should know. Okay, very good. Next is experimental. It is new. So I was talking about off memory, right? And we
told I told you that we rarely use off memory. By default it is zero. But in some cases you can utilize off heap
memory to store or let's say to cache your data to cache your data just to let's say put some less overhead for GC
cycle because obviously when you cache your data you create a lot of temporary objects but off heap memory is not
easily maintained you need to maintain it. Java will not maintain it like Java virtual machine will not maintain it.
Plus first of all you need to enable it. So you can just use this code to enable it and yes it is in experimental meth
mode. So you can just experiment it if you want. I have not experimented it. So if you want you can. So these are like
all the storage levels that we have within this part. And I hope now it is clear. Let me just show you the code as
well. Let me just show you. So we are in our notebook and let's run the code for one more time and instead of using
df.tache let's use df.persist. Okay. And we will simply say storage level dot um memory only. Let's
say okay let's run this. Oops. Storage level is not defined. Oh, we need to import it. What's the library to import
storage level? import storage level.
Oops. Import storage level in Spark. I just need to look at the library.
[Music] Uh, where is the Okay, bro. I I want I want import
library. Maybe this is the one. Oh, yeah. This is the class. Oh, yeah. Yeah. Let's try this. Let's try this. Let's
try this. Import spark.
storage levels. Oh, sorry. Pispark dot storage level dot storage level. Yeah, let's
import it this class and let's try to run this. Storage level is not defined.
Uh maybe we need to import like this from storage level. import storage level.
Yep. Otherwise I have to write the long code and I don't like writing this thing again and again. Pispark dot dot dot
dot. So okay. So this is now done. We have just persisted our data in the memory only. Okay. Let's do the same
stuff and let's write the df2. Let's see the explain method. And obviously we should see inmemory table
scan. Very good. And let's see the job right. Let's see the job and let's see like how it is saving our data frame. So
let's go to storage and this time you can see we have another data frame which we have
which we have which we have stored see memory only serialized and other one was disk memory des serialized one time
replicated which is by default mode when you just run cache this time we have specifically mentioned we need fast
results and our data frame is small so we can just store our data frame in memory so memory serialized one time
replicated as you all No. So this is a difference. And now you will say once like what if we just want to unpersist
it or let's say uncash it. So there's nothing called as uncache. If you want to uncash or unp unpersist, you simply
need to write df1 dot unpersist. That's it. Simply run this. That's it. And you can now just check the spark UI for one
more time if you want to. Let's say you want to run df2.is for one more time. Okay.
Okay, perfect. Let's run this explain method for one more time. Perfect. So, do we have the explain method? Did we
run display DF2? Yeah. So, we did it. So, let's see the job. Now, go to storage. Yeah,
perfect. So, this one is removed the one that we just saved before. And now this is the only
one which is wait which is left which is the nor like the previous one that we didn't uncash or you can say unpersisted
it. So it is saying like fra fraction cached and this does this this. So if you just want to uncach it that means
you are trying to unpersist it. Make sense? Make sense? Makes sense. Make sense. Very good. Very good. Very
good. Very good. So I hope now you can just play with this persist method and now I hope that you can just use
different different storage levels and you can just see the difference okay in your performance. So now let's see what
do we have next. So so so so let's talk about edge node. What is this edge node? So basically so far whatever we have
discussed we were just imagining that we as developers are directly communicating to the cluster manager. Make sense? Are
directly communicating to the cluster manager or resource manager and it was just allocating us a resources. That's
very good. But just to tell me one thing you are a senior data engineer. Okay. and a new data engineer or intern
just join the company will you allow him or her to directly talk to the cluster manager will you obviously you cannot
right so that's why we have something called as edge node and before talking about edge node we just need to clarify
that we have two different kinds of partition not partitions like sites you can say client mode and cluster mode
that we'll be talking about in the next chapter and This is just like client side and cluster side. Make sense? This
is just a side like team A, team B, similar to that. Okay. Client team, cluster team. So what happens? This is
you as a developer. So instead of directly talking to the cluster manager for all the resources, for all the
stuff, what you do? What you do? You talk to this machine which is called edge node. Which is called edge node.
Okay. This is the edge node. You will this can be your virtual machine. This can be your like any machine. Okay,
which can act as the edge node to talk to the cluster manager. You can login into this machine and you can just type
the request and this machine will have the access to this access man uh this resource manager and this machine will
talk to the cluster manager and this manager obviously will assign the resources for your
work. So simple. So simple. This is edge node. Okay, this is edge node. Now this gives a birth to topic called as client
mode versus cluster mode. So basically we have two different kind of deployment modes. One is client mode and one is
cluster mode and both are so simple. Let me just explain what's the difference. So basically we know that we do not talk
directly to the cluster manager. This client machine will talk to the cluster manager. So let's say you are deploying
your application in the client mode. What will happen in that scenario? In that scenario, we know that cluster
manager first of all creates the driver driver node, right? So what it will do? It will simply create the driver node on
the client machine. Oh, that's why it is called client mode. That's why it is called client mode.
And where are the workers? Obviously on the cluster obviously there is no negotiation in that. So these are the
worker nodes. This is your client mode because your driver mode or you can say driver node is this
client. Okay. Now you are using a cluster mode and in most of the scenarios you will be using cluster
mode. Okay. But in some cases you use client mode as well. Okay. In some cases, in most of the cases, you just
use cluster mode. Okay. So, let's say you are using cluster mode. What will happen? You know better than me. I think
so because I have just explained this so many times. So, I hope now you know. So, what will happen? It will simply send
the request to the cluster manager resource manager and it will simply create the driver node or you can say
driver on one of the machines and rest of them will be worker nodes. Make sense? This is like simple approach.
What is the difference and what the advantage disadvantage when should we use which one? So the thing is see when
you are using client mode. Okay. So you are running your driver in the client machine. Okay. What if this client
machine is turned off? Let's say you turned off this machine. So what will happen? The whole application will
break. Why bro? This machine is running a driver program. Driver program is the heart of the whole application. You
switched off the heart. How can the how can the application survive? No ways. So that is why it is more you can say
errorprone. But what is the advantage? The advantage is when you just run your application in like let's say you're
developing your application. Okay. So you will be using this client mode in your development you can say area in
your um development layer. Make sense? In dev environment. So you can simply write the code and you can actually see
the output on your own machine. You can see all the logs everything. So it is very handy when you are just working in
the dev environment but in the prod environment it is not recommended. Why? Because just tell you
one thing this machine is like is not inside this network right this is totally different this is different. So
in the client mode network latency will be high. network latency will be
high and we do not want that. We want our applications to run our code really quick. So these are some of the you can
say advantages disadvantages you can even Google and now you know the concept you can understand those things
otherwise if you do not know the concept you cannot understand the things. Now if you will be just reading the advantage
disadvantages you can actually understand everything like what is the advantage what is the disadvantage the
major ones are these and obviously there are like a lot of advantages and lot of disadvantages and again it is subject to
a specific architecture specific project like what approach you want to use right so it is too subjective I cannot say
like one is bad one is good no it is just like you just need to pick the right one for the right architecture
right project that's it okay see it was so simple so simple Simple so simple client mode and cluster mode. Let's see
what do we so let's talk about partition pruning and I would say even before talking about partition pruning we need
to talk about pruning and partitioning independently but let me just tell you um we should just talk about these
topics together in order to understand better. So now let's talk about first of all what are partitions. So basically
and don't worry I will just show you in the code as well. So first of all let's say this is your exeutor okay and this
executor will be writing your data in a repository in any kind of repository in any kind of data lake in any folder
anything okay because obviously whatever data you're processing you will be you can say writing your data to a
destination right okay makes sense so this is let's say destination repository or destination
location simple so Now we do not like write all the data in one you can say folder. What? Yes we do not do that. No
instead what we do we create partitions. Oh okay we create partitions. Let's say your data frame
has a column called category or department anything. Okay. So what we will do instead of writing everything
inside this one folder we will create independent folders. Let's say you are just creating a data frame for
departments. You have IT department. So we will simply create one folder for IT. Second folder for let's say HR. Third
folder for finance and so on. Okay. So this will be independent data which is partitioned based on a column.
Okay, makes sense. We understood that in partitions we simply and by the way this partition is not equivalent to your
memory partition. This partition I'm talking about while writing the data to a destination table or destination
folder destination location. Okay. So you will say okay we understood it but what's the advantage? Why do we need to
just do this? Why we cannot write all the data in a folder? You can but this is a kind of optimization technique and
what kind of optimization we get after this. So let's say this is your destination folder. Okay or let's say
destination parent folder. Destination parent folder. So you want
to now query this data. What you will write? You will simply write let's say df equals spark read blah blah blah. You
are simply reading this data. Okay. You are simply reading this data. And while reading this data you are applying a
filter simple filter like let's say dot filter um just for the HR department okay this is just a demo code pseudo
code okay do not say hey is this syntactically right I know it is not so let's say you are just creating a data
frame okay and you just want to fetch the data records for HR department only make sense very good so what spark will
do if you do not have the partitions Okay, let's say these partitions do not exist. Okay, let's say this is the only
folder. Let's say this one. This is the only folder which is available now. So this particular data frame, this spark
will simply go to this folder and will read all the data all the data. Let's say you have 10 GB of data, 10 gigs of
data. So it will read 10 GB of data and you are obviously spending resources to read that data obviously. Then after
reading this data it will apply filter. Make sense? Common sense bro. Okay. Make sense? Now if we have
partitions, if I have partitions, what I will do? What I will do? I will simply say this is for HR. Okay. This is for
HR. This is for IT or let's say finance. Okay. And let's say this is for IT. Make sense? So I know and spark also
knows that this folder is just for HR. This folder is for IT. This folder is for finance. So it will directly go to
this folder instead of going here and here. Make sense? Let's say your HR data was only and only of one gig that is 1
GB. So you saved 9 GB of processing. Obviously, bro. Obviously you saved that scanning part of 10 GB of
data. You simply scanned 1 GB of data. This makes your Spark jobs run so quick. So that's why we always perform
partition. And this concept is called partition pruning. How? Because you are pruning the partitions. You are not
fetching all the data. You are pruning the partitions. Make sense? Make sense? Do
you want to see this in code? Let me just show you. So I'm in my new notebook and let me
just connect this with the cluster. So basically I have prepared a demo data frame. So obviously simple data frame
that we are using so far. In this I have added one column which is department. Okay. Just to align you with the
original example. Very good. Okay. Very very very good. And these are the columns. And this is the spark code.
Simple. And this is the code that we use to write our data to write our data to a location. This location I will provide
right now. And this is the code simple code. What is this code? df.t write. Okay. And then pocket because we want to
convert our data in the pocket format. Pocket is like one of the most popular columnar based file formats for big data
analytics. Partition by department. This is that statement which performs partitioning based on a column and you
will actually see the folders in real life. I will just show you don't worry. And this is the mode basically it tells
spark if data is already there what it needs to do. Either it can overwrite append. So we have so many modes. Okay.
So let's actually provide the path. So in order to provide the path I will simply go to catalog and I will go to uh
not to catalog basically um I can simply go to this catalog just to see the DBFS. So I have file store. Okay. Oh I have so
many folders. I know I know I know. So I will simply create this one. Apaches spark. This is the parent folder. Let's
say I want to create a folder inside this. Let's say this one. Make sense? Makes sense. Okay, let's create folder
within this called output data. And within that output data, you will see the partition. Let me just show you. So,
I will simply grab the path from here. This one. Okay. And you need to make some tweaks in this path. So, do not
just paste it directly. You need to remove this DBFS path. Okay. And I think you also need to remove file store like
backslash. Let's see. No worries. If it throws error, we can just correct it. I will simply say upload or output
data. Make sense? Very good. So I think we are good. Let's run this. If we see any errors, I can simply remove this
backslash because I think we need to remove it. And if no, then it should be fine. It should be fine. It should be
fine. This is basically optimizing technique. But obviously competition is high right now. So I it's my duty to
show you and just tell you everything. So it is done. It is done. Let me just show you what did we get. So it has
returned the data successfully. Okay. And if I go to DBFS and if I see output data. Click on this. Perfect. See
I have folders. Department equals finance department equals HR department. It is so good. And within these folders
you will see the data. See these are the pocket files. Wow. And in order to just show you I will I I will just show you
something else as well. So basically I will create another copy of this data like I will write the data in another
location but at this time I will not use partitions. Very good. I will not use partitions. So I will simply say uh
output path I will change this and I will say output data without let's say
without partitions so it's the same thing output path new and I'll simply change it here as well because I want to
show you like what will the what will be the difference while scanning the data okay so this is also done and we can
just check go to cataloges DBFS file store apache spark work output data and output data without we should not see
those f those folders. Perfect. We should just see the we should just see the partitions. That's it. Eight
partitions. 1 2 3 4 5 6 7 and yeah, eight. Sorry. 0 1 2 3 4 5 6 7 8. So, by the way, why eight partitions? Why?
Because we have eight cores. We have what? We have eight cores in our
executor obviously bro I just showed you. So if you just click on this and if you click on
configuration you will see that we have eight cores and each core is responsible for the parallelism. So by default we
get eight partition. These eight partitions are inmemory partitions but partitions that we are creating in the
form of folders that are partitions for the storing the data on the disk. Partitions on the disk. Okay. So if you
just click on spark UI and if you just click on maybe executors and
then it will take I think just a few minutes to load. Yeah. So can you see course equals equals 8. So we have eight
cores. Okay, make sense? Very good. That's why it created 88 partitions by default because
each partitions partition responsible for the output. Okay, very good. Very good. Very good. So now are you ready to
see the magic? So let me just show you the difference. Let's try to read the data one by one. I will say df with
partitions with part. Okay, I will simply say df uh sorry spark dot read dot format format is pocket.
Okay, then dot load let's load the data and the location is we already know oh this this is the location output
path. So we can simply provide this location here in the form of variable. So let's try to read this and let's
see let's see display df. Oh sorry uh df sorry df dfd df with
part. Okay makes sense. Oh where is yel is y. So basically now it will simply load all
the data. Why? because we are not pruning our partitions. We are just creating the partitions but we will just
prune our partitions with the predicate. So just click on this click on this job and if you just expand it you will see
SQL data frame just click on this and you will see scan partitions. If you just click on this and you will see how
many number of files read how many files it has read six. Six files. Number of partitions read. Number
of files read. Number of partitions read. Yeah. Three. So these are the partitions. Three. It HR and one more I
think finance. And how many total partitions it like number of files it read?
Six. Sorted. Because in each partition I think we have two partitions. I can just uh like two files. I can just check just
to confirm. uh Apache Spark and output data and let's say HR yeah one two one two one
two so it read all the partitions right very good now if I perform pruning pruning how I can simply
say dot filter okay column of department equals equals HR make sense
Friends, let's do this for one more time. Let's see what will happen. Error. Very good. Because we didn't import the
functions from pispark.SQL functions
imports. Okay, let's do this. Fact. Okay. Nice, nice, nice, nice stuff. Okay, let's run this finally and
you will see why it is so much important and why it is the hot topic of the interviews because it is really
important as a developer. So let me just check this SQL data frame and this is the data frame. So how many how many how
many files it has read? Uh only two. Very good. Number of partitions read only one. So it pruned the partitions
instead of reading all the six partitions. Now we have the liberty. Now we have the power to limit the file
scans because we have partitioned our data. Make sense? Very good. Let me just show you one thing. You will say Anulama
this is also possible in the other scenario as well. No. No. Let me just show you. No. If I just run this whole
thing. Okay. For one more time. Okay. And this time I will create without partitions. And the output location will
be output path new. Okay. And same code. I'm not changing anything. I'm still applying filter. Okay. Let me just show
you how many files it will read. Every time it will read all the files, all the eight
files. You don't trust me. Now you will trust me. So let me just display the data. DF without
partition and so it has filtered the data obviously but it read all the data it
scanned all the files every time you will say it's not a big deal yeah it's not a big deal because it you are just
working with very small data set when you have like TBs of data terabytes of data it's not a small thing okay SQL
data frame let's check this one and let's confirm like how many files it has read. So basically it has read seven
files like all the files. It has read all the files and number of columns obviously like number of columns will be
same. It has read all the seven or eight files. I do not remember the exact number. It has read all the files. Why?
Because there is no partitioning applied on it. It cannot be pruned. Make sense? Very good. And size
of files read 7 KB. 7 KB. How much of data it read here? How much of data it read here? Only and
only bytes of data 1720 bytes. This is the power. So just replace bite with GB and KB with TB. So
just see the difference of cost of resources. Bro, this is the power of partition pruning. Now we have a cool
concept called dynamic partition pruning. Dynamic and you will say that what's the difference like like why do
we need dynamic partitioning pruning? It is you can say one step ahead of partition pruning. Let me just tell you
what's that. Let's try to understand dynamic partition pruning and just be with me
you will master it. Trust me. Why do we need dynamic partition coding and trust me it is just like magic. What's magic?
So let me just tell you what's the magic. So let me first tell you the magic. So
basically we know that if our data is partitioned and if we just apply any kind of filter okay on top of it on the
partition column we get only selective partitions. Okay. What if I tell you that even if you do not apply filter on
your partitions, you will still get the selective partitions. That is the concept of dynamic partition pron. So
let me elaborate. So do not get confused by looking at the picture by looking at the arrows. Just follow on my cursor.
Okay. So let's say there's a scenario where you have DF1 and you have DF2 and you want to apply a join and just for
the background understanding DF1 is a big table. It is a fact table in which you have partitioned your data as you
can see based on the column let's say department. Okay, make sense? Very good. On the other hand, DF2 is your dimension
table which is not very big which is very small table and you just have like non-partition data because it doesn't
make any sense to create partitions for that data. It is very small table. Make sense? Okay. Now you are applying a
join. Okay. Let's say you are applying a join df1.join
df2. Okay. Make sense? Make sense? Okay. Very good. Now in order to apply join it will simply scan
DF2. Obviously, obviously so to simply scan DF2, right? Agree with me, right? Okay. How hard is it to scan this file
because it is very small data? Just a few seconds, right? So, and it is very quick. So, that is fine. That is fine.
Okay. Now, how long it will take to scan this DF? And I just told you that this is a very big table. It is a fact table.
Oh, okay. And just for your more understanding, each department is of let's say at least 1 GB of data. Each
department h it 1GB, HR 1 GB, um finance 1GB and so on. There are like so many departments in this. I have just shown
four, but you can imagine at least 10 to 15 departments. Perfect. Now it will scan
all the uh departments obviously because there's no filter condition on it. Make sense? Very good. Now, now we are
applying a filter. Just listen to me carefully. We are applying filter on DF2. Okay. Okay. We are applying filter
on DF2. We are getting data only for HR and on DF2, not on DF1, only DF2. Hm. Okay, this is already a small
table. It will apply filter. It will get the data. Perfect. Now, DF1 will still scan all the
partitions because we are not not applying filter on DF1. Make sense? Very good. What if I say even if I do not
apply filter on DF1, I will still get the data only for HR. Let's say this one. And it do not need to it it does
not need to read all the departments. It will only read HR department. An are you kidding me? How can it be possible? It
is possible with the help of dynamic partition pruning. Now let me just explain you the flow how it will
actually do this. So first of all it will apply scanning on DF2. Okay let's apply the scan scanning. Okay, it
scanned the data. Okay, then it applied the filter. Okay, it applied the filter and it scanned the data only for the HR
department. DF2 is ready. Perfect. Now it will go to DF1 which is partitioned and it will apply file scan. Same thing.
Okay. Now before applying file scan, we will see magic. Our broadcast exchange will take place. And what it will do? It
will simply forward this query which is the result of this DF2 with filter equals to HR and it will be broadcasted
here and this will act as a dynamic filter. That means we transferred the filter from one data frame to other data
frame. Now this DF1 has a filter of department. Now just tell me will it read all the departments or just only HR
department? Obviously HR department you have just seen that in the previous example. This is dynamic partition
pruning. This is the whole concept. One thing to just keep in mind, the thing is every time you apply
dynamic partition pruning or you want Spark to apply dynamic partition pruning, you need to make sure that you
are applying filter condition which is this one based on the column which is equivalent to the partitioning column
because only then it will bring the HR data. Make sense? Because just tell me one thing if you are applying filter on
name how it can just bring the data for HR it is not possible. So you have to apply filter on top of that column which
is also responsible for the partitioning in the other data frame. That is one condition. Second condition is you have
to include that particular column in the join condition as well. I'm not saying this should be the only column involved
but this should be the part of the joining key. Make sense? I hope so. And I know you have almost understood all
the things in dynamic partition coding. Let me just show you by the code as well. You will actually feel like what
we are trying to do. Let me just show you. So I am back on my data bricks notebook
and I am applying a join. So what I am doing here, let me just show you the code first of all for joining and let me
just add a markdown here. Uh dynamic partition pronun make sense very good.
So now what I'm trying to do I'm applying a new data frame called df join which is applying a join on df without
part and df with part. And we know that DF with part is partitioned. DF without part is
nonpartition. And and and and join condition includes this department column. See department department and
how equals to inner how equals to inner. Very good. One more thing I have applied I have applied the filter on DF without
partition as well. As you can see here, I have just done that. I have first pruned this data and I know this is not
partitioned. So, it will simply go there and it will read the data. Make sense? But I have not applied any filter on TF
with partition. No, that filter will be dynamically applied which is dynamic partition pruning, right? So, let's run
this and let me just show you the data for DF without partition. So we have data just for the HR and this is without
partition data. Make sense? Very good. And if you just want to confirm how our DF with partition looks like, you can
simply say display DF with partition. Okay. Just to make sure like how it is looking. So now it is
also DF with I think we just uh filtered it out. So that's why it is just throwing it. So we can just recreate it.
Not a big deal. So I can simply remove this filter because it is of the previous code. And let me just run this.
Okay. And let me just hit it. Okay. Perfect. So we have all the data, right? We have all the data. Okay.
Perfect. Now, now and let me just remove this particular command. Perfect. Now we will apply a join and we will see
dynamic partition pruning taking place. Okay. And what it should do? It should obviously read all the files from
here because it is not partitioned. It will read all the files. Okay. Then it will read
only only and only HR partition from the partition data and we have not applied the filter.
It will be applied automatically dynamically. Okay, let's run this code and let's see what happens. or let's say
DF join and then I will say display df join. Uh okay perfect display df join and I
will just see okay this is done the join is applied let's see the query okay let's see it what it has done
and okay perfect let me just expand it SQL data [Music]
frame okay perfect so let me just check what it has done so basically I will first of all go here and I will see this
is my DF uh which is called DF
uh yeah so this is my DF for with partition and how many files how many files we should read only
two and only one partition right let's confirm um number yeah yeah perfect number of partition read only and only
one only and only one number of files did only and only two dynamic partition pruning because it forwarded its filter
to this. And how many files we should read from the other data frame? Obviously all the files because that is
not partitioned and see all the files, all the seven files. Perfect, perfect, perfect, perfect, perfect, perfect.
Really happy. And let me just confirm it for one more time. Let me just rerun it just to make sure like nothing happens
wrong. df join new. Why? because I was just running uh this command and
um it was not recorded. So let me just rerun it. df join new just to confirm like df
this is also working right. Let's click on this and then click on SQL data frame and let's confirm this as well.
Let's click on this. Oh perfect it is working fine bro. Wow wow wow. Number of files read only two and number of
partitions are only one. This is dynamic partition pruning and wow this is very nice. See the power
of dynamic partition pruning. See so this is your concept of dynamic partition pruning and I hope now you
understood all the things in dynamic partition pruning. Make sense? Now let's see what do we have next in this
ultimate Apache Spark guide. Let's see. Now let's talk about AQE an it's high time to talk about it because every time
you just talked highly of AQE AQE AQE just tell us like what it is what it is what is it what is
it adaptive query execution AQE it is the game changanger trust me it is the game changanger it was introduced in
Apache Spark 3.0 do I guess and it is the game changanger. Why? You will know. You will know. It is
basically an advancement in the optimization techniques. Before AQE data engineers
were doing all the optimization on their own, all the partitions handling, all the broadcast joins, all these QES. No.
AQ said, "Bro, take a chill pill. I'm here. I will just do everything for you." So it has some amazing you can
say offerings for the data engineers for the senior data engineers. Okay, first of all let me just tell you what are the
basically the areas where it helps. Okay, where it is a game changanger. First of all, first of all, what it
does, it dynamically dynamically
co-leas the partitions. Dynamically coal is the partitions. What does it mean? First of
all, let's cover this. Dynamically co is the partitions. Whenever we apply Y transformation okay let's take
about take the example of Y transformation because in that particular transformation we know that
there will be like so many transformations right so what it does it says like every time you are creating
200 partitions why why every time let's say I'm just performing df.group group by on a column. Okay, every time I will
get 200 partitions, right? It is saying, "Bro, what are you doing, man? Why you need 200 partitions?" So, it will simply
coales these partitions maybe maybe to just 10 partitions, maybe to just 10 partitions.
It depends upon the size. But maybe in most of the cases you can say that okay it can just coales the partitions into
just 10 partitions because it says why you are just increasing the overhead because more the number of you can say
tasks because every partition is creating a task more the temporary objects will be and more frequently your
GG cycle will be running garbage collection cycle so it is like ruining all the performance of Apache Spark. So
it says bro you do not need to create 200 partitions you do not need to create more partitions. So it dynamically
coales the partitions everywhere it finds some opportunity everywhere. Why transformation is one of
those areas because it is for sure area because you do not need 200 partitions every time. Just take my example um I am
just using few KB of data. Okay, I hardly need only one partition. But whenever I perform group by any Y
transformation, it creates 200 partitions for me. 199 are empty. One is uh the main one. But why you are
creating 199 empty partitions? Why you are adding so much stress to GC cycle to create temporary objects?
Why? Why you are just indulging so many cores of my exeutor? I know it will be skipped but still why you are doing
that? So it says you do not need to do that. Simply coales the partitions. Let me just explain you with the help of an
example. So let's say let's say you have a data frame. Okay you have a data frame in which you
have um let's so let's apply a group by. Let's let's do it. Let's apply a group
by. So let's say you are applying a group by okay and you are applying a group by on department. Okay. So let's
say for HR you have 100 records. Okay. For IT you have let's say 1,000 records. For
finance finance you have let's say 10,000 records.
Wow you have 10,000 records. Okay. Okay. And let's say for like there are so many partitions right? You can just keep
let's say accounts you have like let's say 50 partitions. Okay. Or one more last one.
How many partitions do we have in the industry? Tell me the
department supply chain. So let's say another one is supply chain. Okay. They have let's say
just 100 records. What it will do? Let's draw the partitions just to make you visualize all the things. Let's say this
partition is like this. Okay, this partition is like this. This partition is like
this. This partition is like this. This partition is like this. Okay, let's draw it here. This partition
is like this. And this partition is like this. Okay, make sense? Make sense? Or let's
draw it here. Okay, makes sense. So what it will do? What it will do? What do you think? It will obviously create 1 2 3 4
5 partitions. Who is this man? So it will simply create five partitions. Calm down. Nunch calm down.
What happened? So it will simply create five partitions for the main work obviously
because one partition for each department. Okay, five partitions are fine. Makes sense.
Other 195 partitions will be empty as we know that will be empty. Okay, it will still still create those but it will be
skipped. Okay, that makes sense. Now, now it will simply perform the grouping and I'm just talking about without AQE.
But if our AQE is enabled, what it will do? It will simply coales this partition, this partition and this
partition and it will create partition like this. Let's say like this just one. So how many partition do we have now?
One, two and three. And then it will try to coales this one and this one as well. Why? Just to distribute the data evenly
just to have like similar size of partitions, right? So what it will do? It will end up having only two
partitions. One is this big one and the other one is this with IT, HR, account supply chain. Make sense? So it will
simply coase the partitions like all the other partitions. So this
one and these ones will become one partition like this and this it1 will become or let's say will remain like
this. Now it makes more sense. Tell me does it make more sense? Obviously according to me yeah yeah it
makes more sense. So it tries to just have the partitions of similar size instead of just having a huge difference
between the partition size. Make sense? Okay. Very good. Let me just show you with the code as well like how we can
just see it. I have already shown you. I think I can just show you again like what's the difference between AQE when
it is enabled and when it is not enabled. Let me just so let's see the code. First of all, first of all, I have
written the code just to show you if our AQE is enabled or not because by default AQE is enabled after Apache Spark 3.0
and I can even show you if I just run this command spark.get Spark.sql.adapt
adaptive.enable. Just run this. See true. Now we will make it false because you want to turn it off. So I will
simply run this code and again if I just run this I should see false. Very good. So that means our AQ is disabled. Now we
will see all those things which were there before AQE. Okay, very good. Let's create a data frame. Okay, let's create
a data frame and let's apply some transformation. Let's apply group by df.group
by and I want to apply group by on department. Make sense? Okay. Then I want to find count of name let's say.
Okay. And that's it. I will simply say display df and I will import the functions from
pispark.sql dot functions imports. Let's run this code and let's
see what happens because we are simply applying a bite transformation and without aqe. So I will just show you the
difference basically. Click on this. Oops. Oops. Oops. Yeah. Click on this and expand it. So expand it. And now
simply go to SQL data frame. Okay, SQL data frame. And this is our job. Yeah, perfect. So now if you just click on
scan read, first of all, it read six records. Very good. Very good. Then this is the step where it performs shuffling.
Right. Click on this. Expand it. How many number of partitions you are seeing here? 200.
200. By default, 200 partitions. Wow. If you just check the stages, how many stages it created? It
created 75 stages, 100 stages and 20. How many it
becomes? And four as well. How many it becomes? How many? Uh 175 195. Yeah.
195. 195. Can you imagine? 195 partitions are there. Okay. Then these four that means
199 and then this one one partition. So can you just look at the time it took? So all the 100
partitions were completed in 1 seconds. Okay. And obviously um these were empty partitions. Okay.
And then it took 0.9 then 0.3 then 85 millconds and 0.2. Make sense?
So in short we got 200 partitions out of nowhere like obviously we know like from where but it should not be there right.
So now now now now if we just enable AQE what will happen what do you think? So let me just show you what will happen
and what I do think and we can also see the DAG if you just want to see if you just go to jobs and then if you just
click on completed jobs and this is the job that we completed right now. So you can see see skipped skipped skipped
skipped skipped all these jobs. See this is the job that we ran just now right? So this is skipped
skipped skipped skipped. So not a good thing. Let me show you the DAG as well. Do we have it? I think it
should be here. Yeah, I already shown you. Okay, perfect. So now let me now enable it and let me just run it. I will
simply say true and I will confirm if it is true. Okay, now it is true. Now let's do it
for one more time. Run this. Run this. Run this and let's see what will happen. Click on this and expand it and simply
go to SQL data frame and this is the job. Okay. Now you will see exchange read obviously six level rows in the
exchange it created 200 partitions h still but we have aqe shuffle read it coaled see can you see aqe shuffle read
coaled so it coalesed these 200 partitions to how many let's see only one partition makes sense because I just
have six rows obviously only one partition is required so it optimized ized the performance. It
coalesed the partitions. Make sense? Very good. That is the advantage of it. That is the
biggest advantage of it. And you can even see that how many things that we do have. So you can see only one job
instead of like having 200 and skip skip skip we just have one. See we do not have like 75 task, 100 task, 20 task,
four task. We do not have 200 task. We just have one task that's it and that is the only thing that we need. We do not
want like so many stuff just to put some stress on executors, right? So that is why it is important. That is why it is
the most important I would say. Make sense? Only one task. And here how many tasks did we have? 75 100 175 and then
195 then 199 and then plus one 200. 200 tasks but here we just have one. Very good. This is the power of AQE. And this
is just like one feature of AQE. Let me just tell you other two ones and other two ones you do not need to actually do
anything. It will be automatically applied in your you can say transformations
automatically. Let me just show you and let me just first explain you like what are those things. Now let's talk about
the another feature of adaptive query execution and you will say like wow this is so so so good. So let's say you are
let's say you are just applying a join and let me first tell what's that it basically
dynamically or you can say optimizes join strategy during
runtime. Wow. What does it mean? So if you will observe in the query plan as well you will see something called as
adaptive final plan equals to false. What does it mean? It means that in the query plan itself it is saying that it
is not the final plan. But you will say an Lamba you just told us about the query plan execution. We write our code.
It gets translated into optimized logical plan. Then it gets converted into physical plan and then that
physical plan goes to the executors. Yes, that's right. But that physical plan is also not the final plan after
aqe because AQE says I can change the plan during runtime. How? Because it calculates it calculates something
called as query statistics during runtime. So let's say you are applying a join. Okay, you are
applying a join. You are applying join between DF1 and DF2. Okay, DF1 and DF2 both are big data frames. Okay, both are
big data frames. DF1 is also big. DF2 is also big. But and obviously if if both are big, what join strategy we'll be
using? We'll be using um according to the physical plan. Okay, we'll be using sort mer join like shuffle sort mer
join, right? But do you know what AQ will do? Do you know what it will simply say? According
to the query statistics during runtime, for some reason we were applying some transformations, this DF2
is now become a small data frame due to the transformations that we applied and it was calculated during runtime. Maybe
you use some UDF because it it is also calculated at runtime. So all those things okay all those things. So due to
some reason your DF2 is now very very very small very small. So it will automatically update
the query plan automatically this physical plan and it will pick the best plan at that time. And what is the best
plan? If we have one small table broadcast join. So automatically it will apply broadcast join instead of
applying shuffle sort merge join automatically automatically that is the
concept got it let me just show you so this is the code that I have written for join and now you are well with this code
okay and now we will simply say let's create these two data frames and let's apply a join between these let's say df
new equals First DF1 dot join DF2 on DF1 of let's say ID yeah equals equals DF2 of
ID df2 of ID make sense okay make sense simple and how you want to just apply you can just pick any like left or inner
it doesn't make any sense okay so let's run this code and let's say display df
new and we know that our AQ is enabled. Okay, let's see what join strategy it will apply. So let me just click on
this. If AQ is not there, obviously it will by default apply shuffle sort join. But if AQ is there, let's see what it it
will apply. So applied shuffle swattma join because that's what it thinks that it is that it is the best join strategy
to apply for now because it doesn't make sense because both the data frames are small right so for broadcast join one
data frame should be small enough so that it can be distributed but it's up to AQE what join strategy it picks it
picks if it it has picked uh shuffle sort join that means it is the best one for our small two data frames and one
thing to note here it is applying aqe reshuffle read. So instead of creating 200 partitions, how many partitions we
we should have? Let's see uh here we just have one. Here we just have one. That makes sense. So in the exchange you
will see 200 partitions and 200 partitions. But in the shuffle it collies these partitions and optimized
it and it simply applied shuffle sort join. It applied sorting here, sorting here and sort mer join. So it depends
like what kind of join it will pick according to the query statistics. It is not like if you have a
small data frame every time it will apply a broadcast join. No, no, it depends upon the query
statistics. Make sense? Broadcast joins makes more sense when you have like one big data frame and one is like very
small data frame. Okay. So this is all about AQE optimizing the joint strategy. Let's look at the third major feature it
offers. Let me just tell you what's that. Do you remember this image? I hope so. So basically remember we were just
tackling skewess. So this is a third major area where AQE plays an important role. That means dynamically optimizing
the skewess of the data. What? Yes, it is a great alternative for salting. But yeah, it doesn't mean like we should not
use salting. uh salting gives you much more flexibility like how you want to just do the deal with the skewess but
AQE also provides some support. So let's say you have skewed data. What will happen? It will simply be skewed into
just one partition. So AQE dynamically breaks your partitions as well. What? Yes. So it will simply break
your this big partition into small small partitions automatically for you. But it's not that much easy. There are some
rules that AQ needs to follow. I think it's something like your skewed data should be at least five times more than
the whole data skewess and all. So there are like so many statistics it needs to consider before breaking the data like
of like big partition. But yes it do it it does support AQE does support skewess as
well. So it will help you to work with skewed data as well. That is why AQE is a gamecher in the Apaches part. That is
why because it handle before can you imagine how data engineers would be working before
AQE they would be doing everything manually everything. Now after
AQE everything is optimized. Obviously you need developers to do these things but it's very much easy right now it
becomes really easy to work with these things. Earlier it was not that much easy. Earlier it was
not but now it makes sense. But obviously in order to work with these things you need to have the knowledge.
You have to you have to need the strong background strong you can say b strong conceptual knowledge and I hope now you
have okay I would say this is the complete guide for you
okay once you master all these concepts you are all set to work with Apache Spark because now you have like end to
end understanding of behind the scenes and trust me trust Master these things, master these areas,
you will become an outlier because these are all those things which will make you an outlier. These are those things which
are not known by everyone. Are you getting my point? So I hope I hope you learned a lot from this
video. What should be your next step? As I just mentioned, you need to master Pispar coding as well. And I have
created a dedicated video on pispark coding. Simply search on YouTube pispark an Lambak. You will get the 6 hours plus
course on that as well. So in total 6 hours plus video of this 6 hours plus video of that. Total 12 plus hours of
video bonus bonus. What's that? I have also created one video on spark optimization techniques as well. That is
the third video that you should watch because when you want to master spark you also want to master optimization
techniques as well. So that is like three to four hour of video. So that is your complete set that you need to
master this video pispark video plus optimization technique video. Simply go on uh YouTube simply search spark
optimization course an llamba and you will find the video. I also heard that the original price of
this content like the content like this 15 to 20 hours of content that I have just put is like at least 50,000 more
than 50,000 bucks. I I just got to know this. I hope that you do respect my efforts and if you do hit the subscribe
button and if you have already hit the subscribe button share this channel with others. comment on this video. Please
comment on this video because it will help me to just grow and you can even join my channel by just click on join
button and that's all I can say. If you like my hard work, if you want me to continue making more and more
videos then obviously make me feel like that that I should do that. So I hope that you enjoyed and loved this video. I
want to see your love in the comments. So just drop your comments and I will see you in the next video coming on the
screen. Just hit on that video and happy learning.
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries

Mastering Pandas DataFrames: A Comprehensive Guide
Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.

Java Programming: A Comprehensive Guide to Understanding Java and Its Concepts
Explore Java programming concepts including OOP, exception handling, and collections. Learn how to build robust applications!

A Step-by-Step Roadmap to Mastering AI: From Beginner to Confident User
This video provides a comprehensive roadmap for anyone looking to start their AI journey, emphasizing the importance of understanding core concepts before diving into tools. It offers practical tips on building an AI learning system, developing critical thinking skills, and strategically selecting AI tools to enhance productivity.

A Comprehensive Guide to PostgreSQL: Basics, Features, and Advanced Concepts
Learn PostgreSQL fundamentals, features, and advanced techniques to enhance your database management skills.

Mastering Learning: Balancing Theory and Practice in Skill Acquisition
In this insightful coaching session, Edgar Cabrera discusses the importance of balancing theory and practice in learning new skills, particularly in software engineering. He emphasizes the need for a structured approach to practice, the significance of interleaving techniques, and the advantages of physical note-taking over digital methods for effective learning.
Most Viewed Summaries

A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.

Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.

How to Use ChatGPT to Summarize YouTube Videos Efficiently
Learn how to summarize YouTube videos with ChatGPT in just a few simple steps.

Pag-unawa sa Denotasyon at Konotasyon sa Filipino 4
Alamin ang kahulugan ng denotasyon at konotasyon sa Filipino 4 kasama ang mga halimbawa at pagsasanay.

Ultimate Guide to Installing Forge UI and Flowing with Flux Models
Learn how to install Forge UI and explore various Flux models efficiently in this detailed guide.