The Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025

Overview of the Masterclass

This masterclass is designed to provide a thorough understanding of Apache Spark, focusing on key concepts and techniques essential for data engineering in 2025. The course covers:

Spark Architecture: Understanding the master-slave architecture, driver, and executor roles.
Transformations and Actions: Differentiating between narrow and wide transformations, lazy evaluation, and actions in Spark.
Memory Management: Insights into driver and executor memory management, including handling out-of-memory errors.
Dynamic Partition Pruning: Techniques to optimize data processing by reducing unnecessary data scans.
Joins in Spark: Exploring different types of joins, including shuffle sort merge joins and broadcast joins, and their implications on performance.
Caching and Persistence: Understanding how to effectively cache data for improved performance.
Unified Memory Management: How Spark manages memory allocation between execution and storage.

Key Concepts Covered

Spark Architecture: Mastering the components and their interactions.
Transformations: Learning about lazy evaluation and the importance of actions.
Memory Management: Strategies to avoid out-of-memory errors and optimize resource usage.
Dynamic Partition Pruning: Techniques to enhance query performance by reducing data scans.
Joins: Understanding the mechanics of different join types and their performance implications.
Caching: Best practices for caching data to improve processing speed.
Unified Memory Management: How Spark allocates memory dynamically based on workload.

FAQs

What is Apache Spark?
Apache Spark is an open-source distributed computing system designed for fast processing of large datasets across clusters of computers.
What are the main components of Spark architecture?
The main components include the driver, executors, and cluster manager, which work together to process data efficiently.
What is lazy evaluation in Spark?
Lazy evaluation means that Spark does not execute transformations until an action is called, allowing for optimization of the execution plan.
How does Spark manage memory?
Spark uses a unified memory management model that dynamically allocates memory between execution and storage based on workload requirements.
What is dynamic partition pruning?
Dynamic partition pruning is a technique that allows Spark to skip reading unnecessary partitions based on filter conditions applied to joined tables.
What types of joins are available in Spark?
Spark supports various join types, including inner, outer, left, right, and broadcast joins, each with different performance characteristics.
How can I optimize Spark jobs?
Optimizing Spark jobs involves using techniques such as caching, partitioning, and understanding the execution plan to reduce resource consumption and improve performance. For more insights on data processing techniques, check out our summary on Mastering Pandas DataFrames: A Comprehensive Guide.

Additionally, if you're interested in the broader context of data analytics and its career prospects, you might find The Ultimate Guide to a Career in Data Analytics: Roles, Responsibilities, and Skills helpful.

This 6 hours long master class will cover everything you need to know in 2025 from scratch, including Spark

architecture, Spark Dags and UI, lazy valuation and actions, narrow versus wide transformations, shuffle joins and

broadcast joints, Spark SQL engine and query plans, driver memory management, executor memory allocation, out of

memory errors, unified memory management, garbage collection cycle, edge node and deployment modes, storage

levels with cache and persist, dynamic partition pruning, adaptive query execution, handle skewess with salting

and much more. I spent a lot of time to create this complete guide and I referred so many books, documentation,

playlist to provide you the best condensed form of knowledge and one-stop solution. If you found this video that

means this video is made for you and will help you to achieve your goals in 2025. There is one prerequest for this

video. You need dedication in order to complete this video and commitment to yourself to become a

prosper. That's it. So if you can complete this prerequisite guide course. So, what's

up? What's up? What's up, my data fam? What's up? Happy Sunday. And there's a surprise for you today. What surprise?

This is the biggest surprise because I wanted to create this Apache Spark the ultimate guide since so long

and I think now is the right time to actually deliver this. I know I know I know Spark is very much

in demand. I know that and I know that I have already created my Pispark full course video in which I have just told

you how you can just code how you can just write the coding part of the Spark you can say Pispark using all the Python

API called Pispark that is fine. I know I know that I know that but what is this Spark ultimate guide because the thing

is coding is fine. I know now you know how to code and if you haven't watched that video you can simply click on the

link in the i button and you will see that apach pispark full course video that is for the coding perspective okay

but now in today's video we going to master the conceptual areas as well there are like so many advancements and

I would say if you are sitting in the interviews it's very important to have that coding knowledge as well. But in

order to you can say crack the interviews of big companies, you need to have strong grasp over spark

concepts. I can just give you some hints. Let's say um how you can just read the DAGs, how you can just perform

dynamic partition pruning, um how joins are being happen behind the scenes. You need to know all these things. Spark

architecture, client mode, cluster mode, everything, everything. And this is one of the most important areas to crack any

big data engineering or data engineering interview regardless of any cloud platform. This is the most important

part. Yes, once you land as a data engineer, okay, once you land as a data engineer, it is fine. It is fine. um if

you just know the coding okay it is like I I wouldn't say like it is the ideal scenario but it will work for the junior

roles okay because you just need to write the code that's it but I would say even if you just want to crack a junior

role data engineer interview you cannot skip this part you cannot and obviously once you land as a data engineer it

becomes your bread and butter why because you will see on a daily basis you are actually using all these stuff

you write your code very good you are seeing performance issues there you need this knowledge you are just trying to

let's say doing a P which is a proof of concept project you will see some performance things you will see

monitoring stuff so for all that thing apart from coding part you need all these things you need to understand hey

I'm writing this code but what is actually happening behind the code you need to know all these things. So that

is why and I can say that this video will be very much in detail and obviously it should be because this is

one of the most important and I would say the most complex part but don't worry this is an Lamba and you know me I

will make all the concepts so easy so simple to understand and you will say hey spark is so easy spark is so easy

and that's what my goal is I want to make you understand all these complex concepts with ease you do not need to

worry at all So I'm really excited for today's video and let me just tell about the prerequest and you will be very

happy after knowing this. There are no pre- request nothing. This video is like for those who are planning to become a

data engineer. This video is for those who are already in the data domain maybe data analyst, data scientist. This video

is for those who are like maybe software engineers or any any other um tech domain but they do not know anything

about data engineering. Yes, this video is for those. Plus, even if you are a data engineer, this video is for you.

Why? Because let me tell you, I know like there are like so many hirings for data engineers and

actually actually companies hire data engineers in bulk. Okay, in the previous years they actually hired data engineers

in bulk. Okay, now they might not know all the things. Let me be very honest with you, okay?

Because obviously in the technical rounds they were asking just the coding part but now it's time to actually tune

that code it act because even if you use AI chat GPD you can only know the code hey you will simply say how to apply

left join you will get the code how to apply group by you will get the code but how to tune that you need to know what

to look at what things to consider how to just make some tweaks with the memory optimization and everything everything

how to just pick the right size of execute computer everything. So that is why it becomes even more important

because AI is here. So how you can just beat the talent and this is the way to do that because you need human

intervention to do all these things. Yes, you can replace human with AI. If you just want to write Pispar code but

if you want to tune it, if you want to optimize it, you need experience, you need knowledge. And I I I am not saying

like AI cannot do it. If you want AI to do it, you still need human intervention because you will actually guide AI what

things to look at, what are your inputs and what needs you want to tweak. And in order to provide these parameters to AI,

you need to know all these things, right? Common sense. So I would say this video will help you to become AI proof

as well. Trust me because we going to go much more in depth in Apache Spark. Okay. So

there are no play request and as I just said even if you're a data engineer this video is for you. Trust me, trust me,

trust me. There are like so many things, bro. There are so many things. And trust me, even if I just try to revise the

notes of my Apache Spark, I still feel like, hey, there's so much to learn. There is so much to just revisit. Okay?

Because Spark is such a wide topic right now and we see new things in every upgrade. So do not feel like you are

already data engineer. You do not need to know anything in Spark. you still need to know a lot of things and trust

me when you will be just watching this video you will feel like oh this was new oh this was the right way to understand

it so I am really excited for this video and we are going to just like get started with this video so just hit that

subscribe button right now right now and right now and let me just welcome so many few like so many new members on my

channel because I was just waiting for this moment and let's welcome all the new members and if you also want to

become a part of this channel, you can simply click on the join button and you will be the part of this channel. Okay,

so here's the long list of our loyal data family members who genuinely support my channel and who genuinely

want me to continue creating more and more videos who genuinely contribute to my journey. So welcome welcome welcome

all and some are some are like recurring data family members as you can see some green texts uh Sai Yadiki Lakshmi Rabi

Dhanosh Louis Sai learn code Juliana so by the way if you see your name just drop a heart in the comments and then we

have AB Mani Shuraguna Tano Priya Naji Sepasu Krishna

Sepan Kelvin Nha G1 Fred each Barda, Jamuna G, Johi, macros and C30 solutions,

Anubria. So just just just drop a heart if you see your name and welcome all. And if you also want to contribute to my

journey to this channel, if you see this channel is fruitful and it's just contributing to your success journey

just reciprocate and just contribute to this channel and join this channel so that I will just make more and more

videos in future and happy learning to all of you. So now let's get started with our Apache Spark ultimate guide

course. Let's see. So first of all an Lamba just tell us what is Apache Spark because we don't even know what is

Apaches Park and even even if we know we want to listen it from you. Okay by the way and let me just tell you again we

will try to understand all the concepts with you can say easy to digest approach. Okay. So instead of just

telling you the definitions written on Google or written on like books I will just try to make you understand like

what are the things because you can easily search the definitions right you are here to actually understand the

things right okay let's get started so basically Apache spark Apache spark is a kind

of group yes so the thing is as we know that data is growing rapidly and we are

living in an era where we call data as big data Because data is actually grown rapidly and obviously it will continue

to grow. So if we want to process our data, if we want to process our data, we want obviously compute

obviously. So earlier we used to rely on just one machine. Let's say you are just doing something on your personal laptop

or your personal computer. Okay, you are let's say doing anything even let's say you are just processing some data. So

earlier let's assume you were just processing only one small CSV file and you were easily doing all that stuff but

now that one CSV file is like transition into let's say thousands of files and now you have to process all that data

again okay on a daily basis obviously so obviously your laptop or personal computer cannot handle it or even if it

can handle now let's assume your files are transitioned into millions of files Now tell

me so in these kinds of scenarios your single machine is not efficient to handle that load. That is why we need

something called as a distributed approach. A distributed computing engine instead of just one computing engine.

This is your Apache Spark. So that is why Apache Spark is a kind of big data framework that we utilize on a daily

basis to process data in the distributed manner. You would have heard a term called massive parallel processing. This

is the term in which you are processing your data parallelly using distributed approach or distributed

manner. Make sense? So this is your Apache Spark where it just relies on a group of machines instead of just one

machine and that group of machines is called a cluster. Okay, a cluster of machines and

in Apache Spark language we consider machine as a node. Okay, so that is why you would have heard like nodes cluster

of nodes. That is why node is just a machine. Okay. So we will also use the same term like nodes. Okay. And cluster

for like group make sense. So basically Apache Spark is your that particular group which handles big data processing

easily. Easily that is Apache Spark. Make sense? Now let's try to understand like why do we need Apache

Spark? Let's try to see that. So why do we need Apache Spark? Why obviously you will say Anlam you just told us that we

need to process our data in distributed manner and all exactly why didn't you ask me a question that okay if I have my

laptop I can even upgrade my laptop I can even just add more RAM more CPU more you can say cores and more

computation power in short why didn't you ask me that question why so in order To answer this, I would just like to

mention something called as there are basically two different kind of approaches to handle big data. The first

one is monolithic and baby just take notes right away because we

going to consume so much of knowledge and I know that you are a super human. You will store all the information in

your super brain but trust me you will forget everything after one week. So I will highly recommend you to take notes

because I do care about you and it's really important to take notes. Okay? Whatever I'm just writing, just try to

make everything available on your notebook, on your digital notebook, anywhere but just take notes. Trust me,

it will really help you a lot. Trust me. Okay. Okay. I know I know I know you follow my guidelines. So I'm happy. So

first approach is monolithic and the second approach is distributed. Okay, monolithic and

distributed. Basically monolithic approach is the approach that I'm talking about right now like upgrading

your existing laptop, upgrading your only one single machine. So in this one what we do, we simply

upgrade the system like adding RAM. Okay. um you can say cores to just

one machine and it is also called as vertical scaling okay where we are just adding

more and more computation power to only one machine make sense okay and what is distributed this is Apache sparks area

it will simply say hey bro why you are just putting so much of load to only one machine why you are just adding so much

of RAM but hold on you need to ask this question again what's wrong with What's wrong with this? What's wrong

with this? Let me just tell you. Can you add thousands of RAM chips to your single machine? No. So, there's a limit

to scale your system vertically. There is a limit. After that, you cannot scale it further. This is one drawback. Second

drawback. Second drawback, let me just tell you. Second drawback is availability.

Availability just add L here. Okay. Availability. So let's say you are using a laptop and for any reason that laptop

is not working. Who will process the data? Bro, no one. Because you are strongly

relying on just one machine. There's so much like low availability of the processing power of the computation

power. So very low availability, very low. Make sense? Okay, very good. On the other

hand, distributed computing solves all these issues. It simply says, hey, instead of upgrading system, add

the systems. Add the system into the network. Add more machines. Okay, you

can simply say add more machines or nodes. Okay. Then second thing is this is called as horizontal

scaling. Make sense? And then high availability. Let's say you added 10 machines. You added 10 machines. One

machine is not working. Fine. We still have nine machines. We can still process some

data. Make sense? Make sense? So it is similar to let's say you got your holiday homework.

Okay. And you just need to complete it obviously within a limited time period and you will be the only one who will be

completing that homework. But luckily you have three to four siblings and luckily you have elder siblings. So they

can help you in just completing the homework. So this is similar to that one where you where you do not need to rely

on just one machine. You can simply distribute to the work and you can just do a portion of work. That's it. That's

it. This is Apache Spark. Now let's talk about is Apache Spark the only like big data processing engine that we have or

what was the scenario before Apache Spark how we were just working with big data before Apache Spark. So let me just

tell you about that as well. So let's talk about Apache Spark versus map reduce. So answer to your

question how we were working with big data before Apache spark. So basically we had something called as

Hadoop map reduce. Okay Hadoop map reduce it is basically not one word it is a combination of two words map and

reduce. You can say mappers and reducers. So we were working with map reduce instead of working with Apache

Spark because Apache Spark was not there. Okay, because Apaches path was introduced around 2008 or 9 um in the

AMP lab by uh in the AMP lab and it was done in uh UC Berkeley. Okay. And it was actually a research project to identify

the drawbacks or you can say issues with map reduce. And now it's a revolution in the world of

big data engineering. And do you know the same team founded data bricks as well really? Yeah. So they are the

original creators of Apache Spark data bricks. So let's talk about Hadoop map reduce like we were using Hadoop map

reduce and why we switched to Apache Spark. So basically the thing is map reduce was also doing the same stuff.

Yes it was also doing the same same stuff. It was also distributing the data to the machines and yes it was also

following the distributed computing engine or it was also like treated as a distributed computing engine. Okay. What

happened? What happened? So basically the thing is it has like two concepts map and

reduce. Okay. So map is responsible to map your data basically to distribute your data to the machine. Okay. And

obviously those machines will be processing the data and reduce was responsible to gather the information to

collect the information from those machines. So the thing was the intermediate result of this

processing intermediate results can be like your transformations. If you're filtering any data, you are applying

joins, you are applying, you can say aggregations, group by anything, all those things are intermediate results.

So it was writing all the intermediate results to the disk. Okay. So what's wrong with that?

So basically it was writing all the intermediate results to the disk. So

every time it needs to spend time to go to the disk obviously to to read the data to fetch the information and then

again deliver that data for for the processing. So this to and from moment to disk was time taken. It was saying

bro it is really time taken. You cannot like go back and for to like disk. What are you doing bro? What are you doing?

So then then Apache Spark came into the picture and then it said there's no need to involve disk. We will process

everything in the memory. In the memory basically RAM we will write all the intermediate

results in the memory. Boom. Hold on. Hold on. I know Apache Spark also uses disk. You do not need to

get confused. just consume the information that I'm giving right now because there is a procedure there is a

step-by-step guide how you need to consume the information right so for now just take this information okay so it

said like we just need memory we do not need to go back to the disk again and again and we cannot do it we cannot

afford it because it is really time-t we know that it is much more faster than the your desk that is Why we

can say that in some cases not every time in some cases it is almost almost 100 times

faster than map reduce. Apache Spark is 100 times or you can say can up to 100 times faster as per if I just compare it

with the map reduce not every time 100 times not every time but yeah in some cases and that doesn't mean like it is

almost similar to your Hadoop map reduce no bro it is very fast because it is much

optimized optimized yes how hold on hold on hold on there are like so many topics that we be going to cover today so hold

on so This was the over highle overview of Apache Spark like why did we switch to Apache Spark from Hadoop map reduce.

We don't know about future like which engine will be in use but right now patches spark is the goat of the big

data engineering and yes it is like work handling all the things so so so efficiently your batch data your

streaming data your like continuous continuously flowing data everything everything everything so it's an amazing

engine and like all the organizations are leveraging spark. So that's why like there's a high demand of spark. So this

was your answer to the question like why do we need to use apache spark over map reduce and why did we introduce apache

spark. You got it very good. Very very very very very good. I know you have a lot of things in your mind. What are

RDDs? What are data frames? What are blah blah blah? Because obviously you would have heard those things right?

Even if you're just learning Apache Spark for the first time, you would have heard these terms because these are like

very popular terms. So don't worry, we will also discuss about these terms as well. So this is all about your Apache

Spark and map reduce. And now now now let's see what do we have next in our Apache Spark ultimate guide. So these

are basically some of the features that we some of them we have already discussed and obviously some of these

will be discussed later in this video. So as we know that it performs Apache Spark performs in-memory computation.

Lazy valuation you will get to know what is that. So this is one of the features that make your queries your

transformations optimized when you submit your code in Apaches Spark. Fall tolerance it does that with the help of

RDDs. Don't worry you will also get to know what are RDDs. Partitioning you will also get to know what are

partitioning. So these are some of the features that you should know that you have so many capabilities with like

Apache Spark and on top of it you can also write something called as um streaming and batch processing

together. So the thing is Apache Spark allows you to perform batch processing and stream processing using the same

API. Can you imagine? You do not need to change your code. You do not need to change your transformations. You can

just use exactly the same code and you will be able to use batch processing and stream processing together. That's

amazing. Amazing. And you will say that is it performant enough? Yes, it is very much performant. Okay. So, this is like

short glimpse of Apache Spark features. Now let's talk about you can say the most or you can say the backbone of

Apache Spark and the backbone of all the concepts that we would be just discussing further in this video. So you

have to very focused right now because we going to discuss Apache Spark architecture. Okay. So before just

showing you the image before just mentioning everything about about Apache Spark let's cover

some basic stuff first. So what are the basic stuff? So the basic stuff is Apache Spark

architecture you can say is developed on top of a concept called master and slaves. That is why uh it's also called

as master slave architecture. Why do we call Apache Sparks architecture equals master and slaves architecture? Because

in this particular architecture, in this particular framework, there will be one master. There will be one

master and other will be the slaves. Make sense? So what will be done by the master?

Obviously this master will be just sitting and passing some orders and make sure like everyone is

working fine and everyone is indulged in the task and everyone is doing their work. So that is why we consider this

architecture as master and slaves. So as you can see there's one master and another are slaves. It is just an

analogy. Okay. But it goes really well when you just trying when you are just trying to understand Apache Spark. So it

is one of those analogies which are written in the box as well. So that is fine. Master and slave architecture it

is actually built on top of this analogy. It is actually built on top of this. There are like many more analogies

that you can relate. Let's say manager and employees. Okay. But master and slave architecture is the best one and

is very much popular and you can also encounter many scenarios based on this in your interviews as well. So this is

just a glimpse. Let's go deeper. Now we know that in this architecture there will be one master and other will be the

slaves. That is fine. Let's go deeper. Let's go deeper. So basically if we just want to go deeper we need to understand

some technical terms first of all of Apaches part. So as you can see on your screen right now I have written

something called as resource manager. Let's discuss this first. So basically this resource

manager this resource manager is your master. Really? Yep.

It is your master. Okay. So you can treat this as your manager. Okay. Normal manager of

your organization. Okay. And we just call it as a resource manager here because it allocates the resources for

your work. Okay. It is a resource manager. H. Okay. So that means it will be managing resources obviously.

Then we have resources in the form of driver and workers. Driver and workers. These are two types of

resources that we have. H okay. And one more thing you will be glad to know that your

cluster is this one. Driver plus workers. Driver plus workers. So

basically and I think it is very much understood but still let me just tell you this is a driver. Okay, these are

the workers. Okay, makes sense. When it gets combined, okay, so these will become your cluster cluster

of nodes. Remember remember that we just distribute our work to the nodes. So

these are all your nodes. Okay. Now let's talk about driver. First of all, what is this? What is this? What is

this? Just tell us about this. What is this driver? Basically, we know that we have a

manager. We know that. But is it possible to work directly between your manager and workers directly? The answer

is no. Why? Because we need someone called as team lead. Team lead. Okay. Team lead's

responsibility is to orchestrate the work. Let's say you are an uh you are an associate, you are a developer. Your

team lead's duty is to allocate the tasks to you. Make sense? So just let's let's try to understand from the top

level. Your manager will get the work. Okay. From maybe client, from maybe any other business. Okay. Your manager will

allocate the resources and what will be the resources? Your team lead which will be called as driver and you with your

other developers will be called as workers. I know now you understood. Okay, makes sense. So your manager

allocated the resources like you and other developers plus team lead. Now team lead's responsibility is to see the

work that we that your manager got. It will just see the work and it will distribute the work among you among you

among your friend among your peer developer all the workers. So that is why we always have one team lead to

distribute the work to these workers because these workers cannot directly get the work from the manager. No, it is

not possible. Why? Because your manager may or may not understand the technical terms. Make

sense? Make sense? Similarly here as well. So all the technical terms will be understood by driver.

Those terms will be translated into better way to you by your team lead who will also call as tech lead or team lead

same thing because that person understands the technical terms. Similarly driver understands your code

that you are submitting. So it will distribute that work among these workers. Make sense? So

these are some terms that you need to understand. I know now you understood resource manager, driver and workers.

Make sense? I know now you have understood almost like the whole architecture. Trust me. Let me just tell

you the whole flow now like how it goes the whole flow. So this is the flow and let me just tell you what happens. So

let's say this is a person who is writing a spark code. Okay, make sense?

Okay, this person completed his spark application. Application is just like the code that it needs to submit. Okay,

so this person will submit the code to the who is who is this resource manager. Very good. So

this person will submit the code to the resource manager. Very good. This is a resource

manager. Make sense? Very good. There's one more thing. When the person will be submitting the code, Spark code, this

person will submit some more stuff as well. What's that? The information that it needs to provide to the cluster

manager such as I just mentioned that this resource manager will allocate the resources right h okay like one team

lead and workers. Now just tell me one thing. If your manager is receiving some work, okay, is receiving some work from

a client. Okay, obviously that manager needs to also get the workload as well like how many workers do you need? How

uh uh you can say talented workers you need because obviously obviously do you need senior developers? Do you need

junior developers? Do you need mid-level developers? So all that information will be provided by the

client. Hey I need three developers, three senior developers. Hey I need just three junior

developers. So all these things will be told by client told by client. So this person

who is writing the code will also tell that I want one driver. Driver is your team

lead. Okay. I want one driver or of let's say 10 gigs 10 GB and I want let's say three

workers and we just call call it as like executors. Don't worry I'll just tell you about executors as well. It's almost

the same thing. It's almost the same thing in the modern one world. It's almost the same thing. And uh so this

person is saying I want three executors. three executors of 10 GB each or let's say 20 GB each. Okay. So this

information will be provided by the person who is submitting the code and this is very popular called as

sparksummit. Okay. Sparks summit. So within this sparksummit command the person will

write hey I need one driver I need three exeutors blah blah blah. What are executors? For now you can understand

executors are your worker nodes. Same same stuff. For now you can just understand it. It's same stuff. Trust

me. Okay. So one driver the executors. This request will go to the resource manager. What are the request? Your code

and all the stuff that this person needs. Make sense? Okay. So what this resource manager will do? It will first

of all it will first of all create one driver node first of all. Why? because it needs a team lead for

the first thing. It needs the team lead as a first thing. So it will simply create the driver node.

Okay. Okay. Like driver node. This is our driver node. Then then this driver node will be

connected to this resource manager. Okay. Okay. Let's connect this. So this will be connected to

your resource manager. Okay. So now these are these two are connected. Sorted sorted. Sorted. Okay. Now this

driver node will take the sheet from him from your resource manager and it will read the request. Hm. You need three

executors. H you need this this this. So it will read all the instructions and it will send that instruction back to the

resource manager and it will say hey you need to create or you need to provide me three executors. So this resource

manager will say okay I trust you. So it will simply create three executors on these machines 1 2 and

three. 1, two, and three. Make sense? Now, now what will happen? Now, driver node is there. Worker nodes

are there like worker one, worker two, worker

three. Okay, three workers are there. Driver node is there. We have all the resources. Resource manager is free. So

it will simply say now just take care of this these workers on yourself. So what will happen? What

will happen? This driver node will be connected to these workers now

and all because this driver node has the sheet, right? We know that this just took the sheet from him from the

resource manager. So it has this sheet here. So what it will do? It will simply read the sheet. H you have this code.

Okay. You need to apply filter. You need to apply join. You need to apply group by. Hm. Okay. Then it will simply

distribute all that work to these worker nodes and these worker nodes will be actually doing the

work. Actual work will be done by the worker nodes or you can say executors. For now we are just using these words

interchangeably. It's fine. Okay. So these work are done by worker nodes. What driver node is

doing? Orchestrating. Orchestrating. Making sure like all the

executors are working fine. All the executtors are doing their work. Everything. Make

sense? Makes sense. So that's what your team lead does. Team lead actually does not write

your code. Team lead just a team lead is there just for the monitoring. It's just for the overhead. It's just for like um

whether you need any help, whether you need something, I'm there. So similarly, driver node is there for the

orchestration is for the distribution of the work. But actual work is done by the developers that are you worker one,

worker two, worker three. That's it. This is the spark architecture. This is the spark

architecture. Now you will say who is master here? So this is a controversial thing that I'm going to discuss. So

basically master slave architecture keeps on changing. What obviously just think

wisely when we had resource manager in the picture this was the master and all these machines were slaves. Make sense?

Very good. Now when we have a driver node and other workers and when these are connected who is master driver node

is a master because this is giving the instructions. Makes sense common sense bro and other are slaves. So that is why

master slave architecture also gets changed with the state of the cluster right now. Make sense? Makes

sense. That is all about your architecture. See, it was so simple. You just need right analogies. You just need

right examples and you just need someone who can explain things like this and that's all. That's all about life,

right? It's very simple. Very simple. Okay. So, this is the diagram that we have just made. You can just take the

screenshot and you can just save it in your notes. It will help you to understand. And this is the architecture

that is available on the official documentation. official documentation of Apaches Spark. Okay. And I can just show

you the documentation right now after explaining this image. Don't worry. Uh so I know that you can see lot of arrows

but trust me just give only 2 minutes on this image you will understand everything. Why? Because now you already

understood the concept. Let me just help you. So first of all what will happen? our code

our code. So this is the person who submitted the code. So this person will

submit the code to the cluster manager. Step one. Okay. What is step two? Step two is

it will simply create the driver program. Make sense? Make sense? Two. So now these two

are connected. See these two are connected. Just forget about other things. Just follow the things that I'm

just highlighting. So these two are connected. So this is two-way. Why? Because you remember it will simply go

back to cluster manager. Hey create two more executors. Three. Like in our example it was three. Just imagine there

were like two. Okay. It will simply say okay sir I know you are my team lead but still I'm just following your orders

because I think you are smarter than me. You are more technical smarter than me. So it will simply create two worker

nodes. one and two. So these are like step number

three. Okay, step number three. Now let's clear the confusion. Worker and executors. Basically worker node is just

a machine. Worker node is just a machine. On that machine we create something called as

executors. Executors as the name suggest executors which execute the work. We can host or we can install multiple

executors on one machine. But nowadays you can say in all the modern applications we just host one executor

in one one machine that is one worker node. So one worker node is actually holding one executor. So it makes sense

to say that one exeutor or one machine it's fine. It's like both can be used

interchangeably. Do not get confused. These are very small small things. You do not need to be confused in these

terms. Okay, executor worker node same thing. Okay, and as you can see here as well, we have like one executor within

one worker node. So it is fine. Makes sense? So now this is step number three. Now these worker nodes are directly

connected to this one to drive a program. See two-way communication. Two-way

communication. So this is step number four. This is step number four. So whatever task this drive program will

provide they will just do it and they will return back the results. Wow. This is your architecture. This is your

architecture. See it was so simple. So simple. H it was simple. Yeah.

Okay. Makes sense. Makes sense. Understood. Understood. Now take notes and write everything that we have

discussed because when you will be writing you will think when you will think you will store that memory in your

long-term memory area. Okay. So this is your spark architecture and I hope now you know

this concept. This is the backbone of all the concepts trust me. Okay. Very good. I'm happy that you understood this

concept. Now let's see what do we have next. So this is the official documentation page of Apache Spark and

you can just go here. This is the link and this is the image that we used to describe because we should always use

you can say graphics which are there in the official documentation. So so that you can actually understand whenever you

will be just going there and you can actually read the documentation as well. And let's talk about some more technical

stuff from this documentation. Why? Because it is really important. We know that we just saw something called as

first thing we saw that is was like resource manager or cluster manager same thing. So what are the popular cluster

managers that we have in the market? So we have very popular cluster managers. Okay. And what are those? It can be

Spark's own standalone cluster manager like its own or it can use Apache Misos. It can use YAN. YAN is the most popular

cluster manager. Okay. And it can also use Cubernetes. So these are the popular cluster managers that we have in the

market. Make sense? Very good. This is the cluster manager. Okay, we need to discuss one more stuff. What's that?

Basically, the spark context. Okay, what is the spark context? If you would just closely observe like it is in the drive

program and okay, we will also discuss about these things as well. But anal first tell us like spark context. What

is that? Basically spark context is now is now replaced by the spark session. Okay. So if you see something

called a spark session basically spark session has embedded spark context as well within that. Let's say this is your

spark context. So we had total three context spark context hive context and one more context. Okay. So it is just

earlier we used to just define different different context. Now we have condensed all the context and we have

created something called a spark session in which we have all the things spark context hive context and there was like

one more context anaba you have said context so many times what is the spark context okay okay spark context or spark

session because now it's spark special session not special session spark context or spark session is actually the

Starting point of spark h okay we connect driver driver node with cluster manager see these two things

with spark context. So whatever we are submitting okay whatever that person submitted the code it went to the

cluster manager then cluster manager created the driver program. Mhm. Then driver program had the spark context or

spark session because we defined that in our code and that session you can say that connection that starting point was

actually used to set up a connection between your driver program and cluster manager and it should be written

somewhere here as well. So yeah, Spark application run as an independent set of processes on a cluster coordinated by

the Spark context coordinated by the Spark context in your main program called drive

program. So Spark application gets coordinated with the help of Spark context. It is basically

the entry point or the connection between your driver program and the cluster manager or resource manager.

Makes sense. Makes sense. Makes sense. So you can actually read like so many things and here are like more

information about cluster manager standalone Apache misos hadoop cubernets okay submitting applications and all the

everything is written here and don't worry we will just learning all these stuff step by step. Okay. So I hope now

you have understood all the things related to spark architecture including spark context as well. Okay. So let's

see what do we have next. So let's talk about the driver node in detail. So as you know that when we have like so many

machines remember like we have like so many computers here listed and we decided that one machine will be treated

as the driver node. So what actually happens when we pick one machine and we say hey you are a driver node. Hey you

will be treated as or as a driver. what actually happens behind the scenes. So basically whenever we say that one

node or one machine will be treated as a driver node. So actually resource manager or cluster

manager creates something called as application master container or you can say creates a

container which is also called as application master. Simple. Let's say

bro remember we had like machines here and we decided that we will be just um picking this machine as a driver node.

So what resource manager or cluster manager will do within this machine? It will

install this thing application master container. It will simply install this AMC

application master container within this machine. Okay, make sense? Very good. So what happens when it creates this or

install this application master container? This is a container which actually is responsible for all the

orchestration, all the driver thing, all the driver program activities. What do we have within this? We have something

called as pispark main h. Okay. And then we have something called as JVM main. Okay. Then we have something called as

process. What is this? Hold on. So as you know that most of you will be writing your code in

pispark. Yes. But spark is written in Scala. Okay. Yes. So it is written in Scala. So that is why we add something

called as pispark main within your application master container. So this pispark main is

optional. Understood common sense you know the reason still let me just tell you if you

are not writing your code in pispark do you need this obviously no if you're writing your code in

Scala JVM is fine like JVM stands for Java virtual machine which is responsible to you can say interpret or

compile your code that you have written in Scala JVM can handle that pispark main will only be there if you are

writing your code in pispark make sense Okay. And let me just also tell you the flow. So basically you have

your spark core. Okay. This spark core has a wrapper for Scala or

Java Java wrapper. And on top of this we have a Python wrapper. Now you will say why why did we create the Python

wrapper? Why didn't we use Java? Because bro, there are so many Python developers. Okay. And in big data or you

can say in data domain, Python is the most widely used programming language. So in order to attract Python developers

to this engine, we added not me, the team added Python wrapper on top of this Java wrapper. Make sense? Okay. And that

is why we have something called as pispark main. So whatever your code your pispark code is there in your

application in your spark application will get converted into JVM main JVM main by a process called py

4j make sense so you do not need to worry about anything like how you will convert your code in java and scala. No

spark will do it. Okay. And one more potential question. This pispark main is called as pispark

driver. Okay. And this JVM main is called as application driver. Make sense? These are technical

terms that you need to just write down in your notes. Okay. Good. Understood? This is your driver node in detail. We

have just engineered that driver node. Okay. Let's try to understand worker node. Worker node is

very simple. Worker node is very simple. Worker node just has a JVM machine. That's it. Because obviously these are

executors. Okay. These just need to you can say execute the work. That's it. Okay. So basically you just install a

Java virtual machine on each worker node in order to just get your work done. Make sense? Okay. An Lamba why you

have left the space here? Why? Okay, let me just show you the magic.

Boom. What is this? So, basically I know that executors do not actually need anything other than JVM because all your

code will be converted into JVM by driver node and it will go to the executors. Yeah, makes sense. Why do we

need Python? Then this is a special case. this will not be there all the time. Okay. What is a special case? So

whenever you create UDF, what is UDF? UDF stands for userdefined functions. Let's say you are not using

pispark functions but you are using your own functions using def. Let's say my function and you created something like

this and you use this in your transformation. This is called UDF. Okay? You use this for any reason. Okay?

and you are saying I want to use this. Maybe it it can be the need of the R in some cases when pispark functions are

not performant enough. By the way, Pispark has a long list of functions that you can pick. But still sometimes

you need to do something that cannot be achieved by the pispark functions. You you have to just define your own

functions. I can understand that. No worries. So in those scenarios your executor or your worker node will

install a Python interpreter or Python worker as well on top of your JVM machine or like you can say not on top

of JV JVM machine along with JVM machine. Make sense? So that it can actually interpret your function that

you have created. Oh, because obviously you need a comp you need a you need an interpreter right

for Python code to just run during the runtime to execute a code. That is why it is always recommended not to write

any userdefined function and obviously you need to avoid it if it is possible because it will add a kind of load on

top of your executors because it needs to install Python and it will just take some space and it's not ideal. So that

is why you should try to avoid as much as you can to use userdefined functions. Okay. Okay. Okay. Okay. Okay. Don't

worry. So I know this was like a lot of knowledge for you but digest it, re-watch it, revisit it, digest it, take

notes, you will digest it. Make sense? Make sense? Make sense? Very good. By the way, just a you you can say out of

the box topic. Um, can you see JVM here? Yeah. Now, these JVMs are slowly slowly being replaced by the C++based executors

because those are very very very fast and they can just process your code natively instead of processing your code

in JVM. They can simply use the you can say resources like native resources. So they actually do not need to use JVM.

They can totally eliminate. For now it is not totally replaced. Um it is like slowly being replaced by C++ based

executor but that knowledge is like out of the box. Okay. And yes that's a great you can say initiative and I can see

those executors in fabric. I can even see those executors in datab bricks called as photon. So I can see those C++

based executors but that is fine. that is just like out of the box or you can say the latest information that you

should have. Don't worry, don't worry. Don't worry, don't worry because fundamentals will remain the same. Don't

worry. Don't worry at all. Okay. Very good. So, this was all about your driver node and worker node in detail in

detail. Now, it's time to see some code implementation because obviously I know you are really excited to see um how we

can create the spark session and all those things, right? So, now it's time to actually create our free databicks

account. Yes, because we will be just looking at the code implementation as well like how we can just do that. So,

we'll be just leveraging the free database community account and we will actually write our code and we will see

all those stuff that I'm just trying to explain you because when you actually do it, when you actually see it with your

eyes, you understand better. Okay, let's try to create a free database account. And it's very simple. It's just a few

step. But you need to just take care of all the things that I'm showing here because then you will say hey I'm not

able to create your data account. I'm not. So in order to create your free databicks account steps are very simple.

Simply go to Google and simply search datab bricks community edition. Okay datab bricks community edition. Then you

can simply click on this link databicks community. Okay make sense. So by the way the thing is they have

added a new datab bricks account which is called like I forgot the name. So in that datab bricks account it is

automatically managed and you can simply use the you can say cloud versioned UI but I would say just pick the community

edition to learn all the things from scratch. Okay. So this is the link and let's try to search that. What is saying

bro? Remind me later. I just also need to look whether it is the right one or not. I just simply clicked on this one.

Uh it is saying if you would like to login. No, I don't want to login. Let me just see. I

think this is not the right link. Click on database community additional login. Yeah, this is the one. So I will also

try to provide the link or you can simply um copy it from here um community.cloud.databicks.com/lo and you

will see something like this. Okay. Or you can simply see the URL like database community edition login. You simply need

to click on this. Now obviously you would not have a database account. Simply click on sign up. The moment you

will click on sign up it will simply say sign up for community edition. Okay. Now, simply put your Gmail account or

whatever account you want to just pick. Okay. And then simply say continue with email. The moment you just set up your

account, it will just ask for confirmation and all. Okay. Then come back to this page again because now you

have the account. Now you can simply use that email ID. Make sense? Make sense? Very good. Very good. Very good. Very

good. So this is our datab bricks account. By the way, if you are not familiar with data bricks, datab bricks

is a kind of one of the most popular technology right now in the world of big data. Okay. And as I just mentioned,

datab bricks team is the founder of Apache Spark. So you know how well they are um you can say how how well they can

just tune and you can say play with spark to just develop new and new things and that's why they are just leading the

industry right now and I love data bricks and I would say if you are just totally new to data bricks datab bricks

helps you to work with apaches spark it takes care of so many overhead you do not need to worry about so many things

and don't worry I will just discuss all of these things. So it actually helps you to just start with spark for example

for now I had two options just to set up the spark locally or I can simply pick data bricks it takes a lot of time and

efforts to set up your spark cluster locally and it's your responsibility to manage the cluster okay to start it to

stop it to build the sessions and all if you are learning you need to save all that time data bricks will say do not

need to do anything I will take care of all those things set Set up your Spark cluster, set up your drivers, set up

your worker nodes. You simply need to come here and start learning. That is why they have created this database

community edition just for the community to learn and grow. Just act wisely. Do not waste your time to set up your

clusters locally and just doing all those stuff that you will not be doing in the company because companies in the

companies as well you will be just using databicks or the similar products to just manage the clusters. that time is

gone when like you will be just when when you would be just managing the clusters. No, now you do not need to do

that. Make sense? So just start with data bricks. So this is the datab bricks workspace or you can say datab bricks

homepage and within this we have all these tabs. You do not need to worry about that because we are not learning

data bricks. We are only focusing on spark. We are using data bricks to run our spark code. That's it. And one thing

that I would like to mention simply go to catalog. This is basically the area where we create folders, tables and all

the things. Okay. So in your case maybe you will not see this area DBFS which is the datab bricks file system or you can

say datab bricks distributed file system. Okay. So in order to just enable this you simply need to click on this

button. Okay. And then you will see something called as settings. So just click on the settings button. This is

the settings button. Okay. Now just go to the developer. Maybe I also need to just check because I turned it on way

back. Just scroll down or it should be in advanced maybe. Uh yeah, I think yeah. So in

advanced you will just scroll down and you will see DBFS file browser in other. So you simply need to turn on this

toggle so that you can also see that DBFS button because we'll be just uploading the files and I also want to

show you a lot of stuff that you can just use to learn a lot of things. Okay. So simply turn on this button and you

will actually see that DBFS button. So now let's get started with our first code. Okay. And in order to do that I

will simply go to workspace and I will simply create a workspace. Okay. Make sense? So this is the workspace. Okay.

And now I will simply click on this create button and I will simply click on folder and I will simply provide the

folder name as let's say um spark or let's say Apache Spark. Okay, click on

create. So this is the folder that we have created for Apache Spark. Within this I will create um let's say oh not

this one. Just click on workspace. Yeah. So within the workspace just oh we created the workspace in the home tab.

No no no you do not need to do that. Simply go to workspace and then click on workspace for for one more time so that

you can actually create the workspace within the workspace folders. Okay. Simply click on workspace and click

on folder. So simply create a workspace. Wait wait wait wait. Create folder and Apache Spark. Apache

Spark guide just to distinguish it. Apache Spark guide create. So now you will see that your Apache Spark guide

folder is created under workspaces. That is the thing that you need to do. Okay. So this is your folder and obviously it

is empty. Obviously it is empty. So now let's create our first notebook. What is a notebook? Basically whenever you want

to run our code, if you are just running it locally, you would just run the code in any IDE such as PyCharm, VS Code. In

this we use notebook. Notebook provides a better UI or you can say user interface, user experience for

developers because you can actually run the piece of code that you want to run instead of rerunning the whole thing.

And obviously notebooks are the most popular things in the data domain. And if you are data analyst, you would know

data like notebook similar to Jupyter notebook, right? If you do not know it's just a notebook in which your you you

can treat notebook as like your whole notebook or let's say your whole script is divided into cells so that you can

run the specific cell specific part of code if you want that is the best thing about notebook. So simply create one and

I will simply say notebook. So this is your spark notebook okay or you can say databicks notebook

and this is the title. Let's rename it and let's call it as um spark session. Let's talk about spark

session first of all. Okay. So spark session. So this notebook we need to attach it to a cluster. Make

sense? We need to attach it to a cluster. Same stuff. Let's say I'm just writing my code but uh I want to submit

this code. Where it will go? We know that it will go to resource manager or cluster manager and then it will simply

create the uh driver node and that driver node connect with

resource manager and it will say hey create two clusters more or sorry not cluster two worker nodes more or two

executors more it will create two executors make sense so here the best thing about data

bricks is we do not need to say hey create one driver hey create two executors. Hey, do this. Hey, do that.

No, we can simply create a cluster and we can reuse it. Wow. Yes, that's the one of the powers of data bricks. Okay.

So, how we can just do that? If you just go to compute, this is the area where you create the compute. Compute means

cluster. Click on it and you will see allpurpose compute. Click on create compute. Okay. So this is the

configuration that we get for free. That means we will be running our code for free. We do not need to pay for any

machine. Why? Because this is a database community edition. So if you just click on this, you will see something that you

have 12.2 LTS. This is just like the version that we are using of the cluster 12.2. Latest one is I think 17. Okay.

But it's fine if we are just learning. 12.2 is fine. Okay. Makes sense. So this is the spark cluster that we'll be

creating. Click on create compute. What will happen? It will spin up the cluster. See it is just rotating. So it

will create the cluster for us. What will that cluster hold in real time when you are just paying for data bricks? You

can define driver size, number of workers and worker size. Same way you submit your command to the resource

manager. Same thing. Same thing. Advantage is you are just saving this cluster information and you can just

reuse it. But in the free account you cannot say hey create 100 machines. Hey add 100 GB of RAM. Obviously it will

simply create a basic machine for you so that you can learn. So what it will do it will simply create a cluster. Okay

who resource manager. Who is resource manager? Datab bricks is managing it automatically. Okay. So it will simply

create a cluster of one driver. Okay. And you can say one worker one

worker. So this cluster is there for you which will be processing your data. So now you do not need to worry about hey

driver memory hey worker memory. No, but yeah, in the real world we define these things and don't worry, we'll be just

discussing a lot about driver and workers in the like further parts of the video obviously. But this is something

called as managed layer that datab bricks provides us so that we can actually see all those things. Make

sense? Okay, that's really cool. I know you will also love data bricks. So now what I can do, I can simply go to

recents tab and this is my notebook that I opened. So in the recents tab you can see all the recents notebook. I will

click on it and you will see a button called connect. Click on this and you will see your cluster. It is being

turned on but you can click on it so that it will be attached to it because whatever code you will be writing it

needs some compute right? It needs a cluster. Very good. Very good. Now till the time this cluster is turning on I

will just tell you about spark session. We talked about spark session right? Yes, we talked about spark session. H do

you know what is the other best thing that databicks does for us? We do not need to create a spark session. What?

Yes, we do not need to create a spark session. Why? Datab bricks creates the spark session for us. Really? Yes.

Really? H okay. So one thing that I would like to mention whenever we create a spark session we create something like

this. We create a spark variable okay and we say spark equals to spark session

dot builder dot app name oops dot app name then we just provide the app name let's say

tutorial and we just write all these things right. So in the industry this is the most popular

variable name spark. We every time create spark session with the variable called spark. This is a kind of you can

say industry standard. So datab bricks actually found this as an opportunity that we

can automate this session creation for our customers. So we do not need to create a spark session now. Really? Yes,

we can directly use spark. The moment this cluster is turned on, you can actually say print spark. Yes, you can

actually say print spark and it will simply show the information of that spark

session. Wow. Yes. Yes. So, I will just show you how our spark session looks like and what

information do we have in the spark session. Okay. So it takes I think three two to three to four minutes to turn on

the cluster because obviously it needs to create the machines, nodes, executors, driver node, installing the

application master container. Now I know you are pro now you know a lot of stuff. So so oh it is turned on. So what I will

do? I will simply run this cell. And how we can just run this cell? You can simply hit shift + enter together. Shift

plus enter. That's it. It will run the cell. Okay. So whenever you want to run any cell simply say shift plus enter.

Very good. So it will simply give me the spark details. See it has simply said that this is the spark session which is

there for me. Which is there for me. You can even write spark and you will see that it will simply return the results

like this because in the cells we do not need to write print print command is not required in notebook. Okay. So I got the

details. My Spark session is this one. I have Hive and Spark context. And by the way, the third context is SQL context.

So it merges three of them. Spark context, SQL context and Hive context. And this is the version 3.3.2. And this

is the master which is saying local 8. What is master? What is master? You know what is

master? Driver node obviously. So it is local. Local means like on the local machine. Okay. And app name is data

brick shell. So it has automatically written the code for us called like all the code that I just showed to you like

all that like spark session do things everything is like written for you everything. So you actually do not

need to worry about anything. Got it? So this is the second best thing that I do like about you can

say spark context or spark session because you actually do not need to worry about anything like how to just

submit the things and creating the spark session again and again. So this code you can say is equivalent to um I I can

just write it for you if you just want. Let me just and I hope that you are just taking notes right. So this code is

equivalent to spark equals spark s

capital spark session. I think we also need to import spark session if I'm not wrong. Let me just import

that. Oops. From pispark.sql. So this is the library.

Okay. Import spark session. Yeah. We need to first of all import it. This is

imported. Now we can just simply use this spark session dot builder dot app name. App name I will simply say data

brick shell because it is also using the same thing right I can I can I can give any name but I'm just trying to show you

what code databicks is writing for us behind the scenes. Okay databick shell then we just use backslash to go to the

next line. Then it is simply saying that what what else it is using? I think master dot master and I will simply say

local local local who is texting me bro who is texting me local and 8 oops not 98 8 then we can

simply say dot get or create if we just want to add any configs let's say um I want to say create a driver of 12GB or

blah blah stuff you can simply say dot get to create. So let me just run this and it is fine, right? It is fine. If I

just run it spark, I will see the stuff. See, same stuff. If I

just rename it, what happens if I just rename it? If I just rename it, I will see why because I created it, bro. So

this is a new session that I have created. If I let's say change the name data bricks shelf on right and if I just

run this you will not see any change. Why? Because obviously application is managed by data bricks. So it cannot

create a new application for us because it is all managed by data on data bricks. If you are just running this

code locally you can give any app name. It's up to you and you can even just add so many configurations as well. But all

these things are managed by data. We do not need to worry at all. Make sense? Very good. I'll simply delete this code.

This was just for the reference that you can just write in your notes that this is that's how we can just create spark

session. But it is already there for us. We do not need to worry about spark variable. So this variable is ready for

us to use. So whatever you want to do with this session, we can do it using spark variable. That's it. Okay. Make

sense? Makes sense. And what is the spark UI? Don't worry, we will just talk about Spark UI in our next next

next you can say chapter that we have because in that we're going to discuss a lot about Spark UI and how to read DAGs

and all amazing amazing amazing so let's see what do we have next and I hope don't worry this is just like the

beginner thing and this is just related to some stuff related to data bricks it's fine it's fine we are just learning

spark so if you do not understand all the things of data bricks it's fine just focus on spark okay just focus on spark

very good let's See what do we have next. So let's try to understand this important concept lazy evaluation and

action. So spark is lazily evaluated and you will know the reason why and what are

the benefits of it. Okay. So let's say let's try to understand this with an example. So let's say you are writing

your code in the notebook. Okay. So let's say you are writing you are reading a file. you are creating a data

frame on top of it. Okay. Okay. You are performing some transformations. Let's say you are filtering some records. Then

you are adding some columns and then let's say you are aggregating the records. Okay. In total you performed

three transformations. Filter adding the transformations and aggregating three

transformations and you run the cell. You you you executed the cell. Okay. By hitting shift enter, shift enter. Shift

enter. And do you know what? Nothing will be happening. What? Yes. So let's say you read the data frame. Okay. You

applied some transformations, filter, adding columns and aggregation. You literally ran the code, ran the cell,

but nothing will happen. Why? Why? Because of the concept called lazy evaluation. Concept is very simple.

Don't make it complicated. So the thing is whatever transformations you are applying okay you are applying filter

you are applying adding columns you are um performing aggregations make sense spark will store all these

transformations in a plan in a plan okay all the transformations all the things so it will save all the information in a

plan if you perform more transformations It will add more. Let's say one more transformation four. It will store

everything in a plan. Okay. And it will only execute this plan when you will hit the action. It is kind of switch. When

you just turn on the switch tuck, then your transformations will be actually executed. Oh, okay. And why do we do

that? So basically we do not want to immediately execute your code. Why? Because it is

not optimized. H we will see like how your code will be ex optimized by the driver node. We will see that. But for

now just understand that your code that you have written is not optimized. You will say do you know what I have

written? I have written the most optimized code on this planet. No bro you have not. So that is why spark

understands better than you how to pass or you can say how to um provide that code to the executors. So that's why

before providing your code to the executors it simply puts all the transformations in the plan and it will

say okay okay add more add more add more. Then once you are okay with the transformation once you hit the action

then it will rearrange the transformations. Maybe transformation number four should be performed earlier.

Maybe just to optimize the code just to optimize the performance because Spark understands

better than you. Make sense? So it will rearrange the transformations. Maybe it will say

hey instead of applying two transformations we can simply apply one transformation. Don't

worry I will just show you all these things for now just try to understand the concept. So it will simply merge

these transformations. So in total instead of four transformations will simply perform three maybe

two. So it knows what to do and how to do and what's the best optimized way to do better than you. Okay. So it saves

the plan then the moment you hit the action it will simply perform all the

transformations okay with the optimized plan instead of your plan. Make sense? Now you will say hey where's this

button? Where is this action button? So we have some actions available within Pispark such as dot show. When you just

write df dot show it will simply hit the action. The second one is um dot count. When you say dot count, it will simply

hit the action. If you say display, it will simply hit the action. All these things, all these things are

actions. Dot collect method very popular. All these things are actions. when you actually perform these things

on your data frame, it will simply execute your plan. So this is the concept of lazy evaluation and action.

And the moment you hit the action, it will simply execute all these things and it creates something called

as job spark job. So this is your spark job which actually holds all the

transformations all the things and when it will be created again when it will be created only and only when you hit the

action there will be no job created for these transformations no and don't worry I will just show you don't worry so

there will be no job creation for these let's actually try to see it okay and let's actually see like what happens

behind the scenes and when we just write the code. Let's see in the notebook. So I am back on my database notebook. So

this is the code that I have written and don't worry if this is just a basic code. So this is the code that you use

to just define a data frame. And this is our data demo data. This is the schema for it. Okay. And this is the code to

just create a data frame. Do not worry about code right now because we are just trying to understand the concepts first

of all. Okay. So this is just a data frame that we have created this. So I will simply run

this using shift plus enter as you all know and how we can create a data frame simply use spark session this one spark

dot create data frame that's it this is the syntax I don't know like how spark would have made it more simpler and

struck type is not defined very good I wanted to show you this thing so basically whenever we just create a data

frame whenever you just work with pispark there are so many pispark libraries that we need to import what

What is the best way to do it? What is the best way? Because you would use one function, you import it, then you use

the another one, you will import it. There can be like so many things, right? So the best way to import is just import

all the functions. How? Simply say from Pispark.SQL dot functions import ax and from

pispark dotsql.types import. Simply run this and then just

try to run this and this will just simply create your data frame. Make sense? Okay. Very good. So now it is

completed. See green tech and everything is fine. Now this data frame is created, right? This data frame is created. Make

sense? Simply click on this button and you will see nothing happened. Nothing happened. No job is triggered because we

didn't hit any action. Let's say I want to apply a transformation. Let's say I want to just filter

um those records for city equals to New York. Let's say just I just want to filter this data. Okay, let's do it and

you will see the magic. Okay, so what's the code for this? And again I will simply say do not focus on code right

now because we are just trying to get the concepts first of all. So the code is very simple. I will simply say df

equals okay df equals df dot filter and by the way I'm just writing

the code from basic so even if you do not know you can easily grab like what I'm writing okay so this is like very

basic thing of pispark like how to write the code in pispark and if you have not watched my video for pispark full course

um I would say you can watch that video after this video after this video yeah because that is not a prerequest

Obviously, once you gain the concepts, it's time to master the coding. It's time to master the coding, right? So,

you can just watch that video after this video because this video is strongly focused on the

concepts behind the scenes of that code. Okay? And once you know the concept, you can easily write the code like this like

this. Okay? So, this is the code that I'm writing for filtering. So, how we can just write it? DF.filter. Then we

define something called as column. Which column? We will simply say city because column

name is city. Makes sense. Equals to equals to what? Equals to equals to uh New York. Okay. Simply copy this

and paste it here. That's it. This is your code. Now run this. Run this. Perfect. An Lamba. You said that

if we execute our code, it will not do anything. Yeah. Did it do anything? No. See click on this dropdown nothing

happened. Okay. So right now you are thinking that you have successfully filtered the data but it is not. Why?

Because lazy evaluation this is a transformation which is just stored in the plan in the plan not actually

executed. When it will be executed now it will be executed when I will just hit the action. So the action is this one

hashtag and I will simply say action. Okay. And then I will simply say D display DF. The moment I will run this

cell, you will see a job created for this execution. And let's see let's see see Spark jobs. So it is creating job

for me. See and this is your data frame obviously the result of it because display command is used to display the

output but it created the jobs for you. Wow. Simply click on this dropdown and you will see job 0 1 2. There are to

total three jobs created. Why like how many like why did we see three jobs? Do not worry. We will have a dedicated

chapter on jobs and all. Okay. For now, just try to understand that it creates a job only and only when you hit an

action. Otherwise, there's no job here. There's no job here. Job is only created when you hit the action. That is the

concept you need to gain right now. That's it. See how powerful it is. H okay. And

which transformation did you apply? Filter. Now what if I tell you that I have not I like Spark has two different

kinds of transformations really yes two different kinds of

transformations I will just discuss about transformations as well in the next chapter that is the very next

chapter after this like transformations narrow and wide transformations all the things okay but I just want to show you

something special right now what's that what's that so let me just show you something I told you that it creates a

plan. Okay, I told you it creates a plan. Okay, let's say I want to apply one more transformation.

Okay, not just filter, one more transformation. Let's do that because now I am just telling you how you can

just read the plans. Okay, how you can just read the plans like query plans and how you can just see the DAGs

everything. Oh, so we are just trying to see the DAGs as well. Yes, it is really

important. So now let's try to see query plans and DAGs. Okay, so what I will do? I will simply recreate this data frame

or let's create a new data frame called DF new so that you will understand it. So this is my new data frame. Make

sense? Make sense? Make sense? Let's delete all these cells because we are doing something new just to show you

like how it creates a plan and how it actually shows in the form of DAG as well because now you have to have to

have to see the DAGs and don't worry we will see DAGs in detail in narrow transformation and wide transformations

as well but you you should have some you can say idea okay just to have a groundwork for you so let's say I want

to apply the filter same filter we will be applying I will simply simply say df new dot filter and I'll simply say

column city equals to equals to New York make sense okay let's run this and you know

nothing will happen I want to apply one more transformation in which I will say df new dot select so I

don't want to select all the columns I just want to select city column that's it it's my choice it's my uh

transformation I will simply say city that's it I will apply this transformation so how many

transformations are there two okay and no job is created because no action is triggered now I will hit the action so

display df new so you will see job is created see this is my output desired output it just

has New York records and only one column city now I will show you something special very special like how like why

lazy evaluation is important and why we should just rely on lazy valution. So if you would be running this code, how many

times you will apply transformations? Two times one and two. Make sense? Yeah, makes sense. Now let's see what spark

has done. How we can see that? So basically there's a command called explain. So I will simply write. By the

way, this is a markdown in which we can add headings and titles. And don't worry, I will just share this notebook

with you so that you can refer refer so that you can refer while learning and you can take notes. Okay, just for you.

Just for you. So this time I will simply say um action or you can say query

plan. Okay, simply run this and this is just a ground. We will see the query plans and transformations as well. Okay,

so for that I simply need to use df new dot explain. Okay, simply run this. When you

run this, you will see that what exactly Spark has executed on your code. Just read it. And how to read it? You should

always follow bottom to top approach. Okay? Bottom to top. Make sense? Very good. So

in the bottom it scanned the data, all the data. Make sense? Okay. Then it applied your transformations and can you

see that it applied filter transformation? Yes. But why it added is not null city. We didn't add not null.

Why we can see it here? Because it is a part of optimization because it is already saying that it should not be

null. H. So these are small small things that Spark adds for us for better you can say performance. Plus one more

thing. Okay, it performed filter at the same time. Within the same transformation,

it performed select transformation as well using and operator. That means it didn't apply transformation two times.

If you would be running your code, it would run your transformation two times if there would be no lazy valuation. But

here there is a lazy valution. So it combined both the transformations. Why? because it can be done through one

transformation. Oh, so Spark knows it better than you. You are just writing your code but Spark is optimizing it for

you. Make sense? And then obviously project project is just to display the code, display the output. That's it.

This is the query plan that it is optimized for you. Make sense? Very good. And if you just want to see the

DAG of it, how you can just see that? You simply need to go here in the jobs and simply click on view. Okay,

basically this is a spark UI. Click on this and you will see this is the um Spark UI open for you. Click on this

middle button to expand the screen. So perfect. Simply go to jobs. Click on this jobs and you will see all the jobs

that are running for you. This is an amazing feature that Spark provides us to monitor our jobs, monitor the

performance, everything. So here for now what you need to see simply click on this job because this is the latest job

job number five. Okay click on this and let's see what it has done. So basically um this this is the DAG okay this is a

DAG visualization for your job ID equals to 5 and I am under job. So you need to just take care of

these tabs as well. If you just want to see the query plan that we just saw in the form of DAG, how you can just do

that? Simply go to SQL data frame and you will see it here. So click on this display new because this is the job that

we run ran. And this is the exact thing that we saw in the form of query plan. I always prefer to see it in the D because

it is much more visible and it looks good. So just start from the bottom. Uh or you can also follow the arrows. Same

thing. So this is the first step. scan uh existing RDD. Click on this plus button. You will see rows output three.

So it read three rows. Then it filtered. Okay, it filtered and the rows output was one because there was only one

record. So it performed filter and select statement both and it can be um read in the query plan itself. See here

if I click on this so it is written here. Make sense? Very good. And here we will simply see the highle thing that we

are done that we are doing. So you can see that uh hover over this and you will see filter and select statement both are

applying on this step. That's it. And on the third step it is simply displaying the output. That's it. This is the DAG.

tag is tag is

directed as cyclic graph. So that means that

means whenever whenever you perform any transformation you process your data

spark creates a DAG for you. DAG means the flow the flow the whole flow of the job. So every job will have its own DAG.

Make sense? Let's say you are just running this job. So this job knows the flow. It needs to read data. It needs to

filter data. It needs to select this column. And then it needs to display data. That's it. Directed asyclic graph

means it cannot move back to a circle. It has to go in a flow. It has to go like this. First

this, then this, then this, then this. It cannot go it. It

cannot go back like this to the entry point. No. So this is your DAG. It knows data

reading, filtering, selecting, displaying. That's it. It creates a DAG for you. Do not

make it complicated. It it is very simple. It will DAG is basically like keeping a track of of track of your job

like how it needs to reach to the last step. That's it. And it will simply consist your all the transformations.

That's it. That is a tag. Simple, right? Very good. So this was all about your lazy valuation action. And as the bonus

thing, you also got to know uh query plans and tag. And don't worry we will also have a look again in uh

transformations chapter which is the next chapter and we will discuss about types of transformations narrow

transformations and wide transformations because this is a very important concept and you should know about this in

detail. Let's see. So before talking about transformations I think we should first talk about the partitions because

only then you can understand the concept more precisely. So basically what are partitions? Because we always talk about

partitions. Partitions, partitions in Spark. What are partitions? It is very simple. So let's imagine you have this

data frame or you you have this data in your data lake, in your storage, in your DBFS, anywhere. This is your DF, right?

Yes, this is your DF. So this is one DF and you have already seen like data frame. We have just created the demo DF

previously. Okay, so that DF was really small of just three records. Okay, this DF is also like you can say a kind of

only showing four records but actually let's imagine we have something called as 1 million plus records in this

particular data frame. So we know that we perform distributed computing that means we distribute our data. But how we

can distribute one data data frame? How? That's where partitions come into the picture. So what it does, it basically

creates the partitions on your data frame. So this will be let's say your partition number one. This will be your

let's say partition number two. So it will simply create the partitions on top of your

data. Make sense? Very good. Let's say this is partition two. Now this partition two will be distributed among

the executors. Let's say this is exeutor number one. So one partition will be there and this let's say partition

number two. This partition will be available on this particular exeutor.

So this way our data gets partitioned and then it gets distributed among different executors.

Make sense? That is the role of partitions. But now you can ask me, hey, how we can just partition our data. How?

So you cannot partition your data. No, you cannot partition your data. You create logical partitions. Okay. Now

what are logical partitions? In order to understand logical partition, let's bring another topic. It's called RDD.

And I would say RDD is the backbone of Spark. Really? Yes. It is the backbone of Apache Spark. So basically it stands

for resilient and I will just tell you why it is called resilient

resilient distributed data set. So this is the full form of RDD

resilient distributed data set. So basically what happens? So as we talked about that let's say these are this is

our data frame of four records. Make sense? what it does it creates logical partitions of this data frame. Okay. And

then it stores these logical partition in something called as RDD. So if you just want to understand

RDD okay in a layman language um if you have worked with any programming language you would know about data

types list tpple dictionary set RDD is one of the data types that we have in Apache Spark. So this is the best

example that you can just take to understand. So basically it is also a kind of list. Yes, it is also a kind of

list. List of what? Logical partitions. See it is a list and it is the list of logical partitions. Now this list is

different from the traditional data type which is list because that list cannot be distributed. But this list can be

distributed. Why? Because of RDDs. because of the specialtity of this data type of this list it can be distributed

among machines and again I'm just repeating this is the logical partitions that are

applied on top of your data and it becomes RDD this is something which gets

distributed to the executors your data frame or you can say um your data data like data lake which

is holding your data is not directly going to the executors. No, it first be part it will first get partitioned into

the form of you can say RDDs then it will go to the executor. So basically RDD is the collection of logical

partitions. It is a kind of list that holds your logical partitions. So whenever this data will go to the

executor, so this executor will will know like which data it needs to read from the data lake from the source

because obviously the real data is sitting there. Make sense? Makes sense. Makes

sense. Now let me just tell you why it is called resilient because RDDs are fall tolerant.

How? So RDDs are basically immutable. Immutable means it cannot be changed. Okay. So whenever we are creating RDD

okay let's say you are creating a DF and you applied some transformation. So it will create an RDD. Let's say

RDD1. Okay. Now even if you write another transformation on top of that data frame

and I will just show you a lot of transformation and you can just take the example the previous example in the code

that we uh just wrote in the notebook. We just applied first of all filter okay on

DF and on the same DF what we did we applied select statement right when you look at the code you will feel that we

are changing the existing data frame right because we are using the same data frame variable the answer is

no every time you create a new data frame or you can say new RDD even the name is same even the name

is same but you create a new RDD you are not changing anything in the existing RDD so let's say you are applying filter

it worked fine it create one RDD okay let's say RDD1 now it wants to create select statement like it it wants to

apply select statement it will simply create RDD2 okay for any reason for any reason this RDD

failed so what it has RDD1 created already it will simply use this RDD1 and it has DAG right it knows how to create

the next steps it will simply read the DAG it will simply read the instructions and it will again recreate the

RDDD2 that's how it tolerates the faults or failures in your spark code this is the meaning of resilient

okay distributed we already know that it can be distributed to the executors data set. Data set means your original

data. That's the best example that you can consume to understand this RDD. It's very simple. It's very simple. You just

need right example, right person and obviously an Lamba to understand the concepts. This is your

RDD. Okay. So, whenever you are creating a data frame, yes, you are writing your code in data frame. We also used to

write our code in RDD but it is not optimized. Um it is not you can say recommended to write your code in RDD

for now as well. We simply write our code in data frame that gets optimized and at the end that information will be

given to the RDDs and those RDDs will be going to the executors. Make sense? Even if you are

not writing your code in the RDD that is fine because you are writing your code in dataf frame and dataf frame is

already optimizing your code because of spark SQL engine and we will just look at like catalyst optimizer and all

everything. So you are optimizing your code not you like driver node is optimizing your code and all that

information all that information will be transferred into the RTDs. So it will basically create the RDD and it will

have all the information how it needs to read the data how it needs to apply the transformation all that physical panel

remember then that information will be carried forward with this RDD and it will go to the

executor make sense so RDD is more like a list of information that it is carrying and it is simply passing on

this information through to the executors and they simply read the information and simply do the stuff

without thinking much because they are just consuming the knowledge from the driver. That's it. It can be confusing

for now but it will be very fine for the upcoming part you can say chapters in the video because all the things are

interrelated. So some things can be understood immediately some things will be understood later in the video. That's

how it goes because you need to connect the strings right and I hope that you know the you can say highle overview of

RDDs okay if you do not get everything it is fine because it is connected to a lot of stuff okay and when you when you

will just know those things you will see oh this was the concept so that's how it goes so you are not the only one who

experiences this so it is fine okay very good so this is RDD and we already know the concept of partitions now we can

easily understand and easily digest the knowledge of transformations, types of transformation

that we are talking about. Let's see what are the types of transformations and what is the difference and why

should we just worry about types of transformations because we simply apply the transformations. No, you should be

thinking about type of transformation you are applying. Let me just tell you. So let's try to understand narrow and

wide transformations or you can say types of transformations and trust me it is very simple. So basically what

happens uh let's first cover the narrow transformations. Okay, let's first cover

the narrow transformations. The very popular narrow transformations examples are

um filter select. Okay, these are like popular

transformation like what happens when we just apply this transformation and just try to imagine. Okay. So let's say this

is your source and these are two partitions that Spark has created on top of your data.

Okay. And now you know the partitions. Okay. Now if you have let's say three columns. Okay. If you are

filtering any records let's say okay let me just give you a duty. Okay you are a computer. Okay trust me. Trust me, even

if you just want to do some manual stuff, okay, I am providing you two sheets, two Excel sheets, okay? And

those two Excel sheets, okay, are the part of one sheet. But I have divided that sheet into two two parts, okay? And

you are just calculating some stuff that I'm just telling you. Okay. Okay. Very good. Because if you will imagine

yourself there, you will understand better. So let's say you are applying filter transformation. Okay. on top of

one sheet. First of all, let's consider sheet number one. Okay, you are filtering some records. Let's say I want

to filter all the uh values which have ID equals to one. Let's imagine. Okay, just do it. Can you do

it? Okay, I imagine you did it. Do the same stuff for sheet two. Do it. Okay, very good.

Did you need any kind of reference of previous sheet or from the next sheet or when you were just filtering the records

from sheet two? Did you need any kind of reference or did you need any kind of help from sheet number

one? Obviously no because you can simply filter the records from the data that you have. Exactly.

Now if you just I if I just ask you to remove some columns from sheet one, you can do it. Okay. Do it from sheet two as

well. You can do it. Very good. Same question. Do you need any kind of help from a different sheet? Obviously no. So

that's where narrow transformations comes in come into the picture where you do not need any kind of help from the

different partitions. You can simply apply a filter on this partition and you can just generate the output as it is as

an independent partition. See this output will be this and this output of this sheet will be this. So you actually

do not need any kind of help from this partition and same goes with this partition as well. Both are independent.

Both are generating the output to the independent partitions. Very good. Very good. Now let's cover the Y

transformation. Very popular Y transformation is very popular group by okay let's say this is your source

again the same example and let's take the same columns as well. You have like city in your column. Okay. I have

divided that data in two sheets. Okay. Okay. Very good. Now just tell you one thing. I want you to aggregate the sales

or let's say yeah sales sales of each city. Okay, just do it for sheet number one. You will do it. Okay, you will say

for example New York equals to 100,000. Okay, sales. Okay, very good. Do you need any kind of help from sheet number

two? Yes, obviously because in sheet number two as well you have New York, you have that particular record. So you

need help from the sheet number two to aggregate the values to give the final output. Make sense? Make sense? So here

comes the role of white transformations where you need to shuffle your data. Shuffle your data means just tell you

one thing. If you're just calculating the sum of sales per city, okay, if I just hand

over different sheet for different cities like let's say one sheet for New York, one sheet for Toronto, one sheet

for New Delhi, can you just find the sales? Obviously, you will say, "Hey, I have all the records for New York. I can

easily find the sum of all the sales." That's what white transformations do. So what it will do in this case,

let's say in your source you have just three cities, New York, Toronto and New Delhi. It will simply create individual

partitions for that particular city. one for New York, second for Toronto, third for

Delhi and then it will hand over this data to executor so that it can easily apply

aggregation easily apply group by because it needs reference right. This is white transformation where it needs

to because obviously this data is shuffled right some of the data some of the New York data is here some of the

New York data is here some of the Toronto data is here some of the Toronto data is here so what it will do it will

simply apply the shuffling some data will be coming from here and this is for New York okay just for your reference

this is New York so some data will be coming from here some data will be coming from here as well

right some data will be coming from here. Make sense? Very good. Same goes with Toronto. Some data will be coming

from here. Some data will be coming from here. Same thing with Delhi. Some data will be coming from here. Some data will

be coming from here. Make sense? This is your wide transformation. This is your wide

transformation. This is one to one and this is one to many. Okay? Make sense? This is your

narrow and wide transformations. Let's try to understand this with the help of code and you will actually see that it

will be creating a lot of partitions. A lot of partitions and let me just tell you by default whenever we create or

perform any white transformation spark used to create 200 partitions by default 200 because obviously there can be like

so many keys here we are just considering example of three cities there can be like so many cities. So by

default it creates 200 partitions. Now we create less but we are not talking about that for now. Okay. And

that is because of AQE we will discuss um in the later part of this video. So for now this is the information that you

need to consume. Okay. So it used to create 200 partitions as well. So what it will do? It will simply create three

partitions. Okay. And rest of the 197 partition will empty because obviously we just have three cities. But let's

imagine you have like more cities. Okay. So this is what it does. Let's try to see with the help of code

so that you can actually see and we will just read the query plans as well and and and we will also we will also see

the DAG for this as well. Make sense? Okay, let me just show you. So now let's see our database notebook. Okay, so by

the way this is for you as well. If you see your cluster has terminated because it will be automatically terminated

after 30 minutes. So you can simply create a new one. Click on this and click on create new resource and click

on create attach and run. It will simply create a new cluster for you. Obviously it will take like 2 to 3 to 4 minutes to

turn it on. And yeah that's it. So what we can do we can simply create a new notebook. Makes sense. We can simply

create a new notebook. Simply go to workspace and here I will simply create a new notebook. And this time we will

simply say um let's say transformations. Okay, transformations makes sense. And let's attach this to this cluster. Okay,

very good. Let me add some heading. And I will simply say first of all, let's see narrow

transformation. Okay, narrow transformations. And we can actually copy the code from here to create the

data frame because we can just see the same data frame so that you can actually understand the stuff. So this is the

code for creating the data frame and till the time it is being turned on. So we can prepare our code. So this is the

code for DF new or let's say DF it's fine. Let's create DF. And here I will just show you or let's put this code

above this because obviously and we need to import the libraries from pispark.sql SQL dot

functions import ax from pispark

dossql dot types import arix. Okay. Oh, it has turned on. It was

quick. Nice. So, let me first run this and it is running. Then I can simply run. Okay,

it's complete. Let's run this. Very good. And now let's perform narrow transformation. Let's say I want

to apply a filter. Okay. DF equals DF dot filter. I want to apply filter on column

city equals let's say New York. Okay. Very good. Let's run this. And now you know nothing will happen. Lazy

valuation. Okay. This is a narrow transformation. Make sense? Okay, let me just show you what will happen when I

will perform this transformation. So let's call the action display ADF.

Okay, now we will see the job as well. So it is running. Very good. Very good. Very good. Very good. So

now now this is the data that you wanted to see. Okay, because we applied the filter for New York only. Now let me

just show you the plan tf.x explain. So what it did simply scan the data and applied the filter and then

that's it. That's it. No need to worry at all. Make sense? And we already know like how we can see the DAG as well for

this. This is job zero. Click on this. It will simply open that uh Spark UI. Expand it. Go to SQL data frame and

click on this display df. Okay. And this is your DAG for it. Now let's perform the Y transformation. Okay. So in order

to perform Y transformation simply recreate this so that we will be having a fresh data frame. Okay. Let's write

here Y transformation heading three bold Y transformation. Okay. Let's

perform a group by df.group by. Okay. And how you can just perform group by? It's very simple. You just

need to write df.group by and then you simply need to uh put the column name on which you want to apply group by. I want

to apply group by on city. Make sense? And what I want to find I want to find or let's say what kind of aggregation I

want to apply. I will simply say ag then I can simply say count uh of age. Okay. Or let's say yeah or

let's let's say I want to just find the um maximum age. Okay, I know that we have just one record in each data frame.

It's fine. But let's say I want to find the maximum age of this. Okay, let's apply maximum aggregation on top of

column age. Make sense? Okay, let's perform this. And obviously we need to save it in

DF. Okay, now this is done. Obviously there's no job. Now let's hit an action and let's see what will what we will

get. Obviously we will get the result. Okay. So maximum age is 25 30 35 that is fine. Very good. Now let's see the plan.

And now now let's see what actually happened in case of white transformation. I will simply say df dot

explain. Oh yes. So first of all let's try to read this one like this plan like what

actually is happening. Okay. So so so so this is the final plan that we need to just focus and do not worry this was

the initial plan and you can just treat it as like your optimized plan. Okay. So just read this. First of all it scanned

the data. Very good. Then what it did? It did hash aggregation on

city hash aggregation obviously because this was our um grouping key right aggregation key.

So it created a it it applied a hash function on top of this hash aggregate a hash aggregate keys and key was your

column. Key was your column very key was your column. Then function equals to max like what aggregation we are trying to

do max. Very good. Then what happened? We can see exchange hash partition. So this is something called as shuffling.

Whenever you see the word word exchange that means shuffling. Shuffling of partitions. And you know why it happens.

Okay. Very good. Then you will see shuffle query stage and statistics that row count is three

and blah blah blah. Then aqe shuffle read you can simply ignore it because this is something related to adaptive

query execution that we'll be discussing later in this video. Okay. And now let me just show you the DAG for it. Let me

just click on this and expand it and close the previous ones. Click on SQL data frame and this is the latest

display. Run this. So you can see that a lot of stuff happened lot of first of all it scanned the data then it applied

hash aggregation on top of your grouping key then it shuffled the data and how many partitions you can see 200 200

partitions okay then this is aqe which is your I will just talk about this later in this video because you first

need to understand a lot of stuff before this so what it does basically we know that it will create 200 partitions right

so in the modern world it is a kind of optimization technique that we see so instead of creating 200 partitions what

it will do it will simply reduce the number of partitions and let me just show you and according to me I think it

will just create one partition so if you just hover over this and click on plus so you will see number of partitions so

what it does it simply like combines your partition this is like out of scope right now but you can have some

understanding and for now you can even disregard this tip just imagine there's no step like address AQE just disregard

for now okay we will talk about AQE later in this video for now just simply ignore this step okay and don't worry um

I will simply disable the AQE for rest of our code so that you will not feel confused but yes this is something that

you should know because let's say you are not turning off AQE you will see hey what is this what is this so this is AQE

okay so it created 200 partitions okay very good then it will simply create hash

aggregation. Perfect. And this is the final plan that will be used to execute the code. That's it. That's it.

This is your DAG and this is your Spark query plan for your grouping. Make sense? Or basically you

can say wide transformation. This is your narrow versus wide

transformation. Make sense? So now now what's next? Next is next thing is very important and very

very very important I would say. So we are talking about jobs right? We are talking about whenever we see job three

job four if we just click on this we see a kind of DAG for job and stage ID. If I just click on stages I see stages. What

is a job? What are stages? And there's something another called like term called tasks. Let's talk about this now.

What are jobs? stages and task how it gets created and how we can just work with these things and how it gets

optimized. Okay, this is really really interesting. Let's actually talk about that and before that let's talk about

repartition and colleies. This is really important and I would say now is the best time to discuss about it because

now you know wide and narrow transformation and let me just tell you one is narrow transformation and one is

wide transformation. Sometimes sometimes both can become white transformation really. Yeah. So the

thing is let's talk about repartition first of all. So let's say you have a table you have a data frame.

Okay. And by default let's say this has only two partition. Let's say this has only two

partitions initially. You have just two partitions but for some reason you want to increase the

number of partitions because obviously these two partitions are created automatically as per the partition size

which is 128 MB. If you do not know this is the

default partition size 128 MB and this is the default block size as well of your distributed file system. So this is

the information that you can note somewhere. This is the default partition size. Okay. Because ideally spark

recommends to have a partition size between 100 to 200 MB. Okay. So we keep it like 128 MB. Why? Because default

block size because your data is saved in blocks in data lake distributed file system Hadoop uh HDFS. So that's why we

just pick it as 128 MB. But for let's say this data size is of u 400 MB. Okay. This data this data is of 400 MB. Okay.

So how many partitions will be there? Two. Oh not two. Let's say 200 MB. Let's say 200 MB. So there will be

two partitions. One of 128 MB and second one will be little smaller. For some reasons we want to create 10 partitions.

For some reasons. How we can just do that? We can use something called as

repartition. Okay, we can use something called as repartition. When we perform repartition, we actually just just try

to imagine it. Okay, just try to imagine it. So these are your two partitions. Okay. Now if these two partitions needs

to be converted into 10 partitions 1 2 3 4 5 1 2 3 4 5 let's say 10 partitions will

we perform reshuffleling or shuffling of data or not? Just tell me within 2 seconds it's very simple. Answer is yes

because some of the data is here like this this partition is coming from uh let's say

here. Let's say some of the data of this partition is coming from here then this one then this one like so on because it

is just creating 10 partitions right so whenever you just want to increase the number of partitions we simply use

repartition we can even use repartition for decreasing the number as well but it is not recommended because if we just

want to decrease the number of partition let's say I have two partitions I want to create just one then I will use

something called as coies co-s make sense then I'll simply use something called as

co-elies okay make sense so in coies what happens it will simply let's say these are your two

partitions okay it will simply try to create one just tell me one thing will it create any kind of

shuffling tell me tell me tell me will it create any kind of shuffling the answer is no because it just needs to

merge the partitions, right? Okay. So, let me just tell you when it will perform shuffling as well. Really? Yes.

Sometimes it performs shuffling, sometimes it does not. Okay. Let's try to that is an advanced knowledge you

should know. Okay. So, what happens? Let's say just remember we have two partitions. Okay. Okay.

If if we have let's say only one executor. Okay. And these are your partitions. Two

partitions P1 and P2. If both the partition if if we just have one one exeutor, okay, both will go

to the same executor. Okay. Then this executor can perform the coies on top of these two partitions because it do not

need to go to any other or you can say anywhere else because it just need to merge. That's it.

But just imagine we have one more executor. So in that scenario one partition will go

here. One partition will go here. How it can perform coies without reshuffleling? It's not possible. Right.

Exactly. In these kinds of scenarios it has to reshuffle the data. It has to send this partition to this executor to

perform the merge. Okay, in this scenario, it has to perform reshuffleling. But in 90% of the

cases, okay, coles means merging the data without any kind of shuffle because you will not be having just two

partitions. It is very a rare scenario. But you should know about this, okay, you should know about this. But in real

world, you will hardly encounter with this particular you can say concept. Okay, make sense? Very good. So this is

the concept of repartition and coies and let me just show you how it works with the code. Really? Yeah, we can simply

see that. Let me just show you. So I am in my notebook. Okay. And let's try to see the qualies and repartition the same

notebook. Okay. So don't worry, I will just provide the notebook so you can easily use this. So I will simply

say repartition versus coales. Okay. So let's perform first of all first of all let me just show you how many number of

partitions do we have and how we can just find that. I will simply write df do. Rd dot get num partitions. Okay let

me just run this and you will see we just have one partition because data size is not exceeding 128 MB. Okay. It

is very small. So we just have one partition. Let's say I want to create three partitions. Wow. Let's let's

create. Okay. So, how we can just create that? I will simply say

repartition. Okay. I will simply say df equals df dot repartition. And how many partitions I want? Three. Just run this.

And just again run this command. And you will see the output as three. Very good. Three partitions. Okay. And this

is also a kind of action that is why you are seeing jobs created. Okay, very good. And you can even see the plan DF

equals or let's say DF.explain if you just see like what actually happened. So what it did it

simply performed the repartition step why it performed all the things again? Why? Do you remember the concept of RDD

that we discussed? It's high time to just revise that concept. Okay, do you know like what did I say? It computes

the by the way this concept will be you can say fully understood in the caching and persisting area but you

can still build a very strong base around this zone. Okay. So what happens? This is DF,

right? What is the last thing we did with DF? We perform this group by. And what did

we say? Every time it performs the operations using a DAG, it performs a new RDDD. It creates a new RDD. Right?

So what it did, it simply read the data here. Okay? And then it creates a new

RDD with this group by function. And this is your new RDD here. Then now we performed another

transformation called repartition. It created RDD2 or new RDD. That is why when you just hit

explain, you see all the previous stuff for RDD1 plus RDD2. So that means if for any reason

this step was failed any reason it will simply use this as RDD1 and it will simply perform the remaining steps of

the DAG that is the power of RDD that's how it tolerates the failures okay see how things are getting connected

now a lot of things will be connected don't worry okay so this is your repartitioning step okay very good now

let's say We want to again create only one partition. We do not want three partitions. So what you can do? You can

simply say coies. Okay. DF equals DF dot cos and then one. Okay. Simply run

this. And now again see the output of this command. You will see one. Perfect. And if you see the explain

method what we should see, just tell me in the comments right now. Obviously all the steps

RDD1 and again after that. So if you just observe CC see if you observe this is RDD1 right this is RDD1 very good

this is RDD2 okay and this is RDD3 coies one why we didn't see exchange with coies why you know the

answer you already know the answer and the answer is it does not performed shuffling right it it didn't perform

shuffling so that's where your coies helps to avoid shuffling because shuffling is an expensive

transformation. Okay, very good. Now I hope that you have understood everything in repartitioning and transformations.

Now let's try to understand jobs, stages and task. Let me just show you because we know that

whenever we hit an action a job is created but what after that? So let's try to understand what

happens when we hit the action and we know that job is created but just an tell us more about it. Okay. So we know

that whenever we hit the action a job is created for us. Yes. Not us for spark. Okay. So whenever a job is created it

contains multiple stages. Okay. And one stage can contain multiple tasks as well. Let's say this is the task number

one. This is task number two. This is task number three. Multiple tasks. This looks very simple, right? And it is

really very simple. But you just need to consider some stuff. Now hold on. We know that a job is created. Very good.

Okay. We know that this job is created. Very good. Very good. Trust me. Very good. This job is created, right? Very

good. Now, now this job will have multiple stages. Who will decide the stages? This should be your first

question. And if this is your question, let me answer. So if your whole code, your whole

transformation basically stages are what? Transformations. Stages are what? These are

transformations. If you are applying filter, okay, if you are applying select, if you are applying group by,

okay, these are the three things that we have applied so far. Let's take this example. So just this is a kind of

formula that you can just write write down. Okay, it will help you in the interviews as well. So

basically if your transformation just remember if your transformation is not

changing is not what is not changing. So this is a narrow transformation as you know. This is a narrow transformation as

you know. Make sense? So this these narrow transformations will be bucketed in the one

stage. So there will be only and only one stage for the similar for the similar transformations such as

narrow. Make sense? Okay. Very good. Then we know that this is a y transformation. So this will be

considered into a different stage into a different stage. Very good. So this will be the

one stage but different. Okay. Now within each stage we can have multiple tasks. We can have multiple tasks. Task

is associated with the number of partitions. Just write write it down. Task is associated to the number of

partitions. So we know that in the group by we create 200 partitions, right? How many tasks will be there?

200 200 tasks. It is strongly strongly strongly aligned with the number of partitions. That's it. This is the basic

thing that you can consider to understand it. Okay. It is just for the understanding. Now let's see this with

the help of code because you will actually understand what is going on and for this thing we going to read one CSV

file okay in the datab bricks and I will just show you how jobs are getting created and I will show you all the

number of jobs each and every job okay let me just show you in order to see the illustration okay demonstration let's

upload the file and how we can just upload the file it is very simple simply go to catalog okay and if you just go to

DBFS you will see the folders and you can simply click on file store and I have

like so many folders don't worry I will simply create a new folder okay simply click on upload and in the file store I

will create a new folder and I will call it as let's say Apache spark okay and then I will simply click

on upload because I have created a folder you can also create a folder like this okay so let me just upload the file

so I have uploaded this file mega dotcsv let me click on done. And this is uploaded in this folder. And I can also

get the URL of this. Simply right click on this and copy path and copy the uh this one. Yeah, this

one or this? Yeah, same thing. Copy this one. File API format. Okay. Now you will say how we can also get this file. You

can also get this file by going to my repository called Apache Spark full course. And you can also click on this

button. Okay? or this button. Click on this. Okay. Then you will see this file. Okay. And you can

also download it and you can upload it. Perfect. Perfect. So now we have the file. Now let me just show you the job

stages and everything. Okay. Let's create a new notebook. Go to workspaces. Click on create notebook. Let's call it

as jobs stages and tasks. Okay. like this. Attach this uh notebook to

the existing cluster. Very good. Simply say from pipark.sql dot functions

import from pispark.sql.types imports. Very good. Run this. Now we

will not be creating a data frame on our own. We will simply read the CSV file that we have uploaded. And how we can

simply create a data frame on top of it. It's very simple. Let me just tell you it's very simple. So we have something

called as spark dot read. So this is a kind of API that we use to read the files. Okay. Spark

dot read dot format. We need to specify the format. It's called CSV. Click backslash to go to the next line. Okay.

Then simply type dot option. So in this particular option we can just define extra configuration that we want such as

header. We know that in CSV we have to tell hey header is our first line. So we will simply say header and then true.

Make sense? Then I will define one more thing. It is very important and very handy. You will love it. Actually we

know that we have a CSV file. But if we do not want to define their schema, how we can just still define the schema

without defining the schema. So we have something called as infer

schema and just make it true. What it will do? It will simply predict the best schema for your data frame for your data

in CSV format. And I love this feature. Okay. Then simply say dotloadad because that's all and simply paste the URL. Now

simply remove this DBFS part because we do not need that. Just make sure you are having the path like this. Okay?

Otherwise you will see the errors. Simply run this and you will see the magic and I want you to observe it.

Let's see if you catch it. Let's see. Did you catch it? If you catch it, I love you. So basically the thing is I

just read the data. I didn't hit any action. Why it created two jobs for me? Oh

anch what's this? So let me just tell you something. Basically when we create a data frame without having files it is

not an action. But when we create a data frame on top of the files using dot read method it is an

action. Okay. Okay. So that is one action right? Yes, that is one action. Simply, let me

just add the comments. This is action number one. Okay, this is action number one. So

why we are seeing two jobs? Because we know that one action triggers one job. Very good catch. The next action is

infer schema. But why? Why? Because as I just mentioned that infer schema predicts the best schema for your

data. How does it predict? It simply have a like sneak peek to your data. It will simply see some of the

records of your file and it will then predict the best data. Because it is seeing some of the records, it is an

action. So that is why this is the action number two. Wow, this was something new. I know that. I know that.

Okay. So, let me just click on this drop-down. You will see two jobs. Job number 10 and job number 11. One more

information that is very useful and just note it down. So as we know that this is job. Now just tell me about the stages

because stages are aligned with the transformations but we are not applying any transformation. So the thing is and

again with the task as well. So the thing is one stage or let's say one

job will always have one stage plus one task. Always or you can say at least always at

least. Okay. One job will be having at least at least one stage and one task. That is why you are seeing stage one and

stage one. Let's click on this and let's see what it did. Click on view and expand it and just stop here. Okay. So

this is the stage number uh stage ID 13. Okay. So this is the stage that it performed and how many tasks it has one.

Why? because it's just like one partition and it is actually not doing anything. It is just reading the data.

So just one task make sense. Very good. And let's see the other job. This one job number 11. Click on it. Expand

it. And stage ID 14. Same stuff. Same stuff. And the task is one. And this is for infer schema.

This is for infer schema. Very good. You can even see the DAG visualization. Click on this and you will see it is. So

this is for your infer schema. Now let's do one thing. Let's perform some transformations. Okay. Let's say I want

to perform transformations. Okay. Transformations. And I will simply say df equals df

do.filter. Okay. Um do I know how data look like? Actually no. So what I can do quickly I can create a demo DF that we

will be we will not be using for uh counting the jobs. We will not be using that. We are simply creating a demo df

just to see the data so that we can just use some columns in the transformations. You know what I mean right? So what I

will do I will simply copy the code this one and I will paste it above. Okay. And I'll simply say df demo. It is just for

to see the data. That's it. I will simply say display df demo. Okay, simply run this. Uh what what's wrong?

Unexpected unexpected error. Let me just comment it out. Okay, what's wrong? spark read dot

format. Oh, let's remove the space because it doesn't like the space. Uh uh uh uh uh uh. I just copied

the same code, right? Uh let's remove the comment because it just messes up with

the spaces and all if it doesn't like the spaces. So, let's try to run this. Yeah. See? So, now let's just see the

data just to see this data. Okay. Order ID, user ID and so many things. Wow. Let's say I want to filter only and only

data for fashion. Okay. Where product category or let's say product name equals to sneakers. Make sense? Let's do

it. I will simply say DF do.filter uh column product

name equals to equals to sneakers. Make sense? Very good. Let's perform this transformation. And you know what will

happen? Nothing. No job. Why? Because no action is triggered. Very good. Now let's say I want to perform one more

transformation. What's that? So I want to perform let's

say I I just want to select order ID and product name. That's it. I want to do that. Okay. Order ID and product name.

Let's let's do it. So I will simply say DF equals DF do. Okay. Order ID and product name. Make

sense? Let's run this. Perfect. Now let's perform Y transformation. Oh, okay. DF equals DF.G group by. And I

want to apply group by on product name. Make sense? And what I want to aggregate? I want to aggregate sum of

order ID because I want to see like how many order ids do we have or basically not sum basically count. I want to count

the number of orders for product name equals to fashion. Right? So, count of uh column of order ID. Make sense? Very

good. So, let's run this. Perfect. Now, we know that this is our code that we are going to run. So, action number one,

action number two. Okay. So, we have two jobs so far. That is fine. Let's talk about the third job. Third action. We

have not triggered the action yet but yeah we will be triggering it. So this is transformation number one. Okay. This

is transformation number two but both are narrow transformation. So how many stages will be there? Only one. Very

good. Then we have another transformation. This is a wide transformation. So how many stages we

will be having? A new stage. Okay. A new stage. Okay. Very good. How many task we will be having here? only one. Why?

Because there's only one partition. There's only one partition. How many tasks we will be

having here? 200. How many? 200. But yes, AQ is there. That means adaptive query execution is there. So we have to

turn it off first of all. If we just want to see the desired result otherwise it will simply coales the things and it

will not show up the real things. Let me first turn off the AQ8. It is very simple. It is not a big task. I simply

need to use one statement. That's it. I can also show you right now as well. Simply go to Google if you do not

remember the code for it. So you can simply say turn off AQE in Spark or you can say

turn on same thing. You simply need to use true or false. Okay. Click on performance

tuning and it will be something like um let me

see AQE. So co sends yeah here it is adaptive query execution. So this is the

thing that we need to make it false default value is true. Okay, we simply need to do this and how we can just do

that. Simply run this command at the top of this spark dot conf dot set and then just make it false instead of true. So

as you can see spark.com set spark.sql dot adaptive co partitions. Oh wait wait wait wait partitions. No no no no no. We

do not just need to apply co partition. We need to turn off the AQ on all the things. So just go back. Uh ah yeah,

here's the thing. So you simply need to copy this one. So simply replace it with this one.

Spark.sql.adaptive.enabled. So it will simply turn off the AQE for all the things. Simply run this. And now AQ is

disabled. AQ is disabled. Very very very good. And that's what we wanted. Okay. So we may need to just rerun all the

code but that is fine. So what we can do? We can simply rerun this and we can simply remove this

action because it doesn't like the spaces. Okay, simply rerun your df and you will see the two jobs. Very good.

Rerun this this. Very good. So now what you were saying? We were saying that it will

simply create 200 partitions. That means 200 task. So in total in total I am just writing display df. In total, how many

stages will be there in this job and how many tasks will be there? Very simple question and you should answer

this. Very good. So in this job there will be total two task. Oh sorry there in this job there will be total two

stages. One for this one. Okay. And one for this one. And in this stage there will be only one task.

And in this stage there will be 200 tasks. Let's verify this. Simply run this command display

df. So very good. So what we are doing here it has simply figured out the number of you can say orders for

sneakers and it has created five jobs. Why? Because it just took two jobs from here. Okay. And let's see like what are

the other jobs because it does like lot of you can see skip skip skip skip skip skip skip skip skip skip skip skip skip

skip skip skip skip skip skip skip skip skip skip. So these four are actually skipped and these are not actually being

performed. So do not need to worry. This is our main job which is job 17. So simply click on this view. Why these are

skipped? Because we were just running our code for df. So it is just performing that thing but it is not

actually computing anything but because it is just skipping that part. But you will see that part in the number of

jobs. But job 17 is the main job. Now what it is doing and I love this D like what it is actually doing. Let me just

tell you what is happening. So basically basically basically basically first of all we need to see our two stages in

this job. Okay makes sense. So okay I can see only one task. Wow. Wow. Wow. Wow. Let's

see. Let's see. Let's see. Let's see. Hold on. So basically this is the job and what is actually happening. This is

stage ID 20 and this is stage ID 21. Stage ID 20 and stage ID 21. In the stage ID 20, it is applying your narrow

transformations. In this stage 21, it is simply applying your you can say wide transformation.

And you can actually see this stuff by going to SQL data frame. And there you will see all the stuff. SQL data frame

click on um wait simply first of all go to job click on this and then go to SQL data

frame completed queries and click on this one. Very good. So here you will actually see the stuff that is being

happening right now. So it will simply show you scanning. This is the file scanning. Then this is the filtering.

Okay. So this is the filter that is being applied. Okay, which is your filter and select statement. Then

hashing and you know what is hashing for that grouping key. Then it performed reshuffleling. Okay, how many partition

do we have? 200. Very good. Then we have hash aggregation. Obviously we want to apply group by and then collect and

limit. That's it. Make sense? Very good. Now let's go to the stages. Okay, very good. So now in the stages if you'll

just closely observe like how many stages do we have for like uh completed stages. So these are all the stages that

are completed and these are skipped stages. Okay. So now here if you just go to jobs this is our job right? If I just

go here this is our job right? Job 17. Simply open this job. Simply open maybe

from here you or you can simply reopen it because I just cancelled this. Simply open the job number 17. Okay. Simply

close this and go to job number 17. So click on jobs one more time. And these are all the jobs. Simply open this one.

Okay. Click on this. So this is the job for you. And these are the completed stages. Okay. So here we are seeing one

that we should not see. I know something is wrong with the execution because we didn't I think restarted the cluster but

that is fine. That is fine. The whole intent right now to explain is this one. This diagram is really important because

you know that you should just see 200 that is for sure and obviously like that is not a big deal. This is a big deal.

So basically what is happening right now this is very important. Okay don't worry we will also fix this. No no no need to

worry and this is not a kind of bug. This is just the thing that we um what we did we simply uh rerun this the whole

thing and we actually didn't restart our cluster because we defined a new property for our spark cluster. So it

didn't actually process this thing what thing this one like group by. So it actually skipped all the things. So that

is why it is happening. No need to worry I will just show you. So basically what is happening right now? So here in the

exchange so we just performed our narrow transformations fine then we know that we change the state we change the

transformation type we change from narrow to wide transformation right so that is why we can see an exchange but

what is the name of this exchange it is called write exchange

okay because it is writing this data to this page. Okay. And what is the name of this

exchange? This is called read exchange. Make sense? This is right exchange. This is read exchange. So, it

created 200 partitions. It read all the 200 partitions and then it performed the grouping. This step was really important

and I wanted to explain this thing. Now, let's restart our cluster. Okay, let's restart our cluster and then let's

perform the whole thing again. And let me just remove the clutter. simply remove this and

uh yeah that's fine we will simply so it is being restarted we will simply rerun the whole code and you will see all the

things sorted now make sense okay and I can even do one thing I can even merge the code within the one cell so that it

will be easier for you to manage it and see the results let's run all the code together okay

and it will be fine Okay, because you can just keep this notebook

for your reference then like easily you can just create notes. Okay, then just say

displaydf neat and clean. Okay, let me add hashtag action. Very good.

data reading. Very good. Let's delete all these

cells. Perfect. So now when it will be restarted, we will simply run our code all the code together and then you will

see total. Oh yeah, it's ret restarted. Let's do it. Run this. Let's run

this and I hope after this you will be master in stages, jobs, tasks and all everything. Okay, let's do this. Let's

run this as well. So now it is creating a fresh DF. It is not you can

say running any previous code. Okay, because some code was saved in the memory. So that's why. So now we can see

jobs being created. Very good. Very good. Very good. There are so many jobs. Okay.

Simply click on this one. Okay. And you can see only three jobs that are fine that we want. Job zero, job one, job

two. Make sense? Job zero is for reading. Job one is for your um infer schema. Job two is for your

transformations. Okay, let's open this one. Perfect. Now I can explain this. Very good. So now you can see that

shuffle right is 71 bytes. Okay. And as we know that how many values do we have for group by? Just one because we have

data frame for just one like sneakers, right? So how many partitions should be there for sneakers? only one very good

so how many tasks will be there only one which will be executed but as I said there will be 200 partitions and let me

just show you so if you just click on jobs for one more time okay like total jobs because total seven jobs were

created right out of which four were skipped why and just look at the number of tasks for them 75

100 20 and four just add these it will be 199. So 199 199

oops 199 tasks were like empty tasks. 199 partitions were empty. So obviously it doesn't need to do anything

right? It doesn't need to do anything. So it simply created 75 120 that means 195 and then four. So 199 empty

partitions were skipped and there was only one partition which actually get executed. Make sense? So it is not

coaling or it is not combining your partitions because we have turned off the executive query execution. Adaptive

query execution executive adaptive query execution. AQE make sense? That's how it works. That's how it works. And

obviously if you have let's say bigger data or let's say if you have uh more than one or two or three values that you

are using to aggregate obviously it will create more and more data more and more you can say

tasks. Okay let's let's let's try to sum up what we have done. Okay let's try to sum up. So first of

all we turned off the AQE just to see the results as it is. Then we simply read the data and there were like two

job uh two actions because two jobs were there. One is this one, one is this one. And then we had transformations. There

was one task for filter and select. There was then oh we just copied two times that is fine. It's fine. Then we

had a group by for our product name and then we performed a white transformation. So there was a new stage

for it. There were total 200 partitions. 199 were empty and one was dedicated to a task. Make sense? Let's remove the

select statement. Okay, just to see you just to show you something. So what we can do, we can simply rerun this. Okay,

just just to show you like how how does it work like without select. So that doesn't mean like it will simply go.

Okay, again it skipped two things like four things. Simply click on this view for ninth job and expand it. And you

will see that it has provided you the same stuff. And this time it has shuffle right 71 bytes.

And if you just click on the jobs, you will see the same stuff like why it is just skipped the partition. So 75 100

175 + 20 195 + 4 199. So 199 partitions are skipped. Simple, simple, simple. 199 partitions are skipped. Only

one partition, okay, was involved in the tasks. Make sense? Make sense? Makes sense. So these are like two stages,

okay? And in this only one task per stage, that's it. Just take notes. It is very important, bro. It is very

important. You will forget forget this like what is happening? Why do we just have one task? You will by the way tasks

are not much important nowadays because AQE just merges or coes the data the data but it is very important from the

interview point of view and for the understanding like what is happening behind the scenes right so simply noted

down everything noted down make sense okay very good so what's up baby it's finally time to master the most favorite

the most important the most the most the most popular topic joins in Spark because once you understand this

concept, you are almost there to understand almost all the concept of Spark. Trust me. Trust me. And just sit

back and relax. Hold your coffee. Um I also want to have coffee but it's late here so I cannot. So but I wanted to

have coffee while explaining this concept. It's okay. But I just noted down some notes that I have to have to

talk about. So I am really excited to talk about this topic and trust me just sit back and relax. You will understand

each and everything in joints each and everything. Trust me. Okay. Let me have my notebook. Okay. First of all

basically basically basically just forget about this part for now the right hand side because you need to just focus

on the part that I highlight. Okay. So I have not highlighted this part. So for now just ignore this. Okay. And let's

start. Let's say this is your data frame one. Okay. This is your DF1.

And you have just read this data. Okay. And you can imagine the size of this data is of let's say

um how much of MB? 400 MB or let's say 450 MB somewhere around 450 MB. Okay. So by default it

will create four partitions of this 128 into three and then it will create one small partition. Okay. So it will create

four partitions. Make sense? And we have only two executors as you can see one and two. So these four partitions will

be distributed among these two executors. Common sense right? So this is partition one, partition two,

partition three, partition four. And as we all know that it creates logical partitions. So you can treat like

partition one, partition two, partition three, partition four. Make sense? Just like four partitions. Okay? And you can

just imagine we have like lots of records but for the simplicity and to make you understand the concept we just

have like five to six records. That's it. Okay. Very good. Very good. Very good. Take it easy. Take it easy. Take

it easy. Take it easy. It's very easy. Trust me because I am explaining you this concept. So basically your

executor has two partitions one and two and this exeutor has two partitions three and four. Total four partitions

evenly sorted. Okay. By the way, uh each exeutor has four cores. And if you don't know about the cores,

basically cores are the terms that we use in generic computer sense or you can say computer terms. And cores are

responsible for like your parallel processing like similar to that. Okay, do not get confused with the new term.

So if this executor has four cores, that means it can perform four parallel tasks. So that is why two cores are

filled and two cores are filled, two are empty, two are empty. Okay, very basic term. Do not get confused because half

of the terms are just from your networking and IT terms. So it's very simple. So so far so good. This is your

data frame which is called DF1. Okay. And we have just two columns ID and name. Very good. These are two

executors. Four partitions are evenly distributed. Make sense? Well done. Very good job. Now we need to apply a join

with a new DF. What is the new DF and how does it look like? Let me just show you. So this is the new DF or you can

say second DF, second data frame that we need to apply join with. Okay, so this is called

DF2. Okay, this has ID and salary as two columns. And this is also let's say of 500 MB or let's say 460

MB. That means it will also have four partitions. Very good. So just to keep the things simple, I have created this

data frame with green and the previous one with pink. And as you all know, two partitions will be there and two

partitions will be there in this executor. That means our data is red and it is let's say evenly distributed among

the executors. Very very very good. So what is the current state of our executors? Current state of executors is

this one. Executor one, executor two. Okay, this is DF2. This is DF1. This is DF2. This is DF1. Very good. Now let's

talk about now I want to apply join. I want to apply what? Join. Okay. I want to apply

join between DF1 and DF2. Okay, make sense? DF1 and DF2. So as we know that join is a Y transformation.

You don't know about that. Join is a Y transformation. Okay, if you do not know about that, now you know. Join is a Y

transformation. Y transformation. Y transformation. Very good. So what happens in Y transformation? It creates

200 partitions. Okay. And now you will actually feel why it is so much important to create 200 partitions. Not

every time but yeah in most of the cases. Okay. It will create 200 partitions. And what will happen? Let me

just tell you. So basically 1 2 3 4 1 2 3 4 total eight partitions

total eight partitions will be

reshuffled reshuffled into 200 partitions into 200 partitions. Make sense? Very good.

Now what will happen on these two partitions is important as we already know some of the basic stuff in the 200

partitions that we know that the same key resides on the same partition so that it can have all the information

exactly that is why I made your base so strong because now it will be very easy for you to understand just tell me one

thing let's say you have ID you have ID 1 and two here. Okay. Three and four here. ID five here and six here. Make

sense? Very good. In DF1 you have ID let's say five here, six here, 1 and two here and three and four here. Can you

apply join? Just forget about these 200 partition for now. If if you do not create 200 partition, can you apply

join? This is exeutor one. This is executed two. Can you apply join without reshuffleling? The answer is no. Why?

Joining key is ID. This ID five of DF2 is not available in this particular executor from DF green side. Okay. DF

green or DF pink. Let's let's forget about DF12. DF green. DF pink. Simple. So ID = to 5 of DF pink is missing from

ID five from green make sense because ID5 of DF green is here in the second executor. So we cannot apply the join

because they are not on the same machine. They are not on the same executor. So that is why we need to

apply reshuffleling. We create 200 partitions and now is the main thing.

Now is the main thing. By the way, we do not have five and 6 ID. We have 2010 1 and 201. But it is just an example. And

I know some of you will say, "Ah, what are you doing?" So, simply say 2010, okay, just for you, baby. Just for you.

2001. Okay. 201. Oh, man. Come on man. See I'm so

possessive for you. So now what we will do? We will simply create 200 partition. Now these partitions are not just like

simple partitions. What will happen? Let's consider partition number one. Let's say this is your partition one.

What will happen in this partition? It will simply reshuffle all the data. Okay. And it will go to ID1. Okay. ID1

will go to partition one. Why? Why? Because it calculates something called as mode. So it will simply say ID

1 1 mod of total number of partition 200 it will become it will become what you can say maybe uh one obviously one more

200 is one. If you have some basic understanding of mathematics if no go to Google search it okay one more to let me

just search bro I'm saying with so much of confidence let me just search oh man I I am 90% sure

but oh man oh man oh mana wait hold on hold on hold on I will give you end to end knowledge including mathematics as

well so one mod 200 is 200. Let's

search. It's one. Told you. Told you, bro. I know. I did my bachelor's in mathematics,

bro. Oh man, don't remind me those days. My bachelor days. Bachelor's days.

Bachelor days. Yeah. Still still a bachelor. So basically it will simply go to ID number one. It will simply apply a

mode and then 200 equals to 1. Just for the simplicity I am just keeping it here as one. But in real life it just apply

because obviously if uh it if we are not using an integer column if we are using a string column in the join obviously we

cannot apply mod on it. So it will simply create a hash column for that particular hash column. Basically it

will create a kind of long number for that particular you can say column value. If you are just let's say you

have a value called shoes. So you can simply apply hash on shoes you will get a long integer right. So but just to

keep the things simple just for the understanding because once you understand the concept you can

actually learn anything and you can actually understand anything. So it will simply say 1 more 200 equals to 1. So

this becomes one that means it will go to partition number one. It will go to partition number one. And we know that

it is of green df. We will simply use green color here as well because we have so many colors available. Very good.

One. This is ID1. Very good. Now it will pick um let's say pink data frame. It will say ID 1. So ID1 obviously will be

having the same value. It will simply go here. It will simply go here ID equals to 1.

So this is my second DF. Okay. Or you can say DF pink. This is my partition one. Okay. This is my partition one.

Then let's say it will go to green DF again and it will say 2010. What is the value

of 2011 mode 200? I'm sure it is one. Trust me it is one. Trust me now. I I know. I know. I know. Okay. So, where it

will go? Where 201 ID will go? It will simply go to partition one. Okay. Make sense? And obviously pink 2010 one will

also go to the partition one. Very good. Very good. So 201 make sense. So this way it will populate all the 200

partitions with the same joining key on the same partition. Same joining key on the same

partition. Again same joining key on the same partition. It is really important. It is really important. So now you can

see that in this partition we have ID1. We have ID1 there as well. 2011. So that means it will simply bring

the joining keys on the same partition. Yes. Yes. Okay. Okay. Now just tell me can we

apply join? Easily. Easily. So this is the part. Now we are not applying join. We need to hold on till here it will

create 200 partitions. Total 200 partitions. So let me just say we have 200 partitions like this. Okay,

partition 200 makes sense. Very good. We have total 200 partitions and we have just

reshuffled all the data. Now you need to hold on because till here till here we have a state called shuffled state. We

have a state called shuffled state. Okay, shuffled state. Okay, this is the

shuffled state. Just note it down. This is the shuffle state. Okay. Now this step is common in every join. Not in

every join. Basically we have two major joins. Shuffle sort merge join and second one is shuffle hash join. So in

both the joins shuffling will be happening on the partitions. So till here steps are common. From here they

will divert they will be diverted. They will simply pick their path. Either they can just go to sort mer join or they can

go to hash join. Let's understand both and let me just take the screenshot of this thing so that when I will be just

explaining you the um hash join later on so I can just use the same screenshot you will be

understanding easily. Okay. So now we know that we have performed shuffling so far. Shuffling is done. Shuffling is

done. Very good. Now let's see what will happen in the sort merge join because then it will become shuffle sort merge

join. So shuffling is done. Now let's consider like what will happen in the shuffling like after shuffling. Let's

see. So what will happen? Let's consider our partition one. Okay. Let's consider it. Let's say this is our partition one.

Okay. And we have so much of data. Okay. We know that we have ID 1. Then we have ID

2011. Then we have 2010. Okay. And here we have DF green as ID

1. Let's say 2011 and then let's say 2010. Okay. And you can even say you can have multiple

values from df pink because obviously you can have duplicate values not duplicate but yeah transactional values

in pink and df green is not big enough. So it is just having like you can just imagine it with like a simple concept

called fact and dimension. Pink df is your fact table in which you have like redundant ids but in dimension it's just

like having one value. So this example will be much more clear when you will be just understanding hashing. Okay, you do

not need to worry at all for now. So for now what will happen? So this is our DF. Okay, DF pink. This is our DF green and

this is inside the partition. Okay, so now what will happen? What will happen? Okay, and in order to uh make you

understand what I will do, I will simply do like this. So let's say because obviously we have just put the data in

the partition. We didn't say hey just sort the data. Hey just do this. No, let's say first I store 201 then I

stored one then I stored 2011. Make sense? It's up to me or it's up to Spark. Now in the sort merge join

what will happen? It will simply say hey I can easily apply join. I can easily see the one to one. Then I can see 2010

1 to 200 1 then 2 1 to 2,000. But it will take a lot of time. Why? because I need to see this value then I need to

search it in the whole long list. What if we can sort this data first and then we can apply the join. Very good idea.

That's what sort mer join is. What it will do? It will simply sort the data of both the data frames. Okay. So this data

is already sorted so it is fine. But if it is not sorted it will first sort this data frame DF pink and DF green as well.

1 2010. So now just tell me will it be quicker to apply join? Obviously because

we can see the values. So this is your sort mer join. See how simple was that? How simple was

that? You just need right examples. You just need right people. You just need an Lamba. Right. Right. If you agree if you

agree to this statement just just just drop a comment right now. Just drop a comment right now. Right

now. So this is your sort mode join. And let me just tell you this is your by default join. Even if you do not

specify any kind of join, this join will be applied in most of the cases. Okay, in most of the cases this is your sort

merge join. Okay. Now what happens in the case of shuffle hash join? Instead of sort

merge join hash join. What will happen in the hash join? hash. So what will happen in the hash

join? So let's say let's redesign this. Okay let's say or let me just check quickly if I missed

[Music] anything. Let me just check. Bro pro bro. Um okay. This is covered. This is

covered. This is covered. I just want to make sure like you will not miss anything. Okay. See, I love you a

lot. Uh okay. Okay. Okay. This is fine. This is fine. Yeah. We have covered all the

things and shuffle sort. Yeah. Code implementation is left. I will just show you after this code. Yes, you will also

see code implementation. Don't worry. Don't worry. Don't worry. Okay. Um Okay. Very good. So now let's see the

hash join. Okay, this is really important and just try to understand it. Let's say I will just keep my notebook

here because I want to just make sure that I'm just covering all the things. Okay, very good. So now let's

let's let's try to see what will happen in the partition one. Okay, let's say dfpink is your fact table. Okay, dfpink

is your fact table. Okay, you can have multiple values. You can have like one one then you can have like 2011 201 then

2011 2010 one 2011 right you can have right very good now on the other side you have a dimension table which is df

green and you have values such as 1 2011 and this is a scenario because

obviously in dimension you cannot keep like duplicate joining key okay now what will happen

in broadcast uh uh uh uh uh uh what will happen in because all the joins are like okay okay okay okay it's fine now

okay I was just thinking about broadcast hash joins it's not about that one so just forget about the what the broadcast

I use okay so it is shuffle hash join okay so what will happen in shuffle join you know that this is our partition okay

and this is our dfp pink this is our DF green. So we know that in shuffle sort merge join we simply sorted the data and

merged it. Very good. What will happen in shuffle hash join? So we know that this is a smaller df or you can say

smaller table smaller df it's easy to understand smaller df. So what it will do it will simply create a hash table

of let me just use my handwriting the real handwriting. So this is my real handwriting. So what it will do? It will

simply create a hash table on this [Music] small table or small TF. What is the

advantage of it? So every time it needs to apply join, it needs to apply join, it will simply look at the joining key.

Okay. And this will be converted into a hash table. Hash table is basically a kind of lookup table in the form of

dictionary. Okay. So this will be a dictionary. So it can quickly apply the join by looking into it because lookup

tables hash tables are always faster than your normal traditional approach of joining. Okay. So it will go to one. It

will simply look into the hash table and it will apply join. One apply join. One apply join. 201 apply join. 201 apply

join. 2011 apply join. So it will simply look up into the hash table. So make sure this table is small enough to be

easily fit into the memory. Okay? Because hash table will be saved into the memory. Okay? And

executors memory is very important. So we cannot waste it. So we need to be very cautious before using this kind of

join. Plus the second thing is hash table will be created for each partition. That means this is partition

one. Okay, the same thing will happen on all the partitions. On all the partitions. This is the only

difference between shuffle sort merge join and shuffle hash join. Shuffling will be there. Then one way is sorting

and merging. Second way is hashing. These are the two major you can say joins available in the

market. Wow. Anala you did a great job. Let me just look at the list checklist if I missed anything. Uh okay, this is

done. Okay, I actually made the notes for join specially because I wanted to explain all the things. So I have

written okay partition of small table. Okay, done. Corresponding portion of big table. Very good. Okay, in each

partition I have explained each partition. Uh load the small table in memory. Yes, that is done. creates the

hash table for that partition. Yeah, explained. Scans the big part. Obviously done. Uh, very good. All the things are

done. Only code implementation is left for this one as well. Very good. Very good. Very good. Very good. Now you will

say an Lamba, you used a word called broadcast and we also know about broadcast because we have heard that

word. What is broadcast join? Let's talk about broadcast. H broadcast join. You can say it is like officially named as

like broadcast hash blah blah join. So just to keep things simple and do not get confused simply say broadcast join

simple broadcast join. That's it. Okay very good very good very good very good very good so should we implement the

code first and then we should jump on to the broadcast join. No we should just first understand the broadcast join as

well. Then we will see the code implementation of all the three together. Okay, let's do it. Let's

understand broadcast joint. So let's try to understand broadcast joints now. So let's say this is your DF pink and now

you are wellversed well well well versed with DFP. I know that and this is your standard um DF of 450 MB. Let me just

write it for you. This is your standard DF that we were using before the fact table, right?

Yeah, this is just the same fact table. So, DF pink equals to 450 MB. And it will simply create four partitions. 1 2

3 4. Very good. Now, the second table, the second DF is very special. Why? Let me just tell you. So, this is my second

DF. This second DF is just of is just of DF green is just of let's say 5 MB. What? It is very small. It is very small

5 MB only. Okay. And this is just a kind of small dimension table that we have. Okay. So obviously there'll be just one

partition. Makes sense. Very good. And as we know there's only one partition. So it will be residing here. Makes

sense. Makes sense. Makes sense. So just tell you one thing. This is the whole table that is required

to apply join with 1 2 3 four four partitions. These two partitions are good. These two partitions are very

happy. Why? Because these two partitions have all the keys they require to apply the join. Because this whole table is

sitting here. This is the whole table. Okay, let me just write it for you. whole

table. This whole table is sitting here. These two partitions are very happy because they can easily apply joins

because this is a whole table, not a partition. Whole table. These two partitions are sad. They are saying,

"Hey, we want to apply join. Just apply reshuffling. Just do it right now because we are

waiting." Then spark will come into the picture and it will say, "Bro, hold on. You do not need to apply reshuffling or

shuffling. Why? Because it will say instead of applying shuffling because shuffling is an expensive

transformation, I will simply send this whole table to your executor as well and enjoy.

Wow. Okay. So this whole table will actually create a replica. Okay. of this particular whole table here as well.

Okay. Yes, this whole table. Now, these two partitions are also happy, right? Very good. Now, let me just show you how

this thing happens. This broadcasting happens. Let me just show you. So, what it will do? So, this is your driver

node. This is your driver node. Okay. This driver node will just store your broadcasted table broadcasted data

frame. Okay. Temporarily. Okay. Then it will who driver will broadcast this particular data frame to the executors

to all the executors so that they can easily apply join. Okay. So that is why it is also recommended

that the table size should be small enough that can easily fit in the driver's memory as well because it needs

to broadcast. Okay. And obviously it should be easily fit in the executor's memory because it will be stored

there. Make sense? Makes sense. Makes sense. And remember this will not be stored in the cache memory. This will be

stored in the execution memory. What is cached memory and execution memory? Anala don't worry we will very soon

discuss about D um executor resource allocation exeutor memory

management topnotch topics topnotch okay so this is all about your joints let's see the implementation the code

implementation we will simply create two demo data frames to actually show you all the uh things then I will just

change the join strategy one by one okay make sense And obviously we may need to just turn off the broadcast join because

it can automatically sometime perform broadcast join as well automatically. We will simply turn it off. Don't worry.

And let's see how we can just perform those joints in the code. Let's see. So this is my notebook that I have created

and this particular notebook we are just first of all turning off the broadcast join. Why? Because we know that our data

size is really really small. So it can automatically perform some broadcast join because obviously nowadays spark

engines are optimized. So we will simply turn off it because we do not want to apply broadcast joint. So this is the

simple code and you are already familiar with this code. So we are simply creating two data frame DF1 and DF2.

Okay. Now let's apply the join. Okay. And how we can just apply join? It's very simple. Simply write DF join equals

DF1 dot join DF2. Okay. Then we simply need to define the condition joining condition.

So joining condition is simple. It's on the ID column. Okay.

DF ID, DF2 of ID and this is DF1 of ID. Now is the main thing. Now we need to simply

define what like uh what join we are just trying to do. Left join, right join, inner join, full join. So we are

simply applying for now let's say left join. Make sense? Make sense? Make sense? Okay. Let's apply a left join. So

let's run this. What will happen? Nothing because this is not an action. Let's hit the action by saying display

df. Okay. Oh sorry df join. Yep. Let's run this and you will see actually the join will happen and I

will also show you the tag obviously because it is really important to understand the join. So this is the

result of the join. Okay. Now simply click on this and click on the latest like job number one because job zero is

just your normal data creation and this is the job one that we are looking for. Okay, let's expand

it and this is the job. Okay, and it has total eight task. Okay, just click on it and

then simply wait. Okay. Now you can simply click on DAG visualization. But but but but simply go to SQL data frame

just to actually see the stuff. Okay. Exactly. Exactly. Exactly. So what happened right now? What happened? So so

so so what happened? So basically it first of all scanned the data. Okay. On first side. Okay sense. And actually aqe

was enabled. So that is why it performed some AQ of stuff. So we can also disable the AQA. And we can just rerun the

stuff. So I'll simply say spark.con dot set I think it was

SQL. Uh we can just simply grab the code. So this is the code for your adaptive query disable. Okay. And I have

just ran this. Let me just restart the cluster. Okay. And then we will simply apply all the things from scratch so

that I can actually show you what is happening right now. Okay. Because the thing is in this it is just performing

AQE and you know that in AQE it is an optimized engine and we will just talk about AQ in detail. Don't worry but for

now we simply need to just turn off AQE just to see the desired result because AQE is a game changer. So you can just

get confused when you are just using AQE. So we will be simply turning it off just to show you the context like how it

is just creating a DAG and how it creates you can say the whole um query plans without adaptive query execution.

Okay, make sense? Make sense? Makes sense. So so it should not take much to restart it but

yeah but yeah yeah yeah. So by default it tries to apply a shuffle sort mode join and we should see

the same stuff here as well. Okay. Okay. Okay. Okay. Shuffle sort mode join. And if you just want to perform uh broadcast

join then you have to change your code and don't worry I will just tell you how you can just perform broadcast join. So

it is restarted. Let me just run all the cells. Click on simply run all just to run all the cells instead of running it

one by one. Okay. So this is completed completed. Completed and it is simply displaying

this. Okay. Perfect. Perfect. Perfect. Very good. So now let's see the job. Okay. And this is

the one stages three display job. Uh let's see what it has done in this one. Okay. So simply go to

this one stage ID maybe two or maybe this one. Yeah, these are basically jobs. So you can simply say SQL data

frame and this is the one display df join. So exactly that's what I wanted to see. Okay. Makes sense? So, by the way,

these are just the jobs. Do not get confused. Okay? And we know that our job is job one, not job one. Which one is

that? Job zero, basically. Okay? So, if you just go to job zero, you can actually see all the three stages. Stage

zero, stage one, stage two. Okay? Shuffle right, shuffle right, and all these stuff. Make sense? Make sense? So

whenever you just want to see the DAG okay simply go to SQL data frame and simply click on your ID and this is your

DAG. So basically what it is doing it is simply reading the data scanning existing RDD and this is reading the DF1

and DF2 one by one. Okay and you can simply see that ID and name ID and name is your DF1 and this is ID and salary.

This is your DF2. Okay, make sense? Very good. It is applying filter. Now you can simply ask hey we didn't apply any

filter again if you remember our previous lectures like previous part of the video we said that spark add some

optimization to just add some not null steps because obviously you should not have nulls in your joints. So it is just

taking some precautions before applying join. Okay very good. Now exchange happens because obviously 200 partitions

remember click on this plus button you will see 200 partitions. Okay, click on this 200 partitions. Now this will be

shuffled together to create total 200 partitions and then it will simply sort the data. Sort the data and then it will

simply apply sort merge join. Shuffle sort merge join that is our by default join. So this is our

shuffle sort merge join. Okay. So this is our default join. Very good. Very very very good. Now let's say I want to

perform broadcast join. How I can just perform? So basically as we know broadcasted table should be the smaller

table. Let's imagine that this is our smaller table which is DF2. Okay. So how we can just simply say I will simply

create a new join DF join broad. Okay. So now you also need to write broadcast like this. Okay. just before the smaller

DF broadcast. Just take notes of this. Okay. Now I will also turn it on instead of minus one

because I cannot say hey do not apply broadcast join because I now I want to apply broadcast join. Okay. So what I

will do I will simply remove this. Okay. I will simply remove this and I can simply rerun it. I can even define the

size as well but just to make things simple. Okay. And just to make easy for you to take notes, I should simply

disable it and I should simply restart my cluster. So what I have done, I will simply apply the DF broad right now and

it will simply apply the broadcast join. And this time you will see a different DAG. I will not refresh this window. I

will just show you the difference. Okay, let me just collapse it. Collapse it. Collapse it. Collapse it. Yeah, simple.

So this is our normal you can say shuffle sortment join right shuffle sortment join very good now we will be

performing broadcast join you can also ask hey anchamba aren't we performing the broadcast hash join basically you

can perform broadcast hash join but as I mentioned that you need to be very cautious with the code if you're just

performing broadcast hash broadcast hash join and you should only use if you are

confident with broadcast hash join And in order to do that you can simply um turn off the um shuffle sort join. You

can also turn off the shuffle sort join but it is not recommended because you should always rely on the you can say

spark engine to pick the best join strategy which is shufflemer join in most of the cases. in most of the cases.

Okay. And and and broadcast join is something that you can additionally add if it is not being by the way AQE will

pick broadcast has join as well for you if it is needed. But in some cases you need to explicitly add broadcast has

join right. And what is AQE and Lamba? What is this? What is this? Why it is so much popular and why it is just changing

all the things. Basically AQE actually changed a lot of stuff for us. a lot a lot and for a good cause. Okay, for a

good cause. But obviously uh in order to understand AQE and in order to understand spark in the modern day in

the modern world, you need to understand all these things. Then you can easily understand AQE like what is it doing and

why it is doing right because right now every AI is everywhere, right? So you need to build your concepts really

really strong really strong. Okay, makes sense. And obviously you need you you know the place from

where you can just get your fundamentals so so so strong. It's An Lamba's channel. Simply hit the subscribe

button. Share this video with others. Drop a lovely comment or multiple comments. I feel happy whenever I see

those hearts. Whenever I see those smiling faces, whenever I see those uh messages saying that hey I am placed at

XYZ company, I really feel happy. I really feel happy. So, bro, simply restart the cluster

because it's 8:32 p.m. right now and I started at 8:00 a.m. and I need to eat something. I

didn't have coffee today. Huh? No worries. Everything for you, bro. And

sis, bro means like both. Male and female both. Okay. It's just a bro is is

like how how can I explain this? How how how

[Music] bro? I think last time it was started really really quick. Uh it's just a

matter of few seconds. Let's see. And by the way, we need to just rerun this code as well because yes, we need to disable

AQE before doing anything. Okay. Okay. Okay. Okay. So, whenever you see something called as um adaptive thing

like you were seeing before just um you can assume that it is due to AQE it is due to adaptive query execution.

Okay, that's why we just disable it every time whenever you want to do anything.

Okay, make sense. Makes sense. Makes sense, bro. What are you doing, man? What What's What's wrong with you?

Let me just fast forward this restarting of cluster. Okay. And let me just get back

to you. So cluster is on. Let's run all the commands and let's see what we get. What do we get? Let's

see. So perfect. And your joins concept should be crystal

clear after this. Okay, crystal clear, crystal clear, crystal clear. And you can even

like do something like experiment something to see more and more things and you can even like try with your own

CSV files. It's up to you. Okay, it's up to you. So it is simply running this job now. Display DF join broad. It has

created three jobs so far. Very good. So I am interested to see maybe job three because it will be for maybe display.

Makes sense. Let's click on this and let's go to SQL data frame and display DF join broad. Very good. Very good.

Very good. Very good. Because what happened? Oh wow. So this is your DF1 which is the bigger data frame. This is

the smaller data frame for ID and salary. See when I just hover over this I can see the columns. Then what

happened? What happened? It simply created a broadcast table called broadcast exchange. So it simply

broadcasted this table and it simply you can say joined this broadcasted table with your bigger

table. See and then you can see the rows output and then at the end collect limit. Wow.

This is your broadcast join. Okay. You can even like save these things in your notes. You can even draw it. You can

even take screenshots. Anything. Just save this information like this is very crucial information in the world of

Apache Spark. If you if you just compare the DAG of this one, see there was no broadcasting. But here you are actually

performing broadcasting. What is the advantage of broadcasting? There is no 200 partitions, no shuffling. See here

we have like shuffling 200 partitions but here is no shuffling. That is the advantage of broadcast join. No

shuffling. That's why it is really really quick. Wow. Wow. Wow. Wow. Yes. So this was all about your joins. All

the types of joins including concepts, including theory, including practical, including demonstration, everything. I

hope you understood the joints right now. Okay. Now let's try to understand something very important in the world of

patches spark. Now, now we're going to talk about Spark SQL engine or you can say catalyst optimizer. And now you will

see the entire flow because every time you are saying your code is optimized by Spark, your code is optimized by Spark.

Your code is optimized by Spark. How how how Plus every time whenever we just run df.explain, we see something called as

physical plan. Now what is this physical plan? Just clear everything right now because now we have a lot of knowledge

in Spark. Now we can understand that. Okay. So let me just tell you the entire flow and how Spark optimizes your code

and how Spark you can say picks the best plan and then sends that plan to the

executors. Let me just show you the entire flow. Okay. So let's say this is you. Okay. And you are very dedicated

data engineer. You are writing your spark code. Okay. And you know that whatever code you write, okay, it will

not do anything because it will simply store all that information or plan that plan is called by the way take notes.

Okay, that plan is called as unresolved logical plan. Unresolved. Yes. Now what is this unresolved logical plan?

Basically whatever you wrote df equals to blah blah blah I want to pick this column blah blah blah you wrote your

code right that is unresolved. Why? Because we have something called as catalog. What is catalog? Catalog has

all the objects registered. Catalog name, database name, table name, column name, all the things. Okay. All the

things. So whenever you are just reading any data okay whenever you are reading in data so it will simply create the

unresolved logical plan okay and it will make sure whatever columns you are using whatever file name you are using like

all the metadata catalog means all the metadata it will simply verify all the columns that you are mentioning in your

code is exactly the same column that we have in the file. So it simply perform a small analysis. Performs what? A small

analysis. Okay. On top of your unresolved logical plan with the help of this catalog which has all the metadata.

Okay. Once it approves your all the columns, let's say you are writing your code and um you have a column called

categories. Okay. And due to some reason due to a typo you wrote category instead of categories it will throw an error

called analysis exception. That is the error and that is due to this analysis which is called as catalog. It is due to

catalog. Make sense? Very good. Once that analysis is done it will simply create the resolved logical plan. Only

the resolved logical plan. That means okay everything is fine. There's no typo. All the columns that are being

used are there. So we can just use this code. Okay. Very good. So your code is converted into resolved logical plan.

Then your resolved logical plan will be converted into something called as optimized logical plan. Make sense?

Optimized logical plan. And this is the plan that you see in your code. You always see hey I applied filter, I

applied select statement. and they just combine both the transformations. Why? Because it is an optimized code. You can

perform both the things together. Okay. Another example and this is a very good example. Let's say you are applying

select statement first and then you are applying filter statement uh or let's say you are applying select statement.

Okay. Then you performed some group by and then you performed some kind of filter and then you performed like so

many things. So it knows that filter should always be performed at the beginning just to reduce the rows so

that your query will be performed. So it knows like what transformation to be prioritized first. So that are all the

examples of optimized logical plan. Okay. So you can write your code in any flow. You can even you do not need to

think anything. Hey um should I apply filter first because my data frame will be like shorter. You do not need to

worry about anything because Spark is there for you. Spark SQL engine is there for you. It is optimizing your code. It

is optimizing your code. So you do not need to worry at all. Simply apply all the transformations that you want to

apply. Just apply it. Don't need to worry. Don't need to worry. Once you have something called as optimized

logical plan that is not the end then that optimized logical plan this one will be converted into physical plans.

Now what is this physical plans? What is this? Basically um let's say you are applying a join for example you are

applying a join okay and in the optimized logical plan it has simply mentioned that join should be applied

after filter or whatever. Okay, that is just the optimized logical plan. But in the optimized logical plan, it is not

written that should it apply sort mer join, should it apply hash join, should it apply broadcast join. No, it is just

written that it needs to apply join. So which type of join it needs to use, it needs to apply will be determined by the

physical plan. How? So basically it simply creates a lot of physical plans from one optimized logical plan. It will

simply create all the physical plans that are possible. So many then it will simply use something called as cost

model. Cost model is just like you can say it is a kind of funnel that takes all the physical plans and that outputs

the best physical plan which is least expensive. Not in terms of cost. You can say in terms of cost as well because

obviously if you are just consuming more resources to run any data so obviously you are spending much data but here cost

doesn't mean directly like the money it means like computation resources which can be expensive right so that is the

cost model and the best physical plan let's say it outputs a best physical plan this physical plan will go to the

executors to be executed this is the entire flow of sparkql engine. How it takes your code,

it converts it into unresolved logical plan, it performs an analysis, it converts it into resolved logical plan,

then that resolved logical plan gets converted into optimized logical plan and then optimized logical plan gets

converted into physical plans that you see in your code. When you just say df.explain, you obviously see the

heading physical plan because that is the final plan that goes to the executors. and how we pick the best

physical plan through this cost model and everything will be done by Spark. Wow, this is your end to end flow. I

hope now you understood all the things related to Spark optimization, how it optimizes your code, how it creates a

physical plan, how it sends the data um to the executors. So I hope now you understood,

make sense? Very good. Now let's try to understand the very important stuff in Spark. Let me assure you. So now let's

look at the one of the most popular topics and very important. And this one and the next one both are very

important. Driver memory management and executor memory management. Let's start with driver memory management. Driver

memory management is pretty much simple. Exeutor one is also simple because I am explaining you the stuff. But driver is

trust me like much more simpler than executor much more simpler. So let's actually first of all look at like what

is driver memory. We all know that driver is the heart of the whole spark application or you can say whole spark

framework because driver's memory is responsible to orchestrate the work to create the logical plan to create the

optimized logical plan and everything. So let's look at that. Basically driver in Spark is the master process that

coordinates all the task. We know that it holds metadata, taskuling, info, tags and more. Agree? Very good. Driver

memory equals to memory allocated to the driver process when the Spark job runs. We will talk about it. Don't worry. If

driver runs out of memory, we see an error called driver out of memory. And we're going to talk about that as well.

So just be with me. Everything will be crystal clear in just few minutes. Basically, drivers memory has two major

components. JVM heap memory and overhead memory. For now, just ignore overhead memory. Let's start with JVM heap

memory. What is that? As the name suggests JVM heap memory or you you can also call it as like on heap memory. It

is basically for JVM tasks. all the Java virtual machine tasks such as DAX metadata broadcast variables. Broadcast

variables. Yes, you all know that whenever we need to broadcast something, we first bring that data to the driver.

Then we broadcast it to the executors. So this is the area JVM heap memory is the area where we bring that data like

broadcast variables, broadcast tables, everything make sense and obviously taskuling if we just want to schedule

anything we will be using this memory. So you can conclude that the most important area of driver memory or you

can say driver memory area is JVM heap memory because this will be used in most of the cases or you can say this is

responsible for most of the tasks that doesn't mean that you should not look for overhead memory. Overhead memory is

like responsible for JVM threads shared libraries of heap buffers like neti native code. Basically in short it is

responsible for all the nonJVM tasks non-Java virtual machine task. It is very good for st like make the process

stable and you will obviously ask me hey do we actually need it can we just skip it and

how we can just configure it all the answers will be will be will be given to you in the next slide. So basically

whenever you just want to request driver's memory whenever you just want to request the driver memory okay let's

say you requested 10 GB driver make sense okay 10 GB of driver so actually we are saying that we need

sparkdriver dotmemory which is your JVM heap memory so we are requesting 10 gigs or basically 10 GB okay make sense

and we will automatically get 10% extra for our overhead memory until or unless we have defined overhead memory as like

that I want 1 GB I want 2 GB no if we do not define and usually we do not define so we will simply get 10% of this memory

that is 10% of this memory this is 10 GB so it will be 1 gig. So 1 GB will be going to our spark driver memory

overhead automatically. Now, now no now no now no now no now no now no now no now no now no now no now

now so in total we'll be getting 11 GB right exactly exactly so you request the memory for your JVM heap process but we

will also get 10% added into that that means 10 GB plus 10% of it equals 1 GB now every time it is not equivalent to

10%. Why? Because in some cases if you are requesting let's say 1 GB of driver not 10 GB. So in that particular case

you will get 384 MB on Lamba. We also know a little bit of mathematics. So 10% of 1GB is not 384 MB. So I know that. So

that is why you will either get 10% of total executor memory that you requested

or 384 MB and how you need to decide or maybe how Spark will be deciding it basically whichever is higher. So in

case of 10GB the 10% of it is like 1 GB 1 GB is greater than 384 MB so we'll be getting 1 GB. In case of 1GB executor

10% of it is 100 MB obviously it is less than 384 MB. So it will simply get 384 MB. Make sense? Very good. Very good.

And overhead memory is like also useful for like other things that objects it is creating for us like containers and also

it needs some space right. So all those things will go under this driver memory overhead. See it was so simple. So

simple. Now Anchala tell us about driver out of memory because we have seen a lot of interview questions regarding driver

out of memory. Driver out of memory. So basically let's talk about it. Driver out of memory can be there

due to a lot of reasons. Okay. And today we'll be just talking about the major ones or you can say the popular ones. So

as we know that drivers heap memory will broadcast the you can say broadcast variables.

Okay, if you just want to broadcast the variables or broadcast data. So if you are broadcasting a data which is more

than which is more than your heap memory obviously it will create the error. So that is one type of driver out of

memory. Just note it down in your notebook. Make sense? That is one type. The second one and the most popular one

is whenever you run command such as df do.colct dot collect why why so whatever data your so let's say

this is driver this is an exeutor just to make you aware this is an exeutor and if your memory is sharp you you this is

this should be like understood because we have already taken this as an our example. So this is our executor in

which we have four partitions written. Okay. Very good. So this is our executor. Okay. Now when we run

df.show when we run df do.show or let's say display df. So what happens? Executors send the data to the driver.

Make sense? And when we run display command we just want to see like some few records right? We do not want to see

the entire data. We simply see I think thousand records by default. So what it will do it will simply pick any of the

partitions any partitions it cannot send like half partitions one/ird partition it will simply send one partition so it

will simply send one partition to to the driver okay and it will simply say hey just show the data to the customer

that's it but when you run the command such as df docolct dot collect command means you just want to collect all the

data in a Oh, so that means in that scenario it will simply collect all these partitions

and it will simply send all the partitions to the driver. But but okay, this is done. But and

but so now all these partitions will be going to the driver. Just imagine just imagine the load on the driver. If these

partitions are small enough that this driver can handle it is fine then. But let's imagine the size of all these

partitions together is more than the JVM heap memory that we have provided to the driver. It will simply throw an error of

driver out of memory. So what is the solution in that scenario? Simply increase the driver

size or why are you just using dot collect method? Who told you to use dot collect method? You should only use dot

collect method wisely with some common sense. You should only use dot collect method when you have like few records. I

have used dot dot collect method so many times. But I use it when I just want to let's say um convert any data frame into

a variable just to fetch like maximum date from a data frame. I use dot collect method in that

scenario. Makes sense. I do not call like all my data to the driver. Okay. So you should use dot collect method

wisely. You should only use dot collect method with those data frames which will not return a lot of

rows. Very popular use cases of dot collect method is just getting the maximum date from a data frame in which

we know that there will be just one row. Okay, makes sense. And we can then convert that

particular data frame into a key value pair. We can use it that's like totally different thing. But you should not use

these things like dot collect and just bringing all the data. Make sense? Make sense? Make sense? Make sense? So this

is your driver out of memory error and you know how to mitigate this. Now let's talk about executors memory

management and obviously we'll talk about executor out of memory. Let's see. Now let's talk about executor memory

management. So basically this is much much much more important than by the way both are important but this is I would

say a little bit more important than drivers memory and if you will ask me from the interview point uh point uh

perspective from interview perspective it is much much much more uh you can say important so basically without delaying

let's actually understand what is this I know this is a complex topic I will just make you understand like this so

basically this is divided into four major types of memory or you can say it has four major parts. JVM heap memory.

This is the again main memory similar to driver. Okay. And we going to discuss about JVM heap memory in very much in

detail. So don't worry about that. Second one is off heap memory. So basically off heap memory in exeutor by

default is zero. And we rarely use off heap memory. But sometimes it becomes really really

handy. Second thing off heap memory will not be managed by JVM. Okay. Will not be managed by JVM. That

means you need to manage it. So it brings some overhead that that's why like we rarely use it but sometime it

becomes really really helpful. Okay don't worry we'll just talk about this as well. This is like very high level

overview just to set a ground for you to understand the things. The third third is overhead memory similar to driver.

This is also the same stuff which is 10% of your total exeutor memory that you have requested. Again we will be just

discussing about overhead memory as well. And you already have some idea like why we use like overhead memory,

right? Very good. And this is similar to drivers and the same concept uh applies here as well. So okay no change very

good. Then we have pispark memory. Oh what's that? So this is again by default zero. Okay we rarely use pispark memory.

This is just for like pispark applications. We rarely use this memory as well. So in short if I just would uh

if I just want to give you a heads up this memory and this memory is really important because we use these two

memories in all the applications. This is by default zero. This is by default zero. So we rarely use these two kinds

of memories. Makes sense. But still we will be just looking at all the parts. But the OG is JVM heap memory. And this

is divided into so many parts. Okay, like there's a hierarchy of memory layers. And don't worry all are simple.

So let's first of all first of all let's master this and then we will just quickly cover this this this as well

because once you understand this all the other things are fine and trust me this is the major area that you need to focus

okay heap memory and then other things will be mastered like this. Okay, let's understand and master JVM heap memory of

executor. Let's so now let's expand these memories and you will actually understand the whole flow now because we

going to just go in detail. So let's expand this me. Let's say you requested um some memory from the resource

manager. Okay, let's say you requested 10 gigs of memory from the um resource manager and let's see what will happen

next. Okay, because you requested it, you don't know like what will happen in the back end. Let me just show you what

will happen. So basically, let's say you requested executor memory equals 10 gigs. Equals 10 gigs. Okay. And you will

request it using this particular configuration. Spark.exe.memory. So this is your JVM

deep memory like on memory or you can say main memory. main memory as we just talked about this is the main memory and

we're going to expand this as well so don't worry similarly similarly like similar to driver you will also get 10%

of the total executor memory as well additionally so in short when you requested 10

gigs that doesn't mean like 10 gigs is equivalent to both no 10 gigs means only this memory and this will be available

for you additionally. So 10 + 1 so total your executor this one total executor's

memory will be 11 gigs including overhead but you always request 10 gigs so you just get 10% additionally

whenever you request the memory make sense okay and now as just as I just mentioned that we going to expand this

because in driver we do not have much stuff to do so that's why we do not expand this memory we do not like go

deeper in this inside this memory because we do not have much stuff to do. It just needs to orchestrate it and

that's it. But here in executor you need to do some extra work to learn as well. And obviously in order

to tune your executors you need to know all these things. Okay. So now it's time to actually expand this memory. Make

sense? And again I'm repeating this is the memory that you will get when you request from the um resource manager.

Okay. So for now we are just taking the example of 10 gigs. Make sense? Very good. So let's expand this memory now.

So this is the memory. Wow. This is your JVM memory. Total total memory that is 10 gigs. Okay. So we have three major

parts in this memory. The first is reserved memory. Let's talk about this first. This is the fixed amount for

Spark engine. No negotiations, nothing. So this is the memory that we have to allocate. Obviously we do not allocate

like it will be automatically allocated for your spark engine fixed no negotiations no reserved memory for

spark engine. Then we have something very very special. What's that? The second stuff is spark memory pool or you

can say spark pool memory or whatever you want to say. I call it as like pool memory.

So, Spark memory pool is by default 60%. 60% of what? 60% of total memory minus 300 MB. Because if you have some

basic mathematical knowledge so obviously this is the total memory. The gray area is

the total memory that is 10 gigs. We have reserved memory 300 MB and this is the 60% of the

remaining and which is remaining total minus 300. Make sense? Make sense? So this is the spark memory pool which is

by default 60%. But if you just want to tune it, you can tune it using spark.mmemory.fraction. You can just

make it maybe 50%, you can make it 40%, you can make it 70%, you can make make it 80%, that depends upon your

requirements. That depends upon your code like what you are using within the code. By default, it is 60%. Okay? Even

if you do not define this, it will take 60%. So you do not need to just actually tell like like every time, hey, I want

60%. No, it will take 60% by default. The third memory that we have is user memory. So you can just type here user

memory. I just wrote memory only but actually it is user memory. Okay I can just write it for you. No worries. So

this is user memory. Make sense? So what is this user memory now? So this is the remaining memory and just tell me what

will be the remaining memory. Obviously 40% of total minus 300 MB. Make sense? Let me just give you some hint and hints

how to calculate this. Basically, this is the total memory. Make sense? This is reserved memory 300 MB. Make sense? This

is 60%. This is 40%. Simple. Simple. Why do we need user memory? And what is the reason we need

like user memory and why do we need to allocate 40%. By default, I know. But why 40%? Because obviously in the code

you will be writing a lot of UDFs. what are UDFs, user defined functions and all the userdefined data structures and all.

So all that oper all those operations will be actually performed for under this memory user memory make sense. Now

let's actually see this number or you can say this memory allocation with number so that you can actually

understand it. So we took the example of 10 gigs okay and you can actually take notes for this. So we took the example

of 10 gigs. Okay. So this is 300 MB reserved. So this is 300 MB. I have written that

so you can just take notes. Now calculate 60% of this. So let's say it will

become 60. Let me just write it for you. Let me just tell you how to calculate these

things. So 0.6 6 okay into 10 minus 300 MB 300 MB means 0.3 almost almost gigs so it will become

it will become 9.7 right 9.7 so this is your memory which will go

for spark memory pool make sense you can just use your calculator and calculate it 0.6 it should be somewhere around

5GB something somewhere around 5GB okay approximately I'm not just using my calculator I think it should be

somewhere around 5GB okay 5 gigs and user memory will be and it should be understood

0.4 into 9.7 make sense makes sense make sense

makes sense and you can again use your calculator and I think it would be around

uh 4 point something. Yeah, I think so. 4.5 something. Yeah. 4.5 something. This will be around 5 something. Yeah. Yeah.

Yeah. So, you can just calculate it. So, this is the number that you should get when you just ask for the memory. Anal

Lamba is it everything. Now, we are good for executor. Actually, no. Because now we need to expand this memory because

you can ask me an we understood that this is used for resort memory that is for Spark engine. This is for user

defined function. Where are our transformations happening? It is happening in this memory. Okay. Where is

our cached data located? It is under this memory. Oh, okay. So, we need to just expand this memory as well.

Exactly. So, now expand this memory which is spark memory pool. So, this is your spark memory pool. And don't worry,

it is very simple. Let me just tell you. So this is your spark memory pool and we already know like what memory do we have

here? We already know right we know that it is 0.6 into let's let's take an example of

5 gigs just to keep it simple. Okay so you can actually calculate the exact number but for the simplicity we are

just taking 5 gigs. Okay let's say 5GB total memory. So we have two major memories within this. One is storage

memory. The second one is executor memory. In the storage memory, you cache your data. It is used for caching. If

you are just caching your data frame, don't worry. We will just talk about caching in detail. If you are just

caching your data frame, you will actually use this memory, storage memory. Make sense? And it is also

called as long-term memory. Why? Because you actually store your data that you can refer later in the code. It will not

be eliminated. Okay? Whereas on the other hand, executor memory is the memory which is used to perform your

transformations. join group by everything will be happening on this memory

50% of total. So this will be let's say 2.5 gigs. Okay. And this will be 2.5

gigs. Make sense? Very good. And this is also called as like short-term memory. Why? Because obviously um it will not

keep your or hold your data because it needs to process a lot of data. So if it would be just holding your data for long

term it cannot process the data. You will be just having errors error like executor out of memory again and again

and again. Make sense? Okay. So now an Lamba you said like this is 50% this is 50%. Who defined this number? No one. It

is by default 50% 50%. But if you want to change it, you can just use spark.mmemory dots storage fraction and

you can define maybe 6040 7030. It depends like how much data you want to cache. Maybe you do not want to cache

much data. So you can just reduce this memory. Common sense. Okay. So this is by default. Now one very good point,

very good point. Let me just tell you. So can you just see this line? This line this is the you can say a kind of

divider between both the memories. The good thing is this is not a rigid boundary. This is a flexible

boundary. What? Yes. Even if we define let's say you defined that you want 50%. You want 50%. Let's say you defined that

you want 50%. I know this is by default but still let's say you define 50%. Still this boundary can move up and

down. really yes that's where the concept of allocation and borrowing come into the picture that means in short I

will just talk about this as well in detail for in short let's say you are just storing your data in the uh storage

memory you're caching your data okay and this memory is full but now you need to cache more data and your resource

manager knows that you do not have much thing to do here and you have a lot of data like lot of space here that means

this area can be expanded and this line will be shifted towards bottom so that it can use some space from this area and

vice versa is also true this line can go up as well so still if we define 50% this can go up and down this can go up

and down this is flexible make sense very very very good and I know you just got a lot of

information a lot of you can say hierarchy within the executive memory management just take notes of everything

and now let's talk about this line how it becomes flexible and how allocation and borrowing come into the picture how

we can just change it let me just tell you so let's talk about unified memory management and basically unified memory

is let's say another name for your spark pool memory in which we have two memory like executor memory and storage memory

some people also call it as like execution memory it makes sense okay so execution

memory/exe//executor memory or storage memory so this is the unified memory So let's try to understand like how it

works because we know that this is the separator. This is the separator between that and by default it is like 50 50%.

So this is 50%, this is 50%. We know that and we also know that this divider is flexible. It can move towards right.

It can move towards left. Make sense? Okay. Very good. So let's try to understand with the scenarios because

once you get all these scenarios, this will be a piece of cake for you. So now let's try to understand it. Let's say

you are using executed memory obviously for joins group by everything right all the transformations and this 50% kota is

completed. Okay now it needs more memory to let's say perform your more transformations. Okay so what it will

do? It will see that it has some space. Wow. So what it will do? It will simply it will

simply occupy this space and it will come here. Let's say it needs this space. Make sense? So it can occupy this

space because this space was empty. Sorted life. Sorted life. Sorted life. It can just take the space. Very

good. Now, now let's say okay let's talk about this in the

next scenario because this is scenario number one because I want to just include all the scenarios because if you

will miss any one scenario it will just leave confusion in your mind trust me. So this is scenario one when we have

space available right okay let's look at the scenario number two so this was our scenario right you can also call it as

like scenario two because now we going to do some tweaks so let's say you are using memory till here right your

executor or you can say your execution memory is being used till here now let's say your execution memory needs more

space now what will happen this is really important so what it will It will try to

evict these cached memory because we know that in the storage memory we use like we store like cache data. It will

simply try to evict this and it will occupy this space as well. But but but using

LRU you can say method which is like so basically whatever block of the data is least used

or you can say least used cached data that would be a better word least used cached data it will be

evicted. Okay, it will be evicted or let me just write it. It will evict it will

evict least used cache data because let's say first it evicted this one then it evicted this one then it evicted this

one. So whatever cache data will be least used. So it will just evict it one by one. That means execution memory has

the or executor memory has the power. It has that authority to evict the storage memory. Why? Because it is prioritized

because it needs to perform your transformations. It needs to process the ex like the actual transformations of

the data. It needs to actually process your data, right? So it will get the priority, it will get the power and it

will simply evict the occupied storage memory one by one based on the LRU method or you can say least used cache

data. Make sense? Okay, very good. Let's look at another scenario. Now let's say your storage memory is full. You have

cached so much of data and your executor memory is free. So in that particular scenario storage memory can borrow the

space can borrow the space. It can it can do it like it is it is allowed. It is possible. So it will be coming here

and it will take the space. It will take the space easily. Make sense? Very good. And just a follow-up question and it

should be understood now. What if this memory needs more space? Then obviously it will evict the So this is the whole

storage memory right now. Okay. So what it will do now it needs more space. It will simply evict the storage memory

based on LRU method. Simple. That should be understood now. And that was just a follow-up question. Make sense? That's

how it goes like from left to right. That's how it allocates the memory and borrows the memory and then just return

it back. So in short in short okay okay let's let's consider like one more uh scenario because that is very

interesting. Let's say this is like occupied the space. Now you will say tell me an Lamba if storage memory needs

more space what will happen? This is really important. So what will happen? Let me just tell you. Let's say storage

memory is full. Very good. And the the empty space is also full. It has occupied that empty space. It is also

full. Very good. Now, storage memory needs more space. What will happen? Can it evict executor memory? No way. It is

not allowed because storage memory has no right. Storage memory has no authority to evict our exeutor memory.

Okay. So, what it will do? It will evict its own memory based on the LRU approach. That means whatever let's say

let's say these two blocks these two blocks are like least used among others. So it will simply evict this and it will

simply store the new storage memory or let's say cache data in this one in this one let's say here make sense so it is

not allowed to evict only executor memory has the authority to evict the storage memory but the other way is not

possible it is just one way okay I hope I have covered all these scenarios and I I know that now you are very much very

much very much confident with this concept called unified memory management and I hope it will help you to just take

notes as well because I have just shown you through a grammatic way grammatic diagrammatic oh man I need coffee

diagrammatic way do we have a word diagrammatic with the help of diagram okay so that you can actually understand

and visualize it okay very good let's see what we have next in this particular spark course so let's talk about

executor computer out of memory and I know this is very critical topic and obviously a little bit

complex not really because now you will understand everything so let's get started let's say this is your data

frame okay this is simple data frame ID and product category sorted good so I have just taken top six rows let's say

this data frame has like millions of rows 100 200 millions of rows okay very good and this is our executor forget

about the inner things. Okay, green, pink and blue. This is our executor which is processing this data. Okay, and

for now just to keep things simple just imagine we have only one exeutor. Make sense? Okay, and this is the disk

associated to this executor because we just have one. Okay, now do you know let's say this is our executor. Okay.

And executor means like execution memory like the now you understand the executor's memory everything in detail.

This is that area of the exeutor which processes your data which transforms your data. You can imagine it is like

like uh unified memory. Okay. So let's say unified memory allocation is um approximately let's say 1 gigs just to

keep things simple. Okay. 1 GB makes sense. Okay. And we have applied something called as group by operation.

We have applied group by operation on product category. Okay. Product category column. So we know that each

product category will get a partition in which it will have all the similar product categories. For example, it will

get one partition for shoes, shoes, shoes, shoes, shoes, shoes, shoes. One for food food food food food. One for

dairy dairy dairy dairy. We know that right? We know that. Very good. So if our data is normally scattered, if it is

not skewed, what is skewed data? Skewed data means when you have one category, one product category has lots of data.

It's like covering at least 70 80% of your data. It is called skewed data because it is like skewed on only one

value. It is not normally distributed. It is not like normal distribution of statistics like normally

means like normally distribute. Normally scattered. Okay. So if it is normally scattered it is fine because it will

let's create let's say create one partition for food, one partition for shoes, one partition for dairy. Okay,

make sense? Very good. And each partition is of let's say 300 MB. Make sense? Let's say 300

MB. 300 MB, 300 MB and 300 MB. It can process the data because total size is um 1 GB. Make sense? Okay. Now what will

happen if we have more data? Let's say we have one more category called um plants. Called plants. and plans um

partition is also of 300 MB make sense then in that scenario obviously it needs to fit that

particular partition here right if this executor is of four codes okay but we know that we only have 1 GB so in that

particular scenario what we going to happen like what we going to do what what's going to happen so what executor

can do it has some leverage called spilling It's called data

spill. So it can spill the data, spill the partitions or you can just spill the intermediate results to the disk. Let's

say it decided to spill this data here. Okay. So now it is in the disk because it is already computed. It it has some

intermediate results. Now it has stored this data to the disk. So here we will be having space. So this partition now

can fit here. So far so good. Very good. So now you will say is it allowed? Yes, it is

allowed. It can spill the data to the disk. So whenever it needs to write your data, let's say you are writing this

data. So it will simply go back to the disk and it will grab the data and it will simply produce the output. Simple

simple because this partition is an independent partition. What thing is not allowed? Let me just tell you if you

will ask me an Lamba come here. Why? So I'll just talk from here. So basically let's say this is 300

MB. This is 300 MB. This is 300 MB. Total 900 MB. 900 MB. Open your mouth. 900 MB. So this partition the new

partition which is for plants is also 300 MB. Imagine. So you will say an lamba we have 100 MB

available here. Why we are shifting the whole partition here. We can simply shift 200 MB of data. Let's say only

that part of partition. No that is not allowed. You either have to shift the whole partition or you cannot shift the

partition. This is the thing. You cannot say hey Anala just break this partition and just shift shift half. Now you have

to shift the whole partition. Okay. So now this whole partition if shift shift to the disk. Okay. Now you have 400 MB

available here in your um exeutor to execute the data. Now this partition can stay there. Very good. That is the

concept of spelling. So an Lamba where is the error? Because it is allowed. Yeah right. It is allowed. Obviously it

is allowed. Okay. So here comes the question if executor can spill the data to the disk why it throws the error

called executor out of memory because we have so much of memory in the disk so much so much. So the thing is let me

just tell you when it will throw the error. So basically let me just remove everything. Let's say your data is

skewed. Skewed means let's say 80% of your data that means 800 MB of your data not 80% let's say this partition is

of size 800 MB alone when we just applied group by okay we applied group by earlier it was of 300 MB it was fine

but now it is skewed okay so it is group I have applied group by on product category makes sense so So what will

happen? So let's say this green partition was of 300 MB before when data was normally distributed. Now the data

is skewed. That means this partition is of 800 MB H.

Okay. Now this is fine. Is this fine? Yeah. Till here it is fine. But it is Q data. Now I'm just simply telling you

what is QES. So now we know that if this partition is of 300 MB, this is also of 300 MB. Can they

stay there? No, they will simply be spelled to the disk. Right? Very good. I just wanted to check your knowledge

regarding skewess and regarding the concept. Now let's talk about extreme skewess. Let's say your data is growing

rapidly and you are only receiving the orders for food with the time. Let's say your data is growing but only food

product categories are being sold with the time. Now your data size or you can say partition size has

become of 1.5 gigs. Oh yes it is good news for you. Why? Because your store is doing well. You are just selling so much

of things but data is skewed right data is skewed. So again you applied group by and all the food ids need to go on

the same partition. So now this partition is of size 1.5 gigs. Now just tell me one thing. Can it stay there?

Because executor memory is 1 gig. Can it stay there? No. Very good. Can we spill this data to to the disk? Yes. So it

will go here. Okay. Sorted. Hm. Sorted. Okay. This is fine. But we need to process the data, right? We need to

process this data. We can spill this data. It is like in disk. Okay. But what about the processing? We need to process

this data. And in order to process this data, this needs to go to the executor. And executor's memory is 1 gig.

Now is the time to throw the error called executor out of memory. Exeutor is saying hey

bro just tell me what to do because I have just 1 gig of memory and you are just giving me the partition of 1.5 gig.

How I can just do that? So now is the time to throw the error executor out of memory and this is the

scenario. Now here you have two options. You can either expand the memory. You will simply say I have a lot of money.

Okay, let's give it 2 gigs of memory. Let's give it 2 gigs of memory. Okay, very good. Clapping. Very good. Now,

let's say after a month, you got more orders. Your store is growing rapidly. Now, this data size is 3 gigs. Again,

will you expand the memory? You can, but it is a foolish approach. You cannot like expand the memory again and again.

No. Here comes the concept of eliminating the skewess. You should focus on eliminating the skewess instead

of just expanding the memory. Make sense? Make sense? Very good. So, how we

can eliminate the uh skewess? Here comes the concept of salting. Sultting is a is an approach with the help of which we

eliminate the skewess. Make sense? Make sense? Make sense? Very good. I hope now these things are

getting absorbed in your head. Okay. So now you got the concept. Okay. Now I know you are really excited how we can

just fix the skewess because this is just a partition. You said we cannot break the partition. You said we cannot

this. We cannot do this. We cannot do that. How we will just fix the skewess? Let me just tell you how you will fix

that. You you as a data engineer will fix the skewess and you have to do it. Don't worry. I'll just tell you how and

I will just tell you a lot of techniques. Okay, a lot. Let's learn one first. Okay, it's called salting. Yes,

salt. S A L T salt. It's real salt. Okay, let me just tell you what's that. You You got the concept. You got the

error. Okay, let's try to fix it. So, let's finally talk about salting. And trust me, this is very very

very important topic and very easy, very easy. Trust me, very easy. So in front of your screen you are seeing a data

frame in which you can see the skewess. You can see a lot of food values. See and we have already talked about this is

fully skewed. Fully skewed and obviously it is good for your store owner because he or she's gaining a lot of sales but

you as a data engineer needs to work with skewess right but that's fine. That's fine because now you know

like what are the right resources to learn data engineering, right? Hit the subscribe button right now and just

share this channel with others right now. If you want to support me and if you love me, okay, nice. So basically

let's talk about this. Let's say you have this column, okay, and you are seeing so many food food values. Now we

know that whenever we'll be just trying to apply the group by it will create the partition of 1.5 gig and we cannot

handle that. We cannot handle that. It is not practically possible to handle this. Make sense? Very good. What we

will do? What if I tell you that you can even add a column called let's say salt. Let's say salt.

Okay, this is a salt column. In this salt column, you will simply pick some random values and obviously random

values but have they should make some sense like do not like purely random values but you should pick the right

size of salt. You know that your data is like of 1.5 gigs. Okay, you need to divide this partition let's say in four

mini mini partitions. 1.5 divid by 4 it it will be equal to 350 MB something 375 I think 375 375 750 yeah

375 mathematics honors I told you bro so 375 MB is fine of like one partition size we it's fine like it's

totally depend upon the requirement and the solution and the architecture but in our case we know that four partitions

are fine okay very good so what we will do we will simply pick the salt size equals to 4. What does it mean? It means

that it will simply create an array of 0 1 2 3 four values. It will randomly assign any value from this list. Let's

say this is assign this will assign zero here, one here, two here, zero here, three here, 3 2

1 zero. Randomly, randomly randomly. Very good. So now we will apply group by on two

columns. On two columns, food zero, food zero, food zero. This will be one partition. That is

this food one, food one. This is one partition. Food two, food two, one partition. 43 43 1 partition. So each

partition is of size 375 MB. Not exactly because obviously this is like randomly distributed but still like closer to 300

to 400 to 500 MB. Make sense? So this is now reduced to four partitions. And now we know that our executor can process

this partition independently. And if it needs more space now this partition can freely go here in the disk and can be

bring back can be brought back. Why? Because this is like independent partition size and can fit in the

executor memory for computation. Make sense? This is the concept of salting. We simply add a salt but with some you

can say wise approach. We need to figure out like how many partitions do we need and how many partitions should be

efficient. We decide the salt size. Then we assign the random numbers between this list and then we simply divide our

partitions. Basically we apply the group by operation based on two columns instead of just one. This is salting.

This is salting. Make sense? Very good. All these concepts are really really important.

All are really really important. Okay. So this is your salting. You should take notes. Okay. And in like bigger

companies you will be seeing that some developers have already developed the salting code and you will be just

looking at that and you will be like why they doing this? This is the reason. This is the reason. Make sense? Very

good. So now I hope it is sorted now. I hope that you know the concept of salting. I hope now you know

everything. Okay. Very good. Very very very good. Because concepts are much more

important than implementation. Trust me. Trust me. Once you know the concept see writing the code is not a

big deal nowadays. Chat GBD is here and all those AI applications are there. But in order to understand that you need

your brain, you need your brain. You need your brain. Okay. So now this is also covered. Sorting is also done. Now

let's see something new which is totally I would say independent of this. Not totally independent because everything

is linked to spark and executors but yes it is totally a different topic I would say. So let's see what's that. So let's

talk about caching. Caching is one of the most used method because whenever you are writing the code okay and don't

worry I will just show you the demonstration as well for caching because you will understand by looking

at it. So why do we need caching? And caching is something that you should already know because whenever you let's

say use a mobile phone and whenever you see those recent applications open your mobile phone in the recent tab that is

also kind of caching because they are like caching your data. They are not eliminating it so that you can simply

reuse from the part where you left it. This is caching. Similarly caching is another important thing that we have in

Spark. So let's say you are writing your DF. Let's say you are creating DF1 and you simply read the data frame. Okay.

Then you created DF2 and you used DF1 in order to do something. Let's say filter. You applied the filter. Okay. Then you

created DF3 and let's say you used DF2 dot let's say you performed some aggregation. Make sense? Okay. So all

these things all these things will be done by the executors obviously and all these things will be performed in this

area exeutor memory as you all know okay because this is the area where it perform like all the transformations. So

what will happen? So first obviously it will first of all create the DAG. Okay, it will create the

DAG DAG of all the transformation DAG for DF1 then DAG for DF2 then DAG for DF3. Make sense? Because that is what we

get. So first of all it will calculate DF1 using this and DF1 will reside here. DF1 will reside here. Very good.

The moment we start creating DF2, it will simply eliminate this. Why? Because this is a short-term memory. Then it

will simply create the DF2. Now in order to create DF2, where is DF1? Where is

DF1? Nowhere. Nowhere. Why? Because DF1 is eliminated. because this is a short-term memory. So

in order to create DF2, it will first create DF1. It will simply go here first

here. Then it will create DF2 using DF1. Oh, okay. Let's say you want to create DF3. Okay. And obviously this will be

eliminated because it needs more space for DF3. So DF3. Oh, let me use a different color.

Let's use green. In order to create DF3, where is DF1 and DF2? Nowhere. It

will simply recomputee DF1 first of all. Then it will create uh DF2 using DAG. Then it will create

DF3. Oh, okay. But but but let's say let's say you are creating a DF4. Okay. Let's say you are creating a

DF4 and you are again using DF1. Okay, you're applying another filter. Okay, you are creating

DF5. You are again using DF1. You are doing something else. Again, you are using

DF6 again. This is the first time. Oh yeah. So DF6 again you're using DF1, not DF6. DF1 and you are doing

something. So every time you're using DF1, so every time it needs to go to the first step to compute the DF1 and then

perform your things, right? What if I say I can store this DF1 in a long-term memory? This is yours storage memory. I

can store this DF1 here. And the moment I store this memory here, I do not need to recomputee it. It

will save me a lot of time, lot of resources. I will simply go here for DF4. I will simply go here for DF5. I

will simply go here for DF6. And I will use DF1 one by one. And I can just do my calculation. This is the concept of

caching. I can store the data in the long-term memory. Okay, let's try to see this with the help of code, with the

help of you can say notebook. And I will just show you how it will perform caching. So now let's talk about caching

and let's quickly create our DF and this is the code and this is the same code that we are using in the other notebooks

as well. So you will be feeling aligned with the code and approach. Okay, so this is our DF DF1. Okay, let me just

run this and this will simply create your DF with two columns. One is ID, the second one is name. And these are like

simple records, five records. And this will simply populate your first data frame. Okay, let's see what do we have

in the first DF. So it is created. Let me just say display df df1. So it will simply obviously

populate the data frame. And now it will simply run the job as well. Okay, because we are displaying the data. So

okay perfect. So this is my DF1 and it has created the DF1. Okay that's why we are able to see it. We cannot say like

it is just a plan. It has created the data frame. Now what I will do I will simply create DF2 and I will simply say

DF1 dot filter um not filter let's create another column. So in spark in pi spark if you

just want to create a new column you just create a simple function called with

column and in this you create the function by obviously giving it a name. So let's say I want to create new column

and I will simply name it as new column just to make things simple and I want to simply

um apply a kind of constant transformation. What is that constant transformation? So in this I don't want

to actually compute anything. I simply want to um enter a value called yes and it is a

kind of flag. So let's say I am creating a new column called flag. Okay. And I want to say yes. So how I can just say

that? So if you're using any kind of constant, we have to use a function called lit. Okay. Lit. Then I can simply

say yes or whatever constant I want to assign. Okay. So this is a function and in order to run this we need to import

the functions as well from pispark dossql dot functions

import okay make sense very good okay perfect so let's create that okay this is my df2 by the way or or wait df2 we

want to create it in df1 We want to create it in DF1. Okay, perfect. Let's see the

DF DF1. Perfect. Because that's how you will see that the difference between caching and normal computation. Okay, so

this is my DF1 and Spark has created it for us not just a plan real data frame. Okay. So now what we can do? I want to

create a DF2 with reference of DF1. That means I want to use DF1 in my code in my code.

So I will simply say I want to filter DF1. And I want to apply filter on column let's say ID. Okay. And I want to

simply get the ID equals to 1. Okay. I will simply run it. What will happen? Nothing because this is just a plan. Now

I will simply run this plan. Display DF2. So you know that in TF2 we are not doing any kind of transformation. We are

not creating any new column. We are not creating anything. We are simply applying a filter transformation. That's

it. And on top of DF1 which is already created. Very good. Let's see what will happen. Let's see what will happen. So

okay we got the result. Very good. So let me just show you the plan like what exactly happened. So I will simply say

df2.ex explain what exactly happened. Okay, very good. So, this has created the plan such as first of all it

read the data. Okay, scan RDD. Then it filtered the data is not null because obviously filter should be applied

um as the first step. Then it just projected ID, name and flag. Why it is creating this column?

You will say where it is creating the new column. See this is column creation? It is saying yes as flag. Why it is

creating this new column? Why? In DF2 we are not creating any new column. We are not see we are not creating any new

column. Why it is creating a new column for us. And the reason is it is not using DF1 directly because DF1 was

created and it was stored in the short-term memory but the moment job was completed it was eliminated. Then we use

DF2 and it knows that it in order to create DF2 it needs to use DF1 and we don't have any DF1 because it is

eliminated. It will simply use the DAG to recomputee this data like this and then it will use in our DF2. That is why

in the plan you are seeing this step yes as flag because obviously we should you we should have seen this column like ID

and name like this column. Why we are seeing like this yes as that means it is applying

transformation right? Very good. So this is the reason that I was just focusing that it creates it it basically

recomputes the data based on the DAG. So it is recalculating DF1 every time. Okay. Okay. Let me just show you the

magic. Okay. Let me just show you the magic and you will see the difference. So let's say you are displaying this DF.

Okay, that is fine. Let's say before creating df2 I am saying df1 dot cache. Okay. And let me just uh

rerun all the things just to start fresh. Okay. This is done. This is done. This is done. Let

me cache it. And let's display it. Okay. This is done. Let let me cach it. So this is

cached. Okay. This is now cached. Let's rerun this. and display obviously output will be

will be the same because output will not be changed but query plan will be changed this

time this time you see the difference it is not calculating the column anymore no it is simply performing in-memory table

scan that means it is just fetching the details from the memory because our data frame is stored in the memory this time

and we do not need to worry about flag because now it is simply using flag as a column. It is not saying yes as flag. It

is not creating any column. It is simply using it using inmemory table scan. This is the power of caching. That means you

can just save your data frame if you want to reuse it in your code multiple times. Another thing another tip you

should not cache bigger data frames. You should only cache your data data frames which are small enough otherwise

executor out of memory is waiting for you. Make sense? Uh basically not like error but um it will not actually store

your data and it will just spill the data to the disk and it will simply going to the disk to get the result. it

is not efficient, right? Or you can say it will not even let's say um considering your request to cache data

because if it doesn't have any space enough space to cache your data in the storage memory, it will simply recmp

compute it every time. So that's not efficient. That's not efficient. So you should only cache your data or basically

data frame if your data frame is small enough to be easily fit in the memory plus if you are reusing it many times.

So just to save it save time you can just simply uh cache it. Okay. So let me show you something else. Uh let me just

show you the job of this. Let me just show you let's say job 15. Let's click on this and just expand it. And this

time just go to the result of this one this job

and yeah perfect. Just click on the storage. Just click on storage. What are you seeing? You should see your data

cached. This is your cache data in the storage. This is your storage memory. And you can see that eight partitions,

cached partitions, fraction cached 100%. That means it is fully cached. Size in memory it's just like 3 KB very small

data. Size on disk zero. What is this size in memory and size on disk? What is this? So basically this thing will be

much more clear when we'll be just looking at persist. Now what is this persist man? We understood cache. What

is persist? It is much more related to cache. And I actually everything is persist. Cash is just a special kind of

version of persist. Anala don't make things complex. Just explain us. Okay. Okay. Okay. So you understood the

concept of caching. First of all tell me this. You understood it. Very good. Now let's talk about the storage level and

this size in memory and size on disk like what is this and what's the difference between all these things.

Basically just to give you a hint whenever you want to perform caching okay we have so many options because we

have memory and we have disk we have both the options right so we can prioritize where we need to cache our

data where we need to store our data oh so it's just for like flexibility to store your data which is cached exactly

that's it so let me just tell you in detail what's that so let's talk about the storage

levels and you can also So say like persistence. So basically what happens caching is not like an

independent thing. No it is just like a kind of wrapper or you can say it is just a kind of special use case of

persist. So originally originally everything everything is for persist everything. Okay. But what happens when

we use persist we get a lot of flexibility. We get lots of storage levels that we are just discussing right

now. One of those storage level is memory and disk. When we say that we want to persist our data frame with

storage level equals memory and disk, it becomes df.cashe. Why? Because spark identifies that people were using memory

and disk storage level a lot because obviously it is fast, it is efficient and it gives you flexibility to store

your data on both the sides. memory and disk both. And as you can see, it will force try to save your data in the

memory. And if it can just save the data in the memory, it will save it. Otherwise, it will save rest of the data

in the disk. If memory is not enough, it will spill the rest to the disk. That's it. That's it. It will just spill rest

of the data to the disk. Make sense? Because you're persisting your data in both the areas.

Memory and disk both memory and disk both very good so it becomes cache so it is very popular that's why we instead of

just writing this big code we simply write dfc cache and in most of the scenarios you will be also be using dfc

cache trust me okay very good what are the other storage levels other storage levels are memory only now memory only

means you cannot spill your data to the disk so you will say um what if our data is not being stored in the cached area

or you can say storage memory. It will simply recomputee rest of the data. So it will just try to cache the data as

much as it can. Other data or you can say rest of the data will be recomputed. Make sense? So data is stored in RAM as

deserialized Java objects. If not enough memory recomputee the partitions when needed. Simple. Now this becomes really

controversial because a lot of you will say hey memory only is for caching. When we say DM.cashe that means we are

saying memory only. But some of you will say no no no no we say memory and disk both. So let me just clear a confusion.

If you are using dataf frame API it means df.casheache equals to storage and like not storage memory and disk both.

If you are using dataf frame API, it becomes memory and disk. If you are using RDD code and I don't know who is

using RDD code in 1225, okay, because it is not recommended to use RDD because it is not optimized but yeah some in in

rarest scenario in the rarest scenarios you can use RDD just to perform ground level transformations. So if you use RDD

API and if you use like dot cache in that scenario it becomes memory only. So do not get confused. Okay. And obviously

we write our code in the data frame API nowadays. So it becomes memory and disk not memory only update information

update and I think we stopped using RDDs like a decade back. I don't know who is still using RDD. So by the way the next

one is disk only. This is really nice because you do not need to worry about memory. Just save your data in the disk

but it's really slow okay because it stores all the data on the disk. The next storage level that we have is

memory only two. What is this? It is exactly same as memory only but it will be two times replicated because by

default we get one time replicated. It is just for the fault tolerance. Okay, I rarely use like two times replicated but

you should know. Okay, very good. Next is experimental. It is new. So I was talking about off memory, right? And we

told I told you that we rarely use off memory. By default it is zero. But in some cases you can utilize off heap

memory to store or let's say to cache your data to cache your data just to let's say put some less overhead for GC

cycle because obviously when you cache your data you create a lot of temporary objects but off heap memory is not

easily maintained you need to maintain it. Java will not maintain it like Java virtual machine will not maintain it.

Plus first of all you need to enable it. So you can just use this code to enable it and yes it is in experimental meth

mode. So you can just experiment it if you want. I have not experimented it. So if you want you can. So these are like

all the storage levels that we have within this part. And I hope now it is clear. Let me just show you the code as

well. Let me just show you. So we are in our notebook and let's run the code for one more time and instead of using

df.tache let's use df.persist. Okay. And we will simply say storage level dot um memory only. Let's

say okay let's run this. Oops. Storage level is not defined. Oh, we need to import it. What's the library to import

storage level? import storage level.

Oops. Import storage level in Spark. I just need to look at the library.

[Music] Uh, where is the Okay, bro. I I want I want import

library. Maybe this is the one. Oh, yeah. This is the class. Oh, yeah. Yeah. Let's try this. Let's try this. Let's

try this. Import spark.

storage levels. Oh, sorry. Pispark dot storage level dot storage level. Yeah, let's

import it this class and let's try to run this. Storage level is not defined.

Uh maybe we need to import like this from storage level. import storage level.

Yep. Otherwise I have to write the long code and I don't like writing this thing again and again. Pispark dot dot dot

dot. So okay. So this is now done. We have just persisted our data in the memory only. Okay. Let's do the same

stuff and let's write the df2. Let's see the explain method. And obviously we should see inmemory table

scan. Very good. And let's see the job right. Let's see the job and let's see like how it is saving our data frame. So

let's go to storage and this time you can see we have another data frame which we have

which we have which we have stored see memory only serialized and other one was disk memory des serialized one time

replicated which is by default mode when you just run cache this time we have specifically mentioned we need fast

results and our data frame is small so we can just store our data frame in memory so memory serialized one time

replicated as you all No. So this is a difference. And now you will say once like what if we just want to unpersist

it or let's say uncash it. So there's nothing called as uncache. If you want to uncash or unp unpersist, you simply

need to write df1 dot unpersist. That's it. Simply run this. That's it. And you can now just check the spark UI for one

more time if you want to. Let's say you want to run df2.is for one more time. Okay.

Okay, perfect. Let's run this explain method for one more time. Perfect. So, do we have the explain method? Did we

run display DF2? Yeah. So, we did it. So, let's see the job. Now, go to storage. Yeah,

perfect. So, this one is removed the one that we just saved before. And now this is the only

one which is wait which is left which is the nor like the previous one that we didn't uncash or you can say unpersisted

it. So it is saying like fra fraction cached and this does this this. So if you just want to uncach it that means

you are trying to unpersist it. Make sense? Make sense? Makes sense. Make sense. Very good. Very good. Very

good. Very good. So I hope now you can just play with this persist method and now I hope that you can just use

different different storage levels and you can just see the difference okay in your performance. So now let's see what

do we have next. So so so so let's talk about edge node. What is this edge node? So basically so far whatever we have

discussed we were just imagining that we as developers are directly communicating to the cluster manager. Make sense? Are

directly communicating to the cluster manager or resource manager and it was just allocating us a resources. That's

very good. But just to tell me one thing you are a senior data engineer. Okay. and a new data engineer or intern

just join the company will you allow him or her to directly talk to the cluster manager will you obviously you cannot

right so that's why we have something called as edge node and before talking about edge node we just need to clarify

that we have two different kinds of partition not partitions like sites you can say client mode and cluster mode

that we'll be talking about in the next chapter and This is just like client side and cluster side. Make sense? This

is just a side like team A, team B, similar to that. Okay. Client team, cluster team. So what happens? This is

you as a developer. So instead of directly talking to the cluster manager for all the resources, for all the

stuff, what you do? What you do? You talk to this machine which is called edge node. Which is called edge node.

Okay. This is the edge node. You will this can be your virtual machine. This can be your like any machine. Okay,

which can act as the edge node to talk to the cluster manager. You can login into this machine and you can just type

the request and this machine will have the access to this access man uh this resource manager and this machine will

talk to the cluster manager and this manager obviously will assign the resources for your

work. So simple. So simple. This is edge node. Okay, this is edge node. Now this gives a birth to topic called as client

mode versus cluster mode. So basically we have two different kind of deployment modes. One is client mode and one is

cluster mode and both are so simple. Let me just explain what's the difference. So basically we know that we do not talk

directly to the cluster manager. This client machine will talk to the cluster manager. So let's say you are deploying

your application in the client mode. What will happen in that scenario? In that scenario, we know that cluster

manager first of all creates the driver driver node, right? So what it will do? It will simply create the driver node on

the client machine. Oh, that's why it is called client mode. That's why it is called client mode.

And where are the workers? Obviously on the cluster obviously there is no negotiation in that. So these are the

worker nodes. This is your client mode because your driver mode or you can say driver node is this

client. Okay. Now you are using a cluster mode and in most of the scenarios you will be using cluster

mode. Okay. But in some cases you use client mode as well. Okay. In some cases, in most of the cases, you just

use cluster mode. Okay. So, let's say you are using cluster mode. What will happen? You know better than me. I think

so because I have just explained this so many times. So, I hope now you know. So, what will happen? It will simply send

the request to the cluster manager resource manager and it will simply create the driver node or you can say

driver on one of the machines and rest of them will be worker nodes. Make sense? This is like simple approach.

What is the difference and what the advantage disadvantage when should we use which one? So the thing is see when

you are using client mode. Okay. So you are running your driver in the client machine. Okay. What if this client

machine is turned off? Let's say you turned off this machine. So what will happen? The whole application will

break. Why bro? This machine is running a driver program. Driver program is the heart of the whole application. You

switched off the heart. How can the how can the application survive? No ways. So that is why it is more you can say

errorprone. But what is the advantage? The advantage is when you just run your application in like let's say you're

developing your application. Okay. So you will be using this client mode in your development you can say area in

your um development layer. Make sense? In dev environment. So you can simply write the code and you can actually see

the output on your own machine. You can see all the logs everything. So it is very handy when you are just working in

the dev environment but in the prod environment it is not recommended. Why? Because just tell you

one thing this machine is like is not inside this network right this is totally different this is different. So

in the client mode network latency will be high. network latency will be

high and we do not want that. We want our applications to run our code really quick. So these are some of the you can

say advantages disadvantages you can even Google and now you know the concept you can understand those things

otherwise if you do not know the concept you cannot understand the things. Now if you will be just reading the advantage

disadvantages you can actually understand everything like what is the advantage what is the disadvantage the

major ones are these and obviously there are like a lot of advantages and lot of disadvantages and again it is subject to

a specific architecture specific project like what approach you want to use right so it is too subjective I cannot say

like one is bad one is good no it is just like you just need to pick the right one for the right architecture

right project that's it okay see it was so simple so simple Simple so simple client mode and cluster mode. Let's see

what do we so let's talk about partition pruning and I would say even before talking about partition pruning we need

to talk about pruning and partitioning independently but let me just tell you um we should just talk about these

topics together in order to understand better. So now let's talk about first of all what are partitions. So basically

and don't worry I will just show you in the code as well. So first of all let's say this is your exeutor okay and this

executor will be writing your data in a repository in any kind of repository in any kind of data lake in any folder

anything okay because obviously whatever data you're processing you will be you can say writing your data to a

destination right okay makes sense so this is let's say destination repository or destination

location simple so Now we do not like write all the data in one you can say folder. What? Yes we do not do that. No

instead what we do we create partitions. Oh okay we create partitions. Let's say your data frame

has a column called category or department anything. Okay. So what we will do instead of writing everything

inside this one folder we will create independent folders. Let's say you are just creating a data frame for

departments. You have IT department. So we will simply create one folder for IT. Second folder for let's say HR. Third

folder for finance and so on. Okay. So this will be independent data which is partitioned based on a column.

Okay, makes sense. We understood that in partitions we simply and by the way this partition is not equivalent to your

memory partition. This partition I'm talking about while writing the data to a destination table or destination

folder destination location. Okay. So you will say okay we understood it but what's the advantage? Why do we need to

just do this? Why we cannot write all the data in a folder? You can but this is a kind of optimization technique and

what kind of optimization we get after this. So let's say this is your destination folder. Okay or let's say

destination parent folder. Destination parent folder. So you want

to now query this data. What you will write? You will simply write let's say df equals spark read blah blah blah. You

are simply reading this data. Okay. You are simply reading this data. And while reading this data you are applying a

filter simple filter like let's say dot filter um just for the HR department okay this is just a demo code pseudo

code okay do not say hey is this syntactically right I know it is not so let's say you are just creating a data

frame okay and you just want to fetch the data records for HR department only make sense very good so what spark will

do if you do not have the partitions Okay, let's say these partitions do not exist. Okay, let's say this is the only

folder. Let's say this one. This is the only folder which is available now. So this particular data frame, this spark

will simply go to this folder and will read all the data all the data. Let's say you have 10 GB of data, 10 gigs of

data. So it will read 10 GB of data and you are obviously spending resources to read that data obviously. Then after

reading this data it will apply filter. Make sense? Common sense bro. Okay. Make sense? Now if we have

partitions, if I have partitions, what I will do? What I will do? I will simply say this is for HR. Okay. This is for

HR. This is for IT or let's say finance. Okay. And let's say this is for IT. Make sense? So I know and spark also

knows that this folder is just for HR. This folder is for IT. This folder is for finance. So it will directly go to

this folder instead of going here and here. Make sense? Let's say your HR data was only and only of one gig that is 1

GB. So you saved 9 GB of processing. Obviously, bro. Obviously you saved that scanning part of 10 GB of

data. You simply scanned 1 GB of data. This makes your Spark jobs run so quick. So that's why we always perform

partition. And this concept is called partition pruning. How? Because you are pruning the partitions. You are not

fetching all the data. You are pruning the partitions. Make sense? Make sense? Do

you want to see this in code? Let me just show you. So I'm in my new notebook and let me

just connect this with the cluster. So basically I have prepared a demo data frame. So obviously simple data frame

that we are using so far. In this I have added one column which is department. Okay. Just to align you with the

original example. Very good. Okay. Very very very good. And these are the columns. And this is the spark code.

Simple. And this is the code that we use to write our data to write our data to a location. This location I will provide

right now. And this is the code simple code. What is this code? df.t write. Okay. And then pocket because we want to

convert our data in the pocket format. Pocket is like one of the most popular columnar based file formats for big data

analytics. Partition by department. This is that statement which performs partitioning based on a column and you

will actually see the folders in real life. I will just show you don't worry. And this is the mode basically it tells

spark if data is already there what it needs to do. Either it can overwrite append. So we have so many modes. Okay.

So let's actually provide the path. So in order to provide the path I will simply go to catalog and I will go to uh

not to catalog basically um I can simply go to this catalog just to see the DBFS. So I have file store. Okay. Oh I have so

many folders. I know I know I know. So I will simply create this one. Apaches spark. This is the parent folder. Let's

say I want to create a folder inside this. Let's say this one. Make sense? Makes sense. Okay, let's create folder

within this called output data. And within that output data, you will see the partition. Let me just show you. So,

I will simply grab the path from here. This one. Okay. And you need to make some tweaks in this path. So, do not

just paste it directly. You need to remove this DBFS path. Okay. And I think you also need to remove file store like

backslash. Let's see. No worries. If it throws error, we can just correct it. I will simply say upload or output

data. Make sense? Very good. So I think we are good. Let's run this. If we see any errors, I can simply remove this

backslash because I think we need to remove it. And if no, then it should be fine. It should be fine. It should be

fine. This is basically optimizing technique. But obviously competition is high right now. So I it's my duty to

show you and just tell you everything. So it is done. It is done. Let me just show you what did we get. So it has

returned the data successfully. Okay. And if I go to DBFS and if I see output data. Click on this. Perfect. See

I have folders. Department equals finance department equals HR department. It is so good. And within these folders

you will see the data. See these are the pocket files. Wow. And in order to just show you I will I I will just show you

something else as well. So basically I will create another copy of this data like I will write the data in another

location but at this time I will not use partitions. Very good. I will not use partitions. So I will simply say uh

output path I will change this and I will say output data without let's say

without partitions so it's the same thing output path new and I'll simply change it here as well because I want to

show you like what will the what will be the difference while scanning the data okay so this is also done and we can

just check go to cataloges DBFS file store apache spark work output data and output data without we should not see

those f those folders. Perfect. We should just see the we should just see the partitions. That's it. Eight

partitions. 1 2 3 4 5 6 7 and yeah, eight. Sorry. 0 1 2 3 4 5 6 7 8. So, by the way, why eight partitions? Why?

Because we have eight cores. We have what? We have eight cores in our

executor obviously bro I just showed you. So if you just click on this and if you click on

configuration you will see that we have eight cores and each core is responsible for the parallelism. So by default we

get eight partition. These eight partitions are inmemory partitions but partitions that we are creating in the

form of folders that are partitions for the storing the data on the disk. Partitions on the disk. Okay. So if you

just click on spark UI and if you just click on maybe executors and

then it will take I think just a few minutes to load. Yeah. So can you see course equals equals 8. So we have eight

cores. Okay, make sense? Very good. That's why it created 88 partitions by default because

each partitions partition responsible for the output. Okay, very good. Very good. Very good. So now are you ready to

see the magic? So let me just show you the difference. Let's try to read the data one by one. I will say df with

partitions with part. Okay, I will simply say df uh sorry spark dot read dot format format is pocket.

Okay, then dot load let's load the data and the location is we already know oh this this is the location output

path. So we can simply provide this location here in the form of variable. So let's try to read this and let's

see let's see display df. Oh sorry uh df sorry df dfd df with

part. Okay makes sense. Oh where is yel is y. So basically now it will simply load all

the data. Why? because we are not pruning our partitions. We are just creating the partitions but we will just

prune our partitions with the predicate. So just click on this click on this job and if you just expand it you will see

SQL data frame just click on this and you will see scan partitions. If you just click on this and you will see how

many number of files read how many files it has read six. Six files. Number of partitions read. Number

of files read. Number of partitions read. Yeah. Three. So these are the partitions. Three. It HR and one more I

think finance. And how many total partitions it like number of files it read?

Six. Sorted. Because in each partition I think we have two partitions. I can just uh like two files. I can just check just

to confirm. uh Apache Spark and output data and let's say HR yeah one two one two one

two so it read all the partitions right very good now if I perform pruning pruning how I can simply

say dot filter okay column of department equals equals HR make sense

Friends, let's do this for one more time. Let's see what will happen. Error. Very good. Because we didn't import the

functions from pispark.SQL functions

imports. Okay, let's do this. Fact. Okay. Nice, nice, nice, nice stuff. Okay, let's run this finally and

you will see why it is so much important and why it is the hot topic of the interviews because it is really

important as a developer. So let me just check this SQL data frame and this is the data frame. So how many how many how

many files it has read? Uh only two. Very good. Number of partitions read only one. So it pruned the partitions

instead of reading all the six partitions. Now we have the liberty. Now we have the power to limit the file

scans because we have partitioned our data. Make sense? Very good. Let me just show you one thing. You will say Anulama

this is also possible in the other scenario as well. No. No. Let me just show you. No. If I just run this whole

thing. Okay. For one more time. Okay. And this time I will create without partitions. And the output location will

be output path new. Okay. And same code. I'm not changing anything. I'm still applying filter. Okay. Let me just show

you how many files it will read. Every time it will read all the files, all the eight

files. You don't trust me. Now you will trust me. So let me just display the data. DF without

partition and so it has filtered the data obviously but it read all the data it

scanned all the files every time you will say it's not a big deal yeah it's not a big deal because it you are just

working with very small data set when you have like TBs of data terabytes of data it's not a small thing okay SQL

data frame let's check this one and let's confirm like how many files it has read. So basically it has read seven

files like all the files. It has read all the files and number of columns obviously like number of columns will be

same. It has read all the seven or eight files. I do not remember the exact number. It has read all the files. Why?

Because there is no partitioning applied on it. It cannot be pruned. Make sense? Very good. And size

of files read 7 KB. 7 KB. How much of data it read here? How much of data it read here? Only and

only bytes of data 1720 bytes. This is the power. So just replace bite with GB and KB with TB. So

just see the difference of cost of resources. Bro, this is the power of partition pruning. Now we have a cool

concept called dynamic partition pruning. Dynamic and you will say that what's the difference like like why do

we need dynamic partitioning pruning? It is you can say one step ahead of partition pruning. Let me just tell you

what's that. Let's try to understand dynamic partition pruning and just be with me

you will master it. Trust me. Why do we need dynamic partition coding and trust me it is just like magic. What's magic?

So let me just tell you what's the magic. So let me first tell you the magic. So

basically we know that if our data is partitioned and if we just apply any kind of filter okay on top of it on the

partition column we get only selective partitions. Okay. What if I tell you that even if you do not apply filter on

your partitions, you will still get the selective partitions. That is the concept of dynamic partition pron. So

let me elaborate. So do not get confused by looking at the picture by looking at the arrows. Just follow on my cursor.

Okay. So let's say there's a scenario where you have DF1 and you have DF2 and you want to apply a join and just for

the background understanding DF1 is a big table. It is a fact table in which you have partitioned your data as you

can see based on the column let's say department. Okay, make sense? Very good. On the other hand, DF2 is your dimension

table which is not very big which is very small table and you just have like non-partition data because it doesn't

make any sense to create partitions for that data. It is very small table. Make sense? Okay. Now you are applying a

join. Okay. Let's say you are applying a join df1.join

df2. Okay. Make sense? Make sense? Okay. Very good. Now in order to apply join it will simply scan

DF2. Obviously, obviously so to simply scan DF2, right? Agree with me, right? Okay. How hard is it to scan this file

because it is very small data? Just a few seconds, right? So, and it is very quick. So, that is fine. That is fine.

Okay. Now, how long it will take to scan this DF? And I just told you that this is a very big table. It is a fact table.

Oh, okay. And just for your more understanding, each department is of let's say at least 1 GB of data. Each

department h it 1GB, HR 1 GB, um finance 1GB and so on. There are like so many departments in this. I have just shown

four, but you can imagine at least 10 to 15 departments. Perfect. Now it will scan

all the uh departments obviously because there's no filter condition on it. Make sense? Very good. Now, now we are

applying a filter. Just listen to me carefully. We are applying filter on DF2. Okay. Okay. We are applying filter

on DF2. We are getting data only for HR and on DF2, not on DF1, only DF2. Hm. Okay, this is already a small

table. It will apply filter. It will get the data. Perfect. Now, DF1 will still scan all the

partitions because we are not not applying filter on DF1. Make sense? Very good. What if I say even if I do not

apply filter on DF1, I will still get the data only for HR. Let's say this one. And it do not need to it it does

not need to read all the departments. It will only read HR department. An are you kidding me? How can it be possible? It

is possible with the help of dynamic partition pruning. Now let me just explain you the flow how it will

actually do this. So first of all it will apply scanning on DF2. Okay let's apply the scan scanning. Okay, it

scanned the data. Okay, then it applied the filter. Okay, it applied the filter and it scanned the data only for the HR

department. DF2 is ready. Perfect. Now it will go to DF1 which is partitioned and it will apply file scan. Same thing.

Okay. Now before applying file scan, we will see magic. Our broadcast exchange will take place. And what it will do? It

will simply forward this query which is the result of this DF2 with filter equals to HR and it will be broadcasted

here and this will act as a dynamic filter. That means we transferred the filter from one data frame to other data

frame. Now this DF1 has a filter of department. Now just tell me will it read all the departments or just only HR

department? Obviously HR department you have just seen that in the previous example. This is dynamic partition

pruning. This is the whole concept. One thing to just keep in mind, the thing is every time you apply

dynamic partition pruning or you want Spark to apply dynamic partition pruning, you need to make sure that you

are applying filter condition which is this one based on the column which is equivalent to the partitioning column

because only then it will bring the HR data. Make sense? Because just tell me one thing if you are applying filter on

name how it can just bring the data for HR it is not possible. So you have to apply filter on top of that column which

is also responsible for the partitioning in the other data frame. That is one condition. Second condition is you have

to include that particular column in the join condition as well. I'm not saying this should be the only column involved

but this should be the part of the joining key. Make sense? I hope so. And I know you have almost understood all

the things in dynamic partition coding. Let me just show you by the code as well. You will actually feel like what

we are trying to do. Let me just show you. So I am back on my data bricks notebook

and I am applying a join. So what I am doing here, let me just show you the code first of all for joining and let me

just add a markdown here. Uh dynamic partition pronun make sense very good.

So now what I'm trying to do I'm applying a new data frame called df join which is applying a join on df without

part and df with part. And we know that DF with part is partitioned. DF without part is

nonpartition. And and and and join condition includes this department column. See department department and

how equals to inner how equals to inner. Very good. One more thing I have applied I have applied the filter on DF without

partition as well. As you can see here, I have just done that. I have first pruned this data and I know this is not

partitioned. So, it will simply go there and it will read the data. Make sense? But I have not applied any filter on TF

with partition. No, that filter will be dynamically applied which is dynamic partition pruning, right? So, let's run

this and let me just show you the data for DF without partition. So we have data just for the HR and this is without

partition data. Make sense? Very good. And if you just want to confirm how our DF with partition looks like, you can

simply say display DF with partition. Okay. Just to make sure like how it is looking. So now it is

also DF with I think we just uh filtered it out. So that's why it is just throwing it. So we can just recreate it.

Not a big deal. So I can simply remove this filter because it is of the previous code. And let me just run this.

Okay. And let me just hit it. Okay. Perfect. So we have all the data, right? We have all the data. Okay.

Perfect. Now, now and let me just remove this particular command. Perfect. Now we will apply a join and we will see

dynamic partition pruning taking place. Okay. And what it should do? It should obviously read all the files from

here because it is not partitioned. It will read all the files. Okay. Then it will read

only only and only HR partition from the partition data and we have not applied the filter.

It will be applied automatically dynamically. Okay, let's run this code and let's see what happens. or let's say

DF join and then I will say display df join. Uh okay perfect display df join and I

will just see okay this is done the join is applied let's see the query okay let's see it what it has done

and okay perfect let me just expand it SQL data [Music]

frame okay perfect so let me just check what it has done so basically I will first of all go here and I will see this

is my DF uh which is called DF

uh yeah so this is my DF for with partition and how many files how many files we should read only

two and only one partition right let's confirm um number yeah yeah perfect number of partition read only and only

one only and only one number of files did only and only two dynamic partition pruning because it forwarded its filter

to this. And how many files we should read from the other data frame? Obviously all the files because that is

not partitioned and see all the files, all the seven files. Perfect, perfect, perfect, perfect, perfect, perfect.

Really happy. And let me just confirm it for one more time. Let me just rerun it just to make sure like nothing happens

wrong. df join new. Why? because I was just running uh this command and

um it was not recorded. So let me just rerun it. df join new just to confirm like df

this is also working right. Let's click on this and then click on SQL data frame and let's confirm this as well.

Let's click on this. Oh perfect it is working fine bro. Wow wow wow. Number of files read only two and number of

partitions are only one. This is dynamic partition pruning and wow this is very nice. See the power

of dynamic partition pruning. See so this is your concept of dynamic partition pruning and I hope now you

understood all the things in dynamic partition pruning. Make sense? Now let's see what do we have next in this

ultimate Apache Spark guide. Let's see. Now let's talk about AQE an it's high time to talk about it because every time

you just talked highly of AQE AQE AQE just tell us like what it is what it is what is it what is

it adaptive query execution AQE it is the game changanger trust me it is the game changanger it was introduced in

Apache Spark 3.0 do I guess and it is the game changanger. Why? You will know. You will know. It is

basically an advancement in the optimization techniques. Before AQE data engineers

were doing all the optimization on their own, all the partitions handling, all the broadcast joins, all these QES. No.

AQ said, "Bro, take a chill pill. I'm here. I will just do everything for you." So it has some amazing you can

say offerings for the data engineers for the senior data engineers. Okay, first of all let me just tell you what are the

basically the areas where it helps. Okay, where it is a game changanger. First of all, first of all, what it

does, it dynamically dynamically

co-leas the partitions. Dynamically coal is the partitions. What does it mean? First of

all, let's cover this. Dynamically co is the partitions. Whenever we apply Y transformation okay let's take

about take the example of Y transformation because in that particular transformation we know that

there will be like so many transformations right so what it does it says like every time you are creating

200 partitions why why every time let's say I'm just performing df.group group by on a column. Okay, every time I will

get 200 partitions, right? It is saying, "Bro, what are you doing, man? Why you need 200 partitions?" So, it will simply

coales these partitions maybe maybe to just 10 partitions, maybe to just 10 partitions.

It depends upon the size. But maybe in most of the cases you can say that okay it can just coales the partitions into

just 10 partitions because it says why you are just increasing the overhead because more the number of you can say

tasks because every partition is creating a task more the temporary objects will be and more frequently your

GG cycle will be running garbage collection cycle so it is like ruining all the performance of Apache Spark. So

it says bro you do not need to create 200 partitions you do not need to create more partitions. So it dynamically

coales the partitions everywhere it finds some opportunity everywhere. Why transformation is one of

those areas because it is for sure area because you do not need 200 partitions every time. Just take my example um I am

just using few KB of data. Okay, I hardly need only one partition. But whenever I perform group by any Y

transformation, it creates 200 partitions for me. 199 are empty. One is uh the main one. But why you are

creating 199 empty partitions? Why you are adding so much stress to GC cycle to create temporary objects?

Why? Why you are just indulging so many cores of my exeutor? I know it will be skipped but still why you are doing

that? So it says you do not need to do that. Simply coales the partitions. Let me just explain you with the help of an

example. So let's say let's say you have a data frame. Okay you have a data frame in which you

have um let's so let's apply a group by. Let's let's do it. Let's apply a group

by. So let's say you are applying a group by okay and you are applying a group by on department. Okay. So let's

say for HR you have 100 records. Okay. For IT you have let's say 1,000 records. For

finance finance you have let's say 10,000 records.

Wow you have 10,000 records. Okay. Okay. And let's say for like there are so many partitions right? You can just keep

let's say accounts you have like let's say 50 partitions. Okay. Or one more last one.

How many partitions do we have in the industry? Tell me the

department supply chain. So let's say another one is supply chain. Okay. They have let's say

just 100 records. What it will do? Let's draw the partitions just to make you visualize all the things. Let's say this

partition is like this. Okay, this partition is like this. This partition is like

this. This partition is like this. This partition is like this. Okay, let's draw it here. This partition

is like this. And this partition is like this. Okay, make sense? Make sense? Or let's

draw it here. Okay, makes sense. So what it will do? What it will do? What do you think? It will obviously create 1 2 3 4

5 partitions. Who is this man? So it will simply create five partitions. Calm down. Nunch calm down.

What happened? So it will simply create five partitions for the main work obviously

because one partition for each department. Okay, five partitions are fine. Makes sense.

Other 195 partitions will be empty as we know that will be empty. Okay, it will still still create those but it will be

skipped. Okay, that makes sense. Now, now it will simply perform the grouping and I'm just talking about without AQE.

But if our AQE is enabled, what it will do? It will simply coales this partition, this partition and this

partition and it will create partition like this. Let's say like this just one. So how many partition do we have now?

One, two and three. And then it will try to coales this one and this one as well. Why? Just to distribute the data evenly

just to have like similar size of partitions, right? So what it will do? It will end up having only two

partitions. One is this big one and the other one is this with IT, HR, account supply chain. Make sense? So it will

simply coase the partitions like all the other partitions. So this

one and these ones will become one partition like this and this it1 will become or let's say will remain like

this. Now it makes more sense. Tell me does it make more sense? Obviously according to me yeah yeah it

makes more sense. So it tries to just have the partitions of similar size instead of just having a huge difference

between the partition size. Make sense? Okay. Very good. Let me just show you with the code as well like how we can

just see it. I have already shown you. I think I can just show you again like what's the difference between AQE when

it is enabled and when it is not enabled. Let me just so let's see the code. First of all, first of all, I have

written the code just to show you if our AQE is enabled or not because by default AQE is enabled after Apache Spark 3.0

and I can even show you if I just run this command spark.get Spark.sql.adapt

adaptive.enable. Just run this. See true. Now we will make it false because you want to turn it off. So I will

simply run this code and again if I just run this I should see false. Very good. So that means our AQ is disabled. Now we

will see all those things which were there before AQE. Okay, very good. Let's create a data frame. Okay, let's create

a data frame and let's apply some transformation. Let's apply group by df.group

by and I want to apply group by on department. Make sense? Okay. Then I want to find count of name let's say.

Okay. And that's it. I will simply say display df and I will import the functions from

pispark.sql dot functions imports. Let's run this code and let's

see what happens because we are simply applying a bite transformation and without aqe. So I will just show you the

difference basically. Click on this. Oops. Oops. Oops. Yeah. Click on this and expand it. So expand it. And now

simply go to SQL data frame. Okay, SQL data frame. And this is our job. Yeah, perfect. So now if you just click on

scan read, first of all, it read six records. Very good. Very good. Then this is the step where it performs shuffling.

Right. Click on this. Expand it. How many number of partitions you are seeing here? 200.

200. By default, 200 partitions. Wow. If you just check the stages, how many stages it created? It

created 75 stages, 100 stages and 20. How many it

becomes? And four as well. How many it becomes? How many? Uh 175 195. Yeah.

195. 195. Can you imagine? 195 partitions are there. Okay. Then these four that means

199 and then this one one partition. So can you just look at the time it took? So all the 100

partitions were completed in 1 seconds. Okay. And obviously um these were empty partitions. Okay.

And then it took 0.9 then 0.3 then 85 millconds and 0.2. Make sense?

So in short we got 200 partitions out of nowhere like obviously we know like from where but it should not be there right.

So now now now now if we just enable AQE what will happen what do you think? So let me just show you what will happen

and what I do think and we can also see the DAG if you just want to see if you just go to jobs and then if you just

click on completed jobs and this is the job that we completed right now. So you can see see skipped skipped skipped

skipped skipped all these jobs. See this is the job that we ran just now right? So this is skipped

skipped skipped skipped. So not a good thing. Let me show you the DAG as well. Do we have it? I think it

should be here. Yeah, I already shown you. Okay, perfect. So now let me now enable it and let me just run it. I will

simply say true and I will confirm if it is true. Okay, now it is true. Now let's do it

for one more time. Run this. Run this. Run this and let's see what will happen. Click on this and expand it and simply

go to SQL data frame and this is the job. Okay. Now you will see exchange read obviously six level rows in the

exchange it created 200 partitions h still but we have aqe shuffle read it coaled see can you see aqe shuffle read

coaled so it coalesed these 200 partitions to how many let's see only one partition makes sense because I just

have six rows obviously only one partition is required so it optimized ized the performance. It

coalesed the partitions. Make sense? Very good. That is the advantage of it. That is the

biggest advantage of it. And you can even see that how many things that we do have. So you can see only one job

instead of like having 200 and skip skip skip we just have one. See we do not have like 75 task, 100 task, 20 task,

four task. We do not have 200 task. We just have one task that's it and that is the only thing that we need. We do not

want like so many stuff just to put some stress on executors, right? So that is why it is important. That is why it is

the most important I would say. Make sense? Only one task. And here how many tasks did we have? 75 100 175 and then

195 then 199 and then plus one 200. 200 tasks but here we just have one. Very good. This is the power of AQE. And this

is just like one feature of AQE. Let me just tell you other two ones and other two ones you do not need to actually do

anything. It will be automatically applied in your you can say transformations

automatically. Let me just show you and let me just first explain you like what are those things. Now let's talk about

the another feature of adaptive query execution and you will say like wow this is so so so good. So let's say you are

let's say you are just applying a join and let me first tell what's that it basically

dynamically or you can say optimizes join strategy during

runtime. Wow. What does it mean? So if you will observe in the query plan as well you will see something called as

adaptive final plan equals to false. What does it mean? It means that in the query plan itself it is saying that it

is not the final plan. But you will say an Lamba you just told us about the query plan execution. We write our code.

It gets translated into optimized logical plan. Then it gets converted into physical plan and then that

physical plan goes to the executors. Yes, that's right. But that physical plan is also not the final plan after

aqe because AQE says I can change the plan during runtime. How? Because it calculates it calculates something

called as query statistics during runtime. So let's say you are applying a join. Okay, you are

applying a join. You are applying join between DF1 and DF2. Okay, DF1 and DF2 both are big data frames. Okay, both are

big data frames. DF1 is also big. DF2 is also big. But and obviously if if both are big, what join strategy we'll be

using? We'll be using um according to the physical plan. Okay, we'll be using sort mer join like shuffle sort mer

join, right? But do you know what AQ will do? Do you know what it will simply say? According

to the query statistics during runtime, for some reason we were applying some transformations, this DF2

is now become a small data frame due to the transformations that we applied and it was calculated during runtime. Maybe

you use some UDF because it it is also calculated at runtime. So all those things okay all those things. So due to

some reason your DF2 is now very very very small very small. So it will automatically update

the query plan automatically this physical plan and it will pick the best plan at that time. And what is the best

plan? If we have one small table broadcast join. So automatically it will apply broadcast join instead of

applying shuffle sort merge join automatically automatically that is the

concept got it let me just show you so this is the code that I have written for join and now you are well with this code

okay and now we will simply say let's create these two data frames and let's apply a join between these let's say df

new equals First DF1 dot join DF2 on DF1 of let's say ID yeah equals equals DF2 of

ID df2 of ID make sense okay make sense simple and how you want to just apply you can just pick any like left or inner

it doesn't make any sense okay so let's run this code and let's say display df

new and we know that our AQ is enabled. Okay, let's see what join strategy it will apply. So let me just click on

this. If AQ is not there, obviously it will by default apply shuffle sort join. But if AQ is there, let's see what it it

will apply. So applied shuffle swattma join because that's what it thinks that it is that it is the best join strategy

to apply for now because it doesn't make sense because both the data frames are small right so for broadcast join one

data frame should be small enough so that it can be distributed but it's up to AQE what join strategy it picks it

picks if it it has picked uh shuffle sort join that means it is the best one for our small two data frames and one

thing to note here it is applying aqe reshuffle read. So instead of creating 200 partitions, how many partitions we

we should have? Let's see uh here we just have one. Here we just have one. That makes sense. So in the exchange you

will see 200 partitions and 200 partitions. But in the shuffle it collies these partitions and optimized

it and it simply applied shuffle sort join. It applied sorting here, sorting here and sort mer join. So it depends

like what kind of join it will pick according to the query statistics. It is not like if you have a

small data frame every time it will apply a broadcast join. No, no, it depends upon the query

statistics. Make sense? Broadcast joins makes more sense when you have like one big data frame and one is like very

small data frame. Okay. So this is all about AQE optimizing the joint strategy. Let's look at the third major feature it

offers. Let me just tell you what's that. Do you remember this image? I hope so. So basically remember we were just

tackling skewess. So this is a third major area where AQE plays an important role. That means dynamically optimizing

the skewess of the data. What? Yes, it is a great alternative for salting. But yeah, it doesn't mean like we should not

use salting. uh salting gives you much more flexibility like how you want to just do the deal with the skewess but

AQE also provides some support. So let's say you have skewed data. What will happen? It will simply be skewed into

just one partition. So AQE dynamically breaks your partitions as well. What? Yes. So it will simply break

your this big partition into small small partitions automatically for you. But it's not that much easy. There are some

rules that AQ needs to follow. I think it's something like your skewed data should be at least five times more than

the whole data skewess and all. So there are like so many statistics it needs to consider before breaking the data like

of like big partition. But yes it do it it does support AQE does support skewess as

well. So it will help you to work with skewed data as well. That is why AQE is a gamecher in the Apaches part. That is

why because it handle before can you imagine how data engineers would be working before

AQE they would be doing everything manually everything. Now after

AQE everything is optimized. Obviously you need developers to do these things but it's very much easy right now it

becomes really easy to work with these things. Earlier it was not that much easy. Earlier it was

not but now it makes sense. But obviously in order to work with these things you need to have the knowledge.

You have to you have to need the strong background strong you can say b strong conceptual knowledge and I hope now you

have okay I would say this is the complete guide for you

okay once you master all these concepts you are all set to work with Apache Spark because now you have like end to

end understanding of behind the scenes and trust me trust Master these things, master these areas,

you will become an outlier because these are all those things which will make you an outlier. These are those things which

are not known by everyone. Are you getting my point? So I hope I hope you learned a lot from this

video. What should be your next step? As I just mentioned, you need to master Pispar coding as well. And I have

created a dedicated video on pispark coding. Simply search on YouTube pispark an Lambak. You will get the 6 hours plus

course on that as well. So in total 6 hours plus video of this 6 hours plus video of that. Total 12 plus hours of

video bonus bonus. What's that? I have also created one video on spark optimization techniques as well. That is

the third video that you should watch because when you want to master spark you also want to master optimization

techniques as well. So that is like three to four hour of video. So that is your complete set that you need to

master this video pispark video plus optimization technique video. Simply go on uh YouTube simply search spark

optimization course an llamba and you will find the video. I also heard that the original price of

this content like the content like this 15 to 20 hours of content that I have just put is like at least 50,000 more

than 50,000 bucks. I I just got to know this. I hope that you do respect my efforts and if you do hit the subscribe

button and if you have already hit the subscribe button share this channel with others. comment on this video. Please

comment on this video because it will help me to just grow and you can even join my channel by just click on join

button and that's all I can say. If you like my hard work, if you want me to continue making more and more

videos then obviously make me feel like that that I should do that. So I hope that you enjoyed and loved this video. I

want to see your love in the comments. So just drop your comments and I will see you in the next video coming on the

screen. Just hit on that video and happy learning.

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Apache Hive Tutorial: Installation, Features, and Queries

Explore Apache Hive in this detailed tutorial covering its architecture, installation on Windows, key features, data models, and query operations. Learn how to create databases, tables, partitions, buckets, and perform SQL-like queries including joins, group by, and functions with practical demos.

Master Tableau: Comprehensive Guide to Data Visualization & Dashboards

This extensive Tableau course covers everything from basics to advanced topics, including data modeling, calculations, chart types, dashboards, and real-world project implementation. Learn to create dynamic, interactive visualizations and dashboards with over 60 functions and 63 chart types, optimized for business intelligence and data analysis.

Mastering Pandas DataFrames: A Comprehensive Guide

Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.

Master Excel for Data Analysis: From Basics to Interactive Dashboards

Learn Microsoft Excel for data analysis starting from the basics to advanced features like formulas, pivot tables, and Power Query. This comprehensive guide covers data cleaning, dynamic filtering, advanced lookup functions, and building interactive dashboards for real-world business insights.

Comprehensive Guide to Data Structures: Arrays to Graphs Explained

Explore fundamental concepts of data structures from arrays, lists, stacks, queues to trees and graphs. Learn definitions, real-world examples, implementation strategies, and algorithmic insights for efficient data organization and manipulation.