Comprehensive SQL Learning Journey
This detailed course offers a step-by-step approach to mastering SQL, from basic query writing to advanced data analytics and optimization.
Course Roadmap
- Basics: SQL syntax, queries (SELECT, WHERE, JOINs)
- Intermediate: Filtering data with operators, joins, functions (string, numeric, date/time), data manipulation
- Advanced: Complex queries, subqueries, Common Table Expressions (CTE), views, stored procedures, triggers, performance tuning, indexing, partitions
- AI Integration: Leveraging AI tools like ChatGPT and GitHub Copilot for coding assistance
- Real Projects: Data warehousing, exploratory data analysis (EDA), and advanced analytics
Core Learning Areas
Data Warehousing
- Understand data warehouse concepts
- ETL/ELT processes to extract, transform, and load data
- Architecture design using medallion approach (bronze, silver, gold layers)
- Data modeling: star schema for facts and dimensions
- Data lineage diagrams and documentation
Data Exploration and Analysis
- Distinguish between dimensions and measures
- Exploring data uniqueness using DISTINCT
- Analyzing date ranges with MIN, MAX, and DATE functions
- Aggregations by categories: SUM, AVG, COUNT grouped by dimension
- Ranking and segmentation using window functions and CASE statements
Advanced Analytics
- Change over time analysis using window functions
- Cumulative and rolling totals
- Performance analysis comparing current metrics with averages and previous periods using window functions like LAG
- Data segmentation using CASE statements
SQL Optimization and Performance
- Execution plans: estimated, actual, live query
- Indexing strategies: clustered, non-clustered, columnstore, unique, filtered indices
- Index maintenance: fragmentation, statistics updating
- SQL best practices: avoid SELECT *, minimize unnecessary DISTINCT/ORDER BY, limit rows for exploration
- Query writing tips: avoid functions on indexed columns, use IN instead of multiple ORs
- Using SQL hints to influence execution plans
Stored Procedures and Programmability
- Create and execute stored procedures
- Use parameters and variables for flexible, reusable code
- Implement control flow with IF...ELSE
- Incorporate error handling using TRY...CATCH
Triggers
- Automate actions on INSERT, UPDATE, DELETE via database triggers
- Maintain audit logs for data changes
Working with Views and Temporary Tables
- Create views to encapsulate complex logic for reuse and simplify querying
- Understand differences between views (virtual, no data stored) and tables (physical data storage)
- Create tables from query results (CTAS) for performance
- Use temporary tables for intermediate results within database sessions
AI-Assisted SQL Development
- Use ChatGPT for idea generation, planning, learning, and code optimization
- Use GitHub Copilot for inline coding assistance, refactoring, and comment insertion
- Prepare for interviews and exams through interactive AI sessions
Practical Insights
- Avoid over-indexing to balance read/write performance
- Regularly monitor index usage and fragmentation
- Use partitioning to enhance query performance on large data
- Design modular, reusable queries with CTEs and subqueries
- Maintain clear naming conventions and project documentation
- Build complex reports incrementally, using views for business-ready data
Project Workflows
- Data Warehouse Construction: from source data ingestion (bronze), cleaning and standardization (silver), to business-ready models (gold)
- Exploratory Data Analysis: dimensions, measures, trend analysis, segmentation, and ranking , see also Master Excel for Data Analysis: From Basics to Interactive Dashboards for complementary techniques in Excel
- Advanced Analytics: time series analysis, cumulative metrics, part-to-whole comparisons
This course prepares you to implement industry-level SQL projects confidently and efficiently, mastering both the technical and practical aspects of modern data engineering and analytics.
For a broader data processing framework complementing SQL skills, consider The Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025.
Support the channel if you found this valuable and stay tuned for more advanced content.
Hello and welcome to this unique course to master SQL. My name is Barzalini and I lead big data projects at
Mercedes-Benz over a decade of experience in SQL data engineering, building data warehouses and data
analytics. Now, of course, the first question is what makes this course so special. Well, not only you will learn
how to write SQL codes, but more important than that, you will learn how exactly SQL works behind the scenes. So
I'm going to break complex concept in SQL using hundreds of animated visuals. This makes it really easier to
understand SQL and as well it is more fun than just sharing my screen and I just show you code. Right. The second
reason is this course is taught by me. I have industrial experience and I will be sharing with you everything that I know
about SQL and how I use it in my real projects. So I will be sharing with you hundreds of best practices, tips and
tricks and I'm going to show you my decision-m process in SQL. So by the end of this course, you will be ready to
solve any complex task like I do using SQL. So now I designed this course to cover the basics like writing your first
SQL query and then we're going to keep progressing in the course by covering advanced techniques in SQL like the
window functions, stored procedures, indexes and even at the end we're going to build a data warehouse using SQL. And
this course is suitable for anyone data engineers, data analyst, data scientist and even for students. And by the way
the good news everything is for free from the start until the ends I will be sharing with you as well a lot of
materials code presentations and animations and there are no hidden costs. So you don't have to pay for
anything. But my friends in return I really appreciate it if you support the channel in order to grow. All right my
friends I'm really excited about it. I don't know about you. If you are motivated join me learning SQL. This is
going to be amazing. So let's go. All right. Now I'm going to show you the road map in order to learn
everything about SQL starting from very basics and then advance step by step until we have very advanced topics. So
now at the start we have to understand few stuff like what is SQL, why to learn it, what are databases and the types of
databases and after the theory we're going to prepare your PC with data and the softwares. Now once we have
everything then we can go to the next chapter. This is the basics how to query data using SQL and here we're going to
cover the basic components in each SQL query like select from where those basics. Now once you understand how to
query the data, how to get the data out of the database the next step we're going to go and learn how to define the
structure of the database. How to create a new table add a new column remove column and as well how to drop a table.
So with that you are defining new stuff in the database and then the next chapter you have to learn about the data
manipulation. This time we're going to go inside the table and we're going to learn how to insert a new data, how to
update the data and as well delete few rows from our database. So with that you have the basics how to query data, how
to define the structure of your tables and how to manipulate your data. And I can say with that you cover the basics
about SQL. Now after that we start with the intermediate phase where we're going to deep dive into topics like how to
filter your data. Here we're going to learn about the comparison operators, logical operators, between and like. So
all the operators that you can use in order to build a condition in order to filter your data. Then after that it's
going to be very interesting topic. You have to learn how to combine them. And here we have two mechanism either using
the join or using the set operators. And oh my god joining data. It's going to be very interesting topic. Here we're going
to cover like a lot of stuff like we're going to start with the basic joins and then we go to advanced and then you have
to learn how to choose the right join and after that you have to learn about the set operators and here you have like
four methods union union all except intersects. So with that you learn how to combine multiple tables by combining
the columns or the rows of your tables. So this is very important. Now moving on in our course. Now using SQL you can do
a lot of stuff cleaning up the data a lot of data preparations and at the end you can do a lot of analytics and
aggregations. So there are like two families of functions. The first one is the role level functions and here we
have a lot of stuff you can transform your string values the numbers date and time and how to handle the nulls in SQL
and at the end the amazing case statements. So all those stuffs are transformation for only one single
value. We call it role level functions. And after you learn how to do data transformations, then you have to learn
about how to do data analytics and aggregations using SQL functions. So we're going to start with very basics
like the aggregate functions. And then we're going to deep dive into the window functions, analytical functions. And
here we have like aggregates, ranking and value functions. Those are very important tool for any data analyst or
data scientist doing analytics task in SQL. So I can say the rowle functions is for data engineers and the analytical
functions are for data analysts. So at the chapter 8 we can say you have covered now the intermediate level and
the last four chapters they will be the advanced stuff in SQL. So here there are a lot of techniques that you have to
learn about SQL. So the first one is the subquery query inside another query and the very famous CTE common table
expression. A lot of developers like this one and then you will learn about how to create views in the database.
This technique if you learn it you're going to be really professional in SQL. Then we're going to learn how to create
tables using select the temporal tables and then we're going to learn about the third procedures how to write a program
in SQL and after that of course comes the triggers. So those are the advanced techniques that you have to learn in SQL
in order to do advanced projects using SQL. So now once you learn all those concepts and you start writing a lot of
SQL codes you will notice that some queries going to be really slow and for that you have to learn how to optimize
the performance of your queries and here there are a lot of techniques. The most famous one is to create an index in the
database or create a partition and at the end I will be sharing with you the top 10 best practices that I have
learned in my projects on how to optimize the performance of your queries. So this is very important and
then we're going to move to very interesting one. I will be sharing with you how I use AI like shy GBT or copilot
as I'm using SQL in my projects. So here you have to learn how to write correct prompts to get assistance from AI as you
are using SQL. And finally and my favorite one it will be about SQL projects. So my friends here you have to
bring everything that you have learned about SQL in handon projects. With real projects you will get challenges and
struggle and here going to happen the magic and the real learning and here there are three types of projects. The
first one is data warehousing project. This is very data engineering focused project where you're going to learn how
to build real data warehouse where you're going to take the data from the raw formats and then process it in
different layers. Once you build it then you jump to another project. Here you're going to start exploring the data and
start getting the first insights about the business. And the last project that you can do is the advanced data
analytics project. So this is very important section where you do SQL projects. So my friends this is the road
map on how to learn SQL. So as you can see it takes you step by step from basics to intermediate and you will end
up having advanced topics and with that I can tell you you will learn everything about SQL. Okay. So now let's start with
the first chapter the introduction to SQL and here we're going to cover few topics. So we have to understand first
what is exactly SQL? Why we have to learn it? what are databases and the different SQL commands that we have in
SQL. So it is the basics the theory about SQL. So what is exactly SQL? Let's go. So what is exactly SQL? Everything
generate data and data is everywhere. Your first name is data your mobile and everything inside the mobile is data.
Car is as well generating a lot of data. Bank, your finance statements, everything is data. And now of course
the question is where do we store our data? Personally we store a lot of our data in like excels, spreadsheets in a
text file. So you store a lot of your data in different files. Now how about companies? They have a lot of things
that generate a lot of data that the products that they produce their customers as well generating a lot of
data and sales informations and a lot of things. So companies generate massive amount of data. So now the big question
is how they handle the data how they store it. Of course, they cannot go unused like simple files. They need
something bigger, stronger and smarter. And here where the database comes in. So think about the database. It's like a
container for storing data. But instead of just dumping files into folders, the database organized the data. So it is
easy to access, to manage and to search. So a database simply it is a container that stores data. So now you might ask
why we are using database. Can't we just use files like I do it personally? Well, let me tell you why we use databases.
Imagine that someone asks the following question. Go and find the total spending in your data. So now, in order for Mike
to find the total spending and the costs, he will be opening each of those files one by one, searching for the
costs trying to combine the data and it's going to be very long and messy process. But now in the other side, if
your data in database and you want to ask a question, it's going to be very easy. So all what you have to do is to
talk to the database to ask a question and the database can answer your question with a result. And now comes of
course the question how do we talk to a database? Well we use SQL. SQL is the language that you use in order to talk
to the database. It stands for structured query language SQL. And here you have people that call it SQL like me
and others that call it SQL. There is no right and wrong but if you follow me through the course I think you will
start saying SQL. So by using SQL you can ask the database you can ask your data and the database going to answer
your question by sending you a result. So this process is very easy simple and fast and this is way better than having
your data stored in different files. Another reason why we use databases is that they can handle really huge amount
of data. So sometimes we have like millions of data inside our database but in the other side if you are storing
your data inside spreadsheets and you have like massive amount of data what can happen your spreadsheets going to
just break they simply can't handle big data and another reason why we use databases is that it is just secure. It
is safer to store important and critical data inside the database than just storing it in spreadsheets and files. So
the databases are secure and you can control who is accessing what. So it is just more professional to store the data
inside a database. All right my friends so far what we have learned most of the companies stores their data inside a
container called a database and for you in order to ask questions and to talk to your database you have to speak the
language of SQL. Now I'm going to show you how it looks like usually in companies. So we
have our data inside the database and then you will have multiple people with multiple roles that are just writing
different SQLs in order to talk to the data. But now not only employees and people interact with the database. You
could build a website or an application that as well interacts with the database by sending different SQLs. And of
course, depend on how many people are interacting with the application and the website, it might generate really
massive amount of SQLs that sent to the database. And not only that, you might has as well tools in order to do data
visualizations where you have like a dashboard or reports maybe created using PowerBI or Tableau and it is used by
stakeholders and managers in order to make decisions and as well those tools will be connected to the database and
creating SQLs. So now as you can see we have a lot of interactions with the database from people applications tools
a lot of things are generating SQLs and interacting with the database but the database is just a container and storage
right so we need something a software that manage all those requests and that's why we have something called
database management system DPMS so it is a software that going to manage all those different requests to our database
and it going to make the priority which SQL must be executed First, this software can as well manage the security
whether the SQL is allowed to be executed in the first place. So my friends, the DPMS is the software that
going to manage the database. And now we are not done yet. There is something missing. So we have our data, we have
the software. What is missing here is the hardware. So in real companies, we cannot run that on our PC because first
our PC is weak and as well it goes offline. That's why we need a server. server it is like very powerful PC and
as well it lives 24/7 so it is always available and here we can decide whether we're going to have a server inside the
company or we can use cloud services in order to run our database so my friends so far what we have learned the database
it is container to store the data the SQL it is the language in order to talk to the database the DPMS it is the
manager it manage the database and the server it is the physical machine where the database lives so this is how it
looks Like and now my friends there are different types of databases. So let's
see what do we have. The first and the most famous one it is the relational database. It is very simple. It is like
spreadsheets call them table where we have columns and rows and then there is like a relationship between those tables
to describe how they relate to each other and that's why we call it relational database. So if people hear a
database they're going to think about this one. Now we have another type of databases called key value. This time
the data is organized completely different where you have pairs of keys and values. Think about it. It's like a
big dictionary where you have a word like the key and the definition of the word this is the value. And now moving
on to the next one. This is as well important column based. So now instead of grouping the data by the rows this
type of databases group the data into columns. That's why it's called column paste. And this is very advanced
database in order to handle huge amount of data where the main purpose is to search for data. Moving on to another
database called graph database. The main focus here is the relationship between objects. So the main idea here is how to
connect my data points. And now finally we have the document database. The data is stored as entire documents where the
structure of the data is not that important. What is more important is to fit everything in one page in one
document. And now if you look to those five types, we can group the document, graph, column based, key value, all
those databases called NoSQL databases and the relational database, SQL database. And in this course, we will be
focusing of course on the relational database. And I'm sure you have heard about like the Microsoft SQL server, the
MySQL, the possesses they are SQL relational database. And for the key value you have
the radius the Amazon Dynamo DB and we have for the column paste we have the Cassandra and the red shift. For the
graph database we have the Neo 4G and the very famous database the MongoDB as a document database. Now my friends for
this course we're going to be focusing on the SQL relational databases because it is the most famous one and the most
used one in companies and I will be focusing on the Microsoft SQL server. So those are the different types of
databases. Now the databases are very structured and organized. It has the following
hierarchy. The starting point is the server as we learned it is powerful PC and it is where the database lives and
inside it we can have multiple databases. So maybe you have a database for the sales and another one for the
HR. So the server can host multiple databases and as we learned a database is a container of your data. Now moving
on to the next level. In each database we can have multiple schemas. A schema it is like category or you can call it a
logical container that we can use it in order to group up related objects like let's say you have hundred of tables. So
you can split all the tables that has to do with the orders in one schema and then another group of tables with the
schema customers and so on. So it help you to organize your tables and your objects in the database. And now if you
go inside schema you can have multiple objects like tables. So now of course the question is what is a table? It is
like spreadsheet. It organize your data into columns. The column define the data that you store inside it. So you have
one column about the customer ID. Another column about the names, the scores, the birthday. So each column is
about one type of data and sometimes we call the columns as fields. Now the other thing that we have in tables is
the rows or sometimes we call it records. It is where actually the data is stored. Now in this example each
record represent one customer one person. So we have one record for Maria, John and Peter. Those we call them rows.
Now in each table there is like one very important column called the primary key. It is always very important to have like
one unique identifier for each customer for each row and we use it for different purposes in order to combine it with
another table in order to identify quickly one customer. So it is unique. It's like fingerprint and there is no
two customers having the same ID. Now the overlapping between the columns and the rows we have a single value a cell
and each value each column stores specific data type. A data type it is like what kind of data we are storing
like an integer 1 2 30 or a decimal where you have a decimal point 3.14. Now if you want to store characters we have
different data types for that like you want to store the name or the description. So here we can use the char
or the vchar. So you store inside them like the first name Maria or something. Now you might ask what is a char or
vchar. So the char always a fixed one. So if you define it like five characters always it's going to go and reserve five
characters from the space. But if you want things more dynamic then you go with the vchar. And now moving on we
have another data types called the date and time. So if you want to store a date like the birth dates and if you want to
store the time information you can use the time data type. So we call those stuff int, decimal, char, date, time.
They are data types. So my friends, as you can see, SQL databases are very organized and
structured. Okay. So now let's focus more about the SQL itself. We have in SQL different type of commands. So let's
say that we have a database and this database is empty. So we have nothing inside it. Now, of course, the first
thing that you have to do is to write an SQL with the command create in order to create brand new table in the database.
So, once you executed the database going to go and build one, but this table is empty. So, we have nothing inside it. So
now what you have done here is you have defined something new, right? And we call this type of commands the data
definition language, the DDL. We have create to create something new, alter in order to edit something that already
exists and drop in order to delete something. to drop for example a table. So this is the first family of commands.
Now if you look at our table, it is empty. What do we need? We need data. So let's say that we have a website or an
application. Now this application is generating a lot of data. Now in order for this application to move the data
inside our new table, it must use the SQL command insert. So if you execute insert, you can add a new data inside
your table. This type of commands we call it data manipulation language. And here we have three commands. insert in
order to insert a new data, update in order to update an already existing data and delete in order to go and delete
data from your table and that's why we call it data manipulation language because you are manipulating your data.
So what do we have now? We have table, we have data inside the table. Now what we can do we can start asking questions.
So let's say that you have analytical question about your data. Now all what you have to do is to write something
called SQL query and inside it you use the command select but the whole thing we call it a query. So you send a query
to the database, you have a question and the database can return for you the result, the data answering your query,
your question and we call this type of activities using SQL, the data query language. And here we have only one and
it is very famous. We have the select. We can use it in order to query our data. So those are the three different
commands in SQL. And of course, we're going to learn all of them, but we will spend most of our time learning how to
write the correct query for the correct answer. And now you might ask me, Barra, why we have to learn SQL? And if the
time goes back, are you going to learn SQL again? Well, for sure, of course. And here are the top three reasons that
I have. The first one, you have to learn it in order to talk to the data. You know, most of the companies stores their
data in databases, and this is a standard way. This is how they do it. And if you want to work on the company
in the data field and you want to talk to their data, then you have to use SQL. It's like you move to another country
where they speak another language and you want to live there for a long time, you have to speak their language. The
same thing here. If you want to work with data, you have to learn the language in order to speak to the
database, the SQL. So this is for me the most important reason why we have to learn SQL and SQL it is in high demand.
If you go now and check the job description of the software developer, data analyst, data engineer, data
scientist, I promise you you will find there that they going to demand for SQL. So you will find they going to ask for
SQL skills almost in each job description. So if you check for any data related jobs, you will find that
they going to ask for SQL skills. Now another reason that I have is it is industry standard. So if you go and
check multiple modern data platforms and tools like PowerBI, Tableau, Kafka, Spark, Synaps, you will understand that
there will be always a section where you have to enter SQL code. So most of those vendors adopt SQL because it is the
standard. It is widely used. It is like selling points that their tools are easy. So those are my top three reasons
why SQL is still relevant and why you have to learn it. Okay, my friends. So with that we have now clear
understanding what is an SQL why we need it what are databases and their different types why do we have DBMS
servers and as well now you have understanding how things are very organized and structured inside the
databases so that's all this is SQL all right so with that we have covered the basics about what is SQL and databases
now in the next step we're going to go and set up our environments so that means we're going to prepare your PC
with the data with the databases and all the tools that you need in order to learn
SQL. Okay. So now go to the link in the description and you will land here in my newsletter website and you can subscribe
if you want to get weekly news about my content. I make as well post about data and many other projects. So once you do
that what we're going to do now we're going to go to the downloads over here and you will find here all the materials
of different courses and the one that we want is SQL ultimate course. Let's go over here. Now once you do that you will
land to this page where I have listed all the important links. So the first one and the most important one is to go
and download the course materials. Here you can find everything code the slides the presentations the whole course or if
you don't want that you can go to my get repository and there you will find exactly the same materials. So let's go
and download everything. Okay. So now go and put the downloaded folder somewhere safe and let's go inside it. And here
you can find three things. The first one is the data sets. Here if you go inside it you will find the data for the course
the databases that we will be using in order to practice SQL. So everything is available here. Now the second folder
you can find all the documentations. So that means all the visuals the presentation slides everything that I
present during the course. It is available here as a documentation notes for you. Now moving on to the third one
we have the scripts. So during the course we will be writing a lot of SQL codes and all those codes are here
available. So that means those are all the codes that is used in the course. Okay. So with that you have now all the
course materials. All right. So now the next step is that we have to go and download the SQL Server Express and you
can find the link as well over here. So let's go there SQL Server Express. And now we're going to land on the Microsoft
page where we can see the different offering from Microsoft where it's called server. So either we have it on
the Azure or we can download it on the on premises. But we don't want those stuff. Just scroll down to see those two
options. So the first option on the left side we have the developer edition. You will get all the features and services
that Microsoft offers with the SQL server. It is as well free but the installation here is little bit
complicated. But in the second option on the right side we have the express edition. Installation here going to be
really fast and very easy. You will get as well all the stuff that you need for practicing SQL and learn SQL. So both of
the options are free. It's just a matter of the installation. We will go now for the express edition. So go and click
download now and it's very small file. So let's go and start it. And now the installation going to start. So we have
basic, custom and download media. So download media means download now and later we're going to do the
installation. Custom means we have more control on how to download and install the stuff. The basic is the easiest one
and the quickest one. So let's go with the basics and click on that. And let's go and accept all those stuff. And now
let's click on install. So now it's going to install the applications, drivers and so on. It may take a little
bit time. So in order to do that, let's go and click on install SS SMS. So let's
click on that and as well we can find the link over here. So let's go to SQL Server Management Studio. So let's click
on that. You can find of course this link as well with the other links that I have collected. So now we are again at
Microsoft page. Let's go scroll down and now we will see the following link free download for SQL Server Management
Studio SS SMS. So let's go and click on that and then it's going to go and download it. Let's go and start it. So
the first thing that we have to define the location. I will go with the default stuff. So let's click on
install. Okay. Setup completed. We just installed SM SS SMS. So let's go and close it. So now let's go and start it.
If you go to your menu over here, search for SQL Server and you will find it here. SQL Server Management Studio.
Let's go and start it. Okay, so now we're going to get this window in order to connect to our server. So again, what
is our server? It is the one we have installed at the first step, SQL Server Express. And that's why you're going to
see in the server name, your PC name, of course, like it's not going to be my PC name. But here we have something called
SQL Express. This is the server we just installed. So in the first option, we have database engines. We have reporting
services. Those are different stuff from Microsoft. We're going to leave it as a database engine. And it should be like
this. SQL Express. Now, how to access this database? We have the following stuff. We can do that using the window
authentications or SQL server authentications. I'm going to say that let's stick with the window
authentication. And the username going to be the PC name and as well the window user. If you don't have it for some
reason those informations, you can go to your search search for cmd and then here you can say who am I?
And with that you will get the PC name and as well the user that you are currently logged in. And this is exactly
what I'm seeing over here. One more thing if you're having issue connecting to your database make sure to check the
encryption. It should be mandatory and to click on the trust server certificates. So once you do that you
will be able to connect. Okay. So with that we have the server we have the client. And now the last step we have to
go and create the database. We want to insert our data. So now if you look to the object explorer and open the
databases you can see that we don't have any database. So now let's do something about it. Go back to the course
materials inside the data sets you will find the following. You will find we have here three folders MySQL postcress
and SQL server. So if you want to follow with this course using different database like MySQL and Postgress you
can find the exact same data for the database that you are using. But now in this course we are using the SQL server.
So if you follow me with that go inside the SQL server folder and here you will find four files with different
extensions. So what is going on here? Now for this course we have two databases. One that is very simple
called my database and second one that has more tables called sales DB. And now in SQL server there are multiple ways on
how to create databases. I will show you now two methods on how to create the database. Now the first option we want
to create the database from a script. And if you look to those files, we have here two files with the extension SQL.
Those are files with SQL code. So let's start with the first one, the init SQL server my
database.SQL. Go inside it. And now here we have the SQL code. Copy everything. And now let's go back to our studio and
then go to the menu and click on new query. And here in the middle you can paste the code. So now we have the code
for the first database. And all what you have to do is to go and execute it. So once we executed you will see we will
not get any error. And now on the left side we don't see yet our database because we have to refresh. So right
click on the databases and click refresh. And now you can see it my database. So now let's see the content.
Go extend it and then go extend the tables. And now you see here our two tables customers and orders. Inside
those tables we can find our data. In order to see the data right click for example of the customers and let's go
with the option select top 1,000 rows. Once you do that you can see now in the results we have here five customers.
This is our data inside the table customers. So here again about the interface on the left side we have the
object explorer where you can see the whole structure of the database from server to databases to tables. So you
can see the whole structure on the top we have a menu with a lot of icons and then in the middle this place here we
call it the SQL editor. We're going to go and write their SQL codes and then once you execute it at the bottom you
will get the result and messages and below the SQL editor we have the output. So here you can see for example the data
the results or different messages from the database. So the interface is very simple. Now we have to go and get our
second database. So if you go back to our files you can find a second SQL file the initql server sales db.sql. Open
that and let's go and copy everything here and let's go back to our studio. Same thing you have to go and create a
new query then paste the whole code and this database is about the sales DB. So let's go and execute it and with that we
will not get any errors and now we go to the left side and we do the same thing refresh and we can see the second
database sales DB. Now we can go and explore it. So extend it go to the tables and here you can see five tables
customers employees orders products. So here this is the intermediate database for our course. So now let's go and
check our data. For example, let's go to the orders, right click on it and select top 10,00. And those are the orders of
our database. Perfect. So everything is working. So those are the main two databases that we will be working
through the whole course. And of course if you want to go and practice using another database, it's totally fine. For
example, in Microsoft, there are a database called Adventure Works. It is really amazing. And I'm going to show
you now how to import it. We can go over here the adventure works. So let's click on this link. So now we are again in
Microsoft page. If you scroll down you can see here three different types of databases. The OLTB, data warehouse and
lightweights. So they are like different databases. The OLTP is the most like complicated one. A lot of tables and
transactions and so on. The data warehouse it is like really nice one in order to do data analyzes and stuff. The
lightweight it is the simplest one. So let's go for example and get the data warehouse. So click on that and now as
you can see the extension of this file isbak and now I'm going to show you the second way on how to create databases in
SQL server. So now all what you have to do is to go to the following path. It really depends where you have installed
the SQL server. So for me I have installed it in the program files Microsoft SQL Server MSSQL SQL Express
then MSSQL backup. You have to go there. So here what you can do you can place all the files with the extension bak.
For example, the adventure works that we just installed. This is a backup file for the database and we want to go and
restore it and with that you are creating like a database. So this is the second method on how to create databases
in SQL server by restoring the database. If for some reason the script didn't work for you. Now let me show you
quickly how we can do that. Let's go back to our studio. Right click on the database and then here we have an option
called restore database. Click on that. And now here we have two options under the source database and device. The
default going to be database but we have to switch to a device because we want to import it from files. And then we go to
these three dots. Click on that. And now we have to go to the option add. And now it's going to take you to the place
where the SQL server creates backups. So here we can find our files and what we want you to create is the adventure
works. Select that. Then okay, one more okay and one final okay. So now the database will be restored and it is
successfully. So now on the left side we can see our third database. If you don't see it go and refresh of course and here
you will find a lot of tables in the adventure works. And as usual we can go and explore the data by selecting top
thousand rows. So my friends now you have three databases but of course our focus is only the first two that we have
done my database and sales DB. And with that you have learned two ways on how to import databases into SQL server. So
with that my friends we have prepared everything. We have the SQL Server Express running on your local PC. We
have the studio the clients where we're going to use it in order to interact with the database and we have created
our two databases that we will be using in order to practice SQL. So we are ready. All right my friends. So with
that we are done with the first chapter. We have our introduction to SQL and now we're going to start learning the first
thing in SQL and that is how to query our data. So let's go and start with that.
Okay, so now we can understand exactly what is an SQL query. Now normally your data is inside the table and your table
is inside the database and now you might have a question from the business like what is the total sales? What is the
total number of customers? So any question that you have in your mind and you want to go and ask your data you
want to go and retrieve data from the database and in order to do that you have to talk to the database using its
language the SQL. So in order to do that you're going to go and write a query where you write inside the query
something called select statement and with that you are asking the database for data. So once you execute your query
the database going to go and fetch your data and then it prepares a result to be sent back to you. So with that you are
asking the database a question by writing a query and the database going to process your query and answer your
question by sending back data and with that we are like reading our data from the database and the queries will not
modify anything will not change the data inside your tables or even change the structure of the database. So you use
select statement only in order to read something from the database. You just want to retrieve data from the database.
So this is what we mean with a query. And now my friends, each SQL query has usually different sections,
different components. We call them clauses. And this is amazing because you're going to have enough tools to
write a query that matches any question that you have about your data. So what we're going to do, we're going to cover
all those clauses step by step in order to write any query that you need. So now we're going to start with two clauses
that makes the simplest query in SQL. the select and from. So let's start with that. All right. So now it's really
important for me that you understand how SQL works with the code with the queries. So now what I'm going to do,
I'm going to show you on the right side the syntax of the query in SQL and then on the left side I'm going to show you
exactly step by step how SQL going to go and execute your query. So now we have the table customers inside our database
and we will start with the easiest form where we're going to select everything. Select the star. So the select star is
going to go and retrieve all the columns from your table. So everything and the from clause it's going to tell SQL where
to find your data. So with the select we select the columns that we want and the from you specify the table where your
data come from. So the syntax going to be very simple. In each query we start always with the select. And now since we
want all the columns we're going to write star and with that SQL going to understand I want to see everything. And
then after that comes the keyword from. And now we want to tell SQL where the data come from. So we have to specify
the table name. And that's it. This is all what you need to do. So once you execute it what's going to happen? SQL
going to go and execute first the from clause. So it's going to go and retrieve all the data from the database to the
results. And then in the next step going to go and check the select statement. So which columns we have to keep in the
result since you are saying star then the SQL going to keep everything all the columns and with that you will see in
the result everything all the columns and all the rows. So that's it. This is how it works. Now let's go back to scale
in order to select few data from our database. Okay. So back to our studio. Let's go and start a new query and let's
go and find our database just to expand it and our tables. Now it is very important to make sure that you are
connected to the correct database. So go to the top left in the menu over here and make sure to select your database.
So my database like this or we have a command for that called use and then just write the database name like this.
So I'm telling SQL just use my database like this and with that SQL going to switch to your database. Now if you are
learning any new programming language, it is very important to understand about the comments. So comments are like notes
that you add to your code in order to understand what is going on. And of course the engine, the database will not
go and execute it. it's going to go and ignore everything inside it. And there is like two ways on how to do that.
Either you make inline comments by typing two dashes like this and then you write anything this is a comment. So now
in SQL if you see it is green that means it is a comments. Now the other type you can have multiple line comments and in
order to do that what you can do you can write slash and then start and then you can write anything this and then start a
new line is a comment. So as you can see all the lines after the slash star it is getting green that means it is a comment
and now let's say that you are at the end. So in order to close it you write again star and then slash and that you
are telling SQL I'm done with my comments. So those are the two types of writing comments in SQL. Now back to our
query. Let's say that we have the following task says retrieve all customer data. So I would like to see in
the results all the data of my customers everything all the rows and all the columns. So currently our data is stored
inside the table called customer and I need to see all the data in the output. In order to do that we're going to write
a query and all our query start always with a select and since I need everything all the columns we write star
and then a new line. Let's go and specify for SQL from where it's going to go and get the data. So it's going to be
from and then we going to write the name of the table. It must be exactly like it is in the database. So it's called
customers and you have to have it here as a customers. So that's it. Let's go and execute it. And now if you look to
the results, you can see we have four columns and five rows. So with that you are seeing everything inside the table
customers. You can see we have five customers and you can see all the columns about the customers. So this is
very simple. We have ask question for the database using SQL query and the database should answer our question by
returning our data in the results. All right. So now let's move to another task. I'm going to go and create a new
query and this time we're going to retrieve all the order data. So that means I would like to see all the data
inside the orders. So let's go and write a very simple query. We start as usual with select and since we want
everything. So it is select star from our table orders. So that's it. Let's go and execute. And with that you can see
in the output we have again four columns but this time we have only four rows. So that means in this table we have four
orders and we can see all the data inside this table. So with that we can understand we have five customers inside
our database and these customers did generate four orders. So as you can see we are now talking to our database and
this is the simplest form of query in SQL. All right. So now let's move to the next step in our query where you say you
know what I don't want to see all the columns from the database. I want to be more specific. So I would like to select
exactly the columns that I need. So now we want to select few columns from the database where we select only the
columns that we need instead of everything. Now about the syntax we're going to go and change a little thing.
So instead of using star we're going to go and make a list of columns that we want to see in the output. So we're
going to select column one column two and we're going to separate them using a comma. So we are just writing a list of
columns exactly after the select. And for the from it's going to stay as it is. So from a table. Now if you execute
this what going to happen as usual SQL going to start with the from. So it's going to go and get the data from the
database and then the next step is going to go and check the select. So what going to happen? SQL going to go and
keep only two columns like for example the name and the country and all the columns that are not mentioned in the
select statements will be excluded. So SQL going to go and remove it from the results and keeps only the columns that
we mentioned in our query. So this time instead of having four columns in the output we can have only two. So with
that you are like filtering the columns and you are selecting exactly what you need. So now let's go back to SQL in
order to practice this. All right. So now we have the following task and it says retrieve each customer's name,
country and score. So that means I don't want to see everything from the table customers. I need only to see the three
columns. So let's see how we can do that. As usual we start with select and I'm going to go with a star in order to
see the whole table first from the table customers. So it's exactly like before. Let's go and execute it. And now I can
see everything inside the table customers. But the task says I need only three columns. So now what we're going
to do instead of the star, we're going to make a list of columns. So we start a new line and then we write the name of
the first column. So the first name and a new line for the second column for the country and then again a comma and then
we write a score. So with that we have the three columns. Now what I usually do, I go and select them and give it
then a push using a tab. This just looks nicer and easier to read. So with that we have now between the select and from
list of columns. Now there is like mistake that happens a lot where we go and type a comma after the last column.
So if you do that and execute it you will get an error because SQL going to expect from you a column after the comma
and since there is no column and immediately you have a from you will get an error. So there is no need for a
comma after the last column. Now let's remove it and execute. And now that you can see in the output we don't have four
columns we have only three. the first name, the country and the score. And by the way, they are ordered exactly like
you selected in your query. So first we have the first name and then the country and then the last one the score. So that
means if I go and now change the order. So let's get the country at the end and execute. You will see the country at the
end. I'm going to go and put it back in between to match exactly like the task and remove the last comma. So execute
again. And with that we have selected few columns from our table. So we are more specific to what we need. Okay. So
that we have covered the two select and from next we're going to talk about the wear clause that you can use in order to
filter your data. So let's go. So what is exactly where? We use where in order to filter our data based
on a condition and any data that fulfill the condition going to stay in the output in the result and the data that
don't meet the condition will be filtered out of the results. Condition could be anything like for example we
say the score must be higher than 500 or you can say the country must be equal to Germany. So any condition that you have
in your question. Now let's see the syntax in SQL. As usual we start with a select. We select the columns that we
need. Then we write from where the data come from and then after the from we're going to write the where and exactly
after that you specify your condition. So now let's see how SQL going to execute this. First SQL start as usual
from the from. So it's going to go and get your data from the database and after that SQL going to go and execute
the wear clause. So let's say that the condition should be higher than 500. And now what going to happen? SQL going to
check each row whether it meets this condition or not. So for example for Maria she doesn't fulfill the condition
because her score the 350 is not higher than 500. So she doesn't fulfill the condition and SQL going to go and remove
completely this row this record from the results. Now SQL going to go to the second record. So Joan is fulfilling the
condition. So he going to stay in the result. The same thing for George. Now moving on to the fourth one Martin. So
this customer is not fulfilling the condition and SQL going to go and remove it from the results. The same things
happen for the last customer. The score is zero and not fulfilling the condition. So that means if we apply
this filter, SQL going to return only two customers out of five. So with that we are filtering the rows based on
condition using the work clause. Now as you can see in the result we are getting all the columns but if you specify in
the query like for example only two columns like the name and the country then SQL going to start removing as well
the columns of the results. And this means in the output we will get only two columns and two rows. So with that you
are filtering the columns and the rows of your results. So now let's go back to scale in order to practice this. All
right. So let's have the following task and it says retrieve customers with a score not equal to zero. So now if you
are looking to our task you see we have like here a condition. The condition says the score must not be equal to
zero. So I don't want to see all the customers. I want to see only the customers thus fulfill this condition.
So it's like we have to filter the data. So let's go and solve the task. Let's start as usual. Select star. There's no
specifications about the columns from our table customers. Okay. So I'm going to start with this. Let's go and execute
it. Now if you look at the result, you can see like almost all the customers are fulfilling the condition. Their
scores are not equal to zero. Only one. The last customer his score is zero. So this customer does not fulfill our
condition. Now let's go and build filter for that. So we're going to say where. And now there will be a section that is
only focusing on how to build conditions and filtering in SQL. So don't worry a lot about the syntax of the conditions.
We're going to cover that later of course but it is very simple. Now for the condition we need a column. So in
which column is our condition based on it's going to be on the score. So we're going to write here score and since we
are saying not equal there is like an operator in SQL called not equal and then we have to write a value after
that. It's going to be a zero. So again the condition is like this. The score must not be equal to zero. It's very
simple, right? And with that we have our condition and we are using the where in order to filter the data. So let's go
and execute it. And now as you can see SQL did remove the last customer because he is not fulfilling this condition. And
we have now only the rows that fulfill our condition. So as you can see it is very simple how to filter the data. All
what you have to do is to write where clause after the from and then write a condition after that. Now let's have
another task like for example it says retrieve customers from Germany. So I don't want to see all customers from
different countries. I just want to see the customers that come from Germany. So that means we have a condition here.
Country of the customer must be equal to Germany. So let's go and remove the current condition. It is not the one
that we need and execute. If you are looking to the results, we have two customers that come from Germany and we
are interested only to show those two customers. So let's go and make a filter for that. We're going to write where
clause and after that we need a column. The column going to be the country. So we're going to write here country and
this time the country must be equal to Germany. So we're going to write an equal operator. So we're going to write
Germany like this exactly like the value inside our data. But now as you can see we are getting like an error here. And
that's because in SQL if you want to write a value that contains characters then you have to put it between two
single quotes. So at the start you put a single quote and as well at the end. And now as you can see the red line is away
and the value now is red and that's because it is a string value. It is a value that contains characters and with
that you will not get an error. So if your columns contains only numbers you can write it without single quotes. But
if your values contains characters then you have to write it between two single quotes. Okay. So now back to our
condition the country must be equal to Germany. Let's go and execute it. And it is working. So as you can see now we are
seeing in the output only the customers does fulfill my condition where the country is equal to Germany. So this is
exactly how we work with the wear clause in order to filter our data. So my friends this is how you filter your
rows. And now let's say that I would like to filter the rows together with the columns. So I just want to keep the
first name and the country and not interested to see the scores and the ids. So in order to do that we're going
to go to the select and list the columns that we want to see. So the first name and after that a comma then the country
and that's it. So let's go and give it a push and execute it. So we have two rows and two columns. So guys as you can see
SQL is very simple. All right. So with that you have learned how to filter your data using the wear clause. Next we're
going to talk about how to sort your data using the order by. So let's go. Okay. So what is exactly order by?
You can use this type of clouds in order to sort your data. And of course, in order to sort your data, you have to
decide on two mechanism. Either you want to sort your data ascending from the lowest value to the highest value or
exactly the opposite way using descending from the highest value to the lowest. And the syntax kind of looks
like this. So as usual, we start with the select and then from and after the from you can specify order by and with
that you are telling SQL we have to sort the data and you have to specify two things. First you have to specify for
SQL the column that should be used in order to sort the results. So for example you can say score and after the
column name you have to specify the mechanism. So for example you say ascending from the lowest to the
highest. And in SQL if you don't specify the mechanism the default going to be ascending. So you will not get an error
if you don't specify anything after the column name. But my advice here is always to specify something after the
column easier because it's just straightforward and easier to understand and if someone reads it can understand
immediately it's going to be ascending because maybe not everyone knows what is the default in SQL. So always specify a
value even if it's like easier to skip it and if you want to store the data from the highest to the lowest then you
can specify descending. So as usual SQL going to go and start from the from it's going to go and grab your data from
database. Then the second step is SQL going to go and sort the result. So the order by going to be executed and SQL
going to see okay I'm going to sort it by the score and using the sending mechanism and still going to go and
start like moving around your rows where the first row going to be the customer with the highest score and in this
example John has the highest score the 900. So John going to appear as a first row at the result and that's because his
score and after that the second highest is going to be George with 750 and SQL going to go and keep sorting the data
and then we have 500 then 350 and the last row going to be the customer with the lowest score the zero. So this is
how SQL executes your order by. Now let's go back to scale in order to practice. All right. So now we have the
firming task and it says retrieve all customers and sort the result by the highest score first. So now by looking
at the task we need all the customers. So there is like no conditions or anything to filter but we have to sort
the results. So let's go and do that. We're going to start as usual by selecting all the columns from the table
customers. So now if you go and execute it you will get all your customers and you are now seeing the data exactly like
stored in the database. And you can see the result is not sorted by the scores. So we have here a low score then high
score then low and so on. Now the task says we have to sort the results. So we have to go and use the order by and now
you have to understand from which column and we can get that from the task. So it says it should be sorted by the score.
So we're going to go and define the score here. And the final thing that you have to define is the mechanism
descending or ascending. And you can get it as well from the task. So we have to sort the data by the highest score
first. So the highest first and then the lowest. So that means we're going to go and use the descending. So that's all.
Let's go and execute it. Now as you can see in the results, the first customer has the highest score. Then we have the
second one with the second highest until the last one with the lowest score. That's it. This is how you sort your
data. And with that we have solved the task. Now let's do exactly the opposite. So we want to sort the results by the
lowest score first. So that means we want to see first the customers with the lowest score like here in this example
we should see the ID number five as the first because he has the lowest score the zero. Now in order to do that all
what you have to do is to switch the mechanism instead of descending when you can use ascending. Let's go and execute
it. And that's it. As you can see now we have the lowest score then the second lowest score until the last row. It's
going to be the customer with the highest score. So the lowest score comes first. So it is very simple. This is how
you sort your data using SQL. And now I'm going to show you one more thing that you can do with the
order by. You can sort your data using multiple columns. And we call it nested sorting. So now let's take this very
simple example where you want to sort your data using country. So we are saying order by the column country and
the mechanism going to be ascending. So from the lowest to the highest. Now if you do that going to go and sort the
data this time based on the country. So we're going to have like the first two customers from Germany. It is sorting it
alphabetically. Then we have the UK and the last two going to be from USA. Now if you are checking the final results
you might say you know what there is like something wrong. The data is not completely sorted correctly. So if you
are looking to the first two customers that come from country Germany. You can see the scores are sorted in ascending
way from the lowest to the highest. So first we have 350 then 500. Then UK it's fine because we have only one customer.
Now if you look to the customers from USA you see that it is like sorted the way around. It is sorted descending from
the highest to the lowest. So first we have the score 900 then zero. So there is like no clean way on how the data is
sorted and the result is not really clean and this issue happens usually if you are sorting your data based in a
column that has repetition like here the country we have twice Germany and twice USA. So now in order to refine the
sorting and make it more correct, we can include in the sorting another column in this scenario for example the score. So
we can make a list of columns in the order by and we can separate them using the comma. And of course you can have
different mechanism for each column like for the country we are saying it is ascending but for the score we say you
know what let's make it descending. It will not be only one for all columns. So now what can happen is we're going to
start sorting the data for each section. So for the two customers from Germany the sorting going to be from the highest
to the lowest. So it's going to go and switch the two customers. So Martin going to be first because he has higher
score than Maria. And with that we are refining the scores based on the same value of course the country. Now for the
UK nothing going to happen because we have only one value and for the USA as well nothing going to happen because it
is already sorted in the correct way from the highest to the lowest. So as you can see if you are including a
second column you are refining your sorting and as well my friends the order is very important. So this is how you
can do nested sorting in SQL. Let's go back to our SQL and start practicing. All right so now we have the following
task and it says retrieve all customers and sort the results by the country and then by the highest score. So again we
need all customers. So select everything from customers table. And now the task says we have to sort the result by the
country. So we're going to start with the order by and since it says by the country. We're going to go with the
country and we're going to sort it alphabetically. So it's going to be ascending. So let's go execute it. Now
you can see the data is sorted completely differently by the country. So we have first Germany, UK and then
USA. But that's not all and says then by the highest score. So we have to go and include another column in the sorting
and we can go and add that by adding a comma and then mention another column the score and now we have to specify the
mechanism. It says by the highest score. So the highest must come first and with that we are using descending. Now what
is the current situation in that? If you look to the results for example for those two customers we have 350 and then
500. So that means the scores are sorted ascending right the same thing for USA. So from the lowest to the highest. Now
if you go and do it like this what going to happen it's going to go and switch it. So you can see over here now for
Germany first comes the highest the 500 and then the 350 and for USA as well they switched. So we have the highest
and then the lowest and with that we have solved the task. Now again the order of those columns are very
important. So since the scores comes after the country we will not get the highest scores first at the results. So
we will not get the 900 as a first row. And that's because the scores must be sorted after the country. So the country
has more priority. Now if you go and flip that. So let's go over here and says sort first the score and then the
country. So let's go and execute it. It's called has first to sort the scores. So with that you will get the
900 first, right? And then the countries. And since there is like no duplicates in the scores, this makes no
sense at all. So you can go and skip it. So nested sorting only makes sense if you have repetition in your results and
you can use the help of a second column in order to make the sorting perfect. So that's it and with that of course we
have solved the task. All right. So with that you have learned how to sort your data using order by. Now in the next
step we're going to talk about how to aggregate and group up your data using group by and we're going to put it
between the where and the order by because in the order of the query the group by comes between the where and the
order by. So let's go. Okay. So what is exactly group by? It's going to go and combine the rows
with the same value. So it's going to go and combine and smash press your rows to make it aggregated and more combined. So
all what group by does it aggregates a column by another column. Like for example, if you want to find the total
score by country. So you aggregate all the scores value for one country. If you have this kind of tasks, then you can
use the group I. Let's see the syntax of that. We will start as usual with the select. And now what we want to see in
the result is two columns. So we have to specify like a category like the country. This is the value that you want
to group the data by. and another one where you are doing the aggregations. So for example you are saying I would like
to see the total score. So we use the function sum in order to summarize the values of the score. After that as usual
we use the from in order to select the data from specific table. And now comes the magic we use after the from group
by. And now understands okay I have now to combine the data. I have to group up the data by something. And this time we
are saying you have to group up the data by the country. So that means each value of the country must be presented in the
output only once and for each country we want to see the aggregation and that is the total score. So let's see how is
going to execute it. So it's going to first start with the from it's going to go and get the data from the database
and then it's still going to execute the group by and now scale understand okay I have to group up now the data by the
country and it understands it has to aggregate the scores for that. So it's going to go and identify the rows that
are sharing the same value. Like for example here we have two rows for Germany and it's going to bring it to
the results. So now we have two rows for the same country but since we are saying group by country SQL going to try and
combine them smash them together in only one row. So each value of the country must exist at maximum once. We cannot
leave it like this. So now what we going to do with the scores? We have two scores. Now SQL going to check the
aggregate function. It is the summarization. So, and it's going to go and add those values 350 + 500. And with
that, we're going to get the total score of 850. And with that, as you can see, scale is combining those two rows into
one. So, in the output, Germany will exist only one. And about the scores, we will get the total score. And the same
thing going to happen for the next value. In the country, we have the USA. We have it twice. So, we're going to get
two rows. And scale going to combine those two rows in one because USA must exist only once. And with the scores we
will have the total scores. So 900 plus zero we will get 900. And with that it's still converted those two rows into one.
And for the last value in the countries we have the UK. It's going to stay as it is. There is no need to smash and
combine anything because it's already one value. So my friends if you are looking to the output you can see we
grouped the original data by the country. And that means we're going to get one row for each value inside the
country column. So my friends the original data you have five rows in the output if you are using group by like
this you will get only three rows. So this is exactly how the group by works. Let's go back to scale and practice.
Okay. So we have the following task and it says find the total score for each country. So from reading this you can
understand we have to do aggregations and we have to combine the data by a column. So now usually I start like
this. I start selecting the columns that I need in order to solve this task. So what do we need? We need the country and
score from our table customers. So let's start like this. Now you can see we have the countries and the scores. And the
task says we have to group up the data by the country. So that means this is the column where we're going to do the
group by and the total scores will be aggregated. So what we have to do? We're going to use the group by since it says
for each country. We're going to use it over here. Group by country. And now we have to go and aggregate the scores. We
cannot leave it like this. So we're going to say the sum of the score. So let's go and execute it. And with that,
as you can see, we are getting the total scores for each country. So now instead of having five customers, we have only
three rows now. And that's because the countries has three rows. And now if you check the result, you can see something
weird. It says no column name. And that's because we have changed the scores. It's not anymore the original
score. It is it is the total scores. We have summarized those values. So SQL don't know how we going to call it. So
those values doesn't come directly from the database. It is manipulation that you have done here. Now in order to give
a nice name for that we can go and add aliases. An alias it is only like a name that lives inside your query. So we can
do it like this as and you can specify any name you want like for example total score. And now scale can understand okay
this is the name for this column and if you go and execute it you will see the new name in the results. But you have to
understand this name exists only in this query. You are not renaming anything inside your database and you cannot use
it in any other queries. It is just something that is known inside this query and only for your results. And of
course you can rename anything any column like for example here you can say this is the customer country and if you
execute it you are just renaming the column in the output. So this is really nice in SQL. Okay. So now there is like
one more thing about the group I the non-aggregated columns that you are adding in the select must be as well
mentioned in the group I. So now for example let's say that okay I'm seeing now the countries the total scores I
would like to see as well the first name. So you go over here and say you know what let's get the first name. So
country first name the total scores and execute. You will get an error because it's going to tell you I need only the
columns that you want to group the data by or should be aggregated. So now the first name it is not aggregated and as
well not used for the group I. So it is just here to confuse SQL and it will not work. So if you bring a column either it
should be in the aggregation or it should be part of the group I. So in order to fix this and you really want to
see the first name you can go over here and say you know what let's add it to the group I and execute. This time it
going to work because all the columns that are mentioned here is as well part of the group I. So now as you can see we
have the countries the first name and the total scores and you can see again we have five rows we don't have three
rows and that's because now you are combining the data by the country and as well the first name and now you can see
in the output we are getting five rows we are not getting anymore the three rows the three countries and that's
because SQL now grouping the data by two columns the combination of the country and the first name and those two columns
gives five combinations and that means you will get five rows so that means you have to be really careful what you are
defining in the group I and the number of the unique values that those columns are generating going to define the
output the results. So if you go and remove the first name and from here as well you are grouping by only one column
and this column has only three values and that's why you are getting three rows and with that of course we have
solved the task and now let's extend the task and say find the total score and total number of customers for each
country. So that means we need two aggregations. We have the total score and as well we need the total number of
customers. So from reading this you can understand we still want to group up the data by the country but this time we
need two type of aggregations. We need the total number of customers and the total scores. So we have almost
everything but what is missing is the second aggregation. Now what you can do you can go over here and add another
aggregate function called the count. And what we want to count is the number of customers. So we can go and add the ID
over here and call it total customers. So now of course SQL going to So now if you go and execute it, you will get as
well the total customers by the country. And now as you can see SSQL has no problem with the ID and that's because
you are aggregating the ID. So SQL know what to do with it and how to combine it. So that means you don't have to
mention the ID in the country because you are aggregating it. So that's all with that we have solved as well the
task. All right. Right. So with this you have learned how to group up your data using the group eye. Next we're going to
talk about another technique on how to filter your data but this time using the having clause. So let's
go. All right. So what is exactly having? You can use it in order to filter your data but after the
aggregation. So that means we can use the having only after using the group I. So let's see the syntax of that. So
again like the previous example we are finding the total score by country. So we have our select from group I and now
you say you know what I would like to filter the end results and in order to do that we use the having after the
group I and now like the wear clause you have to specify a condition. So we have the following condition where we want to
see in the results only the countries if their total score is higher than 800. So this going to be our condition. So now
you might noticing something with the group by we are using the country the column where we are grouping the data by
its value but with the having we are using the aggregated column the sum of the score. So this is how the syntax
works and now let's see how is going to execute it. So as usual SQL start with the from we are getting our data and
then the second step is going to go and aggregate the data by the country. So it's like before going to group the rows
with the same value of the country. So we're going to have one row for each country and this is what going to happen
if you use group I and with that we have now aggregated values right and after the group IQL going to go and execute
the having. So having it is like a filter. Now we have a nice condition the total sale must be higher than 800 and
SQL going to go and check the new results after the aggregation. So in Germany we have the total sales of 850.
So it meets the condition and it going to stay in the results. The same thing for USA it is higher as well than 900s
but for UK it is not meeting the condition 750 it is not higher than 800 and SQL going to go and filter out this
row so that means after applying the having we will get only two countries because they have values that is
fulfilling the condition and that's it is what can happen if you are using having it is simply filtering the data
but now you might be confused you say you know what we have used the wear clouds to filter the data so why we have
in SQL another cloud how to filter my data. Can't we just use the where? Well, in SQL there are like different ways on
how to filter your data based on the scenario. So now let's go and add both of the filters in my query. We are
already using the having after the group I and now let's go and add the wear. Usually the wear comes between the from
and the group I so directly after the from. And here we are saying the score must be higher than 400. So now we are
filtering based on the scores twice, right? Once we are saying the score higher than 400 and by having we are
saying the sum of score must be higher than 800. So what is the big difference? It is when the filter is happening. If
you want to filter the data before the aggregation you want to filter the original data then you can go and use
the wear clause. But if you want to filter the data after the aggregations after the group by then you can go and
use having. So it's really all about when the filter is happening. So let's see how is still going to execute this.
So as usual first the from going to be executed to get the data. Then after that the second step the wear going to
be executed. This is our first filter. So SQL going to filter the data using where before doing any aggregations and
based on our condition the first customer will be filtered out because score is less than 400 and the same
thing for the last customer. Now after the applying the wear clouds we will get only three rows only three customers.
And now next SQL going to go and execute the group by. So it's still going to go and group the data by the country. So
now we have fewer data to be combined. So the values will not be summarized because we have only one row for each
country. Now after the data is aggregated by the group by then SQL going to activate the second filter
having. So the next step is going to execute the having and here SQL going to filter the new results based on the
total scores and still going to check one by one. So, USA is meeting the condition. UK going to be filtered out
because it is not higher than 800. And this time Germany as well will be filtered out because this time it is not
fulfilling the condition. In the previous example without the wear, we had more scores for Germany. That's why
it passed the test. But this time since we filtered a lot of customers using the wear, Germany will not have enough
scores pass the second filter. So with that in the output we will get only one row and that's because we are filtering
a lot of data. So it is very simple where going to be executed before the group by before the aggregations having
going to be executed after the group by after the aggregations. So now let's go back to scale in order to practice.
Okay. So now we have very interesting task find the average score for each country considering only customers with
a score not equal to zero. So it sounds like condition and return only those countries with an average score greater
than 430. So this is again another condition. So I know there is a lot of things that's going on. Let's do it step
by step. Usually I start by doing a very simple select statement with the columns and data that I need. So let's start
with a simple select. So what do we need over here? We need a score. We need a country. Again we need a score country.
So all what we need is two columns. Now I'm going to go and select the ID just to see the customer ID. Then let's go
and get the country score from our table customers. So let's go and query that. So now as you can see I start with the
basics. Query the data and then build up on top of it the second step. Now what do we have in the task? We have to find
the average score for each country. That means we have to do some aggregations. And here we have two conditions. The
first condition says we need only the customers with a score not equal to zero. And the second one we need only
the countries with an average score greater than 430. Now you have to decide for each condition whether you're going
to use the where or having. Now for the first one we want to filter based on the scores. So that means we want to filter
before the aggregations. It's not saying the average score. It's saying the score itself. So that means we can use for
this a wear condition. Now about the second one it says countries with an average score greater than 430. That
means we want to filter the data after aggregating the score. So that means for this condition we have to use the
having. Now what I would like to do is to implement the first condition. It's very simple. We're going to say where
after the from the score is not equal to zero. So let's go and execute it. And with that we don't have any customers
where the scores is not equal to zero. So that we have solved this part. But now for the second condition first we
have to do the aggregations. So we're going to start with the average score. We're going to go over here and say
average and we're going to call it average score. Now we don't want to see only the average score. We want to see
the average score for each country. So that means we have to aggregate by the country and for that we use the group I
group by comes always after the wear clause. So group by and which column? It's going to be the country. So
country. Now there is like an issue here. You cannot execute it like this. We have to go and get rid of the ID. We
don't need it at all. So let's go and execute it. So with that we have the average score for each country and we
have solved the first part. So that means the first and the second part they are completed. Now we're going to talk
about the last part. The average score must be higher than 430. And for that we're going to use the having and having
comes after the group by. Now we need to specify the condition. It must be the aggregated column. So we're going to
take the average score from here and put it after the having and it should be greater than 430. So that's it. With
that we have the last part as well. Let's go and execute it now. And with that my friends we have filtered the
data after the aggregation. So this is how I decide between the where and having. It is very simple. All right. So
with that you have learned how to filter the aggregated data using the having. And now next we're going to go back to
the top where we can use there the keyword distinct exactly after the select. So let's go now and learn about
the distinct. Okay. So what is exactly distinct? If you use it in SQL, it's
going to go and remove duplicates in your data. Duplicates are like repeated values in your data and it's going to
make sure that each value appears only once in the results. So it sounds very simple and as well the syntax is easy.
So as usual we start always with a select but directly after the select we use the keyword distinct. So there is
nothing between them and then the normal stuff we specify the columns and then the from in order to get the data from
table. Let's say that I would like to get a list of unique values of the country. So the first thing that SQL
going to do of course is to get the data from the database using the from. And now the second step is the select. So
SQL going to execute it and going to select only one column the country. All other columns going to be excluded and
removed from the results. And now SQL going to go to the third step. It's going to go and apply the distincts on
the country values. So it acts like a filter where it going to make sure each value happens only once. So it's going
to start with the first value Germany. Now it's going to look to the results. Do we have Germany? Well, we don't have
anything yet. So that's why it's going to include it in the results. Then the next value is going to be USA. The same
thing. We don't have USA in the results. So it's going to go and include it. And this happens as well for the UK. We
don't have UK in the final results. That's why it's going to go as well included. Now comes Germany again. Now
it's going to say wait, we have it already. So it will not go and add it again in the output because it must
appear only once. So we will not have Germany twice. And as well for the last value the USA we have it already in the
results that's why it will not appear again and with that we have removed the duplicates or the repetition inside our
data. So each value is unique. Now let's go back to SQL. Okay that task is very simple. It says return unique list of
all countries. So let's go and do that. It's going to be funny. So select and now let's get the column country from
our table customers like this. Now you can see we have a list of all countries but the task says we need a unique list.
So that means I cannot have here repetitions inside it. And with that we're going to use the very nice
distinct. So if you do it like this let's go and execute. You will see there will be no duplicates in your results
and all the values in the result going to be unique. So with that we have solved the task. It's it's very simple.
Now there is like one thing about the distinct that I see a lot of people using it a lot in cases that it's not
really necessary. So for example, let's go and get the ID. Now if you go and execute it, you can see here we have a
list of all ids and there are no duplicates. But now if I go and remove the distinct and executed, we will get
the same results because the ids are usually unique. So it really makes no sense to go and say distinct because as
you can see the database has to go and make sure each value happens only once. So there's like extra work for the SQL
and it is usually an expensive operation. So if your data is already unique, don't go and apply distincts.
Only if you see repetitions and duplicates and you don't want to see that only in this scenario, go and apply
the distinct. Don't go blindly for each query applying distinct just in case there is duplicates. This is usually bad
practices. Okay. So that's all for distinct. Okay my friends. So with that you have learned how to remove the
duplicates using the distinct. In the next step we're going to talk about another keyword that you can use
together with the select. You can use top in order to limit your data. So now let's go and understand what this
means. Okay. So what is exactly top or in other databases we call it limit. So it is again some kind of filtering in
SQL. If you use it, it's going to go and restrict the number of rows returned in the results. So you have a control on
how many rows you want to see in the results. The syntax is very simple as well. Directly after the selects you're
going to use the keyword top and then you specify the number of rows you want to see in the results. So for example
three and then only after that you specify the columns that you want and then from which table. Now let's see how
going to execute it. So as usual the from going to be executed we will get our data and then the second step is
going to go and select the columns. In this case all the columns going to stay and then after that it's going to
execute that top. So how it works? It's very simple. For each row in database, we have a row number. It has nothing to
do with your data with the ids. For example, here like in the current result, we have row number 1 2 3 4 5.
Those numbers are not your actual data. It is something technical from the database. So it is not equal to the ids.
For example, the ids is actually your content your data. So here we are not filtering based on the data based on the
row numbers. So since here we have defined three SQL going to count. Okay. row number one 2 three and that's it. So
it's going to make a cut and all the rows after number three they will be excluded from the results and you will
get only the three rows at the results. So now as you can see this type of filtering is not based on a condition or
something it's just based on the row numbers. So whatever results you have in your data it will go and make a cut at
specific row. So let's go to scale and practice that. Okay. So now we have a very simple task. It says retrieve only
three customers. So let's go and do that. We're going to go and select star from our table customers and execute it.
Now as you can see in the output we have five customers. But the task says we want only three. And there is no
specifications at all about any condition. So I don't have to go and make a work clause where we write a
condition based on our data. We just want three customers. So we can do that very simply by just adding top exactly
after the select and then specify the number of rows you want to see from the output. So select top three and then the
star. Let's go and execute it. And with that we are getting three customers. That's it. It's very simple. All right.
Now moving on to another task. It says retrieve the top three customers with the highest scores. Now of course this
is like a mix between ordering the data and filtering the data. Right? So we usually sort the data by the scores from
the highest to the lowest. But now it's like we are doing both together. So let's do it again step by step. I will
just back to the select star from customers. Now what we can do we can go and sort the data by the score from the
highest to the lowest using the order by so order by score and then descending. So let's go and execute it. And now you
can see the first customer is with the highest score and then the second highest and so on. Now I think you
already got it in order to get the top three customers with the highest scores. What you have to do is to just go over
here and say top three and execute it. And with that you have now a really nice analyzis on your data. It's like a
reports where we are finding the top customers with the highest score. So this is really amazing and very easy. So
as you can see mixing the top with the sorting the data you can make top end analyzes or bottom end analyzers. So
let's have this task retrieve the lowest two customers based on the score. So now we want to get the lowest scores in our
table. And in order to do that is very simple. What we're going to do we're going to flip that. So we're going to
sort our data based on the scores ascending from the lowest to the highest. And since we want only the
lowest two customers, we're going to replace the three with a two and execute it. And with that, we're going to get at
the lowest two customers. It is Peter and Maria. They have the lowest scores. Again, it's very easy. Okay, this is
fun. Let's go to the next one. Get the two most recent orders. Well, this time we are speaking about another table.
Let's go and select everything from the table orders like this. So now, as you can see, we have here four orders and we
want the two most recent orders. So most recent means we have to deal with the order dates and we can build that by
sorting the data by the order dates. So order by order dates and since we are saying the most recent orders so from
the highest date to the lowest that means descending right let's go and execute it and as you can see based on
our data and now we can look to our result this is the last order in our business based on the order age and this
one is one of the earliest orders. So with that we have sorted the data and since we want the two most recent orders
we go over here and say we go exactly after the select and say top two and execute and with that we have now the
last two orders in our business. So as you can see combining the top with the order by you can do amazing analyszis.
All right so this is how you limit your data using top and with that you have learned the basics everything that you
can learn and with that you have learned all the clauses the sections that you can use in any query in SQL. Now next
what we're going to do we're going to put everything together in one query in order to learn how SQL going to go and
deal with all those clauses and how SQL going to go and execute it. So let's go and do
that. Okay. So now I'm going to show you the coding order of a query compared to the execution order that happens in the
database. So the coding order of a query starts always with a select and then exactly after that you can put a
distinct and then after the distinct you can put a top. So this is the order of all those keywords and then you can go
and select like few columns and after you specify the columns separated with a comma you tell SQL from which table your
data come from using the from clause. Now after that if you want to filter the data before the aggregation you can use
the where clause and this always comes directly after the from. And if you want to group the data then you have to do it
after the wear clause using the group by and after the group buys comes the having if you want to filter the data.
And the last thing that you can specify in query it is always the order by. So this is the order of all those
components of the query. And if you don't follow this order you will get an error from the database. Now if you look
to this query there are a lot of things that's going to filter your data. So let's check them one by one. The first
thing that you can do is to filter the columns. If you don't want to see all the columns, you want to see only
specific columns, you use the select and of course you must use it. So the columns that you specify will be shown
in the results. So it's like filtering the columns. Now there is another type of filter where you filter out the
duplicates if you want to see unique results and that's using the distinct. So this is another type of filter.
Moving on, we can filter the result based on the row numbers. So we can limit the result using the top. But this
type of filter doesn't need any conditions. It's purely based on the row number in the results. Now moving on, if
you want to filter your data based on conditions based on your data, you can filter the rows before the aggregation
using the wear clause. And the last type of filtering, you can filter your rows after the aggregation using the having.
So as you can see, we have like five different types and how to filter the results in SQL. So now let's see the
execution order. As we learned the first thing that's going to happen is that SQL going to execute the from clause. So SQL
going to go and find your data in the database where all the next steps going to be paste on this data. Now the next
step that is going to do is that it's going to go and filter the data using the wear clause. This has to be happen
before anything else. So before any aggregations and so on we have to make scope of the data. So once SQL apply it
maybe some of the rows going to be removed and once the data is filtered the third step SQL going to execute the
group I so going to take the results and start combining the similar values in one row and start aggregating the data
based on the aggregate function that you have specified. So now after the group by after aggregating the data what is
going to do now it's going to go and apply the second type of filter the having. So based on the condition the
SQL going to go and start removing few aggregated data away and keep the rest. Now moving on to the step number five.
Finally it's going to go and execute the select distinct. So SQL going to go and start selecting the columns that we need
to see in the results and remove the other stuff. And once the columns are selected SQL going to go and execute the
order by. So SQL going to start sorting the data based on the column that you have specified and the mechanism as
well. So the data will be sorted differently. And my friends the last step that going to happen in your query
will be always the top statements. So based on the final final results SQL going to go and execute the top. So here
we are saying top two that means we want to keep only the first two rows without any conditions. So SQL going to count
okay row number one two and after that it's going to make cuts and remove anything after that. So this is the last
filter that's going to happen and as well the last step. So now if you sit back and look at this the coding order
is completely different than the execution order in the coding we have first to specify the select actually the
select going to be executed just almost at the end. So at the step number five and once you understand how SQL execute
your query you can understand how to build correct queries. So now the first thing that we
have learned that we can go and have like one query right something like this select star from customers. Now this is
one query and in the output we have one results but did you know that in SQL we can have like multiple queries and
multiple results in one go. So we can do everything together like for example let's say I'm selecting as well the data
from orders. So that means we have two queries and now if you go and execute what can happens you will get two result
grids. The first result grid is for the first query and the second one is for the second query. So with that you can
do multiple queries in the same window and with that the results can be splitted into multiple window depend how
many queries you have and usually in SQL you might find that by the end of each query there is a semicolon like this. So
at the end of the first query we have semicolon and for the second query we have as well at the end another
semicolon. For the SQL server it is not a must but for other databases if you have multiple queries in one execution
you must separate them with a semicolon and with that the database can understand okay this is the end of the
first query and this is the end of the second query. So you have like separations between
queries. Okay. Now moving on to another cool thing in SQL. Now what if we don't want to query the data inside our
tables, we would like to show a static value from us from the one that is writing the query. And this is very
practical. If you are like practicing and you want to check something using a value from you, not from the tables. So
how we can do that? It is very simple. We're going to write select and then now after that instead of having a column
name you can go and add any value like 1 2 3. So it is just a number and we do not specify after that any table. So we
leave it like this. Select 1 2 3 and we don't need to use the from close. So now if you go and execute it you will get 1
2 3. So this is a static value. And of course you can go and rename the column like static number. So execute it again.
So with that we have a static value. And you can go and add anything like string as well. So let's say hello as static
for example string. So let's go and execute. Now we have two queries. The second one you can see our static value.
Hello. So in queries we can add values from us. Not only selecting data from the queries but of course you can go and
mix stuff. So we can have like in one query data from the database and static data from us. So let me show you what I
mean. Let's go over here and say select and let's go and get for example the ID the first name from the table customers
like this. So with that we can see we are getting data from the database. But now I can go and add something from me
new customer and we can call it customer type. So now what is going on here? Two columns from the database and one column
from us. It is the static one. So if you go and execute it, you can see for the ID and the first name those data comes
from the database. But for each record we are always getting the same static value new customer, new customer and so
on. So this piece of information comes from the query. It is not stored inside the database and those two informations
come from the stored data inside the database. So this is really cool thing. You can add few informations from you
and you can get the data from the database. This is the static values. Okay. One more cool thing that I
want to show you that if you have a query like this you are selecting from table and filtering the data and now you
would like not to execute the whole thing. You would like to execute only a part of this query. So now sometimes as
you are writing a query, you don't want to execute the whole thing. You want to execute only part of the query. Like for
example, I would like to see all the customers again in this query without this filter. So instead of removing it
and then query and then again adding it, what you can do, you can highlight what you want without now the filter and
execute. So without the database going to execute exactly what you highlighted. And now as you can see I'm getting all
the customers without the filter. And if you don't highlight anything and execute, what's going to happen? It's
still going to execute the whole thing inside the editor. And this is really nice if you want to query another table
quickly in the same editor. Like we want to select everything from the orders just quickly. So you can highlight only
this query and execute. And with that SQL is ignoring everything else and only executing what I'm highlighting. And
this is really nice. It gives us like speed and dynamic. And you're going to find me doing that a lot in the course.
So this is really nice. Okay. My friends. So with that we have learned the basics about SQL query. the basic
components of the select statements and with that you can talk to our database in order to get data. Now in the next
chapter we're going to learn how to define the structure of our database. So we're going to learn the data definition
language DDL. So let's go. Okay. So usually if you have like an empty database what you want to do is to
go and define the structure of your data. So one of the first things that we usually do is we go and create a new
tables. So here we have a command called create and if you use it you can create a new object inside the database like
for example a table. So once you execute it you're going to get brand new table and usually the table going to be empty
without any data. So it is very simple. This is what the create command does. And now let's go to SQL in order to
create a new table. So my friends we have the following task. Create a new table called persons with columns ID
person name birth date and phone. Okay. So this time we will not start by select we will start with the command create
table. So we are telling SQL to create a table and after that we have to define the name of the table. So in this task
we have to call it persons. Now we have to go and open two parenthesis like this and in between we have to define the
columns. So what do we need? First we need an ID. So this is the first column name. And next we have to define which
data type for this column. It's going to be an int. So it is a number does not contain any characters. And now next we
can define some constraints and we cannot have a person without an ID. So it should not be in null. So not null.
This is the first column. So we have defined the name of the column, the data type and the constraint. Okay. So let's
go to the second column and here we're going to have a comma and the next one name going to be person name. So this is
the column name and the person name we can have. And now the data type for this column it going to be a varchar because
the person name contains characters. So vchar. And now we have to define the length. So I'm going to go with 50
characters. And now I would say this is a must. So each person should has a name. So we're going to say not null as
well. So that we have the name, the type and the constraint. Now let's move to the third column. It's going to be birth
date. Now which type of informations we have inside the birth date? So it's going to be a date, not a number, not
characters. So we're going to go with the data date. And now about the constraint well depends. I would say in
our application it is an optional because this is very personal information and maybe some persons will
not provide their birth dates. So this is an optional and I will not say it is not null. So nulls are allowed. Now
let's move on to the next one. It's going to be the phone. So now what is the data type of a phone? Well we have
some types numbers we have characters special characters. So we could have anything. So that's why I'm going to go
with the farchar. And here you can specify the length that you think it's okay. I'm going to go with 15. Now of
course depend on the system that you are building. I would say the phones are very important in order to validate
whether this is a real person. So we're going to say not null. So we are not allowing nulls in this field. Perfect.
So with that we have covered all the columns that are required. We have defined the data types and as well the
constraints. Now the last thing in each database table we should has a primary key in order to make sure this table has
an integrity and maybe as well connectable to other tables. So now what we're going to do, we're going to go and
add the primary key constraint, comma, for the last column. And then we're going to say constraint. Now we have to
give a primary key name. This is only going to be visible for the database. So I'm going to call it PK for primary key.
And here persons and then after that we're going to say primary key. And between two parentheses, we're going to
go and pick which one is the primary key. And of course, it's going to be the ID. So we're going to go over here and
say ID. So again, we are saying there is a new constraint. This is the name of it. It's only internal for the database.
And then we are saying this one is a primary key on the field ID. So that's it with that. We have defined a primary
key for our table. Let's go and execute it. So as you can see it is successful. Let's go and check our database for our
new table. So if you don't see it already, you have to right click on the database and then go and refresh. So
let's go to tables and now we have a brand new table called persons. So with that we have created our new table. Now
of course for the DDL commands you will not get results or data. All what you're getting is a message from the database
and the message says here the command completed successfully and then we have a date when this is completed. So that
means the DDL command will never return data. It is changing the structure of your database. It's not about retrieving
any data and so on. So this command did change something in our database and in this scenario it created a new table and
that's why we call this data definition language DDL because we are defining the database. Now of course if you go and
say select star from our new table persons. So let's go highlight it and then execute it. You will see we are
getting of course the columns. So the ID, the person name, birth date, the phone but we don't have any rows that
means our table is empty. Now what is very important to that you go and save those informations in an SQL script
because maybe later you have to redefine this table but let's say that you have created different queries and you have
lost the script and now I would like to see again the create statements for this table well there is trick for that if
you go to the left side you see the persons right here right click on it and then you have here script table as and
now we have here different options that you can run on the table and the first one says create two Then let's go to new
query editor. So now what happened? The database did read the metadata
informations about the person and created your DDL query with many extra stuff that we haven't done. But this is
the template that the database uses. So now we can see a lot of stuff. But what is interesting is this create table. So
we can see create table the schema DBU the default one then the persons and then we have our columns the data type
and as well the constraints. So with that you got back your DDL statements and many other stuff about the table
which is now not interesting. But now what I really need is to see the create statements about this table. So this is
how you can get back your DL command. But of course what I recommend you is always put your code inside a get
repository and always keep it up to date. So that always you can check your work and extend
it. Okay. So now what else you can do with the structure of your database? If you have already a table, what you can
do, you can go and edit and change the definition of the table. So for example, let's say I would like to add a new
column. In order to do that, we can use the command alter. Alter means you want to edit the definition of your table and
you want to change it like adding new column or maybe changing the data type and anything in the definition of the
table. So the alter command, you can use it in order to change the definition of your table. And now let's go back to
scale and try to change something. All right. Now the task says add a new column called email to the person's
table. So it is very simple what you can do. We can use the alter table command. So we are not creating new table. We
want to edit already existing table. So which table we want to modify it's going to be the persons. So we are telling SQL
we want to change something in the table persons. And of course we have to tell SQL what we want to change. Are we
removing a column? Are we adding column? In this scenario we want to add new column. So let's go and add the email
information. So this is the column name and as you are creating a table you have to define column name the data type and
the constraint. So now for the emails we're going to have like characters, numbers, special characters. So we're
going to go with the varchar and about the length it's going to be let's say 50 and I'm going to say each person has to
has an email. So it's going to be not null. So with that we are adding completely a new column. So that's it.
Let's go and execute it. Now again this is not a query. This is a DDL command and in the output we will not get data.
We will get a message whether everything went correctly. So it says command completed successfully and the time when
this is completed. Now we can go and do a simple query just to have a check to the table. So and now you can see we
have our columns and at the end we have a new column called emails. This is very important. If you are adding new column
it's going to be always at the end of the table. But now you might say you know what I would like to have the email
like something in the middle maybe after the person name. Well, in order to do that, you have completely to delete and
drop the table and create it from the scratch using create command which is might be bad if you have data inside the
table. So if you are fine by adding your new column at the end, you can use the alter table. But if you say I would like
it in the middle, then sadly you have to go and drop everything and start from the scratch. Okay. So now let's have
another task and it says remove the column phone from the person's table. So now we're going to do exactly the
opposite. We're going to go remove it completely with its data from the table. So we're going to still saying alter
table persons. We are saying we want to edit the definition of the table persons. And now instead of adding we
will be dropping a column. And then after that we have to specify as well the column name. It's going to be the
phone. But we don't have to mention again the data type and the constraint. And that's because the database already
knows those informations. So we need those informations if we are creating something new. That's why we can get rid
of that. We just need the column name and the database is going to do the rest. So let's go and do that. Now you
can see successful. And now let's go and check our table. And now as you can see we have the ID, person name, birth date,
email, and we don't have the column phone. Be careful. If you are deleting column, you will be losing as well all
the data inside this column. So as you can see, this is very simple. This is how we can edit the definition of our
table by adding and removing columns. Okay, now moving on to the last one in this group of commands. So now so far
what we have done, we have created something new in the database. We have changed the definition of something
inside our database. And now the last one, you can go and drop something from the database. Let's say we have another
table and we don't need it anymore. So we can go and use the drop command in order to remove the table completely
from the database. And this means as well removing everything the table and the data inside it. So now let's go to
SQL and let's drop something from our database. Okay. So now our task says delete the table persons from the
database. This is the simplest form of command in SQL but yet the most risky one. So what we need? We have to delete
and drop the whole table persons. We don't need it anymore. We're going to say drop table and then all what we have
to do is to give the name of the table persons. So three words. You don't have to specify anything. Just destroy the
table persons. Let's go and execute it. It is successful. So as you can see it is very simple. Now on the left side to
your database go refresh and go to the tables and you will not see the table persons. So the drop command it is very
simple but yet very risky. So if you compare now create table with a drop table you can see destroying things is
way easier than building it. Those are the commands create alter drop. those commands we use in order to define the
structure of our database the DDL commands that was very simple all right so that's all about the data definition
language DDL and with that you have learned how to define new stuff in your database now moving on to the next one
we're going to learn about the data manipulation language and here we're going to learn how to manipulate our
data inside the database let's go all right so now what we're going to do we're going to go and modify and
manipulate your data inside the database. So now sometimes what happens you have a table inside your database
and the table is empty. You don't have any rows any data inside the table. Now in order to add your data to the table
what you can do you can use the command insert. So insert going to go and add new rows to your table and of course not
always the table must be empty to add your data. You can add new rows to already existing data and SQL going to
go and append it at the end of the table. Now my friends in order to insert new data to the target table there are
two methods. The first and the classical way in order to insert new data we can use the insert command and manually
specifying the values that should be inserted to the table. So you're going to start specifying in the script the
values and then they're going to be inserted as a new rows to the target table. So in this process you are
manually inserting new values to the table using like an SQL scripts. So now we're going to focus on this scenario on
how to insert data. All right. Now let's check quickly the syntax of the insert command. It start with the keyword
insert into and after that we have to specify the table name. So where we want to insert and then we make a list of all
columns that we want to insert. And then we specify list of columns where we're going to insert values into them. And
after that we say values. And finally we're going to go now and specify the data that should be inserted to the
table. and we make it as well as a list like we have done for the columns. Now in the insert statements specifying
those columns it is totally optional. So if you don't specify the columns of the table then SQL going to expect you to
insert values into each column because sometimes of course we don't want to insert value for each column. You can
skip few columns of course but if you want to insert a value for each column either you go and specify them as a list
or you can skip it. Now for the insert statements there is very important rule. The number of columns and values must
match. So if you specify here three columns then you must insert as well exactly three values. So this must be
matching. And one last thing about the syntax you can insert multiple values in one go. So for each row you can specify
a list of values that must be inserted. So that's all about the syntax. Let's go back to SQL in order to practice insert
command. Okay. So now let's go and insert a new customers. So it's very simple. It start with insert into. So we
are saying we want to insert data into. So we have to go and specify the table name customers. Now after that we have
to specify list of columns where we want to insert data into it. And what we can do we can go and check which columns do
we have inside our table. So we can see we have ID, first name, country, score. And we can go and make a list of that.
So we can say ID, first name, country and score. So we just have a list of all columns inside our table customers. Now
what we need? We need the values. So which data should be inserted. So we can go and open two parenthesis. And now we
have to specify an ID. We know the last customer was five. So we're going to go with the customer six. Now we have to
give the name of the customer. Let's go for Anna. And then a country. Let's go for USA. And this customer has no
scores. So what we can do? We can say null. So we don't know the score of this customer. nulls means nothing we don't
know. So with that you can go and insert one row. But now let's say that I would like to go and insert like a second row
one more customer. What we can do we can separate this with a comma and then we can go and repeat the whole thing again.
So the ID is seven. The next one let's call this customer Sam and we don't know the country of this customer. So we're
going to say it's null. But the score we know it already. It is 100. So as you can see we are adding a value for each
of those columns. And if you don't know the answer then make it null. if the database allows it to be null. Some
columns they are not allowed to be null like the primary key. So if you go and say over here null the database will not
allow it. Well actually we can go and test it. Let's execute. And you can see you cannot insert the value null into
the column ID. So this is not allowed. Going to have a seven. But for the other columns it is allowed. You can go and
check the definition of the table. Now we go and execute. Now the output of the modifications command is going to always
indicate what happens to the data. So it says two rows affected. Affected might be inserted, updated, deleted. So you're
going to get a general statement from the database. But you are getting how many record is affected. So we got two
because we have inserted two records. So now as you can see it's not like the query. We are not getting any data in
the output. We are just getting a message. So this is a big difference between querying the data using the
selects and modifying the data using inserts. We are doing now direct modifications to the data inside our
database. Of course, if you want to see the data in the customers, what we can do, we can go and query the data, right?
So, let's go and do that. Select star from customers. I would like to see the whole table. So, market and execute it.
Now, you can see we have seven customers. So, we just manipulated our data. We have here Anna and Sam. This is
how you can insert data to the database. Now, there's like few rules you have to be careful as you are inserting new data
to your tables. You have to pay attention that the order of the columns that you have defined. insert is
matching the values that you are inserting over here. Let's have an example. I'm going to go and remove this
over here and let's say that we are inserting a new one number eight and now in the first name instead of the name of
the customers we have inserted the country like USA and in the country we have inserted the name is just mistake
and we are all human right? So let's have a name like this max. Now if you go and execute it the database can accept
it because it is really hard for the database to understand that you have made here an error. Both of them are var
and the database doesn't care about the content of the data as long as you are following the rules of the data type. So
now if you go and select the data from the customers you can see now we have a customer called USA from the country
max. So the SQL going to do it blindly like you insert the data as long as you are following the data type rules and
the constraints. So for example, if you made this error over here and you say the id is max and let's say the first
name is let's say nine and you execute it here the database is smart enough to say you know what there is something
wrong the ID should not be strange so the database going to reject your inserts be careful of the order of your
columns now let's go and query again our table now if you are in the insert commands defining all the columns
exactly like the table so as you can see we have here complete match ID first name country score we have all the
columns and as well the correct order there is like lazy way you can go and remove the whole thing over here and
with that the database can understand okay we are inserting values to all of the columns so going to understand you
are inserting something to each columns in the correct direction so let's go and do that correctly nine and here let's
say we have from Germany so if you go and execute it it will be working even
though we didn't define the columns and that's because the values that we are inserting as exactly the same number of
columns of the table and following as well the rules. Now moving on to the next one, you can go and add only two
columns in the definition. If you know already always the country and the score is null. We know only two informations,
the ID and the name. Then you don't have always to go and say null null null and so on. We can go and skip that. Okay. So
now let me show you what I mean. We're going to go after the table name and we're going to define only two columns,
the ID and the first name. So that means we are telling SQL we want to insert only two columns. And now you have to be
careful. If you define here two columns then the values should be as well two columns. So we're going to remove the
country and the score. And we can go and add only two informations. So 10. And we can go and add here for example Sara. So
if you go and execute it, it will be working. And now what is skill is doing with the other two columns. It's going
to be nulls. So let's go and select again from our table. You can see here Sara has null in the country and as well
in the score because we didn't define those informations. But be careful, you cannot here skip a column that is not
allowed to be null. So you have always have in your list all the columns that are not null. So for example, I cannot
go and insert only the first name. I will get an error because the database can try to insert a null in the ID and
this is not allowed. So you can skip only nullable columns. All right, my friends. So that
was the first method on how to insert data to your target table as you saw by typing manually the values inside an
insert command using values. And now let's move to another methods. We're going to insert data but this time not
manually. We're going to insert data using another table. So imagine we have the following scenario. We have an
already existing table with data and this going to be the source table, the source of your data and we have another
table. This table is empty and we want to insert a new data to this target table. Now what we can do, we can take
the data from the source table and insert it into the target table without manually writing the script for the
values. So we are moving the data from one table to another. Now in order to do that we need to do two steps. The first
step we have to write an SQL query using select from and so on in order to select the data that we need from the source
table. And once you do that you will get a results. So this is like you are doing a normal query. You right select and you
will get an answer with the results. And now what we can do in the next step we can take this results and use an insert
command in order to insert this results into the target table. And with that we have moved the data from the source
table to the target table. So first write the query on the source table. And the second step use an insert to move
this results to the target table. So let's go back to the scale in order to do that. So now we have the following
task and it says insert data from the table customers into the table persons. So that means the source table is the
customers and the target table is persons. Now how I usually do it that I keep my eye on the target table to
understand the structure of this table and I start writing the query from the source table. If you go to the left
side, we can see okay, we have here an ID. We have here person name, birth date and phone. And you can see only the
birth date except nulls and the rest we have always to provide informations. So with that I have now understanding about
the table persons. Now next I'm going to go and start writing the query from the source. So we start like this. Select
star from our table customers just to have an overview of our table. Now the next step we're going to go and design a
perfect result from this query that is matching the target table. So in the output we need ID and we have it from
the customer from the original table. We're going to go and select ID. Okay. So now next we need a person name and
here we have from the original table something called first name. So this is a perfect match. So we're going to go
and select this table as a second column. So we have covered the first two. Then the third one is going to be
the birth date. Well, my friends, we don't have birth dates, but the database can accept it as a null. So, I'm going
to go and write a null because I don't have such information from the source table. And now the next one going to be
the phone as well. We don't have phone informations. But we cannot have it as a null because it says here not null. So,
what we're going to do, we're going to go and add a static value, a default value. So, we're going to have two
single quotes and in between we're going to say unknown. Since it is var, it can accept this word. So, now let's go and
just query. So we have the ID, we have the first name, the birth date is empty, and the phones is unknown. Now you might
say, but the column name is not matching with the column name of the persons. Well, the database does not care about
that. As long as the result of the data is matching the table, it can go and insert it. So the database will never
compare the column names together. But if you like and go and add here like the aliases exactly like the target table it
will not hurt but it has no effect on the results. All right. Okay. So now we have like query select and we have a
results but this is not an insert. So how we going to insert the result of this into the table persons. Well for
that we need the insert into command. So insert into and now we have to specify the target table going to be the
persons. And of course you can go and list all the column names but if you have like exact match you can skip it
but for me I would like always to add it just to make sure that we don't have any issue. So the ID, person name, birth
date and the phone. So that's it. Let's go and execute. So it is working now. We can
see 10 rows affected. Well that means 10 rows are inserted from the table customers into the target persons. And
now what we can do we can go and query the table persons just to check that everything is working perfectly. Select
star from persons and let's go and execute. And with that you can see our 10 persons that we have added from the
customers. So with that we have moved the data from one table and inserted into another table. And as you can see
it was very simple. First you have to write a query from the source table in order to collect the data that you need.
and then you go and insert it into the target table. So this is really nice and easy and this is another way on how to
insert data into your database. Okay, so with that we have learned how to insert data to our
tables. Now let's say that I don't have something new. I don't have any rows to be added to my table but I have an
update. I would like to go and change the content of the already existing rows. So what you can do? We can use the
command updates in order to change the content of already existing rows. So again my friends insert going to go and
insert completely new rows but update going to go and change the data of already existing row. Now let's have a
look quickly to the syntax of the updates. It start with the keyword updates and then we have to specify the
table name and after that we're going to use sit in order to specify what are the new values for the columns. So you have
to write down for each column that you want to update a new value and you separate the columns of course using a
comma. Now after that we have to specify as well a wear condition. So it's like the queries you say where and then you
write a condition and if you don't do that and you don't use the wear clause what going to happen you will be end up
updating all the rows inside your table. So that's why we need always the wear clause. All right. So that's all about
the syntax. Let's go back to SQL in order to update our data. Okay. So let's have the following task and it says
change the score of customer 6 to zero. So that means we have to go and modify the data of the customer ID equal to
six. So now first I would like to go and have a look to our data. So select star from customers and now the task is
targeting this customer over here and we would like to replace the null to zero. Now how we can go and update this
information inside the table? We can use the update command. So what we going to do? We're going to start writing update
and after that we have to specify the table name. So what we are updating? We are updating the customers and then
we're going to tell the database to set the value of the score to a zero. So we would like to update and change the
value from null to a zero. And now here comes something very risky. Don't execute this query yet. If you do that,
what's going to happen? The database going to go to the table customers and replace all those values of all
customers to zero. So it's going to go and update the whole table and this is of course very risky. That's why in the
update command we have to give a wear condition a filter in order to target only specific row or the rows that you
want really to modify. In this case we want to change only one row. So what we have to do is to go and specify the work
condition like we have done in the select query. Nothing new, right? So we're going to say where the customer ID
is equal to six. And with that SQL will not go and update everything. First it's going to filter the data and then
updates. And now before I execute just to make sure I go and check which data going to be affected. So it's very
simple you go and select star from table customers and then I go and take the exact where and put it in my query and
then I select the whole thing and execute. And now if this query gives me the data that should be modified then
I'm doing the update command correctly. And in this case we are targeting only one customer. This is the customer
number six. And with that I feel really confident with my update. So what we can do since I'm going to use this later I'm
going to put the whole thing in a comment and if I execute now only the update going to be executed. So let's go
and do that. Now very important to check the message you can see one row is affected which is really good because if
I see here 10 rows is affected that means everything is updated. Now let's go and check the data. I'm going to go
and remove the wear here and check the whole table. Now you can see we still have the old scores only Anna has now
score zero instead of null. So this is how I usually update the data. You have to do it very carefully. Now let's move
to another task. It's going to say change the score of the customer number 10 to zero and update the country to UK.
So now this time we are targeting the user number 10. As you can see she doesn't have the country and score. And
the task wants us to change the score to a zero and the country to UK. So now how we going to do it? We're going to use
the exact same command but with different condition. So the ID this times is equal to 10 and the score is to
zero. But now we have to change as well the country. Now if you want to do multiple updates, you're going to have
here a comma after the score and the new line and let's say country equal and then we're going to add UK. So select
the whole thing and let's go and execute. So again it is affecting only one row. This is really good. And if you
go and check the table search for Sara, you can see in one update we have updated two columns the country and as
well the score. So with that we have solved the task. It's very simple. Now moving on to the second task. It says
update all customers with a null score by setting their score to a zero. So this time we are not speaking about one
specific customer. We are talking about updating the data for a subset of customers. So now imagine you have like
hundreds of customers and you are making one update command for each customer. It's going to be really wasting of time.
Now instead of that we can specify a condition that targets multiple customers and we're going to do the
update for those customers in one go. So now let's see how we're going to do it. We are talking only about replacing the
nulls with a zero. So we don't need the country. So set score equal to zero. But now we will not be specific for the ids.
Now we have to make a new condition. It's going to say like this where score is null. Now of course in the course we
have a full dedicated chapter about the nulls and here all what we are doing is we are searching for scores that is
equal to null. But we cannot write an equal we have to write it like this is null. Of course before we update
anything we have to go and test it in a query. So select star from customers where score is null. Let's go and
execute. Now as you can see we have two customers where the score is null. So that means this condition is targeting a
subset of customers and we're going to do now the updates for multiple rows for this subset. So that means we can run
this query. Let's go and execute it. Now you can see two rows are affected. So that means multiple rows got affected
got updated. So now if you go and query our table customers you can see we don't have any nulls inside the scores and we
have replaced all the nulls with a zero. And of course you can do the same thing. you can go and make an update command in
order to replace all the nulls in the country to maybe something unknown or any default value that you want. So this
is how you can update multiple rows in one go. All right my friends. So with that
we have learned how to insert new rows to our tables and as well how to update the content of already existing row. Now
the last thing or command that we can do to the data inside the table that we can go and remove rows from our table and we
can do that using the command delete. So if you use delete SQL going to go and start removing already existing rows
inside your table. All right. Now for the syntax of the delete it's going to be very simple. We're going to say
delete from and then we're going to write the table name. And here comes something very important. We have to add
a wear condition. And it's like the update. If you don't do that, if you don't include where condition, what
going to happen? You will end up deleting all the rows inside the table. So the syntax is very simple. Let's go
back to scale in order to delete some data. Okay. So now we have the following task. Delete all customers with an ID
greater than five. So now we have to go and delete all the customers that we recently added. So how we going to do
it? It's very simple. We're going to say delete from. So that means I want to delete something from a table. And we
have to specify the table name. It's going to be the customers. So the syntax is very simple. Now my friends, this is
more risky than updates because if you execute it like this, don't do that yet. Wait, what's going to happen? All the
data of the customers going to be deleted. So you will get an empty table and we will not do that. So now we're
going to do exactly like the update command. We're going to specify the work clause. So it says the ID should be
greater than five. So that means ID higher than five. So with that we are defining a subset of the data that
should be deleted, not everything. And if we check in the updates, we have here to do a double check before deleting
anything. So again what we do, we select star from table customers and we're going to go and copy the work condition
in order to test what going to be deleted. So it's going to be all the customers that is higher than five. And
with that I'm making sure that my delete command is correct which is from what I see here is correct. So those five
customers should be deleted. So now let's go and delete those customers. And now very important to read the message.
It says five rows affected. So that means five customers got deleted. And this is better than 10 of course. So
let's go and check what customers left. So we have 1 2 3 4 5. Those are the original customers. And everything else
got deleted. And with that we have solved the task. And this is how we can delete data from tables. Be very
careful. Always test before doing the delete command. Okay. So now we have the following task. And it says delete all
data from table persons. So that means we have to go and drop everything from the table persons. But we don't want to
delete the table. We just want to delete the data inside the table now. So now what we're going to do, we're going to
write delete from. And now we have to specify the table persons. And if you execute it, what's going to happen? SQL
going to go and drop all the data in the persons. But in SQL, we have more interesting command. If you want to
delete everything from the table persons, we have that truncate. Truncate. It is exactly like delete from
persons. It's going to go and make the whole table empty. But why I like to use truncate because it is way faster than
deletes. If you have large tables, the delete command going to be really slow because with the delete there is like a
lot of things happening behind the scenes. There is like logs and protocols. But if you are using trunk,
the database going to skip all those extra stuff and it's going to be very fast. So if you want to delete all the
data from table, you can do it like this if it's like small table. But what I usually do, I go and write truncate and
then table. we're going to get the same effect and with that I'm saying reset everything make the table empty. So
let's go and execute it and now with that you will not get the number of deleted rows and that's why it's
truncate it's way faster. It is not protocoling anything it's not logging anything it just go and delete all the
data without any extra steps. So this is how we can delete all the data from a table but the table still exists. Okay
my friends, so with that you have learned the basics on how to manipulate your data inside the database the data
manipulation language DML and with that I can tell you we have covered the basics of SQL. So with that we have
covered the beginner level. Now in the next chapters we will be in the intermediate level and the first thing
that you're going to learn in the intermediate level you will learn how to filter your data and we're going to
cover many operators that you can use inside the workclass. So let's go. All right. So now let's have an overview
about all different operators in SQL. So the first group of operators we have the comparison operators. They are the
easiest one where all what we have to do is to compare two values and we have like six different variants and how to
do that. Now to the next one we have the logical operators. We use it in order to combine multiple operators. And moving
on to the next one we have the range operator. Here we have only one, the between. We're going to use it in order
to check whether a value falls within a specific range. Now moving on to the next one, we have the membership
operator. And here we have two things. We have the in operator or not in. Here all what you have to do is to check
whether a value is in a list or not. And the last category that we have is the search operator. And here as well we
have only one operator that like we use it in order to search for a specific thing in a text. So my friends, we're
going to go through all those operators one by one. Okay. So now let's go and deep dive into the first category the
comparison operators and we're going to cover all those stuff. So what is exactly comparison
operator? Okay. So what is exactly comparison operators? It is very simple. We want to compare two things and there
is a lot of things that we can compare in SQL. But the formula for that going to be always like this. So we have the
first expression and then operator and then we have another expression and this going to form something called
condition. So here we have a lot of variance. We can compare one column to another column. So for example, you can
go and compare the first name with the last name. So both of the expressions are columns here. Another scenario, you
want to compare a column with a value, a static value. Like for example, you say the first name must be equal to a value
like John. So now we are comparing a column with a value. It's not anymore two columns. Now we have another
scenario where we want to apply a function to a column and then compare the results to maybe a value. So for
example, we apply the upper function to the first name and then this must be equal to a value like John with all the
letters in the uppercase. And one more thing that you can compare you can write an expression in one of the sides like
for example you can say if we multiply price with the quantity it must be equal to 1,000 for example. So here we have an
expression. We have multiple columns included in one sides and the output of this expression must be equal to 1,000.
And now the last one is going to be a little bit more advanced and we're going to cover that of course in other
chapter. We can include a whole query the complete query to one of the sides and we call this a subquery. So in one
of the sides you're going to write a whole query select from where whatever you want and you go and compare the
result of this query to for example a value or a column. So as you can see in a scale we can compare a lot of things
together. Either comparing the columns together or a column with a value or we use a function or an expression or even
a whole query. So this is how we build conditions in SQL. Okay my friends. So let's see how the conditions works in
SQL. So we have our data the name the country the score and let's say that we have built a condition where it says the
country must be equal to the USA. So this is very simple comparison operator and this is the condition that we are
using inside the work clause. So once you apply this filter to your data what going to happen? SQL going to go row by
row evaluating whether it is meeting the condition. If it's not fulfilling the condition then SQL going to remove it
from the results. But if it is fulfilling the condition it's going to keep it. So now we are comparing the
values of column together with a static value the USA. So we're going to compare whatever value we get from the country
together with the USA. So now let's see how is going to apply this filter to our data for the first customer Maria. Now
you can see the value inside the country is Germany. So Isql now going to go and compare Germany to USA since it is not
equal. Then is going to understand okay Maria is not fulfilling the condition. So it is false and is going to go and
remove this customer from the results. So she is not fulfilling the condition. Moving on to the next one to Joan. Now S
is going to take the value inside the country the USA it is equal to USA. So that means John is fulfilling the
condition and Isl going to be happy about it. So it is true and this means is going to keep Joan in the final
results. Now moving on to George the value is UK not equal to USA. He is not fulfilling the condition. Is going to go
and remove him from the final result. Same thing for Martin. Germany is not equal to USA. Is going to remove this
customer as well. And to the last one bit better you can see the value is USA. So USA equal USA. The condition is
fulfilled. SQL is happy about it and going to leave the customer in the output. So now if you go and apply this
condition using the comparison operator to your data only two customers going to be left in the output. This is exactly
how the conditions and the comparison operators works in SQL. Okay. So now let's start with the first operator.
It's very simple. We have the equal. It's going to checks if the two values are equal. That's very simple. Let's
have an example. Okay. So now we have this task. It says retrieve all customers from Germany. So this is very
basic. We're going to go and select and we're going to select all the columns since we don't have any specifications
from the table customers. And if you go and execute it, you will get all the customers. But we don't need that only
the customers that comes from Germany. So we have to go and apply a condition using the wear clause country equal to
the value Germany. So make sure you are writing it exactly like in the database otherwise it will not work. So let's go
and execute and with that we are getting only the customers from Germany. So it is very simple and this is why we use
the equal operator. Okay. So now moving on to the next one again very simple. If you want to check if two values are not
equal we can use the not equal operator. So let's have an example. Okay. So now we let's have the opposite task. It says
retrieve all customers who are not from Germany. So this is very simple. We are saying here who are not they are not
equal to Germany. So we can use the not equal operator in order to get these customers. So with that as you can see
after executing we are getting all the customers country is not equal to Germany and there's like another way on
how to do the not equal doing it like this we'll get the same results. All right my friends moving on to the next
one. We can check if a value is greater than another value. So we use the greater operator. Let's have an example.
Okay. So now the next task it says retrieve all customers with a score greater than 500. Now we want to filter
the data based on the score. So we're going to say where score and now the task says greater than 500. We're going
to use the operator greater than 500. It's very simple. So with that we will get only the customers where the score
is higher than 500. So for example Maria it's not fulfilling the condition. The same thing for the Peter and as well for
Martin it must be greater than 500. So if you go executed you will get only those two customers because they are
greater than 500. Okay, moving on to the next one. This time we're going to check if a value is greater than or equal to
another value. So it is like mix between the greater than and the equal. If one of them is fulfilled then the value
going to meet the condition. So let's have an example for that. Now, if the task says retrieve all customers with a
score of 500 or more, this time we're going to go and include the customers where their score is equal as well to
500 or higher. So, we're going to have a similar condition based on the score and the 500's value, but this time we're
going to say greater or equal to 500. So, if you go now and execute it, this time we're going to see the customer
Martin with the score of 500. So, in this scenario, we're going to use greater or equal. All right. Right. So
now let's keep moving. The next one is as well very simple. We're going to check this time if a value is less than
another value. So we're going to use the less operator. Let's have an example. Now moving on to another simple task.
Retrieve all customers with a score less than 500. So this time we want all the customers with a lower score. And we're
going to use exactly the opposite. It's going to be the score is less than 500. And again here it is not equal, right?
So if you go and execute, you will get all the customers with a low scores. he will not get to Martin because Martin is
equal to 500. So with that we have solved the task. We have all the customers with the score less than 500.
Okay my friends, now moving on to the last one. I think you already got it. So we're going to check whether a value is
less than or equal to another value. So you can go and combine the less operator together with the equal and if one of
them is fulfilled then the value going to meet the condition. So let's have an example for that. This time we are
retrieving all customers with a score of 500 or less. So the query going to be very similar but we are saying it is
less or equal to 500. So we are including the value in our condition. And with that as you can see we still
have our two customers where we have the score less than 500 but we have now as well Martin with a score of 500. Okay my
friends. So with that we have covered the first group the comparison operators. Now we're going to move on to
the next group. We're going to speak about the logical operators and here we have three and or not. So let's start
with the first one. What is exactly and operator. Okay. So now what is the definition of the and it says all
conditions must be true. So all the conditions that you have in the wear clause must be true in order to keep the
row in the results. So let's understand what this means. things going to get more complicated where you can have not
only one condition but you might have multiple conditions in your query. So here we're going to add a second
condition where we're going to say not only the country must be equal to USA but also the score must be higher than
500. So now you have two conditions and you have to put them in the wear clause. Now you have to combine those conditions
using the logical operator and here we have two options two operators the and operator and the or operator. In this
scenario, if you say and then SQL is very restrictive. Both of the conditions must be true in order to keep the row in
the results. So now let's see how this going to work. Now for the first row and for the first condition you can see the
country is Germany and it is not fulfilling the first condition. So this going to be false. And as well if you
check the second condition for the first row you can see the score is 350. So that means this customer is as well not
fulfilling even the second condition. So both of the conditions is false and it's going to go I remove this customer from
the results. Now to the next one John you can see John is fulfilling the first condition because the country is equal
to USA and as well fulfilling the second condition. His score is 900 and this is higher than 500. So now SQL going to be
very happy about it because both of them is true and this is the only way in order to keep the row in the output
because we are using the operator and so John going to stay in the output. Now moving on to George. He is not
fulfilling the first condition. But now the second condition is fulfilled. His score is 750 and this is higher than
500. So now it's like 50/50 right. In one side it's false but the other side is true. But this is not enough for the
ant operator. Both of them should be true in order to keep the result in the output. That's why SQL going to remove
this row. Now moving on to Martin. He is not fulfilling both of the conditions. So SQL going to go I remove it from the
results. And now for the last one. Peter is fulfilling the first condition. the country is equal to USA but the second
condition is sadly not fulfilled so we have the score zero not higher than 500 again we have the same scenario it's
50/50 and this is not enough for the ant operator that's why SQL going to go I remove it so as you can see if you use
an and operator a lot of rows going to be removed if one of the condition is not met so the ant operator is very
restrictive both of the conditions must be fulfilled to keep the row in the results so this is exactly how the and
operator works. Okay. So now we have the following task. Retrieve all customers who are from USA and have a score
greater than 500. So here we are like combining multiple conditions and let's go and do it step by step. So the first
thing that we have to go and select the data from the correct table. So select star from customers and with that we are
getting all the customers from the table. Now the first condition we need the customers that come from USA. So we
need only those two customers and in order to do that as we learned we can go and use the wear clause and the
condition going to be country equal to USA. So if you go and execute we will get those two customers. Nothing is new.
We have used the compression operator equal. But we are not done yet. We have another condition from those two
customers. We need only the customers where their score is higher than 500. So now by looking to those two customers
you can see we see that the bitter here does not have a score higher than 500 and we don't want to see that in the
results. So now what we have to do we have to go and write a condition for this one over here. So this is based
this time on the scores not on the country. So the score should be greater than 500. Now as you can see we have the
first condition for the first one here and the second condition for the second requirement. Now the question how to
connect those two conditions. So here we have two options and or and to be honest this is very simple the task says it
customer should fulfill both of the conditions should be from USA and as well at the same time greater than 500.
So it is very simple real and so with that we have connected both of those conditions and if you go and query it
you will get only one customer that is fulfilling our conditions. So from all customers we have only one customer
that's fulfilled this condition that comes from USA and at the same time the score of this customer is higher than
500. So this is how we use the ant operator in order to connect two conditions. Okay my friends. So that's
all for the ant operator. Let's speak now about the or operator. All right. Now the or operator
it says at least one condition must be true. So it is less restrictive than the and it is enough to have one condition
true in order to keep the row in the results. Let's understand exactly what this means. Okay. So now we have the
same scenario. We have two conditions and in SQL you have to connect them either using the and operator or the or
operator. In this scenario we're going to talk about the or operator. And as we said at least one of the conditions must
be fulfilled in order to leave the record in the results. So let's see what's going to happen here. Now the
first customer Maria she is not fulfilling the first condition and as well the second condition. So both of
them is false and this is the only scenario where SQL going to remove the record from the results because it is
not fulfilling the minimum at least one of them should be true. Both of them is false then SQL going to go and remove
this row. Now moving on to the next one to John. John is from USA and has higher score than 500. Both of the conditions
is green. So both of them is true and this is more than enough to keep the row in the output. That's why we will see
John in the outputs. Now moving on to the third one, George. George is not fulfilling the first condition because
UK is not equal to USA. But John this time is fulfilling the second condition. So we have here true and since we have
at least one true, this is good enough to keep the record in the output. So you will see George in the results. Now
moving on to Martin. He is not fulfilling the first condition as well not fulfilling the second condition.
Both of them is false and this is not enough to keep the result in the output. So that's why it's still going to go and
remove it. Now moving on to the last one. Peter he is fulfilling the first condition but not the second condition
but still everything is fine because he is fulfilling at least one condition. So we have the minimum and it's still going
to leave it in the output. So as you can see the or operator is not restrictive like the and operator. It's enough to
have one true in order to keep the data in the output. And this is exactly how the or operator works. Now let's see the
second task. Retrieve all customers who are either from USA or have a score greater than 500. So it is a very
similar task. We have two conditions. So we need the customers that are either from USA. So it is based on this country
equal to USA. And the second condition is the score is greater than 500. But this time we are very relaxed. either
this condition is fulfilled or the second one. So instead of having and we will be using the operator or. So it is
enough to fulfill one of those conditions. And if you go and execute now as you can see we are getting more
results because it is easier to fulfill the conditions. So we can see those three customers either fulfilling the
first condition or the second one. All right my friends. So that's all for the or operator and we're going to move to
the last one in this group the not. So what do we mean with the not operator? Okay. So now what is this operator not?
It is a reverse operator. It's going to go and exclude the matching values. So what this exactly means? Let's have a
very simple example. All right. So now the net operator is not like the or and the ands. This operator will not go and
combine two conditions. So you can use it with only one condition. And let's say that our current condition is like
this. The country must be equal to USA. So this is like a comparison operator. And if you apply it to your data, as we
learned, it's going to leave only two customers, John and Peter, because they fulfill the conditions and all other
customers will be removed because they don't fulfill the condition. So nothing crazy so far. But now if you go and
apply the not operator to the condition, what going to happen? You're going to reverse the whole truth. So you are
saying if this condition is fulfilled, it must be removed from the final results. So it is switching everything.
We want to see the customers that is not fulfilling the condition. So now let's see what can happen if you apply the not
operator together with the condition. We can see that the first customer is not fulfilling the condition which is great
thing. This is exactly what we want. We want the customer that is not fulfilling the condition. That's why going to be
happy about it and SQL going to make it true and leave it in the output. So Maria is fulfilling the whole thing. She
is not meeting the condition. So SQL going to leave it at the output. Now for the next one. So this customer is
fulfilling the condition and that is not a good thing. So SQL going to go and this time remove John from the results
because he is fulfilling the condition. And moving on to George. So George is not fulfilling the condition which is
amazing. So that's why SQL going to keep this time George in the output. The same thing for Martin. Martin is not
fulfilling the condition. So Isl going to keep the customer and better he is fulfilling the condition. So SQL going
to go and remove this customer from the output. So as you can see we have reversed everything right. The not
operator going to make the true false and the false true. Okay. So this is how it works. Now let's go back to SQL in
order to practice. Okay. The next task it says retrieve all customers with a score not less than 500. So this sounds
really funny. As usual we're going to go and select star from customers. And now we have to filter the data based on this
condition. So the score is not less than 500. Well, you can go and say well the score is higher, greater or equal to
500, right? And with that it is not less than 500. So if you go and execute it, we just solve the task, right? We get
all the customers that are not less than 500. Or you can go and use the not operator to make things more funnier. So
you go over here and say it is not and then you switch it. So you make like this. So the score is less than 500. But
as we use here not then we twisted everything. So we are saying the score is not less than 500. And if you execute
it you will get the exact same results. Convert the truth. If you remove it and execute you will get everything that is
less than 500. But if you put the nut you will convert the whole logic. So if you go and execute you are not getting
the scores that are less than 500. So this is really nice. This is how you use the nut operator. Okay my friends. So
with that we have covered everything about the logical operators. Now we're going to move to the third group. We're
going to talk about the range operator. And here we have only one the between. So what is exactly between
operator? Okay. So what is between? It's going to go and check if a value falls within a specific range. So you have a
range and you are checking whether your value is in the range or outside the range. So let's understand exactly what
this means. Okay. So now in order to build a range you need two things. You need the lower boundary for the range
and you need as well the upper boundary. Once you have two boundaries then you have a range and everything between
those two boundaries going to be true and everything outside those boundaries going to be false. So now for example
let's say that we have the lower boundary 100 and the upper boundary 500. And there is one thing that you have to
understand about the between the boundaries are inclusive. So that means if a value is exactly 100 or exactly 500
then it's going to considered as a true. So it is considered to be inside the range. Now if you apply this filter to
our data where we say the score must be between 100 and 500 going to go and do the following. So for the first customer
Maria is going to go and check whether her score is inside the boundaries. So as you can see 300 is between 100 and
500. So she is in the green area and that's why Isque going to be happy about it and leave the customer in the
outputs. Now moving on to John. John has 900. As you can see 900 is greater than 500. So this value is going to be
outside the boundaries on the right side and this means the score of John is not in the range. That's why he is not
fulfilling the condition and SQL going to go and remove this customer from the results. Now moving on to George 750.
The same thing outside the range. SQL will not accept it and remove this customer from the final results. Now
moving on to Martin his score is 500 and this is exactly at the boundary. So if it's like 5001 it's going to be outside.
So since between is inclusive then SQL going to accept it and Martin considered to be in the range and fulfilling the
condition. So SQL going to keep him in the final result. Now here are speaking about better he has zero score and this
is less than 100. So in the left side not in the range. So not fulfilling the condition and SQL going to go and remove
him. This is exactly how between works in SQL. It's very simple. Okay. So now we have the following task and it says
retrieve all customers whose score falls in range between 100 and 500. So let's start as usual by selecting all data
from customers and execute it. Now the task says everything. We need all customers in a range. So we have a lower
value and a higher value. So in order to do that as usual we're going to use the where and then we're going to specify
the column that we want to filter on. So it's going to be the score and since we have like two boundaries we can go and
use the function between and we start with the first boundary the lowest boundary. So it is the 100 and 500 the
high boundary the upper boundary. So between 100 and 500. So now let's go and execute it. And with that we get only
those two customers because they are between this window. Now there is another way in how to solve this task by
not using between. We can go and use the comparison operator together with a logical operator and. So let me show you
how we can do that. I'm going to go and copy the whole thing. And now we're going to write two conditions. So first
the score should be higher or equal to 100 because the boundaries is inclusive and the other one the score is less or
equal to 500. So this is the upper boundary. So with that we have the two conditions and we can go and connect
them using the and operator. So it's like very similar to the between we have an and between the upper and the lower
boundaries but we are using the comparison operators. So it is higher or equal to 100 and lower or equal to 500.
If you go and run this query you will get exactly same results. Now if you ask me which method is my favorite I'm going
to go with this method and I will skip the between because each time to be honest for me I forget about the between
whether the boundaries are inclusive or exclusive. But if I read the script I am going to see exactly that those
boundaries are inclusive because we have here the equals. So I really prefer using the compressor operator together
with the and then using between. So it's up to you if you memorize it then go with the between. But for me I'm going
to go with the compression operators. Okay my friends. So that's all about the between and the range operator. Now
let's move to another group. We have the membership operator. So here we have like two. We have the in and the not in.
So let's understand what this exactly means. Okay. So what is in operator? It's going to go and check if a value
exist in a list. So you have a list of values and you are checking whether your value is a member of your list. So let's
have very simple example in order to understand what this means. Okay. So now how this works exactly what you have to
do is to go and make a list of values. So let's say that I have a list and there I have specified two values
Germany and USA. So those two are the members of this list. Now if you use the n operator it's going to go and check
the value of countries whether it is in the list or not. So let's do it one by one. For the first customer Maria her
country is Germany and Germany is member of the list. So it's going to be happy and going to leave Maria in the final
results. Now moving on to John. John comes from USA. USA is member of the list. So he is fulfilling as well the
condition and you're going to see John in the final results. Now we come to George. George comes from UK and UK is
not member of our list. And SQL going to go and remove this customer from the final results not fulfilling the
condition. Now for the last two, Martin and Peter, their country is a member of the list and SQL going to go and leave
those customers in the final results. So as you can see it's very simple. Or what you have to do is to define the members
of a list and use the n operator and if the value is a member of this list it's going to be true otherwise it's going to
be false. Now of course the other operator going to be exactly the opposite where we say not in the list.
So we are searching for values that are not in this list. So as we are using not it's going to go and reverse completely
the truth. And if you apply this you will get in the result only one customer. you will get George and the
result because the country is UK and UK is not a member of the list. So if you use not together with the in operator
you will get exactly the opposite effect. So this is how the in and the not in operator works in SQL. Let's go
back to scale in order to practice that. Okay. So now we have this task and it says retrieve all customers from either
Germany or USA. Okay. So let's try to solve this task. This going to be a little bit tricky. So select star from
customers as usual and execute it. So now we need in the results only customer that comes either from Germany or USA.
So that means this customer over here should be excluded from the result because he come from UK. So how we going
to write it? It's going to be like this maybe. So the first one going to be the country is equal to Germany or the
country is equal to USA right something like this. So if you go and execute it, you will get in the output only the
customers that are either from Germany or USA. And with that we have solved the task, right? Well, there is another way
in order to solve this task which is more clear and shorter using the n operator. So now how we going to do it?
Let's go and get the whole thing in another query. And now instead of having equals and ors and so on, we're going to
use the in operator and then we're going to have like two parentheses and then inside it we're going to have a list of
values. So it's going to be the Germany and then the second value going to be USA like this. So we are saying country
should be in this list Germany or USA and if it is like one of those values then the condition is fulfilled. So now
if you go and execute this one over here you will get the exact same results. So my friends, if you notice that you are
repeating yourself in the wear condition and you are just changing the value of the condition, it is based on the same
column and you are connecting them using the or then there is something wrong and always think on this scenario to use the
in operator because this can be really ugly once you have a lot of values. So imagine in our database we have a lot of
countries and your query going to be like something like this. So you are keep repeating country equal or country
equal and so on. Instead of that you're going to have a really nice list of countries in one go. So this is as you
can see here it is easier to extend and as well has better performance. So as you can see we are repeating the same
thing but we are just changing the value and we are connecting all those conditions using the or in this scenario
go and use the in operator. All right my friends. So that's all for the membership operators. Now we're going to
speak about the last one the search operator. And here we have only one the like. And each time we're going to say
like, I'm going to remind you to like this course. So let's go. Okay. So now what is like operator?
You can use it in order to search for a pattern in your text. So if you have like a text or characters and you are
searching for a specific pattern inside the text. So let's have an example in order to understand exactly what this
means. Okay. So now if you don't have yet cafe, go grab one because you have to focus for this one. Now what we have
to do is to define a pattern in is scale. In order to build a pattern we have like two special characters. If you
use a percentage you are saying anything. So I'm going to accept anything. So it could be no characters
at all or only one character or many characters. So I'm saying anything. Now if you use an underscore you are
expecting to have exactly one thing like one character or one number. So it is exactly one. I know this sounds
complicated but with an example you can understand this. And I can tell you the percentage is way more famous than the
underscore. I rarely really use the underscore. So now let's say that I build the pattern like this. I say the
first character must be M and then percentage. So here I'm saying in my text the first character must be an M
and after the first character I really don't care. It could be any character, any number whatever. So this is the
pattern and now let's have few values in order to say whether it's true or false. So now if you have the value Mariam. So
now you can see the first character is an M which is perfect. This is exactly our pattern. The first character must be
an M. And then after the M we got like four characters. So whatever it is totally fine. We can say Maria is
fulfilling our pattern. And this is exactly what we are searching for. This value is fulfilling the condition. Okay.
Now moving on to the next value we have m a. So here again the first character is an M which is perfect. And after that
we have only one character a. Well we have say percentage. So it could be anything one character multiple
characters a number or whatever. So that's why this value can match our pattern and we will see it in the
outputs. Now moving on to the next value we have only one m which is as well totally fine because we are saying the
first character must be an M and then followed with anything. Now moving on to the last scenario we have Emma. Now this
is a problematic because the first character is an E and in our pattern we say it must start with M. So we don't
have that in this word. The first character is an E. That's why this value is not fulfilling our pattern and SQL
going to remove this value from the final results. So this is exactly what going to happen if you have this pattern
and those values. Now let's have another scenario where you say you know what it could start with anything but for me it
is very important the last two characters it must be an I and N. So we could start with anything but the last
two must be an I and N. So let's take this value Martin going to go and check immediately the last two characters. So
you can see we have an I and N and the first part marks it is fine. It could be anything. So this value is fulfilling
the condition because the last two characters is an I and N. Now moving on to the next one we have vin. So v i n
the last two characters is as well exactly what we are searching for. It is fulfilling the condition and we have
before it like only v. So we say anything with a percentage. Right? Now one more we have in. So it is as well
fulfilling the condition because before it we don't have anything. So en is fulfilling as well the condition. The
percentage is always saying anything. Now moving on to the last scenario we have Jasmine. They are not the last two
characters. The last two characters is an N and E and this is not matching our pattern and this why this value is not
fulfilling our pattern and you will not see it in the results. So with that you can understand how we can search for
something in a text using the like operator. Let's keep going. Now let's say that I have a percentage at the
start and percentage at the end and in between I have only one character an R. If you define it like this you are
saying if there is an R anywhere it is good enough whether it's beginning or at the end or in between then the condition
is fulfilled. So if you have Maria you can see we have an R in the middle. So in the left side we have two characters
on the right side we have two characters doesn't matter the main thing we have an R somewhere. So this going to be
fulfilling the condition. Now moving on to better we have an R at the end and that is totally fine cuz we say at the
right side it could be anything. So we have an R somewhere that's why it's going to fulfill the condition. Now we
have another case where we say Ryan we have an R at the start. So we don't have anything before and we have after that
like three characters which is totally fine. So we don't really care about the position of the R. It is totally
acceptable to have an R anywhere. And if you have only an R that is as well good enough. You don't have anything before.
you don't have anything after and that's okay. But if you have a word like Alice, we don't have any R inside it. So that's
why this is the only case where you say we don't have here an R and it's going to remove this value from the results.
And this way of searching of something is very famous. You don't care about the words before this word and after the
word, right? So if you are searching for any word, you're going to say percentage before and percentage after. Now I know
that we want to practice with the underscore. So let's say that I have two underscores and then the character B and
then a percentage. So here what I'm saying there should be something in the first position. There should be as well
something in the second position. Then the third position should be the character B must be exactly at this
position and after that it could be anything. So we really don't care. I know this is a little bit complicated.
Let's have an example. So we have the value alert. Now we can see the first position we have something the A. Then
the second position we have as well something the L. So so far we are good at the pattern and then the third
position we have B. So we have complete match and the rest the ERT whatever. So with that Albert is matching our
pattern. Moving on to the next one rope. You can see the first character we have something which is good. We have the R.
Then the second character we have an O. So it's not empty. We have something. And then the third one we have exactly
B. And after that we don't have anything which is fine. So again this value going to fulfill the condition. So moving on
to the next one. So it start with an A. So we have something in the first position. The second position we have as
well something the B. But now the third character it is a problem. It is not P. We have an E. So that's why it is not
following our pattern. And is going to go and remove it. Now moving on to last example we have an A and an N. So in the
first position we have something. The second one as well. But the third one we don't have anything. We don't have a B.
So that's why it's going to be removed. So my friends I know that was a lot. This is exactly how you build a pattern
for the like operator using the percentage and the underscore. But the percentage is more famous. So this is
exactly how it works. Let's go back to scale in order to have some examples. All right, let's start with this task.
Find all customers whose first name starts with a capital M. So let's go and start searching for those informations.
We're going to start as usual. Select star from customers. And now we have to go and build the filter logic. So we're
going to say where. Now we are searching something in the first name. So we're going to say first name. So that means
it is very important to start with an M and then the rest it doesn't matter. So we're going to use the like operator in
order to search. And we're going to have our single quotes and we're going to start with the M. And it doesn't matter
what comes after that. So for us it is very important that the first character is an M. Let's go and execute it. And
with that we got our two customers Maria and Martin. And both of them starts with an M. So with that we have solved the
task. It is very simple. Now we have the following task. Find all customers whose first name ends with an N. So let's go
first and select all the customers here. And we need all those customers where they are having an N at the end. So we
have John and as well Martin. So how we going to do it? The same thing where first name like since we are searching
but here we're going to change the expression. So it must ends with an N as a last character. So before that it
doesn't matter whether it is the first character. So it could be anything but the last character of the word should be
an N. So that's it. Let's go and execute. And with that we got John and Martin because the last character is an
N. It is very simple, right? It is all about where we're going to place this percentage. Okay. So now we have the
next task. Find all customers whose first name contains an R. So here we don't have like specifications whether
it is at the start or at the end. Somewhere there should be an R. So if you go and execute first without any
wear condition you can see here for example Maria we have in the middle somewhere an R George George as well
Martin and Peter at the end. So we have a lot of names with an R. So how we can search for that? We're going to stick
with the where first name like and here our character going to be an R and we're going to put before it and after it a
percentage. So it doesn't matter what is before it or after it somewhere there should be an R. So let's go and execute
it. And with that we got all our customers where somewhere we have an R. As you can see it is very simple. If you
put it before and after then you are open for more results. And this is usually used a lot in order to search
for a value inside your database. All right. Now we're going to move to a funny one. It kind of says find all
customers whose first name has an R in the third position for some reason. I don't know why. So let's go and execute
our customers here without any filter. So it is for us very important to find the customers where in the third
position we have an R like here for example Maria the third character is an R which is okay but with Peter over here
it is not the third character so it is not fulfilling the condition. So how we going to write that? It going to say
like this where the first name like but we have to write it now from the start. So the first position going to be an
underscore the second position going to be as well an underscore and now in the third position going to have an R. So
with that we make sure the third position and an R and before it we have two positions and now afterward it
doesn't matter what comes after that it could be nothing or characters. So if you go and execute it like this we will
get Maria and Martin and we will not get Peter because the R is not in the third position. So now if you don't do it
correctly with the underscores let's go and remove one of them and execute. You will get nothing because we don't have
any first name where the second position is an R. So you have to be very careful with this. All right my friends. So this
is how you search inside your values. And with that we have covered all different groups of operators that you
can use inside a wear clause. So with that you have learned how to filter your data using multiple operators that you
can use inside the wear clause. So you can filter anything now in SQL. Now we will move to very interesting topic. You
will learn how to combine your data from multiple tables. And here we have two main methods. The first one is SQL joins
and the second set operators. And they are really big topics. So we're going to first focus on the SQL joins. And here
we have a lot of things to cover. So now we are talking about the core of SQL. So let's
go. All right. So now we have two tables, table A and table B. And the big question here is how to combine those
two tables. What do we want exactly? Do you want to combine the rows or the columns? And now if you say I would like
to combine the columns then we are talking about joining tables. So we're going to use joins in SQL. So now let's
say that we are joining the table A with the table B and we start from the table A. So SQL going to take the columns and
the rows of the table A and SQL going to call it the left table because we started from there and then we join it
with the table B and SQL going to call the second table as the right table. And here what's going to happen? and SQL
going to take the columns and the rows from the right table and put it side by side with the columns and rows of the
table A. So we are like combining the columns we are putting them side by side. And now if you say you know what I
don't want to do that I would like to combine the rows both of the tables having the same columns. I just want to
stack them. So we are now talking about another methods. It is called the set operators. So here there is like no left
and right. So since we started with the table A, the SQL going to take the columns and the rows of the table A and
put it in the results. And then it's going to go to the second table, table B and it's going to take only the rows and
put it below the rows of the the table A. So we are putting the rows beneath each others. We are doing like
appending. So that means as we are using the set operators, we are combining the rows. Our table going to be longer but
with the joins we are combining the columns side by side and we are getting wider table. But now for each methods
there are different types. So now for example in order to do the joints we have four very famous types. We can do
an inner join, full join, left join, right join. But of course there are more than that but those are the basics. And
for the set methods we have as well types. We have the union, union all except and intersect. And for each
methods there are like different rules. In order to join the tables we have to define the key columns between the two
tables. Don't worry we're going to learn about that later. This is the requirement in order to join tables and
the requirement of combining tables using the set operators the tables in your query should has the exact same
number of columns but here you don't need any like key in order to combine the tables. So guys if you look at this
in order to combine two tables first you have to decide do I want to combine the columns or the rows. So first you have
to decide in the methods and after that you have different types on how exactly you're going to go and combine the data
and of course there are rules that you have to follow. Now, of course, we're going to go and cover everything in the
course, but now in this section, we're going to learn how we're going to combine the tables using the SQL joins.
So, we're going to go and dive into this word. All right. So, now what is exactly SQL joins? Now, let's say that we have
two tables. On the left table, we have the customer name. So, we have four customers. And on the right table, we
have the country informations about the customer. And now we would like to query both of those informations the names and
the countries. Now in order to query those two tables in one query first we have to connect them. And in order to
connect those two tables we need a key a column that exist on the left and on the right sides. And by looking to this the
common column here is the ID of the customer. Now once we connect those ids together we will be able to query those
tables together and SQL going to start matching those ids. So for the ID number one, we will get the name Maria and the
country Germany. And the ID2 is connecting John to USA. And now you can see the ID3 is not connectable. So we
cannot connect it to the right side. But for the ID4, we can use it in order to connect Martin to Germany. So this is
exactly what happens if you join two tables. You connect those two tables using a common column, a key like the
ID. And once we have matching value, we can connect the two rows together. So this is what we mean with SQL
joins. Now you might ask why do we need actually joins? Well, the first and very important reason is to recombine your
data. So now usually in databases the data about something like the customers could be spreaded into multiple tables.
Like we could have table called customers, another one where we have the customer addresses and a third table
where you can find the orders of the customers and maybe another one where you can find the reviews of the
customers. So as you can see the data of the customers is spreaded into like four tables. Now how about I would like to
see all the data about the customers in one results. So I would like to see the complete big picture about our
customers. What we can do, we can go and connect those four tables using the SQL joins. And once we do that in one query,
I will be able to combine all those tables in one big results. And this is the most important reason why we use SQL
joins in order to combine all the data about specific topic in order to see the big picture. Now, another reason why we
use SQL joins is to do data enrichment. It is where I want to get an extra data and extra information. So let's say that
you are querying the table customers and this is your main table the master table. So you are able to see all the
data that you need but sometimes what happens you would like to get an extra information from another table like for
example the zip codes of the countries. So you would like the help of another table we call it a reference table or
sometimes lookup table where there is like one extra information that you would like to add it to your master
table to the primary source of your data. So now what we can do we can join those two tables in order to enhance our
table. So we are getting one extra relevant informations for the customers and this process we call it data
enrichments. I'm getting an extra data for my main table. So this is another reason why we use joins. All right. So
now so far we have used joins in order to get the data from two tables. But now there is another use case for the SQL
joins. We use it in order to check the existence of your data in another table or maybe as well the not existence. So
let's say that I have a table called customers and I'm working with this table and doing queries. But now I would
like to check something. I would like to check whether our customers did order something. Now in order to check that I
need the help of another table for example the table orders. So that means I'm using the table orders only for my
check. So I don't want to get any extra data from the orders in my final results. I'm just using the table orders
and we call in this table a lookup. So now what we can do we can connect those two tables together. And now based on
the existence of the customers inside the second table the orders either the customer going to stay in the final
results or going to be removed. So that means I'm filtering the data based on the join. And of course I can check as
well the net existence. I would like to see in the final results all the customers that didn't order anything. So
it is the same scenario. So my friends, those are the main three reasons why you use SQL joins. First, if you want to
combine the data from multiple tables in one big picture. So I use join in order to get the data from different tables.
The second use case, you are working with one table but you would like to get an extra information from another table.
So you are doing it like something called data enrichments. And in the third scenario, we don't want to combine
the data. We want just to join it with another table in order to do a check to check the existence of your records in
another table. So this is why we need joins in SQL. Now there is like a lot of
different possibilities on how to join tables, how to join the data. Now in order to make it easy to understand,
we're going to visuals as like two circles. So we have the table A and a table B. The table A is on the left
side. We call it the left table. And the table B going to be on the right side and we call it the right table. The side
of the tables is very important. Now if you combine those two circles, you will get three different possibilities. The
circles going to overlap. And here exactly where we can have the matching data between the two tables. So the data
is available on the left and on the right. Or another possibility you want to get all the data from one of the
tables. So you can get all the rows from one circle. And the third possibility you want to get only the unmatching data
from one table. So if something exists in one table but not in the other table then we call it unmatching data. So
those are the three scenarios that you have to ask yourself once you are combining tables and this can generate a
lot of join types. So here we have like basic SQL joins those are the classical one and here depends on the scenario
whether you want only matching all or all the rows from either left or right and we have advanced SQL joins where we
focus on the unmatching data. Now we're going to go and cover all those types one by one. So we're going to start
first with the basics and the first option that you have is to get all the data without joining tables. So let's
see what this means. So what do we mean with no join? Well, we want to returns the data from two
tables without combining them. So actually this is not a joint type because we are not combining anything.
We just want to query the data from two tables. So that means from the table A we want to see all the rows everything
and from the table B we want to see everything as well all the rows. So that means we want to see two results and
there is no need to combine them. So let's see the syntax of that. So all what you have to do is very simple.
Select star from table A and then semicolon and then start another query. Select star from table B. So that's it.
And of course since we are not combining the data there will be no join in the syntax. So that's it. Let's go to SQL in
order to do that. Okay. So now we have the following task. It says retrieve all data from customers and orders in two
different results. So that sounds that we don't have to go and combine the tables together. And all what we can do
is the following. We can go and select the data from the first table like this and then we make another query for the
second table the orders and we don't have to go and combine them in one big query. We just use a very simple select
statements in order to retrieve the data. So if you go and execute it since you have two separate queries you will
get two results and with that in one result you will get all the customers and in the other result you will get all
the orders and the data is not combined at all. So this is how you query two tables without combining them. So with
that we are getting all the data without joining the tables. Now we're going to start talking about the first type of
join the inner join where we start combining the data from two tables. So let's
go. Okay. So now what is exactly an inner join? So this type going to return only the matching rows from both tables.
So that means we will see in the output only matching rows. So now what do we need from the left table? We want only
the matching data. So we will not get the whole circle of A. We will get only where we have an overlapping with the
table B. So we want to see the data from A only if it exists in the table B. And now what do we need from the table B?
Exactly the same thing only the matching data. So that means I don't want to see all the data from B. I want to see only
the data in B that has a match from the table A from the left side. And with that you will get only the matching data
from both tables. Now let's see how we can write that in SQL. So it is a usual query and always we start with a select.
So we select for example all the columns from and here we specify the table name. So it's going to be a. So so far nothing
new. But now we want to add as well the table B in the same query. In order to do that we use the keyword join and then
we say table B the name of the table. And since we have like different types of joins in SQL, you can specify the
type of the join before the keyword join. And if you don't specify anything, the default type is inner join. But my
friends, the best practices is always mention the type. I don't like to skip the defaults because in projects maybe
not everyone is aware of the defaults. So don't skip that. Always specify the type. So now what we're going to do,
we're going to put the keyword inner before the join. And with that SQL going to know how to deal with the rows
between two tables. But still we are not done there. We have to tell SQL how to combine the tables. And with that we use
the keyword on. And after that you specify the join condition. And as we learned in order to join two tables we
have to find out a common column in order to match the data. Right? And usually in scale they are the keys or
ids. So the condition can be like this. the key from the table A must be equal to the key from the table B. So this is
the join condition and using this join SQL can go and start matching the data from the left table and the right table.
And there is one thing that is very important while you are joining the tables you have to understand about the
order of the tables in your query. Now in the inner join the order of the tables doesn't really matter. So whether
you start from A or you start from B it doesn't matter because you will get the same results. Both of the tables has the
same priority and it doesn't matter where we start whether we say from A join B or we say from B join A we will
get the exact same results. So in the inner join you don't have to worry about the order of the tables. So that's all
about the inner join. Now let's go back to scale in order to practice. Okay. So now we have the following task and it
says all customers along with their orders but only for customers who have placed an order. So my friends that
means we need the data from the customers and from the orders from two tables and we have to put everything in
one results. That means we have to join two tables. Now let's go and do it step by step. So we're going to go and say
select star from customers and then we have to go and join it with the orders. We're going to say join orders. Now you
have to go and specify the join type. Is it inner, left, full and so on. Well that's depend on the task. It says we
want all customers but only for customers who have placed an order. So there is like condition right here. We
don't want to see everything from the customer. We just want to see only the matching data only if the customers has
an order in the orders table. And for that we can go and use the inner join. Of course if you can leave it like this
you will get the same effects but I'm going to go and specify it like this inner join just to make it clear. We are
speaking about the inner join. And after that we have to go and specify the join condition. So we have to go and find a
common column between the customers and the orders. So how I usually do it I go and explore both of the tables. So I'm
going to go and select everything from customers and as well everything from the orders. So let's go
and execute. Now we're going to start searching where do we have a common column between those two tables. So we
have the from the first table first name, country score and you don't find any of those informations in the second
table. The only one is the ID. So the ID of the customer and the ID of the customer you can find it in the orders
the second column here. So this is the common column between those two tables. And usually in databases we create ids
exactly for this in order to connect tables. So it's really rarely that we're going to use like a country or score or
first name in order to join tables. We usually use the ids. So let's go back to our query and use those two columns. So
it's going to be the ID from the customers equal to the customer ID. So that's it. With that we have the
condition we have decided on the type and we can go and execute it. Now you can see we are getting only three
customers. Right? If you don't apply the inner join we can see that we have five customers. So that means actually we
have two customers without any orders any matching data from the other table. And as well you can see very nicely we
have now not only the columns from the customers but as well all the columns from the orders side by side. So with
that we have combined the data and as well with that we have solved the task but we will not leave our query like
this because it is not really good practices. What we have to do is to go and select only the columns that really
make sense in our query because in many cases in your tables you will have a lot of columns that is not needed like for
example if you check here you see we have the customer ID here and as well the customer ID over here. So it's like
repetition and it's enough to see it only once. So what you have to do is to go and pick few columns that we want.
For example, I'm going to start with the ID maybe the first name and that's all from the first table. Let's go and get
the order ID and I don't want the customer ID again. So from the second table I'll get add the sales. So let's
go and execute it. And with that you can see very nicely the customer's name and their orders with the sales. And now
comes something very important. Sometimes if you have two tables you might have columns that having the same
names. Like imagine the order ID in the table orders it's called ID. So that means we have the same name in both
tables and this kind of makes SQL very confused. And here you will get an error tells you I really don't know what do
you mean with the ID. Is it from the table customers or from the orders? So we have to tell SQL exactly from which
table did this column come from. So in SQL in order to do that what we do before the column name you write again
the table name the customers and then you make a dot and now we are telling SQL this column the ID it comes from the
table customers and SQL will not be confused about it and it's going to go and get the ID from the customers. And
for the second id you can go over here and as well before it you say orders do id so that knows okay this ID come from
the orders and the other one comes from the customers and it is always good practice especially if you are joining
tables to always assign for each column a table because after a while if you open your query and you see okay the
sales does the sales come from the customers or the orders and if you have a long list of columns it's going to be
really confusing so that's why we consider it best practices if you always assign for each column the table name
especially if you are doing joins. So it's going to be like this. But of course if you have like only one table
it's clear that all the columns in the select comes from this table. But since here we are dealing with multiple tables
it is good to show it like this. And of course here we don't have the ID. We have the order ID and the same thing for
the join condition. So the ID from here comes from the customers and the customer ID come from the orders. So now
it is clear for everyone which column come from which table. But now you might say you know what each time I have to
write the customers this is very long name and sometimes in real projects you're going to see tables that has
really long name and it's going to be really annoying to add it each time before each column right so instead of
that we can go and assign aliases for the tables but only for the columns so usually we go over here and say as and
maybe you can go and use only one character like the first character C. And now instead of saying customers you
can go over here and say C. The same thing for the second column and as well over here. And you can use now the C in
everywhere in your query. The same thing for the orders. You can go over here and say has O. And now instead of orders you
say O on here. And now it is very easily to see those two columns comes from the C
that means the customers and those two columns comes from the O the orders. Those are the best practices as you are
joining tables together in SQL. And of course with that we have solved the task. And about the order of the tables,
it doesn't matter where do you start. So for example, if you take the orders here and put it in the join and get the
orders in the from. So I just switch the tables and execute it, you will get the exact same results. So if you are doing
inner join between two tables, don't worry about the order of the tables. Okay. So now let's go and instant
exactly how executed the inner join. Okay. So now again here we have our query. Then we have the two tables
customers and orders. And here we have the ID where we are joining the data. So this is the ID from the table customers
and this is the customer ID that we have in the orders. Now let's see how SQL can execute this. So we are saying I would
like to see the ID and the first name. So we will get the ID, the first name from the table customers and we would
like to get the order ID and as well the sales from the table orders. So our result going to focus on those four
columns. Now the data should be joined between those two tables using the inner join and SQL going to start from the
left table from the customers because we say from customers. So it's going to start matching the ID from the left
table with the right table. So it's going to say okay is there a match from the first record from the first order?
Well yes it is the same ID and then SQL going to say okay that condition is fulfilled and we are allowed to see the
data. So the data will be presented in the output. So we're going to have the ID Maria and the order ID from Maria and
the sales of this order. So there is a match. Then SQL going to go to the second record. Well, we don't have a
match. The third we don't have match. And so on for the last one. So we have only one match for this ID. Then SQL
going to go again to the customers and pick the second one and start matching again with the first order. Do we have a
match? Well, no. Then it's going to go to the second. Well, now we have a match. So SQL going to be happy. the
condition is fulfilled and we will see the results. So we're going to see the first name and as well the order
information for this customer in the output. It's going to keep searching. So we don't have a match as well here. So
that's it. Now for the third customer as well from the start there match no to the second to the third and here we have
a match. So it's going to go and show this informations since there is a match. So the customer three George with
the order from this customer order ID and the sales as well in the output. Now it's going to go and keep continuing the
search. Well, we don't have any match. Then it's still going to go to the fourth customer and start matching. Do
we have here an ID? Do we have here a match? Well, no. Then the second, third, and fourth. We don't have any order for
this ID. There is no match at all. And since we are saying inner join then SQL will not allow to show the data of this
customer in the results. There is no match and SQL going to totally ignore this customer. Then we're going to go to
the last one and start as well matching this ID with the orders. Well, there is no match as well. SQL going to go and
exclude this user from the results. So this is exactly how the inner join works. it start from the left side and
start matching the data on the right side and only if there is match the result going to be presented in the
output and this is exactly why we are getting this results and how the inner join works. So now if you look again to
the reasons why we are joining tables we can say we can use the inner join in order to recombine the multiple tables
into one big picture. So the first use case and as well we can use the inner join in order to filter the data. So
since we are saying only the matching data that means we are filtering the data we are checking the existence of
the records in another table. So you can use inner join either to combine data from multiple tables or you can use it
as well only for filtering purposes only to check the existence of your rows. So this is usually the two use cases of
inner. All right. So that's all about the first type the inner join. Next we're going to talk about the left join.
So we're going to focus on the left side. So let's go. Okay. So now what is exactly left join?
This type going to returns all the rows from the left table and only the matching from the right table. So now if
you look again to our two circles A and B. What do we need from the left table? We want to see everything all the rows
all the data. So that means we will get a full circle. And now from the right table we want to get only the matching
data. So that means we don't want to see everything from the table B. We want to see only the records that has match to
the table A. So that means my friends the left table has here more priority. This is the primary source of your data.
The main source we cannot miss anything. This is very important. We want to see all the data. But from the table B, it
is a secondary source of data and we are joining it only to get an additional data. So I don't want everything. I want
only the data that has matched to the lift table. So this is what we mean with a lift join. Now if you look to the
syntax it's going to be very similar to the inner join. So we start from the left table the A. Then we say left join
the right table B and then the same condition using keys. So here we just switch the type. Instead of inner we
have now left. But now here with the syntax we need to be very careful. The order of the tables now is very
important. You have to start from the correct table. So you have to mention the left table exactly in the from
clause and then you join it with the right table. So in the join you have to specify the right table. If you don't do
it like this then you will not get all the data from a and you will not get the results that you are expecting. So this
is what we mean with the left join. Let's go back to scale in order to practice. All right. So now we have the
following task. It says get all customers along with their orders including those without orders. So again
here we need the data from two tables the customers and orders and we want everything in one result. So that means
we have to go and join the data. And now the task says includes those without orders. So that means I want to see
everything the matching data and the unmatching data from the table customers. And by looking to our query
this is not working because we are not getting everything right. We are getting only the customers that has match in the
table orders. And this is not of course fulfilling the task. So now if you read the task you can understand the main
table here is the customers. We are not speaking about to see all the orders and not missing any order and the orders
here is only for additional informations. So now in order to not lose any data for the customers we make
sure we start from the table customers. So that means now the customers on the left side and now after that instead of
inner join this is not good thing for this task. We're going to say left join and with that we guarantee we will get
all the data from the customers. Now we say left join orders and of course the condition going to stay like this. This
is how we are connecting the two tables. So actually that's it. Let's go and execute it. And now by looking to the
result you can see that we have now five customers even the customers that didn't place any orders. So you can see Martin
and Peter they don't have any order ID. So that means they didn't order anything. And as you can see is showing
us nulls when there is no match. So with that we have solved the task. Now my friends one more thing as I told you the
order of the tables is very important because the customer is now the left table because you start from it and the
second table the orders is the right table. Now if you go and switch them like this. So we start from the orders
and then join it with the customers and you go execute it you will not get all the customers and of course the task is
now not solved. So as you can see you are getting now completely different result if you go and switch the tables.
So be careful where you start and how you join the tables in order to get the effects that you want. All right. So now
I'm going to put everything back like before. Now let's go and understand how is exactly executed this query. Okay. So
now again we have the data from customers and orders and this time we are doing the lift join. So now let's
see how is going to do it. So going to say okay we need the ID and the first name and we will get that as well in the
results and from the right table we need only those two informations the order ID and the sales in the output. So those
are the columns that we need. So now SQL in the left join going to do it a little bit differently. It's going to start as
well from the lift table from the customers. But this time going to go and immediately put the result in the output
without like trying to match anything and to check whether the data exist or not because it doesn't matter not doing
any validation whether the customer exist in the orders. Since it's lift join is still going to show all the data
from the lift table. So there will be like no check. But now as a next step in order to get the order ID and the sales
SQL will start searching. So SQL going to go over here and start searching where do we have a customer with this
ID? Well, it's going to be the first order. We're going to get the order ID and as well the sales informations and
we will see that in the output. So that's it for the first one. Now it's going to go to the second row and the
same thing going to happen immediately. The SQL going to go and put the result in the output without checking anything.
And then in order to get the order data, it will start searching for this ID. So we have it here in the second row. We
have the order ID and the sales. And it's still going to put those results to the output. So the search for the third
one immediately going to put everything in the output. And then start searching for orders with this ID. We have it over
here. So this order belongs to the user ID number three. So far we are getting the same result as the inner joint. But
we are not done yet. Now exactly count the difference this guy going to go and get Martin and put it immediately in the
output and start searching for an order with this ID. So do we have any order with the ID number four? Well, we don't
have anything this time. SQL of course will not go and exclude the ID number four. It's going to leave it. But in SQL
if there is no match, we still have to have something in the output. So SQL going to go and say the output going to
be null like this. We don't know it is unknown. And the same thing for the sales. So in the lift join if there is
no match you will see nulls. The same thing for the next customer for better. So SQL will go and put the result
immediately in the output and then start searching the orders. So do we have anything for the ID number five? We
don't have anything. That's why SQL going to go and present nulls as well in the output. And that's why you saw nulls
in the output because those customers don't have any orders. So this is exactly the effect of the lift join. you
will get everything from the lift table and only the matching stuff on the right side and if there is something not
matching you will get nulls. So that's it is this is how scale execute the left join okay so now back to this use cases
of joins if I think about lift join I can use it in order to recombine data in order to build this big picture and as
well in the second use case where we use it in order to get an extra information from another table. So we have a main
table and secondary table. So we use it for both use cases and as well in the third use case only with a twist that
we're going to learn later. So that's all about the left join. Now we have another type that is exactly the
opposite of the lift join. We have the right join. So now let's understand what this
means. Okay. So now what is exactly right join? This is the total opposite of the left join. So this tag going to
returns all the rows from the right table and only the matching from the left table. So here the main table the
main focus is the right table. So SQL going to get you all the rows everything from the table B the right table but
from the left side we will get only the matching data. So that means in the left sides you will get only the data that
has a match on the right side and with that the right table going to be the primary the main source of your data. So
it is very important table but the lift table is not that important. You are just joining it in order to get
additional data. So again about the syntax it's not that crazy. All what you have to do is to change the join type.
So instead of left you say right join and again here the order of the tables is very important because the side here
makes a difference. So we start from the left table A and then right join it to the table B. So it sounds very similar
to the left join. We are just switching things. Now let's go back to scale. in order to practice. Okay my friends, so
now we have the following task and it says get all customers along with their orders including orders without matching
customers. So again we have the customers and the orders and we are doing the join but here the condition is
different. We want to see all the orders even if they don't have a matching customer. So that means I would like to
see everything from the table orders and the customers table here is only like supporting and helping. So the main
table that we are focusing on is in the orders. We want to see everything and from the customers only the matching and
if you are looking currently to the results you can see we are seeing only three orders right but in the original
table if you go back over here you can see that we have four orders. So we are currently using this query not seeing
all the orders. So now how we going to solve it? If you start from the table customers you can say you know what
instead of left join we're going to say right join. And with that you're going to guarantee you will get everything
from the table orders. But now the left table the customers is not that important and you will see the data of
the customers only if there is a match. So doing the right join like this guaranteed to see everything whether
there is match or no match. Now if you go and execute it you can see on the right side the order ID and the sales
and we can see now all the orders and on the left side the ID and the first name. We are seeing only the customers if they
did order something. And for the orders without a known customer, we are getting nulls. So with us, you have solved the
task using the right join. So now my friends, you have to go and solve this task to get the exact same results. But
you are allowed to use only the left join. So you are not allowed to use the right join. So now go pause the video,
solve the task and meet you [Music] soon. Now my friends, in SQL there is
always alternatives on how to solve a task. So now if you want to get all the data from B and only the matching from
A, you can do it like we have done using the right join. But if you go and switch the sides and you make the table B as a
left table and the table A as a right table, you can do that of course in SQL. But you have to switch the join type. So
instead of right, we have to use left now since the B table now on the left side and as well you have to switch the
order. So you start from the B table and then you say left join the A table. and of course the same join condition. And
if you do that, you will get the exact same result as the left query. So if you just switch the tables and as well
switch the join type, you can get the same results. And to be honest, my friends, I don't like the right join.
It's just in the last 10 years, I always tend to start from a table and then use a left join. And from my point of view,
the left join is way more famous than the right join. And I think I never used a query where I'm using a right join. So
my advice for you always try to skip the right join and stick with the left join just get the order of the tables in the
query correct and you will get the same results. So with that you know an alternative for the right join. Now all
what you have to do is to go and switch the right to left. Uh this is not enough because if I go and execute it. So now
all what I have to do is to go and switch the tables like this. So we start from the table orders because I want to
see everything from the orders and then lift join it with the customers. And of course we don't have to change anything
here. It doesn't matter the order because we have an equal operator here. What is very important here is where you
start from which table and what is the table that you are joining with. So if you go and execute it, you will get the
exact same results. So now I'm seeing all the orders. I'm not missing anything and only the matching customers. And I
prefer this way solving this task instead of using the right join. All right. So that's all about the right
join. Next we're going to combine everything. We're going to talk about the full join. So let's
go. Okay. So now what is exactly a full join? If you use it, SQL returns everything all the rows from both
tables. So now if you check again our circles from the left table, we want to get everything all the rows. So you will
get the whole circle and as well from the right table you want to get everything all the rows the whole
circle. So that you want to get everything the matching the unmatching all the data from left and right. Now
let's check the syntax. It's going to be very simple. The joint type here going to be a full join. And the full join it
is very similar to the inner join. You remember the order of the tables is not important at all. So there is here no
main table and secondary table. Both of the tables are important and it doesn't matter in your query where you start.
You can start from A full join B or you can start from B then full join A. you will get the exact same results. It
sounds simple. Let's go to SQL and practice the full join. All right. So now we have the following task and it
says get all customers and all orders even if there is no match. So now again we need the data from customers and
orders. But now of course which type we're going to use? It says even if there is no match but it didn't say no
match from orders or customers. So you can understand from this task we are not focusing only on the orders or the
customers. Both of them are equally important and we need all the data. So that means we need all the data from
left, all the data from right and we can go and use the full join. So now we have this query over here. We are starting
from customers and then joining to orders. But now instead of having left, we're going to say full join. So now
let's go and just execute it. Now if you are looking to the left side, you can see we are getting all the customers,
right? So we have our five customers and if you are looking to the right, you can see all our orders. So with that we have
everything from left and everything from right and the matching data is just side by side in the results and if there is
no match we are getting nulls. So actually with that we have solved the task and again it doesn't matter how you
start. You can start from the orders and then join it to the customers and you will get the exact same results. So you
are getting exactly the same data. Now let's go and understand exactly how is executed the full join. Okay again we
have the data of the customers and the orders and our full join. So now we're still going to identify those columns
that we want to see in the results. So the ID and the first name, the order ID and the sales informations to the
output. Now it's still going to start from the left table since it is started with the customers. It's still going to
take simply everything from the left table and present it in the output. Since it is full join, we want to see
all the data from the left side. And now start searching for matches from the right table. So let's start with the
first customer. And as usual, we will get the order from the customer number one. And the same thing for the second
customer, we have as well here match. So we will get as well. It's like that lift join. And for the third one, we have as
well a match. And we're going to have it like this. And since we don't have orders for those two customers, we will
get as well nulls in the outputs. So scale going to mark it with null. The same thing over here. And as well for
the last customer. So we will get nulls for those two customers. And now of course SQL will not stop here otherwise
we will get a left join effect. Now SQL going to start looking at the right side to find any order that is not in the
output. So SQL going to see okay the first order is in the output. The second one is as well in the output. The third
but the fourth one is not in the results. So SQL going to take this result and put it in the output. So this
order has no match at all from the left side. And with that if you are looking to the right side you can see SQL going
to be happy because we have all the orders from the right table. And of course SQL will not leave it like this.
Instead of that SQL going to show nulls on the left side. So there is no ID and there is no first name. So this is
exactly why we got this results. And this is how SQL executed the full join. Okay. Okay. So now if you are looking to
the use cases I can say you can use the full join in order as well to recombine the data from multiple tables if you
don't want to miss anything from all four tables all data the matching and unmatching data but I don't use it
usually for data enrichment for the second use case and where we can use the full join is in the last use case as
well but with a little twist that we're going to learn later. So this is mainly where we can use the full join. All
right. So with that we have covered the basic types of joins inner, left, right and full join. Those are the classical
joins on how to combine two tables. Now we're going to start talking about the advanced SQL joins. And now
we're going to cover the first part the lift anti- join. So let's see what this means. Okay. So now what is exactly a
lift anti- join? Now in this mechanism we want to return rows from the left side the left table that has no match in
the right table. So now by looking to our two circles from the left table we want to see only the unmatching rows. So
only rows that exist in table A but it don't exist in the table B. So if there is like matching data we don't want to
see it. And now from the right table we don't want anything. We don't want any data. So that means the only source of
your data going to be the left table. And from the right table we don't need any data. We are just joining the tables
to do a check to filter the data. So now for the syntax this can be interesting. We don't have a special type called left
anti- join. At least in the SQL server we still can create this effect. Since we are saying left we can use the type
left join and then as usual the join condition with the keys. But now if you leave it like this you will get the
effect of the lift join. And we don't want that because with the lift join you will get the complete circle from the
lift table. But now in order to remove the matching data this overlapping in the middle what we can do we can use a
filter and in order to filter the data we use the wear clause. So now in order to get rid of the matching data we can
take the key from the right table and we say the key must be null. So if the key is null so that means there is no match
on the right side. And if you do it like this you will get the effect of the left anti-join only the data in the left that
has no match on the right. So now let's go in scale and create this effect. Okay. So now we have the following task
and it says get all customers who haven't placed any order. So now by looking to this query clearly we are
focusing on the table customers but we want to see the customers that didn't order anything. So they are in our
database but the customers are inactive. Now there are like different ways on how to solve this task but we're going to
solve it using the joins. Now let's go and start by just writing a very simple query where we are selecting everything
from the table customers. Now you can see this is our five customers. And now I want to check which of those customers
didn't order anything yet. Now since we are talking about the orders, we can go and join it with the table orders. So
we're going to say lift join the table orders as all and then we're going to go and connect the tables using the ids
with the customer ID. So now if you go and execute it now we are still seeing all the customers because we are using
the lift join and now we can see the orders informations of each customer and you can see immediately those two
customers didn't order anything because we are seeing here nulls right so they are empty there is no orders now we can
use this information in order to filter the data I just want to see Martin and Peter so what you can do we can go and
say where and all what you have to do is to take the key that we are using in order to join in the tables this is this
one over here and say this must be null so is null so if you see it like this that means you want to see the data if
the customer ID is null so let's go and execute it perfect now you are getting the customers who haven't order anything
and this is exactly the effect that we wanted the left anti-join we are getting the data from the left side where there
are no match on the right side so you have always to do it in two steps first join the data as you normally do using
the classical joins the lift join and then the second step you go and use a filter using the wear clause if you do
it like this you can check for not existence and with that we are getting the effect of the left anti-join so
that's it okay so now if you are looking to this picture I think you already know where we use the lift anti- join we're
going to use it only in the last use case where we are checking the existence so if you use the lift join together
with the where you can check for the notexistence of your data in another table so This is exactly for this
scenario. All right. So that's all about the left anti- join. Now we're going to speak about the exact opposite of that.
We will cover the right anti- join. So it's going to be very similar but we are just switching sides. So let's
go. Okay. So now what is exactly the right anti- join? Well, it is the opposite of the left anti- join. So we
want to return the rows from the right table that has no match in the left table. So again if you are looking to
our two circles. Now what is important is the right table. We want to see only the unmatching rows from the right
table. So only the rows that exist in B but not in A. And from the left table we don't need anything. So no data is
needed and that means the only source of data comes from the right table and you are using the left table as a filter as
a lookup just in order to check the existence. So now the syntax of that going to be very similar to the left
anti- join. So we don't have a special type called right anti-join. We have to use the classical one the right join.
But if you do that you will get everything from the right table. And now in order to get rid of the matching data
in the middle we use a filter. We use the wear clause where we say we are interested only on the unmatching data.
So we take the key from the left table and we say the key from left is null. And if you do that you will get rid of
any matching data. Is null means there is no match. And again here the same thing the order of the tables is very
important since here we are talking about sides and you have to do it correctly. Okay. So now the task says
get all orders without matching customers. So now it is exactly the opposite. We want to see all the orders
that don't have a valid customer. So this is really bad scenario. You have in your business orders without a valid
customers. So let's see how we can discover that using SQL joins. Now as you can see we are focusing completely
on the orders. It's not the customers anymore. And we want to see only the orders where there is no match with the
customers. So now again here we have two steps. The first step we're going to go and do the normal join. So using either
the left or the right join. Now by looking to this query you can leave it like this where you can start from the
customers. But if you want to fully focus on the orders you have to switch this from left to right. And with that
you will get all the orders and only the matching customers. And let's go and remove this workloads from here first.
So I'm just adding comments. And with that SQL going to totally ignore this line of code. So let's go and execute
it. Now you can see we are getting all the orders right and data from customers only if there is a match. And now of
course this is not the task. We don't want to see all the orders. We want to see only the orders where we don't have
a match from the customers. So if you look to this those three orders they are okay. They are totally fine. We are
finding customers for them. So they have valid customers. But this order here is really bad. So there is no valid
customer for this order and now our task to show only this type of orders in the result. Now what we have to do we have
to use the workclass in order to get exactly the effects. So this time we're going to say if the ID of the customer
here. So here we're going to say the ID of the customer from the table customers must be null. So we're going to remove
this here and take the key join from the customer and we are saying this ID must be null. So let's go and execute it.
Perfect. With us we have solved the task and we are getting the effect of the right anti- join and we are getting now
those orders that don't have any customers. So we have solved the task. Now my friends you have to go and solve
this task without using the right join but still you have to get the same effects. You want to get exactly those
orders without customers. So pause the video and go solve the task. [Music]
Now again as you know me I don't like the right joins. We can create the same effects if you switch the sides of the
table. So if you say the B table now on the left side and the A on the right side then we will get the same effect if
you go and switch the type of join from right to left and you go just switch the tables. So you start from the B table
since it's on the left side and then join it with the A. And we still say of course in our work condition where the
data from A is null. So there is no match. So if you do this you will get the exact same results like the lift
query by using the lift join and just switching the tables. So you will get the same results and with that you know
that in scale we have always alternatives. I hope that you are done. So it's very simple what you're going to
do. We're going to go and switch the joins and since the orders is the main table we're going to start first from
the table orders. So we are putting it on the left side and then the right table going to be the customers. And of
course the condition going to stay as it is. We want to see the orders where there is no customer. So we don't have
to switch anything here or in the join key. So let's go and execute it. With that you are getting the same exact
results. Since we are using here the star, it's always starts from the left table and show the data from the right
table. But still the result is valid. We are getting this type of orders without matching customers. And I prefer this
way. All right. So now with that we have the left, the right and now of course what is next? We will get the full. So
let's speak about now the full anti-join in SQL. Let's go. Okay. So now what is exactly a full
anti- join? Well, this time we don't have sides. We want to return only the rows that don't match in either tables.
So what this means? If you are looking to the left circle, we want only the unmatching rows. So we don't want the
whole circle. We want only the data that exist in A but it don't exist in B on the right table. Sounds like the left
ant join but since we are saying full then you have to do the same thing on the right side as well. So on the right
table we want only the unmatching rows. So we want to see in the result the data that is in B but don't have a match from
A. So it's exactly the opposite. And if you look to this then that means we want to see only the unmatching data and this
is exactly the opposite effect of the inner join. In the inner join we were interested only on the matching data
only when there is like overlapping. But now with the full anti-join it is exactly the opposite. We don't want to
see the matching data. We want to see everything else the unmatching data. So how we going to write this query? Again
here we don't have a special type called full anti-join. We will use the help of the classical full join. So the basic
one. So you start from a full join b and then the same key. But now what is interesting is about the where
condition. Now we have like two conditions right? So now in order to get all data from A that has no match in B,
you have to make a filter where you say the key from the B table must be null. And now since we want the exact same
thing from the right table, we want all the data in B that has no match in A. You have to say as well the key from the
A table must be null. So now we have here like two conditions. And in SQL if you have like two conditions in the work
clause, you have here two options either use and operator or the over operator. So now the one that we're going to use
here is the or operator. So either the key from right is empty or the key from left is empty. If you do it like this,
you will get the effect of the full anti- join. And of course since here both sides are equal then the order of
the tables as well here is not that important. So you can say from A full join B or from B full join A. It doesn't
matter. So now let's go back to scale in order to create this effect. Okay. Instead we have the following task and
it says find customers without orders and orders without customers. So if you are looking to this this means we want
to see only the unmatching data from customers and as well from orders. There is no main table and secondary table.
Both of them are equally important. So now since we are talking about the unmatching data and the anti-join we
have to do it in two steps. The first step we're going to do the classical join and then we focus on the wear
clause. So let me remove the wear clause to make it as a comment. Now since we want the data from left and right, we're
going to go and use the full join. So let's go and execute it. Now you can see we are getting the effect of the full
join. We are getting all the orders and as well all the customers. But now we are interested only on the strange cases
where they are like orders without customers like this one here and as well customers without orders. So that means
the first three rows they are not really interesting for us because it is boring. We have here matching data and this is
totally fine but we are not focusing on that now. We are focusing only if there is like missing data from left or from
right. As you notice I'm saying or and this is very important because we're going to use the or operator. So now
let's focus on getting this scenario over here. We want to get an order without a customer. So that means the
customer ID must be null. And we have it already here. So we are saying where the ID of the customer is null. So if I go
and execute it, I will get only one records only this one over here. But as well I want to get the opposite
scenario. So in this scenario, the customer ID must be null. So we're going to say or the customer
ID in the orders is null or we can do it like side by side like this. Either the right side is null or the left side is
null. So if you go and execute it, you will get the effect of the full anti-join. And with that we are finding
the customers without orders and orders without customers. I think this is really fun and as well really easy. So
this is how we do the full anti- join. All right. So now if you are looking to the use cases we use the full anti- join
again exactly for the last use case in order to check the existence. So if you combine the full with the where you can
check the existence or the notexistence of your data in another table. So this is exactly the scenario for that.
Okay, my friends, now we have a bonus section where I'm going to challenge you to solve the following task without
using an inner join. So, it says, "Get all customers along with their orders, but only for customers who have placed
an order, but without using an inner join." So, pause the video now and go and solve this
[Music] task. Okay, so now let's see how we're going to solve this. We want the
customers, the orders, blah blah blah. But we want only the customers who have placed an order. Previously, we have
used the inner join in order to solve this task. But this time, we are not allowed to use it. So, let's go and
solve it. This is how I'm going to do it. Select star from table customers. Can't give it the alias. So, now I'm
getting all the customers, but I am interested only the customers who have placed an order. So, as we know before
there's like two customers didn't order anything, and we don't want to see them in the final results. Now how we will
get that? Well, we can use the help of the table orders in order to check the existence of our customers there. And of
course, I'm not allowed to use the inner join. So I'm going to go and use a left join with a table orders and then
combine them as usual. Nothing new with the customer ID. So now let's go and execute it. As you can see, we are doing
it step by step. You don't have to rush everything in one go. So you start simple, check the results and decide on
the next step. So now by looking at these results I want to get those three customers because they have ordered
something and we are seeing data about their orders and I don't want to get in the result the last two. So again we
still can use the customer ID from the right table in order to decide which data going to stay in the result and
which data should be filtered. We're going to go and use the wear clause and then the key from the orders and this
time we're going to say is not null. I know we didn't learn yet about the not and the logical operators but using the
not null it means there should be data inside the column it must not be null if you do it like this and execute you will
get the exact effect as the inner join. So as you can see as you are joining the tables using the left join you can
control what you want to see using the wear clouds using the filter and this is how you can solve this task without
using an inner join. Okay, so with that we have covered all those three scenarios in order to find the
unmatching data. Left, right, full and joints. Now we can speak about one crazy join. We call it the cross join. This
one is totally different from all other types that we have learned. So let's understand exactly what is the cross
join. Let's go. So now what is exactly a cross join? Now in some scenarios we want to combine
every row from the left, every row from the right. So that means I want to see all the possible combinations from both
tables. So we are doing something called like cartesian join. So now if you look at our two circles, we want everything
from A and as well everything from B. So that means I want to see everything from A combined with everything with B. So in
this example, we have two rows in A and three rows in B. If you do a cross join, you will get six possible combinations
by just multiplying the number of rows between A and B. So be careful using the cross join. If you use it, you will get
like crazy number of rows in the results and you're going to make the database really busy finding out the result for
you. So now about the syntax, it's going to be the easiest. So you start as usual from one of those tables, the A for
example, and then you say cross join B. So now my friends, if you look at this, you can see it's not like the previous
joins that we have done. We have always before talked about unmatching rows, matching rows and so on. But here we
don't care at all about whether the data is matching or not. I just want to see all the possible combinations
everything. So since we don't care about matching the two tables, we don't have to specify any condition. So there is no
need to use the keyword on because we don't need any condition. So that's it. You just say cross join B and the magic
can happen. So this is a cross join. Let's go to SQL to try that. Okay. So now we have the following task. It says
generate all possible combinations of customers and orders. So that means we want everything with everything using
the cross join and this going to be very simple. So we're going to start with select star from whatever table. So you
can start from the customers and then you say cross join orders. That's it. Very simple. Let's go and execute it. So
now as you know we have five customers and four orders. And if you multiply them you will get in the results 20
rows. So now we are getting everything with everything. even if the data is not matching at all. So you can see for
example the orders here. So this is one order that belongs only to one customer the customer ID one. So it is an order
from actually Maria but still we are seeing this same order with the other customers since we want to combine
everything with everything. So there are no rules. The same thing for the next set. So this is the second order
actually belongs to John but we are seeing this order with all customers. So that's it. This is how the cross join
works. And now you might ask me why we have this. It makes no sense, right? Well, my friends, I rarely use it. But
sometimes if I want to generate like test data or maybe if you have like for example table called colors and table
called products and you would like to see all the combinations between the products and the colors. So in some
scenarios it makes really sense to see all your products together with all the colors without any matching conditions
or whatever. So there are like few scenarios for the cross join if you are like doing simulations or testing. So
this is how we do the cross join. Okay. So that's all about the cross join. And with that we have covered the four
advanced types of joins. Now if you look at this you might ask okay how I'm going to choose between all those types. So
you might ask me okay bar how you do it? Well I'm going to show you now my decision tree that I usually follow in
order to choose the correct type. So now if I'm combining two tables and I want to see in the results only the
matching data between two tables then I go and use the inner join. We don't have any other type for that. So that's
simple but now if I want to see everything all the data I don't want to miss anything after joining two tables
then I take different path and here I ask myself is there like one side more important than the other am I interested
in all data from one table from one side like here we have like a main table or a master table then I go and use the lift
join but if I want to see all the data from all tables in my query everything so there is no one table more important
than other then I go with the full join So this is another path and now the third path if I'm interested to see only
the unmatching data. So I'm doing some kind of checkups and so on. And here again the same thing do I want to see
the unmatching data from only one side. There is like one table that is important then I go and use the lift
anti- join. So I want to see the unmatching data from one table and I'm using the other table only for the
check. But in my query if both of the tables are important there is no main table and secondary table both are
important then I go and use the full anti- join. So actually that's it. This is the decision tree that I follow
usually as I'm writing a query. And you might ask me how about the right join. Well as you know me I don't have it at
all in my decision tree. So I don't use it at all. Now by looking to this I can tell you if I check most of the queries
that I write very often I use the left join. So I can tell you this is my favorite way on how to join tables. So
let me show you exactly why. Usually I write queries in order to do data analyzes. So in data analytics
you have always like starting points. You have like a topic that you are analyzing like the customer. So you have
always like a master table. So I always start with the main table of my analysis. So in my query I start from
this table from table A the main table. And then what happens? The data is not enough in this table. I need some extra
data that comes from another table like the table B. So the table B is only here like an additional data to the master
table. So I go and use the lift join in order to connect the table B and then I find another interesting information in
another table in table C. So same things happens. I go and join the tables using the lift join and so on. So I keep
connecting multiple tables to this main table in the middle. And my query going to look like this. always doing lift
joins with multiple tables. Now, of course, you might say, "Yeah, but sometimes you would like to see only the
matching data and so on. So, it makes sense only to use the inner join." Well, in order to do that, I can control
everything that I want to see in the final results using the wear clause. So, in the wear clause, I define exactly
what I want to see in the final result. So, with that, I get like more flexibility on whether I want to see the
matching, unmatching data and so on like we done in the lift and join, right? So as I'm analyzing data I tend very
frequently having this setup where I start from the main table and I lift join all other tables and with the word
conditions I control the final results. So this is how I connect multiple tables together. So now if I want to visual
this in like circles it's going to look like this. We have the circle A. So this is the master table the starting point.
I want to see all the data from table A and I live join it then with another table B and from table B I want to see
only the matching data. So it's like the lift join. Now what going to happen? I'm going to go and add another table. So
another circle the circle C. And from the circle C, we want to see only the matching data. And of course you can
keep adding circles to this. But it's going to be always the same thing. And in your circle going to has only the
matching data. So now as we learned we can use joins in order to combine multiple tables to get a complete big
picture about topic like the customers. I would like to see everything about the customers in the final results. So
either you're going to do it like me where you start from the main table and then go and lift join all other tables
or maybe you say you know what there is no main table about the customer's data all the tables are equally important
then you can go and join all those tables using the inner join if you are interested only on the match data so
what can happen if you have again those circles from the A you need only the matching data from B you need as well
only matching data and as well from the third circle so you are interested only on the overlapping between all all three
tables. So you will get only this section where you have overlapping between all three tables. So this is of
course another way on how to join multiple tables. Okay. So now my friends let's go back to scale in order to
practice how to join multiple tables. Okay. So now let's have a task. This going to be a little bit challenging. We
will be doing multi- joins using the sales DB. Retrieve a list of all orders along with the related customer product
and employee details. And for each order display the following. We want to see the order ID, the customer name, the
product name, sales price, salesperson name. So there is a lot of things that is going on. And the first thing that
you're going to notice it does now we are using different database. We will be not using the my database, we're going
to go and use the sales DB. So this is the first thing that we have to do. So instead of using my database, so we say
use sales DB and then execute it. We are now connected to the sales DB. So this is the first thing. So now if you are
reading this task there are a lot of tables that are involved. We need the orders, we need the customers, products
and employees. So there are like four tables needed in this task and we need different stuff from each table. So now
how I think about it well it is mainly focusing on the table orders right? So we need all the orders we cannot miss
any order here. So this sounds for me this is the main table and then it says along with that we need other
informations. So that means the other tables are not that important like the orders. So this gives me feeling about
what is the main table and this going to be my starting points. So let's start from that from the table orders. So
select star from and here you have to pay attention that this database has always a schema. It's called if you look
to the left side sales dot the table name. So we have to write that now in our query. So we're going to write it
over here sales dot and then the table name orders. Let's go and execute it. Now I know this is the first time that
you are querying this table. We have a lot of informations here and as well we have a lot of ids. Those ids going to
help us of course on joining our data with the other tables. So what do we need from here? We need the order ID. So
we have it over here. We're going to get the order ID. This time the naming convention is different. We don't have
like underscores and comm. We have different type of namings. So be careful with that. So what else do we need? We
need the sales. So if you go to the right side over here, we have column gold sales and we're going to go and
include it to the results. Now all the other informations are actually not needed, but I need those ids in order to
join it with the other tables. So now what I'm going to do, I'm going to go and give it an alias and all. So now I'm
going to go and assign it for each column. This comes from the orders and as well the same thing for the sales. So
that's it for now. And if I go and execute it, I will get the orders and the sales. All right, so that's all for
the first table. Let's go now and see what do we need. We need the customer's name. Well, actually we don't have this
piece of information in the orders. So all what you have to do is to go and explore in the other tables in order to
find this column. So how I usually do I go and explore the tables like this. So I write a symbol select from each
tables. So the customers. So now I go and repeat this for each table inside the database. So we have the customers,
employees, we have an orders, the orders archive and as well the products. So now I start exploring the table. So if I go
to the customers over here, we can see we have here five customers and we can see the names of the customers. So we
see the first name and the last name and this is exactly what I need for my query. Now of course we have to go and
connect this table with the orders. So we need a common column. Usually it's going to be the ID. So here we have the
customer ID and if you go and query the orders you can find here as well the customer ID. Now if you are working in
big projects you're going to have a lot of tables and exploring each one of them going to be really hard. So now of
course if you have like in the project hundreds of tables it's going to be really hard to explore each table. So
instead of that a good project a good database usually has an entity relationship model er model like the one
that we have for the course. And here you can find easily the tables that you have inside your database and as well
the relationship between them and this is very important especially if you want to join tables. So now by just looking
quickly to this diagram I can understand okay there is an ID called customer ID inside the table orders and it is like a
foreign key to the primary key the customer ID. So that means if I want to connect the orders with the customers I
have to use that customer ID. So as you can see this is really nice documentations and I can quickly
understand how to join the tables. So now back to our query. Now what I'm going to do I'm going to say lift join.
So with that I guarantee all the orders going to be presented in the output and I will see always 10 orders. So now
let's join it with the table customers sales dot customers and let's give it an alias like this. And now we're going to
build the joining condition. So it's going to be the customer ID from the table orders equal to the customer ID
from the table customers. So that SQL understand how to match the two tables. And now the two tables are connected and
I can get the informations now from the customers. So see let's go and get the first name and as well the last
name. So now let's go and execute it. So now as you can see we have customers for each order which is really nice. So with
that we got the customer name and the order ID. Now the next one we need the product name. So either you're going to
go here and start exploring. I think it is inside the table products. And here you can see we have the product. This is
the name of the products. And if you check our ER diagram, you can see we can connect the table orders with the
products using the product ID. So we have the product ID in the left and as well in the right. And now we can go and
build this join as well over here. So again I go with a lift join. I don't want to lose anything from the table
orders sales products and we give it an alias P. Now the condition for that here you have to be very focused. You want to
get the product from the orders. So you say O dot product id equal to the product ID from the table products. So
as you can see in the joins we are always joining with the table orders. Right? We are not trying to join for
example the customers with the products. Always we are joining with the main table. So with that we have connected
the third table and we can get the information that we need. So we need the products as I'm going to go and rename
it products name. So let's go and execute it. And with that my friends I'm getting now the product informations
from the table products. So we have the sales as well and we need the price. So if you go to the products you can see we
have as well price information. I forgot about it. So let's go and get it as well from the same table. price. So let's go
and execute it. And with that we have as well the prices. Now the last column it says we want to get the saleserson name.
So the name of the employee right now if you go and explore as well we have here employees table and execute it. You can
see we have here the name and the last name of the employees and we have an ID. So now we need this ID as well in the
orders. So you can see we have the product ID, the customer ID. We already used those two. But we have here one
more extra ID called the salesperson ID. Of course, it is not called employee ID. So here you might be a little bit
skeptical about it. That's why we have to go and check again our ER diagram. And as you can see the employee ID from
the employees, it is connected to the salesperson ID. So that I have better feeling about it and I understand. Okay,
I can connect the orders with the employees using the salesperson ID. So let's go and do that. I'm going to say
lift join. So as you can see I'm just doing left joins sales dot employees as e and the condition again very important
always the first table is included in the join condition and here we're going to say the sales person ID is equal to
the employee ID. So with that we have connected as well the employees and we will get as well the first name and the
last name. So perfect that's it. Let's go and execute it. And as you can see guys, now we are getting the name of the
salesperson. Now here comes an issue. As you are joining multiple tables and you are getting columns from different
tables, what can happen? You might encounter this scenario where you have the same names in multiple tables. So
now as you can see we have the first name last name from the employees and as well we have the first name last name
from the customers and it's going to be really hard from the result to understand what are we talking about? Is
it the customers? Is it the employee? That's why in this scenario if you have the same names we have to go and start
giving aliases. So for the first one we're going to say customer first name and as well for the last name we're
going to say customer last name. Same thing for the employee. So let's say employee first name or we can call it
the saleserson whatever employee last name. So if you go and execute it now it's going to be more clear. Here we are
talking about the name of the customer and here we are talking about the name of the employee. And again one more
thing if you are not using aliases it's going to be an issue. So for example if you go over here and you don't use the
table name before the column. So if I go and remove it and execute it you will see I'm getting an error. Now SQL can't
understand what are you talking about. Is it the first name of the customer or from the employees because you are not
specific about it. So you have to tell SQL to which table belong this column. It's very important to use a table name
or the alias before the column name. Especially if you have the same column. So now we will not get an error. And
with that we have solved the task. You have really to pay attention about the join keys. The condition you have to do
it correctly cuz as you can see now we have a lot of tables and a lot of columns and sometimes happens an issue
where you specify the wrong columns or the joins and the result can makes at all no sense. So always double check are
you using the correct keys in order to join the tables. So with that you have solved the task and this is exactly how
I join tables. I have always a starting point from an important table and everything else going to be left joined
and in my results if I want to remove any scenario then I go and use the wear clause. So this is how I join multiple
tables. Okay my friends. So with that you have learned now everything about how to join the tables in SQL and this
is very important to understand. Now moving on to the second method on how to combine your data from multiple tables.
We have the set operators. So we're going to go and cover how to combine the rows from multiple tables. So let's
go. All right, my friends. So now as we learned before, in order to combine two tables we have two methods. If you want
to combine the columns, we use the joins. And we have learned all those different types on how to combine data
using join. So we have covered this section. But now if we want to combine the rows of two tables, we can use the
set operators. And here we have four different types. We have union, union all, except and intersects. So now we're
going to go and deep dive into this word on how to combine the rows of tables using the set operators. And now of
course in this course we're going to cover everything. So let's go. All right. So now let's have a look
to the syntax of the set operators. Okay. So now let's see that we have the following query. we are selecting the
data from the customers. So this is our first query or our first select statements and we have another one which
is very similar where we are selecting the informations from the employees and this is our second select statement. So
now what we can do we can put between those two queries a set operators like for example the union. We can use of
course any other set operators like the union all intersects except and so on. So as you can see the syntax is very
simple. We have two different queries and we just put between them the set operator. So this is how the syntax of
the set operators looks like. All right friends. So now we're going to talk about the rules of the set operators.
And we're going to start with the rule number one the SQL clauses. In each individual select statements or query.
We can use almost all the SQL clauses like where join group by having. But there is only one exception with the
order by. Order by you can use it only once and only at the end of the entire query. So that means we cannot use order
by in each select statements or in each query. We can use it only once and only at the ends of the entire query. All
right. So about the syntax again here we have our two select statements and in between them we have the set operators.
So now in each query we can go and use multiple stuff like the join where group by having. So we can make each query
complex as we want. So everything is allowed but not the order by the order by must be always placed at the end of
the entire query. So if you want to sort the result by the first name, you have to use the order by exactly at the end.
So we are not allowed to use order by in each query. Okay. Moving on to the rule number two. The number of columns. The
number of columns in each query must be the same. Okay. Okay. So now in order to understand this rule, let's have this
very simple example. We're going to go and select the first name and the last name from the table sales customers. So
this is our first query, our first select statements and let's say that I have another one and we want to select
the first name last name but this time from another table, the employees. So with that we have our two queries and I
would like now to go and combine them into one result. So we're going to go and use the set operator union. Let's go
and execute it. So now as you can see in the result we will get the first name and last name from two tables the
customers and employees. And it is working because we are fulfilling the rule where it says the number of columns
must be the same in both queries. So how many columns do we have in the first query? We have two right and as well in
the second query we have two columns. So that's why everything is working. So now let's go and break the rule by adding
another column to the first query. So let's say that I would like to have the customer ID as well in the first query
and with that as you can see in the first query we have three columns but in the second we have only two. So let's go
and execute it. Now as you can see in the result we will get an error where it says if you are using union intersect
and all those set operators you must have an equal number of columns between queries. So this is the rule you have to
have the same number of columns in order to repair it. So I'm going to do I'm just going to remove the customer ID.
Okay. So here again we have two columns and the second one as well two columns and everything going to be working.
Okay. Moving on to the rule number three. The data types of columns in each query must match must be compatible in
matching. In order to check that what we're going to do we're going to go to the object explorer to the left side.
Let's go and browse the customers and the columns. And as you can see we have here the first name and last name with
the same data type. We have the vchar. And if you go to the employees, you can see as well the first name, last name
having varchar. So the first column is varchchar from the first query and as well for the employees and as well the
last name from the customers having the same data type as the last name from employees. So the data type is matching.
Now let's go and break this rule. Instead of having the first name, I would like to go and use the customer
ID. So now let's check the customer ID on the left side. It is an int, an integer. But the first name is
invarchar. So here we have a mismatch between data types. Let's go and try to execute it. So now we are getting an
error where it says SQL is trying to convert the value Frank to an integer. So what this means the first query is
always controlling everything the names and as well the data types. So here we have an integer and now scale is trying
as well to convert the first name values to an integer and of course it will not work because we have here characters
inside and it cannot convert characters to an integer. So we have a mismatch between data types between the customer
ID and the first name and that's why we will get an error. The second column we don't have an issue because it is
varchar in the first table and as well for the second table. So now in order to repair it either select a first name in
the first query or we can go over here and say employee ID and with that if I execute it we will not get any errors
because the employee ID is as well an integer and we have a match in the data types. So as you can see it's not enough
to have the same number of columns. You have to have as well matching data types between those two queries. Okay, let's
move to the next rule. Rule number four, the order of columns. The order of columns in each query must be as well
the same. Okay, so let's understand what this means. Now we have here again the same example where we are selecting the
ID and last name from customers and we are combining it using union with the employee ID and last name from the
employees. And as you can see everything is working because we have the same number of columns and we have a matching
data types. So now let's go and break it. What I'm going to do I'm just going to switch between those two columns. So
first I'm selecting the last name and then the customer ID. So again I have the same number of columns and the ID is
integer matching the ID of the employee and the last name having the same data type. So let's go and execute it. So
here again SQL going to throw an error and says SQL is trying to convert the value go back to an integer. So it's
like character to integer. It will not work. So what happened here? I have here the same informations. I have an ID and
last name and ID and last name. Well, SQL doesn't work like this. SQL going to go and map the first column from the
first query with the first column with the second query. So it's going to go and map last name to employee ID. And
since they have different data types, SQL going to throw an error. So SQL doesn't understand or don't know how to
map let's say the ID with the ID and since they have different data types SQL going to go and throw an error. So as
you can see here we have the same informations between customers and employees but they don't have the same
order. So SQL cannot go and map the informations because of the names of the columns. It's going to go and simply
just mapping the columns like this. The first column from the first query with the first column from the second query.
So as you can see in this rule you must have the same order of the columns. First the ID and then the last name and
with that it's going to work again. All right moving on to the rule number five. The column aliases column names that we
see in the output in the result is defined and determined by the column names of the first query the first
select statements. So that means the first query is responsible of naming the columns in the output. Okay. So let's
understand what this rule means. Again we have the same example. The customer ID, last name from customers, union,
employee ID, last name from employees. So if you check closely the output, you can see that in the output we have the
customer ID and not the employee ID. Even though we have the ids from the employee ID, but as you can see the
first query is controlling the naming of the output. So since the first column called the customer ID, you will see it
in the output as a customer ID. So the naming of the like the next queries will be totally ignored. So that's why if you
want to give aliases to the output, you're going to go and do it only for the first query. So for example, I go
over here and say instead of having customer ID, I would like to call it as an ID. So now if I go and execute it, as
you can see in the output, we will get an ID. So I don't have to go and in each query give this alias. So I don't have
to go over here and say yeah you are as well the ID because it's enough to define it from the first query. So
there's no need to give the same names in the next queries. Let's take another example where we would like to have an
alias for the last name. So I would like to have it like this last name and let's go and do it in the second query. So
last name let's go and execute it. So now as you can see in the output, we still have last name and there's no
underscore because this is totally ignored from SQL. This is not the first query. The first query says you are last
name without underscore. So again if you want to do that we go over here. Let me just get it and put it in the first
query. Let's go and execute it. So my friends, the first query is very important in order to give the names for
the output. So if you want to do aliases and to rename stuff, do it only on the first query. And as well the first query
controls the data types. All right. Now to the last rule matching the correct informations. If in your query you
fulfill all other rules and you don't have an error in the SQL that doesn't mean that your result is accurate and
correct. You are the only one that is responsible of mapping the informations between queries correctly because SQL
doesn't understand the content and the informations of your tables of your queries. And if you don't match the
informations correctly between the queries, you will get inaccurate and wrong results in the output. Okay. So
now back to our example. Let's say I would like to get the first name and as well the last name from the customers
and the same informations from the employees. Let's go and execute it. Now as you can see it's very nice where we
are getting the first name, last name from both tables in one result and we are fulfilling all the requirements in
SQL. Same numbers, same data types and so on. Now let's go and make incorrect results. So what I'm going to do, I'm
just going to swap the first name and last name in the second query. So first last name and then the first name. So
let's go and execute it. So now as you can see we will get results because we are fulfilling all other rules because
we have the same number of columns and as well we have matching data types. So the first one is character the first
name and the last name is as well character. So SQL will just present the result as you define it. But the result
is completely wrong because now we have if you check the first column here the first name. So here we can see last
names inside the first names. For example, Brown and Baker those are last names but we can see them inside the
first name. And the same thing in the last name. We now we can see first names inside it. Mary, Carol, they are all
first names. So as you can see the result has really bad data quality. We are now mixing stuff and it doesn't
makes any sense. But SQL will not know that because SQL doesn't know the information the content of your data.
It's just mapping the data types. So first name is varchchar the last name as well vchar. Everything is fine and you
will get the results. So my friends you are responsible of having the same informations mapped between the two
queries and not having an error from a skill doesn't mean that we have now correct results. So pay attention to the
informations that you are mapping between the two queries. All right. So those are the rules of the set
operators. So the first one is that the order by can only be used once at the end of the entire query and all queries
must have the same number of columns, the matching data types, the same order of columns and the first query always
control the names and the aliases of the result set and as well the data type. And the last rule is that make sure that
you are mapping the correct informations to each others between queries. So those are the rules of the set
operators. Okay. So what is union? Union going to go and return all distinct unique rows from both queries. So that
means it's going to go and combine everything and all the rows going to be presented at the output. So since it
says all distinct unique rows that means union going to go and remove all duplicates from the combined result set.
So union going to make sure that each row going to appear only once. All right. So now let's have this very
simple example. We have two sets of data. We have the customers where we have five customers with the first names
and as well we have another set called employees and we have as well the first names of the employees and we have five
employees. And now if you take a look to the first names you can see that we have the same persons as a customers and as
well as employees. We have given and marry in both sets of data. So now how is k going to execute union it's going
to go and return everyone from customers and everyone from the employees. But now since we have given and married twice in
the output we're going to have them only once. So this is how the union works. It going to go and return everyone from two
sets but without duplicates. All right. So now we have the following task and it says combine the data from employees and
customers into one table. So that means in one table we want to combine all informations from employees and
customers. So which informations do we need? This is the first question that I usually ask myself. So in order to do
that first we have to explore the data. So select star from sales customers and then semicolon. Then I'm going to write
another query select star from sales and employees and semicolon. So now why I'm using two different semicolons because
I'm telling SQL we have now two separate queries. They have nothing to do with each others. And if you go and execute
it like this. And now in the output you can see we got two result grids. The first result grid is for the first query
and the second one for the second query. So they have nothing to do with each others. I just want to explore those two
tables in order to understand how I'm going to map those informations. So now if we check those two tables you can see
that both of them has ids. So we can map those informations right. Both of them has as well first name last name. So
that means I can go and map the first name and last name together. Now in the customers we have country but we don't
have this informations in the employee. So we have to go and ignore it. And we have as well here score where we don't
have a score for the employees. That means I can go and map three informations between the customers and
employees. Now of course we can go and think do we need really the ids because it doesn't make really any sense to have
the ids in the tables. It's not anymore unique because we have here the custom ID one and employee one. So I think we
can go and ignore it. So the only really two informations that is useful to map is the first name and last name. So now
let's go and add those two informations. So we need the first name, last name and the same informations as well from the
employees. But now we want everything to be in one query. That's why I'm going to go and remove the semicolons. And now we
have to go and use set operators between those two queries. And now in order to combine the data we have two options
either union or union all in this example it doesn't mention anything about duplicates and so on. I would like
to go with the union in order to remove the duplicates if there is any. So that's it. Let's go and execute it. Now
as you can see in the output we have only one result because we have only one big query. And now we have the first
names and last names from the customers and employees. And now one more thing about the order of the queries. It
doesn't matter whether we start with the employees or with the customers. we will get the exact same results but pay
attention to the naming of the columns. Always the first query controls the names but since now they have the same
naming so it should not be a problem. So if I go and switch those two tables and start it again we will get exact same
results. So now let's understand how scale did combine the data using the union. Okay. So now we have here the
results from the first query and the second query employees and customers and we are combining the data using union.
The first step in SQL is that it's going to go and take the columns from the first query which is from the employees.
So it's going to take the first name last name as a column name to the results. And now the next that is going
to go and start combining the rows between those two tables. So first going to go and take the rows from employees
and as well going to check whether there is duplicates in the data. So as you can see we don't have here any duplicates.
So we're going to have the five employees. And now the next step is going to start adding rows from the
second query from the customers very carefully without generating any duplicates. We don't have it in the
output. That's why it's still going to go and add it to the result. Append it. And then the next customer we have Kevin
Brown. As you can see, we have it already in the results. That's why will not go and add it to the result.
Otherwise, it's going to go and generate duplicates. So it's still going to ignore this customer. The same thing for
Mary. We have Mary as well in the results. So it's going to skip it. And then we're going to go to the mark. As
you can see, we don't have mark in the results. That's why SQL going to go and take this customer and put it in the
output. And then the last one, we have Anna. We don't have Anna in the results. That's why SQL can go and as well add it
to the results. And now with this, SQL did combine the rows between those two tables. And we have here eight persons.
So as you can see, SQL is combining the data, but very carefully not generating any duplicates. All right. So that's it.
This is how the union operator works. Okay. So now union all union union all going to go and return all rows from
both queries. So it's very similar to union. It going to go and combine all the rows and everything going to be
presented in the combined result set. But the big difference to the union all will not remove any duplicates. It is
the only set operators that doesn't remove duplicates and it going to show all the rows as it is. So if you have a
row 10 times from the query, you will find it as well in the output 10 times. Now you might ask me when to use union
and when to use union all. I'm going to say that there is one big difference between them is that union all has way
better performance and it's faster than the union. And that's because union all doesn't perform additional steps like
removing duplicates. So my friends that means if you know already that in my queries there is no duplicates. I know
my tables. I know my queries. There's no duplicates. Don't use union and always use union all because you will get
better performance. Another scenario for the union all is that I would like to see the duplicate. I'm doing data
quality checks and I would like to see whether there is duplicate after I combine multiple queries. So in this
situation I go and use as well the union all. Now we have again the same example. We have the customers and employees and
we have as well the same persons Kevin and Mary as customers and as well as employees. So now if you want to combine
the data using union all it going to return all rows including duplicates. So that means SQL going to go and execute
union all like this it going to return everything from customers and everything from employees and Kevin and Mary going
to be presented twice in the output. So as you can see union all is returning all the rows as it is from the two
result sets and if there's duplicates in the sets we will get as well duplicate in the output. So Kevin going to be
existing twice in the output and marry as well twice. So this is how the union all works. All right. So now we have
very similar SQL task and it says combine the data from employees and customers into one table including
duplicates. So it's exactly like the last task but this time in the task we are saying include duplicates. So we
cannot go and use union. We have now to go and use union all. We will have the exact same query. So we are selecting
the employees first last name and as well customers first last name. And now instead of using union, we're going to
go and use union all. So all what we have to do is that to go over here and say union all. So now pay attention to
this. As you can see in the union previously, we got eight records or eight persons from the output. So now
let's go and execute it and check the results. Now as you can see we got now 10 persons instead of eight. And that's
because we have five customers and five employees and we have duplicates inside the data. We have two duplicates. Now if
you check we have here Mary and as well over here we have Mary and same goes for given we have given over here and as
well here. So we have duplicates inside the data and SQL just combine the two tables. Okay. So now we're going to
understand how SQL execute union all in order to combine data. All right. Again we have the two results from queries. We
have the employees and customers and SQL going to do the same steps. First going to go and get the column names from the
first query and put it in the output. It's still going to go and take all the employees and put it in the output
without checking anything. So that means if there is duplicates in the data, it's going to be presented as well in the
output. It's very simple. Now it's going to go to the second step and as well take all the customers and append it
into the output like this. So that's it. It's very fast. It's going to go and just combine all the rows from the
employees and all the rows from the customers. And with that, we're going to get that 10 persons. And as you can see,
we have duplicates in the data. So we have marry twice and given as well twice. And that's why union all is the
fastest. It doesn't have any extra steps or checks. Just taking all rows from all queries and put it in the output. All
right. So as you can see it's very simple, right? So that's all for the union
all. Okay. So what is except sometime we call it minus in other databases but in SQL server we call it except. So it's
going to go and return a distinct rows from the first query that are not found in the second query. So from this
definition we can understand that the order of the queries can affect the final result. There is a first query and
a second query. So it is the only set operator where you have to pay attention to the order of the queries. And as well
it's like the others. It's going to go I remove the duplicates from the result set. All right. Again we have this very
simple example. We have two sets, five customers, five employees and there is the same persons as a customer and as
employees Kevin and Mary. So now we're going to go and combine those two sets using the excepts or sometime we call it
minus. So it says it's going to return unique rows in the first table that are not in the second table. So what going
to happen? What is the first table? Let's say the customers on the left side. So here we have five persons.
Joseph, Mark, Anna, Kevin and Mary. So now the rule is we need the customers that are not employees. So it's safe for
Joseph, Mark and Anna because they are not existing in the second set. That's why SQL going to return those three
values. But now for the two customers given and marry here there is an issue. Given and marry they are members of the
second set. The second table the employees. That's why SQL going to go and exclude them from the output because
they are not fulfilling the rule. So in the output we will get only three customers and all the values from
employees and the common values between customers and employees will be excluded from the output. So this is how the
except works. All right. So let's have a very simple skill task and it says find the employees who are not customers at
the same time. Okay. So let's see how we're going to solve that. We're going to stay with the same queries as usual.
We have the employees and the customers but instead of having union all we're going to use the set operator except. So
now since we are using except we have to make sure that the order of the queries are correct. So the first query is the
employees which is correct because we have to find the employees who are not customers at the same time. So we are
focusing on the employees. The first table is correct and the second table is customers. If the task says find the
customers who are not employees at the same time then we have to go and switch it. We have first to query the
customers. So now everything is correct. Let's go and execute it. And now in the output we see three employees who are
not customers at the same time. So we have Carol, Frank and Michael. But as we know we have five employees Kevin and
Mary. They are not here in the result because they are customers as well. So now let me show you what can happen if I
just switch those informations. So we start with customers and then with employees. Let's go and execute it. As
you can see, we're going to get completely different results. Now we are getting customers informations. And now
in the output, we got three customers who are not employees at the same time. This is not what we want from this task.
So if you do it like this, it's going to be incorrect. So pay always attention here to the order of that query. So now
let's go and correct it. So we're going to have first employees and then customers. Let's execute it. And now
let's go and understand how SQL execute the except operator. All right. So again we have the results from the two queries
or from two tables and now we are doing except between them. So let's see how is going to execute it. It's going to take
as usual first the names from the first query from the employees and put it in the output. And now SQL going to present
data only from the first query in the output. And it going to go and use the customers only as a check. So SQL will
not put any data or rows from the customers. It will just use the second query as a lookup in order to check the
data. So, it's going to start with the first employee, Frankly. Do we have Frankly in the customers? Well, no, we
don't have it. That's why it's going to accept it and put it in the output. And then in the next step, it's still going
to go to the second employee and check. As you can see, we have it already in the customers. So, SQL going to go and
ignore it. It's not allowed to be in the output. The same thing for Mary. We have it as well in the customers. That's why
it will not be presented in the output. So Michael, we don't have a Michael in customers. That's why it can be
presented in the output. And as well for Carol, the same thing. We don't have Carol as a customer and we're going to
have it in the output. So as you can see, we will get data only from the first table and the second table only
going to be used in order to check the informations from it. So we don't have in the output any customers, it's only
employees. So now let's check quickly what going to happen if we switch the tables. So now we have the customers as
the first table. SQL going to take the columns from the first table and it's going to start presenting the customers
informations in the output and going to go and use the employees only as a lookup. So do we have Joseph? We don't
have it in the employee. And then Kevin and Mary we have it already in the employees and Mark and Anna are not part
of the employees that's why can go and present the results in the output like this. So now as you can see SQL is
focusing on the table customers and we are getting data from the customers not from the employees. Employees is only as
a check. So with that we understand the order of the queries is very important for the exceptions. We will get
different results if we have different order. All right. So that's all for the except
operator. Okay. So what is intersect? Intersect going to go and return only row that are common in both queries.
It's something very similar to the inner join and as well here it's going to go and remove duplicates. So there will be
no duplicates in the output. All right. Again we have this very simple example where we have five customers and five
employees and now we're going to combine them using the intersect. So what intersect does it going to go and return
common rows between two tables. So how SQL going to execute it? It's very simple. SQL going to go and search for
the common values. So what are the common values? It's given and marry and SQL going to return only those two
values given and marry and all others going to be excluded from the results. It's very simple, right? It's going to
go and return only the common values and this is how the intersect works in SQL. Okay, let's have this simple task and it
says find the employees who are also customers. So we're going to have the same queries employees and customers but
instead of having except we're going to go and use intersect. Since we are finding the common informations between
the employees and customers it's very simple and straightforward. Let's go and execute it. And with that we're going to
get the Kevin and Mary. This is the two persons that are at the same time employees and customers. And of course
here we don't have to pay attention to the order of the queries. It's going to be the same if we say find the customers
who are also employees. So if you go and just switch for example the customers with employees you will see that we will
get the exact same results. So it doesn't matter which query is first again pay attention to the first query
that define the names. So now let's understand how is scale execute intersects behind the scenes. Okay again
our two tables and now we are doing intersects. So as usual SQL going to go and take the columns from the first
query and now we're going to go and find the common data between those two results. So it's going to do it row by
row. So we have the employee Frank. Do we have it as a customer? No. So it will not be in the output. Given brown, we
have it in the employees and as well as a customer over here. So that's why we will get it in the output. The same
thing for Mary. So we have Mary as employee and as well as customer. So we're going to have it in the output.
Michael and Carol, they are not customers. They are only employees. That's why we will not get it in the
output. The same thing goes for the customers. Joseph, we don't have Mark. We don't have Anna because they are not
employees. So with that we're going to get only the common informations between the two tables or two queries and it
doesn't matter whether we start with customer or with employees we will get at the end the same information. All
right so that's all it's very simple right this is how the intersect works in SQL.
All right friends, so now we come to the part where I'm going to show you how I usually use the set operators in my
projects for data analyszis or for data engineering. So here are the most important use cases for the set
operators. All right, the first use case is combining similar tables before doing data analyzes. In some scenarios, we
want to generate a report and we end up writing similar queries on top of similar tables and we go at the end and
join all the results from the queries in order to present the final report. And now instead of doing that what we can do
first we can go and combine all the similar informations into one table and then we can do on top of it a query a
data analyzes in order to generate a report and we can do that using the union or union all. Let's have few
examples. So let's say that we have four tables employees, customers, suppliers and students. So as you can see all of
them are sharing the same informations. They hold data about persons. So now let's say that you are generating a
report that requires all the individuals in the organization in the database. So what you're going to end up doing is
writing SQL query for the employees, another one for customers and as well for the suppliers and the students. And
then you're going to go and merge all the results from those queries into the final report. Now the issue with this
setup is that you are having a lot of queries, a lot of similar queries. So you have it here four times. And now
what might happen is that you go and change the logic of the first two queries and you forget later to do it
for the other two and you will get really inconsistent data in the reports. So instead of that what we can do we can
go and use the set operators in order to combine first all those tables in one big table. So what we're going to do
we're going to go and use a union in order to combine those four tables into the table persons. So we're going to
have it like this. So we will get all the rows from the employees and put it in the persons all the rows from the
customers from the suppliers and as well from the students and put everything in one big table that holds all the
informations about the individuals that we have inside our database. And now the next step after we combine the data now
we write an SQL query in order to analyze this new big table and the result going to be presented in the
reports. And now of course the advantage here is that we have only one SQL query for the data analyzers on top of this
table instead of having it four times. And now if you go and change the logic of the SQL query, it going to be applied
automatically on all the data that we have in the database. And we have done already this example where we have
combined the data between the employees and customers. Another scenario where we have to combine data before doing any
reporting. That's sometimes the database developers tend to divide a table one big table into multiple small tables in
order to optimize the performance. For example, here splitting the orders by the year. We have orders 2022 2023. Now
again here if you want to generate a report in order to analyze the orders over the years over the time either
you're going to go and make a query for each of those tables or you're going to go first combining all those tables into
one table called orders. So what we're going to do we're going to use a union between all those tables in order to
generate one central table called the orders. So all the rows from the first table and all rows from the next table.
next one and the last one. So, we're going to put everything in one big table and once we have the orders, we're going
to go and write analytical skill query on top of the orders in order to generate the report. So, as you can see,
it's very important step in order to prepare the data before doing data analyszis. Okay. So now let's have the
following SQL task and it says the orders are stored in separate tables. We have the orders and orders archive. Now
combine all orders data into one report without duplicates. Okay. So by looking to the task we have to combine two
tables orders and orders archive. So either union or union all. But since the task says without duplicates that means
we have to go with the union. But now before we combine any data we have first to understand the content of the orders
and the orders archive in order to map the columns correctly. So first we have to go and explore the two tables. So
let's start with selecting the data from orders everything semicolon and as well from the second table sales orders
archive and as well semicolon. So let's go and execute it. So now in the output we get two results because we have two
separate queries. The first result is for the orders and the second one is for the orders archive. Let me just make it
a little bit bigger. And now as you can see we have almost identical tables. So as you can see we have the order ID,
product ID, customer ID. So everything looks like identical and of course we can go and check that using the object
explorer on the left side. So we have here the orders and those are the columns. And if you go to the orders
archive, you can see that we have the exact same columns. So that means we can go and map all columns from orders with
the all columns of orders archive. So let's go and do that. So I'm just going to remove all semicolons and then we're
going to go and use the union. So now we have everything in one query. Let's go and execute it. Now we will get in the
output one single results, one single table with all informations from orders and orders archive. So we have all
orders now in one table and everything currently is matching. So with that we have solved the task. We have one result
with all orders. We don't have any duplicates since we are using union and we have combined the data. But now we
have one issue with that. This solution, this query is quick and dirty and actually it's not following the best
practices. So now the best practices here is to list clearly all the columns in each query without using star. All
right. So now let's go and do that. Now we need a list of all columns from the table orders and the table orders
archive. And since we have a lot of columns, what we're going to do, we go to object explorer, right click on the
table name, and then let's go select the top thousand rows. So let's click on that. And now we're going to get a very
simple select statements where we have all the column names from the table orders. This is what I usually do if I
need all the columns in the my select statements. So let's go and copy it and go back to our query. Then let's go
replace the first star with those columns. And we're going to do the same thing as well for the orders archive
since they have the same names. So let's go and do that as well. So let me just make this smaller in order to see the
query. So now we have a select for the table orders with all columns and as well a select with all columns for the
table orders archive. So let's go and execute it. And of course now we're going to go and get the same results.
Now you might ask why we are doing this. Why didn't we stick with the star? It's quick. It's simple. Well for the
following reason. So now currently the status is that everything is matching. We have 100% identical tables. But what
happened with the time is that we do development in our solution and we might go and change the schema of the table
orders. So we might rename stuff, we might add new columns or maybe switch the columns. So this means the table
order with the time will not be anymore identical with the archive. And this is of course a problem if you are mapping
the data blindly using the star. So now let me show you what I mean. Let's say that in this table we are developing the
orders and we just switch those two columns in the schema for some reason. So now we have the product ID first and
then the order ID. So let's go and execute it. Now if you are using star you will not notice this informations.
But if you are using script you're going to see immediately that here we have first the order ID and then product ID.
And here we have the opposite. So it's more clear listing the columns than using the star. And now as you can see
in the output you can see that we have a problem that here we have order ids and then suddenly we have something like the
product ID. So we're going to have incorrect data which leads to incorrect analyzes. So here the best practices to
not use the star and to clearly list all the columns. Now one more technique that I usually use once I'm combining data is
that I add the source of the data inside the query. So what I mean with that now you can see that we have here two orders
with the order ID one they are not duplicates they are completely different informations and that's because they
come from different tables. So what I usually do I go and add the source of each record it's really nice information
for the analytics for the users to understand where these records come from. So how we going to do that? We're
going to have for example on the first column the following word let's say orders and we're going to call it let's
say that's source table and we're going to do the same thing as well in the second query. Right? So the source table
here is not the orders it's the orders archive. So I'm just adding a static columns to my query in order to see the
source of the table. So now we have here two different values. And let's go and execute it. And now you see we have
created a new column called source table where it has only two values. We have the orders and the orders archive. Let's
go and sort the data by the order ID. So order by order ID. So let's go and execute it. And now you can see it very
clearly. The first order order ID one comes from the table orders and the second one comes from the orders
archive. So this is really nice information that you can add to your data once you are combining multiple
tables. So that's all about this use case on how to combine data between different
tables. All right. Now we have another use case for the set operators. It's more for data engineers. We can use the
except in order to find the delta between two batches of data. For example, data engineers build data
pipelines in order to load daily new data from the source systems to a data warehouse or a data lake. Now, in those
data pipelines, we have to build a logic in order to identify what are the new data that is generated from the source
system in order to insert it in the data warehouse. One way to do it is to use the set operator except in order to
compare the current data with the previous load. Let's have a very simple example. So in the day number one we
have two customers one and two. So what going to happen in this day we're going to go and load those two customers into
the data warehouse. So in the data warehouse we will get as well one and two. So this is for the first day
nothing is crazy. We just load the data as it is. Now for the second day we will get the new data from the source system
and it's going to look like this. So now if you check the second day you can see that we have again the customer number
one we have already loaded to the data warehouse. So we have it as the previous day but we have a new customer ID number
three. So now in order to load only the new data we don't need to load again the customer number one. What we can do? We
can do an accept between the day number two with the previous load with the day number one. So now if we simply do an
accept between those two sets we're going to go and identify the new data that is existing in the source system
which is only the record number three. So now what going to happen if we do except between day two and day one we
will get one record the new record that we're going to go and insert it inside our data warehouse. So as you can see
this set operator except is very powerful in order to compare two sets and not only for data analysis we can
use it as you can see for data engineering in order to identify what is the new data that is generated from the
sources in order to insert it inside our data warehouse. Okay, one more use case for the set
operators that I personally use a lot in my project is that if you are doing data migrations, you can use the accept in
order to check the data quality and more specifically we can use it in order to check the data completeness. Okay, so we
have the following scenario where we are doing data migrations between two databases. So let's say that we would
like to move this table from database A to database B. So we're going to go and load the table to the new database. And
now what is very important after you move the data is that to check whether all the records did move from database A
to database B we are not missing anything even one record. So we want to do data completeness test and there are
many methods on how to do this test. One of them is to use that set operator except. So how we going to do it? We're
going to do an except between the table from database A and the table from database B in order to find any record
that is still in database A which is not migrated to the database B. And of course the best result is that we will
not get anything. The result should be empty. If we get an empty that means all the rows from database A exists in the
database B. And now of course we are not done yet. We want to do the comparison but the way around. We want to find any
new rows that is in database B that we don't find in database A. Those two tables must be identical. So now what
we're going to do, we're going to do an except but the first table going to be from the database B. And then we're
going to compare it with the database A. And we have the same expectation. The output should be as well empty. And now
after doing the except twice for both sides and we are getting empty in the results. That means those two tables are
identical and we are not missing anything. So this is another amazing use case for the set operators in order to
improve the quality of your data migrations and in order to do data completeness
test. Okay. So now let's have a quick summary about the set operators. So the set operator is going to go and combine
the rows of multiple queries, multiple tables into one single result. And we have four different types of the asset
operators. The first one is the union where it's going to go and combine all the rows but without including any
duplicates. The second one we have the union all it's very similar. And the third one we have the except it's going
to show all the rows from the first query that cannot be found in the second query. And the fourth one we have the
intersect where it's going to show the common rows between two queries. And of course we have SQL rules in order to use
the set operators. Both of the queries should have the same number of columns, the same data types and the order of
columns. And the last rule, don't forget that the first query controls the aliases, the name of the columns and the
data types of the entire result. And we have found amazing use cases for the set operators. Like for example, using union
and union all in order to combine similar informations into one big table. Or we can go and use the amazing except
operator in order to compare two different results in order to find the differences between them. And I usually
use it in order to do data quality checks to test the data completeness. And another use case as a data engineer
you can go and implement the except in your logic in your data pipelines in order to identify what are the new data
that must be inserted in your system. Okay my friends. So with that we have learned all the set operators that we
have inside SQL. And with that you have learned how to combine your data from multiple tables using SQL. So we are
done with this chapter. Now we're going to go to the right side. So now we're going to start talking about the
functions in SQL. And here we have two big families. The first one is the row level or the single value functions. And
the second one we have the aggregate analytical functions. So let's start with the first one the rowle functions.
And here we can group them into multiple categories. And we will start now with the string functions. But first let's
understand what is exactly functions and why do we need them in SQL. So let's go. Okay. So what is exactly function
and why we need it. Now again we have our data inside the table. Now there is like a lot of stuff that you can do with
your data. So sometimes you have to change the values of your data like doing data manipulation or you want to
do some aggregations and analyzes. So maybe you want to analyze your data and find insights and maybe build reports
and sometimes you might find bad data inside your tables and you want to clean that up. So you want to do data
cleansing and sometimes you have to do data transformations and data manipulation on our data in order to
solve some SQL tasks and in SQL in order to solve those tasks we have functions. So again what is exactly a function? It
is a built-in code block that accepts an input value. Then the function going to go and process this value and it going
to return a result an output value. So you give an input value do some transformations and give an output. And
we can group the functions into two big categories. The first one we call it single row functions. So you give the
function only one value and at the return you will get as well one value. So the input for the function going to
be only one single value like maria and the output of the function going to be as well single row value. So one value
in one value out. And now the other category of functions we call it multirow functions. So for example if
you have the function sum this function accept multiple rows multiple values like it gets 30 10 20 40 the function is
then going to go and summarize all those rows and return in the output only one value. The summarization of all those
values going to be 100. So the input is multiple rows and the output is one single value. So those are the two main
categories of functions in scale. Now my friends you have to understand something about the functions that you
can go and nest functions together. So you can use multiple functions together in order to manipulate one value. And
this technique is not only in SQL in any programming language. So let's have this example. We have the function left. It's
going to go and extract like few characters. Let's say two characters. So the input for this function let's say
it's Maria. This value going to enter the function. The function is going to go and extract the first two characters.
And in the output we will get only two characters m a. So this is one function. We have an input and output. Now you
might say you know what we have multiple steps on this value. So the first step we want to extract the first two
characters using the lift function. But we have a second step. So we want to transform this output into a lowercase
characters. So we have another function lower and the input for this second function will be the output of the first
function. So ma it is at the same time output and input for another function. So the lower function going to take this
value and convert it into lowerase character. So it's like inside the factory the materials going to be
processed into multiple stations and the output of one station going to be the input for the next station. And this is
exactly what we can do with the functions. So now how we going to build that? The first step is to start with
the first function. So this is simple one function. Now for the next step what you're going to do on the left side
you're going to write lower and put the whole thing in parenthesis. So now the whole thing the first function going to
be inside another function and with that you have nested one function in another and of course if you need a third
function like for example the length what you're going to do you're going to put the whole thing again between two
parentheses. So now that means the output of the lift going to go to the lower and the output of the lower going
to go to the length. So it is very simple and the order of the execution for this will start always in the inner
function. So the lift function going to be executed first and then the outside function the lower and the last function
that's going to be executed is the length. This is how the nested functions works in SQL or in any programming
language. Now my friends in SQL we have a lot of functions that's why we have to group them as well into subcategories.
Like if you are talking about the single row functions, we have functions for the string values and as well for the
numeric, the date and time and as well functions in order to handle the nulls. And if you are talking about the
multirow functions, here we have basically two groups. The first one is the simple aggregate functions. Those
are the basics in order to aggregate your data. And we have another advanced one. We call it the window functions or
sometime we call it analytical functions. So now if I'm looking to those two groups and now my friends it
is very important to understand those functions because using them you can do whatever you want with your data and if
I'm looking to those two groups the single row functions those stuff here they are functions in order to
manipulate and prepare the data for the second group. So if you are thinking about data engineers and data analysts
the data engineers going to go and prepare the data in SQL using the single row functions. So you're going to use
them in order to clean up, transform, manipulate your data in order to prepare it for the analyzes. And if you are data
analyst, you will be mostly using the aggregate functions in almost every task. So I really see it like this. The
single row functions for data engineers and multirow functions for data analysts. And my friends, what we're
going to do in this course, we're going to visit each of those subgroups one by one, exploring the functions,
understanding how they work and when we're going to use them. So let's start with the first group, the string
functions. And here we're going to learn how to manipulate the string values. So let's
go. Okay. So now since we have a lot of string functions, I'm going to go and divide them into categories based on the
purpose. So for example, we have a group of functions that's going to go and manipulate the string values. So we have
concatenation, upper, lower, replace, and so on. And another group where we have only one function. It is where we
can do calculations on the string values. And the last group, it is all about how to extract something from a
string value. And here we have three functions left, right, substring. So now let's go and start with the first group
about the data manipulation. And the first function we have here concat. All right. So what is exactly
concat or concatenation? It's going to go and combine multiple string values into one value. So if you have multiple
things you can put everything in one value. So let's have a very simple example. Okay. So now let's say that you
have one value called Michael. So here you have a first name and you have totally separated value for the last
name another column where you have a value like Scott. And now you say you know what it makes no sense to have the
first name separated from the last name. I would like to go and combine them in one value. So you can go and use the
concat in order to combine those two values or multiple values into one single value like Michael Scott.
I think that pretty much sums it up. So it is nicer to see the full name in one value instead of having like two columns
for that. So that's it. This is why we need the concatenations. Now let's go back to scale in order to try that out.
Okay. So now we have the following task. Show a list of customers first names together with their country in one
column. So that means we have to make a list of customers and we have to combine two columns in one. So let's start
writing the query. Select. We need the first name, the country from the table customers. So first let's go and execute
this. Now as you can see we have list of customers but the issue here the first name and the countries those two
informations are in different columns but the task says they should be in one column. So now in order to combine those
two things we have to use the concatenate function. So concat. So I'm going to start with the first argument.
It's going to be the first name and then the country like this. And we're going to give it a name. Let's call it like
this name country. Now let's go ahead and execute it. Now in the output you can see we have a new column. It's
called name country and we have both of the informations in one column. So we have Maria, Germany, join USA. But it
doesn't really look good because there's like no spacing between them. Now we can go and make some separation between them
by just adding one more thing in between like for example maybe a space. So now we are concatenating the first name
together with a space this over here and then the country. So let's go and execute it. Now as you can see we have
nice separations between the first name and the country. And of course you can go and add different separations like
maybe my notes or underscore and you will get the same effect. So with that we have a list of customers where we
have the first name together with the country in one column. As you can see it's very simple. This is how you
combine two columns in one. It is really nice and easy transformation. Okay. So that's all about the concatenation in
scale. Next we're going to talk about two functions. The upper and the lower. Okay. So what is upper function?
It's going to go and converts all the characters of a string to an uppercase. It's going to make everything
capitalized. And the lower function is exactly the opposite. It's going to go and convert everything to a lower case.
So let's have very simple example for those two functions. Okay. So now we have like three values with different
cases. The first one where you have only the first character capitalized and the rest is lowered and then the same value
but everything is lowered and a third one where you have everything with an uppercase. Now if you go and apply the
function upper to those three values what going to happen for the first value going to go and turn it into an
uppercase. So everything going to be capitalized not only the first character. And now for the second value
going to turn it as well to completely capitalized. So all the characters going to change. And for the last value it is
already capitalized. So in the output you will get the same value. So actually nothing going to happen for that. So
this is simply the uppercase. Now let's see what can happen if you use the lower case. For the first value only the first
character going to be changed and then you will have everything in lower case. The second value it is already a
lowerase value. So if you apply lower case nothing going to happen. You will get the same value. But for the last one
everything here is capitalized and if you apply lower case all the characters going to convert to a lower case. So my
friends this is very simple. Let's go back to your skill in order to practice that. Okay. So we have the following
task and it says transform the customer's first name to lowerase. So now as you can see the first names here
the first character is a capital the rest is lowerase. So now in this task we have to convert the whole thing into
lower case. So let's go and do that. It's very simple. We're going to say lower first name and let's go and call
it low name. So that's it. Let's go and execute it. Now if you go and compare the lower name with the first name, you
can see all the characters now in the lower case. So that's it for the task. We have transformed the first name to
lower case. All right. The next task is exactly the opposite. Transform the customer's first name to uppercase. So
let's go and have a new column. We're going to say upper then the first name as app name. So that's it. It's
very simple. Let's go and execute. Now you can see in the output we have a new column called up name and inside it we
have the first name but now all the characters in upper case. So this is how you convert the case to lower or to
upper in SQL. Okay. So that's all about the upper and the lower. Next we're going to talk about very interesting
function. It is the trim. So the trim function going to go and remove the leading and trailing
spaces in your string values. So it's going to go and get rid of the empty spaces at the start and at the end of a
string value. Let's have very simple example. Okay. So now we're going to have different scenarios. The first one
you can have like a value join where you don't have any spaces and this is the normal case. But sometimes you might
have it like this where at the start you have a leading space. You have an empty space or sometimes we call it white
space. In another scenario the space might be at the end of the word. So here we call it trailing space and in another
scenario you might have both of them. This is really bad. where at the start you have the leading space and at the
end you have the trailing space. And of course you might not have only one space, you might have multiple spaces
depend on how long did the user press the space, right? So of course my friends spaces are really evil and this
makes no sense to have it in your data. Now what you have to do is to do data cleansing. We have to clean up this miss
and you have the best function in order to clean up the data. You have the trim. So if you apply trim for the first
value, nothing going to happen because everything is clean and we don't have any spaces. Now if you apply it for the
second case where you have a leading space if you do that SQL going to go and remove this space. The same thing for
the trailing space. So if you have space at the end the trim function going to find it and clean that up. And if you
have it at the start and at the end then it's as well no problem. It's going to go and clean that up. And as well the
trim function can go and clean multiple spaces. So if you have like five spaces 10 spaces at the end or at the start the
trim function going to go and clean that up. So this is how the trim works. And now let's go back to our scale in order
to find out whether we have any spaces. Okay. So now we have a very tricky and interesting task. It says find the
customers whose first name contains leading or trailing spaces. So now by looking to those values we have to find
any spaces inside the customer's name. Now by just looking to this results you will not find any white spaces because
it's really hard to see especially if it is like trailing spaces. Now we have to write query order to detect any spaces
in the names. So how we can do that? Okay. So now think about it a little bit and I can give you a hint. You can use
the function trim in order to remove any white spaces and you have to use it inside a wear clause. So what we're
going to do we're going to say where. So now we have to build a condition to detect any spaces. So if you are saying
if the first name is not equal to itself first name after applying a trim. So after trimming the first name if it is
not equal to the first name so that means there was spaces. So again what is going on here? Let's go for Maria. If
Maria has no nulls if you trim this value nothing going to happen. The value going to stay exactly like before
because there is no white spaces. But if in Maria there is any space inside it. Trimming the value will not be equal to
the first name if it contains any spaces. So if the column is not equal to the same column after trimming it that
means there is spaces. So let's go and execute it. And now we can see in the output we have one customer John where
we have this situation. Now if you don't believe me or you don't follow me here we can have another easier check. So
let's go and comment this out and let's have a look to our first names. Now we can go and calculate the length of the
first name like we have done before. So length name and let's go and execute it. Now if you can see here Maria we have
five characters but John we have here four characters but the length is five and that's because we have somewhere
space and the space going to count as a character. So here there is like something wrong right and you can check
the others as well everything is matching but only John we have here an issue and now in order to see this more
clearly we're going to use two functions the trim and the length. So first let's go and trim the first
name. And after trimming the values, I'm going to calculate the length. So we are nesting together the trim and the
length. And I'm going to call it length. Trim name. So let's go and execute it. Now we can see the length before
trimming any value. And we can see the length after trimming the values. So you can see over here that join before
trimming is five and after trimming is four. So we have here an issue. Now we can make things more clear where we can
go and subtract the length of the first name with the length of the first name. But first we trim the values. So here we
can call it maybe a flag or something. So let's go and execute it. Now by looking to the flag it is really easy to
now to see if we have a zero then everything is fine. We don't have any white spaces. But if we have higher than
zero like here one then this is an indicator that we have a white space. Either you do it like this where the
first name is not equal the first name after trimming or you use more complicated solution where you say where
and I'm going to remove this from here the length of the first name is not equal to the length after trimming so
not equal so if you go and execute it you will get exactly again join so this is how we detect any empty spaces inside
our data using the trim function or maybe as well using the length but I really prefer the first solution it is
way easier using one function. All right, so that's all about how to remove the empty spaces using the trim. Next,
we're going to talk about very important function called replace. Now the replace function going
to go and replace a specific character. So that means we have something old and we want to replace it with something
new. Let's have a very simple example to understand it. All right. So now imagine we have a phone number where the data is
splitted by a dash. Now let's say that I don't like to have the dash in my data. I would like to have slash like any
other special character. Now in order to replace the dash, we can use the function replace. So we have to specify
for SQL two things. The old value the dash with a new value the slash. So if you do that in the output it's going to
go and remove all those dashes between the numbers and the replacement going to be the dash between them. So it's very
simple, right? All what you are doing is replacing an old value with a new value and that's why we call it replace. But
we can use this function as well in order to remove something not only we replace and you can do that by not
specifying anything in the new value like just the single quotes and with that it's going to be nothing a blank.
So now what's going to happen is still going to go and replace the dash with a blank and that means I'm just removing
the dashes from the output. So if you do it you will remove the dash and you will get only numbers. So if the replacement
going to be a blank then that means this function will be replacing any value that you specify. So this is exactly how
it works and this is why we use the replace function in SQL. Now let's go back in order to practice. So let's do
the same example. This time we're going to go and select from a static value. So we're going to get 1 2 3 4 5 6 7 8 9 0.
So if you go and execute it, you can see we are getting the phone number. Now let's go and remove the dashes from this
value. So let's have a new line and we start with replace. The first thing that you have to specify for SQL the value
itself. So let's go and get the value. This is the first argument. The second argument going to be the old value. So
the old value going to be the dash. And now the third argument will be the replacement. And since we want to remove
it, we don't want to replace it with anything. We will have just single quotes and nothing between them. So
there's no space between those single quotes. Now we can go and rename stuff like this is the phone. And this is a
clean phone. Let's go and execute it. Now, as you can see in the output of the function, we don't have any dashes
between the numbers. And you can go and test stuff. Like for example, I can go and add a slash and execute it. You will
see slashes between them. So you can go and try multiple stuff. So this is one nice use case for the replace function.
Now there is another use case for the replace function is that sometimes in my data file names going to be stored like
for example, let's say reports.t txt and now let's say that I would like to change the file format from .txt to CSV.
Now how we're going to do that we're going to go with a new line say replace and then the first argument going to be
the value. So let's take our value from here and now what is the old value it's going to be the txt and I want to
replace it with another format with another extension. So it's going to be the CSV. So we're going to say this is
the new file name and this is the old file name. So let's go and execute it. And now as you can see in the output SQL
did replace the txt with SCSV. This is as well where I use the replace function in my projects. So my friends the
replace function is really fun and those are two nice use cases for the replace. All right. So that's all about the
replace function in SQL and with that we have covered the whole datamations. Now in the next group we're going to talk
about the calculations. And here we have only one function the length. Now the length function it's
very simple. It's going to go and count how many characters you have in one value. So you are calculating the length
of a value. Let's have very simple example to understand it. Okay. So now let's say that we have the value Maria.
If you apply the length function for that what's going to happen? It's going to go and start counting how many
characters we have inside this value. So the m is 1. a 2 3 4 5 in the output you will get the number five. So five is the
length or the total number of characters in this value. Now let's say that you have a number like 350. If you go and
apply the length function still is going to go and count how many digits do we have. The three is 1 5 2 3. So the total
length for that going to be three. So you can apply it even for numbers and not only that you can go and apply it on
a date value. So let's say that you have the following date 2026 1st 23. So SQL going to go and count each digit each
character even the underscores not only the numbers underscore is as well a digit right? So the total length of this
date it's going to be 10. So you can apply any data type to the links function and in the output you will get
always a number. That's it. This is how you can count the number of characters in any value. Let's go back to scale in
order to practice that. Okay. So now we have the task calculate the length of each customer's first name. So it is
very simple. We're going to go and apply the function length len to the column first name and we're going to call it
length name. So let's go and execute it. And with that as you can see we are getting in the output numbers and these
numbers are the number of characters of each name of our customers. So this is how we calculate the length and that's
it for this group. Now moving on to the next one. It's going to be very interesting. Now we're going to talk
about how to extract something from a string value. And here we're going to cover now two functions the left and the
right. Now the lift function going to go and extract specific number of characters from the start of a string
value. So if you want to get few characters at the beginning of a value, you can use the lift. But now the right
function is exactly the opposite. It's going to go and extract specific number of characters from the end of string
value. So if you want few characters from the end of your value, you can use right. Now in order to apply the left or
the right function, you have to give SQL two things. The value where you want to extract a part from it and the number of
characters, how many characters you want to extract and this is the same for the left and the right. Now let's say that
we have again this value Mariam. And now if the task says I would like to extract the first two characters and since we
are talking about the starting position, we're going to use the lift function. And since it says two characters, we're
going to go with the two. So it's going to start counting M is 1, A is two and after that it's going to stop and make a
cut and it's going to go and return the two characters M A. So we are counting from the left side going to the right
side. Right now if your task says extract the last two characters here we are talking about the end position of
your value and for that we're going to use the right function since we are approaching from the right side and
since we want only two characters the number of characters going to be two. So this time going to start counting from
the right side moving to the left side. So A is one, I is two and that's it. Then SQL going to stop and extract only
those two characters. I A. So if you want to extract data at the starting position, you use the left. But if you
want to extract characters from the end position of your value, then you use the right function. Now let's go back to
scaler in order to practice. Okay. So now we have the following task. Retrieve the first two characters of each first
name. So we just need the first two characters. Since we are coming from the left side, we can go and use the
function left. So it's very simple. First name and we need only two characters. So two. So we're going to
call it first to character. Let's go ahead and execute it. And now you can see in the output we have two characters
MA. Now with John we have only G because we have a leading space. Well, you can leave it like this or you can transform
it. And then George we have G and so on. So with that we are getting the first three characters. Now in order to fix it
for John what we're going to do we're going to say trim first and then apply the lift. So with that we are getting
rid of all white spaces and then we apply the lift. So with that everything looks perfect. So for John we have jo.
So this is how we can get the first two characters of a column. Now let's move to the next one. The task says retrieve
the last two characters of each first name. So this time we need the last two. So we are coming from the right side. So
we're going to do it like this. We're going to say write first name and then as well too.
So last two character let's go and execute it. And now as you can see in the output we have new column where we
have the last two characters from the first name. So we have here I a er and for John as well working and that's
because we don't have any trailing spaces but if you have any trailing spaces then go and use that trim
function. All right so that's all for the left and right and now we're going to go to the last function. we have the
substring. So the substring going to go and extract a part of a string at a specified position. So this time we
don't want something from the beginning or the end. We want something like in the middle. So we want to specify the
starting position and we want to extract few characters from there. So let's have very simple example to understand it.
Now in order to use the substring you need three things. The first one is the value itself where you want to extract a
specific part from it and then you have to specify the starting position where SQL going to start extracting the
characters that you want and as well SQL needs the links how many characters we have to extract. So now let's say that
we have the following task after the second character extract two characters. So from reading this you can see we
specified the starting position this is the second character and the length going to be the two characters. So let's
have this example. Well, if you have Maria, so now we have to specify the starting position. Now we are saying
after the second character. So the first character m is one. Then a is two. After two, we got the position number three,
right? So starting from R. So that means we have to specify for SQL three because the starting position going to be number
three. This is after the two. Now we want only two characters. So we want the R and the I. If you give this to SQL
Maria starting position three and the length two, SQL can go and extract the two characters the R I. And this is
exactly what we want. We want two characters after the second position, the second character. So with that, we
didn't extract something from the left or from the right. We extracted at specific position. And this is exactly
why we need the substring. Now let's make it a little bit more difficult where we're going to say after the
second character extract everything all the characters. So not only RA I I would like RA I A. So now nothing's changed
about the starting position. It's going to stay at three. But now if you are looking to this value and you want to
extract everything starting from R. That means you have to specify the length of three. But this is not really good
because let's have another value in the same column. So we have Martin. So the starting position going to be as well R.
And now the lengths going to be different. So we have here four characters. So now the length is not
anymore three. It is four. But you have to specify something at the end for SQL. You can go for four. That's fine for
Maria as well. But if you have a lot of values, it's going to be really hard to specify exactly the correct length.
That's why instead of specifying a static number like three or four, we can use another function. So now my friends,
if you use the length function, you will get the total number of characters, right? So for Maria, you will get five.
For Martin, you will get six. And those numbers are okay to use in the length because they are more than what we need.
And that's totally fine. So if you are saying okay for Maria start from the third position and cut for me five
characters SQL going to find only three but you will not get an error. So you are extracting more than you need and
you will always get all the characters after the starting position. So this is a little trick that we use in order to
make the links dynamic where we cannot find one value that we can use in all scenarios. And now let's go back to SQL
in order to practice the substring. Okay. So now we have the following task and it says retrieve a list of customers
first names after removing the first character. So now don't ask me why but for some reason we don't want to see the
first character of the first names. We want to remove it. So how we can do that? We cannot use the left or the
right. We have to go with the substring because it is little bit more complicated. So substring and let's go
and get and the first argument going to be the value. So it comes from the first name and then the second argument is the
starting position. So where we want to start since it is saying I want all the characters after the first character. So
that means we will be starting from the position number two. So for example Maria here the first character M
position number one and we want to start our substring from the position number two. So that was so that was the easy
part. Now the next one the question is how much characters we want to leave. So do we leave here like four characters
like in Maria we have four characters but in John we have only three then the next one is four and so on. So if you go
for example with four and let's call it sub name. So we make it static. What can happen? It's going to work for some
scenarios like Maria. We have here Ara and for better we are getting it. But for Martin it is not working. We are not
getting the last N because it has like five characters after the first one. And by just looking to the result as you can
see we have here one issue with John and that's because the first character is an empty string. So this is really
annoying. So that's why we use the trim first just to get rid of all those white spaces. And now you can see it's working
fine. So we are not getting the J. We have everything after the first character. So now instead of having this
static what we're going to do we're going to make it variable. So we're going to go and use the length of the
first name. So with that we make sure we have enough length to extract. And this can work for any value inside the first
name even if the name is like 20 characters. So let's go and execute. And now you can see for Martin it is now
working. So we have here like five characters after the M. And here we have four characters after the M as well. And
here we have three characters after the G. So it is working completely and it is full dynamic. So this is the trick by
using the links together with the substring. And as you can see now we are using three functions in one go. We have
the length, we have the trim and we have the substring. And this is what happens in scale. we use multiple functions
together in order to solve like complex tasks. So this is how you can extract a substring from a string. All right. So
that's all about the substring and with that we have covered a lot of very important string functions in SQL and
now you have enough tools in order to manipulate the string values in your data. Okay my friends. So with that we
have learned how to manipulate your string values inside SQL using the string functions. Now we will move to
the second one. you will learn how to manipulate the numbers, the numeric values. So let's
go. Okay. So now let's have this example 3.516. Now let's say that you want to apply the function round and you are
using two decimal places. So what going to happen? It's going to go and keep only two digits after the decimal point.
So five and one and the third digit after the decimal six. It will decide whether the number going to round up or
stay as it is. And now since six is higher than five. So that means SQL going to go around the numbers up. So
instead of having 51 we will get 52. And after that the third digit going to reset to zero. So in the out you will
get 3.52. Now let's say that you have done round but only for one decimal place.
Now it's still going to go and keep only one decimal place and that is the five. And the second digit this time going to
decide whether we round up or not. And now since one is less than five, there is no need to round up and the five
going to stay as it is. It will not turn to six. So there is no round up and the digits after the five going to reset to
zero. So we're going to get 3.5. Now let's say that you say round zero. So that means I don't want to see any
digits after the decimal point. So now SQL going to go and check the first digit after the decimal point, the five.
This one going to decide whether the three going to turn to four or not. And now since we have five it is good enough
to round the number because either five or above five going to round the numbers. So that's why it's going to be
a round up and SQL going to return at the end four and all the digits after the decimal points going to be reset to
zero. So this is exactly how the round function works in SQL. So now let's see how we can do that in SQL. Okay. So now
let's go and practice about the number functions. So what we're going to do we're going to write SQL select but this
time we will not select any data from the database. We going to practice using our static value like for example the
value 3 dot 516. So let's go and execute it. So with that I have this decimal number. Now let's go and start
practicing the round function. So now let's go and round this number 3.516 and this time we are rounding to
decimals. So let's go and call it round two and let's go and execute it. So as you can see in the output we are
rounding two decimal places and we have the two because as we learned the six going to go and round it up. Now let's
go and do the same thing for one. So let's round one execute. And as you can see in the output we are rounding to one
decimal. So we have the five and everything is zero. And we don't have six here because the one is lower than
five and it will not round up the numbers. And let's and round by the zero. it is rounding it to an integer to
the four and all the decimal digits are zero and we have four because we have five and five going to round up the
number. So as you can see it is really nice and this is how we round numbers in SQL. Now there is another number
function which is really cool called APS or the absolute what it going to do it's going to go and convert any negative
number to a positive. So let me show you what I mean. Let's go and say we have like minus 10. So this is a negative
number. But if I say APS, so the absolute of the minus 10, what I will get? I will get a positive number. So
it's like giving us the absolute of any number or in other words, it is like converting the negative to a positive.
And if the number is already positive, nothing going to happen. So if I say the absolute of the 10, I will get as well a
10. So this is really nice and cool function that is really important in order to transform numbers in many
scenarios like if you have mistakes on your database like let's say minus sales makes no sense to have sales that is
minus. So in order to correct the data we can use the APS in order to convert all the negative numbers to a positive.
So this is really nice cool and easy function to learn. All right my friends. So that's all for the numeric functions.
We have covered two very simple functions and now in the next topic we have a lot of functions about how to
manipulate the date and time in SQL. So let's go. So what is a date? If you take a
look at calendar and you pick any date, for example, August 20th, 2025, this date could represent an event
like a birth date. Happy birthday. Happy birthday. or a project deadline at your work and
mainly it has three components. The first part is a fourdigit number indicating the year. Then the next
component it is the month. So normally we represent the month with a number between 1 and 12. And the last component
is the day. This is a number between 1 and 31 depending on the month. Now in database we call this structure of those
three components a date. So this is what we mean with dates in SQL. All right. All right. So now let's move to the next
one. What is time? Time refers to a specific point within a day. Like for example, we have 18:00, 55 minutes, and
45 seconds. So this structure has as well three components. The first one we call it the hours. It is as well a
number between 0 and 23 indicating the hour of the day. Then the next one, it is the minutes. This is a number between
0 and 59. Moving on to the last component, we have the second. This is again the same thing a number between 0
and 59. So now this structure with those three components we call it in databases and SQL a time. So this is what we mean
with the time. Now to the last type if you go and combine both the date together with the time and you put them
side by side you will get a new structure and a new name in the databases and we call it usually time
stamp. This name is used in many databases like Oracle, Postgress and MySQL. But in the SQL server, we have
another name for that. We call it date time. So again, it's very simple. The date time or time stamp has the date
information together with the time information. So here in this example, we have six components from left to right
and here we have like a hierarchy in this structure. So we start with the highest which is the year. Then we have
the month, the day and then we continue to the hour, minutes and seconds. So those are the three different types
about date and time informations in SQL. We have the date alone or the time alone or together in the date time. All right,
let's explore now the data that we have inside our database searching for date and time informations. Now let's go to
the table orders and if you go and expand it, you will find here two columns having the data type dates. So
we have the order dates with the date and as well the shipping date with the data type dates. And if you check the
last column, the creation date, this one is date time 2. So now let's go and query those informations in order to
understand the structure. I'm just going to select the order ID, the order date, and the ship
date and the creation time from sales orders and from is big. So let's go and execute it. Now if you
go and check both order date and ship date, you can find that here we have only the structure or the informations
about the date and we have nothing about the time. So again here we have a year, month and day and that's why they have
the data type date. Now let's go and check the creation time. Not only we have the date information but as well we
have the time information. So it start with the date information year, month, day and then we have hour, minute and
seconds and then we have fractions of the seconds, milliseconds and so on. So this is how the date time or time stamp
looks like in databases and this is how the date looks like. All right my friends now in SQL I
can say that we have three different sources in order to query the dates. The first one is dates that are stored
inside our database like we saw here in those columns like the order date, shipping date, creation time. All those
are columns that holds this informations and they are stored inside our database. So this is the first source of dates
that we can get inside our queries. Let me just remove those stuff and let's stick with the creation time. So let's
just execute it. So those are date and time informations stored inside our database. The second type is a
hard-coded date string that we can use inside our queries. Let me show you an example. So now if we go to a new line,
I can go and define a date like this. So 2025 August 20th. So that in this string we have hardcoded a date that is static
for all rows. Let me just call it hardcoded and let's go and execute it. Now we can see in the output we're going
to get a static date for all rows. So this going to be the same for all rows inside our table. So this value is not
stored inside our database. This value I just added to our query and hardcoded it. So sometimes in queries we define
our dates that's going to be used maybe later in calculations and so on. Now the third source of getting dates inside our
query is using the function get date. Get date is the first and the most important function that we use in SQL.
It's going to go and return the current date and time at the moment of executing the query. So let's try that out. I'm
going to go and get a new line. So get dates. It's very simple. It doesn't accept any values inside the function.
So it's going to be empty. So let's call it today. All right. Let's go and execute it. And of course, we're going
to get different results because the get date now is the date and the time that I'm recording this video. So currently
it is July 18, 2024. And I'm recording this around 20 p.m. So as you can see, this going to be as well repeated for
each row. We're going to get always the same value. So again, this depend on the execution of that query. So during the
tutorial, you're going to learn a lot about the get date and we're going to use it in a lot of functions. So those
are the three different sources of getting date information inside your query either from a column inside our
database or hardcoded using a string. And the third one is using the get date in order to get the current date and
time informations at the moment of the query execution. Nice. Now we have a clear
understanding what is date and time in SQL. The next question is how to manipulate those informations using SQL
functions. Okay. Now we have our date August 20th, 2025. One of the things that we can do with the date is we can
go and extract different parts of the date. For example, we are interested only on the year. So we can go and
extract only the year part. Or if you are interested in the month, you can go and extract the month and you will get
August. And of course, we can go and extract the day and we will get the 20. So this is the first thing that we can
do. We can extract the parts of the dates. Now another thing that we can do is we can go and change the date format.
So instead of having like a small minus between those date parts, we can go and split them using slash. We can even
start first with the month August then 20 the day and then the year but having only the short form of the year 25 or we
can go and change the format where we say we don't need any special character we just leave it as a space. So as you
can see we are changing and manipulating the format of the date. Another category or task we can go and do date
calculations. So we can go and take our date and add to it for example 3 years or we can go and find the differences
between two dates like we are doing a subtraction or let's say minus and we will get for example 30 days. So we can
go and add stuff subtract stuff or find differences between two dates. It's like we are doing calculations on the date.
Now to the last thing that we can do with this date is we can go and test this date or validate it whether it is a
real date that SQL understands. So we can put it on the test and at the output we're going to get true or false or zero
and one. So as you can see here we have different ways or let's say categories on how to manipulate our dates in SQL.
Now we're going to go and group up the different date and time functions under four categories. The first category and
the most important one we have the part extraction and here we have around seven different functions that we can use in
order to do this task. Another category we have the format and casting. And here we have three different functions.
Underneath this category we have the format, convert and cast. And then the third category we have the calculations
of the dates. We have two functions date add and date diff. And the last category the validation. We have here only one
function called is dates. So as you can see we have a lot of scale functions. We have 13 date and time functions that
we're going to cover in this tutorial on how to manipulate the date and time informations in SQL. And this is how we
can group them into four different categories. Let's start now with the biggest category. We have the part
extraction. We're going to cover all those seven functions in details on how to extract
parts. All right friends, now we're going to cover three very easy quick functions in SQL to extract the parts of
the dates. So they are very simple. The day function going to return a day from a date and in the same way the month
going to return the month from a date and guess what the year going to return a year from a date. Okay. So now in
order to understand how they work we have a date like this one 2025 August 20th. Sometimes you are not interested
in the whole date. You would like to get only a part from this date. So you go and use the function day in order to
extract the two digit 20. Now in other scenario you might be interested in the month information. So you would like to
get those two digits 08. So we can use the function month in order to extract the month information in order to get
the August. So 08 and one more situation where you want to have only the year information. So you are interested in
the four digits 2025. So you can go and use the function year in order to extract it. So in the output if you
apply it you will get 2025. So it's very simple. This is how those three functions work. All right. Now let's
check the syntax of those three functions. It's pretty easy. So we have it always like this. A keyword called
day. This is the function name. And then it accept only one parameter. It is the date. The same things for the others. We
have a function called month and it accept as well only one parameter the date and as well for the year the same
thing. So the syntax is very straightforward. It accept only one value the date and we have the function
name like the name of the part that we want to extract. All right. So now let's try out those functions. I will be
working with the column creation time. So let's try for example extracting the year from the creation time using the
year function. So it's going to be very simple. It's going to be year and then creation
time like this. And let's call it year. That's it. Let's go and execute it. Now as you can see it's very simple. We have
only one year 2025 from the creation time. So with that as you can see we got a new column where we have only the year
informations inside it. And this information come from the creation date. So we have only 2025. Now let's go and
do the same for the month. So we're going to have the same thing month creation time and let's call it month.
So let's execute it. Now as you can see in the output we got as well the number of the month. So we have here January,
February and March and those information as well are extracted from the creation time and the same thing using the day
function. So let's go and use that. So creation time and we call it day. So now as you can see in the output we have the
day part from the creation time. So here we have 1, 5, 10 and so on and all those informations come from the creation
time. So as you can see those three functions are very simple and quick in order to extract parts from a date or
date [Music] time. All right. So what is date part?
Date part going to go and return specific part of the date as a number. All right. So now back to our example.
We have learned how to extract the day, month and year. But of course now in a day we have more informations that we
could extract. Not only those three we could extract for example the week right the quarter so all those informations
are as well stored in this dates we cannot see it like as a value but inside the SQL you can extract the week and
quarter but we don't have a function dedicated for those stuff because they are not commonly used like the year and
month and day but still we can extract those information using the date parts for example we can say date part and we
can specify the part as a week and with that SQL going to return for this example 34 and maybe in other situation
you are interested in the quarter right so you can specify it like this date part quarter so we are interested in the
part of quarter and in the output you will get three so this is exactly the power of the date part you can go and
extract way more parts that is available in these dates and one more thing to notice about the date part year and day
all of them are always generating the output an integer a number. So we have the for the quarter 3 for the week 34
the day 20 2025 and so on. So all of those informations are integer. So integer is the data type of the output
of these functions. Okay. So let's have a look to the syntax of the data part. It start with the function name date
parts and it accept two parameters. The first one is the part that we want to extract. So we want to define what do we
want. We want the month, the day, the year and so on. And the second parameter is the date itself. So let's have an
example. We can say date part and we would like to extract the month from the order dates. So the part is the month
and the order date is the date that we want to extract from. So with that we are specifying the part as a month. Now
in SQL there is another way on how to specify the parts. We can go and use like an abbreviation of the month. So if
you specify instead of month instead of writing the whole thing you write mm you will get the same results. So it's like
abbreviation and shortcut in order to write scripts. But I rarely see that in the implementations. I always tend to
write it completely like this month because it's more like standards if you are switching between different
databases. So as you can see it's very simple. You have to give SQL two things which part you want to extract and the
date that you want to extract from. Okay. So now we're going to go and extract different parts from the
creation time using the date part. Let's start for example by extracting the year again. So let's go and do that. date
parts and then we have to specify which part we need. So we're going to write year like this and then the next one
going to be the value. So it's going to be the creation time. So let's call it year and let's say date parts. Let's go
and execute it. So now at the output you can see we got as well again the years that is extracted from the creation
time. So it's going to be identical to the year function. So there is no differences between them. Both of them
are integer and it holds the year informations. Now we can go and try different parts. For example, let's copy
the whole thing and let's extract for example the month. So you can go over here and change it to month and let's
rename it execute. So at the output you see we got as well the months is identical as well
to the function month. And the same thing for the day. So we are just changing the
parts and in the output we are getting the parts. So here we have as well the days it is identical to the day
function. So so far we don't have something new from the date part because we have it already from the other
functions. But now we're going to go and extract other parts that are not year month and day. So for example let's go
and get the hours. So we have the date part and here as a part you say hour and let's call it here as well hour. Let's
go and execute it. Now you can see in the output we have a new dedicated column that shows only the information
from the hour. So we have here 12 23 and so on. And those informations comes from the time and the same thing you can
define minutes and so on. But now let's go and get something interesting like the quarter. So let's go and duplicate
it and instead of hour let's get quarter. So this information it's not displayed in the creation time but SQL
can go and extract it. So let's call it quarter and let's go and execute it. Now as you can see in the output we have one
new field called quarter and inside it everywhere we have a one because all those dates are in the range of the
quarter one. So as you can see this is amazing of course for reporting and analyzes. Let's go and have something
else like the week day. So we are over here quarter and let's call it week day and rename as well this to week day. So
let's go and execute it. All right. So now let's go and get something else like for example the week. So I just
duplicated over here instead of quarter let's write week. So I would like to get the week number. So let's go and execute
it. So now in the output as you can see we got a dedicated field that show us the week number from the creation time.
So we can see this dates come from the week number one. Those two come from week number two and so on. So that's it.
As you can see guys all those informations that you are getting from the date part are numbers. And now we
can extract way more informations than only the year, month and day. And even if those informations are not displayed
directly in the field itself like the quarter, weeks and so [Music]
on. All right. So now we have very similar function to the date part. We have the date name. So the only
difference here is that it returns the name of the date parts. All right. So now back to our example. We have learned
we can extract different types of parts from one date. But we learned as well that all of them are numbers. How about
we would like to extract the name of the month. So instead of eight, I would like to get the name of the month like
August. Or instead of the 20, I would like to get the day name like here in this example, it going to be Wednesday.
So in order to get the name of the parts, we have to use the function date name. So for example, if you use the
function date name using the part month, you will not get eight in the output. You will get the full name of the month
August. So as you can see we are getting a string a full name and as well the same thing if you use date name for the
week day you will not get 20 like the day function you will get the name of the day Wednesday and as well here the
output is string so as you can see it's very simple we are using the date name in order to get the name of the parts
and the data type of the output here is a string it is not an integer so as you can see here we have different types of
functions that all of them are doing the same job we are extracting ing parts from one date. Okay. So now by checking
the data name syntax, it's going to be identical to the date part. So we are just switching the function name. It
needs from you to define the part and as well the dates. The only difference here is that we are getting different data
type at the output. So here we are getting a string instead of integer. All right. So now let's check the date name.
It is very similar to the date part. So we're going to have it like this. We're going to work as well with the creation
time. So we're going to say date name and then after that we have to define the parts. So let's go for example with
the month and our field is as usual the creation time and let's call it month date
name like this. So that's it. Let's go and execute it. Now if you go to the output over here you can see we have the
month but this time we don't have numbers. We have the full name of the month. So we have January, February,
March instead of having 1 2 3. So this is the big difference between the date name and date part. Date part you get
numbers. Date name you get the name of the part. So let's do the same thing for the day. We would like to get the name
of the day. So I'm just duplicating it. But now in order to get the full name of the day, we cannot go with the day.
We're going to go with the week day as a part. So that's it. I will call it week day. So let's execute it. Now as you can
see in the output, we have here a new column called week day. And inside it we have the name of the day instead of a
number. So here we have Wednesday, Sunday, Friday and so on. So the full name of the day go of course with the
day. Let's go and try that out. So this is the day of the month and of course the day of the month has no
name and SQL of course going to return the numbers again. So you can see 1 5 10 20 and so on. But still there is a
difference between the day from the day name and the day from the date parts. In the date parts we are getting integers.
So if you store this information in a new table it's going to be stored as an integer. But in the date that you are
getting from the date name it is a number but still it can be stored as a string value. So the data type of those
numbers is a string and the data types of the day from the date part is an integer. And the same thing can happen
if you extract for example a year. So you don't have like a full text of the year. So let me just do it like this. So
if we say a year, you will not get the name of the year. You're still getting the numbers, the digits, but the data
type here is a string. So that's it. This is the difference between the date name and the date parts. For the month
and weekday, you will get the full name. For the other stuff, you will get numbers but with the string data type.
So the most important thing about the date name is to present easy to read and human readable informations to the
users. So imagine you are building a report called sales by month and then you show to the user the muscles as
numbers 1 2 3 until 12. This is of course okay but it is way more nicer if you present those informations as a full
text. So you go with the date name in order to show instead of one you show January, February, March and the full
name of the month. And this going to look way nicer in reporting for the users. So this is the core use case of
the date name. So what is date trunk? Date trunk going to go and truncate the date to a
specific part. So let's understand what this means. Okay. Now let's check the syntax of the date trunk. It's going to
be exactly the same like date part and date name. So you have to define the part and the date that you want to
extract apart from it. So the only thing that is different here we are giving different function name. So as you can
see all those three functions like having the same structure you have to provide which part you want to extract
like a month, day, week, hour, minutes and so on and the date or date and time that you want to extract a part from it
and of course with the date trunk we are getting at the output date or date time. Okay. So now let's understand exactly
how the date trunk works. We have the following date time and as we learned we have like a hierarchy where we start
with the highest from the year then we move to the month, day, hours, minutes and seconds and by looking to this
information it is very precise. We know exact second for this information right? So the level of details here is very
high. We know the seconds of this event. So now the date going to allow us to change this level of details of this
information by specifying the level of details. Let's take for example if we say the date trunk minutes. So we are
saying we are interested only at the minutes level. We are not interesting with the seconds. So what can happen?
Everything between the year and the minutes going to be kept. That means all those information will not be changed
but only the seconds going to be reseted. We are not interested anymore with the seconds. This is very detailed
for us. So it's going to go and reset the seconds to 0 0. So we are saying the minimum level is the minutes and we are
not interested anything like before it the seconds let's say now we say you know what the minutes is very detailed I
would like to be at the hours level so we specify for the date rank hour so here things changed we're going to keep
the informations now between the year and the hours and anything after that going to be reseted so now minutes and
seconds going to be in the range of the resets and SQL going to go and reset the 55 to 0 0 so now the level of details is
little bit lower now we know only the informations until the hours and we are not interested about the minutes and the
seconds and I think you already get it if you say date trunk day what's going to happen it's going to keep everything
between year and day and the whole time going to be resets so the hours and seconds all those information is going
to reset to 0 0 so now by looking to this we don't know anything about the time we know only informations about the
dates and now we can go one more step and we say you know what I'm not interested about the days I'm doing
analyszis on the month level so what is here kept is only two informations year and month and everything below that the
day and the time going to be reseted but this time SQL will not reset the date to 0 0 because there is no date called 0 0
it start always with the first date so it's going to reset to 01 so the dates parts and the dates going to reset to 01
one and the dates parts in the time going to reset to 0 0. So now we are at the level of the month. Now you can go
to the last step and you say you know what I'm interested only on the years and I'm doing only analyzes at this
level at the highest level. So you can go and say date trunk year and now what's going to happen going to keep
only the year and everything below that going to be reseted. So between month and the seconds everything going to
resets. So here is scale going to reset as well the August 2011. So the only value that is kept is the year and
everything else is reseted. So this is the 1st of January and the time is completely reseted. So now we are at the
lowest level of details. We know only information about the year and we don't care about any other parts. So as you
can see the date trunk here is not really extracting a part here. Date trunk is like resetting stuff. So we are
navigating through the hierarchy of the date and time and we are controlling at which level we are doing the analyszis.
So as you can see at the end it's not very complicated once you understand how it works and it is very useful in
analyzis. So this is how the date trunk works in SQL. Okay, let's have a few examples about the date rank together
with the creation time. So as you can see the creation time the level of it is the seconds. So we have seconds
information with the creation time. Now I would like to move it to the minutes. So let's go and do this date trunk and
we're going to say let's tr it at the minutes level for the creation time. So let's call it minute date trunk. So
let's go and execute it. Now if you go and check the output over here and compare it to the creation time, you can
see here we have zeros at the seconds. So as you can see we have the seconds completely resetted compared to the
creation time. Now let's say that I'm not interested in the time information inside the creation time. I would like
only to get the date. So in order to do that, we can use the date trunk where we reset to the level of the day. So let's
go and duplicate it. I'm going to put it over here and instead of minutes, let's say we have a day and let's go and check
the output. Now if you go and check the result over here you can see all the time informations are reseted to zeros
and we have here only information about the date. So we have year month and day and everything else is reset it to zero.
Now of course we can go to the maximum where we say I just need the year. So I don't need anything else. So let's try
that out. We're going to take date trunk and say year and let's call it year. So let's go and execute it. Now if you
check the output over here you can see that everything is reseted beside the year. So we have only the year
information but everything else is reseted to the first of January and the time is as well is reseted. So as you
can see the output of the date trunk is always as a date time and it help us as well to navigate through the hierarchy
of the day time and we can truncate at the level that we want. All right. So now we're going to check why data trunk
is amazing function for data analyszis. So let's have this example. We are saying select
creation time and we want to count the number of orders based on the creation time from our table sales orders and
we're going to use the group by in order to group the data by the creation time. So let's go and execute it. Now as you
can see we're going to get one everywhere because the level of details the granularity or the creation time is
very high and that's because here we have the seconds and since our data is small we will not get like two orders at
the same seconds. Now in data analytics you would like quickly to aggregate the data at different granularity like for
example at the month level. So you can do that very quickly using the date trunk and you say you know what let's
say at the month and let's call it creation and we're going to have the same thing for the group pie. So let's
go and execute it. So now as you can see at the output we have only three rows we don't have like 10 rows and that's
because we have three months. So that means we just rolled up to the month level instead of the seconds. And we can
see now in the month of January we have four orders, February as well four and March we have only two. So now we are
talking about different level of details in the output and granularity. And now you might say let's go and aggregate the
data at different level at the year level. So you can just change over here the year and execute it. And with that
now we are at the highest level of aggregations. We are at the year level and since in our data we have only 2025.
So we will get the total number of orders inside the table and that is 10. And this is really amazing in data
analytics. You can go and quickly change the granularity and the level of aggregation or details by simply
defining the level inside the dates. So this is why the date rank is amazing. It allow us to do analyszis and
aggregations by zooming in and zooming out. Okay. So now we're going to talk about the last function in the part
extraction category. We have the end of the month. As the name says, it's going to go and return the last day of a
month. So let's see how end of month works. This is very simple. So let's take our date 20th August 2025. If you
go now and apply this function to it, what's going to happen? It's going to go and change only the day information. So
instead of 20, it's going to go to the last day of the month. So it's going to go and change the 20 to 31. The last day
of the month, August in 2025. Let's take another example is the 1st of February 2025. If you apply the end of the month,
it's going to go and change the day from the 1st to 28. The last day of month February. So as you can see, it's very
simple. Let's take another example where it is already the last day of the month. So we have 31 of March. If you apply the
end of the month here, what can happen? Nothing going to happen. You're going to get in return the same value. So this is
how it works. And as you can see always the output of the end of the month going to be as well a date. So this is how end
of month work. It is very simple. All right. Now quickly about the syntax of the end of the month. It's going to have
the exact same syntax like the day, month, year. It accepts only one parameter. It is the date. So we have to
pass here a date in order to find out the end of the month. So let's go and find the end of the month of our
creation time. So end of the month like this. And let's have our creation time. So let's see the end of month. Let's go
and execute it. And now in the output you can see we have a new column a date column. And inside it we have values
about the end of the month. So for example here we have January, January, January and so on. So you will see
always here the end of January and the same thing for February and March. So that's it. This is really nice function
in case you need the end of the month of each date. Maybe you're creating a report or analyzes where you need this
information. And now you might ask me how about to get the first day of the month. Is there like any function for
it? Well, no. But there is a trick in order to get the first day of the month using another function that we just
learned. Think about it. How to get the days as one everywhere. So we have to get here the 1st of January, the 1st of
February, and the 1st of March. So how we can do that? Well, using the date trunk. So let me show you how we're
going to do this. So date trunk and we're going to reset at the level of month. So we don't need the
days it going to reset to the first. So our field is creation time and this going to be the start of month. So let's
go and execute it. So now as you can see in the output we have the start of month and you can see we have everywhere here
a one since we reset it at the level of month and this going to give us the first day of the month. And now you
might say you know what here we have a lot of zeros how to get it exactly like the end of the month and that's because
the date rank give us date and time always. So that means we have to change the data type and that we're going to
learn later using the cast function but we can go and do it right now. So we can say cast and we want to change the whole
thing to date. And now that we change the data type from date time to date and in the output as you can see we have
only the date information. So now it's really amazing that you got two dates. The first one is the start of the month
and the second is the end of the month. And those information might be helpful if you are generating reporting and you
need the start and the end of the [Music] month. So now we come to the part where
we ask the question why do we need those parts? Why do we need to extract the date parts from a date? So let's have
the following use cases. The first use case of extracting the part is doing data aggregations and reporting.
Sometimes we are building like reports based on our data and sometimes we have to aggregate our data by a specific time
unit like for example we are building a reports in order to show the sales by year. So we have different years and we
are aggregating the data based on the year or you want to drill down to more details where you want to aggregate the
data by the quarter. So in this report we are showing the sales by quarter Q1 2 3 4 or you decide to go in more details
where you show a report says sales by month and then you start aggregating your data by the month. So you have
January, February, March and so on. So as you can see we can use those different parts in order to aggregate
the data based on it and these different parts can offer us different analyzes with different details. So now we have
the following task and it says how many orders were placed each year. So that means we have to group up our data by
the year and we have to count the number of orders. Let's go and solve it. So let's go with the select. And now what
do we need? We need the order date. This going to indicate when the order is placed. So and we have to go and count
the star. So this going to be number of orders. and from our table sales orders and we have to group up by the order
dates. So that's it. Let's go a and execute it. So now in the output we are getting the number of orders but by the
order date. So we are still not there. We have to have it as a year. So we don't need the whole date information.
We need only the year information. So that means we have to go and extract the part year. In order to do that we can do
it like this. So we can go with the year and we have it as well in the group I. So that's it. Let's go and execute it.
And with that as you can see we got the number of orders for each year. And since in our data we have only 2025 we
will get only one row. So with that the task is solved. We are now aggregating the data on the level of the year. Now
let's have another task which is the same but only different parts. How many orders were placed each month. So we
have to go and change it to a month. It's very simple. We're going to use the function month and as well in the group
by. So let's go and execute it. And now as you can see in the output we don't have one row. Now we have three rows.
And that's because we have three months inside our data. And for each month we will get the total number of orders. So
for the January we have four, February we have four and March we have two orders. Now you might say you know what
I don't want the months as a numbers. I would like to have the full name of the month. So in order to do that we're
going to go and use the function date name. So let's go and use date name and then we have to specify the date part.
It's going to be the month and the value going to be the order date and we have to have the same thing as well in the
group I. So let's go and execute it. Now you can see in the output we are getting the full name of the month which is
easier to read. So this is one of the use cases why we need to extract parts from a date in order to aggregate the
data on a specific level. So now let's have the following task and it says show all orders that were placed
during the month of February. So that means we don't need all the orders. We need only a subset of the orders based
on the order dates. Now let's go and check the data. So select star first from sales orders and let's go and
execute it. So now with that we have our 10 orders. Now if you check the order date over here you can see that we have
orders in January, February and March. Now we are interested only on the orders that were placed in February. So only
these subsets. So that means we have now to filter the data based on the month information. So what we're going to do,
we're going to have a wear clause. And now we don't need the whole order date. We need only the part month. So we're
going to go with the month and order date and this going to be equal to two. Since the output going to be in number.
So let's go and execute it. Now as you can see SQL did filter the data and in the output we have only the orders were
placed in the month of February. So this is as well very common use case. Why do we need the parts? We use it in order to
filter the data based on specific part of the dates. So as you can see it's very quick and easy. And here my
recommendation is that if you are filtering the data always use the numbers. So always use a date function
that gives you a number because it's always faster to search for integers instead of searching for a character or
for string. So don't use the date name function in order to search or filter for the data. It's better to use the
date part or month, year and day. Since you can work with numbers and numbers are always faster to retrieve data and
to filter your informations. Okay. So now we have a lot of functions and I would like now to do
a quick recap about the data type of their results. So as we learned we have functions like day, month, year, date
bar and the output of all those functions going to be integer. It's going to be a number. Now we have
another function the date time. If you use it the output of this function going to be a string because here we are
extracting the name of the date part. And if you go and use the date trunk you will get in the output always date time
two. So you are getting both the date and time. And the last function that we learned end of month if you use it in
the results you will get the data type date. So this is really important to understand the data type of the output
so that you don't get any unexpected results. All right. So now you might say you know what those are a lot of
functions and like I'm saying they are doing the same stuff. We are extracting the parts of the dates. So now you might
ask me how do you decide on when to use which function? This is how I usually do it. First I ask myself which part I want
to extract. If I want to extract a date or a month then I ask the question do I need it as an integer as a number? If
it's yes then I go and use the day function or the month function because they are quick and I will get exactly
what I need. But now if I need the full name of the month or the day then I go with the function date name. Now moving
back if I'm interested on the part year. So here we don't have a year name or something. I'm going to go immediately
with the function year. But now let's say that I don't need the day, month or year. I'm interested in other parts like
the week, the quarter and so on. Only for this scenario, I go with the function date part. So this is my
decision process. This is how I decide when to use which SQL function in order to extract the parts of the
dates. All right. All right. So now I have prepared for you here a list of all parts that we can use inside those three
functions date part date name and date trunk. And you can see in this table the different outputs using those different
three functions. So for example if you go and use the month with the date part you will get eight but for the date name
you will get August and for the date trunk you will get truncated date time at the level of the month where you
reset the days and times. So this is a full list of all examples you can go and check it. And one more thing that I have
prepared for you in order to practice with all those different parts. I have made one big query with all different
parts. So if you go and download the queries of this chapter, you will find the following files and let's go now and
open all date parts. So we're going to go inside it and here we have a long query. So what we're going to do, we're
going to select everything and copy it and let's go back to our scale and paste it. So let me just zoom out and then
let's go and execute the whole thing. So now in my code I have just done a union for each possible part. For example for
the year we have date part date name and date trunk and I'm using currently the get date. So we are manipulating this
one and then the output can be presented over here. So you can see it like this. So if you use the part here for the date
name you will get 2024. The same thing for the date name and this is for the date rank. And with that you have all
possible parts that you can use in SQL in one query. So with that you can learn what are the outputs for different
parts. All right. So with that we have learned all those functions on how to extract the parts of dates. All right.
Moving to the second category. We're going to learn how to do formatting and casting for the date informations in SQL
using three functions. So now before we deep dive to the formatting and casting I would like you
to understand what is date format. So back to our example we have here the date and time informations and we
understood there is components year month day and so on. Now if you check the date time there is combination of
numbers and characters. For example the 2025 is a number but between the month and the year there is like a minus
between them and this is a character. So now this is a very specific format and in SQL we can have a code for this
format. So for example let's start with the year we have here four digits and we can represent it with 4 Y. So Y Y and we
call those characters as format specifiers. So this is how we represent the year. Then between the year and the
month there is like this small minus and then the month is two digits and we're going to represent it with two big M. So
m M then between the month and the day there is a minus. So we have as well minus and then the day going to
represented with two digits d and then we have like a space between the date and time and then we start with the
date. So it start with the hour big h and big h because here we have the system of 24 and then we have double
points small m small m. So as you can see here the formats are case sensitive. So there is a big difference between
small m and a big m. So a small m indicates for a minute and big m indicates for a month. So as you can see
here the case format is case sensitive. So two small m means minutes but two capital m means month. Then double point
and small 2s. So now the whole code is called the date format. So this is the date format representation of this
value. Now in the world there are different representations on how to represent a date. So for example in SQL
we have the international standard ISO6801 and the date format is like we have learned first it start with the
year. So four digit for the years minus two digit for the month minus two digit for the day. So year month day but in
the USA we have different standards. So first it start with the month. So we have mm and then after that it is
followed with the day. So we have then the day and after that at the end we have the year. So this is the sentence
format that is used in USA and in Europe we have different representations of the day. So it start first with the small.
So it starts with the day then the month and then the year. So this is exactly the opposite of the international
standards. So as you can see we don't have one standard. We have different ways on how we represent dates. But in
SQL the SQL server is following the format of the international standards. So SQL server start always with the year
then month then day. So all dates that are used in our SQL database can be following this
format. Okay. So after we understood what is date format, now let's talk about formatting and casting. So what is
formatting? Is changing the format of value from one to another. So we are changing how the data looks like. So for
example, we have our date. So it's following the international standards start with year, month, then day. Now we
can go and change the format using the function format where we can go and define a different date format like it
start with the month and then we have like slash instead of minus and then the day/ year. So in the outer we're going
to get it like this and even the years is only two digits not four. So here we are providing for SQL the format that we
would like to see the data with or you can go with other format where you have three big M and then four digits for the
year and between them is just a space. So in the output you will get abbreviation of the month name and then
space and the year. So this is one way on how to format data. But in the scale there is another function that help us
to format data and that is convert. So here we provide not the format itself we provide style number. So for example the
style number six. So it can show it like this day space and after that we have the abbreviation name of the month and
then two digits of the year. Or if you use another style the 112 then you will get the year, month, day without any
separation between them. And of course not only the date and time we can style we can style as well numbers and here we
can use the function format in order to change the format of the number. So here if you're using the format of numeric
values then the values will be separated with comma or if you use c for the currency then you will get the dollar
sign or if you go and use p then you will get the percentage and at the end you have the percentage character. So as
you can see we can as well change the format of the numbers but only the dates. So this is what we mean by
formatting we are just changing how the value looks like. Now in the other hand the casting the casting can go and
change the data type from one to another. So for example if we have the value 1 2 3 as a string we can go and
convert it from the data type string to an integer. So in the output we will get as well 1 2 3 but as a number or we can
go and change the data type from dates to a string. So in the output it is not anymore dates it is a string value or
the way around we can change the data type from a string to a date. So as you can see we can change the data type from
one to another and we can use that using two functions. The first one is and the most famous one is cast function or in
SQL server we can use as well the convert function in order to change the data type. So this is what we mean with
casting changing the data type from one to another. All right. So let's start with
the first function the format. So what is format? As the name suggest it formats a date or time value. So it's
like we are changing how the date and time looks. Okay. So let's check the syntax of the format and here it accepts
two parameters and the third one is optional. So the first one we have to provide a value. It could be a date or a
number. And the second one we have to provide the format. So here we are specifying the new look the new format
for this value. Now the third one it is optional one. It is the culture. Culture means show me the value whether it's
date, time or number. Show me this value in the style of a specific country or region. So each country each region has
different format. So here we can go and change it to specific region format. But as I said it is optional. Let's have an
example. So here we are saying go and format the order dates using the following format. So dd day then slash
then we have the month then slash then the year. So going to go and format this with this new format. And as you can see
here we didn't specify any culture since it's optional. Let's see another option where we can say you know what I would
like to have the order date formatted with this format but we would like to go and add the style of Japan. So we are
specifying here the code or the style of Japan. And of course we can go and use the format not only for the date but as
well for formatting the numbers. So here we are specifying the value. The format is D. And as well we have activated the
culture option. We are using the style of France. So this is the syntax of the format. Using this option is not really
common. So I rarely see this format or someone using it. So the first example is the most used one in the projects
where we have the culture as default or we are not using the culture at all. And of course if you don't specify anything
is going to go and use the default culture which is enus. So this is all about the syntax of the format. All
right. So now let's have a few examples using the format. So we're going to go and format the creation time. So we're
going to do it like this. Format. And what we are formatting? We are formatting the creation time and now you
can go and define any specifier you want. For example, let's say DD like this. So let's go and check the outputs.
So execute it. Now if you are using DD, you will get the day information. So we can see if you're using this specifier,
we are getting two digits about the day. So and as well we are getting the leading zero. So we are getting the 01
05 and all those informations are the day information. Now let's go and try something else. adding one more D. So
let's have it 3D and here as well. So let's go execute it. So now if you check the output, we are getting now the name
of the day. It is not full. So we are getting like a short name of the day or abbreviated one. So this is sometime
nice if you are creating like a calendar or something. Let's go and add one more D. So we're going to have 4 D. And let's
go and check the result for this one. Now in the output we are getting the full name of the day. So it's really
nice. Now we are getting full flexibility on how to format our day. Okay. So now let's keep playing. Let's
get something else. I'm just going to go and duplicate everything and I will go with the month now. So this is 2 M, 3 M
and 4 M. Let me do it like this. So let's go and execute it. Now as you can see we are getting the same stuff but
for the month. So mm we will get the two digits and 3m we will get the abbreviated name of the month and for m
we will get the full name of the month. So it's like we are extracting the date part from the format but of course we
don't use it like this. We will go and write the whole format that we need for a date. So for example let's go and
change this format to the USA format. So in order to do it so we're going to go over here. So let's say format again the
creation time. And now we're going to write the format of USA. So it's going to be mm. Then after that then after the
month we're going to have like minus then day and then after that we're going to get the year. So for time year and
that's it. Let's call it USA format. So let's go and excuse it. And now you can see in the outut we got a new column
where we see now the date information but as a USA standards. So it start with the month then the day and then
afterward we got the year. And of course we can do the same thing in order to generate the standard format of Europe.
So what we're going to do I'll just duplicate it. And now the format of that going to start with the day then the
month and then the year. So now if you check the output you can see it start with day minus then we have the month
then minus the year. So as you can see we are changing the format of the date from creation time to something new. All
right. So now we have the following task and it says show creation time using the following format. Now we have a very
weird format. So it start with the word day. Then after that we have the abbreviation of the day and then
abbreviation of the month. This is the quarter informations. Then the year and after that we have the time and we're
going to say whether it's PM or A.M. So it's little bit weird format that you don't see it everywhere but still we
want to practice on how to construct such custom format. So let's do it step by step. I'm going to go over here and a
new line. So the first one is like day. So we don't have any format for that. It's just like characters. So this one
going to be static for all the format. So what we going to do? We're going to say with a string this is the day. So
let's go and execute it. So with that we got a static value. Everywhere we have the word day. So that's it. And after
that we have a space. So I'm going to go and include it after the day in the string. So we have a day then space and
after that we need the abbreviation of the day name. So what we're going to do we're going to go first with the plus
operator in order to concatenate the strings. So we need the format function for the creation time. And what do we
need? We need the short name. So it's going to be three times the d. Let's go and execute it. Let me just say here
custom formats. So now as you can see in the output we have here the day. Then afterward we have space and then the
abbreviation of the name of the day. So it looks so far good. Now after that what do we need? We need space and then
the abbreviation of the month. So we can go and add all those stuff together with the format here. So we don't have to
create two formats. So space and the abbreviation of the month is 3 M. So let's go and test it. Great. So now as
you can see we got the abbreviation of the month as well side by side. So we so far we have covered this part. Now we
have to move to the second part. So we still need a space and then Q1. Well the Q going to be static. So we cannot go
and extend this format. We have to start a new one. So what I'm going to do I'm just going to add a plus here and a new
line. So what do we need? We need first a space between the month and the quarter. So let's go and add space and
we need the Q as a static value like this. Let me just move it like this. And now after that we need this one like
this right so now we need the quarter informations and we don't have format for that that's why we have to go and
use the part extraction functions and the one that we're going to use since we are using string I will go with the date
name so quarter and we are extracting from the creation time so let's go and test it so now in the output you can see
we have everywhere a Q1 and that's because all of those dates are in Q1 all right so now we are so far halfway in
our format Not. So now next what do we need? We need like a space and then the year information and then the time
information. So now in order to go and get space we're going to do it very simply concatenate and we're going to
have space. Now let's go to a new line and in order to get the year I will go with the format as well. So format and
what do we have? We're going to have the creation time again. So how we going to format it now?
What do we need? We need the year. So it's going to be four times the y and after that we have like space and then
the time information. We still can't do that inside the format, right? So we're going to have space here. And then next
what do we have? We have the hours. So it's going to be h the small h because here we are talking about the pm and am.
It's not the 24hour system. And then after that what do we have? The points double points. Then the minutes going to
be small 2 m. And then after that the seconds. So far this is exactly this part over here. And now what is missing
a space and the PM the designator. So in order to do that we're going to have a space as well and then small 2 * tt. All
right. So we are almost there. Let's go and execute it. Now you can see it is working. So we have the year then space
the hours minutes and space and then we have the designator. So this is PM and this is A.M. which is correct. So that's
it. We are done. This is how you can create those crazy formats in SQL using the help of format or maybe date name or
maybe some static values like we just added here. So I think it's really fun formatting the dates in
SQL. Now one use case for the format that I frequently use in my project is using it to format the date before doing
aggregations. So it's like part extraction but here we have more customizations on how we represent the
date at the reports. So we can show a report like sales by month where we display for example the date as
abbreviation name of the month Jan and as well two digits for the year 25. So once we change the format like this and
then do data aggregations we will have a nice report about the sales by month. So let's have a quick aggregations using
the format. So, we're going to go and say select and now the order date and count the number of
orders from our table sales orders and then group by. But now before we start using the order date, we have to go and
format it. And then if you take the order date, let's go and execute it. So as you can see the level of details is
very high and we have here 10 rows and for each day we have like one order. Now we learned we can go and use the date
part in order to extract one part and then aggregate on it. So now instead of that we're going to go and use the
format function. So let's go and change the format and it is the order dates. And our format going to be like this. So
three big M and then two digits for the year. That's it. And let's call it order dates. And we need this as well for the
order date over here for the group I and here a comma. So that's it. Let's go and execute it. So in the output as you can
see over here we have three months and here we having the aggregation the number of orders for each month. So now
it's like the date part but now we are customizing the format as we want. So we can use the format in order to change
the granularity of the date in order to do that aggregations. Now I'm going to show you
a real use case for the formatting in real projects. Now our data could be stored in different technologies like
the data could be stored in CSV file or we can get our data using an API call or in very common scenario our data could
be stored in database. So now what we usually do we go and extract the data from these different sources into one
central storage. It could happen that you are getting different formats for the dates and of course this is a
problem for analytics. You cannot present different formats for the dates. What we're going to do we're going to go
and clean up the formats into one standard format. So that means we have to format the incoming data to new
formats and once we have one standard format we can use it in analytics and reports. So this is very common use case
in data preparation and in data cleanup by formatting different formats into one standard
format. Now in SQL we have many different date and time specifiers and I said they are case sensitive and each
one of them has a different meaning. So I prepared for you as well all possible specifiers that we can use with the
formats. Not only that, if you go back to the queries that you can find in this chapter, you can find here date format.
So all date formats. If you go inside it, you can go and copy the whole query and then go back to SQL then execute it.
You can find here a live example because I'm manipulating now the get date. So you can find here a list of all possible
date specifiers that you can use with the formats. So I would say go and practice with those different date
formats in order to understand what is possible in SQL. So as we learned not only we can change the format of the
date, we can change as well the format of the number using the function formats and those are the different possibility
that you can use as a specifier for this format in order to change the format of the numbers and as well I have prepared
all those different specifiers in one big query. So if you go inside it and copy it and then put it in SQL and
execute it, you will find here all different possibilities that we have as a specifier to change the format of the
numbers. All right. So what is convert? It's very simple. It's going to go and change the value to a different type and
as well at the same time it helps formatting the value. Okay. So let's check the syntax of the convert and it
looks like this. It start with the function converts and it accept two parameters the data type first since we
can use this function in order to cast the data types. So you can use string integer dates and so on and then we have
to specify the value. So which value should be casted. And the last parameter it is optional one where you define the
style the format of the value. Let's have this very simple example. We are saying convert to the data type integer
int and the value that should be converted is 1 2 3 as a string. So it's going to convert it to integer. We are
saying convert to a vchart and the value that should be converted is the order date. So the order date should be a
date. So we're going to convert it from date to v charts using the format or the style of 34. So here we are specifying a
style a format for this value. And of course it is optional and if you are not using anything the default value that's
going to be used is zero. So this is the syntax of the convert in SQL. All right. So now we're going to have few examples
on how to work with the convert. So let's go and convert for example string to integer. So we're going to say for
example convert. So what is the target data type? It's going to be the integer and the value. It's going to be like for
example 1 2 3. So and let's call it like this string to integer and the function is convert. So now in the column name as
you can see I'm using here brackets and that's because I'm using like empty spaces and so on and with that I will
get more freedom on how to name things. So this is just the name. So this is no function or something. Let's go and
excuse it. Now as you can see it's going to work. So we are converting from a string value to an integer and the
output this 1 2 3 here is not string. This is the data type of integer. All right. So now let's have another example
where we want to convert from string to date. So the target going to be the date and the value let's have this value as
usual and we're going to go and call it string to date convert. Okay. So let's go and execute it. Now in the
output we will get this information this string as a date. And with that we have converted the data type from string to
dates. Now let's have another example where we want to convert the date time to a date. As you remember the creation
time is a date time and we would like to have it as only date. So let's go and convert and we would like it to be as
well date but this time it's going to be a column called creation time and let's give it the name. So we are converting
date time to dates. But of course here we have to go and select. So from sales orders that's it. Let's go and execute
it. Now, as you can see in the output, we got only date. I'm going to go and select the creation time in the query as
well. So now, as you can see, the creation time was before a date time. So, we have the time information as
well. But if you go and cast it using the convert and make it only date. So, SQL going to go and convert it to date
and you're going to lose all the informations about the time. So, so far what we are doing here is just casting.
So, we are changing the data type from one to another. But in the convert, we can do both. We can do casting and
formatting. So let's see how we can do that. I will just get rid of those information at the start. So creation
time. And now we're going to go and convert the date time of the creation time to a varchar to a string. And as
well to give it the format of the USA standard format. So let's see how we can do that. We're going to start with
convert. We are changing now to var. So this is the new data type and the value is the creation time. And now if I don't
give it a style, it's going to stay with the standard format, but we would like to have the USA standards. So in order
to do that, we're going to go and add the style of the format. So it's going to be 32. So that's it. Let's have a
name like this. So USA standard and we are using the style of 32. Let's go with that. This is just
a name again. So it's not a function. Let's go ahead and execute it. And now in the output we got a new field and the
data type of this field is a varchar. So it's not a date or date time. And as you can see the date now is formatted using
this style the 32 the US standard format. So it start with a month then a day and then a year. So now let's go and
do the same thing in order to get the standard format in Europe. So I will just go and copy the whole thing. I will
just change the style. So instead of 32 we're going to go with the 34. And I will just change the name as well. So,
so we are just changing the style. Let's go ahead and execute it. Now, as you can see, we got the same thing. We have as
well a v jar and the format now is different. So, we have here the day, then the month, and then the year. So,
this is how you work with the convert function. You can use it in order to do only casting or not only that, you can
do casting and as well formatting. So, you have both things in one function. And now if you're talking about which
styles are available, we have many styles that you can use inside the convert. So I have prepared for you a
list of all styles that you can use with the convert. So we have styles only for the dates and another styles only for
the time and styles for only date time. Now in the download folders you can find here one file called all culture
formats. And here you can find one query that I have prepared where you can find inside it the different cultures and the
examples. So let's go and copy it and let's go back to scale paste it and let's see the results. So now if you
check the output we got the first column is the cultures that is used. So we have a lot of cultures like around 17s and
you can see how the numbers are formatted or the date is formatted based on this culture. So it's really fun. You
can check here for example how the format in Japan or Korea or France and the German one. If you scroll down, you
can find the Arabic, the Russian and so on. So you can see the format of each dates is changing based on the culture.
So I would say have fun. Go and try those different cultures formats in order to format your numbers or
dates. So what is the cast function? It going to go and convert a value to a different data type. So it turns one
data type to another. All right. So now let's check the syntax of the cast. I really like this one. It is not typical
like format or syntax in SQL. So it says the cast is the function and then inside it we need two things but it's not
separated like with the comma as we learned before with all other functions but this time is separated with the
keyword as. So it's like the natural English you are saying cast the value as a data type. So you are casting the
value to a new data type. So let's have this very simple example we have here cast the value 1 2 3 as integer. So
previously it is string and it going to be converted to integer. So as you can see it's very simple. Now in this
example we are saying cast this value this string value as a dates. So converted from string to dates. So as
you can see with the cast we don't have here any option of formatting or styling the values. So it's only dedicated for
casting the value from one data type to another one. So this is the syntax of the cast. It is very straightforward and
really nice function. Okay. So now let's have a few examples about the cast. So let's go and convert a value from a
string to integer. So it's very simple. We're going to say cast. So now we need the value. So let's go with the 1 2 3.
So we have here a string. And then we're going to say as and then we have to define the data type. So the data type
going to be integer. So that's it. So let's give it the name like this string to integer. Let's go and execute it. Now
as you can see we got the value but with the data type integer. From string to integer. Now let's do the way around. We
cast from integer to string. So we're going to say cast 1 2 3 as var jar and we're going to give it a
name int to string. So let's go and execute it. Now in the output we have 1 2 3 but this time it has the data type
varchar. Now let's go and work with the date. So we're going to go and convert a value a string value to a date. So our
value going to be the usual one and we want it from string to date. So we're going to have the data type as date. So
let's give it a name string to date. Let's go and execute it. Now we're going to have this value with the data type
date. So that's it. Now let's say that I would like to have this value but as date time. So I will just copy the whole
thing and go to a new line and say date time two. So the name of this going to be string to date time. Let's go and
execute it. Now in the output as you can see we are getting not only the date but as well we are getting the time
information. But now since we didn't provide SQL with any time information SQL going to go and show it as zeros.
Now let's do one more casting where we change the data type from date time to date. So now we need our creation
time but we have to get it from the tables. So from sales orders let's go and execute it. So now in the output you
can see the creation time is a date time. We have the time information but we are not interested about the time
information. I would like to have this field as a date. So it's very simple what we're going to do. We're going to
say cast. Now the value is creation time and then the keyword as and we need it as a date. So we're going to give it the
name date time to date. So let's go and execute it. Now as you can see in the output we got the creation time but only
with the date information. We don't have anything about the time. So we get it as a date instead of date time. So that's
it. This is amazing function SQL and it's very simple and we can use it only for casting. So only to change the data
type from one to another. And we cannot use this function in order to change the format. So if you are casting you will
get always the standard format from SQL. So now let's go and compare our functions side by side. So we have our
three functions. cast, convert and format and we can do two things either casting or formatting. So by the casting
for the first function cast we can change any type to any other type. So there is no restriction at all. The same
thing for the converts the same thing we can convert anything to anything. But for the format we can change only to a
string. So any data type like a date or number to a string value because the main thing for the format is not
changing the data type. Now if you are talking about changing the format of the values, you cannot use the cast function
in order to change the format. So the cast function is only for casting. It makes sense. Now about the convert, we
can use it in order to change the format of the date and time. But we cannot use it in order to change the number
formats. And for that we have a dedicated function called format. So we can use it to change the format of the
date and time and as well the numbers. So those are the main differences between those three functions. All
right. So with those three functions we have learned how to do formatting and casting on date informations. Now moving
on to the third group we have the date calculations and here we have two functions on how to do date calculations
or mathematical operations on the dates. If okay so now we're going to start with the first function the date add. So what
is date add? Date add can allow us to add or subtract a specific time interval to or from a date. So let's understand
how the date add work. So here again we have our date August 20th 2025. So now in some scenarios we would like to add
years to our dates. So for example let's say I would like to add three years to our date. So we can do that using the
date ad. So if you do that in the output you will get 2028 August 20th only the date part is changed and where we have
added three years but in other scenarios you would like to go and add months. So for example let's go and add two months
to the August. So in the output you will get 2025 10 20 with that we have added two months and of course we can go and
add days to our dates. So for example we're going to go and add five days to our date. So in the output we'll get the
same year 2025 the same month August but only the day will be changed to 25. So we have added five days to the original
dates. And of course we can go and subtract dates even though that the function called date add. So for
example, we can go and subtract three years from our dates and we will get So if you do that, you will get 2022 August
20th or if you go and subtract two months from our dates. So it's going to stay the same year 2025. But this time
instead of August, we will go back to June with the same date 20. And the same thing going to happen for the days if
you go and subtract five days. So the same year 2025, the same month August, but only the days going to be instead of
20, it's going to be 15. So as you can see with the date ad you can manipulate the years, the month and the days by
subtracting or adding new intervals. So this is how the date ad works. All right. So now let's check the syntax of
the date ad. And here things little bit more complicated. We have to provide three informations. The first one is a
part. What do you want to add? Do you want to add years or months or days and so on. Then the second one is interval.
So it's like how many days? How many years? How many months? And then the last one is the date. This is the date
that we're going to be manipulating by adding or subtracting intervals. Let's check the following example. We are
saying here date add. So what is the part here is a year. That means we want to manipulate only the year parts. Then
the interval here is two. So it is positive. We want to add two years. So it's going to go to each order and start
adding two years for each date value. Now let's check another example. Here we are saying date add month. So here we
want to manipulate the month part. But here we are saying minus4 that means we want to go and subtract four months from
each value in the order date. So as you can see the value of the interval whether it's positive or negative. We
are controlling here the function whether it is subtraction or addition. So let's have few examples about the
date add using our field order dates. So for example let's go and add two years for each date. So we can do it like this
date adds. So we are adding years that's why we're going to go with the part year and how many years we are adding we are
adding two years. So this is our interval and our field our value is the order date. So now in the output as you
can see we got a date but this date is always 2 years higher than the order date. So everywhere you have see 2027.
Now let's go and add maybe three months for each date. Just going to go and copy it and say a month. Let's change the
interval to three and we're going to call it three months later. So now if you check the
output over here we have a new date but now the difference between it and the order date we have here always three
months more than the order dates. So for example here we have January but in the new one we have April and for the next
one we have February and in the new field we have May. So as you can see we are adding months over here. So as you
can see we are adding monthses to our original filled order date. Now let's say that I would like to go and subtract
10 days. So let's go and do the same. So we're going to have the date add. Since we are talking about the days, it's
going to be the day. We're going to subtract 10 days. So minus 10 for the order date. So let's call it 10 days
before. Let's go and execute it. Now we got as well a new date. And this date has always 10 days before the order
date. So for example, let's take the order number seven. In the order date we have 15, but in the new column we have
five. So we have subtracted 10 days from the original filled order dates. So as you can see it's very simple to add or
subtract days, year, months using the date add. All right. So what is date diff?
diff stands for difference and date diff can going to can allow us to find the differences between two dates. All
right. So let's understand how the date diff works in SQL. Now imagine we have two dates. We have the order date 2025
August 20th and the shipping date is the 1st of February in the next year 2026. Now we might ask the question how many
years have passed between the order date and the shipping date. So in order to answer this question we can use the
function date diff and we can define the part year. If you do it like this it's going to subtract those two dates and it
going to return one. So the date difference between those two dates is exactly one year. But now if the
question is how many months are between the order date and the shipping dates. So here again we can go and use the date
diff between the order date and the shipping date but we use the part month. If you do it like this in the output you
will get three months. And now of course if the question is how many days are between the order date and the shipping
dates. So here we can use the function date diff where we specify the day inside it and in the output you will get
68. So this is how the date diff works. You go and subtract two different dates and you will get in the output a number
how many years how many months how many days. So that's it. All right. Now to the syntax of the date diff. It accept
here as well three parameters. So the first one is the parts as usual year, month, day. And then here we need two
dates, not only one, we need two. So we need the starting dates and the ending dates. So that means here we have the
youngest dates and the end date going to be the oldest dates. So for example, here we have date diff and we are saying
find the differences in years between the order dates. This is the start date and the shipping dates. So which dates
normally happen? First we have to order something. So we have the order date and once you order what can happen next is
the shipping date. That's why the shipping date is as an end date. So we want to find the differences between
them in years or of course if you want to find the differences between them in days we have to go and change the part
from year to day. So as you can see the syntax is very simple and very logical right. All right let's have the
following simple task and it says calculate the age of employees. So let's see how we can solve that. So we're
going to go and select first all the informations from employees. So sales and employees. Okay, let's execute it.
Now in the employees, we don't have any informations about the age, but we have the birthday. So we can go and transform
this birthday to an age. And of course, how we calculate the age? We count how many years between this year and the
birthday. So that means we have to go and use two functions the date diff and the get day in order to have the year of
the current year. So that means we have to go and use the function date diff. So let's go and do that. I'm going to go
first selecting only few informations. So employee ID and P date. So let's start with the date diff. So if we are
talking about the age we are calculating how many years that's why we're going to say as a part going to be the year. So
what is the starting date is the birth date of the person. So it's going to be the birth date. And now we need the end
date. We don't have here anything about the end date. The end date going to be the current year. So in order to get the
current year, we're going to go with the function get dates. And with that we are getting the current date information.
And this is exactly what we want. So let's close it and let's go and call it an age. So it's very simple. We are
counting how many years between the birth dates and the current dates. So let's go and execute it. So now we are
getting the ages. As you can see the first person is 33, the second one is 52 and so on. And now you might getting
different values than I'm getting now. And that's maybe you are doing the course now in 2025 or 2026 and the
employees going to be older than now. Now we are 2024 and I'm getting those ages. So this is how we calculate the
age using the help of two functions. The date diff and the get date. Okay. Okay, so now we have another task for the day
diff and it says find the average shipping duration in days for each month. So here we have a lot of
informations. Let's do it step by step. Let's first find out the shipping durations in days. So let's go and
select few informations from our table. So select order ID. We have the order date, ship
date and I think that's it. So from sales orders. So let's go ahead and execute it. So now we have our 10
orders. We have the order date and the shipping dates. Now we have to go and create a new field called shipping
duration. So what is the shipping duration? It is the number of days between the order dates and the shipping
dates. So how many days it took from the order placement until the day of the shipping. So that means we have two
dates and we have to go and find the differences between them. We're going to go with the function date diff. So now
since we are saying in days we have to go with the part day. So what is the start date? The start date is the order
date. And what is the end date? It's going to be the shipping dates like this. So I'm going to call it day to
ship like this. Let's go and execute it. So now by checking the result for example for the order one it is ordered
at the 1st of January and it is shipped on 5th of January. So between those two dates we have around 4 days. So four is
the shipping duration and if you go to the order number three the differences between the order date and the shipping
date we have around 15 days. So with that we have solved this part shipping duration in days. But now the task says
we have to find the average duration for each month. So that means we have to go and select for example the month of
January and find the average duration. So we have to go and do a simple aggregation. We're going to go to the
date if at the start and say average. And we're going to close it over here. And let's go and rename it average
shipping. And now we have to aggregate by the month. So we don't need the whole order dates. We need the month of the
order date. So like this. We don't need of course the order ID, but now we need to group up the data using this
dimension, the month order dates. So that's it. Let's go and execute it. So now in the output you can see we have
three months and for each month we have the average shipping durations in days. So for the first month it is around 7
days for February is as well 7 days and for March we have less duration 5 days. So with that we have solved the task. As
you can see the date diff is very strong function in order to do data analytics using the dates information. All right.
Right. So now we have the following task and it says find the number of days between each order and the previous
order. So there's a lot of stuff going on over here. Let's do it step by step. Let's start by selecting the basic
stuff. So select order ID, order date from the table sales orders. Let's go and execute it. So we have our 10 orders
and we have the current order dates. So now we have to find the differences between two dates. order dates, the
current one and the previous order dates. So in our data, we have the current order dates, but we don't have
the previous order date for each order. And in order to calculate the previous one, do you remember about the window
functions? We can go and use the lag in order to access a value from a previous records. So let's go and do that. The
order date, I'm just going to call it current order dates. And let's go and find the previous order dates. So we're
going to go with the lag of the order date because we are interested in the value of the order date. Now over we
have to sort the data. So we're going to sort it by the order date as well. So this is
going to help us always to access the previous value of the order date. So we're going to call it
previous order date. Let's go and execute it and let's check the result. For the first order, we don't have
anything previously. So that's why we are getting a null. For the second record, the current order date is the
5th of January and the previous one is the 1st of January. And this value comes from the previous record, the previous
order. Great. Amazing. So with that we have now the two dates, the current date and the previous one. And now we can go
very simply finding the number of days between those two dates. And we can do using the amazing function date diff. So
we are interested on the days that's why it's going to be the day. So what is the starting day? If you check those two
dates, you can see that the previous order date is the starting date. So we're going to take the whole thing, the
whole window function and put it over here. So I just moved my picture. So here is the previous order dates. And
now the end date, what's going to be? It's going to be the current order date which is our order date like this. So
again, we are finding the number of days between the previous dates and the current dates. So that's it. Let's close
it. So I'm just going to call it number of days. So let's go and execute it. Now of
course we have here null. So we will get as well null in the output. And now you can check over here how many days
between those two dates. We have exactly four days. And as well for the next one we have around 5 days, 10 days and so
on. So we have solved the task. We have now the number of days between each order and the previous order. So this
type of analyszis is very important in the business. We call it time gap analyzes and we have done it using the
help of the window function and as well the date function date diff. So date div function is amazing function to do data
analyzes. All right. So with those two functions we have learned how to do mathematical operations on date
informations or we can call it date calculations. Now moving on to the easiest and the last group, we have the
date validation. And here we have only one function, the is date. Okay. So what is is date? So the
is date is very simple. It's going to check whether a value is a date. So it going to return one if the string value
is a valid date or zero if it is not a valid date. Okay. So let's check quickly the syntax of the is date. It's very
simple. The keyword is date is the function name and it accepts only one value. So for example you can pass a
string like this and you can ask SQL is it a date. So is date and the value and of course for this example you will get
true or one. So as you can see we are passing here a string value and we are validating whether it is good enough to
be a date or as well you can go and specify a number like here 2025. So is this value a date and of course SQL
going to accept it and say yeah this is a year so you will get as well a one. So you can pass as well a number or
integer. So you are just checking the values whether they are suitable enough to be a date. So that's all about the
syntax of the is dates. Okay. So now let's have few examples. For example, let's go and select and we're going to
say is date and we will check a value. So let's say this value is a string 1 2 3. Let's go and call it date. Check one.
Let's go and execute it. Now in the output it's going to say no, it is not a date. And that's why we are getting the
value zero which is correct because 1 2 3 is not a date. Let's pick another value. The same thing is dates. And now
the value going to be the following. So 2025 August 20. So let's call it date check 2. And let's go and execute it.
Now in the output we will get one. That means the value that we have provided is a date. And that's why we have a one in
the output because ESKL is saying this is a date. Now let's have another example. We're going to take the whole
thing. So this is a check three and remove this from here. But I would like to go and change the format. So let's
say that we start with the day then month and then the year. Let's go and check. Now in the output you can see it
is zero because SQL does not understand the formats. So we are not following the standard format of the database and
scale and that's why going to say no this is not a date. This is like a string value. So this means only if the
value is following the status format SQL going to understand this is a date. Now let's go and check another thing for
example let's say is date and let's have only the year. So 2025 and let's give it the name date check for let's go and
execute it. Now in the output we will get one. So that means is considering this value as a date. So that means
Iskll is smart enough to understand okay we have provided a year information and is going to accept it and say okay maybe
this is the 1st of January of 2025. Now let's go and do the same thing but for the month let's see whether SQL going to
accept it. So check five and we have the month of August. Let's go and check now going to say no I don't understand this
value this is zero. So that mean this value is provided is not a date. So by checking those results as you can see
SQL understand only the standard formats and it allow you as well to check whether a year is a date. So this is how
the is date works in SQL. And now you might ask well when I'm going to do this when I'm going to check whether the
value is a date or not. Let me give you this following scenario. Now imagine that we have the following date. So we
have four values as a string. And now if you check the data you can see that we are following the standard format but
only one value has an issue. So we have here data quality problem. So now what we want to do, we want to go and cast
this string value to a date. We don't want this to stay as a string value. We would like to have it in the final
result as a date. So what we usually do is that we go and have like subquery on top of those values. So like this. So
now what we're going to do, we're going to go and say we would like to go and cast the order dates as date. We don't
want it as a string. And we're going to call it order dates from these values. So let me just make it like this and
let's go and execute it. Now SQL going to give you an error and say well I cannot convert everything to a date
because you have maybe corrupt data and this is of course because of this row. So SQL is not able to convert this
string to a date. But of course now the example is very simple. We know that but if you have a huge table it's going to
be really hard to identify those issues. But now still I would like to go and convert those value here. I don't want
to get an error. And now if there is like some values like here that is corrupt and so on this value could be
null. So how we can force SQL to convert the data type from string to date and not give us this error. And for this we
can go and use the help of the function is date. Let me show you how I usually do it. So let's go and say let's check
whether the order date is a date. So let's have it like this. And now before we go and execute, I'm going to make
this as a comment because if I execute it like this, we will get an error. And let's go and get the order date in our
select. So let's go and execute it. Now as you can see in the output, we have our string value. So they are not yet a
date. And we have the result of our check. So as you can see the first row, we are getting a zero. So it's saying
this value is not a date. But for all other values, we are getting one. So they are passing the check and they are
dates. So now what we're going to do we're going to go and build a logic where we're going to say go and cast the
value from string to date only if the flag or the check is equal to one. So that means we can go and use the help of
the case when statement. Let me show you how we can do that. So let's do it step by step. We're going to say case win.
Now we need the check. So is dates the order date. So if the output of this check is equal to one then you
are allowed to do the casting. So let's go and get the cast as a result of this condition and if it's not equal to one
then it could stay as a null. So let's have it as a null if it didn't pass the test. So end and we can call it new
order dates. So now let's go and execute it. Now as you can see we are not getting error from SQL. So now if you
check the output for the invalid dates we are getting a null. So we are not getting an SQL error. And now only if
these string values are a valid dates it's allowed to be casted. So that you can go and cast a string value to a date
even though that you have bad data quality and this is very important step in order to prepare the data before
doing analyszis and it help us as well to find data quality issues. So for example we can go over here and say you
know what let's go and search for all issues. So we're going to go and take the is dates. So let's go and get the
check and I'm going to say let me see all string values that are invalid that are failing the test. So let me execute
it. And with that we are getting this record. And now imagine we have a lot of data. So it's now it's really easy to
identify those issues by just using the S dates. So this is as well amazing way in order to identify data quality
issues. Now of course you might say you know what I don't want to see here null. Maybe let's get a dummy value. Well it's
very easy. We can go over here and say else. So and we can go and get for example very large value something like
this that is easy to identify. So now with that instead of getting nulls inside your data you can get such a
dummy value. So now you understand the use case of the is dates and why this function is amazing doing data
cleanup. All right. So with that we have covered 13 different date and time functions in SQL. So we have learned how
to extract the date parts using seven different functions and we have learned as well when to use which one. So they
are amazing in order to do data aggregations and as well filtering. And then we have learned how to change the
date format from one to another and as well how to change the data types. And then we learned how to do mathematical
operations on our dates. So how we can add or subtract days, years, months from a date or the amazing function the date
diff where we can go and find the differences in days or years between two days. And the last one we can go and
validate whether the values that we have are dates or not. So as we learned date functions are amazing functions in order
to do data analyzes and reporting. All right my friends. So with that we have learned a lot of very important SQL
functions and how to manipulate the date and time values in your database using SQL. Now in the next section we're going
to start talking about the null functions in order to handle the nulls inside your tables. So let's go.
So what are the nulls? Imagine you are filling out a forum and there will be usually like fields that are required
and another fields that are optional. So what usually happens? We leave those optional fields unanswered. So we don't
provide any values and we leave it empty. And now once we are done filling out the form and we click on register,
the data will be inserted into database tables. So now what can happen? The fields where you have provided answers
and values can be filled inside the table while the unanswered fields will have no value and this is what we call
in SQL a null. So in databases a null means nothing unknown. It is not equal to anything. So it is not equal to zero
or empty string or blank space. A null is simply nothing. It tells us there is no value and it is missing. It's like
saying I don't know what this value is. So this is what a null means in SQL. All right friends, so now we're
going to do a deep dive into special SQL functions on how to handle the nulls inside our data. Now in some scenarios
we have nulls inside our tables and we would like to go and remove it and replace it with a new value like for
example 40. And in order to do that in scale we have two functions. The first one called is a null and the second one
called coales. But now let's say that we have another scenario where we have a value inside our table like the 40 and
we want to go and make it as a null. So now we are doing the exact opposite. We are replacing the value with a null and
for that we have the SQL function null if. So as you can see with those two scenarios we are replacing stuff. So
from null to value or from value to null. So they are really helpful in order to manipulate the data inside our
databases. Now moving on to another scenario where we don't want to manipulate anything. We want just to
check. So we don't want to replace or convert anything. We want just to check in our database whether we have a null
value. And for that we have a function called is a null. But between the is and null there is like space. It is
different than the first function. So if you apply is null you're going to get a boolean true or false. For this scenario
you will get true. Or the second option you can go and check whether the value is not null. So we can use is not null
and for this example you can get false. So in the output we are getting a boolean true or false. So those keywords
are really amazing in order to check whether we have nulls inside our data. So this is the big picture of all
functions that we have in SQL in order to handle the nulls. So now let's go and understand those functions one by one.
So let's start with the first function is null. Is null going to go and replace a null with a specific value. Now the
syntax of the isnull is very simple. We're going to use the keyword is a null and it accepts two arguments. First the
value and then the second the replacement value. So let's have an example. We can go and use the is null
for the column called shipping address. So we are checking the nulls inside it. And if SQL encounters any null, it going
to go and replace it with the value unknown. So this going to be like a default value for the nulls. So the
first value is a column and the second value is like static. Always going to be the unknown if we find any nulls. Now of
course in other scenarios we don't want to have it always like the unknown. We would like to use another column to help
the first one. So let's have this scenario. So now with this syntax we are checking the values of the shipping
address and if we find any nulls it's going to get the replacement from the billing address. So here in this example
we have two columns. We don't have here any static value. We will get the values of the billing address only if the
shipping address is null. So we are replacing the nulls using the help of other column. And in the first scenario
we are replacing the nulls with a static value the default value. So let's have a very simple example in order to learn
how this works. So what we are doing we are checking whether the value is null. If it's yes then we're going to go and
get the value from the replacement and if the value is not null then show the value itself. So we have the following
example. We are going to check the values from the shipping address and if there is nulls then go replace it with
the default value na. So let's see how going to go and execute this very simple example. We have two orders. The first
order we are checking the submit address is the value of this address is null. Well, no. We have a value a. So that's
why it's scale going to go and return the same value. So in the outputs we will get a. So if it's not null, it's
going to return the same value. So now it's going to move to the second order and here we have the shipment address as
a null. So what going to happen here? If the value is null, then we going to get the replacement value. So what is the
replacement value is the NA. So that's why in the output we will not get a null we will get the N A. So if you check the
result what happens? We're going to get the addresses from the shipping address but only if we have a null we will get
like default value. It's very important to understand if you are using the default value in the output you will
never get a null. All right. So let's have another example for the second scenario where we are not using a
default value we are using a column. So we have a supportive column that's going to be checked. So in this scenario we
are saying is null shipping address and billing address. So we have two columns and of course the logic going to be the
same right. So we are checking only once. Let's see how SQL going to execute this example. We have this time three
orders and we have addresses from the shipments and as well from billing. So now SQL is always focusing on the
shipping address since it is the first column. So we are not checking the billing address at all. So it start with
the first order. Is it null? Well, no, we have the value A. So, we will get it as well in the output and SQL will not
get anything from the billing address. So, we will get a. So, that's it for the first order. Now, it's still going to go
to the second order. And this time, we're going to have a null. So, now in the rule, we are saying if the shipping
address is a null, go get the value from the billing address. So, this time we're going to go to the replacement, right?
So we will get the value C in the output because the shipping address is the null. Now let's move to the third row.
As you can see here we have again null. So SQL going to go and get the value from the billing address. But here in
this scenario the billing address is as well null. That's why we will get the value null in the output. So as you can
see having the replacements values from a column there is no guarantee that there will be always a value like here
in the third order it is a null that's why we will get null as well in the output. So if you think you are using is
null to replace all the nulls by having two columns you might end up as well having a null in the output if the
replacement having nulls. So if you want to make sure you don't get any nulls in the output you have to go and use a
static value. So this is how SQL execute the isnull. All right. So what is coales?
Coal is going to go and return the first null value from a list. All right. So now the syntax of the coales is way
better than the is null. Here it accepts like a list of many values. So here for example we have value 1 2 3 you can add
four five as much as you want. So we are creating here a list of values to be checked. So for example, we still can
use it like the isnull where we have the shipping address where we replace the null with a static value the unknown or
as we learned we can go and use two columns shipping address and the billing address. So so far it's like the same
use cases as the is null but now of course the kalis is not only limited to two we can go and use three. So we are
saying go check the shipping address if it's null then go check the billing address. If it is as well null then use
at the end the default value the static one the unknown. So as you can see we can use more than two values with the
coalis. Okay. So now let's understand the cowless and how this works. Now the workflow is something similar to the
isnull. So in this example we have two columns shipping address and the billing address. It's going to consider it as a
list and it's going to start checking from left to right. So it's going to check the first value from the shipping
address whether it's null. If no, it's not null then we're going to go and get the value one. So we will get the value
from the shipping address. And if yes, it is null then it's going to go and get the value two. So we're going to get the
value from the shipping address. Now we have the similar data. We have three orders. Let's see how going to execute
it. So it's going to start with the first row and it's going to focus on the shipping address. So here the value is
not null. So we have it as an A. So that's why we will get the value one. So we will get the value from the shipping
address and nothing else going to be checked. Now moving on to the second row. This time the shipping address is
null. So it's going to go and get the value from the second column and it's going to be the C. Right? So in the
output we will get C. Now to the last example, we have it as a null and it's going to go and get the value from the
second column and this time we're going to get as well a null like the is null function. So at the results we are
getting exactly the same result as isnull. So for this scenario it doesn't matter whether you use isnull or
kowalis. So now of course we are still not happy with that because I don't want to see any nulls in the output and I
will still need to use the billing address instead of any static values. So I would like to have everything the
values from the billing address and as well I would like to have at the end a default value so that I don't have any
nulls in the output. So how we going to solve it? So now we can use the power of the account list where we can include
multiple values in one function. So what we're going to do we're going to have the shipping address first then the
billing address and at the end we're going to have the default value. So we have now a list of three values and of
course our workflow going to be a little bit bigger. So again here it's going to start from the left to the right. So
first it's going to go and check the value one. If it is null then it's going to go as well checking the value two.
And if the value two is as well null, we will get the last value. It's going to be the value three. So now let's run the
example again using the new kalis. So the first thing we're going to go and check the first value which is the
shipping address for the record number one. So now as you can see the value is not null. So we have here an a. So what
going to happen? We're going to get the value a as well in the output. So that means this one going to be activated and
we will not check anything else. So that means in the output it's going to be like this. and the first value is
returned and everything else will be ignored. So, SQL will not check anything. So, as you can see, we are
returning the first null value. So, now let's move to the second order. Now, we're going to check again the first
value. Is it null? Well, yes. As you can see, we have here a null. So, that means we're going to go and activate this path
over here on the right side. So, now SQL will not go blindly putting anything from the billing address in the results.
First SQL has to check it. So SQL going to check it whether it's null or not. SQL going to go and return it as well in
the output. And we have activated this path. So SQL is returning the value two which is the value from the billing
address. So now let's move to the third order. SQL first going to go and check the shipping address. Is it null? Well
yes it is null. So that's why SQL going to go and start checking the second value. So this time SQL will not return
the billing address value since it's null. It's going to go and return the third value. And what is the third
value? It is our static value the NA. So in the output we're going to get the NA our default value. So with that as you
can see in the output we will not get any nulls. We are using the default value and as well multiple columns. So
if you check the output, it's always the first priority to check the values from the first column, the shipping address.
If it's null, then the second priority going to be the billing address. If it's null, then the last priority, it's going
to be the default value. So as you can see, SQL is checking the values from left to right and it stops immediately
once it encounters the first not null value and return it in the results. So this is how the cow works.
All right. So now let's have a quick summary about the differences between the kowalis and isnull. So as we learned
the isnull is limited only to two values where the kowalis is amazing where you can have a list of multiple values which
is a great advantage compared to the isnull. Now if you are talking about the performance the isnull is faster than
the kawalis. So if you want to optimize the performance of your query then go with the isnull. Now there is another
problem with the isnull is that we have different keywords for different databases. So for Microsoft SQL server
we use the isnull as we learned but in Oracle they have different implementations they use the NVL and
other database like MySQL you have if null and all those three functions are doing the same but we have different
implementations for different databases but in the other hand the cowis it is available in all different databases. So
here we have like an agreement or standards between the databases of using the kowalis. So here again this is a
great advantage for the kowalis because if you are writing like scripts and someday you want to migrate from one
database to another. If you are using the kowalis you don't have to change anything but if you are using the isnull
then you have to go and adjust your queries and scripts with the correct functions. That's why I tend always to
use the kalis and avoid using the isnull. Only if it's really necessary that I have really bad performance, I go
and try the isnull. But I usually stick with the kowalis. So that is my advice for you. Go with the kowalis and stick
with the standard. Now the use cases of the kowalis and the isnull are very similar and we mainly
use them in order to handle the null before doing any SQL task. For example, we can use them in order to handle the
null before doing data aggregations. So let's understand what this means. Imagine that we have three sales. We
have 15, 25, and a null. Now if you go and use an aggregate functions like the average, what's going to happen? SQL
going to calculate it like this. 15 + 25 divided by two and the average is going to be 20. So as you can see here SQL is
including only the two values 15 and 25 and ignores totally the null value. So in the calculations the null will not be
included because if SQL does that the output going to be as well null. So the nulls are totally ignored. Now the same
thing can happen with the other aggregate functions like the sum count if you are counting the sales min and
max. There is only one exception about the aggregate function count. If you are using it with the star, SQL here is
considering not the values. SQL going to consider the rows. That's why SQL going to go and include all those rows and
find the output going to be three. Now in some scenarios, if your business understand the null as zero, then you're
going to have a problem with the result of your analyzes if you don't handle the nulls. So what we have to do? We have to
handle the null before doing the aggregations. So we have to go and replace a null with zero using either
the isnar or the kowalis. So once you do that the calculation going to be changed for the average. So it's going to be 15
+ 25 + 0 divided by 3 and the output this time going to be 13.3. So with that you're going to get more accurate
results for the business if they understand nulls as zero. All right. So now we have the following example. It
says find the average scores for the customers. So let's go and solve it. So we're going to go and select the
customer ID, the score from table customers. So let's go and execute it. So as you can see, we have four
customers with score and the last one doesn't have any score. So we have it as a null. Let's go and calculate the
average for the score and I would like to have the window function in order to see the details as well. So this is
average scores. So let's go and execute it. Now of course what is going on here? The four values going to be added to
each others and divided by four and the null is totally ignored. Now of course the question is what the business
understand with the null. If it is zero then we have inaccurate results. So let's go and fix it. Now this time we're
going to say okay we're going to have the average but instead of score we're going to handle the nulls first. So we
have to replace any nulls with zero. We can go and use the kowalis or the isnull. So I will go with the cabalis
like this and score if you find any null make it zero. So that's it and as well I will go with the window function. So
average scores let's call it two. Now let's go and execute it. Now as you can see in the output we got 500 and this is
different than the previous average and that's because we have replaced the null with zero. Let's just go and display it
in order to understand it. So I will copy it and put it here. So let's call it score two and execute it. So now SQL
is going to summarize all those values and divided by five and that's why we are getting the 500. So if our business
understand the null as a zero this average going to be more accurate after we handle the null. As you can see in
some scenarios we have to handle the nulls before doing any data aggregations.
All right, moving on to the next use case for the kowalis and isnull. We can use them in order to handle the nulls
before doing any mathematical operations. So let's understand what this means using the plus operator. So
if you do plus operator between two numbers like 1 + 5, you are summarizing the values and you will get six. And if
you do the plus operator between string values like a + b. So now what we are doing, we are doing data concatenations
and the output going to be a b. So now if you go and replace the one with a value like zero. So 0 + 5 we will get
five. Nothing fancy about that. And for the strings if you go and replace a value with an empty string. So there is
zero characters between the two quotes plus the B. So in the output you will get only B. So it's fine and nothing is
critical. But now we come to the problem. If you use a null if you replace the one with null in the output
you will get a null. because you are saying okay five plus something that I don't know so SQL says okay you are
summarizing now a value with a no value it is unknown so I don't as well know what going to be the answer that's why
going to say it's going to be null just don't know what is the answer and the same thing can happen with anything else
like the string so if you're saying null plus b and here going to say the same thing the null is unknown and the answer
going to be as well unknown so my friends this is very critical in the analyzes and working with data. So this
means we have to handle the nulls before doing any mathematical operations. And this is not only for the plus operator,
it's as well for the other operators like minus and so on. All right. So now let's have the following task. And it
says display the full name of the customers in a single field by merging their first and last names and add 10
bonus points for each customer's score. So let's go and solve it. We're going to select first the basic informations.
Let's get the customer ID. What do we need? the first name, the last name and we need the scores. So that's it from
sales customers. Let's go and execute it. Now the first task is that we have to generate a new field called full name
where we have to go and merge or concatenate their first and last names. So let's go and do that. We need the
first name plus and then let's have a space between the first and last name and then plus let's have the last name
as full name. So let's go and execute it. Now if you check the result for the
first customer it is working. So we have Joseph Goldenberg. The same thing for the second customer. But for the third
customer we have here a problem. Customer doesn't have any last name but she has a first name. So we have here a
Mary. So the full name here is completely null which is not correct. For this example we have at least to
show the first name Mary even though that the last name is missing. So the result is not really accurate and that's
because we are doing the plus operator between a null and marry. So that means we have to go and handle the nulls
before doing any plus operator. So again here we can go with the cowless or the isnull. So let's go and create a new
field using the cowless. So it's going to be the last name and now we have to define a new value. If it's null so we
could have like something unknown or we could have like an empty string and we can do that using two quotes and between
them there is nothing. So we are using an empty string. So let's go and check the results. Last name two. So let's go
and execute it. Now we can see that the last name over here for marry it has an empty string and it is not anymore a
null. So now SQL knows okay this is a string and there is no characters inside it. So with that SQL knows more
informations and we can go and now concatenate those informations. So let's go and do that. We're going to take the
whole thing and replace the last name with the kowalis. So let me just remove this last name over here and execute it.
So now as you can see things looks better. Now we have in the full name for mari only the first name. And of course
if you don't like it like this you would like to have another default value. You can go over here and say something like
in a not available. So let's go and execute it. And with that you can see immediately uh there is here a missing
last name. But it doesn't really look good. So I will just remove it and go with the empty string. We're going to go
and execute it. So with that we have solved the first part of the task where we have the full names and we are not
missing any informations from the first name and the last name. Now let's go to the second part of the task where we
have to add 10 bonus points for each customer score. So we have to go and add a 10 for each score. So let's go and do
it. I'm going to put it at the end. So score + 10 and let's give it the name score with bonus. So that's it. Let's go
and execute it. So now in the output you can see it's very easy. We have added a 10 for each score. So we have increased
the score points for each customer. But now for the last customer Anna you can see over here she doesn't have a value
in the scores and that's why didn't go and added 10. So we will get as well a null. And of course this might not be
fair that the last customer is not getting any point even though that we have increased for all others. So that
means we have to go and handle the null by replacing the null to zero. And only after that we're going to add a plus to
it. So let's go and do that. I'm going to add a kalis if it is null then go and make it
zero. And afterward go and add a 10 points. So let's go and execute it. So now as you can see at the results
everything now is fair where we have a 10 bonus points for each customers even if the customer doesn't have any values
in the scores like here Anna she has like null but still she is getting a 10 points. So here again as you can see if
you don't handle the nulls correctly before doing the mathematical operations you might get unexpected results. So be
careful with the nulls and handle them correctly before adding anything. Okay, moving on to the next use case for
the kowalis and is null. We can use them in order to handle the null before doing joins. This is little bit advanced use
case but it's very important to understand it. So let's understand why this is important. Let's have for
example two tables table A and table B. And in some scenarios we have to go and combine those two tables using the
joins. And now in order to join two tables, we have to go and specify the keys between the table A and table B in
order to join on it. So in this example, we have two keys in order to join the tables. Now here comes the special case.
If those keys don't have any nulls inside it and all the data are filled, then your join going to work perfectly
and you will get the expected results. And now you might have a special case where there are nulls inside the keys.
So there are missing values and this is a big problem because in the output you will get unexpected results and some
records will be totally missing. So in this scenario we have to handle the nulls inside the keys before doing the
joins. Let's have a very simple example in order to understand this behavior. All right. So now let's have this very
simple example where we have two tables and we want to combine them. So in the first table we have a year type orders
and in the second table we have as well year type and we have sales. So now we would like to go and combine those two
tables in order to have all informations in one result. Now we can go of course and use the inner join between the table
one and table two and the keys for the joins here. As you can see we have the year in both of the tables and as well
the type. So we're going to go and use both of those columns as a key for the join. So let's do it step by step how
going to execute this. So we need the year type and the results. So it's going to go and take those two columns to the
results and we need the orders and sales. So it's going to take as well the orders and the sales from the second
table. So now let's start doing it row by row. So the first key going to be those two columns. So we have 2024 and
the type A. So now it's going to start searching for those two informations in the second table. And as you can see we
have here a match, right? So the first row is as well matching since it's inner join it going to present in the output
only the matching rows from left and right. So in the outputs we're going to get the whole row from the table one and
we will get the sales from the table two. All right. So that's all for the first row. Now let's move to the second
row over here. So what are the values of the keys? We have 20 24 and null. So now if you check the matches on the right
side you can see we have a match here right it is logical so it's as well 20 24 and null so everything is matching
and we should get it in the result right SQL cannot go and use the equal operator in order to join tables so even though
that is logically it makes sense to have it at the output but still SQL cannot go and compare the nulls that's why this is
a problem for this combination SQL will not find any matching So we will not get any informations for the combination of
2024 and null. So for us of course in the business this is missing informations and as well inaccurate
results. So we're going to miss this row and it's still going to go and jump to the third row. So here what are the
values of the key. We have 20 25 and B. Now it's going to go and search it in the second table and it's still going to
find a match over here. So in the outputs we're going to get those values. The the orders going to be 50, the sales
300. Now it's going to go to the last row and we have here again the same problem. We have here 2025 and null. And
of course if you check the data you will say yes we have a matching over here but SQL would ignore it. So we have exactly
the same situation and we will not find it at the results. So at the output we will get only two rows even though that
those two tables are like identicals if you compare the keys. So with that we are losing data at the results and we
are providing inaccurate results. So my friends if you have nulls inside your keys what can happen you will lose
records at the output. So here it's very important to handle the nulls inside the keys before doing the joins. All right
so now in order to fix it we're going to go and use either the kalis or the isnull in the join. So as you can see we
are not using the type directly. We are handling it by replacing the null with an empty string. It doesn't matter which
value you are using. The main thing is that you have a value and SQL can go and map it. So you could have it as empty
string or a blank or any default value. But I usually go with the empty string since it's little bit faster than having
any other characters. So now what going to happen is we're going to go everywhere and replace those nulls with
an empty string. So now we don't have any nulls inside our keys and let's go and see what can happen. So we're going
to start with the first row again. Here we have a matching from the right table and we're going to see the whole records
in the outputs. So we will get as well the sales as 100. And now it's going to go to the second row over here. So this
time we don't have a null. We have 2024 and an empty string. So now it's going to go and search for a match and it's
going to find it over here. we have as well 2024 and an empty string. So now what can happen in the outputs we're
going to get a 204 but here we will get a null. So we will not get an empty string we will get
a null over here and that's because we are handling the null only on the join. So as you can see we have here the is
null type on the join but we don't have it on the select. So in the select the type going to be like the original data
and the original data was a null. We are just handling the null in the joints just in order to let SQL understand how
to map and match the data. So in this example, I'm not changing the values in the select. So that's why we will get
the original value. But the orders we will get it 40 and the sales going to be 20. Now moving on to the third row. I
think you already get it. So let's going to find the match and the sales going to be 300. All right. Now we're going to
move to the last one. And here we have the same scenario. So we have 2025 and an empty string. So it's not null
anymore. And SQL going to go and search for all those informations and it's going to find it over here. So SQL going
to take this fields over here in the type in null not an empty string because in the select we didn't handle it. So
the order going to be 60 and the sales going to be 200. So as you can see now the result is complete. We successfully
combined both of those tables in one big results using joins but as well using the help of the isnull function in order
to have a complete results and not miss any value. So my friends be very careful check always the keys whether they have
nulls or not and if you find nulls go immediately and handle it so you don't lose any records in the results and you
get accurate analyzes. All right, moving on to the next use case for the isnull. We can use
it in order to handle the nulls before sorting the data. So imagine we have the following sales 15 25 and null. Now if
you go and sort the data by the sales ascending from the lowest to the highest what can happen? SQL going to show the
nulls at the start and that is not because the null is the lowest value because null has no value. But SQL show
it like this. it's going to place it at the start and then below it we're going to have the lowest value. So it is the
15 and at the end we're going to have the 25. Now if you are doing the exact opposite where we are sorting the data
from the highest to the lowest using descending. So what going to happen is going to sort it like this. We're going
to have 25 then 15 and the last thing that going to appear in the list going to be the null. So here SQL is showing
the nulls at the end and that is again not because nulls are the lowest value it has no value but SQL do it like this
show it at the end. So this is how SQL deals with the nulls if you are sorting the data. So in order to understand this
use case let's have the following task. So the task says sort the customers from the lowest to the highest scores with
nulls appearing last. All right. So let's solve it. This going to be very interesting one. So we need the customer
informations. So let's go and select and we need the customer ID and the scores from sales customers and let's go and
execute it. So we have a simple list of all customers and their scores. But now we have to go and sort the data from the
lowest to the highest. So we're going to go and use the order by clause and we need the field score. And since it's
lowest to the highest that means we need to have the ascending and in SQL it is a default. So we don't have to go and
mention it. So let's go and execute it. So now as you can see in the results it start from the lowest to the highest and
the first part of our task is solved. But now of course we have an issue right because we have a null and as we learned
SQL going to put it at the first place on the list. But the task says with nulls appearing last. So we really don't
want to see the nulls at the start. We don't worry about it. So we would like to have it at the end of the list. So
that means we have to go and handle the nulls before sorting the data. And here we have two ways to do it. One way that
is lazy and the other one is more professional. So let me show you first the lazy way. We're going to go and
replace the null with a very big number. So for example, what we're going to do, we're going to go and use the kowalis
and we're going to say okay score and then let's have a lot of number so that we have a really big score. I just want
to select it in order to see the results. So as you can see it's a very big number here. So if you take this and
replace the order by with the new score. So that's it. Let's go and execute it. So now if you check the results we have
already solved the task. We have listed all the customers from the highest to the lowest and the nulls are at the end.
So now the question why do we call this lazy or not professional and that's because we are defining a static value.
And of course for this example it is working but we don't know later what's going to happen. Maybe things change
where in this course you're going to get a higher value than this and then sorting the data will make no sense
since the null going to be like in between values. So who knows your value might be a real value inside the data.
Now let me show you the other way which is more professional in order to solve this task where we don't play with luck
at all. So let's go and do that. Let me just move this little bit here. I'm going to go and create a new logic where
we're going to say case when if the score is null then what's going to happen we want the value one otherwise
the value going to be zero so end so we are just creating a flag with zero and one if the score is null then we're
going to get the flag of one if we have a value for the score we will get zero so let's have it like this and I will
just go and get rid of this kalis so let's go and execute it Now if you check our new nice flag you can see we have
zeros everywhere where we have a value in the score but only once we have a null we will get the flag of one. So now
once we got this what we're going to do we're going to go and sort our data based on this flag and the score even
though the task is not mentioning anything about the flag but we are using it in order to force the nulls to be at
the end of the result. Let me show you how we're going to do that. So let me just remove all this. So first we want
to sort the data by our new flag in order to make sure that the nulls at the end. So we're going to have our flag and
then afterward we sort the data by the score. So let's go and have the score. So again what we are doing first sort
the data by the flag in order to push the nulls at the end. And now once all those values are equal to each others
what's going to happen SQL going to go and sort the data by the score. So SQL going to use the scores in order to sort
the data and both of them are ascending. Let's go and execute it. Now as you can see we're going to get exactly same
results. The values from the lowest to the highest and the nulls are at the end. And as you can see with the order
by we didn't use any static values or any big numbers. And of course we don't need the flag at the select. So we can
go and remove it. So let's execute it. And with that we have solved the task. So as you can see we can use those nice
functions like the cowis or the isnull in order to handle the nulls before sorting your
data. So what is the function null if null if going to go and compare two values and it going to returns a null if
they are equal otherwise if they are not equal it going to returns the first value. Okay. Okay. So now the syntax of
the null if it accepts only two values value one and value two. So here again of course you can go and use a column
with a static value like the unknown. So we are comparing the values between a column and a static value or you can go
and compare two columns the shipping address and the billing address. So again here it accepts only two values.
We cannot have it like the kalis where we have a list of multiple values. All right. So now let's understand exactly
what do we mean with the null if. So the workflow going to be like this. SQL going to go and check two values the
value one and the value two. And if they are equal then SQL going to go and return a null. But if the two values are
not equal going to go and return the first value. So it is the one on the left side. So by checking the outcomes
here we will never have a scenario where we're going to get the second value. That means the second value always used
as a check. So we are checking against this value. So either we're going to get the value one or a null. Let's have this
very simple example. We are saying null if price and we are checking whether it's equal to minus1. So we are saying
if the price is equal to minus1 then go and replace it with a null because it is data quality issue that we have a price
that is negative. It makes no sense for our business. And if it is minus1 then it means for us a null. We don't know
the price of this product. So we will correct it using the null if. Let's check this very simple example. We have
two orders. So SQL going to start with the first order and check the first value. So what is the first value? Is
the price. So here we have a 90. SQL going to go and check is 90 equal to minus one. Well, no. That means it's
going to go and execute this path. So that means in the output we will get the first value which is 90. So in the
output we will get a 90. Now let's move to the second order. Here we have a minus one. So SQL going to check is
minus one here equal to the minus one that we have in the null if well yes. So that means SQL going to go and execute
this path where we were going to get the null value in the output and we're going to get it like this. So now if you
compare the result from null if and the price you can see we don't have any more the minus one. And as you can see now we
are doing exactly the opposite as kowalis and is null. We are replacing a real value with a null. Now moving on to
the second example and this is very interesting one in the analytics where we can go and use two columns inside the
null if. So in this example we are saying null if original price and discount price. So SQL have to go and
compare the prices between those two columns and if they are equal it should return a null. And now you might say
okay in this example why we are doing this? Well we can use it in order to highlight or flag special cases inside
our data. And the special case here is if the original price is equal to the discount price and if those two prices
are equals that means we have an issue in our program or something like went wrong as we are inserting data. So let's
see what's going to happen for the first row we're going to go and compare the 150 from the original price with the
discount price. So they are not equal right. So that means going to go and return the original price the 150 in the
output. So let's move to the second order. Here we have the original price 250 and as well the discount price is
250. So they are equal and if they are equal then we will get a null in the output. So as you can see again here we
are not getting any values from the discount. We are using it only for a check. So with that we have a quick flag
like using the nulls as flag in order to identify where we have equal values. So this is how the null if works.
All right friends, here we have a very nice use case for the null if and that is preventing the error of dividing by
zero. Let's see what this means. Okay, let's have the following task and it says find the sales price for each order
by dividing the sales by quantity. So let's go and solve it. This should be very easy. So we need the order ID. We
need the sales and the quantity from sales orders. Let's go and execute it. So now we have 10 orders.
Those are the sales and the quantity. So now it's very easy to calculate the price. It's going to be the sales
divided by quantity and we're going to call it price. So let's go and execute it. Now as you can see we got an error
says divide by zero error encountered. So that means somewhere we have a zero for the quantity and this is a problem.
Let's go and check the data again. So I'm just going to comment the whole thing and let's go and execute it. So
now by checking the result yes we got for the order ID 10 here we have quantity zero. So it will not work if
you divide by zero of course. So how we can solve it? We can use the magic of the null if where we're going to go and
replace the zero with a null. So getting a null is way better than getting an error. Right? So let's go and do that.
I'm just going to remove the comments. And here we're going to say null if if the quantity equal to the zero value. So
that's it. Let's go and execute it. Now as you can see it is working. And with that we are making sure that we are not
dividing by zero. And that's because we replace it with a null. And if you divide anything by null you will get a
null. So if you check the result over here the order 10 we got the price of null which is correct and for the all
other values everything is working because we have values and we didn't replace it with a null that's why we
have values for the price and this is very common use case for the null if we can use it in order to prevent dividing
by zero. All right so what is is null? It's going to return true if the value is
null. So it is checking the value if it's null it's going to return true otherwise it's going to returns a false.
Now the exact opposite if you go use the is not null. So if you use these keywords it's going to returns a true if
the value is not null otherwise if it is null it's going to go and return a false. Okay. So the syntax for that is
very simple. It start with a value or expression and then after that we're going to have the keyword is space null
and the is not is exactly the same. So we have a value then afterwards we have the is not null. So we have the not
operator after that and the is not is exactly the same. So we have a value then we have the is not the not operator
then the null. So it's very simple. Let's have an example. We are checking whether the values of the shipping
address is null. So we can have it like this. Shipping address is null or we can check the opposite whether it's not
null. So the shipping address is not null. It's very easy. Okay. So now let's understand how this works. we are
checking the value. So if the value is null then return a true if it is not null then we return a false. So as you
can see it never returns the value itself or any nulls. So we are getting a boolean of true and false. So we are
creating like a boolean flag in order to assist us with the checks. So we have this very simple example price is null
and we have those two rows. So we are checking whether the price is null in the first order it is not null right
that's why we will get a false in the output and the second order the value is null so it is correct that's why we will
get true now of course if we go and use the is not null is going to be exact opposite so is the price not null well
yes it's not null that's why you will get a true over here so now for the second check it is null right so the
output going to be false we will get the exact opposite. So that's it. It's very simple how the isnull and is not null
works. All right. One very obvious use case for is null and is not null is by searching for missing informations or
searching for nulls. And maybe after that we can go and clean up our data by removing the nulls from our data set.
Let's have the following task and it says identify the customers who has no scores. All right, let's go and solve
it. This is very simple. So let's start by selecting star from sales customers. So we need everything. Let's go and
execute it. Now as you can see we have our five customers. But the task says we have to have all the customers who have
no score. So that means the result should return only the last record since the score of Anna is null. So let's go
and have a wear clause. So where and now what do we need? We need the score. Then we don't use the equal, we use is null
like this. So that's it. Let's go and execute it. And with that, as you can see, it's very simple. We have filtered
our data and now we can see all the customers where the score is null. This is a very basic check to understand
whether our data contains nulls. All right, moving on to the next task and it says show a list of all customers who
have scores. So back to our example, this time we're going to do exactly the opposite. We want a list of all
customers where we have a value in the scores. So what we're going to do, we're going to say where score is not null. So
if you go and execute it, you can see we're going to get a clean list where all the customers have score. And with
that, we get rid of all nulls inside the score field. And maybe this is helpful in order to do further analyzes.
All right friends, now we come to very interesting use case for the isnull and that is by introducing a new type of
joints between tables that's going to help us to find the unmatching rows between two tables. Let's have a quick
recap about the joints in SQL in order to understand the new types. So basically we have two sets or let's say
two tables the left and the right. And if you go and use an inner join what we are doing here we are finding only the
matching rows between the left table and the right table. So at the result we will get only the matching rows. Now we
have another type of joints called lift outer join. And if you use this type at the result you will get all the rows
from the left table and as well only the matching rows from the right table. Now we have another type which is exactly
the opposite the right join. And here we're going to get all the rows from the right table and only the matching
informations from the left table. And now to the last type that we learned. We have the full join where we will get all
the rows from the left and as well all the rows from the right. So we will not be missing anything. So those are the
four basic joints that we have learned in SQL. But in SQL we have as well other types that are more advanced. But we
don't have in SQL any keywords for that. So the first one called lift anti-join. So what we are saying here we need all
the rows from the left table but this time without the matching rows. So all the informations that are matching with
the right table we don't want to see it at the results. And as I said we don't have here an extra keyword for this type
of join. But in order to get this effect we're going to go and combine the left join together with the isnull. And with
that we're going to get all the data from the left side but without anything that is matching the right side. And
this we call it left anti- join. And we have another advanced type for the joints called the right anti- join. This
is exactly the opposite. So we are saying all the rows from the right table without having any matching rows from
the left table. So all the informations on the right side that is not matching the left side. So again here we don't
have a keyword for that. We're going to go and work with the right join plus and is null. So with that, as you can see,
we have two new types of joins added to our four basic joins. Now this might be confusing. Let's have the following task
in order to understand it. Show a list of all details for customers who have not placed any orders. All right. So
let's see how we can create the effect of the left anti-join. So let's do it step by step. We need here two tables.
We need the customers and as well the orders. So since we are focusing on the customers, the lift table going to be
the customers. So let's go and do that. We're going to go and say select star from sales customers. This is our first
table. So we are using the alias of C. So let's go and execute it. Now as you can see we got the list of all
customers. So that we have all the details for our customers. But now we have to go and join it with the orders.
So in order to do that let's have a new line. left join sales sales orders and let's have the
lso and now we have to go and define the key for the join so on it's going to be the customer ID equal the customer ID in
the order table so now if you go and execute it now what we're going to do we're going to go and show the order ID
from the table orders so order ID just to see whether we have a match or not so let's have it like this and execute it
Now let's go and check the results. As you can see those four columns comes from the table customers and only the
last column come from the orders. So now what is interesting is to check the order ID whether we have nulls or not.
So as you can see for the customer one we have everything matching. For the customer two as well we have orders the
three as well for only the last one customer ID 5 we have here a null. So that means SQL was not able to find any
order for this customer. So again what this means we have only one customer Anna where she doesn't have any order
but all other customers they did have an order and that's because we have values from the right table. So once we have
values that means we have matching but since here we have a null that means we don't have any matching. So now since
the left anti- joint says we would like to have all the data from the left table without having any matching from the
right table. So that means for this example we would like to get only this customer Anna. And this is exactly as
well fulfilling our task. The task says list all details for customers who have not placed any order. All data from
customers where we don't have matching from the orders. Now I think you already got it how to get this effect. We're
going to go and filter the data like the following. So we're going to have the wear clause and now we need the column
from the right table from the orders. So we're going to go with the customer ID comes from the orders. So we're going to
say oh customer ID is null. And of course you can go with the order ID as well. You're going to get the same
effect. But I would like always to use the key that we are using with the join. So let's go and execute it. And now as
you can see we got the effect of the left anti join and with that as you can see we got the customer that we are
aiming for. So here we have the data from the left side that is not matching the right side. So the customers who
have not placed an order and with that we have solved the task. So as you can see we have implemented the left and
join by combining the left join together with the is null. So this is the power of playing with the nulls in SQL.
Now my friends, there is something that is really confuses a lot of developers or anyone that is working with data in
databases and SQL and that is the differences between nulls, empty string and blank spaces. So the nulls as we
learned we are saying I don't know what the value is it is unknown. But now in the other hand the empty string you are
saying I know the value it is nothing. So the empty string is a string value which has a zero characters. This is
totally different than the nulls. The nulls we don't know anything about it. So now sometimes maybe happens to you as
you are filling a forum and you come to one field you go and by mistake hit a space bar and with that you are entering
space into the field and you just jump to the next field without entering any other values. So we have now like a
space character inside the field. This is really evil in databases because once the user enter a blank space, it's going
to go and store it as a value inside the database and it's going to take storage. So it could be one space or many spaces
depends on how long you press the space bar. So the blank space is a string but the size is not zero like the empty
string. We're going to have a size of how many spaces you have entered. So here it's not like the null. We know the
value it is string and the character of that going to be space. Okay. So let's see those three scenarios inside scale.
Now I have like a dummy data using the city statements. Don't worry about it. I'm going to teach you all those stuff
in the next tutorials. So now we have here like four rows. The first one with a value a. The next one with null. The
third one with empty string. So as you can see there is nothing between those two quotes. And the last one we have a
space between those two quotes. Now let's go and query this temporal table. So select star from orders and execute.
So now by looking to the values of the categories you can find all the scenarios now. So now the first scenario
is the easiest one where we have a normal value. We have here an a. But the other three rows we don't have normal
values. We have like empty stuff. So the first one going to be the null. So we don't have a value. This is the special
marker from SQL. It says null. So there is no value. And the other two they are really confusing. As you can see it's
really hard by just looking to the data or to the results whether it is an empty string or a blank space. And this confus
a lot of developers or anyone working with data seeing those results. It's really hard to detect the data quality
issues by just looking at the results. So now in this scenario what I do I go and calculate the length of each value
inside my column. So let's go and do that. Now we're going to go in the SQL server. We're going to go and use the
function data length and our field going to be the category. So let's call it category length. So let's go execute it.
And now let's check the result. The first one since we have only one character, the length of that is going
to be one which is correct. And now to the next row we have the category null. We don't know the value and as well we
don't know the length of the value, right? So that's why we will get a null. So now by moving to the next one as you
can see those two looking really exactly the same. But now with the help of the length or the data length function we
can see that the third row or the third category value has the length of zero. That means it is an empty string and we
don't have any characters over here that is hidden. So with that we are sure this is an empty string. But now let's move
to the last one. Here it is very tricky and evil. we have a hidden space inside this value and we can understand that by
the length of this field. So as you can see we have here a one that means we have here one hidden space inside this
value and it is not empty string. So that means I have here only one space let's go and give it another space and
calculate the length. So as you can see we have two spaces and that's why the length is two. So don't count on your
eyes in order to understand the spaces. go and calculate the length in order to be very precise. So now let's go and
compare the three scenarios side by side. So let's start with the first one about the representations in the table.
The null we're going to see it as a null inside the table. The empty string going to be like two quotes and nothing
between them. And the blank space it's as well two quotes and between them one or many spaces. And now if you are
talking about the meaning the null means unknown. We don't know the value. The empty string it is known but it is
nothing it is empty value. And the third one blank spaces it is as well known and the spaces are the value. And now if you
are talking about the data types since the null is no value. So we don't have a data type for this and it is like a
special marker in the SQL. And now the empty string has a data type. It is a string and the size of this string going
to be zero since we have zero characters inside the empty string. Moving on to the blank spaces, it is a string since a
space is a character and it's going to be the size of one or many. And now if we are talking about the storage, the
null is the best. They don't consume or occupy a lot of storage. While the empty string and the blank spaces, they occupy
here storage and memory and they waste the space. So if you are worried about the storage, the best option here is a
null. Now talking about the performance, you will get the best performance if you are using nulls. Now the empty string is
as well fast but it is not that fast like the nulls. Now the worst option here is the blank spaces it is slow. So
again if the speed is important for you you have to have those scenarios as a null. So now if you are talking about
the comparison and you are searching for those values if you want to search for the null you have to go and use is null.
But in the other hand if you want to search for the empty string and the blank spaces you have to go and use the
operator equal. So that's all those are the main differences between the null empty string and blank spaces.
Now you might ask you know what why do I have to understand the differences between all those stuff the nulls empty
strings and the blanks everything's like empty so why do I care well in new projects I'm going to promise you that
you will be working with sources and data that has bad data quality and you might encounter all those three
scenarios in your data and now if you don't do any data preparations like cleaning up the data handling those
three scenarios and bringing standards to your data and you jump immediately to the analyzes without doing all those
stuff, you will end up providing inaccurate results in your reports and analyzes which leads to wrong decisions.
So preparing your data before doing any analyszis by cleaning up the data, handling those three scenarios and as
well bringing standards is very important step before doing any analyszis. So this is how we're going to
do it together with the stakeholders and the users of your reports and analyzes. You have to define a clear data
policies. It's like rules and you have to commit yourself during the implementations by following those
rules. And here we have three different options. The first one you can go and define the data policies like this. Only
use nulls and empty string but avoid using blank spaces. In my project I cannot imagine that there is a scenario
where we need blank spaces. They are just evil. Just go get rid of them. All right. Right. So with this policy, we
have to go and get rid of all blank spaces inside our data. And in order to do that, we have a wonderful function in
SQL called trim. The trim function in SQL going to go and remove the spaces from a string from the left side and as
well from the right side. So all the leading spaces and the trailing spaces going to be removed. So now if we go and
apply the trim function on that category, what's going to happen? All the blank spaces going to be removed and
it going to be turned into empty string. So let's go and do that. It's very simple. So we're going to use the trim
function and we're going to apply it on the category. Let's go and call it policy
one. So let's go and execute it. So now by just comparing the policy one with the category. You see like it's
identical but it's not. Now in order to have a better feeling about this we can go and test it using the data length.
Now let's go again and use the data length function. So we're going to use it for the whole results and as well I'm
going to go and use it for the category in order to just compare it. So without the
trim so like this. Let's go and execute it. Now if you go and check the result as you can see here again we have the
length of two because here we have two spaces but with the policy one we have zero. So those two values after applying
the trim function they have the length of zero and with that we don't have blank spaces. So that means now we are
sure after applying the trim we have either a null or empty string. So let me just get rid of all those informations.
Now I am sure both of them are empty string. So as you can see it's very simple using only one SQL function you
are cleaning up the data and bringing standards. All right moving on to the option two. You can define your data
policies like this. Only use nulls and avoid both empty strings and as well blank spaces. So that means in our
business we don't have anything meaningful for the empty string and the blank spaces. We can go and use only the
nulls. Okay. So now let's go and implement this rule. We have to go and convert a value to a null. So the value
going to be empty string to a null. And as we learned we can go and use the null if function in order to get nulls
instead of values. So let's go and apply this policy. But now here we have two values the empty string and spaces. Now
instead of having two rules for that I'm going to convert first the blank spaces to an empty string like we have done
here. So I'm going to take the result of this function first as a first step and afterwards we're going to go and use the
null if. So we're going to say null if for the result of the trim if if you find any empty strings convert it to
null. So that's it policy 2. So as you can see in the result we have converted those empty spaces and planks to a null.
So with that we are getting three nulls and of course we're going to get the value a. And now if you compare those
three columns side by side you're going to see the bully C2 is really easier to understand compared to the previous
ones. Right? So now if you compare the policy two now to the policy one, you can see it's easier to understand and
it's easier as well to handle. So again it's very easy to do data cleanup with only two functions we have now like
standards inside our data. And now moving on to the last option, we can define our data policy like this. Use
only a default value unknown and avoid using anything else like nulls, empty strings and blank spaces. So that means
in the analyzes and reports we want to see the value unknown and we have to handle all those three informations and
convert them to unknown. So now in order to implement the policy three we have to go and convert a null with a value a
default value and here we have two options either use the is null or we can go and use the kalis and I will go with
the kowalis so kowalis and I'm going to use directly the category. So if you find any null
replace it with the default value unknown and let's call it policy 3. So let's go and execute it. So now if you
check the result over here you see that we got it only once correct. So we replaced the null with the unknown but
we still have like empty spaces and blanks and that's because we rushed using the qualis and we skipped the
other steps. So as you can see preparing the data you have to do it slowly step by step. So first we have to go and
convert everything to a null like the policy 2. And after that the last step we're going to go and use the default
value. So that means instead of using the category we have to go and get the result of the policy 2. So let's go and
copy it and replace the category with those two steps and let's go and execute it. So now as you can see we have the
default value for all those three scenarios. First we have to trim the data in order to remove all the blank
spaces. The second step, we're going to go and replace all the empty strings with a null. And with that, we're going
to get a null for all those three scenarios. And finally, we're going to go and replace the nulls with a default
value, the unknown. So, that's it for the three policies. And this is the different ways in order to clean up the
data and bring standards before doing analyszis. And now you might ask me, okay, which one should I use in my
project? Like if I want to suggest something for the users, which one should I use? Well, it really depends on
the business, but I tried always to avoid this one, the policy one, because it's always confusing and I have always
explained for the users. So now we are left with the two and three. Well, I use both of them in different scenarios. I
normally go with the policy 2 because it takes less storage and as well the performance of your queries afterward
going to be really good. So if I'm doing data preparations in my ETL before inserting it inside a table, I go with
the policy too. But in other hand, if I'm doing a preparation step before showing it in a report like in Tableau
or PowerBI. So if it is like one of the last steps before showing the data to the users, I go with the policy 3
because if you present a null inside a report, it's going to be really hard to read. So having like a word like
unknown, it's easier to understand. Okay, we have here missing data. So again if the data preparations is
exactly before I present the data to the users I go with the policy 3 where I use default values but if I'm using a data
preparations before inserting it in the database I go with the policy 2 because it's going to optimize the storage and
it's going to be really bad if you go with the policy 3 because it's really bad to store the whole world each time
there is no value like the unknown. it's gonna take a lot of space and as well you're going to get bad performance as
you are building the queries. That's why I tend to store the data using nulls. If you present it to the users go and show
it as a default value. So as you can see it's very important to understand the differences between the nulls empty
strings and blanks and how to prepare the data by cleaning up the data and bringing standards and policies before
doing any analyszis. So with this we have cleared up the confusion between those scenarios and if you encounter it
in your projects you know how to deal with it. All right. So now let's have quick
summary about the nulls. Nulls are special markers in SQL in order to say there is no value. It is missing. It is
unknown. So nulls are not equal to zero or empty string or any blank spaces. And using nulls inside our database is going
to save some storage and as well provide a strong performance in your queries. And in scale we have different functions
in order to handle the nulls. So now if you want to replace a null with the value we can go either with the function
kowalis or is null or if you want to do the opposite where you want to replace a value with null you can go use the
function null if or in other cases we want only to check whether there is nulls or not we can use the is null or
is not null. And we have learned as well that we have to treat the nulls especially before doing any tasks. So
that means we have to handle the nulls before doing for example data aggregations like average, sum, max, min
and so on. And we have to handle the nulls as well before doing any mathematical operations like using the
plus operator to concatenate two strings. And in some scenarios as we learned we have to handle the nulls as
well before doing joins. And in other cases we have as well to handle the nulls before sorting the data. And we
have learned as well by combining the joins and the isnull we introduce new types of joins like as we learned the
left anti- join and the right anti-join where we exclude the matching rows using the isnull and we can use the null
functions in order to provide standards and data policies in our data like using the nulls or using a default values like
the unknown. All right my friends. So with that you have learned how to handle the nulls inside your data and now we're
going to move to a very special topic called the case statements. This is very important tool in order to do data
transformations. So let's go case statements. It can allow you to build a conditional logic in your SQL
query by evaluating a list of conditions one by one and return a value when the first condition is met. So now let's
understand the syntax of the case statements and what this means. Okay. So now let's see the syntax
step by step. It start with the keyword case. This case indicates now we are starting a logic a conditional logic in
SQL. It's like programming languages as you start with the if else. So the if is the keyword of a logic and the whole
logic as well ends with another keyword called end. So once SQL sees the end. So this is the end of the conditional
logic. So the case is the start and the end is the end. So now what we're going to have in between is the conditional
logics right. So the conditional logic start with the keyword when. Now we are telling SQL we have a condition to be
evaluated and then we're going to go and specify that conditional logic. So now we have to tell SQL what can happen if
this condition is fulfilled. So now we have to use another keyword called then. So now we are telling SQL show this
results if the condition is true. So as you can see it's very simple. It's like the natural language, right? It's like
in English when the condition one is met then show the results. It's very logic, right? And now of course we can go and
add a second condition inside our case statements. So we're going to have the same setup. When condition two if this
is true then show the result number two. We specify the keyword when then we have a second condition. And if this
condition is true then we tell SQL to show another results. And of course it's very important to understand in the
syntax of that SQL going to go and process the conditions from the top to the bottom. So the first most important
condition should be at the start. So SQL going to first check this condition. If it fails and it's not true then it going
to go and jump to the second condition. So the order of the conditions is very important in your logic. And now of
course we can go and add multiple conditions depend on the logic using the keyword when. And now once we are done
defining all the conditions we can go and specify an else keyword. The else can introduce the default value and it
is optional. You can go and skip it. So the value of the else or the default going to be used only if all the
condition failed. So that means all our conditions are not true and nothing is fulfilled then SQL going to go and use
the value from the else. So it is the default value that's going to be used if all conditions are false. So those are
the keywords that you must use inside each case statement. So we have case when then and only the else is an
optional. So you can go and use it or skip it. So this is the main structure and the syntax of each case
statement. Now let's have a very simple example in order to understand how SQL execute the case statements behind the
scenes. All right, let's have this very simple example where we have only one condition. So as you can see in the
syntax, it starts with case and end and then we have only one condition and we are evaluating here the sales. So the
condition says if the sales is higher than 50 then show as a result the value of high. So it's very simple only one
condition and on the right side we have here a flowchart in order to understand how the logic is executed. And now what
we're going to do, we're going to go and evaluate those four sales through this logic and see what the output going to
be with the case statement. So let's do it one by one. Let's start with the first sales. It is 60. So here we're
going to go and check is 60 higher than 50. Well, yes. That means this sales is meeting this condition and we will get
true and we're going to get in the output the value of high. So here we're going to get the value high in the
output. So that means the first sales is fulfilling the requirement the condition and SQL going to give us the value from
this condition. All right. So now SQL going to go to the next value and we're going to start evaluating the 30. Now
we're going to ask the same question the same condition is 30 higher than 50. Well no. So that means in the output for
this condition we will get false. So we will take the path of the false. Now if you take the path of the false we will
not get any value. Right? So that means the output going to be a null. So the output for the 30 is null. And that's
because we didn't define in our logic anything about the default option. So we don't have here an else. And this is
what going to happen. If you don't use else, you will get a null in the output for the case statement. So now let's
move to the next one. It's going to be the same thing. So 15 is smaller than 50. So it's not fulfilling the
condition. And as well we're going to get a null. And for the last one since it's null we will get as well a null
since it will not fulfill the condition. So now after evaluating all those sales only the first sales is fulfilling that
condition and that's why we have only one value the high. All right. So now let's keep moving and adding stuff to
our case statements. Now we are adding a second condition. So it says after checking the sales whether it's higher
than 50 and it fails check again the sales whether it's higher than 20. If yes then show the value of medium. So
now in our workflow we are adding a second condition to be checked if the first one is false. So now let's go and
evaluate our sales again and check the output the first one the 60. So as you can see the 60 is higher than 50. So we
are fulfilling the first requirement that's why we will get the value of high. So it's same like before. So here
we're going to get high in the output. And now here very important to understand one thing is that SQL didn't
evaluate here in this scenario the second condition. So SQL didn't waste any time by checking the other
condition. It skipped everything once it get a true from one condition. So this is exactly how SQL process the case win.
It going to check each conditions from top to down and once it finds a true it's going to stop everything
immediately and throw the value from this condition and it will not evaluate any other conditions. So now it's going
to go and jump to the next value. We are at the value of 30. So let's evaluate the conditions. Is 30 higher than 50?
Well, it's not. So it's false. So now what can happen is going to go and jump to the next condition and start
evaluating the second one whether it's true or false. So now we're going to check here. Is 30 higher than 20? Well,
yes. So it's going to be fulfilled and we will get the value of medium. So it's going to stop everything and show in the
output for this value the medium. So we're going to get medium here. So in this scenario, we have evaluated both of
the conditions that we have in the case statement. Now it's going to go to the third one. We have 15. Is 15 higher than
50? Will no. So we will get to false for the first condition. Then it's going to go and jump to the second condition and
check it. Is 15 higher than 20? Will as well no. So now what going to happen? The false going to be activated over
here. And we will not get any value as a return. So we will get the value of null in the output. And now for the last one
we have null. We will get as well null because it will not fulfill any of those conditions and that's because we didn't
define an else in the case statement. So if we define these conditions like this, we will get the category medium for the
30. And this is how SQL evaluate multiple conditions in the case statement. All right. Now we're going to
go to the final form of our case statements and we're going to go and add an else. So we're going to have a
default value. So we are seeing here if the sales is not higher than 50 or higher than 20 then show a default value
as low. So that means any sales that is equal or smaller than 20 going to get the value of low. And now very
interesting if you check the workflow over here you can see that we have now a value for each path. So for the first
condition we're going to get high for the second one medium. And if nothing is fulfilled we're going to get always the
value of low. So there is no way in this chart to get any nulls. Right? So let's go and evaluate again our values. I
think you already get it. The 60 is fulfilling the first requirement and SQL going to stop everything immediately and
just show the value of high. So on the right side over here nothing going to be evaluated because the first condition is
true. So here in the output we're going to get the value of high. So nothing changed like the two previous examples.
Now it's going to go to the next value. We have the 30. So we're going to evaluate the first one. It's going to be
false. The next one it's higher than 20. It is true. And that's why is still going to show the value of medium. And
this is as well. We had it in the previous example. So medium. So now scale going to move to the next one. And
here things going to get interesting. So the value of 15. We're going to evaluate the first condition. Is it higher than
50? Well, no. Is it higher than 20? Well, no. So now we are in scenario where none of those conditions are true.
So that's why SQL going to go and execute the else. So if you check our chart it's going to be false and we will
get the value of low. So in the output we will not get this time a null because we have else we will get the value of
low. The same thing now for the null. Null will not fulfill the first condition as well the second condition
and that's why we will get as well the value from the else. So here in the output we will get as well the value of
low. So now as you can see if you use an else inside the case statements you will make sure that there will be no nulls in
the output. So that you have learned the different options that we have inside the case statements and how skill
execute the case behind the scenes. All right friends so now we come to the part where I'm going to show you
the most useful use cases of the case statements that I usually use in my projects. So let's start. The main
purpose of the case statement is to do data transformations. And data transformations is very important
process in each data project. And one very important task in data transformations is that we can generate
new informations. We can go and create new columns based on the existing data that we have in the database using the
case statements and this of course can help us deriving new informations for our analyzes without modifying the
source database only for analytics. So my friends, the main purpose of the case statement is to do data transformations
by creating and generating new columns. So now let's start with the first use case and the most important and famous
one is we use case statement in order to categorize the data. This means we are going to group up the data into
different categories based on certain conditions. And now you might ask why this use case is important. Well,
classifying and grouping data is fundamental in data analysis and reporting because it makes the data
easier to understand and as well to track. But what's more important, it going to help us aggregating the data
based on the categories. All right. So now let's have the following task. And it says generate a report showing total
sales for each of the following categories. category high if the sales is over 50. Category medium if the sales
is between 20 and 50 and low if the sales is 20 or less and sort the categories from the highest sales to the
lowest. Okay, so let's do it step by step. And now before we do any data aggregations, we have to go and create a
new column called categories because we don't have it in the database. So now let's start with very simple select
statements. So select what do we need? Let's take the order ID, the sales and that's it for now. So from sales orders
let's go and execute it. And now we have our 10 orders and we have to go and now create a new column called categories.
And we're going to do that using the case statements. So let's take a new line and we start with case and then
again a new line in order to define the first condition using the when. So the first condition is the high where sales
is over 50. So it's very simple. So when the sales is higher than 50, what can happen if this is true? We want to show
the value high. So this is the first condition. And then let's move to the second one. If the sales is higher than
20, that means it's less than 50 and higher than 20, then we want to see the value medium. And now for the last
category, the low, we don't have to go and create a condition for that because if those two fails, then that means that
the sales either equal to 20 or less. So what we're going to do, we're going to just do a simple else and show the value
low like this. Let me make this a little bit smaller. Now what is missing in our case is of course the end. Without it,
you're going to get an error. So end and let's give it a name category. So we are ready. Let's go and
execute it. So now let's check randomly stuff. So as you can see here we have the sales of 50 it is low which is
correct and then we have here 60 it's above 50 and we have the category high and now if you check the order number
six we have the order 50 it's medium because it is not higher than 50 it is between 50 and 20. So now as you can see
we have now classified our orders using the category. Now the next step that we're going to go and aggregate the
data. So how we going to do that? We will use a subquery. So let's do it like this. So we're going to go and select
and of course we're going to group up the data by the category. So we're going to go and select that category and we
need the total sales. That means we're going to go and use the function sum for the sales and we're going to call it
total sales. So now we have to nest the queries together. So from this is our query like this and then we have to
close it and group by. So we are grouping by the category. Okay. So with that we are now aggregating the sales by
that category. It's very simple. Let's go and execute it. So now in the result we have only three categories. We don't
have the 10 orders because now we are doing data aggregations. So now the granularity now on the level of
category. So now we can see the total sales for the high is 2010. The low we have 65 and the medium we have 105. And
of course we are not done yet because in the task it says sort the categories from the highest sales to the lowest.
That means we have to go and use an order by statement at the end and we're going to sort the data by the sales from
the highest to the lowest. That means descending. So that's it. Let's go and execute it. And now with that we have
our reports. Now we are showing the total sales by the categories and the data is sorted from the highest to the
lowest. So the highest category is high then medium and then the last one is low. So my friends as you can see with
the help of the case win we have created the new informations from our data we have the category and then we have
created insights or report based on this new informations where we have aggregated our data using this new
information. So the use case of categorizing data using case statements is fundamental and very important in
each data project. Okay. Okay. So now one more thing before we jump to the next use
case is that there is one rule to follow if you are using case statements and that is the data types of the result
must be matching. So what this means if we check again our example over here we can see that the result of each
condition is a string. So as you can see we have here high, medium and low and all of those informations are following
the same data type. So it is correct. So now if I go and break this rule for example after this then let's have the
value two. So now we have a number and we have characters. So let's go and execute it. And now of course we're
going to get an error because now SSQL is trying to convert the value low to an integer which is incorrect. So the data
types of the output of the result must be matching and that's not only include the value after the then but also the
value after the else because this value is as well part of the output. So let's have here again medium. And now let's go
and change this to let's say one. So let's go and excuse it again. Isl going to throw an error because this is an
integer in number and the others are string characters. So this is the rule of using the case statement. The data
types after then and after else must be matching. And if you ask me whether there is restriction about where you can
use the case statement in which clauses you can use it everywhere in select, in joins, from, where, group by, order by,
everywhere. So there are no restrictions and we have only this one rule. Okay friends, another use case for the
case statement. We can use it in order to map values. So we can use the case statement in order to transform the data
from one form to another in order to make it more readable and more usable for analytics. One scenario of mapping
values is that sometimes the database developers stores the data and values inside the database as codes and as
flags. So for example, the status of the order could be stored as one and zero instead of having inactive and active.
And this is one technique in order to optimize the performance of the database for the application because one and zero
is way faster than storing the whole string. But in data analyzes, we usually generate a report to be read by human by
persons. And now instead of showing the data as zero and one, it's going to be more nicer and readable if you show the
data as active and inactive. So for these scenarios, we're going to go and use the case statement in order to
translate those cryptical and technical values into readable terms. Otherwise, each one going to consume your report.
Going to ask you what do you mean with the zero and one. Let's have the following task and it says retrieve
employee details with gender displayed as full text. Okay. So now let's go and solve it. First we're going to go and
explore few informations. So let's go and show the employee ID and let's take the first name, last
name and we need the gender informations. So gender from sales employees. So that's it. Let's go and
execute it. So now as you can see in the result we got our five employees and now the gender informations are stored as
only one character F and M. And of course it's easy to understand that the F is female and M is male. but we would
like to show it in the report as a full text. So, female and male instead of those abbreviations. So, in order to do
that, we're going to go and use the case statement in order to do the mapping between the old value and the new value.
So, let's go and create a new column using the case. So, we're going to have here two conditions because we have two
values. Let's start with the first one. So, we're going to have a new line and when. So when the gender equals to f
ladies first then female and now for the second value it's going to be exactly the same when gender equal to m then
we're going to have male be careful for the case sensitivity of the values. So of course we will not end this without
an else. So else then we can have the default value. We're going to have the default value not available. It's better
than having nulls. So what we are missing is the end. So we're going to have an end over here and we're going to
call you gender full text. So that's it. Let's go and execute it. Now if you check the results, we have now done the
mapping between the old format of the value with the new format. So instead of m we have males and females. And of
course we don't have here any nulls. That's why we don't have a not available in the data. But if you have huge data
of course you can have somewhere a null and then you will get this default value. So this is how you can do mapping
between values very easily using the case statements. Okay let's have another task for the mapping use case and the
task says retrieve employee details with abbreviated country code. Sometimes as we are generating reports maybe using
PowerBI or Tableau we don't have enough spaces in order to use the full name of values. So what do we need? We need
abbreviations. we need short form of the values and we can go and use in SQL the case statement in order to map the full
value to an abbreviated value. So it's like the previous example but the way around. All right. So now let's go and
solve it. We're going to go and select few details like the customer ID. Let's take the first name, last name and what
do we need? We need the country information from sales customers. So that's it. Let's go and execute it. And
now as you can see we get our five customers and we have the country informations as a full name. Now of
course for the report we need abbreviated values from this. So we're going to go and map those full names of
the countries to a short form. But in real project you might get big tables where you have thousands and millions of
records. So you cannot just check it like this. So how I usually do it I go and retrieve a distinct list of all
values from one column. So I usually go and have a separate query for that. So we're going to have select distinct
country from the table sales customers. It's just for me to see all the possible values inside the database. So now you
see the second result over here. We have only two values Germany and USA. And then I can go and map the data
correctly. So always if you are mapping data using the case win you have to understand all the possible values that
you have inside the table. So let's go and generate this new informations. Let's start with case and then you line
when country equal to the first value. It's going to be Germany. Make sure you write it exactly like in the database.
The first character is capital and the rest is small. So what happened? We're going to have the abbreviation of
Germany. It's going to be de. All right. So this is for the first value. And then let's move to the second one. It's going
to be country equal to USA. It's already abbreviated but maybe we can get only two characters.
So us like this. And now let's go and add an else. It's optional but in case that we have nulls in the data or we get
a new value. So else it's not available. So na. So that's it. And never forget about the end. So end. And the name
going to be country abbreviation. So that's it. Let me just get rid of the other query. So the mapping is correct.
Let's go and execute it. And now if you check the results, we got a new column called country abbreviation. And as you
can see now the mapping is working. Here we have Germany and we have here DE and for the USA we have US. So with that we
have solved the task and we done the mapping correctly between the old value and the new
value. All right friends, now there is special case for the syntax of the case statements if you are using it for
mapping values. So now let's go and check it. So now let's say that we have a lot of different distinct values
inside the country not only to values you have a lot of values and if you are mapping the values using the case when
you're going to end up always writing the same thing country equal Germany country equal India country equal United
States and so on. So we are always using the column country. So the conditions over here using always one column and
it's always the operator is equal. So now only for this scenario we have another syntax for the case statements
and it looks like this. We start with the keyword case but after that immediately we're going to use the
column that we want to evaluate and here you can use only one column you cannot use multiple columns. So now we are
telling SQL we are now evaluating one column the country and then for each condition we have the following stuff we
say when Germany that means when country is equal to Germany then de so as you can see here we don't have here the
whole condition we have only a possible value that we can see inside the country. So we are saying is the value
country if it's true then show de the next one is it India then en United States US and so on. So we call this
syntax a quick form of the case statements and on the left side we call it full form of the case statements and
of course the restriction and limitation using the quick format is that you can use only one column and it's only for
the equal operator. So that means only for these scenarios you can go and use the quick format. If things get a little
bit complicated where you have to mix and make complex logic, you cannot use the quick formats. So I would say if you
are sure that the logic will not get complicated and you can stay always with the same column, you can go with the
quick format. But I would recommend always to go with the full format because for one simple reason if you add
one small logic you have to go and rewrite the whole case statements back to the full format in order to add any
small logic. But of course there is nothing wrong using the quick form in order to do the case statements if the
logic can stay static and you are sure we are using only one column and we are just doing mapping. There is no any
extra logic. Okay. So now let's try this quick format for the case statement for the previous example. So I will just go
and copy everything to a new column. So I'm just going to rename it to two. And now how we going to do it? So it's going
to be case but this time we're going to write country and then inside the wind we will have only the values. So no need
for the condition. So it's going to be like this. Let me scroll up. So that's it. As you can see it's smaller and
quicker than writing the whole condition each time. So now let's go and execute this. And as you can see in the result
we're going to get identical values. So now you know one more trick in the case statement.
All right, moving on to the next use case for the case statements. We can use it in order to handle nulls. Handling
nulls means replace a null with a value. And as we learned before with the window aggregate functions, sometimes nulls
leads to incorrect calculations and results which leads to wrong decision-m. We're going to have later a dedicated
chapter on how to handle nulls in SQL. But now we're going to learn how to handle nulls using case statements. So
now let's have the following task and it says find the average scores of customers and treat nulls as zero and
additionally provide details such as customer ID and the last name. Okay. So now let's solve it step by step and
again we have here details and as well we have to do aggregations that means we have to go and use the window functions
and we don't have to forget that we have to treat the null so we have to handle it. So now let's go and start with very
simple uh select. So select customer ID we need the last name and as well we need the scores. So from sales
customers let's go and execute it. So as usual we have our five customers and the scores. And here we have a null. Now
we're going to go and write the window function but without handling the nulls just in order to see the differences. So
we need the average function for what for the scores. Do we have to now partition the data? Well no. So we're
going to leave it as empty. We need the average score of all customers. So that's it. Let's go and give it a name
and then execute it. I think I have here mistake. So it is a score not scores. So and now as you can see we have the
average of 625. And as you learned before SQL going to go and summarize all those four values and divide it by four.
But our business understand the nulls as zero not as missing information. So we have to go and handle the null. Let's go
and create a new column for the scores. But this time we're going to go and use the case statements. It's going to be
very simple. So we're going to say when the score is null. So in SQL we don't write equal null, we say is null. So
with that we are replacing the nulls with zero. Right? So now otherwise what can happen? So if it's not null so we
need the score as it is. We should not manipulate anything. So the default value is the score itself if the score
is not null. So now let's go and end it and let's call it score clean. So let's go and execute it. Now if you check the
result over here, it's like almost identical as the score. So we don't have any new values for the scores but only
the nulls now are zero and all other values they are not affected. So we didn't touch it. We didn't transform it
at all. So this is what do we mean with handling nulls replacing nulls with another value. So now in order to finish
the task we have to do the average for the score clean and not for the original score. So how we going to do it? Let's
go and copy the whole case statements. I'm just going to do it in another column. So let's have an average and
inside it we have the case statements like this. Let me just sort it like this. And now what is missing is the
over and it's going to be empty. So average customer let's call it clean. So this is the logic. Let me just make
everything smaller. So now as you can see it's exactly like the previous one but instead of using the original score
now we are using the column that we have created. But of course we don't need the alias over here. So we have to remove
it. So it start with case and end. So let's go and execute it. And now you can see in the output we got a new value for
the average and it is more accurate for the business. So now we have 500. Previously we had
625. So as you can see you have to understand what the nulls means in your business and handle it correctly.
Otherwise you will get wrong results. So that's it. We use case statements in order to handle the nulls inside our
data. Conditional aggregations means we're going to go and apply an aggregate
function in SQL like some average count but this time only on a subset of data that meet specific conditions. This
technique is amazing in order to do deep dive analyzes or target analyzes on a specific subset of the data. So now
let's have the following SQL task in order to understand this use case. The task says count how many times each
customer has made an order with sales greater than 30. All right. So, as usual, we can do it step by step. So,
what do we need? We need the orders. So, let's get the order ID and as well, let's get the customer ID like this and
the sales from sales orders. Let's go and execute it. So now, what else I'm going to do with I'm going to go and
order the data by customer ID. So, let's execute it again. Okay. So, now the task sounds easy, but it's a little bit
tricky. We have to count the number of orders for each customer where the sales is higher than 30. Let's have an
example. For example, this customer number one. So the total number of orders is three orders, right? But we
have to count only the orders where the sales is higher than 30. And in this example, we have only one order where
the sales is higher than 30. So it's only the order number four. So the count for the customer ID number one should be
one. Now let's check another customer. For example, the two. And as you can see, we have three orders, but none of
them have the sales higher than 30. So the count should be zero here. So how we going to do that? We have to go and flag
each row whether it's higher than 30 or not. So if it's higher than 30, it gets the flag of one. If it's less than 30 or
equal to 30, it's going to get zero. And then we're going to go and summarize all those flags in order to get the count.
So let's do it step by step. Let's first create the flag. So we're going to go and use case and then our condition is
very easy. We're going to say when. So what is the condition? Sales greater than 30. So sales is higher than 30.
Then what can happen? We're going to flag it with the one because later we're going to go and summarize the one. And
now else if it's not higher than 30, equal to 30 or less. So it's going to get zero. All right. So now let's go and
end it. So let's say sales flag. Now let's go and execute it and check the results. All right. So now if you check
the results we got now a very nice flag in order to see which orders has sales higher than 30. So now for example let's
take the customer ID number one. As you can see only the order number four has sales higher than 30 and it's flagged
with one and all others are zero. Now let's take the customer ID number three. And as you can see we have now two
orders where the sales is higher than 30. And as you can see we have the one twice. And now we can use this flag in
order to do the aggregation. So now if you go and summarize the flag for the customer id number three we will get two
and this is the count of orders where the sales is higher than 30 right and let's take another example the customer
ID number two we have everywhere zero and if we summarize those values we will get zero which is the count of orders
where the sales is higher than 30 which is correct so now as you can see first we have built an extra column in order
to help us doing the aggregation and now in the next step we're going to go and aggregate this column so let's go and do
that we don't need all those informations the order ID we need the customer ID because it is the
granularity for the aggregation and let's remove the order by and now let's go and group up the data by customer ID
but of course we need the aggregate function so how we going to do it we're going to go and summarize the whole flag
so and now of course we're going to go and rename this since now it is an aggregated column so we're going to call
it total orders so now let's go and execute it. So now let's go and check the result. As you can see, now we have
our four customers. And for the customer ID number one, we got only one order higher than 30. The second one has no
orders higher than 30. The third we have two and one. And with that, we have solved the task. Now I would like to add
one more thing to our query in order to see the normal aggregations, not the conditional aggregations. So usually we
go and count for example the star in order to get the total orders. And let's rename the previous one to high sales.
So let's go and execute it. So we are just now doing aggregations without any conditions. And now we can see how many
orders did each customer. So we can see that the customer ID number one did order three times but only one order
higher than 30. So this is a normal aggregation and this is a conditional aggregations using the case
statements. All right friends. So now let's do a recap about the case statements. Case statement can go and
evaluate a list of conditions one by one and return a value once the first condition is met. And if you are talking
about the rules of using the case statements, we have only one where the data types of each condition after the
then and else must be matching. And now if we talk about the use cases of the case statements, the main use case is to
do data transformations and especially by creating new columns and deriving new informations. So as we saw there are
amazing use cases for the case statements. For example, we can use it in order to categorize our data. As we
learned, we can go and create a new groups of data then to be aggregated for our reports. And then we saw another use
case is mapping values. We can use the case statement in order to help us mapping the cryptical technical values
that is stored in databases to new values which is more readable and more friendly to be used. And the next use
case that we have learned is handling the nulls. We can use the case statement in order to replace the nulls with value
to make our aggregations more accurate. And the last use case that we have learned and I think the most used one in
my project is doing conditional aggregations where we can aggregate a subset of data that meets specific
conditions in order to do focus and target analyszis. Okay my friends. So with that we have covered all the topics
and all the functions in order to transform single value in SQL the role level functions that was very important
especially for data engineers. So we are done with this chapter. Now we are moving to very interesting chapter.
Finally we're going to talk about data analytics in SQL and we will be covering now the aggregate and the analytical
functions that we have in SQL. So first we're going to start with the basics. So we will learn simple functions on how to
aggregate your data. So let's go. Hey my friends. So now we're going to talk about the aggregate functions in
SQL. They are amazing if you are a data analyst or data scientist where we usually use them in order to uncover
insights about our data. So the aggregate functions they accept multiple rows as an input and the output of the
aggregate function usually is one single value. So now we're going to go and cover first the basic aggregate
functions in SQL. So let's go. So now in our database we have four orders and we have the sales informations for each one
of them. So now one question that comes in our mind what is the total number of orders in our business. So how many
orders do we have? Now in order to do that we use the function count because what it does it's going to go and count
the number of rows inside our table. So if you apply the count function on this data SQL going to go and start counting
how many rows do we have. So the total number is four and in the output we will get four. So as you can see we don't
really care about the content of the tables. Scale is just counting how many rows. So the number is not based on the
sales or formations or the orders. So this is how the count function works. Now we have another question and we say
I would like to find the total sales in our data in our business. So that means we have to go and summarize all those
sales that we have in the order and for that we have the sum function. So if you go and apply the sum function, it's
going to go and summarize all the sales and return at the end the total sales. In this example, it's going to be 80.
So, as you can see, the aggregate function accept multiple rows, multiple values, and the output going to be one
single value, the aggregated value. Now, moving on, I would like to understand what is the average sales in our
business. So, it sounds simple. In order to do that, we're going to use the average function. So, if you apply it on
the sales, it's going to go and summarize all those values and divide it by the number of values. So, you will
get the average of 20. Now comes interesting question where you want to find what is the highest sales in my
data. So for that we can use the function max. So once you apply it it's going to go and start searching for the
highest value inside our table. So this time we are not really aggregating the data into something new. It's like
searching for the highest value between multiple values. So in this example we will get the 35 as the highest sales.
Now of course if you want to see the lowest sales inside your business you can use the min function. And if you go
and apply it as well, the same thing is going to go and start searching for the lowest value in the sales. And in this
example, it's going to be the 10. So as you can see guys, the aggregate functions is very simple but yet very
powerful. So it is really useful for insights in order to understand how well your business is performing. So now
let's go to SQL in order to try those functions. Okay. So now we're going to go and analyze the orders table inside
our database by doing very simple aggregations. So let's start with the first task. It says find the total
number of orders. So this time we are targeting the table orders. So let's just start with the select. So now we
can see we have like four orders. And now we would like to have like one number. What we can do? We can go and
say count star as total number of orders. So let's go and execute it. And with that we got one number. It is the
four. This is the total number of orders. Now let's move to the second task. It says find the total sales of
all orders. So this time we have to summarize all the sales values in one big value. So how to do it? We're going
to use the function sum and this time we are targeting the sales and we're going to go and call it total sales. So let's
go and execute it. And with that we have 80 as the total number of sales. So all the sales values are summarized in one
big value. So as you can see now we are exploring the business right? We are understanding how many sales, how many
orders. So this is really the basics of analytics in SQL. Now let's go to the second task. Let's find the average
sales of all orders. So we're going to have average this time the sales as average sales. Again very simple. Let's
go and execute it. Now the total sales is 80 but the average sales is 20. So all the values of the sales is
summarized and then divided by the number of orders. So 80 divided by four. And with that SQL finding the 20 as an
average. Now let's go and get interesting stuff. Let's go and find the highest sales of all orders. So what is
the highest sales that happens in our business? In order to do that, we can use the function max sales as highest
sales. Very nice. Let's go and execute. So the highest sales in the database is 35. And now I think you already know
what is the next task. Find the lowest sales of all orders. So this is exactly the opposite. So we're going to go and
use the min sales as lowest sales. So let's go and execute. The lowest sales in our business was 10. So my friends,
as you can see, the aggregate functions are really amazing. And if you use it like this, you will get like the big
numbers about our business. But now don't forget about the aggregate functions. If you combine it with a
group by then you will be breaking those big numbers into something like you are aggregating by the customer ID maybe by
a date by a country. So anything you specify with the group by it going to breaks those big numbers into smaller
number based on the column that you are using. For example let's go with the customer ID over here and let's put it
at the start as well. And now if you go and execute it. So now as you can see in the output all those numbers are not
anymore like big numbers. We drill down to more details based on the column that we have specified. So now we have for
each customer the total number of orders, the total number of sales, the average sales, the highest sales or the
lowest sales. Of course the data is very small and those numbers can be more interesting if you have bigger data. So
if you combine the aggregate functions together with the group by, you will break those big numbers into more
details based on the column that you are grouping by. So now what you can do, you can go and apply those functions as well
for the customers. There we have a score and you can go and find the average score, the highest score, the lowest
score and then you can group up the data by the country for example. So pause the video and do some aggregations on the
table customers. [Music] All right my friends. So with that you
have learned the basics on how to aggregate your data using SQL. Now we're going to move to more advanced way on
how to aggregate your data. We will start talking about the window functions the analytical functions. So first we're
going to start talking about what is exactly window functions and we're going to cover the basics about this topic. So
let's go. window functions or sometimes we call them analytical functions. They are very
important functions in SQL. Everyone must know them especially if you are doing data analyszis. Each time I write
SQL script in order to do data analytics, I end up using them. So as usual, we're going to go and now
understand the concept behind them and then we're going to start practicing. So let's
go. Okay guys, so now let's start with the first question. What are SQL window functions? They are functions that
allows you to do calculations like aggregations but on top of subset of data without losing the level of details
of the rows. So it is something very similar to the group pi but here we have special case you don't lose the level of
details. So now in order to understand the definition let's have a very simple example. Okay. So now let's understand
how SQL works with the group by clouds. Let's say that we have the very simple example. We have four orders. two orders
for the cabs and two order for the gloves. And let's say that I would like to see the total sales for each
products. So now if we decided to use the group by what SQL going to do going to take the first two orders for the
caps and put it in one row. So in the output we're going to have only one row for the caps with the total sales of 40.
And the same thing going to happen for the gloves. So we're going to take the two rows of the gloves from the input
and in the output we're going to have only one row for the gloves. So that means the number of rows going to be
depending on the number of products we have on our data. We have two products, we get two rows. So that means SQL is
really like smashing or squeezing the results in the output. And this is exactly what the grouper does to our
data. It aggregate the rows, aggregate the data into different level of details. So now on the left side we see
four rows. On the right side we have two rows and with that we are losing some details in the results. But still we
have solved the tasks. So now let's see what going to happen if you use window function in SQL. Okay. So now we have
the same data and as with the same task we have to find the total sales for each product. Now if you use window function
SQL going to do the following. It going to go and execute each rows individually from each others. So what going to
happen it start with the first row the order ID one. In the output we're going to get as well the same stuff the order
ID one the same row but we will get the total sales for the caps. So here the total sales is going to be 10 30 we will
get 40. Then it's going to jump to the second row and it's going to process it as well. So in the output we will get
the order ID two the product caps and as well we have the same aggregation since we are talking about the same product.
So we will get 40. Then it's going to go to the third order and here we have the gloves. So in the output again we have
the order ID 3 the product gloves and the total sales this time going to be 5 + 20 so we will get 25 then it goes to
the last row to the order ID number four in the output we're going to get four gloves and as well 25. So now we can
notice that if you use the window function you will not lose the level of details of your data. So we are doing
something called rowle calculations. So if in input data we have four orders in the output we're going to get four
orders and as well we will get our aggregations correctly. So now if you compare both of the methods side by side
we can see that we are solving the same task. So we are finding the total sales for each products but with the group by
we are smashing squeezing the results from four orders into two rows one row for each order. So that means with the
group by the granularity is changing right in the input the order ID is controlling the level of details but in
the output of the group by the product is controlling the level of details. So we have different granularity but in the
other hand with the window functions we are still able to do aggregations but we are not losing the level of details. So
the granularity of the input going to be the same like the output in the results. So this is exactly the main difference
between the group eye and the window function. If you want just to do simple aggregations, then go with the group by.
But if you care about the level of details and you need to add more details to your results, then you can go with
the window function where you can do aggregations plus having more details. And now if you go and compare the
functions between the window and the group by, we can find that both of them has exactly the same functions for the
aggregations. So we have the count, sum, average, min, max. And here comes another difference between the window
and the group by. The group I has only the aggregate functions. So that's it. But in the window functions, we have way
more functions to use for analytics. So for example, we have the ranking functions. And we have here another
group of functions for the value or we call it analytical functions. So that means in the SQL window, we have a lot
of functions. We can cover a lot of analytical use cases and advanced complex stuff. But with the group by we,
we have only the aggregate functions only for simple use cases. So this is another difference between the group by
and the window. Group by use it if you have simple analyzes, simple aggregations, window functions, we're
going to use it for more advanced data analyszis where we're going to cover a lot of use cases. All right guys, so now
we're going to have few tasks in order to understand one thing. Why do we need scale window functions and why in some
scenarios group is not enough and we have to use scale window functions. So let's go. All right. So let's start with
very simple task. It's going to say find the total sales across all orders. So we need one value with the total sales.
Let's see how we can do that. First make sure that you are using the database. So use sales database in case you have
closed the clients so that we don't get any errors. So now we're going to start with the first thing. We're going to go
and select the sales. You're going to find it in the table sales orders. So now let's just query the data. And as
you can see we have 10 orders with 10 sales. We didn't aggregate anything yet. So we have the row data now. So now in
order to solve the task, we're going to use the function sum. So sum of sales and we're going to give it new name
total sales. We don't have to use any group by because we don't have to group up anything. So that's it. Let's go and
execute that. And as you can see SQL going to return one value 380. This is the total sales that we have inside our
data. And this is the highest level of aggregations. So with that we have solved the task. We have the total sales
across all orders. We don't have to group up anything. Let's move to the next example. Let's say that in the next
task, this time we want to find the total sales but for each products. So not for the all orders, for each
products we want to find the total sales. So this time we don't need only one value. We need one value for each
product. In order to do that now, we're going to go and use the group by function. And we're going to group up by
the product ID. and group up need as well the dimension in the selection. So we can do it like this. So that's it.
Let's go and execute the query. Now as you can see in the results we don't have one value. We don't have the highest
aggregations. This time we are drilling down to the next level of details. So the level of details here is the product
ID. We have one row for each product. So for the first product we have 140. The next one 105 and so on. So as you can
see we are now splitting the data at the level of product ID and we went from 10 orders now in the results we have four
orders and that's because we have four products. So the number of rows at the output going to be defined by the
dimension the product ID and with we have solved the task we have the total sales for each product. All right guys
so let's keep progressing in our examples. Now the next one going to be a little bit advanced where we have the
same aggregation. Find the total sales for each product. Additionally, provide details such order ID and the order
date. So, as you can see, we have already solved the first part. We are finding the total sales for each
product. Now, we just have to add some additional informations like the order ID and the order date. So, let's go over
here and just add it in our select. So, order ID and let's have the order date. So, let's go and execute that. Just
going to make it a little bit bigger. So, let's go. But now as you can see SQL will not be happy going to throw an
error and says the stuff that you are adding to your select are not included in the group by. So as you can see in
the group buy we have only one dimension or one field called the product ID. But in our selection we have three
dimensions the order ID, the order date and the product ID. So there is no matching between the select and group by
and SQL will not allow it. And now you might say you know what let's add everything to the group by. So with that
we're going to get our aggregation and as well we're going to get our details. So let's try that. I'm just going to
zoom out a little bit and instead of having the product ID let's add everything. So the order ID, order date
and the product ID. So now we have matching and SQL should not throw any error. Let's go and execute it. So now
let's check whether we have solved the task. The task has two parts right. We have to do the aggregations and to
provide details. So as you can see we have solved the second part. We have the details, order ID and order dates. But
now the first part finding the total sales for each product is destroyed because if you check the results, we
have the product ID 101. It has the total sales of 10. But in the third order, we have it as a 20 for the same
product. So actually the data is not aggregated and that's because we are aggregating at different levels and we
have included way more stuff that we don't need for the aggregations. We are aggregating at the order ID level. So as
you can see now we are hitting the limits of group by. We cannot provide aggregations and as well provide
additional informations from our data. You have to pick one. That's why we have to go to the second option where we can
use the window functions. So let's do that. I'm just going to get rid of the group by parts and as well all the
fields. Let's back to the root. So now we have the sum of sales and if execute this I'm going to get one value. So we
are at the highest level of aggregations. So now we need to use the window function. I'm just going to
remove the name. And now we're going to tell SQL this is a window functions using over after the aggregations or the
functions tells SQL we are talking about window functions. So let's just execute it like this. And with that we got 10
rows and that's because we have 10 orders and for each row we have exactly the same value. So we have the total
sales of all orders for each row. So as you can see SQL understands this is a window function and SQL should not like
group all the data in one row. It should keep exactly the same rows or same number of rows like the input. So with
that we have the window function but we have to split the data by the products. So now we're going to use the keyword
partition by it's like the group by but another wording products ID the same dimension. So with that we have the
total sales by products as a name. So let's go and execute this. So now as you can see in the output we still have the
same number of rows. We have 10 orders. We have 10 rows but the result did change because now we are aggregating
the data at the level of product ID. In order to understand the results we have to add more informations to our select.
So now let's add the same dimension. It's going to be the product ID. I'm just going to add it at the front over
here. So let's select. And as you can see now it makes more sense. We have those products and they have always the
exact same uh sales. and as well for the next product and so on. And now here comes the magic of the window function.
We can add more informations to our select statement without having any errors. So now we need additional
informations like the order ID. So we can go over here and say order ID, order date, any type of column you can add it
to your select and let's go and execute. So as you can see now we got the result even though that those three dimensions
in the select are not part of the window aggregation. So with that we have solved the tasks. We have additional
informations. We have the order ID, the order date and as well the first part of the task to find the total sales for
each products. So each of those values are the total sales for each product. And with that we have solved the tasks
and this is exactly why we need window functions. In real projects things get really complicated. You are doing
different tasks in one query. So you are doing aggregations, you are doing some other stuff. So just focusing on the
aggregations is not going to be enough. You have always to add additional informations to your query. So as you
can see we use group eye to do simple analyszis but as things get complicated in the analytics we use the window
functions in order to show the aggregations and as well add additional informations. So as you can see we use
group eye to do simple analyszis but as things get complicated in the analytics we use the window functions in order to
show the aggregations and as well add additional informations. All right everyone. So now we're going
to go and deep dive into the syntax of the SQL window functions. We're going to cover everything each part of the syntax
for you to understand how to use them. So let's go. All right. So let's start first by understanding the basic
components or the basic parts of each window syntax. Mainly we have two parts. The first part going to be the window
function. We have like sum, average and so on. The second main part is going to be the over close. And inside the overlo
we have three different parts. The first one going to be the partition close. The second order close and the last one we
have the frame close. And those are all components that you can use inside the window function. So two main parts
window function and the offer close. And inside the over we have partition order and frame. Let's go more in details. So
for example we have the following window function. So as you can see we have a lot of stuff going on here. We're going
to understand them step by step component by component. Let's start from the left from the first one. So what do
we have over here? We have a function window function. So what is a window function? Like here we have the average.
It's like any other function in SQL. You can use it in order to do calculations on top of the window. So the first thing
to do or to define in a window is to define the function of the window. And as we learned before, we have a long
list of many window functions available in SQL. And we group them into three groups. The first one we have the
aggregate functions. So we have the count, sum, average, max. All those functions we have them as well for the
group by. So those are used for the aggregations. The second group of functions we have the ranking functions.
So we have the row number, rank, entile and so on. So we can use those groups in order to give a rank for our data. The
last group we call it value or sometimes analytics functions. So here we have very important functions like the lead,
lag, first value and the last value in order to access a specific value and of course we're going to go and learn all
of them one by one understanding the concepts some examples and as well for you to understand when to use them for
that analyzers. All right so now let's keep moving understanding the other parts of the window syntax. Now inside
the function average we have here a field name or column name called sales. This called a function expression. It's
like a value, a parameter, argument that we can pass it to the function. And here we can use multiple different stuff. For
example, depend of the function of course. So here it could be empty like here in the ranking. It doesn't allow to
use an expression. So it should be always empty. Or we can use a column like in the example we use the sales. So
we use the column name as an argument or an expression. For the average we are finding the average of sales or we could
use a number. So here in the intile we are allowed only to use numbers or we could have multiple stuff. For example
in the lead we can have sales then numbers and so on. So things get complicated. Don't worry about it. I'm
going to explain that. So here we have multiple stuff or we can have a whole conditional logic. So for example here
we have the case win so on inside the sum. So the whole thing over here holds an expression for the sum. So as you can
see we can build here a complex logic and the output of this logic can be passed to the function sum. So that
means as an expression for the function we can use different stuff of course depends whether the function allows it
or not. All right. So now let's have a quick overview in order to understand which data types are allowed in the
expressions for those functions. Let's see the aggregate functions. As you can see the count function accept any data
type but the others like the sum, average, min, max, they allow only numerical data types. All right. So now
let's move to the rank function. The expressions it's pretty easy. It should be empty. It doesn't allow any argument
or anything inside those functions. So as you can see all of them are empty but only one that accept numerical values
which is the end tile. You have to define a numeric value. And now moving on to the last type we have the value
functions. they accept any data types inside the expressions. So as you can see each functions has its own
specifications and you have to be careful which data type you are using in the expressions. Okay. So now let's keep
moving to the next one. We have a very important part in the window syntax. So so far what do we have? We have a
function. We have an expression. It's like usual stuff. We have done that before using the group by. Now we have
to tell SQL that we are dealing with the window function. It's not a normal one. In order to do that we have to specify
the keyword over. So the second main part in the syntax is the over clause and we use it in order to define a
window and inside it we can define multiple stuff like the partition pie the order by the frame but all those
stuff are optional. We can skip it and leave it empty. So the main task of the over it tells first SQL we are dealing
with the window function here and as well you can use it in order to define a window of your data. So now we're going
to go and cover everything inside the over clause and we're going to start with the first one the partition
pi. All right. So now we're going to learn how to define a window inside the overlaus. The first part that we can
define is the partition pi. So for example here we have partition pi category. We have to define the
dimension. It's very similar to the group by another wording. So the first part is going to be the partition
clause. What it going to do? It's going to divide the entire data sets into groups or you can call it windows
partitions. So here we tell SQL how to divide our data. And here we have two options. Let me just show you. So if we
don't use anything so we have it empty. You see over and partition by is not used. What going to happen? SQL going to
use the entire data in order to do the calculations. So the whole data the entire data going to be counted as one
window. So we are telling SQL don't divide anything leave it as it is. The second option that we have is to divide
the data by partition pi. So we define the window like this partition pi products for example. So SQL going to go
and divide the entire data into different windows. For example here two windows. And here this time the
calculation the sum of sales will not apply on the entire data set. This time it going to be applied on the different
windows individually. So we're going to find the sum of sales for window one separately from the total sales of
window 2. All right. So now we have this very simple example. We have here three fields. The month, product, sales. They
are really easy informations. And now we have the following SQL window function. So we have sum of sales and inside the
overlo we are not using anything. So we are not using partition by. So how SQL going to define the window. Now SQL
going to say okay I don't have to divide anything. The entire data set is one window. So SQL going to go over here and
say the whole thing is one window. So there is no partitions, there is nothing. We have only one window. So the
entire data going to be aggregated. So this is what happen if you don't use partition pi and you leave the
overclos. The entire data is one window. All right. So now let's move to the next example. We don't want to have only one
window. We would like to have multiple windows. So we have to divide the data by something. So in the over clause
we're going to define the window like the following partition by month. So it's not empty. We are now dividing the
data by the field month. So the values inside this column going to divide the data sets. So here we have two months
January and February. So what's going to do? SQL going to go and divide the data into two sets. The first window going to
be this one of January. So we have the first window going to make it smaller and the second window going to be the
February. So it's going to be two windows inside our data and the calculation going to be happening on
each window separately. So here as you can see we are using the month in order to divide our data sets into two
windows. One window for January and another window for the February. So now let's have a quick overview about the
options that we have with the partition p. The first option as we learned we can just skip it. So without partition by
for example here total sales across all rows and here we don't find anything inside the SQL. The second option we can
use one field one column for example partition by product. So we are using one dimension but we can go and mix
stuff. We can use multiple columns or multiple dimensions in the partition by for example here partition by product
and order status. So here with the partition by we can define a list of dimensions that could be used in order
to divide our data. So in this example we are saying find the total sales for each combination of products and order
status. So those are the different options on how to work with the partition pi. So now let's have this
overview again for all functions. The partition pi for all those functions is optional. So if you don't use the
partition pi in all those functions you will not get any errors. So now let's go back to SQL in order to start practicing
with this clause. Okay. So now we have the following task. Find the total sales across all orders. And we have to
provide additional informations like the order ID and the order date. So let's go and solve it step by step. First I would
like to provide the details. So I'm going to select the order ID and the order dates from the table sales orders.
And next we're going to work with the aggregations. So we need to find the total sales across all orders. Again
since we have here details and aggregations we cannot use group by. We have to use the window function. So
we're going to go use the function sum for sales. And now we have to tell SQL we are working with window functions.
That's why we're going to use the over close. And now the next step we have to think about defining the window. So
let's check the task. It says total sales across all orders. So that means we don't have to partition or divide the
data sets into like chunks or partitions. We have to leave it as it is like the whole data going to be one
window. And that's why we don't use partition pi inside the definition. We're going to leave it empty. Let's go
now and give it a name. It's going to be the total sales. Let's go and execute this. And now at the results, as you can
see, we have all the orders, all the details, and as well, we have the total sales across all orders. So with that,
we have solved the tasks. We have the total sales and as well some details about the order. All right. So now let's
move to the next task. It's going to be very similar. So it says find the total sales for each product. And we have to
provide additional informations like the order ID and the order dates. So it's going to be very similar task but this
time we have to divide the entire data into windows and that's going to be by the product. Since we are saying total
sales for each product. So this time we have to go and divide the data. So we're going to define the window like this
partition by and we can use the dimension product ID. Let's go and execute this. So now you can see in the
total sales we don't have anymore the total sales of the whole data but they are divided but in order to understand
the results let's go and include the product ID in the results. So product ID and execute. So now by looking to the
results you can see that the data is divided into four windows. Let's see them. It's going to be by the product
ID. So this dimension going to be controlling the partition. So the first window going to be the product ID 101.
So we have the total sales for this product 140 and the next window going to be 102. The third one 104 and the last
window it's going to be only one row the 105 and the total sales of 60. So with that we have solved the task. We have
the total sales for each product and as well we have some details. Now I would like to show you the dynamic of the
window function. We can add multiple aggregations on multiple levels. Let me show you what I mean. Let's say we stay
with the same example but we're going to find the total sales across all orders and as well the total sales for each
products. So what we can do we can do the window functions on different levels by for example here removing the whole
definition. So here we have the total sales for the entire data for the first task and the next one going to be the
total sales but divided by the product ID. Let's here rename it by products. Let's go and execute this. And
now you know what I'm going to go and add the sales as well just to explain the flexibility of the window function.
So let's go add the sales and execute it again. And now by looking to the results you can see we have the sales
informations three time but with different granularities. The first sales the sales it sales without any
aggregations. It is the highest level of details of the sales and we're going to have the sales for each order. The next
one the total sales with the window function. Here we have the highest level of aggregation. So we have the total
sales of all orders and the last one the total sales by product it's something like in the middle we are aggregating on
a window and the window going to be the product ID. So as you can see we have different granularities of the
aggregations and this is exactly the flexibility that we have with the window function. We can do all those stuff in
one query. Okay. So now let's keep moving and adding stuff to our task. It's going to say find the total sales
for each combination of the products and the order status. So this time we have to divide the data not only by the
product as as well with another dimension the order status. So now let's see how we can do that. I'm going to
just show the dimension order status and the results and we're going to add the following thing. So sum sales over since
it's a window function and let's go now and define the window partition by. So we have again the product ID but not
only this dimension as well the order status and let's go and call it sales by products and status. Let me just rename
those stuff. Okay. So let's go and execute. All right. So now let's check the results. It is the last aggregation
over here. And as you can see here the aggregation has different granularities as the previous one. And we have more
details. This time we are splitting the data by two dimensions. So the first window going to be the product ID with
the order status it's going to be only those two rows. So we have the order ID 101 and the order status delivered. So
the total sales of this going to be 10 + 20 and we're going to have 30. The next window going to be the same product but
with different status. So it's going to be the 101 shipped and we're going to go and summarize those two values and we're
going to have 110. The next product and order status going to be the 102 and we have it only once. So 102 delivered it's
only once. So it's going to be the same value. The next partition or window it's going to be two rows. 102 with the
shipped is going to be those two things 60 + 15 we're going to get 75. So as you can see here the product ID and the
order status they are controlling how many windows we're gonna get. So we get here around like six windows. With the
product ID we got only four windows and without using anything inside the overlause we will get only one window.
So this is how the partition by works. All right. So that was the first part of the window definition within the
overclo. Let's move to the next part. We have the order by. For example, we can use order by order date. It's just a
field. So the order close is very important in order to sort your data within a window. So the order by is very
important as well for many functions. So by just checking the overview over here for the aggregate functions it is
optional. So you could just leave it or add it. But for the rank function and as well for the value functions they are a
must. So if you want to use those functions you must use the order clause because it makes no sense for example if
you are ranking the data without sorting your data first. Okay guys. So now back to our very simple example and we have
the following query. So the function this time going to be rank. So we have to rank the data and the definition of
the window going to be partition by month. So that means we divide the data by the months. So we have it over here.
And then the second part going to be order by sales descending. So we have to sort each window by descending order.
That means we start with the highest value and we end up by the lowest value. So let's see how SQL going to go and
execute this. So first partition by month. So it's going to divide the data into two partitions because we have two
values by the month. So let's see how this going to look like. So one window for January and another window for
February. All right. So now SQL going to go to the second part and execute order by sales descending. So what's going to
happen? SQL going to go for each window separately and start sorting the data from the highest to the lowest without
checking the other window. So in those three values, the highest one is this one. So it's going to be on top. Let me
just sort it. This is going to be the lowest. You're going to be in the middle. So SQL going to sort this window
separately from the next one. And then once it's done, it's going to go to the second one. So the highest value going
to be this one. You are the lowest. Let me just do it like this. So SQL going to sort it like this. The highest one is
70. The next one is 40. And the last one is five. So with that SQL done with the definition of the window. So it's
splitted by the month. And each window is sorted by the cells. The next step is going to go and rank those values. So
it's really simple. In the output, it's going to rank the data like this. So the first one going to be this value. The
next one going to be two and the third one going to be three. So as you can see, SQL is sorting only this window and
it's going to go and repeat the same stuff for the second window. So each rank is separately from the others. So
as you can see, it's very simple. This is how SQL executes partition by together with the order buy for the rank
function. All right. So now let's have a quick task for the order by. It says rank each order based on their sales
from the highest to the lowest. And we have to provide additional informations like order ID and order date. So let's
see how we can write the query. So we have the basic stuff order ID, order date and sales. And now we're going to
go and rank the data using window function. So we're going to use the function rank and then we're going to
tell SQL this is a window function and inside it we have now to provide the definition of the window. So now by
checking the task you can see that we don't have to divide the data. So we don't have to use partition by we have
just to use rank and with rank we have to use the order by it is must. So we're going to use order by the field going to
be the sales and from the highest to the lowest. So let's just call it rank sales and let's go and execute this. And as
you can see our results going to be sorted from the highest to the lowest. So you can see the sales 90 at the top
and the lowest going to be the 10. And as well we have a rank. So for the top rank it's going to be one and the lowest
rank going to be 10. So as you can see we just quickly create a rank in SQL. It's very simple. The whole thing is one
window since we are not using partition pi. And of course if you want to have ascending so from the lowest to the
highest you can just remove it because optionally going to be ascending. So let's go and execute the query. So now
we can see the orders are sorted the way around. So we start with the lowest and end up with the highest. And of course
you're going to get the same results if you go over here and add ascending. So if you execute you see we got exactly
the same results. So this is how you use the order by inside the window definition.
Okay guys, so with that we have covered the second part of the window definition. Now we're going to go to the
last part to the most advanced part of window and we have the following stuff. So we have rows unbounded proceeding. We
call this frame close or window frame. So what we are doing over here that we are defining a subset of rows within
each window that is relevant for the calculation. Totally understand if this is confusing at the start or complex. It
was for me as well. So what we're going to do we're going to deep dive into the concept in order to understand how this
works and we're going to do it step by step. So don't worry about it. All right. So now let's understand what is
going on with the frame close from the basics. So now if you do aggregations and you don't use window function you're
going to consider the entire data or rows inside the table. But what we can do we can go and divide the data using
partition pi to a window. So for example here we have window one and window two. Now if you go and do aggregations all
the rows in the window one going to be aggregated and then scale going to go to the window two and aggregate all the
rows. What we can do in scale is that we can say you know what I don't want all rows inside the window. I want a subset
of rows inside the window. So what we are doing over here is that we have those two windows but we specify a scope
or we specify subset of data from each window to be involved in the aggregations. And of course not only
aggregations we can do ranking other stuff. So I mean calculations. So here like we have a window inside a window.
So we are defining a scope of rows. Not all rows should be involved in the calculation but only specific subset of
data. And we can do that using the frame clause. So again the partition by you can use it in order to divide the entire
data set into multiple windows. And now for the frame close. If you don't want to consider all the rows within each
window in the calculation, you want to focus and specify only a subset of data within each window. Then you going to go
and use the frame close. All right. So now let's go and understand the syntax of the frame close. Let's have the
following example. We are saying the window function is the average of sales and then we define the window. So we
have the first partition by categories, order by order dates and then we have the frame close. It's going to be the
following rows between current row and unbounded preceding. This is the frame types and we have two types. We have the
rows and groups. Then we have like between and range. So the first range going to be the frame boundary, the
lower value. And here it accepts three types of keywords like the current row or number of preceding or the unbounded
preceding. And then we have another frame boundary. It's going to be the higher values and it accepts the
following stuff. We can use the current row in following or unbounded following. So as you can see we are defining like
boundary or a range from low value to higher value. So now we have some rules. We cannot use the frame clause without
order by. So order by must be exist in the definition in order to use frame clause. And the second rule it says
lower boundary must be before the higher boundary. So always we start with the lower boundary and we end up having the
higher boundary. You cannot switch that. Okay. So now we have a very simple example. We have the month and the sales
and the following query. Sum of sales. This is the window function. And the definition of the window going to be
order by month. We are not using partition by just in order to make our life easier. And the frame close going
to be defined like this. Rows between current row and the two following. So now let's see how SQL going to execute
this. The first definition order by month. As you can see the months are sorted already. So now SQL going to work
with the frame definition current row and the two following. So SQL going to process this row by row. So it's going
to start with the first row and it's going to be our current row as here in the SQL. So this is our current row and
we say the range until two rows, two following rows. So it's going to be February and March. So that means the
pointer going to be over here for the two following. So with this we have the frame boundaries and SQL have the
following scope for the first row. So we have three rows and the summarization of those three rows going to be around 70.
So we will get for the first row 70 because the scope is not all rows but only this subset of data. Okay. So with
that scale is done with the first row it's going to jump to the second row. So the pointer going to be the current row
at the February and the second two following going to be at April. So with that as you can see we are sliding down
in the subset of data or in the window and with that we have a new scope a new subset and the summarization of all
those values going to be 45. So that's it. I think you get it already. It's going to go to the next one. The pointer
going to be on March and the two following going to be on June and it's going to slide like this. We have those
three rows in the scope and the summarization of that going to be 105. So now things gets interesting for the
next row. So the pointer for the current row going to be April but the two following going to be like after the end
of the table or something like that. So as we slide down as you can see the scope now or the subset of the frame
going to be only two rows and the output going to be 75. And finally if you go to the last row it's going to be the
current row and we're going to have only one row for this subset because the two following is just outside of the table
and we're going to get the same value as the summarization. So as you can see that's it. It's very simple right? So
the frame we use it in order to scope which rows are involved in the calculations. So all you have to do is
to define the boundaries of the frame, the lower and the upper boundary. Let's see what other options do we have with
the frames. Okay. So here we have the same example but we redefine the boundaries of the frame like this. Rows
between current row this is the first boundary and unbounded following. This means that we are targeting always the
last record in the window or in the table. So unbounded following going to be always static and it's going to be in
this example pointing to June. And now it's still going to go row by row and the current row going to be like the
start January and then February. I'm just going to take this example the pointer is on February and the subsets
or the frame going to be those four rows. So it's going to be February, March, April, June. So it's going to be
four rows and the total aggregation of that going to be 115. So you can do it like this. And previously it was like
flexible more flexible it was two following but this time we have unbounded following that means always
the boundary going to be the last one. So as we are moving with the records over here the boundary is going to be
smaller smaller and like this and the last one they going to be both in the same record. So the current record going
to be as well the unbounded following. Okay let's see the next one. The definition of the window going to be the
following rows between one proceeding and the current row. So here is the way around one proceeding is lower than the
current row. So let's see how SQL going to execute this. Let's say that we are currently at March. So this is the
current row and we are saying between one proceeding. So that means one row before the current row. So the frame
going to be like this and we have only two rows. So the value going to be the summarization of those two rows and it's
going to be 40. So that means we are always targeting the rows before the current row. Okay. So now let's keep
going with the other options in order to understand everything about the frame. So we redefine like this rows between
unbounded preceding and the current row. So unbounded preceding going to be the first row in the table or in the window.
So it's going to be static like this. It's going to be the first one January. And let's say that we are at this
current row in March. So the window or the subset going to look like this. Those three rows and the total of that
going to be 60. So now as SQL is proceeding to the next one, it's going to fix the first boundary. So it's going
to be always pointing to January and the subset going to be a little bit bigger until we reach the last one. And with
that we're going to have the subsets the whole rows. So with that we get really great flexibility on how to define the
subset and how the subset is shifting through the window. Okay, so now we are just having fun. So we are just playing
around with the boundaries. We don't have always to use the current row. So we can use for example here in this
definition row is between one proceeding and one following. So we don't include at all the current row in the
boundaries. So let's say again our current row going to be in March. So one preceding going to be February and one
following going to be April. So with that our frame going to be the three rows. And let me get it like this. And
the aggregation of this going to be around 45. So with that as you can see the boundary is going to be one
proceeding and one following. So it should not be always the current row. All right. So now I think you already
get it. What going to be the last option? We're going to have everything. So the definition of the frame going to
be rows between unbounded preceding and unbounded following. What we're going to have over here. The unbounded preceding
going to be January and the unbounded following going to be June. And now the frame going to be everything all the
rows. And it doesn't matter where are we with the current row, it's going to be always a fixed subsets. So it's going to
be always everything. So if we are over here or February or March, we're going to be considering all rows and the total
sales of that going to be 135. So we will get the exact same results for everything for all rows. So with that I
think it's not that complicated, right? We just have to provide the boundaries and then the calculation going to be
depending on the frame on the subset of data. Okay guys, so now let's go back to SQL and start practicing in order to
understand how the frame work. So let's go and define a window like this. So sum of sales and the window definition like
this. We going to divide the data by order status and let's say we're going to sort it by order date. And let's
define a frame like this. rows between current row and two following. Let's give it a name total sales. So let's go
and execute it. So now let's look to the data. You see that SQL going to divide our results into two sections, two
windows delivered and shipped. And you can see that the data is sorted by the order date. So as you can see over here
for example in this status delivered we can see that 1 of January 10 and so on. And then the third part we have defined
a frame in each window. So for example, let's take the first one. So this is the current row. So we say the frame is
between the current row and the two following orders. So that means the scope going to be like this. So 10 + 20
25 it's going to be 55. And now what is interesting as well to check here is the last record of each window. So now let's
take this window over here and the last record going to be number seven. So this order and let's say this is the current
record. So we set the frame between current record and the two following. But since it is the last record of this
window, it will not go and consider the next two orders because those two orders are outside of the window and that's why
we have here 30 and SQL doesn't go and summarize all those value. So we have it 30 and there is nothing after that.
That's why we will get 30. So as you can see the frame going to be calculated within one window. So it will not
consider anything outside of the window. So this is how the frame works within partitions. So now I would like to show
you as well a few stuff about the frames. We can use shortcuts but we can use them only with the proceeding. So
for example let's say I'm going to change the definition like this to proceedings and current row. So let's go
and execute it and we will get those results. So now if you want to check the results quickly, let's take for example
this order over here and we are always summarizing the values of the two previous orders. So that means those
three orders going to be involved in the frame and the output going to be 55. So now there is a shortcut for SQL but only
for the proceeding where we can remove the range. So we can go and remove everything and we can leave it like this
rows to proceeding and if you go and execute it we will get exact results. So this is a quick way or a shortcut on how
to define a window but it only works with the proceeding. So for example, if I go over here and say for example
unbounded it's going to work. So we will get the results between the unbounded proceeding and the current row. But if
you go over here and you say you know what let's have the unbounded following SQL going to say there's an error. And
the same thing if you remove the unbounded let's say for example one following SQL will not like it. So you
can use the shortcut only with the proceeding. And one last thing about the frames it does there is a default frame.
So if you don't use any frame and you use order by what can happen SQL going to use a default frame. So if you check
the result you will notice that for this window over here those values are not like the whole values of the sales.
There is like frame there is hidden frame and the default frame in SQL going to be like this rows between unbounded
preceding and current row. So this is the default frame if you use order by. So now if you go and just execute it you
will see that we will get the exact results. So be careful once you use order by with the aggregate functions
there will be a hidden frame or a default frame like this between the unbounded proceeding and the current
row. So that means there are three ways in order to do this scenario framework between unbounded proceding and current
row. Either write it like this or you can go and have a shortcut like this. Let me just execute it. So we'll get the
same result or just remove it completely. We will get as well the same results. Now again the hidden frame or
the default frame is only working with the order by. So if you go for example here and remove the order by let's see
the results. The whole window will be aggregated. So again let me just select it. So you can see that SQL going to
consider all the rows in the aggregations and we will get the total sales for the whole window. So there
will be no frame defined only it's going to be present once you use order by. All right friends so with the frame closed
we have now covered all the components on how to define a window inside an overclo and with that we have covered
everything about the syntax of the window functions. Okay guys, so now we're going to go and
understand the rules or let's say the limitations of window functions. So let's learn what you are not allowed to
do while using window functions. Okay, so the first rule that you are allowed to use the window function only in the
select close and as well in the order by clause. So here we have again the same example where we finding the total sales
by the order status. So as you can see we used the window function in the select clause and we didn't get any
error right. So now we can go and use it as well in the order by. So let's say order by and let's go and copy
everything but not the name in the order by. So if I go and execute this there will be no errors and SQL going to allow
it. And as you can see the result didn't change. So let's go and sort it for example descending. So I'm going to
write here descending and let's execute. Now we have the total sales with the highest values then the lowest values.
So having this rule that we can use it only in select and order by that means we cannot use window functions in order
to filter data. So let me show you for example instead of order by let's have clause where the total sales let's say
bigger than 100. So let's go and execute this. And as you can see XQL going to say no you are not allowed to do that.
You can do that only for select and order by. We are not allowed to use it for filtering data using the wear clause
and as well you are not allowed to use it in the group by. So if I go and do a group by and as well remove the
condition over here. So if you execute it you're going to get the same error. You are not allowed to use the window
function in the group by. So only with the order by or as well in the select clause. Okay. So now to the second rule.
You cannot use window functions inside another window function. So that means you cannot go and nest window functions
together. Let me show you what I mean with that. So let's remove the group pie. Now everything should be working.
Let's take and copy the whole window function over here and let's just nest it. So instead of sales, we're going to
have now window function inside another window function. So as you can see this is the inner window function and the
rest the outside is the outside window function. So if I go and execute this you will see that scale going to tell us
you cannot use the window function in the context of another window function. So we cannot do nesting using window
functions. So as you can see this is another limitation for those functions. All right moving to the third rule or
let's say an info the window function will be executed after filtering the data with the work clause. Let's have an
example. So okay so now let's say that I would like to have the same informations. the total sales for each
status but only for two products 101 and 102. So let's go and do that. We're going to use the wear clause and then
we're going to say product ID in we're going to specify 101 and 102. So let's go and execute this. Now you can see we
still have two partitions. So one for the delivered and one for the shipped but the total sales is reduced because
we are only focusing on two products and we filtered the whole data sets. So how SQL works? First the workflow is going
to be executed and then the window functions going to be calculated. So that means first filtering and then
aggregations. Okay guys, now we're going to move to the last rule to the most interesting one and it says the
following. You are allowed to use the window function together with the group by clause only if you use the same
columns. So let me explain what do I mean but first some coffee. Let's have the following task and it
says rank the customers based on their total sales. So now it sounds really easy but if you check it you have here
two calculations. The first one you have to rank the customers and the second calculation is an aggregation. You have
to find the total sales for each customers. Okay. So now I'm going to show you step by step how I usually
solve those tasks. So for now let's check the total sales. It is an aggregation right? So we can use the sum
function and this function is available in both group pi and as well in the window function. So for now I'm going to
go with the group by and that's because the task is very simple. We don't have to show any other details. Right? So
it's all about aggregations. So why not using the group by and now to the first part where we have to rank the
customers. We cannot use the rank function with the group by right. Groupy uses only aggregations. So here we are
forced to use the window function. So that means for the rank I'm going to use window function. For the total sales I'm
going to use a group by. So now let's do it step by step. So first we have to find the total sales for each customer
using group by. It's very simple. So I'm just going to remove all those stuff in our select statements. We need the
customer ID and then we don't need a window function over here. And then after the from we're going to have a
group by customer ID. So now I'm just grouping the customers and finding the sum of all sales. Let's go and execute
this. So now as you can see in the results we have four customers and that's why we have four rows and as well
we have the total sales. So let's say the half of the tasks is already solved. Right now what is missing that we need a
rank. So let's go and build that. The second step we're going to use the rank function and we can define a window for
that. So over and inside it will not partition the data at all because it's already like grouped up. So what we're
going to do over order by the rank function always needs an order by don't worry about it we can talk about it
later. So now we are ranking the data based on the total sales that means the sum of sales. So what we're going to do
let's just go and copy this and put it after the order buy. And now we have to decide whether ascending or descending.
It's going to be descending. So the highest sales first and then the lowest sales. So now as you can see we have now
a rank customers and we have a window function now together with the group by. So now let's go and execute this and see
whether SQL going to allow it. So let's run it and as you can see SQL runs it and we will get the rank for each
customers. So the customer three has the highest total sale. Then the customer number one and the last one going to be
customer number two with the lowest total sales. All right. So we solve the tasks. We have now ranked the customers
based on their total sales. So as you can see SQL allows you to use window function together with the group by but
only with one rule. Anything that you are using inside the window function should be part of the group I. So for
example, we fulfilled the rule because we are using the sum of sales and the sum of sales is a part of the group I
right. So now if I go I just break the rule by not using the sum just using the sales. So if I just remove the sum and
use only the sales, SQL will not allow it because the sales is not part of the group I. So as you can see SQL is very
strict with this. If you want to use everything in one query without using like subqueries and so on, you have to
use the exact same columns. So for example, if I go over here instead of sales, I use the customer ID. So since
the customer ID is a part of the group by, SQL can allows it. So be careful using window function together with the
group by. As long as you are using the same columns, nothing going to go wrong and SQL going to allows it. Okay, so now
I'm just going to go and fix this and let's run it. So now as you can see it's really easy if you follow those steps.
First build the query using group by. So don't you think about the window function just build the group by and
then the next step the last one you go and define and build the window function. So with that you can solve
really nice analytical use cases with a simple one query without having you to build like some queries and so on. You
can go and use group by together with the window functions. All right guys so those are the four rules for the SQL
window functions. All right friends, so now let's have a quick recap about the SQL window
functions. Let's start with the definition. It will go and perform calculations like aggregations on top of
subset of data without losing the level of details. So that means we can do aggregations and at the same time we are
not losing the details. Now, of course, there is a lot of similarity between the window function and the group by. But
the main difference is that window functions are very powerful and dynamic compared to the group by. We have way
more functions than the group by. Right? But now if you are doing data analyzes and you have an advanced use case, then
you have to go and use window function. It's more suitable for complex and advanced data analyzes. But in the other
hand if you have a simple question simple data analyzes then you can go and use the aggregate functions using the
group by and of course you can go and use them in the same query in the same select you can go and mix the group by
together with the window function with only one rule you have to use the same columns and of course the first step is
to do the group by and then later you do the window function in the same query. And now to the next point about the
window components we have two main components. The first one is the window function and the second part is the
window definition using the over close. And inside the overlo we can define three things. If you want to divide the
data to create windows you can use the partition by the second section we have the order by in order to sort your data.
And the last part you can go and specify a subset of data like a frame within each window. Now let's move to the last
part. We have rules for the SQL window functions. So the first thing is that if you have two window functions or
multiple window functions, you cannot go and nest them together. You have to go and use multiple subqueries. The next
point is that you can use the window function only in the select and the order by clause. So for example, you
cannot use the window together with the wear clause in order to filter the data. Talking about filtering data, how SQL
going to go and execute the window function? It's always after SQL filter the data. All right. So those are the
basic stuff about the SQL window function. So with that we have learned the basics about the window functions in
SQL. And next we're going to start talking about the functions. So the first group is the window aggregate
functions. And here we're going to learn how to summarize our data for a specific group of rows. So let's
go. Okay guys, let's say that in our data we have the following informations. We have the months and the sales. Now if
you apply any aggregate functions in SQL what going to happen SQL going to go through all rows of the window or the
entire data and start aggregating the data. So that means in the result in the output SQL going to give you one single
aggregated value. SQL going to go and summarize all those values and in the output you're going to find for example
here the total sales it's going to be 175 or you can use the average or count the data and so on. So the aggregate
functions going to deliver at the end one aggregated value for a window or for the entire data. Okay. So now let's have
a quick overview of the syntax of all aggregate functions. Most of them follow the same rule. So first as usual we have
to define the function name. And in this example we have the average. Then to the next part we have to define inside it as
well the expression. We cannot leave it empty. So here we are using the sales and the second rule for all functions
beside the count. The data type of this field should be a number. And this of course makes sense, right? So we cannot
find the average of the first name of customers or something like that. So we have to define a number. Then next we
have to define the frame. So we have the partition pi and it is optional. So you could use it or leave it depends. And
then the next one we have the order by. It is as well optional. It is not a must or required. So you could use it or
leave it. That mean the whole definition of the window could be empty for the aggregate functions. Let's have a look
to all functions. So we have the count, sum, average, min, max. And as you can see, only the count accepts all data
types as an expression or arguments. All others require you to have a number as a data type. And for all functions, the
partition by is optional. The same for order by and frame. So everything is optional over here. So now what we're
going to do with that, we're going to go and deep dive into each of those functions in order to understand how
they work, what are the use cases, and of course, we're going to practice in SQL. So we're going to start with the
first one with the function count. Okay. So what is the count function? It's really simple. It's going
to return the number of rows within each window. So it's going to help you to understand how many rows do you have
within each subset of data. So now let's go and understand how SQL works with this function. All right guys, so now we
have again this very simple example for the orders and we have the following informations. We have the products and
sales and now we want to solve very simple task. How many orders do we have within each products? So in order to
solve it, we can use the function count like the following. So we can say count and then we pass for it an argument or
expression the star. So with that we are telling SQL go and count how many rows do we have in our table. But we have a
window definition like this over partition by products. So now what SQL going to do? We're going to go and
divide the data sets into two partitions. We're going to have one partition for the caps and another one
for the gloves. So with that we have prepared our data into windows and we are ready to do aggregations. So how
many rows do we have within each window? It's going to be three. So for this window it's going to be three rows and
as well for the next window we have as well three rows. So we're going to have three three and three. It's very simple
right guys? We are just finding the number of rows within each window. But now with the aggregate functions we have
to be very careful with the null values for the count star. As you can see over here we are not specifying anything
about the sales. So we are just saying find me the number of rows. So that means SQL will just count the nulls as
one row. So that means if we are using the star as an argument for the function count the null will not affect anything.
So whether we have nulls or nots we are just counting how many rows do we have inside our data. But in some scenarios,
we should be ignoring the nulls in our account. For example, let's say that I would like to count how many sales do we
have within each product. That means if we have nulls, it should not be counted. So now in order to achieve this task,
what we're going to do, we're going to use instead of a star over here, we're going to have the filled sales. So now
with this, we are telling SQL, don't just count blindly how many rows do we have within each window. You should be
very careful with the values. Find how many cells do we have within each window. So now let's see what's going to
happen. For the first window we have three cells. So we have three values. So the number of rows is correct. But for
the next one, how many cells do we have? We have two. So we have this sale and then the 70. But the last one is null.
So it will not be counted. It will be ignored. That's why we're going to get in the output the value two. We have two
sales. So as you can see the result did change and we are now more sensitive to the null values. So be careful what you
are specifying for the count. If you are using a column name like this it will ignore the nulls. But if you have a star
it just going to go and find how many rows do we have within each partition. Okay. So now if you go and compare the
results side by side you can see that if you specify a column within the count function it's going to be sensitive with
the nulls. So it's going to ignore it and will not use it within the aggregations. That's why we have here
only two rows. But if you go and use the star within the count function, what going to happen? SQL just going to go
and count it. So we're going to find the number of rows that we have inside our table. And there is one more way in
order to do the same thing here on the left side. You can use instead of star you can use one. So you might find it
somewhere that people are using count one and then the same window function and we will get exactly the same result.
So the nulls will be counted and will not be ignored. So now you might ask me which one should I use the one or the
star? Well, I would say it doesn't matter right we are getting the same results and if you are thinking about
the performance I hardly find any differences between them so you can go and try both of them and stick with the
one that is giving you like more better performance. Now we have special case for the count function compared to all
other aggregate functions it allows any data type. So that means we can use numbers we can use characters dates and
so on. So that means we can go and specify something like the product for the count instead of sales. So we can go
over here and say product and it's going to go and count how many rows do we have for the product. So it's going to be
three over here. And since here we don't have any nulls, it's going to go and count it like this. So we have three
rows and be careful here. We are not counting the unique rows. We are just counting the rows that we have inside
our data. So this will not be counted as one and this as well would not be one. So we have three times the caps. That's
why we have here three. Okay. Okay. So now we have this very simple example. Find the total number of orders. This is
very simple task in order to find how many rows, how many records do we have inside the table orders. So let's go and
solve it. So let's start by selecting just star from the table orders without anything like this. So as you can see we
have 10 orders. It's very simple. It's very easy as well. But now let's say that you have thousands or millions of
rows. You cannot do it like this by just checking the rows. What you're going to do? We're going to go and use the
function count. So we can go over here and say counts star and then let's give it a name total orders. So let's go and
execute it. So now as you can see we got only one record, one value. We don't see any other details. We got the 10 orders.
So this is the total number of orders. This is very helpful in order to understand the content of your data. So
this we call it overall analyzes or let's say having the big numbers about your business. For example, how many
orders do we have? how many customers, products, employees and so on. So having those big numbers going to help us to
track our business to understand how well we are doing with the orders and with the customers and so on. So this is
the basics of reporting. Now let's go and extend our task by saying provide details such as the order ID and the
order dates. So let's go and do that. So select order ID, order dates. And now of course we cannot do it like this. So let
me just execute it. we will get an error because here we have different level of details in our select. So in order to
solve this what we going to do we're going to use the over clause and with that we are telling SQL this is a window
function. So now let's go and execute it. So with that you can see with that we have solved the task we have details
we have the order ID order dates. So this is the highest level of details since we have the order ID and as well
we have the highest level of aggregations. we have the total number of orders in the entire table orders. So
now let's keep going and add more stuff to our task. Let's say that we want to find the total number of orders but for
each customers. So that means this time we have to go and divide our data by the customers. So let's go and do that.
We're going to use as well a window function. So count star over we have to divide the data using partition by and
we're going to use the field customer ID. So let's call it orders by customers and I would like to see as well the
customer informations in the query. That's why I'm going to go and add it. All right. So that's all. Let's go and
execute it. Now as we learned before that SQL first going to go and divide the data. So that means we have four
customers. We're going to get four windows. The first window going to be for the customer ID number one. And as
you can see we have three rows. That's why we have here three orders. And the same thing for the customer two. We have
three orders. customer three three orders but only the last customer the customer ID number four we have only one
row and one order. So now if you go and look to the total orders and the orders by customers you can see now we are not
doing the overall analyzes we are doing like comparison between different categories and of course in this example
the category is the customers and with that we can understand as well the behavior of our customers. So you can
see that we have three customers that has exactly the same amount of orders. So they are very similar but we have one
extreme which is the customer ID number four. This customer has only one order. So this is the only customer that has
different behavior than all other customers. So you see with very simple query we are able now to analyze our
business and understand the behavior of our customers. So if you divide the data by partition by and using count you can
go and now compare stuff together. All right. So now let's keep moving. Next we're going to understand the special
cases that we have with the function count. So now we have this very simple task. It says find the total number of
customers and additionally we have to provide all customers details. So I think it's very easy to solve. What
we're going to do we're going to go and select star since we need all details from customers from sales customers. So
let's just have a look. So we have five customers and the function is count star over and we don't have to divide the
data since we have to find the total number of customers for the entire table and it's going to be total customers. So
nothing new that's it we have five customers and now as we learned before if you are passing the star to the count
function what you are telling to escale is that just go and count how many rows do we have inside the table customers.
So SQL just going to go and start counting and going to say we have five customers, five rows. So it doesn't
matter whether we have nulls inside our data like in the last name or the score. It's just going to count the number of
rows. So now let's say that we have the following task. It's going to say find the total number of scores for
customers. So what do we need with this task is to find out how many scores inside our data. So as you can see we
have around four scores but the last customer doesn't have any score. So we have it as a null. So the result should
be four. We cannot go now and use the star for it because we're going to get five. We have to go and count the
scores. So let's see how we're going to do that. We're going to count as well. But this time the score and the
definition of the window going to be empty. So total scores and let's go and execute this. So now we can see in the
results we got four scores which is very correct because SQL did ignore the null and SQL now focusing only on one column.
So focusing on those values the nulls will not be counted. This is really great in order to check the quality of
your data. So let's say that you are not expecting any nulls inside your data. So instead of going manually through the
whole records what you can do you can go and find the total number of customers like this and then you can go and count
the total number of scores and you can see there is a difference. So by just checking the data I can say you know
what we have one null without checking every record in our data. So with that we can check the quality of our data and
understand very quickly how many nulls do we have in the field score and you can do the same stuff for example for
the first name show it to you. So I'm just going to go and copy this and let's say first name or let's say country
actually. So I will go with the country. So let's go with the country total countries. So let's go and execute this.
So now if you check the result you can see we have five rows with the countries. So SQL going to go and focus
on the countries and it will not find any nulls. So we have here complete data. We don't have any nulls because
the total number of customers is equal to the total number of values within the country. And I can immediately find okay
the data quality of the country is very good. All right. So now one more thing about the count function that we have
learned before. We can use either star or one in order to count how many rows do we have. So let's just try it. I'm
just going to go and duplicate it. And instead of having a star, let's have a one. Just going to give it a name here.
It's going to be one and you are star. So let's go and execute it. So now if you check the output, we got exactly
identical results. So there is no difference between those two queries. It's up to you. You can try it and check
the performance. I usually go with the star instead of one. Okay. So now we're going to talk about a very important use
case for the SQL window function count that I frequently use in my real projects. The data that we use for data
analyzes has usually bad data quality. And if we don't find those data quality issues and we don't clean it before
doing the analyzes, what going to happen? We're going to deliver bad results, bad analyzes which going to
lead to bad decisions. And one very common data quality issue that you might encounter in your project or on your
data is that having duplicates. Duplicates are really bad for doing data analyszis. So now in order to discover
or let's say identify the duplicates in our data, we can go and use the SQL window function count. So now let's go
and have some examples. Okay. So now the task says check whether the table orders contains any duplicate rows. So how we
going to do that? By checking now the table orders over here. We can see that there are many orders. But how to find
out the duplicates? Well, the first step is to understand what is the primary key of the table orders. So what we usually
do we go and check the data model if there is one. So for example for this course we have the following data model
and we can see that it is defined that the order ID is the primary key for the orders. The product ID is primary key
for the products. So that means for our table the orders we have the order ID as the primary key and it should be unique.
It should not contain any duplicates. So now let's go to our data and check the order ID. By just looking at the data
you can see that we don't have any duplicates. Rightes all of them are unique. So we have 1 2 3 4 and so on.
But of course in real projects you cannot do it like this. You have to go and build a query in order to find out
whether the primary key is unique. But now you might say the primary keys are usually unique because we can define it
in the DDL in the rules of building the table. Well that's true. If you have it like this then you don't have to find
any duplicates. But usually in data analyzes we export a lot of files and a lot of data inside an extra database and
we don't build such a rules. So now in order to check the quality of the primary keys that you get from the
source we can use the count function. So let's go and build it. I'm just going to select the order ID first as a detail.
And now we're going to do the following. So count and then star. And let's go and define the window. So it's going to be
partition by and here the field going to be the primary key. So the order ID I'm checking now the quality of this field.
This should not contain any duplicates. And now we're going to go and give it a name check primary key. So now my
expectation is that the result of this should be at maximum one. That means we have one row for each primary key. And
that means as well it is unique. So if we get anything more than one then it means we have duplicates. Let's go and
run the query. And as you can see in the results we get for each primary key one. So that's great. That means we don't
have any duplicates inside our data and the primary key is unique. So that means the table orders is clean and we don't
have any duplicates inside it. Now let's check our database. We have here another table called orders archive. Let's go
and check the table. So first I'm just going to go and select the data. So select from orders archive. So sales do
orders archive. Let's check the results. And here we can see that we have exactly the same structure as the table orders.
So now let's go and check whether the data quality is well clean. So now what we're going to do, we're going to use
exactly the same query as before, but instead of using the table orders, we're going to take the orders archive. So
that's it. Let's go and execute it. So now by checking the data, you can see that we don't have everywhere one.
Sometimes we have two rows for the same primary key, which is really bad. So we have here for the order ID four we have
two orders with the same order ID and as well for this order id six we have three orders that means those stuff are
duplicates and they are against our data model. So now what else we can do is that to generate a list specifically for
the data quality issue where we have duplicates. So anything that has one we are not interested in it. In order to do
that we're going to use the subquery. So let's say select star from and then we're going to use the first query as a
subquery and we're going to say in our filter where the check primary key is higher than one. So that means I need
only the order ids where we have duplicates. So let's go and execute this. Now I have a list with the primary
keys where we have duplicates. So we have the order ID 4 and as well the order ID six. So guys, as you can see,
the window count function is wonderful in order to find data quality issues like the duplicates. All right guys, so
those are the four most important use cases in the SQL window function count. So the first one we can use it in order
to do overall analyzes or we can use it in order to do category analyzes like we have done the analyzes on the customer
behavior or another use case we can use it in order to check the nulls inside our data. And the last use case we can
use it in order to identify or discover the data quality issue duplicates in our data. So now let's go and check the next
function. We have the [Music] sum. All right. So now let's understand
what is the sum function. It's very simple. It's going to return the sum of all values within each window. So now
let's go and understand how SQL works with this function. All right. So this is very easy and we are using the same
simple example and now we would like to find the total sales for each products. So we can define like this sum of sales
since we are finding the total sales and then we define the window like this over partition by products. So as we learned
SQL going to go first and divide our data into two windows. So one window for the caps and another window for the
gloves right. So now after SQL define the windows it's going to go and starts aggregating the data. So the sum of
sales that means for the first window we have the three sales and it's going to go and just simply summarize all those
values. So we are adding 20 + 10 + 5 and we will get the result 35. So in the outputs we will get everywhere 35. So
that's it for the first window and as you can see SQL going to go aggregate the data within each window separately.
So that means as we are aggregating the data for the caps will not check anything with the gloves. So they are
completely separated. So now it's going to go for the next window. And here we have two values and a null. So again
here the null will just be ignored. So what we going to have? We're going to have 30 + 70 and the total sales for
that going to be 100. So as you can see it is very simple, right? So 100 100 and so guys that's it. It's really simple.
We don't have here like a lot of special cases like the count function. It's only that it ignores the null in the
calculation and as well the requirement here it allows only integers or let's say numbers. So we cannot go and say sum
the products since the products are not numbers they are characters. So you can only use numbers for the sum function.
Let's go now and have some tasks and some use cases in order to practice in SQL. find the total sales across all
orders and as well find the total sales for each product and additionally we have to provide some details like the
order ID and the order dates. So let's go and do that. Select order ID, order date and let's get as well the sales.
And now we have to find the total sales across all orders. That means we're going to use the window function sum
sales and the definition of the window going to be empty since we don't have to divide the data. So that's it. total
sales and we have to select the table sales orders. So that's it. Let's go and execute it. So with that as you can see
we got all the details that we need and as well the total sales the summarization of all those sales in one
field. So with that we have our overall analyzes one big number for our reporting. We know how much sales we did
made in the entire business. So now let's go for the next task. It says total sales for each product. I think
you know already what we're going to do. So sum of sales and we're going to do it like this. Partition
by product ID. So that's it. We're going to call it sales by products. And with that we are dividing the data by the
product. So let's go and execute it. So as you can see we don't have the product information. So let's go and add the
product ID in the query just in order to analyze the results. So we can see from the data that the winner is the product
ID 101. So as you can see we have here the highest sales if you compare it with the other products and the lowest one
going to be the products ID 105. So as you can see we can use the window function sum together with the partition
by in order to compare stuff to do comparison between the products in order to understand the performance for
example of the products. So it's really great analyzes for the performance. All right. Now we're going to move to very
interesting use case for the aggregate functions not only for the sum but as well for the others. It is the
comparison analyzes. Okay. Okay, so let's understand quickly what is the comparison use cases. So it's going to
go and compare the current value. For example, let's say we are currently at the month of March and the sales is 30.
So we're going to compare this value, the current sales with an aggregated value. For example, let's say the total
sales using the sum function. So what happen if you compare the current value with the total sales? You are comparing
here or doing analyszis called part to whole analyszis where it's going to help us to understand how important was the
sales in this month compared to the total sales or we can go and compare it to the best months to the highest value.
For example, the highest value is June and we can go and compare this month with the best months of the year or to
the lowest month in the year or we can go and compare the sales of the current month with the average in order to
understand are we above the typical sales or below the average. And this is very important analysis in order to
study and understand the performance of the current data. All right, let's have an example in order to understand the
use case. Find the percentage contribution of each product sales to the total sales. So let's go and solve
it step by step. What we're going to do, we're going to go and let's select the order ID and as well let's take the
product ID and the sales just like this from sales orders. So let's go and execute it. Okay. Okay. So now as you
can see in the results we got the first part of the equation. We have the sales. So nothing like a crazy over here. Now
we need the total sales over all data. So what we're going to do we're going to have the sum of sales and the definition
going to be empty. So this is the total sales. Let's go and execute it. So now we have everything for the equation. We
have the sales and as well the total sales and that is enough in order to find the percentage of the contribution.
So the calculation for that is going to be very simple. We're going to divide the sales by the total sales. So it's
really simple. Let's go and do that. It's going to be the sales divided by the total sales. So we're going to go
and copy the whole window function over here. And then we're going to multiply it with 100. So that's it. Let's go and
execute it. So now you notice that in the output we got zeros. This is because of the data type. So now if we go to our
table over here on the left side you can see that the orders has the data type of integer. So if you divide integers you
will not get a float or decimal number. You have to go and change the data type. So now what we're going to do we're
going to go and change the data type for one of them. So it's enough for the sales over here. So we're going to use
the following statement. So cast sales as floats. So that's it. I'm just converting the integer to floats. So
that's it. Let me just give it a name. So it's going to be percentage of total. So that's it. Let's go and execute it.
So now in the output, you can see we got now the percentage of the total or let's say percentage of contribution. So now
what we're going to do with that, we're going to go and round those numbers because we have a lot of decimals. In
order to do that, we're going to use the round function like this. Then we're going to have two decimals. And let's go
and execute it. So now, as you can see, it is really easier to read because we have only two decimals. And we can find
immediately that the order rate is the highest contributor to the total. So this is what we call part to whole
analyszis where we find the percentage of total. It is very common analyzes in order to understand the performance of
each order compared to the total. So this is an example how the window function is helping us here to compare
the current value with an aggregated value. All right everyone. So that's all for the window function sum. Next we're
going to talk about the average function. All right. So now let's understand what
is an average function. As the name says, it's going to find the average of values within each window. So now let's
go and understand how SQL works with the average. All right. So now back to our very simple example and the task says
find the average sales for each product. So it's really easy. We're going to use the average then pass to it the column
sales and we define the window like this partition by products. So the first thing that SQL going to go is to define
the window. So it's going to divide our data into two partitions. One for the caps and one for the gloves. And now I
hope that everyone knows how to calculate the average. So as you know that it's going to go and summarize all
the values and divide it by the number of rows. So it's going to go and summarize 20 + 10 + 5 and divide it on
three rows and the output going to be 11. So we're going to get it for each row. So as you can see SQL just ignored
everything in the next window. We are focusing only on the caps. Now it's going to go to the second window and
start doing the same aggregations. But here we have the special case of null. So the null is going to be ignored in
the calculations and we're going to have it like this. It's going to say you know what 30 + 70 and we are just including
two rows. So it's going to be divided by two and the average going to be 50. So we will get the result 50 for each row
and we are completely ignoring the nulls. But now we might be in scenario where your users understand the business
like this. If we find a null in the sales it means a zero. So there is no sales and it is actually a zero. But we
store it in the database as a null. So that means the average that we have provided is not really correct. We have
to divide by three. So that means first we have to handle the nulls before doing the aggregations before finding the
average. Now we're going to have a whole chapter on how to handle nulls in SQL. What are the different functions? But
for now we're going to go with the functions qualisk. Okay. So now what we're going to do, we will not use the
sales as it is. First we're going to handle the nulls. So that means we're going to use the qualisk sales and
replace it with zeros. So as you can see we are not using immediately the sales we are handling it first and then we're
going to find the average. So SQL going to go over here and if it finds any null going to go and replace it with zero and
that's going to have then an effect on our average over here. So it going to be 30 + 7 + 70 but now plus 0. And now we
have three rows. So instead of dividing by two, it's going to go and divide it by three and the total result going to
be like this 33. So that means we're going to have in the output 33 for each row and with that we are now fulfilling
the expectation from the business. If you have a null it's going to be handled as zero and the result going to be more
accurate. You see right it is very tricky. If you are doing data analyszis and aggregations be very careful with
the nulls. understand them, understand what they mean for the business, handle them correctly in order to get correct
results in your analysis. So now let's go back in order to practice SQL using some tasks and use cases. Okay, so let's
start with the basics. We have the following task. Find the average sales across all orders and as well find the
average sales for each product. And don't forget the details. So now let's go and solve it step by step. So select
order ID, order date, and let's get the sales as well. And let's go and find the average sales. So it's going to be a
window function. And we have the sales inside it. The usual stuff. The window going to be empty. So average sales,
we're going to call it the table going to be sales orders. So that's it. Let's go and execute it. Oh, we have to select
everything of course. So what SQL did in the output, it going to go and summarize all those values and then divide it by
10. So with that we have the average sales of 38. Very easy. So this is again what we call an overall analyzis. Let's
move to the next one. Find the average sales for each products. So again we're going to go and build the window
function like this. Average sales over and we're going to divide it by product ID. And we're going to call it average
sales by products. And we're going to go and add the product ID in the query. So that's it. Let's go and execute. And we
missed something here. So it is the partition by going to execute again. So with that we have the following data. So
now SQL going to go and divide the data. So for example for this products we have those four orders. So what going to
happen is still going to go and summarize the four values and then divide it by four. That's why we have
here 35. The same thing for the next order. It's going to divide it by three. And the last one is just going to divide
it by one. That's why we have 60. So as you can see the aggregation can done separately for each window and this is
as well very nice way in order to compare the averages between the different products. Okay. So now let's
have an example in order to learn how to deal with the nulls. Let's say that we have the following task. Find the
average scores of customers and show as well additional informations like the customer ID and the last name. So let's
go and solve this. We are now targeting the table customers. So let's just select it first.
like this. And now let's go and include the customer ID and the last name. And let's have as well the score. But this
time we're going to go and find the average score. So it's going to be the average score. And since we don't
partition the data, we're going to leave the definition like this and it's going to be the average score. So that's it.
Let's go and execute it. So now as you can see, we have the average score of 625. SQL is going to go and summarize
the four values and divide it by four. But here we have a null. So now we have to understand the business or ask about
it what the null means in the scores of the customers. Is it zero or is it something empty? If it's zero then the
average that we have is wrong because it should be divided by five and not four. So let's say it's zero that means we
have to go and handle the nulls. So what we're going to do now we're going to go and use the function kalis. So qualis
and for the score and replace the null with zero. So you are the customer score. Let's go and execute this. So now
as you can see if there is a value it's going to be exactly the same value but only if we have null it's going to be
replaced with zero. So now let's go and correct the average. I'm just going to do it like this. So let's go and copy
the whole thing. But now instead of using the score we're going to use the score that is handled with nulls. So I'm
just going to go and replace it like this. So here without nulls. So let's go and execute it. So now as you can see we
are getting more valid result at the output compared to the previous one. And this is only for the case if the null
means zero. So guys as you see be very careful with the nulls especially if you are doing aggregations and handle it
correctly before doing any aggregations like the average. All right. Moving on to the last use case. We have the
comparison analyzes and the task says find all orders where the sales are higher than the average sales across all
orders. So that means we have to go and compare the current sales with the aggregated value and this time the
average of sales. So now let's go and do it step by step. So what we're going to do we're going to go and select of
course the order ID. What do we need the let's take the product ID and we need the current sales. So it's going to be
the sales as it is and that's it for now. So from sales orders. So that's it. Let's go and execute it. So now by
checking the result, you can see that we got the first part of the equation, right? We have the sales for each order.
Now we need the second part, the average sales across all orders. In order to do that, we're going to go and use the
window function average sales and we're going to use over since across all orders that means it's going to be
empty. So let's give it a name average sales. So let's go and execute it. So now in the output we got the average
sales. So it's going to be 38. So now we need all the orders that are higher than the average. So as you can see for
example the order one is not higher but the order for is higher than the average. So in order to filter the data
we cannot use the window function in the wear close. Right? So what we're going to do sadly we're going to go and use
the subquery. So it's going to be like this. select star from and then we're going to define the condition outside
the subquery. So it's going to be where the sales is higher than the average sales. So that's it. Let's go and
execute it. And now as you can see it's very simple. We got all the orders that are higher than the average. Right? So
you can see all those sales are higher than the average. It would be nice if we can do all those stuff in the first
query. But since we cannot do that, we need to use the subqueries in order to filter the data afterward. So that we
can understand the importance of the comparison analyszis. For example, here we are finding or evaluating the data
whether they are above the average or below the average. And this is very important in the business analyzes. All
right, everyone. So that's all for the window function average. Next, we're going to talk about two very interesting
functions, the min and max. All right guys, so what is min and max functions? They are very simple but yet
very powerful functions for analytics. So the min simply is the function that can return the minimum or let's say the
lowest value within a window where the max it's exactly the opposite. It's going to find the maximum value or the
highest value within a window. So now let's go and understand how SQL works with these functions. All right. So now
we have the same data and we have two tasks. First we have to find the lowest sales for each product. And the second
one side by side we would like to find the highest sales for each product. So we're going to go and use the min max.
And as you can see the syntax is very simple. Min the sales and then the partition going to be by the products.
And here as well the same stuff but having the max. Okay. So now let's see how going to execute the first query. As
usual first it's going to prepare the data. So it's going to split the data into two windows. One for the caps and
another one for the gloves. And after that it's going to search for the lowest sales within each window separately. So
for the first window we have the following values 20 10 and five. And of course the lowest value going to be the
five. So that's why SQL going to find it over here. And everywhere for this window it's going to be the value five.
So we have it as the lowest sales for the product caps. So now it's going to jump to the next window for the gloves
and start searching the values. So as you can see we have 30 70 and null. Null will be ignored. So null will not be
considered as the lowest value. So SQL going to find the lowest sales with the 30. So it's going to be actually the
first row within this window and the value the output going to be 30 for each row. So that's it. It's very simple,
right? Now let's move to the next one. We have the same stuff but using max. So the data is partitions and for the first
partition what is the highest value? It's going to be the first row, right? The 20. So SQL going to find it and in
the output we will get the highest sales 20 for this window and then it's going to go to the second window and search
for the highest value. So here we have two values 30 and 70 and it's going to be the 70 right. So it's going to point
it over here and in the output we will get everywhere 70. So guys it's really simple right now let's back to our
scenario in the average where in our business we understand nulls as zero in the sales. So that means first we have
to handle the nulls and replace it with zero and then we're going to go and search for the value. So what's going to
happen? We're going to go and replace nulls with zero. For the max nothing going to change the highest value going
to be 70 and we're going to get the same output. But for the min now we have new lowest value. So it's not anymore the
30. It's actually the zero. So SQL going to go over here and replace the 30 with nulls. So nulls is the lowest sales for
the product gloves. So again guys, the nulls are very tricky and those functions are really sensitive with the
nulls. Understand what the nulls means and handle it correctly so that you get correct results in the output. So that's
it. Let's go back to SQL to have some tasks and use cases in order to practice SQL. All right everyone, let's start
with the basic stuff. find the highest and lowest sales of all orders and as well find the highest and lowest sales
for each product and we have to provide additional informations. So let's go and solve it. Select order ID order and
let's take as well the product ID. Now let's find the highest sales of all orders. It going to be the max function
for the sales and the window function going to be empty since of all orders. So you are the highest sales. Let's go
for the lowest sales of all orders. It's going to be exactly the opposite. The main function for sales over then we
have the lowest sales. So I'm just going to make it bigger capital. So let's select the table sales orders. So I
think that's it. Let's have as well the sales actually. All right. So now let's go and execute it. So now this is very
simple, right? This is the wholesales. What is the highest sales? We have the 90 of the order eight. So, as you can
see, we have now the highest sales, the 90, and the lowest sales is the 10. The first order is the lowest. So, it's very
easy. Now, we're going to go and repeat the same stuff for the product. So, we have go and partition the data by the
product ID. So, what I'm going to do, I'm just going to go and copy paste stuff around. So, the first one going to
be partition by the product ID. So, highest sales by product. And the next one going to be the same stuff. Copy
paste by the product. So that's it. Let's go and execute it. So now again the data going to be partitioned and
divided by the product. So for the first window what is the highest sales? It's going to be the 90 and the lowest sales
is going to be the 10. So it's exactly like the overall rights now let's go to the second window over here. We can see
that the lowest or the highest sales is the 60 the first one and the lowest this time is 15. And this is great in order
to see that the SQL going to execute each of those functions for each window separately. So let's go to the last
window. It's funny one. So the sales is 60 and we have only one row. So it's going to be the highest and as well the
lowest sales. So with that as you can see we can define a range for each product and the range are different from
each product to another one. For example, for this product 101 the range from 10 until 90. But for the second
product we have it between 15 and 60. Okay guys, let's move to the next one which is one of my favorites in the
window function where we filter the data using the minmax functions. Let's have the following task. It says show the
employees who have the highest salaries. So this sounds very simple but we can use the help of window functions in
order to solve it. So now we are working with the table employees. Let's just select the data. So select from sales
employees. So that's it. Let's go and execute it. So now we have five employees and we have those different
salaries. Let's go and find the highest salary. So max salary and let's use the window function over but we don't
partition the data at all. So it's going to be like this highest salary. So let's go and execute it. And now by checking
the results we got a new column called highest salary and inside it we have the 90k. So if you check those five salaries
you can see that the highest is from the employee Michael. But still the task is not solved. We have to show only the
employees who have the highest salaries. So we have somehow to filter the data and only show this employee. So in order
to do that we have to use the subqueries since we cannot use the window function in the wear clause. So what we're going
to do select star from and then our first query going to be the inner query. So we have the following condition. It's
going to be the salary should be equal to the highest salary. So it's very simple. So with
that we are comparing the salaries with the highest salaries. If there is a match the data going to be presented. So
let's go and execute that. And that's it. As you can see we got the employee with the highest salary. But if there
are like multiple employees with the same salary of 90k of course we're going to get it in the results. I think
Michael going to need a new job. Right. This is the worst. So this is another use case for the
window functions minmax. All right. So now we come to the use case of the comparison analyzers where we want to
compare the current sales with the highest and the lowest value. So we have the following task. It says find the
deviation of each sales from the minimum and the maximum sales amount. So now as you can see this is our sales. This is
the highest and this is the lowest. So now we just have to go and subtract the data from each others in order to get
the deviation. So it's very simple. Let's get the first deviation where we're going to go and subtract the sales
with the lowest value. So it's going to be like this. So now what we are doing over here, we are subtracting the sales
from the lowest sales of all records. So we're going to go and call you deviation from min. So let's go and
execute it. So now we can see from those values how far is the current value from the extreme. The extreme here is the
lowest value. So this is a really great way on to analyze the extremes in your data. So now as we are near to the
extreme the value going to be low. So as you can see here we have a zero. This is the lowest because we have it exactly as
the extreme. So actually this is our value. So the 10. Now the next one is little bit far away from the extreme
which is 15. So we have it here as a five. So this is not far away from our extreme value. And then if you check
this value over here we have it 80. So the distance is very far away from our extreme value the lowest sales. So this
is really nice analyszis in order to analyze and evaluate the sales of your data. Now of course we can go and
evaluate our data with an another extreme which is the highest sales. So in order to do that we're going to first
say let's get the highest sorry this one the highest sales and subtract it from the sales. So you are the deviation from
the max. So let's go and execute it. So now we can see in the output we're going to get exactly the opposite distances.
So the order number one is the farthest from the extreme. So as you can see we have the value of 80 and the order eight
is the identical one. So that's why we have the distance of zero. So now we can see as well very quickly which data
points are the nearest to the extreme to the highest sales. So as you can see guys using the window function min and
max it is very powerful in order to understand and evaluate your data points to the
[Music] extremes. All right everyone so now we're going to focus on very important
use case. One of the must know use cases for data aggregations is doing running total and rolling total. These two
concepts are very important for data analyszis and doing reporting that you must know. The key use case for those
two concept is to do tracking. For example, we can go and track the current total sales with the target sales in our
business. And as well, it's great in order to do historical analyszis for the trends. Okay. So now the question is
what is running a rolling total. They are basically very similar. They're going to go and aggregate a sequence of
members and the aggregation going to get updated each time we add a new member to the sequence. A sequence could be like a
time sequence. That's why we call this type an analyzes over time. So now we still have the question, what is the
difference between the running and the rolling totals. The running total going to go and aggregate everything from the
beginning until the current data point without dropping off any old data. Where on the other hand in the rolling total
it going to go and focus on a specific time window like the last 30 days or the last two monthses and each time we add a
new member or a new data point to the window we will be dropping off the oldest data point in the window and with
this we're going to get the effect of rolling or let's say shifting window okay I totally understand if this might
be complicated now let's go and have very simple example in order to understand this concept and as well how
we can solve it using SQL all right guys so now We have very simple example. We have the months and sales and we have it
twice because I want to show you side by side how SQL works with the running total and the rolling total. So now what
is the task on the left side? We want to find the running total of sales for each month and on the right side we would
like to find three month rolling total of the sales for each month. So they sound very similar but on the right side
we have only fixed window. So now how we can solve this using SQL. On the left side we can use sum of sales. So we want
to go and aggregate all the sales using the sum function. And the definition for the window going to be like this order
by month and of course you can go and do anything like you can have here an average. And if you use an average with
order by you will get the running average or the running max or the running count and so on. So that means
always if you go and mix an aggregate function together with an order by you will generate an effect of running
total. Now on the right side we can have the same stuff. So we can have an aggregate function together with order
by. So sum of sales, order by month. So far we have everything like the left side, right? But now you might ask why
is going to go and generate this effect the running total. We didn't here specify like crazy stuff, right? It's
all about the definition of the frame close. So now do you remember if you use an order by and you don't specify a
frame close you will get like hidden or let's say default frame close and it's going to look like this rows between
unbounded preceding and current row. And what was the definition of the running total? It's going to go and aggregate
all the data from the very first beginning well the unbounded proceeding until the current position the current
row without dropping off any old members. So that means the definition of the running total going to be the exact
definition of the default frame clause. That's why it's going to go and generate the effect of the running total. Now
let's go to the right side the rolling total. Here again we have the same stuff right. We're going to go and aggregate
the data using the sum function and we're going to go and sort the data order by month. So with that we are as
well generating the effect of running total. So each time you use order by with aggregate function. So now in the
running total we want always to specify a frame. So here in this example three months. So that means if we are getting
a new month we don't want to include the latest months. We want always to be fixed window. Now in order to have this
fixed window effect we have to go and redefine the frame close because if you leave it as a default like the running
total the frame going to keep extending. You will see this effect in the example. So now we define it like this rows
between two preceding and current row. So the total number of rows going to be included in each window going to be
maximum of three months. So now I know you might saying bar what you are talking about you didn't get anything.
It's total normal you will understand it only with an example. So in order to do this let's start with the left side. So
first going to go and sort the data. So everything is sorted from the smallest month until the highest one. So from
January until July everything is good. And now it's going to go and start working with the frame. So the frame
says unbounded proceeding. So this going to be static. It's going to be always pointing to January. This is the
unbounded proceeding. The first row in the data set. And now of course we are starting from top to bottom. The current
row going to be pointing as well to January. So the frame going to look like this. It's going to be only one row and
the total sale of this row going to be 20. So that's why we're going to have in the output 20. So now let's move to the
right side. The current row going to be as well January. And what is the two proceeding? We don't have it yet. So
it's going to be pointing maybe somewhere here before the table. So again, what is the frame? It's going to
be as well one row. So in the output, we will get exactly the same result 20. So so far there is no differences between
the running total and the rolling total. But let's keep going. Now we're going to go to the next row over here. So what
can happen to our frame? It going to go and extend, right? So we're going to have now two months in this frame. And
what is the total sales over here? It's going to be 30. So we added a new member. You can calculate it like this.
Either go and calculate all the sales within the frame or you can go and say this is the previous aggregated value
plus the new member. So the previous one was 20. The new member is 10. We will get 30. Both of them is correct. So now
let's move to the right side. What's going to happen? We're going to be as well at February. The two preceding is
still like pointing somewhere outside. And here the window going to go and extend like this. We have two months and
the same aggregation going to happen. So we have 30. So so far nothing crazy right. Let's go to the next month March.
The frame going to be extended. So we have now three months. And the aggregation going to be either here 60
or 30 + 30. We will get the running total of 60. And now on the right side what going to happen? We're going to
point as well to March. And this time the two proceeding going to be pointing to January. And this is the first time
we are getting the whole fixed frame. Right? So we have here three muscles in this frame. So what is the total of
that? It's going to be 60. Okay. So now you say, okay, we're still getting the same results. There's no difference. I'm
going to say wait for it. It's going to be the next one. So as we go to April, the effect here is that the frame going
to get extended to four months because always we start from the first month until the current month without dropping
any member outside. So what is the total of this? It's going to be 65. Sorry, like this. So now on the right side,
what going to happen? We're going to go and add a new member. the April but we are at the maximum sides of the window
we have only three and that's because the two preceding going to shift as well down over here so the boundary going to
be from February until April and with that we are dropping off January and now you're going to see the effect it is
sliding it is rolling or shifting from top to bottom and that's because the boundaries as well shifting so you can
see now the effect of the rolling total the newest member going to be added the oldest member going to be But we are
allowed only to have three muscles. So what is the total of this? It's going to be 45. So this times we are not
aggregating this value the 60 together with the five. We are aggregating the values within the window. So now let's
keep going. Now we are at June. What can happen on the left side? The frame going to get bigger. And with that we will get
the result of 135. So the frame is getting really bigger. But on the right side it's going to has a fixed frame. So
we are just sliding, shifting and rolling. So with that we are adding new member. Another member is leaving the
oldest one. And the total over here going to be 105. And now we're going to go to the last row. We will have
everything for the ring total. So the whole data set is going to be aggregated. So this is the maximum what
we're going to get. It's going to be around 175. But on the right side it just going to keep shifting until we
reach the last record. the window the frame going to be as well shifting like this. So the total of this going to be
105. Okay guys so you see it's very simple the running total it's always consider everything from the starting
position until the current row without dropping any member. The rolling total it's always drop the oldest member in
order to add something new and the window is keep shifting. So the running total is very great in order to do
tracking like for example budget tracking or we track for example the current total sales with a target or
something like that. So always we are considering the whole data sets but with the rolling total we always do here
focused analyzes. We are always interested with the window of 3 months. So they might sound very similar but
they have completely different scope for analyzes but both of them are doing aggregations over time. So they're going
to help us to do analyzes over time like checking whether our business is growing over time or declining. So guys as you
can see using very simple SQLs using the window functions we can do really great analysis on our data. So those stuff are
really fundamental of data analyzes or doing reporting for our business. So window functions are really powerful for
data analytics. Okay. Okay. So now we have the following task and it says calculate
the moving average of sales for each products over the time. So now we have here something called moving average. It
is very similar to the running total. In the running total we used count and sum and so on. But here we're going to go
and use the function average and instead of calling it running average we call it moving average. So let's go and solve
the task. Let's start always by selecting the usual stuff. So let's get the order ID. Let's get the product ID
and I would say since it's over the time I will get the order date as well and the last one the sales from our table
sales orders. So that's it. Let's go and execute it. So now we got our 10 orders with the products order date and sales.
Let's start building our window function step by step. So which function do we need? We need the average. This is the
easiest one. It says moving average. So that's it. We need the sales. So it's going to be the average of sales. Let's
go and define the window. So now do we have to divide the data, partition the data? Well, yes. It says for each
product that means we're going to go and use the partition by clause by the product ID. So now I would say that's it
for the first step. So average by product. So let's go and execute it. So now if you check the result, you can see
that we got our windows. So the first one for the product 101 and the total average of the sales going to be 35. So
we have like aggregated one value for each window. The same thing for the next product and for the next and so on. So
we don't have any progress over time or something like moving average all the time. Right? We don't have this effect.
We have just one average for each window. So now in order to have the effect of the moving average, it's going
to be like the running total. We have to use the aggregate function together with the order by. So I'm just going to make
it in the new column. I'm just going to copy everything like here. And now what we going to do? Order by. And since it's
over the time, we're going to go and use the order dates. Order dates. And we're going to have it ascending because it's
overtime. Over time always like start with the earliest dates, end up with the latest dates. So from the lowest to the
highest, we're going to leave it like this. So let's call it moving average. So now let's go and execute it. And we
got here an extra comma because of the copy paste. So let's execute it again. All right. So now let's check the
results. Let's take the first window over here. And you can see we have on the moving average like a progress. So
it start with 10 15 14 35. So there is like moving average. We don't have one solid number for the average. We have
different values. So now how SQL going to solve this? It's really simple. It's going to start row by row. So the first
row what is the average of 10? It's going to be 10. Then moving on to the next one it's going to be 10 + 20
divided by 2 you will get 15. So now moving to the third one all those three values going to be summarized divided by
three you will get 40. And now to the last row in the window it's going to be summarizing all those four values
divided by four and you will get 35. And this is exactly the same value in the previous column. You have here the
average byproducts. We don't have order by you got as well 35 exactly like this last row and that's because we have the
same calculation. It is summarizing all those four values dividing it by four. But now it's interesting the next value.
So as you can see the next value it comes from another window. So you see here we have 15 for the product 102 but
the average going to be as well 15. So scale is not considering the old values from the other window. So SQL going to
calculate each window separately. So it's again here this is the first value of this window 15 the average 15 then
the same stuff right. So summarizing those values divided by two and so on. And this we call in data analyzes this
last field over here we call it a moving average and you can implement it very simply using an average function
together with the order by. All right, let's move to the next task and it says calculate the moving average of sales
for each product over time including only the next order. So as you can see the first part we have already done it
right. We have the moving average and divided by partition by for the products but here we have more specifications. It
says including only the next order. That means we are talking about the current order and as well the next order. So
here we have like a fixed frame or fixed window. So we don't need the whole average of the window. We need only
maximum two orders in each calculation. So how we going to do that? We can have our custom frame close inside our window
function. So that means we cannot leave it as a default. We have to specify it. So let's go and do that. I will just
copy the old definition of the window because we have the exact stuff. So we have the average sales over partition by
product ID order by date. So this is the first part. So now we would like to have this fixed window. So we're going to go
now and define our frame close. I'm just going to zoom out a little bit. It's going to be rows between. So we have now
the boundaries of the frame. It says including the next order. So we're going to go and use the following. So the
first boundary going to be the current row. And since it's next order, so it's going to be one following. So that is
our frame including only the next order. And we have it like this one following. Let's call it yeah rolling average. So
that's it. Let's go and execute. So now let's go and check the result. You can see the moving average has completely
different values as the rolling average. So let's go and understand why. We're going to do it row by row. Let's take
the first row over here. So the sales here is 10 and the rolling average is 15. So why is that? Because in the
calculation we are considering the next value. So 10 + 20 divided by 2 you will get 15. So that means the SQL defined
the frame like this those two rows for this calculation for the first row. So now moving on to the second row. SQL
going to include as well the third one right the next one. But since the window is only two orders it's going to go and
drop the first row. So the next frame going to be like this. And as you can see it's going to be 20 + 19 divided by
2. You will get 55. So now you can see the effect of the rolling average. Right? So now for the next one going to
be exact same. So we are at the third row. It's going to go and include the next one and we're going to get the same
value because 19 + 20 divid by two you will get 55. Now interesting to the last row in the window over here. It will not
go and consider the next value because it is outside of the window. So it's going to be 20 and it's going to stay as
well 20. So that's it. All right guys. So with that we have learned about the moving average, the rolling average and
those amazing concepts using the window function. All right. Now we're going to have a quick overview of the different
use cases in the aggregate functions and how the definition of the window going to change the whole use case. So now the
first use case is finding the overall total. And here if you don't define anything in the window if you leave it
empty what going to happen you are doing here overall analyzes. So you're going to go and aggregate the whole data sets
and then provide this aggregation for each row. So this is what happen if you leave it empty. You don't define
anything. You are aggregating the whole data sets. Now moving to the next step, we can do analysis called total pair
groups. So what we're going to do, we will add partition by to the definition of the window. So by adding for example
here partition by products, what can happen? The data going to be splitted into two categories or two groups and
the aggregation going to be done for each window separately. This is of course a great analysis in order to go
and compare different products like here the caps and gloves. So this is helpful in order to compare categories. So you
can do this analysis total pair groups if you use the partition by. Now if you go and use the order by you're going to
land in the third use case. As we learned we will be doing running total. So as you can see here in the output we
are building a cumulative value for the sales and this going to help us in order to do progress over time analyzes in
order to understand the performance of our business. And now moving on to the last use case the final phase of the
window function with the aggregation. Here you have the aggregate function together with the order by with
customized fixed window. And of course we can use it in order to help us building progress over time in specific
fixed window. And of course you can use those use cases you will get the same effect if you use the other functions
not only the sum you can use average count max so all aggregate functions. So guys as you can see the window function
in scale is very important in order to do data analytics by just like changing the part of the window you are
generating a whole new use case for data analytics. All right friends so now let's do a quick recap about the window
aggregate functions. So what they do they're going to go and aggregate a set of values and return a single aggregated
value for each row. So it's very similar to the groupy but here we don't lose details. Now to the next point what are
the rules for the syntax about the expressions they all expect a number in the expression. So you have to pass a
number like sales or any integer but only for the count you can go and use any data type. And the things for the
aggregate functions are very simple. Everything is optional inside the definition of the overclouds or the
definition of the window. So you can go and use partition by order by frames or not or just leave everything empty. So
everything is optional. So now as we learned we have a lot of use cases for the aggregate functions and they are
really amazing for analytics. So the first one the simplest one you can do overall analyzes if you just leave the
window function empty. So you will get one big number about your business. And the next use case we can do total bear
groups analyzes. As you've learned, we can use partition by in order to compare categories with each others like
comparing the products or customers and so on. Moving on to the next one, we can do partto-hole analyszis. We can go and
compare the performance of each data point with the overall. So you can for example compare the sales to the total
sales in the window or to the all data sets. And we have many comparison analyzes. We can go and compare the
current value with the average or we can compare them to the extreme to the highest sales to the lowest sales and so
on. And another use case, we can go and identify data quality issues in our data. So we can go for example and
identify duplicates using the count function. Moving on to the next use case, we have the outlier detection. We
can go and find out which data points are above the average and below the average and so on. Then the next one we
have the running total. As we learned, it is great tool in order to track the progress or to monitor the performance
of our business over the time. Or if you want to be more specific, you can go and use the rolling total in order to have
like a specific window and only track this window like three months or something like that. And the last use
case, we can go and calculate the moving average of our data. So it's really amazing how order by and aggregate
functions can open for you a door for amazing or advanced analyzers. So guys, as you can see, we have a lot of use
cases for the window aggregate functions in the world of data analytics. All right. Right. So with that we have
covered the aggregate window functions and in the next step it's going to be very important. We will learn how to
rank our data using window functions. So let's go. All right. So now let's say that we
have the following data. We have products and their sales. If you want now to go and rank your products first
you have to sort the data based on something like for example ranking the products based on their sales. So that
means SQL first is going to go and start sorting your data from the highest to the lowest. So sorting the data is
always the first thing SQL has to do before ranking anything. Now in order to rank our data we have two methods. The
first method we call it the integer based ranking. So that means SQL going to go and assign for each row an integer
a whole number based on the position of the row. So now by looking to the example the first row we have the
product E with the sales 70 it's going to be rank number one then the next row the product B with 30 sales we will get
the rank number two then the next one going to be three four and the last one going to be five. So that means SQL here
is assigning an integer for each row based on their position in the sorted list. So this method we call it integer
based ranking. Now let's go to the second method we have the percentagebased ranking. So in this
methods going to go first and calculate the relative position of the row compared to all others and then assign a
percentage for each row. So here in the output is going to start assigning percentages instead of integer and we're
going to have a scale from 0 to one. So now if you go and compare both of the methods you can see that on the left
side on the integer base ranking we have discrete distinct values. So it starts from 1 then 2 3 and end up in this
example by five. So it really depends on how many rows do we have in the results. So it could be five, it could be 500, 5
million and so on. But in the right side we have always the same scale from one to zero. So between 0 and one we have
infinite number of data points and this scale we call it a normalized scale or we call it continuous scale continuous
values. So now the question is when to use which method. So for example for the percentage based ranking it is great to
answer such questions find the top 20% products based on their sales. So this method is a great way in order to
understand the contributions of data values to the overall total and we call this kind of analyszis a distribution
analyszis where in the other hand in the integer based ranking we can answer questions like find the top three
products. So with this question we are not interesting about the contributions of each product to the overall total. We
are just interested in the position of the value within a list. So this is as well very commonly used analyzes and
reporting. We call it top button in analyzers. So now let's group up our ranking functions based on those two
methods. For the first group in the integer based ranking we have four functions. Row number rank d rank and
inile. But in the other hand we have only two functions that generate percentage based ranking. We have the
cumid list and as well the percentile. So now that was an introduction an overview of those methods and how we
group up those ranking functions. Next we're going to go and learn about the syntax of the ranking functions. Most of
them follow the same rules. So for example we start always with the function name. So we have here the rank.
But as you can see we don't use any expressions. So they don't allow you to use any argument inside it. It must be
empty. So this is the first rule using rank functions. Then about the definition of the window as usual the
partition by it is an optional thing. You can use it or leave it. And now to the second part we have the order by it
is as well required. So you must order the data or sort your data in order to do ranking. So you cannot leave it
empty. So that means for the definition of the window at least we should have an order by for example here sales. So we
cannot leave it empty. All right. So the two requirements you cannot use any expressions for those functions and as
well you have to sort your data using order by. Okay. So now let's have an overview of all functions. So as you can
see all those functions are ranking functions and almost all of them don't allow to use any expressions inside
them. Beside this function here we have the end tile. it accepts a number inside it. So that means you cannot use it
empty. You should use a number inside it. All others must be empty. So now for the partition by all of them are
optional and for the order by all of them are required. So you must use order by and the frame clause they are not
allowed to use in the ranking functions. So you cannot change the definition of the frame inside the window function. So
now what we're going to do as usual, we're going to go and deep dive into all of those functions in order to
understand when to use them and what are the use cases and as well practice in SQL. So we're going to start with the
first one, the row number. All right. So what is a row number in SQL? The row number function
going to go and assign for each row a unique number as a rank and it doesn't care at all about the ties. That means
if you have two rows sharing the same value, they will not share the same rank. Okay. So now we have very simple
example. We have a list of all sales and we have the following query. So it's going to start with the ranking function
row number. It doesn't accept any argument inside it. And the definition of the window going to be like this
order by sales disk. So that means we're going to go and sort the data descending from the highest to the lowest. So SQL
going to go and do the following. The highest going to be the 100. The lowest going to be the 20. And here we have
twice the 80. So now once SQL done sorting the data, what's going to happen? It's going to start assigning a
rank. So the row number going to go and assign a unique number for each row. So that means it's going to start with the
first one. The 100 going to be the rank number one. The next one going to be rank number two. The 80 going to be rank
number three. And the 54. And then the last one going to be five. And now if you check the output you can see that
all those numbers are unique. We don't have any repetitions. So 1 2 3 4 5 there's no repetitions. They are unique
distinct value. And as well there are no skipping of ranking. So that means we have here 1 2 3 there is no jumping to 6
7 or something. They are clear sequence of distinct value and there are no gaps. But still there is something special in
our data. We can see that in the sales we have the same value twice. So we have two rows with the same sales. As you can
see in the row number they will get distinct values. So they will not share the same ranking. So that means row
number does not handle the ties. If you have multiple rows sharing the same values they will not share the same
rank. They going to have a distinct rank different ranks. So this is how the row number works in SQL. It generates unique
ranks for each row. It does not handle the ties and as well it doesn't leave any gaps. So there is no skipping of
ranking. So now let's go to SQL in order to have few examples and use cases. All right. So now we have the following
task. It's very simple. Rank the orders based on their sales from the highest to the lowest. So now this is very easy.
We're going to go and select first the data. So order ID, product ID. Let's take the sales as well and select the
table. So it's going to be sales orders. Let's go and execute it. So with that we got all our orders. What we're going to
do now is to assign for each row a rank. So that means we need a column here that contains the rank for each row. So in
order to do that we're going to go and use the window function row number. It doesn't accept any argument inside it.
So should be empty. And then we have to define the window. So as we learned in the ranking functions we cannot leave it
empty. We have to sort the data using order by. So order by is a must. We don't have to use any partition by. So
we're going to rank all the data that we have inside the table. So how to sort the data? It says it should be based on
their sales from highest to lowest. That means we order by sales since from highest to lowest we have to use the
descending. And now we're going to go and give it a name sales rank and let's say row since we are using the row
number. So that's it. It's very simple. Let's go and execute it. So now let's have a look to the results. Before SQL
did sort the data by the order ID since we didn't define anything. But since now we are order by sale descending SQL went
and sorted the data by the sales from the highest to the lowest and start assigning a rank or let's say an integer
unique integer for each row. So now the highest order going to be the order number eight. We have the sales of 90.
This is the highest one. So as you can see we have 1 2 3 4 5 until 10. So now by checking the results you can see that
the ranking here is unique. So there is no duplicates over here and as well there is no skipping or gaps. So we have
everything between 1 and 10 even though that we have in our data a couple of sales that sharing the same value. So
for example we have those two orders you can see both of them has the 60 at the sales but they don't share the same
ranking. Right? So we have here as well the 9 and three they share the same value 20 but they don't share the same
ranking. So with that we have solved the task. It's very simple. We have now a rank based on the sales from highest to
the lowest. All right. So what is a rank function in SQL? The rank function going to go and
assign for each row a number a rank and this time it going to go and handle the ties. So that means if in your data you
have two rows having the same values they going to share the same ranking. One thing about the ranking function is
that it's going to go and leave gaps in the ranking. So there is possibility of skipping ranks. In order to understand
how the rank function works in SQL, we're going to have a very simple example. All right. So again with the
same data but with different function. So our window looks like this. It start with the function rank doesn't accept
any argument inside it. Then we have the window like this. Order by sales descending from the highest to the
lowest. And our data is already sorted like that. So now how is scale going to go and assign the ranks. The first row
going to be the highest rank. So the value 100 is going to be one. Then the second one going to be two. But now for
the third one, as you can see, we have here two values that are the same. So we have a tie and this time SQL going to go
and as well let them to share the same rank. So both of them going to be the rank two. So it's not like the row
number where we have over here three. This time we have two because we have a tie. So having same values means they
going to share the same rank. And now moving to the next value going to be tricky one because if you check over
here you can see that the next rank should be like the three right? So we have one two and then the next value
that generated in the rank should be three but going to say you know what this value position going to be number
four. So as you can see 1 2 3 four. So actually the position number here is four and going to go and give it the
rank of four. So with that SQL going to be leaving a gap in the ranking. You can see we are skipping the rank number
three and this always happen once you have a tie where you are sharing the same ranking. So for the next one it's
going to be easy. It's going to be the row number five. So now by looking to the output of the rank function you can
see that we don't have a unique ranking. Here we have shared ranking in case of the ties. So it handles the ties but
here we have gaps in the ranks. So we are skipping ranks. When I think about the rank function I think about the
Olympics. If two athletes tie for the gold medal, the first place, there will be no silver medal for the second place,
the next medal going to be given to the bronze to the third place. All right. So now let's go in SQL in order to practice
the rank function. All right. Now we're going to go and solve the same task but using the rank function. So what we're
going to do, we're going to stay with the same example over here and we're going to rank the order based on their
sales from highest to lowest but this time using the rank function. So we use the rank and everything inside is going
to be empty and then our window going to be exactly the same as before. So over order by sales and disk. So let's give
it a name sales rank. Yeah, let's give it a rank. So that's it. As you can see the syntax is very simple and very
similar to the row number. We just changed the function. So now let's go and execute this in order to check the
results. So now let's go and check the results by looking to the new rank. If you go and compare it with the old rank,
we can see that we are sharing some ranking, right? We have here the two twice. So the rank number two, we have
it twice because we have over here the same value. So 60 60 we have it here two and two. But if you compare it to the
row number, you can see that it is not sharing the same ranking. So this is one difference. And as well here the same
thing. They have the same value. The sales is 20. So we have it twice the rank number seven. And here we have it
as different values. And the next value as you can see we are skipping the rank. So there is gap there is no rank of
eight. So you can see that this is the row number nine and that's why it get the nine. The same thing I believe over
here. So now if you check those two ranks the next one should be three. But since it is in the row number four it's
going to get the rank four. So by checking the results we can see that sharing the same ranks and as well we
have gaps. So this is how the rank works. All right. So what is a dense rank? It
is very similar to the ranking function. It's going to go and assign for each row a number rank and it as well handles the
ties. So same values they going to share the same ranking but this time it doesn't leave any gaps like the rank
function. So the d rank it will not leave any gaps. It will not skip any ranking. So in order to understand this
we're going to have a very simple example. So let's go. All right. So again the same data but with different
function. We have this time the rank function dense rank and the window going to be the same order by sales descending
from the highest to the lowest. So now the data is as well sorted already. Let's see how SQL going to go and assign
the ranks as usual. The first row going to be the rank number one the second as well but again here we have the same
values. So we have same values and it's like the rank it's going to go and share the same rank. So both of them going to
has the rank number two. And now you might say, well this is very similar to the rank function. So why do we have
dense rank? I'm going to say wait for it. We're going to have the difference in the next value. So it's going to come
over here. This value is exactly after the tie. In rank SQL went and took the position number. So the row number it
was four, right? So 1 2 3 4. But this time with the dense rank SQL will not leave gaps in ranking. So there will be
no skipping the next rank in the sequence going to be three. So that's why we're going to have the rank three
for this value. So as you can see there is no gap. We have one, we have two and three. So we are not skipping, we are
not leaving any gaps. And the last one going to be four. So this is exactly the difference between the dense rank and
the rank. So now by checking the output of the dense rank, you can see that we don't have unique ranks. We have here
shared ranks. As you can see, we have here repetition. So, it handles the ties and as well it doesn't leave any gaps.
It doesn't skip anything in the ranking. Okay, so that's it. Now, let's go back to SQL to practice the dense rank. All
right, so now we have the same task. Rank the orders based on their sales from highest to lowest. So, we're going
to do the same stuff, but this time using the function dense rank. So, dense rank is going to be empty. And then
we're going to define it like all others over order by sales disk. And then we're going to give it the name of sales rank
dense. And that's it. So as you can see all of those functions having the exact syntax, right? So let's go and execute
it. Okay. So now let's go and check the results. We got our newest rank using the dens. And by just checking the
results, you can see that it handles the tie. We have two twice, right? So let's check the example over here. We have the
sales 60 twice. That's why they are sharing the same ranking in the dense and as well in the normal rank. But now
what is interesting is the value after the tie. So as you can see over here with the dense rank we have three. So we
didn't skip any ranking. We don't have any gap 1 2 and then three. But with the rank it's just focus on the position
number. So it is the row number four. That's why it's four. With that we have a gap. So as you can see now we don't
have any gaps in the dense rank. So we have three four five. And now we have over here the same two values. So we
have sales of 2020 and they share the six twice. So as you can see there is difference now between the dense and the
rank. So here we have seven seven but here we are at the rank 66. So that's why we have differences between them
because we skipped before in the rank number three. Now the other stuff you can see we have seven and eight. So now
if you compare those three ranking you can see that they all start with the rank number one but they didn't all end
with the same ranking. So the row number and the rank they really focus on the position number or the row number of the
orders. So you can see over here it is the row number 10. That's why we have here 10 and 10. So the scale is from 1
to 10. And that is exactly the same for the row number from 1 to 10. But with the d over here we have it from 1 to 8
and that's because we shared the same ranking and with that we wasted let's say few ranks. So the scale is different
from the two others. And that's because we have ties twice. This is one tie and as well we have over here one tie.
That's why we are missing over here two ranks. So this is how the dense ranks works. And you can go and compare now
all three togethers in order to understand how those ranks are working. All right. So now let's quickly
compare the three functions side by side. Let's start with the first point about the uniqueness of the rank. And if
you compare those three you can see that only the row number generates unique distinct rank. So this going to be
unique rank and the two others we have duplicates or let's say shared ranks. Okay. So now the second point whether
the function handles the ties and the only one that doesn't handle the ties is the row number. So this one doesn't
handle the ties and the two others handles the ties since they offer the shared rank. And now we have the last
point about leaving gaps or skipping ranking. So now if you check the row number and the dense rank you can see
there will be no skipping. So there is no gaps for the row number and as well for the dense rank only for the rank
function the middle one we are skipping ranks and we are leaving gaps. So that's it guys. This is the differences between
those three functions. I tend usually to work with the row number more often than that to others.
All right guys, so now I had a look to those three functions and I checked my projects real projects and I found out
that there are many use cases for the function row number compared to the other functions dense rank and rank. So
now what we're going to do I'm going to show you a few use cases for the rank number that I usually use in my real
projects in order for you to understand how important is the row number function. So let's go to SQL. All right.
So now let's start with the first use case and we have the task of find the top highest sales for each product. So
this is very classic in reporting or data analyzes. We call this top end analyzes. So here the managers or
decision makers they would like to have the best performers or the best success in our data. So for example the top
highest five customers or the top five products or categories and so on. So this is very important analyzis in order
to focus on the best products or on to the most important customers and so on and this is as I said very classic and
very important in order to make decisions in the business. So now let's see how we can solve this. So we're
going to start with the usual stuff. Let's first select the data. So select order ID. Let's take as well the product
ID and the sales from sales orders. So let's go and execute this. And now as we know that for each product we have
multiple orders and we have multiple sales but we are interested only in the highest sales for each product. So we
have to go and create a rank. In order to do that we're going to use the row function row number and we have to
define the window now. So do we need partition by check the query. So it says for each product that means we have to
divide the data by the product ID. So let's go and use the partition by products ID. And now we must use the
order by. So order by. And now how to sort the data by the sales, right? And it is from the highest to the lowest. So
let's go sales. And we have here descending. So from highest to lowest. Let's go and give it a name. So you're
going to be rank by products. So let's go and execute this. And now by looking to the result, you can see that SQL did
divide the data by the product ID. So we have here like around four windows. The first one over here you can see that the
rank starts from one end with four. So the highest rank can be the order number eight with the sales of 90 and then it
goes to the four. Now as you can see that the second window we have a new ranking. So it resets the first going to
be uh the order number 10 and the last one going to be order number two. So as you can see each window has its own
ranking and as well the last one we have it only as one row. So now of course in the task we have to return the highest.
So we are not interested in the others. We have to return this row this row as well and this one and this one. So as
you can see we have to return everything that has the rank one. We are not interested in the rank 2 3 4 and so on.
So we would like to have the highest. So now in order to filter the data what we're going to do we're going to go and
use subqueries. So select star from and then we're going to have the following condition. So where and we're going to
say rank by product equals to one. So we are interested only on the rank number one. So let's go and execute it. And
with that since we have four products in our data, we're going to have only four rows and we have the highest sales. So
as you can see we have only number one over here. And those sales are the highest for each product. And with that
we have solved the tasks by finding the top end analyzers. Okay, moving on to the next use case. We
have the following task and it says find the lowest two customers based on their total sales. So now we have the exact
opposite use case. We call it button in analyzes. So now in this example in the business the decision makers want to
optimize the costs want to cut costs and with that they have to analyze the lowest performers in the products or the
lowest performance in the employees in order to cut costs. So now with this analysis the decision makers are not
focusing on the best successful stuff. We are focusing on the lowest stuff the lowest performers. So now let's solve
this tasks. So now if you check the question we have multiple stuff right we have the total sales and as well we have
to find the lowest two customers. So we have ranking and as well aggregations remember we can do stuff together with
the group I. So now let's do it step by step. First let's select the data right. So what do we need? Order ID customer ID
and we need the sales from sales orders. So let's go and execute this. So now if you check the customers over here we
have around four customers and they have multiple sales. Now we would like to have the total sales for each customers
in order to find the lowest two. So let's start first with the aggregations. So what we going to do? We're going to
go and aggregate the sales. So the sum of sales and let's call it total sales. And now in order to do the group by we
have to have only the customer. So group by and we have the customer ID. So it is very simple group by statements. Let's
go and execute this. So now by checking the result we can see that SQL did aggregate the data. We have four rows
and that's because we have four customers and we have their total sales. So we have solved the first part of the
task. We have the total sales for each customers. Now let's move to the second part. It says lowest two customers. That
means we have to use the ranking functions in order to rank those customers. So we are not interested in
all customers. We are interested only in the lowest two. So in order to do that now we're going to go and use the window
function row number. So and then over. Now do we have to partition the data? Well no we don't have to do that. We
have now to sort the data. So order by. So this time we're going to go and use the aggregations in the order by. So the
sum of sales and we want to have it sorted from the lowest to the highest. So I'm just going to go and use the
defaults. So it is ascending. Now let's call it rank customers. So that's it. Again here the rule is that if you are
using a window function together with the group by function, you have to use only columns that is used in the group
by. So this should be working. Let's go and execute it. So now as you can see in the results, we got an extra column for
the rank. So now the lowest customer going to be the customer number two. The second one going to be four with the 90
total sales. And the highest customer with the sales is going to be the last one, the 125 customer number three. So
now we have almost everything but the list should contain only the last two. So in order to do that to filter the
data, we're going to go and use subquery. So select star from and then we have to define the
condition where rank customers it should be smaller or equal to two. Right? So with that we will get the first two. So
let's go and execute this. And with that we got the lowest two customers based on their total sales. So customer number ID
you two and the four. So that's it. We have solved the task and now we have done button in
analyzes. Okay let's keep moving to the next use case and we have the following task. It says assign unique ids to the
rows of the table orders archive. So now guys we might be in situation where you have a table without any primary key and
you would like to create an ID for each row. So in order to do that we can use the function row number in order to
generate unique identifier ids for each row inside our table if we don't have one. And generating such ID for each
row. It's very important to do stuff like importing data, exporting data, maybe joining tables as well using this
ID or let's say optimizing the performance of query using the ID. So now let's see how we can generate that
using row number. Okay. So now let's first select the table order archives in order to understand the content. So
select star from sales orders archive. So let's go and execute. So now by checking the result you can see that we
have 10 orders and we have repetitions in the order ID over here. So it is not really primary key. As you can see over
here we have twice the ID four and here we have three times the ID6. So now what we're going to do we're going to go and
generate unique identifier for each row. So in order to do that what we're going to do going to go over here and say row
number and then we're going to define the window function. We don't partition the data at all but we have to sort the
data by the order ID. So order by order ID or you can use something else as well using the order date or something
doesn't matter. So let's add to it order data as well and let's call it unique ID. Let's go and execute this. Now by
checking the data you can see that we have a new ID over here that comes from the row number and we have like a unique
identifier. As you can see we have 10 rows and with that we have as well 10 different distinct unique ids. So with
this as you can see we have solved the task and we have now a unique identifier an ID for the table orders archive. So
now having this ID we can do many stuff like joining tables or doing something special and important called pagenating.
Imagine we have like a huge table and we would like to retrieve the data. So now in order to not have all the data in one
go we can go and divide the data by the primary ID or by unique identifier. For example, we can make a page from 1 until
100,000 and then the second page starts from 100K to 200ks. So now by dividing the data, we can maybe improve exporting
or importing data or we can have faster retrieval for the users. We don't want to have the whole data in one go in one
page. So it has a lot of benefits using pagionating and we can do that only if we have a nice ID like
this. All right. Right. Today I'm going to show you the last use case for the function row number that I usually use
in my real projects. So sometimes if you are doing data analyszis you're going to find out that there are data quality
issues especially with the duplicates. So what I usually use I use the raw number in order to identify the
duplicates. Not only that I can use it in order to delete the duplicates. So we can use it in order to do data
cleansing. And this is essential task for each data engineer not only data analysts in order to prepare and clean
up the data before doing data analyzes. So let's have the following task. Identify duplicate rows in the table
orders archive and return a clean result without any duplicates. So not only we have to identify the duplicates, we have
to return no duplicates in our results. So let's see how we can do this. Let's first select the data. So select star
from sales orders archive. So let's go and execute. So now by looking to the data you can see that we have
duplicates. We have an issue. So the order ID number four is twice in our database. It doesn't make sense, right?
It should be only one. So which one is the correct one? If you check the data over here, you can see that this order
is shipped and then delivered. So it looks like the last one is the correct one. So how we can do that? If you just
scroll to the right, you can see that we have a creation time. And we usually use such a time stamp in order to identify
what was the last valid like order. And here we can see immediately that this order time is higher than the previous
one. Which means this is the more up to date, right? The more current. So what we're going to do, we're going to go and
rank our data for each order ID and sort the data by the creation time in order to find the last inserted or created row
for this order. So let's see how we can do that. What we going to do? We're going to go over here and say let's have
a row number and then over and what we're going to do, we're going to partition by the primary key. So
partition by order ID and as we said we have to order the data by this time stab at the end. So partition by or order by
creation time and descending. So we want the highest then the lowest. So that's it. Let's call it Rn and execute the
query. So now by checking the data if everything is clean and we don't have duplicates everything should be one
because maximum for each primary key we should has one row. So but you can see over here we have here two and we have
here three two. So that means this is indicator that we have duplicates inside our data. So now by checking one by one
as you can see the order ID is only one. So we have the rank one the second one as well we have the rank one but here we
have the issue. So as you can see we have now two ranks for the order ID four. So now which one is the correct in
our logic? We say it is the last row that is inserted inside our data and this is rank number one. So if you
scroll to the right side you can see that the creation time here is higher than the second one. So with that we
have identified what we want. We want the last inserted row for each ID. And now let's check this over here. So here
we have it three times. So it says the first one is the highest creation date. So if you go to the right side and now
by comparing those time stamps you can see that this record the first one is the la latest one that is inserted
inside our data. So as you can see this one is the one that we need the other two we don't need it because it is old
informations. So now everything that doesn't has the rank number one is not valid. It's something old and it's
actually bad data quality. So we want to remove it or not to select it. So now in order to have a clean data what we going
to do we're going to go and select the following as sub select. So select star from the table and now we are interested
only with the rank number one. We don't need anything else. So let's go and execute. And now if you check the
results you can check the order ID over here. It is unique. We don't have any duplicates. Right? 1 2 3 4 5 6 7. There
is no duplicates at all. And we have now only the latest inserted data inside the orders. and we don't have any duplicates
or data quality issue. So now of course now we can go with this results in order to do for the analyzes and this is
exactly what data engineers usually do clean up the data and prepare the data before doing any data analyzes. And of
course if you want to communicate those data quality issues to the source of the data let's say you are not the owner of
those informations. You can generate a list of all bad data quality issues and you can send it to the source system and
tell them to clean it up from the sources. So now in order to select the bad data what we're going to do is we
can just change here the condition and say if it is higher than one then you are like bad data. So let's go and
execute this. And now with this we have in the results all records that shouldn't exist in the data in the first
place. So we can go and export it and communicate it to the source and tell them check here you have something wrong
in your system and those information should not be inserted in the data. So everyone it is very strong right? It is
very powerful. I use it a lot in my projects. There are many use cases for the row number function in SQL. We can
do it in order to find the top end analyzes, the bottom end analyzes, the best performance, worst performance and
as well we can assign unique ids to do benating or we can use it in order to discover data quality issues to clean up
our data. So it is amazing function in SQL and you're going to use it a lot. So that's it for the three functions ro
number, rank and dense rank. Now we're going to talk about the inile. Okay. So what is inile? Intile in
SQL is very simple. It's going to go and divide your rows, your data into specific number of almost equal groups
or sometimes we call them packets. So now in order to understand this and how it scale works with this function, we're
going to have a very simple example. So let's go. Okay, we have the following setup. We have four rows for sales and
we would like to divide it into two groups or into two buckets. So in order to do that we can use the entile
function. It has different syntax than the other ranking functions. So it starts with entile then we must define a
number. So we cannot leave it empty like the other ranking. So here we have two buckets then over and here again we have
to sort the data. So it is must order by sales descending from the highest to the lowest. So now as usual SQL going to go
and sort the data. We have it already sorted in this example. Then it going to start assigning each of those rows into
buckets. But SQL first has to calculate the bucket size. So how many rows we can like insert inside each bucket. So the
calculation is very simple. It says the bucket size equals to the number of rows divided by the number of buckets. So
what is the number of rows here? We have four rows, right? So we have four over here. Then the number of buckets we
define it in the syntax of the query. So here we defined two buckets. We need two groups. So that means we are dividing
four by two. And the size of the bucket going to be two. So now with this SQL is ready and going to start assigning each
row to a bucket. So it's going to start on the top. The first one going to be in the bucket number one. Then go to the
next one. It's going to say okay we still have enough space in the bucket. Right? So it's going to sign as well to
one. But with this we reach the maximum number of rows within each bucket. So the next row going to be assigned to
another bucket. So it's going to be two and the last one going to be as well too. So as you can see it's very simple.
We have just assigned our sales based on the sorting of course into two buckets. These two sales belongs to the bucket
number one and the other two belongs to the bucket number two. Very easy. So that was very straightforward because we
are dividing even numbers and we got perfectly sized buckets. But now what going to happen if we have an odd
number? So we have here five instead of four. So the bucket size going to be dividing five by two. We're going to get
2.5. And now of course SQL will not go and divide like two half for each bucket. Then we are splitting this into
two packets. Of course this will not be working. We should has now a bucket with three and another bucket with two. So
now the rule in SQL make it very clear. It says larger groups comes first then smaller. So that means if we have here
an even number like this, the larger group going to be the first group. So that's going to look like this. It's
going to like reset everything. So let's see what's going to happen. The first one going to be one. The second one has
bill one. The third one going to be as well one. So it going to has a larger package than the second one. Then the
rest going to be two. So as you can see the larger group comes first then the smaller. And this is how a scale going
to work. if you have odd numbers. So you don't have here perfectly sized buckets. You have approximately or roughly
equally sized buckets. So this is how the intel works. Now let's go back to scale in order to practice this
function. Okay. So now let's have some fun working with this function. So we just going to select something like
order ID sales from sales orders. So let's go and execute it. And with that we got our 10 rows. Now let's say that I
would like to create only one bucket from the data. So entile and only one bucket over partition let's say not
partition by let's take order by sales descending. So that's it. I'm going to call it one bucket. So let's go and
execute it. As usual it's still going to go and sort the data and then calculate the bucket. It's going to be 10 rows
divided by one. So the size of the bucket going to be 10. So that's why you're going to see everywhere here as
one because all those rows going to fit into one bucket. So this is very simple. We have only one bucket. Let's go and
now have two buckets. So I'm just going to copy and paste. And instead of one, we're going to have two and let's call
it two buckets. So let's go and execute this. So now here again, what is the size of the buckets? It is 10 divided by
two. So we will get perfectly grouped buckets. So the first bucket going to be five rows and the second one going to be
the next five rows. So it is very perfect. Let's go to the next one. Let's have three buckets. So three. So let's
go and execute. So now what going to happen is going to go and divide 10 by three in order to get the size of the
bucket. And it's going to be 3.3. So it is decimal and we will not get perfectly sized buckets. So again the larger group
comes first then the smaller. So as you can see we have to fit then in the first group four in order to get the others
with three. So that's why the first bucket is going to be the biggest one. So four rows into the first bucket. Then
the second three rows going to be in the bucket two. And as well the last one going to be bucket three. So as you can
see the largest group is going to be the first bucket. So now let's keep playing with the data. Let's go and take now
four. We would like to have four buckets. Now things going to get interesting. So now by checking the
result it's going to be interesting. SQL going to divide 10 by four and we will get something like 2.5. So again we will
not get perfectly sized groups. So SQL has to fit now 10 rows into four groups. So the first three rows going to be fit
in the bucket number one and as well the second three rows like this going to be in the bucket number two. And then you
can see over here we have two buckets with a size of two. And with that we can fit 10 into four groups. And again you
can see the larger groups comes first like this one and then the second and the smallers comes later. Okay. So this
is how the inter works in SQL. And now you might say you know what why do I need buckets in the first place. So what
is the use case? There is two use cases for the intel function in my projects. In one
hands if I am data analyst I'm going to use the intel function in order to segment my data. In the other hand, if
I'm data engineer, I'm going to use the intel function in order to do ETL processing and as well to do load
balancing. So now let's start with the first use case as a data analyst where you want to do segmentations with the
entire function. Segmentations is very nice way in order to understand your data. So you can go and segment your
data into different buckets or groups like for example doing segmentations for the customers. So you can go and group
up your customers depend on their behavior like the total sales or the total number of orders. So with that you
can make like for example VIB section and then the medium and then the low. So now in order to understand the
segmentation use case let's have the following task. Okay. The task says segment all orders into three categories
high medium and low sales. So in order to solve this let's do the basic stuff right. So select order ID. Let's take
the sales from our table sales orders and let's go and execute it. So as usual we got our 10 sales. So now if you check
the task it says we need three categories. So that means we need three buckets right and it says high, medium
and low sales. So that means we are dividing by the sales. So let's go and do it step by step. So we're going to
use inile since we need to segment the data. Three categories means three buckets. And then let's define the
window over we don't have to divide the data by partition by we just need to sort it first by the sales. So it's
going to be by sales and let's take discrete since we want to sort it from the highest to the lowest. So that's it.
Let's say you are our buckets. So let's go and execute this. So now if you check the data you can see that they are
segmented into three buckets. So the first bucket going to contain all orders with the high sales. Then the second one
going to be all sales with the medium. And then the last one going to be all sales with the low sales. So as you can
see we have already categorized our data into three groups. But now as you can see we have numbers and maybe the user
is expecting to have those text high, medium, low. So that means what we're going to do now we're going to go and
translate those numbers into text into words. And of course we cannot do that inside the window function. We're going
to use data transformation using the case when statements. Don't worry about it. We're going to have complete
dedicated section explaining the case when. So for now just follow me in order to see how this works. We're going to go
and use subquery. So it's going to be select and let's take the star for everything and then let's have the
following logic. Case when buckets equal to one then it is high the sales is high. So we are just mapping the numbers
into text. So otherwise case when the brackets equal to two then we are targeting the medium medium and then the
last group packets equal to three then those sales are low. So let's call it end it and let's call it sales
segmentations. So that's it. Let me just make it a little bit smaller in order for you to see it. And all right so from
and then we have our subquery like this. So as you can see we just mapped the numbers into text. We are just doing
translations. So let's go and execute it. And now by checking the results we got our three categories for the users.
So the first category going to be the high sales. The second one going to be the medium sales and the third one going
to be the low sales. So guys you see Intel is very powerful in order to segment our data. So now you can go and
segment stuff like the customers by their total sales or the products by prices, employees by their salaries and
so on. All right. So this is the first use case for the Intel function as a data
analyst where you go and segment your data in order to understand the behavior. Now in the other hand, if you
are data engineer, you can use Intel function in order to do load balancing in your ETL. So now I'm just going to
explain it in very simple sketch. All right. So now we have the following scenario where we have two databases and
we would like to move one big table from the database A to database B. So in this case I'm doing something called full
load. That means I'm loading all the rows from one database to another. So if you do it in one go what could happen is
that it could take long time. So it could take hours or even sometimes days and maybe at the end you will get maybe
some network errors because you have stressed the networks between those two databases and everything going to break
and you're going to lose the data and you have to start again. So now instead of loading this table in one go what we
can do we can go and split it into fractions or let's say packets. So we can split this table for example into
four small tables using the function entile. So now after we split this big table into small tables, we're going to
go and start moving those small tables one after another and with that we are not stressing the networks and it's
going to succeed. So now after loading everything at the end in the target database we're going to have those small
tables and of course we can go and use the union in order to merge them in order to have again the big table that
we have it in the original database. So this is very common use case for the entile in order to split the load and to
balance the processing of extracting data. All right. So now we have the following SQL task. It says in order to
export the data divide the orders into two groups. So let's go and do that. First we're going to select everything
from the table just in order to see the data sales orders. So let's go and execute it. So now we got our 10 orders
and what we have to do is that to go and split it into two groups. In order to do that we can use the entile function. Two
groups means two buckets. So let's define the window. So here we don't have to partition the data using partition by
but we have to specify the order by. So now which column we're going to use in order to sort the data. Of course here
there is no rule like you can go and split the data by sales or by the order status by date by anything you want. But
we usually go and use the primary key. It's just systematic, better, and more clean, especially if you have a sequence
of numbers in the order ID. So you can export the first range of the orders, then you can go to the next group and so
on. So let's go with the order ID and let's give it a name buckets. So that's it. Let's go and hit execute. Now, as
you can see, it's very simple. We got our two groups. So this is the first batch of of the data and this is the
second batch of data. So now we can go and select the first batch and export it, import it in the next system. And
then after that we go with the second batch. And of course if you still suffer from the size of those packets, you can
go and split it to more smaller size. So you can go over here and make it four. So with that we're going to get smaller
buckets and it might be easier to export the data. So this is really great use case for the entile function. All right
everyone. So with this you have learned the two use cases for the entile function that I usually follow in my
projects. So as a data analyst you can use it in order to do segmentations and as a data engineer you can use it in
order to do load balancing of the ETL. Okay everyone so with that we have covered everything about the integer
based ranking functions. Now we're going to talk about the second methods. We have the percentagebased ranking
functions and here we have two functions the cubist and as well the percentile. So now let's have a quick recap. So with
the percentage based ranking SQL going to go and calculate a relative position as a percentage and assign it for each
row. So the output going to be a continuous normalized scale from 0 to one. And this is really amazing in order
to do distribution analyszis. So those functions going to consider in their calculation the overall total the whole
size of the data set which can help us in order to find out the contribution of each value to the overall total. And now
in SQL in order to generate the percentage we have two different formulas. So in one hand we have the
function cumist and in the other hand we have the percent rank. So that means we have two different functions with
different formulas in order to generate and calculate the percentage. So now let's start with the first function the
cumist. All right everyone. So now let's start with the first function. We have the dis and it stands for
commumulative distribution. It's going to go and focus or calculate the distribution of your data points within
a window. So what this means in order to understand it, we're going to go and have very simple example to understand
how SQL works with this function. So let's go. All right. Again we have our very simple example of the sales and we
have the following query. So dist then we don't give any argument inside it. So it's going to be empty and the
window going to be like usual order by sales descending from the highest to the lowest and the order by is must. So the
first step is SQL going to go and sort the data. We have it already sorted from the highest to the lowest. So now the
next step is that SQL going to go and start calculating the percentage for each row. And we have a very simple
formula. It says the cumist equals to the position number of the value divided by the number of rows. So now the next
step is still going to go and start calculate the percentage for each row. And we have this very simple formula. It
says the cubist equals to the position number of the value divided by the number of rows. It's very simple. Let's
do it step by step. So SQL going to start with the first value in our list. So it going to be calculated like this.
So what is the position number of the first value? It's going to be one, right? So this is the first value in our
list. And what is the total number of rows? We have five rows, right? So 1 2 3 4 5. So we're going to divide one by
five. And the result going to be 0.2. So this going to be the first value for the first row. Okay. So now SQL going to go
to the next row. And this time we're going to get a special case. As you can see, we have the 80 twice. So we have
here a tie. So now first we need the position number. As you can see, we are at the position number two, right? But
since we have the 80 multiple times, SQL going to go and take the last position that we see the value 80 and the last
position going to be the record number three. So that's why SQL going to say for this record it's going to be the
position number three and not two and then it's going to go and divide it by five and we will get the value of 0.6.
So this is the most confusing thing with this function. So if SQL finds a tie, it will completely ignore the current
position number. So we don't have two. It going to go and take the last position number for the same value. And
the last in our list going to be the record number three. So that's why we have three over here. Okay. So now let's
keep moving. Let's go to the third row. And as you can see, we are again in the tie. But this time, this is the last
time we see 80. So next we don't have 80. So what's going to happen? We're going to have exact same result. So it's
going to be 3 divided by 5. So as you can see if we have a tie they going to share the same percentage. So that means
with the cube list if you have same values they going to share the same rank. So let's keep moving to the fourth
one. So now what is the position number of the 50 we are at the record four. So position number four divided by five we
will get 0 comma 8. Okay. So now let's move to the last one and it is the easiest one. So which position do we
have over here? It is the position number five. It's the last one. And the number of rows is five. That's why we
will get one. So guys, that's it. This is how the cumulative distribution works. Once you understand the formula,
it's going to be very easy in order to understand the output. So as you can see, calculating the percentage always
depends on the total size of our data sets. You can see here the number of rows. So with this we're going to get an
output that help us in order to understand the distribution of our data points within the data
sets. All right everyone. So now we're going to go and focus on the second function that generate percentage as a
rank. We have the percent rank. So the percent rank going to go and focus on generating the relative position of each
row within a window. So in order to understand what this means, we can have a very simple example in order to
understand how scale works with this function. So let's go. Okay, again we have those sales very simple example and
the syntax going to be like this percent rank and inside it we don't use any arguments and the window going to be
like this order by it is a must sales descending from the highest to the lowest the first step that is going to
do is that it's going to go and sort the data from the highest to the lowest and we have it already like this and next
SQL going to go and start calculate the percentage which is very similar to the cumulative distribution but this time
it's going to be like this position number then we subtract it from one and as well divided by the number of rows
subtracted from one. So it's like exact formula but we are only subtracting here once for both numbers. Okay. So now
let's go through all rows step by step and see the output. So it's still going to start with the first row right. So
what is the position number of the first row? It's going to be one. Then we have to subtract it by one. That's why we
will get zero. Now what is the total number of rows? So we have here five rows and it is subtracted by one that's
why we're going to get four. So now 0 divided by any value the output going to be a zero. So that's why for the first
value we will get a zero. All right. So now let's move to the second row over here. And here we have our special case
where we have a tie. So we have two sales sharing the same value 80. So now for the percent rank SQL gonna have
different behavior than the cumist. Remember in the list SQL did search for the last position of the shared
value. So it was the position number three since this is the last time we see 80. But now with the person rank is
still going to stick with the first occurrence of the shared value. So now by checking those two 80s what is the
first occurrence? It is the record number two. So that's why we have position number two subtracted by one we
will get one. And here the same going to be number of totals we have five subtract by one we have four. So now if
you divide one by four we will get the result of 0 comma 25. So this is the percentage of this value. So now let's
go to the second row. Here we have again the tie. So scale going to stick with the position number two the first
occurrence. So it's going to be the same two subtracted by one we will get one. And as well the total number of rows
five subtract by one we will have four. That's why we will get the same exact results. So here as you can see with the
percent rank it's like the list the shared value going to share as well the same percentage rank. Now let's move to
the fourth one. So we have the value 50. So what is the position of this? It's going to be the record number four.
Subtract it by one we will get three. And if you divide three by four you will get
0.75. And now moving to the last value over here it's going to be easy. So what is the position number of the 30? It is
five. Five subtracted by one it's going to be four. And as well we're going to have four as well here for the total
numbers subtracted by one. So if you divide four by four you will get one. So that's it guys. This is how the percent
rank works. It always has the scale from 0 to one. So it's always like this. Doesn't matter which values do we have
inside and it's going to has like continuous scale. And again here if you have a tie they're going to go and share
the same percentage rank. Okay guys. So now if you go and compare those two functions you're going to see that they
are really similar to each others. The output of both functions we are generating percentage based ranking and
both of them as well handling the ties perfectly. So they share the same percentage rank. If you check the syntax
they are very similar. And now by checking the formulas of both of them we are always considering the overall size
of the data sets. So here the size is considered in the calculation to help us finding the relative position of each
value to the overall and this is very important in the analyszis in order to measure the contribution of each value
to the overall. So now about the use cases if you want to focus on the distribution of your data points go with
the cumulative distribution but if you want to focus on the relative position of each rows then go with the percent
rank. All right. So now there is one more difference between the and the percent rank and that's if you check the
formulas. You can see that the is more inclusive. We always consider the position number of the current row. But
with the person rank we don't consider the current row. We like skip it or make it exclusive. So we say for the person
rank it is more exclusive and the cumulative distribution it is more inclusive. So now if you ask me the hard
question which one to use, I'm going to say if you want to be more inclusive, go with the commutive distribution. If you
want to be more exclusive with the current row, go with the person rank. So they are very similar to each others. So
if you want to calculate the distribution of your data, go with the cumulative distribution. If you want to
find the relative position of each row, then go with the percent rank. All right. So now we have the following task
that says find the products that fall within the highest 40% of the prices. Let's go and solve this. Now we are
targeting the table products and I will just select like two columns products price from sales products. So that's it.
Let's go and execute this. So now as you can see we got five products and their prices. And the task says find the
highest 40%. So we have to find and generate a percentage rank. In order to do that we have the two functions cumist
and the percent rank. I will go this time with the list. So let's go and do that. So list and then let's go
and define the window like this. It's going to be order by we are targeting now the prices right? So order by the
price from the highest to the lowest and let's go give it a name this rank. So let's go and execute this. So now with
that SQL going to go and generate for us a percentage ranking using the formula that we just learned before. So now in
the output we are getting all the products but the task says we have to get only the products that are in the
highest 40%. So that means the first row the second row and that's it. So those rows are in the highest 40% the rest are
below that. So in order to do that to filter the data we're going to use the subquery. So select star from and then
we have our sub query like this and then our filter going to be this rank smaller or equal to 0.4. So this is our
threshold in order to get the data. So let's go and execute this. And now as you can see we got the top products the
top 40%. Now of course you can go and format the percentage. We can do that like this. So let's take the test
rank multiply it with 100. So let's go and execute this. So as you can see we got 20 and 40%. We can go and add to it
as well the percentage character right. So we can go and say concat and we're going to add the character after that
like this and let's call it test rank percentage. So that's it. Let's go and execute it. So that we have solved the
task. We have the products that fall within the highest 40%. Now, of course, you can go and try the percent rank. So,
it's very simple. We just have to go and switch the cumulative distribution with the function percent bank. So, let's go
and execute it. Now, as you can see, we will get the exact same results. So, we're still getting the gloves and caps
as the highest products within the 40% of the price. So, guys, that's it. It's very simple, right?
All right friends, so now let's have a quick recap for the window ranking functions. So what they're going to do,
they're going to go and assign a rank for each row within a window. And we have two types of ranking, right? The
first one is the integer based ranking. It's going to go and assign a number an integer for each row. And here we have
four functions. Row number, rank, dense rank, and in tile. And the second type of ranking, we have the percentage based
ranking. So scale fair is going to go and calculate a rank and then assign it for each row. And here we have two types
of formula or functions. So we have the cube dist the cumulative distribution and the second one we have the percent
rank. And now to the next point if we are talking about the rules of the syntax. So the expression should be
empty. We should not pass any argument to the functions. We must use order by in order to sort our data. So it is
required and the frame clause are not allowed to use. So you cannot go and customize a frame within the window
function. And as we learned there are many use cases for the ranking functions. For example, we have the top
end analyzes the button end analyzes in order to identify our top performers or the worst performers in our business.
Another use case using the row number we can identify and remove duplicates in our data. So we can use it in order to
find data quality issues and as well to improve the quality. And another use case if our table don't have a clean
primary key we can go and generate unique ids using the row number in order to do as well by generating one more use
case it was the data segmentations you can use the intel in order to segment your customers your products employees
and so on and another use case we can do data distribution analysis as we learned we can use the cubeist in order to
understand the data distributions of our data points compared to the overall and the last use case it's more for data
engineering we can use the intel function in order to equalize the loading process of our ETLs. So as you
can see there are many use cases for the ranking functions. Okay, so that's all about how to rank your data using the
window functions and now we're going to cover the last group. We will learn about the value window functions. How to
access another records. So let's go. All right everyone. So now we have this very simple example. We have the
months and the sales. Now we can use the value functions in order to access a value from another row. So in order to
understand it let's say that SQL now processing the months and we are currently at the month of March. So now
for example I would like to access the value from the previous month from February. So in order to do that we can
use the lag function in order to get the value of 10. So with that we have in the same row the current sales of the month
March and as well the sales from the previous month the February. And maybe in other cases I would like to get the
sales of the next month from April. In order to do that we can use the function lead and we will get at the same row the
value five. So now I can very quickly compare the current month with the previous month and as well with the next
month. And now in the other cases you might be interested in the first month of your list. So it's going to be here
January. So in order to get the sales of the first month you can use the function first value. So we're going to get at
the same row 20. And now for the last option I think you already get it. We can go and get the value of sales of the
last month. So here we can get the July. So for that we're going to use the function last value and we will get the
value of 40. So this is exactly the purpose of the value functions or analytical functions. We can access a
value from another rows. And here is really important to understand as well the value functions is like the ranking
functions. We have to use the order by in order to sort the data in order to understand what is the first row and the
last row. In this example, the data is sorted by the month. So guys, the access functions are really important for
analytics. You can use it in order to access a value from other rows in order to do comparison. All right. Right. So
now let's have a quick overview of the syntax and the rules for the value functions. So here we have four
functions lead, lag, first value and last value. So as you can see we can group them into two groups. So we have
the lead and lag. They are very similar to each others. Especially with the syntax we can use three things or three
arguments inside it. Expression offset default for both of them. For the first value we can use only an expression. So
that means we have to pass a value for those functions. You cannot leave it empty. So now about the expression data
type, you can use any field with any data type. There is no restrictions about only for example using numbers.
Any data type is allowed. Now about the definition of the window. The partition by as usual is optional like any other
group. The order by here is a must. You must define an order by. It's like the ranking. So here you cannot leave it
empty. And now we come to the last one. We have the frame clause. There are really different stuff over here. So for
the first two functions lead and lag you are not allowed to define any frame. So you are not allowed to define any subset
of data. It's very similar to the ranking. So you must use order by but you cannot define the frame of the
window. But for the other two functions the first value and the last value they are optional. You can go and use them.
And for the last value it is recommended to define frame close. Don't worry about it. We're going to have enough examples
in order to understand. So as you can see those functions has different requirements. So there is no generic
rule for all of them. But one thing that they all agree on that you must use order by. So now as usual what we're
going to do we're going to go and deep dive into those functions. We're going to address first the two functions lead
and lag because they are very similar to each others. We're going to understand the use cases when to use them and of
course we're going to practice in SQL. So let's go. lead and lag functions. The lead
function can allow you to access a value from the next row within a window where the lack function is exactly the
opposite. It's going to allow you to access a value from a previous row within a window. It sounds very easy,
right? So let's understand how is SQL going to execute those functions. Okay. So now let's have a quick overview of
the syntax for both of the functions lead and lag. We have here very simple example for the lead function. So as
usual we start with the function name. It's going to be the lead. And now after that we're going to go and pass the
arguments. And as you can see we have here multiple stuff. So let's do it step by step. So the first thing is that
we're going to go and specify an expression. And the data type could be any data type. It could be a number like
here the sales. It could be a character like names or dates or anything. So this is required. We have to specify an
expression. We cannot leave it empty. And we can use any data type. Now moving on to the next one. We have here a
number. So what is this? This is the offset and this offset is optional. So you can go and skip it. So what offsets
means? What we are doing over here? We are specifying for SQL the number of rows forward or backward from the
current row. So here in this example we are specifying the offset as two using the lead. And with that we are telling
SQL go jump to the next two rows and get me the value. And if you are using lag it means you are telling SQL go back two
rows up and get me the value. So here you are telling SQL how many rows it needs to jump and if you don't specify
anything like leave it empty SQL going to go and use a one. So the default of this with the offsets going to be one if
you don't specify anything. All right moving on to the last one and to the third one. This is as well optional. You
can go and leave it empty. So here it is the default value. Now what happens with those functions is that sometimes SQL
jump to the next two rows or something like that and SQL doesn't find anything. So there is no more rows available to
access and with that SQL going to go and return a null. So that means if SQL goes to the next rows or go to the previous
rows and doesn't find anything SQL as a default going to go and return a null. So if you don't specify anything over
here in those scenarios you will have a null values as a return from the whole function. But in some scenarios you
don't want to have a null you would like to have a value. So here you are defining the default value. So it should
not be a null, it should be a 10. So scale if you don't find anything return a 10. Don't return a null. So again
guys, the default values, the offsets, all those informations are optional for you in order to configure it. But you
should know the default if you don't use anything for the offset is going to be one for the default value going to be
null. But you must specify an expression. So here you cannot leave it empty. All right. So that's all about
the arguments that you can pass to the lead or lag functions. Then the next stuff are the standard stuff. So we have
the overclos then we have the partition by as usual partition by is optional. And then to the order by those functions
it's like the rank functions. It requires you to sort the data. So it is a must to sort the data otherwise will
not know what is the next row what are the previous rows. So we have to sort the data. It is required. You cannot
skip this. So it is not optional. All right. So the syntax is not crazy right? We have the usual stuff but only we can
go and configure the default value and the offsets. Okay guys, now we have very simple example. We have months and sales
and we're going to go and understand how the SQL works for both of the functions lead and lag side by side. So now in the
first example we are interested in the sales of the next month. So in order to do that we're going to use the lead
function. So lead and then we're going to specify the argument. It is the sales. We want the value of sales and
then we define the window like this order by month. So it's going to be ascending. And now in the right side
we're going to be interested in the sales of the previous months. So in order to do that we're going to use the
lag function. So it's going to be very similar to the lead. We have lag and then the sales since we are interested
in the sales and we're going to sort the data by the month. So now let's see how going to do it step by step and side by
side. So going to start with the first. So now let's see how skill going to process those informations side by side
and row by row. So it's going to start with the first row over here. What is the next month of January? It is
February and we are interested in the sales of this row. So SQL going to take the value from the next row and we're
going to have the value of 10. So now by looking through the January we can see the sales of the next month of February
in the same row. So now let's check the right side over here. Now we are interested in the previous month. So
what is the previous month of the first row? It will be nothing. Right? So we cannot point it with anything. That's
why going to say this is null. There is no previous month for the current row. And we're going to have it as a null.
Okay. So now it's going to go to the next row. We are at February. What is the next month? It's going to be March.
And it's going to point to it. So we will get the 30 as the sales of the next month of March. And on the right side,
what is the previous month of February? It's going to be January, right? So, it's going to get the value the sales of
the previous month. And here we will get 20. So, as you can see, it's very simple. On the lead, we are always
checking the next values. On the lag, we are always checking the previous value. So, let's keep going. We are currently
at March. What is the next month? It's going to be April. So, it's going to go and point to it like this. and we will
get the sales of the next month April. For the March on the right side, what is the previous month? It is February.
Right? So, it's going to go and point to February. So, we will get the sales of 10. And now, interesting to the last row
over here. You can see that we are at April. What is the next month of April? There is nothing because we are at the
end of our table, right? So, since there is no month after that, we will get a null in the output. But for the lag, we
still have a previous month for April. So what is the previous month? It is March. And we will get the sales of the
March. So it's going to be 30. So that's it guys. It's really simple, right? It's just like they are doing the opposite
things. So now if you check those values side by side, you can see that with the lead, we will always get a value for the
first row, but for the last row, it can be always empty because there is no next value. We are at the end of the table.
But if you check the lag for the first value, we will always get a null because there is no previous value or previous
record from the first row. And for the last record, as you can see, we're always going to get a value because we
will have a previous value. Okay, let's move on in order to understand how scale this time works with the offsets and the
default value. So now we have the same data, but we have different task. So now on the left side, we would like to get
the sales of two months ahead. So it's not the next month, it's going to be two months. And we would like to tell SQL if
you don't find any value don't return null return for us is zero. So this is going to be our default. Now if you
check the syntax it's going to be exact like before but we are adding now an offset of two because we are interested
in two months ahead and we are specifying here a default value zero. So if you don't find anything put zero
don't put null. Now on the right side we have the exact opposite. We are interested in the sales of two months
ago. So we are not interested in the direct previous month we need the sales of two months ago. And here the same
thing if you don't find anything don't return null give us a zero. So as you can see we have the same syntax but
using the function lag. So now let's understand how going to execute this step by step and side by side. So going
to start with the first month January. So now SQL going to ask what is the sales of two months ahead. So we are at
January. It will not be February it's going to be the month of March. So it's going to go and point it like this and
we will get the value of 30. So 30 is the sales of two months ahead. And now on the right side we are as well at
January. It's going to ask the question what is the sales of two months ago. So we don't have any previous data. Right?
So we will not get anything. It's going to return null but it's going to check do we have a default value? Well yes. So
this time HQL will not return null. It's going to return the default value. And this time it's going to be zero. All
right. All right. So now let's go to the next value. We are currently at February. What is the sales of two
months ahead? So it will not be March, it's going to be April. So it's going to go and point it like this and we will
get the value of five. So now on the right side we are currently at February. Now the question is what is the sales of
two months ago? We have history. We have the previous month but we don't have two months in the history. That's why we
will still get zero as the output with the default value. Okay. Okay. So now let's keep going to the next value. We
are currently at March. SQL going to ask what is the sales of the two months ahead. We have only one month after that
but we don't have two months. That's why SQL will not find anything and it's going to return null but it's going to
go and use the default. So here we're going to go and get the value of zero. There is no more data available in the
table. But now on the right side we are currently at March and we are asking what is the sales of two months ago. So
now we have enough history in the past and it's going to get the value of 20. All right. So now let's go to the last
month over here in our table. April. What is the sales of two months ahead? We don't have any data. So it's going to
be zero as well. But now on the right side, we are currently at April. What is sales of two months ago? We have enough
history. That's why SQL going to get and point it like this. So we will get the February going to be 10. So that's it.
This is how SQL works with the lead and lag using offsets and as well default value. Let's go back in SQL in order to
practice those two functions. Okay, so now we have the following task and it says analyze the
month over month performance by finding the percentage change in sales between the current and the previous month. So
that means we have to go and compare the current month with the previous month. So the main use case for the lead unlock
is to do comparison analyszis and we have a very common use case it's called time series analyzes. So it is the
method of analyzing our business our data in order to understand the patterns and trends over the time. And one of the
most important and classical question that you're going to get from the decision makers or business is to do
year-over-year analyszis or month over month analyszis. So the year-over-year analysis is going to help us in order to
understand the overall growth or decline in the performance of our business over the years over the time. But in the
other hand, we have month- over-month analyszis in order to do shortterm trends analyzes and as well discover the
patterns in the seasonality. So the main focus is to understand the performance of our business over the time. So now
let's go back to it in order to solve the task. Okay guys, so now let's go and do it step by step. Now what is the
first step? Before we go and compare things together, we have to collect the data. We have to do the calculations
first. So we have to find out first the total sales for the current month and then the total sales for the previous
month. And after that we can go and compare them. So now let's start with the easy stuff. We have to find out the
current sales for the current month. So in order to do that, let's just do very simple select. So what do we need? We
need let's take the order ID. Let's take the order date because inside it we have the month. Uh let's go and collect the
sales. So that's it for now from sales orders. So let's go and execute this. So now in the result we got the usual
stuff. We have 10 orders, sales and order dates. But the order date is on the level of the days and we are not
interested on the whole date. We would like to get only the month in order to calculate the total sales for the month.
Now we're going to go and use a function in order to extract the month from a date. Don't worry about it. We're going
to have a dedicated chapter in order to show you how to deal with the dates format in SQL. So now what we're going
to do, we will use a very simple function called month and order dates. And let's call it order month. So that's
it. Let's go and execute it. Now, as you can see, we got a new field where we have only the month of informations. So
here we have January, February, and March. So now the next step is that we want to find the total sales for each
month. So what we're going to do, we're going to go and use group by. So, let's do that. We're going to go and say we
want the sum of sales. I'm just going to call it current month sales. And let's go and get rid of all those
informations. We're going to go and group by the month, right? So, group by and let's have the month. So, that's it.
Let's go and execute it. So, it's very simple, right? We got now the three months and the total sales of the
current month. So now with that we got the first information that we need in order to do the comparison. We have for
each row the total sales for the current month. So now the next thing that we're going to do is to find out the total
sales for the previous month like side by side in the same row. And in order to do that we have learned we can go and
use the lag function. So we're going to go and integrate the lag window function in the same group by. So we're going to
do it like this. So lag we are now interested in the previous month. So that's why we're going to go and get the
sum of sales as an expression inside it. And after that we're going to define the window. It's going to be like this over
and order by is a must. So we're going to go and sort the data by the month. Right? So let's go and do it. And with
that we have defined the previous month sales. So you are the previous month sales. So now let's go and execute it in
order to see the results. All right. So now let's check the results. The first row what is the previous month? There is
no previous month. We are at the first record and the first month that's why we have null. Now let's go to February.
What is the sales of the previous month from January? It is 105. So this is correct. And now to the last value to
the March. What is the sales of February? The previous month it is 195. So with that we got the two
informations. We have the current month and as well the previous month. So guys as you can see it's magic right? It's
very simple. we can go and use the lead and lag functions in order to access another values from another rows without
doing any complicated joins and so on. Okay. So now what is the next step? We're going to go and subtract the total
sales from the current month with the previous month. So in order to do that we're going to go and use a sub query
like this. So select star from and we're going to have it like this as subquery. And now the calculation is very simple.
Let me just move this a little bit down. So it is the current month subtracted from the previous month and let's go and
call it month over month change. So that's it. Let's go and execute this. So now let's go and check the results for
the first month. You can see that we don't have any value and that is correct because the previous month is empty. So
there is no change. And now moving on to the February. You can see over here we got plus 90. That means we have here
improvement in the performance of our sales. Now moving on to the last one. It's really bad. We have decline in our
performance. We can see that we have minus 115. So that means the current month is doing really bad compared to
the previous month. So the March is really bad month. Okay. So now as you can see in the output we got the
absolute numbers but the task says find the percentage change. So we have to convert this to a percentage and we can
do it like this. It's very simple. Let's do it in a new column. Just going to zoom out a little bit. So, it's going to
be the change the differences divided by the previous month sales. And then let's go and multiply it with 100 in order to
get the percentage. So, like this. And now, as you can see, we got zeros. And that's because those numbers are
integer. So, we have to go and cast one of those values. Just going to do it for the first. So, cast and float. So,
that's it. Let's go and execute it again. Now the result looks better. We have the percentages but we have a lot
of decimals. So let's go and round the number to let's say one decimal. So only one and let's give it a name. So you are
month over month percentage. So let's execute. So now as you can see things get better. And with that we have
calculated the percentage change in sales between the current and the previous months. And this is how we do
month overmonth analyszis. All right. So now we have another use case for the lead and lag function. We
can use them in order to do customer retention analyzes. It's all about measuring the customer behavior and
loyalty. So we are helping the business and decision makers to build strong relationship with the loyal customers
and for them as well to focus on their needs. So now let's see how we can use lead and lag function in order to do
customer retention analyszis. So let's go. All right. Right. So now we have the following task and it says in order to
analyze customer loyalty rank customers based on the average days between their orders. So there is a lot of things
going on over here. Let's do it step by step. And I would like always to start with a very simple select. So let's go
select informations like the order ID. Let's get the customer ID and as well since we want the days we would like to
have the date. So order dates from the table sales orders and let's go and sort the data. So order by customer ID and
order dates. So that's it. Let's go and execute. So now as usual we got our 10 orders, the customers and when they did
order. So now let's check the task. Let's solve this over here. Days between their orders. So we have to find how
many days are between two orders. For example, if we check the customer number one over here, he did order around 10
January and the second order is like after 10 days 20 January. So we have to go and subtract those two dates. Now in
order to subtract those informations and do calculations, we have to have everything in the same row. So for
example, if we are at the first row over here, I would like to have as well one column about the next order. So the date
of the next order. So we have to access a value from another row. Of course, we can go and do joins, but we have lead
and lag functions. And for this scenario, we're going to go and use the lead window function. So let's go and do
that. I'm going to go and call the order date over here as a current order. And let's go and calculate the lead. So we I
would like to get the next order date. So I would like to get this value over here in the same row. That's why we this
time we're going to get the order dates. And now let's go and define the window. Now we have to go and partition the data
because we are analyzing each customers separately, right? So that's why we have to partition that by the customer ID.
And of course in order to do the lead, we have to use the order by. So let's go and define that as well. Order by and
it's going to be by the order date. So now we have to give it a name. The order date here is the current order. This
going to be the next order. So next order. Let me zoom out a little bit and make this smaller. So let's go and
execute it. So now as you can see in the output we got a new column called next order. And with that we got the current
order, the current row and as well the value from the next row. So what is the next row? It's going to be the 20
January. The same thing of course for the next row. Over here we have the current order date and the next order
date. So this value going to be exactly as the next one over here 15 of February and then since we are working with
window since this is the whole window over here the last order for this customer it's 15 of the February there
is no next order so this going to be null the same thing if you check the other customers you're going to see
always the last order don't have any next order so looks like everything is fine and for the last customer he has
only one order so now with this we got all the informations for our calculations. So we have the current
order and the next order in the same row. Now we can go and subtract them in order to get the days between those two
orders. And now in order to subtract date we has to use the function date div. Don't worry about those functions.
We're going to explain all those stuff in the next chapters. So now just follow me with those steps. What we're going to
do, we're going to go and subtract this date the order date with the whole thing over here. Right? So the whole thing
here is the next order. So let's do it in a new line and it's going to be very simple. So date diff we are finding the
differences between two dates. So the syntax going to be like this. First we have to define what we are talking
about. Are they days, months, years and so on. So we have to tell SQL find me the differences in days. Now we have to
specify two days. So the first one going to be the order date. This is the current date and the second date going
to be the whole thing from here. So let's take it and put it side by side and this calculation going to give us
number of days. So we're going to call this days until next order. All right. So now let's go and execute the whole
thing. So now let's check the result. As you can see over here we got 10. So this is 10 days between those two dates and
the next one we have around 26 days. Here we have a null because we don't have here a date and for the next one we
have 31 days. So we have a whole month over here. So everything is working perfectly and with that we have solved
only this part days between their orders. So guys you see right this is the magic of the lead and lag function.
We can very easily access any information you need in the same row in order to do such a important analyzis
and with very simple query. We are not doing any crazy stuff like joining and stuff. We are just specifying the lead
function. So now we got all the informations that we need. Next we're going to go and calculate the average of
those days. So in order to do that we have to go and use a subquery. So let me just zoom out. So let's go and select
star just prepare the subquery. So the whole thing going to be a subquery. I'm going just get rid of the order by it's
not now necessary. So let's me just put it like this and shift it. So now what do we need? We need the average of the
days. So we need the average of this value. So what can we do? We're going to go and use a group by. So customer ID
since we have to find the average for each customers and we're going to get this value and say average days until
the next order and we're going to call it average days. So and we have here to group by. So group by customer ID. So
like this just make this a little bit smaller and zoom in here. So that's it. Now we are just doing a very simple
average and group I statements. So let's go and execute it. Now as you can see it's going to go and aggregate the data.
So we have now only four customers and for each customer we have the average days between their orders. So now what
is missing in our task? If you check over here it says rank the customers based on this average. So we have to go
and use the rank function. So here again another window function that we have to go and use. We're going to do it
together with the group by. So let me just make this a little bit smaller and then let's do it over here. So I'm just
going to go with the rank function. Then we're going to define the window like this over order by and then we're going
to go and sort the data by the average days. So that means we're going to go and get this calculation over here and
put it as order by it's going to be ascending. So we are focusing on the lowest average days. So that's it. Let's
call it rank average. So now let's go and execute this. So now by checking the result, you can see now we have a
ranking for the average. And here skill says that the number one customer or the number one loyal customer is the
customer number four which is not really correct because the number four we don't have a lot of informations about this
customer he or she did order only once. So either now you go and like filter the data and remove this customer where you
say if the average is null then don't put it in the rank or we can go and replace this value with a very huge
value in order to make it at the end of our list. For example, we can go over here and replace the null with
qualisk like this. And we say if the average is null, then let's say give me a crazy number like this very huge one.
So that's it. Let's go and execute. And now as you can see this customer going to be at the end of our list. And now we
can see that the most loyal customer is number one. And then the other two customers are in the rank two. Here we
are sharing the same rank since we have the same average. So guys with that we have solved the task and we have ranked
the customers based on the average days between their orders. So we have now a really nice rank and we can understand
now the behavior of the customers and maybe we have to go and focus on the customer number one and understand her
or her needs. And of course the function that helped us here in order to do such a customer retention analyszis is the
lead function in order to find the next order to calculate the days. So this is how you use lead functions to do such a
use case. the first value and the last value functions. I think the name says
everything, right? So the first value going to allow you to access a value from the first row within a window where
the last value exactly the opposite. It going to allow you to access a value from the last row within a window. Easy,
right? So now let's understand how SQL execute those functions. Okay. So now as usual, we have this very simple example.
we have the months and sales and we have it twice because we would like now to go and compare side by side the two
functions first value and last value. So now for the left sides we would like to get the sales of the first month and on
the right sides we would like to get the sales of the last month. So now for the first task we can go and use the first
value. It's very simple. So the first value function then the argument going to be sales since we want the sales and
then the window going to be defined like this order by month because we want to get the first month. So as usual we must
use order by now on the right side in order to get the sales of the last months we can go and use the last value
right so the same things last value sales over order by month. So as you can see on the left and right we don't use
any frame definition but the default going to be used from this. All right. So now let's see how SQL going to
process both of those queries side by side. So the first step is SQL going to go and sort the data. They are already
sorted from the lowest to the highest. And then the next step is going to start row by row finding the first value on
the left side. So what is the unbounded proceeding? It's going to be static and always pointing to January. So this is
always going to be the unpounded proceeding. We have it in both sides like this. And what is the current row?
It's going to be at the start the first row. And on the right side the same things over here. So the window
definition going to be is only one row right. So what is the first value in this window? It is 20. Right? The same
things on the right side. What is the last value in this window? It is as well 20. So we will get exactly same results.
Now let's move to the second row. So it's going to be pointing to February. And the frame definition going to be
here extended like this. So what is the first value in this frame? It's going to be as well 20. Right? So in the output
we're going to get 20. And now in the right side the current row going to be as well pointing to February and the
window going to go get extended. So now what is the last value of this frame? It's going to be 10. Now let's keep
going. We're going to go to the March and the window going to get extended. What is the first value? It's always
going to be the same. So 20 on the right side window going to get extended. What is the last value? It's going to be 30.
So as you can see the default definition is always having the static start always the same start of the subset and as we
are moving with the current row the frame going to get extended. So now moving to the last one and with that we
will get the whole data set inside the frame and the first cell is going to be 20 on the right side. the same things
going to get extended like this and this time the last one going to be April and five. So now if you go and compare them
side by side you see that on the left side the task is solved and everything is working correctly right. So we have
for each row always the sales of the first row and what is the first row it is January. So we have everywhere a 20
which is correct. But now if you check the right side you can see there is something wrong right? We are getting
not the last value. We should always get April right? We should have here everywhere a five. So we have here
exactly the same result as the sales. So it's really useless to use it like this, right? And that's of course because SQL
is using the default definition of the window frame. Last value is the only function from all window functions that
you cannot use the default frame definition. You have to go and customize the frame definition in order to get the
effect of the last value. For the first value, everything is working. If you're using a default frame, if you are not
specifying anything, but for the last value, you will not get the effect correctly without customizing the frame
window. So my friends, you can go and use the first value function like all other window functions without defining
a frame. You can go with the default and you will get the effect of the first value, but the last value you have to go
and define a frame. So let's see how we can solve that. All right. So now in order to solve this, we going to define
the frame like this. It's going to be the rows between the current row and the unbounded following. So we just switch
things around. So now let's see how this going to work. Now of course it's going to go and sort the data and so on. Now
it's still going to have a pointer to the unbounded following. So it's going to point always to the last row in our
data set and then it's going to proceed step by step. So the first row going to be like this and the frame going to be
the whole thing, right? So from the current row until the unbounded following. So what is the last value the
last row? It's going to be the five, right? The April. So we will get in the output five. Now let's proceed to the
next value. The frame going to be shorter and smaller. And what is the last value? It's going to be as well the
five. Right? So now we jump to the next one. And the frame going to be like this. What is the last value? As well
five. And then we will get the last value like this. Current row is equal to the unbounded following. We have only
one row and it's going to be as well five. So as you can see it's very simple just fix the frame clause and you will
get the last value working as expected. So this is how SQL going to go and do it. Now let's go back to SQL and start
practicing. All right. So now we have the following task. It says find the lowest and highest sales for each
product. So now let's see how we can do this. As usual we're going to start with very simple select statement. So select
order ID. We need the product ID and as well their sales. So let's select the table sales orders. So that's it. Let's
go and select this. Now in the output we got our orders, products and sales. So now let's start with the first part of
the task. Find the lowest sales for each product. So in order to do that, we can use the first value function. So let's
go and do that. First value. Then what we are talking about, we have to give an expression. We need the lowest and
highest sales. So let's go and have the sales inside it. And now we have to define the window. So over since we are
saying for each product that means we have to go and make windows. So we have to divide the data using partition by
products ID. And then we must use an order by right. So we have to go and sort the data by the sales. Since the
first value should be the lowest value, we have to do it as ascending from the lowest sales to the highest sales. So
we're just going to leave it like this as a default and we're going to call it lowest sales. So let's go and execute
this. So now let's go and check our results. First going to go and partition the data by the product ID. So as you
can see we got now here four windows. Then sort the data by the sales. So the data are sorted from the lowest to the
highest from 10 to 90. So now what is the first value of the sales? It is the first row, right? So it's going to be
10. That's why we have everywhere a 10. Let's check another one. Let's take this one here. So this window has two rows
and it is sorted the lowest sales or let's say the first value is 25. So with that we have solved the first part of
the task finding the lowest sales for each product. Let's go to the next one. We have to find out the highest sales
for each product. So let's go and use the last value for this. So let's have a new line. We're going to have a last
value again the sales. Then we're going to go and define the window. So it's going to be the
exact same window. We have to partition the data by the product ID and order the data by sales. So let's go and just copy
the previous one and let's call it for now highest sales. So let's go and execute it. So now if you check the
results, you will see our issue over here again. Right? We are not getting the highest sales for this window. The
highest sales is 90. But as you can see, we are getting the exact same sales. And we have explained that in the previous
example. So in order to fix this, we're going to go and add for it the frame. So rows between current row and the
unbounded following. So now let's go and execute this. So now let's check the result. As you can see
over here, we got the highest sales correctly. So for this window, the highest ones is 90. and as well for this
window the 60 and so on. So with that you have solved both of the tasks the lowest and the highest sales. But now I
would like to show you my honest opinion about this tasks. I will not go and use the last value to find the highest
sales. So let me show you how I usually do it. I'm going to go and use the first value in order to find the last value.
So now let me show you what I mean. Let's go and add a new row. I will just take the whole thing from the lowest
sales. But what I'm going to do, I'm just going to go and change the order. So that means we will not go and sort
the data like this ascending from the lowest sales to the highest sales. We're going to go and switch it. So we're
going to go and sort the data from the highest sales to the lowest sales. And with that, the first value going to be
the highest sales. So let me just rename it highest sales. Let's give it like two. So let's go and execute this. And
now you can see over here we got the exact same results because we sorted the data differently and we get the first
value. So this is going to give you the exact same effect like the last value. And as you can see I don't have to
define now any window or something like that. I can stick with the default frame but just twisting the order by. So this
is how you can do it as well using only the first value. So now just for the sake of this task there's as well
another possibility in how to solve this. You can go and use the minmax functions. So let me just take the same
and have a new one the lowest sales. We can go and say you know what let's get the min. So we are saying find me the
minimum sales and we don't have to go and sort anything. So we can go and just divide it like this. So let's give it
another ID. Let's go and execute it. So as you can see we got the exact same results like the other two highest
sales. So as you can see we can solve this task using three different functions. Either go and use the last
value but you have to define the frame or you can go and use the first value where you switch or flip the order by or
simply just using the max function in order to get the highest sales. So guys as you can see we can use the first
value and the last value in order to find out the extremes like here in this example the lowest and the highest
sales. So there is like similarity between those two functions and as well the min and max. And of course what
we're going to do with this value over here we can go and compare it with the current sales. So for example we can go
and extend our task where we say find the difference in sales between the current and the lowest sales. So in
order to do that let me just clean up all those stuff and let's stick with the first value and the highest value like
this. So we have to compare now the current sales which is this field over here. the sales the original one with
the lowest sales with the whole thing from here. So let's go and do that. So we're going to have a new line and we're
going to say just simply subtract the sales from the lowest sales like this. And let's give it a name sales
difference. So that's it. Let's go and execute it. Now as you can see the result in one row I'm comparing the
current sales which is 90 with the lowest sales from this product. It's going to be the 10. So with that we're
going to get the distance let's say between those two informations and it going to be 80. So now for the next one
the distance between this value and the lowest value is shorter. So we are near the lowest value. So as you can see over
here we can now compare the sales between the current sales and one extreme in order to find the distances
between two values. So this is again very important analysis in order to do comparison analyszis.
All right friends, so now let's do a quick recap about the value functions or we call them sometimes analytical
functions. So what they do, they're going to go and allow you to access a specific value from another row. This
going to help you in order to do complex calculations with very simple SQL without having you joining tables
together or doing self joins. And for the value functions we have four types or let's say for functions the first one
allows you to access the previous value like the previous month using the lag function. The next one it allows you to
access the next values the next month using the lead function. Then we have another one it allows you to access the
first value in a subset using the first value function. And another option we can go and access the last value in a
subset using the last value function. Moving on to the next one, we have the rules of the syntax. So about the first
point, it is the expressions. We can go and use any data type. It could be a number, string, a date, anything. Now in
order to perform those functions, we have to go and sort the data by the order by. So order by is required. It is
a must. Then for the frame, you are allowed to use it. So it is an optional thing. I would say always leave it empty
for the frame. But only for the last value, you have to go and customize otherwise it will not work. Now to the
next point, we have the use cases. We have simply very important use cases for the value functions in data analytics.
So what we can do? We can do time series analyszis. As we learned, we can do month overmonth analyzes and
yearover-year analyzes. Those analyszis are classical and it's always the first question in that analyszis in order to
measure are we growing with the business or are we declining? How the performance between the current year and the
previous year. So as you can see we are doing always comparison using those window functions. The next use case is
as well about the time we can do time gap analyzes as we analyzed the customer behavior the customer retention where we
have calculated the average days between two orders and the last use case it's as well about comparison comparison
analyzes we can go and use the value functions in order to compare the current value with extreme like
comparing the current sales with the highest sales or to the lowest sales. So my friends those analyzers are essential
in data analyzers you will be countering them in each company in each business you have to answer those questions and
you can do that very easily using the SQL window functions all right my friends so that's all about the window
value functions and with that we have covered everything about how to aggregate your data using SQL and those
are very important tools on how to do data analytics in SQL especially if you are a data scientist and data analyst.
So with that we are done with this chapter and I can tell you with that we have covered the intermediate level. So
we have learned how to filter the data, how to combine the data and as well the most important functions in SQL. Now
we're going to go to the third and last level we will cover now the advanced level. So the first level going to be
about the advanced SQL techniques. So now if you go inside it and in SQL there are like different techniques in order
to organize our complex projects. So first I'm going to explain for you what is exactly I'm talking about what is
complex queries and why we have it and then we're going to start with the first topic the subqueries. So let's
go. Normally in projects we have a database and we have a person that is responsible for the database the
database administrator that take cares of the database structure. And now in very simple scenario we're going to have
a user that is writing queries in order to retrieve data from the database. So he or she going to write an SQL query
and then this query going to be sent to the database where it's going to execute it and then the database going to return
the results. So at the end our user going to see the result of the query that he wrote. So this is a very
simplified scenario on how we use a database. But my friends in the real world things are totally different.
Things in real projects get very complicated like this. So for example, you have a financial analyst that is
writing a huge block of SQL query that is very complex and there will be like another user that have different role
like a risk manager that is as well writing a very complex query and from different departments from different
projects for different tasks. You will have a lot of analysts that are writing many complex queries. So all those
analysts and managers have a direct access to your database and they are executing a complex analytical queries
in order to generate maybe a report or something. Now not only those guys are doing analyszis on your database you
will have as well our friend the data engineer that is saying you know what I'm building a data warehouse and I
would like to extract your data. So that data engineer going to go and write an extract query in order to extract the
data from the database. And then he has a different script for the transformations in order to manipulate,
filter, clean up, aggregate your data. And then a third script in order to collect the result of the
transformations and load it in another database called data warehouse. A data warehouse is like special database that
collect data from different sources and integrate it in one place. in order to do analytics and reporting. And now at
the end of this chain, you will have a data analyst and she writes as well queries in order to analyze the data in
the data warehouse. Or you might have a different query in order to prepare the data before inserting it to a tool like
PowerBI in order to generate visualizations and reports. So we call this a data warehouse system or a
business intelligence system that extract and extract from your data and manipulate it and transform it for
analyzes. Now not only we have a data engineer and data analyst accessing your database and doing queries, we have as
well our friend the data scientist. So now our data scientist as well has a direct access to your database. So he
might write like different queries in order to extract the data and as well to manipulate the data that are needed in
order to develop a model and doing machine learning and AI. And now one more scenario that I see in many
projects where the result of the data analyst going to be used in another query in order to prepare the results
for data visualizations PowerBI or in order to export like a Excel list. So as you can see we have a lot of people with
different roles that want to access your database and do analyzes on top of it and that's because everyone want to
answer questions based on the data and now if I look to this I still think this is a simplified version and how things
works in the data projects and I can tell you in real projects things are way more complicated than this so now if you
sit back and look to this we will find many challenges and problems for example all those people are not talking to each
others And each one of them are creating like their own query. But if you go and take all those queries and compare them
side by side, you will find in the scripts and queries logic that is keep repeating. So the queries from the
analyst or the data scientists and data engineers, they might contain a redundant logic. And of course the issue
of this we have the same effort repeating over and over and maybe not everyone is getting the logic
implemented correctly because not all of them having the right skills in SQL. So this is a big issue in this setup. And
now we have another challenge having this scenario. If you don't optimize it you will have a performance issue
everywhere. So the data warehouse or the data engineer scripts might take like 5 hours and the query from the analyst
might take like 40 minutes and before inserting the data to reports we might have 30 minutes and 1 hour there 30
minutes there and everyone else is as well suffering from bad performance on their queries and the performance
everywhere is really bad. So if everyone is writing big complex queries don't expect that they will have a good
performance. Now to the third challenge that I observed in many projects and that is the complexity. Now behind the
original database you might have a data model that is prepared and optimized only for one application. So you will
have in the data model a lot of tables and all those tables have different relationship between them and of course
only the developers and the experts of this database understand the physical data model behind this database. And now
if you give access to all those analysts they will have a lot of questions because first they have to understand
the data model before writing any query. So that means a lot of data workers are keep asking our expert from this
database questions. So for example how to connect the table A with the table B and where do I find my columns? What
this table means? I'm getting bad result in my query because your data is really corrupt. So the developers of the
database will get a lot of questions from the analyst and they have to explain over and over their data model
so that the users are able to write those complex queries. So that means all those users are stressing the database
team by many questions and as well the users are writing very complex queries. So the complexity is a really big
challenge. Now as well by looking to this picture you will find a lot of errors from those queries to the
database and this might cause a lot of database stress. So keep executing repeatedly a big complex queries going
to makes really big stress for the database and it going to bring the database down. And the last challenge of
this picture is that the data security. So if you leave it like this by giving the users a direct access to your
database tables you might have a problem because it might be okay for like some data engineers and so on but you don't
want to give for each data analyst a full access to the database tables. So you have to protect your tables the
columns the rows everything. So you cannot leave it like this where everyone having a direct access to the physical
database tables. Now enough talking about challenges problems and issues. Let's be solutionoriented. So what are
the solutions of those issues? Of course, there are many solutions, but we're going to focus now on five
techniques. We can go and use sub queries or CTE, common table expressions. We can introduce views to
our database or temporary tables or we can go and use the technique of the CTAs carrier table as select. So this is
exactly why we have to understand those five techniques in order to solve all those issues that we might face in our
data projects. All right friends, so now after we understood the importance of
those five techniques, let's take a quick and simplified look to the database architecture because I want you
to understand what happens behind the scenes and how the database execute the queries from these five techniques. So
by understanding this architecture you will understand how things works. So let's go. For each story there are two
sides. We have the server side and the client side. In the client side it's like for example you you are writing an
SQL query for a specific purpose. Now in the server side we have many things. So the server is where the database lives
and it has many components like the database engine. The database engine is the brain of the database that handles
different operations like storing, retrieving and managing data in the database. So each time you execute a
query, the database engine going to take care of it. And now in the database we have very important component that is
the storage and the two main types of storage in a database are disk storage and cache. The disk storage is like a
long-term memory where the data is stored permanently. So it's like the disk at your PC. It stores the data
permanently even if you turn off the system. And one important feature of the disk is that it can stores a lot of
data. But the disadvantage of the disk storage is that it is slow. So it is slow to write and to read. Now in the
other hand we have the cache is a short-term memory where the data is stored temporary. It's like the RAMs at
your PC. It holds the most frequently used data. So the database can access it quickly in order to retrieve data. And
the big advantage of the cache is that it is fast. So it is very fast for the database to retrieve data from the cache
compared to the disk. But the disadvantage of the cache, the data is stored there only for short period. So
it's like tradeoff between the speed and how much data you can store and how long. Now let's talk about the disk
storage. This is very important in databases. There are typically three types of storage areas. There we have
the user data, the system catalog and the temporary data and each storage type has a different purpose. So what is user
data storage? It is the main content of the database. So it stores the actual data all the informations that are
relevant for the users. So it's stored there all the important data that the users cares about. So this is the
storage where the users are interacting all the time. So where do we find the user data? If you go to our database
sales DB and then you go to the tables now we find all these tables that we are already used the customers employees
orders and so on those tables are the user data. So now if I go and say select from sales orders and all those
informations that we are seeing now are the users data. So this is what we users actually care about. All other stuff
that we see inside databases as a user we don't care about it. We care only about our data. But in the database, we
don't have only the user data. We have many other informations. So this is what we mean with the user data
storage. Now what is system catalog? This is the internal storage for the database for its own information. So
it's like a blueprint that keeps tracking everything about the database itself. So that means the main purpose
of the system catalog is that it holds the metadata informations about the database. So what is a metadata?
Metadata is data about data. Now let's understand what this means. What we have done so far is that we have created a
table called customers and we have defined inside it like multiple columns like the customer ID, first name, last
name and then we have inserted our data inside this table. So we have inserted five customers. So those informations
are my data. I have created those informations and stored it inside the database. That's why we call it the user
data. So nothing so far is new. So now what happens behind the scenes is that the database server will not only store
the user data that you have provided but also it's going to go and store a different type of data inside the
database and this data is the metadata. So the database server going to store the metadata of the customer's table and
it going to look like this. There is like a table name, there is a column names and those are the column names
that you have defined inside your database and those are the column names that you have defined as you are
creating the table customers and it's going to store as well additional informations like which data type like
the customer ID is int and the last name is v charts and many other informations like the length of the column and
whether the column is nullable or not. So as you can see in the metadata we are having a description a data about the
structure of the customers and in the metadata we can find a lot of informations about not only the tables
and columns but as well about the schemas and the database. So you can find a full catalog about the structure
of your database. Basic table the customers table it contains data about the actual data. So it stores data about
the customers. But the metadata of the customers table contains data about data. So in the databases each table
that you are using in order to store your data has a table twin that describes the structure of your data. So
this is what we mean with a system catalog or a metadata. And now you might ask where I can find all those system
catalog and metadata inside our client here. Well, you cannot navigate through those informations in the object
explorer like we used to do for the user data. But you can find those informations in a special hidden schema
called the information schema. The information schema in SQL server is a systemdefined schema that contains a set
of built-in views that help us to find information about our database like tables, columns, and other objects. So
let's go and explore it. We're going to go and say select star from information schema. And then let's have
a dot. And now we get from SQL a list of all views that are available in order to browse the metadata of our database. So
for example, you can see here tables. You can see informations about the views and as well about the columns. So let's
go and select the columns and let's go and execute it. And now in the output we can find informations about the schema
about the table names like for example here the customers. Let me just go and select this table. And then we find all
the columns inside this table how they are sorted. So we have here the order of each column and as well the data type
and the size of each column and many other stuff. So as you can see we got here all the informations all the
metadata of each table and as well for each column inside the table. So with that you can check which tables does
exist in your database. For example I find here like something called test two. So maybe I was trying to test
something. I can go now and clean up stuff right and this is exactly why the database maintain such a catalog. It
helps the database to quickly find the structure of each table and of each column. and it helps me as well as a
user to browse the catalog of the database. So for example I can go over here and say okay let's get a distinct
table name. So with that I will get a list of everything that I have inside the database. So we have the customers
employees and some tests that I have done. So metadata are awesome. Now we come to the third
storage that temporary data storage. It is a temporary space used by the database for short-term task like
processing a query or sorting data. And once these tasks are done, what going to happen? The database going to go and
clean up the storage. And now of course the question is where we can find these temporary tables that is using the
temporary storage in the disk. Well actually if you go to the object explorer you will not find it inside our
database sales DB but you will find it inside the system databases. Now since we are working locally we have the full
access to everything inside the SQL server. But in real projects if you are just a user or let's say developer you
will not have access to the system databases only for the database administrators. But now we are working
on the local copy. So let's go to the system database and here you have a special database from the SQL server
called temp DB. And if you go inside it we will find here tables and temporary tables. So this is exactly where you can
find all the temporal tables that you are generating. Now currently we didn't create any temporary tables that's why
it's empty. But once you start creating temporary tables you will find those tables underneath this folder. We will
learn about the temporary tables in the next sections. So these are the main
component of the database architecture. So now let's have an example. Now we have a table called orders that is
stored inside the user storage and the metadata of this table is stored in the catalog. So now let's say that you are
at the client side and you write a simple select query in order to select the data of the orders. So now that
query is sent to the server in order to be executed and the database engine going to take the query in order to
process it. So first the database engine going to check whether we have the data in the cache because if the data is
stored in the cache then things going to be really fast and the database engine can solve the task quickly but in this
scenario we don't have the orders informations in the cache that's why the database engine going to say okay it's
not in the cache let's check the disk so it will find the orders information in the disk and the query going to be
executed then the result of this query going to be sent back to the client side where at the end in return you will see
in the output the result of the table orders. So this is how the SQL database execute very simple select
query query is a query inside another query. So what this means let's have a sketch to understand it. So so far what
we have learned we have different database tables like the orders customers and so on and we write a
simple SQL queries like select from where. So the SQL going to retrieve data from the database tables and in the
output we will get some kind of results. So this is so far what you have done. We have done very simple queries. Now in
our query we can have things little bit different. So we could have another query that is inside our query where we
do the same things like select from where. So we have now a query inside our query and we call this embedded query we
call it a sub query and the original query the first one where we have select from we call it main query. So now if
you execute the whole query what going to happen SQL first going to go and select the subquery and then it's going
to execute it. So it's going to go and select and retrieve data from our database tables and the result of the
subquery will not be sent to the users to us. So we cannot see it. What can happen? the result can stay inside the
query as an intermediate results and then now our main query can go and start interacting with this intermediate
result from the subquery. So the main query going to do some kind of operations on top of this intermediate
results and use it for filtering or joining or any purpose and still the main query can go and query the original
database tables. So now the main query has two sources for data. The original database tables and as well the result
from another query. So now by looking to this you can see the subquery is a query inside the main query and it play a role
of supporter. So it supports the main query with data and the main job of the main query is of course to get all those
data and to show us at the end the final results. Now there is now two things about this intermediate results that we
got as a result from the subquery. Once the execution of the query is completely done, what can happen is going to go and
destroy this intermediate result. So it's going to totally drop it. So we will not find it anywhere. It's
completely lost. Now the other thing about the intermediate results is that imagine you are making another query
that is completely outside of the first query. We are selecting few tables from our database. Now you might say you know
what is it possible to access the intermediate results from the first query. So now we are talking about
completely external query you cannot do that. The intermediate result of the subquery is only locally known from the
main query itself and it is not globally available for any other query. So the subquery can be used only from the main
query. So with that we have understood what are subqueries and now you might ask me why
do we need them in the first place? Why sub queries are important? Let's have the following sketch. Now in our complex
task we might have to do several stuffs in our query. Like for example the first step we have to go and join tables in
order to prepare the data and then the outcome of the joins should be filtered. So this going to be our step two. And
then on top of that in the step three we have to go and do transformations like maybe handling the nulls or creating new
columns and many other stuff. And the last step we want to go and do data aggregations like summarizing the data
or finding average. Now if you go immediately and start writing the SQL query without having a plan what can
happen you're going to end up having a long complex SQL query and it's going to be really hard to write and as well to
understand and read. And now what we can do instead of that we're going to go and divide our task based on those steps. So
we're going to write one query section for each step. For example, for joining tables we're going to have one query for
filtering another one transformation another one and for the aggregation we're going to have the last query. So
now since each step is like a preparation for the next step we can go and say each of those queries is a
subquery. So for step one, step two, step three, we have sub queries and they are all doing like calculations and
preparations for the last step to the aggregations and we call the last step the main query and of course the whole
thing can exist in one single query. So if you want to visual this like you have a subquery in circle and then this
circle belongs to a bigger circle called the main query. By the way, sometimes we call the main query as the outer query
and the subquery we can call it an inner query. And of course, we can have many subqueries and many small circles inside
each others to form something called nested queries. So this is the main purpose of using subqueries in our
scripts and queries. It's going to help us to reduce the complexity and going to make it easier to read and we can have
like a flow logical flow inside our queries. Now for the sub queries there are many
different types and categories. So now what we're going to do I'm going to show you an overview of all those types and
categories and then later we're going to deep dive into each of those types. So first of all if you are thinking about
the dependencies between the subquery and the main query. There is mainly two types of subqueries. We have the
non-correlated subquery. That means the subquery is independent from the main query. And the second type is the
correlated subquery. It's exactly the opposite. The subquery gonna depend on the main query. Of course, we can
explain all those stuff in details. Don't worry about it. So, this is the first group. Now, there is another group
on how to group up the subqueries depending on the result type. So, I mean with this that the subquery has
different output and results. For example, we have scalar subquery. It returns only one single value. or
another type it's called the row subquery. It's going to return multiple rows and the final type called the table
subquery. It is a subquery that returns multiple rows and as well multiple columns. Now we come to the third way
and the last way on how to categorize the subqueries and this time based on the location and the clauses. So we are
describing here where the subquery going to be used within the main query. So we can use it in different locations and
clauses like the select clause or we can use it in the from clause and this is the most common type for the subqueries
or we can use it before joining tables and we can use it in order to filter the data in the work clause and in the work
clause as we learned there are two different sets of operators. We can use the subgrade together with the
comparison operators the less, greater, equal and so on. Or we can use it with the logical operators like the in, any,
all and exists. So now those are the different types and categories for the subqueries and we're going to now deep
dive into all of them. So now let's go and start with the easiest category, the result types of the subqueries.
Now we have different types of subqueries based on the results. So this means the amount of data that the
subquery going to return. So the first type is the scalar subquery. So it is a subquery that it's going to return only
one single value like for example the value three. Let's have an example for the scalar subquery. So in this query
for example if you are saying select star you will get all columns all the rows from one table. But for the scalar
subquery we need only one value. So how we usually get it is by doing some aggregations. For example, if you go and
say let's get the average of sales. So let's execute it. And with that in the output we have only one value with a 38.
We call such a query as a scalar query. So it has only one row and only one column. So this is a scalar query. All
right. So now to the second type we have the row subquery. So it is a subquery that going to return multiple rows and a
single column. So we're going to have like values 1 2 3. So it is only one column with multiple rows. Let's have an
example for the row query. As you can see now we are saying select star from the table orders and now we are getting
multiple rows and multiple columns. But for the row queries we need only one column. So you can go over here for
example say customer ID. And if you go and execute it. So now if you check the output we have a single column and as
well multiple rows. So we have like a list of values and this is what we call row query. All right. So now to the last
type we have the table sub query. It's going to go and return multiple rows and as well multiple columns like any
regular tables. So this subquery going to return a lot of values. Okay. So let's see an example of that table
query. So if you check our example here, select star from orders, we got here multiple rows and as well multiple
columns and of course we can go and select multiple columns like for example the order ID and the order dates. So if
we execute it here in the output we have multiple columns we have two columns and as well multiple rows that's why this
kind of query is as well a table query. All right. So with that we have learned the different types of subqueries based
on the result type. Now we're going to go and learn how to use the subqueries in different locations in our query. So
we're going to start with how to use subquery in the from [Music]
clause. Okay. So we typically use the subqueries in the from clause in order to create temporary result sets that act
as a table for the main query. So it's like in some scenarios we cannot use the tables directly from the database. We
have to prepare it somehow before we do our actual query. Okay. So let's check the syntax of the sub query inside the
from clause. So we start with the usual stuff where we go and say select and few columns that we want to retrieve and
then we say okay from usually after the from comes the table name from our database that we want to query. But this
time instead of writing the table name, we're going to have another SQL query. So that means we don't define the table
name, we define another select statements where we have as well again select a column from specific table and
then maybe we have a filter. And in order now to tell SQL this is a subquery, we have to use the
parenthesis. So we're going to have the parenthesis at the start and at the end. This is a subquery. This is not the main
query. And after the parenthesis, we can go and define the alias for the results that we're going to get from this
subquery. In many databases, this alias is an optional, but for the SQL server, we have to go and specify an alias. So,
it is a must in SQL server. So, again, we call this a subquery and the outer query we call it a main query. So, this
is the syntax of the subquery in the from clause. Okay. Okay, so now we have the following task and it says find the
products that have a price higher than the average price of all products. So we're going to do it step by step and
here we have two steps. The first one is that we have to go and calculate the average price of all products and the
second step we're going to use this value in order to filter the table products in order to find the prices
that is higher than this average price. So let's start with the first step where we're going to find the average price.
I'm going to select the following informations. So product ID, price from the table sales products. So
let's go and execute it. So now we have the product and as well the prices and we need this price here in order to
compare it with the average price. So that means we need this price and as well side by side we need the average
price. So that means we need aggregations and details and that's why we're going to go with the window
function average. So let's go and do this. This is very simple. So it's going to be the average
price and we don't want to partition the data. So it's going to be an over empty and this going to be the average price
like this. So let's go and execute it. And with that we have calculated the average price. So now we have all the
informations in the first step. We have the average price, we have the price and as well the products. So now the next
step is that we have to go and filter the data to find out all the products where the price is higher than the
average. That means we will do this step based on those information that we have now. So that means we have to go and use
the logic of subquery and main query. Since this is the first step to prepare the data, we're going to use this as a
subquery. So we're going to call this a sub query like this. And we have to go and use it in the main query. So how we
going to do that? We have to go and write the main query. So it's going to be I'm going to start over here. Select
and then I will take all the columns from. So this is the main query. Let me just make this a little bit smaller. And
what we're going to do now so now the main query going to get the data from the sub query. So the whole thing going
to be used inside the from close. So now in order to put the subquery inside the main query we have to go and use the
parenthesis. So we're going to have it at the start and as well at the end and what we usually do we go and add like a
tab in order to understand okay this is the subquery and then this is the main query. So now one more thing that we
have to add for the whole subquery in the SQL server that we have to give it an alias. So you can go and give it any
name that you would like. I usually go with only one character with the T. It stands for table. So you can use
anything that you want. But we have in SQL Server to give an alias for the subquery. So now what we are saying, we
are saying select everything from the subquery. If you go over here and execute it, you will get the exact same
results because the main query is doing nothing. It's saying just select everything from the subquery. But now in
order to solve the task, we are not interested with all products. We are interested only the products where the
price is higher than the average. That's why we have to go and use the where clause. So we're going to say where the
price is higher than the average price. So this filtering is done in the main query. It's not inside the subquery. So
now that means in the main query we are doing something. Let's go and execute it. And with that we saw the task. We
are getting now two products where the price is higher than the average price. So as you can see it's very simple. If
the task has multiple steps then we can do that using multiple sub queries until we have the main query and we can learn
from this that the subquery is here is only to support the main query. So we are preparing here that all the data
that we need in order to have the final result for the main query. So for this task we cannot go immediately
calculating the results we have first. So for this kind of task we cannot immediately like put everything in one
select query. We have first to prepare the data in one subquery and then pass the values for the main query. And this
is what we mean with the table subquery. And here one quick tip for you. If you would like to see the intermediate
results that we are getting from the subquery, you can go and highlight the subquery itself without the parenthesis.
So we are just highlighting the subquery. You can go now and execute it. And with that SQL will not go and
execute everything. SQL going to execute only what you are highlighting. So this is really nice way in order to see the
results of the subquery as you are like debugging or searching for errors. You can go and see the intermediate results
that is used from the main query. And of course if you deselect and not highlight anything and execute SQL going to go and
execute everything the whole query. So this is how we use the table sub query inside the from close. All right. Right.
So let's have another task and it says rank the customers based on their total amount of sales. So again if you check
here we have like two steps. First we have to find the total amount of sales and then after that we have to go and
rank the customers. So again we have like two steps and we can use the subqueries in order to solve it. So
let's start with the first step where we're going to find the total amount of sales. So let's go and select the
customer ID and as well the sales from the table sales orders. Let's go and execute it. So now
in the output we have like multiple customers and their sales. We have to go and now find the total amount of sales
for each customer. That means we have to go and use the group by. So we're going to go and summarize the
sales. So total sales and then group up the data by the customer ID. So like this. Let's go and execute it. Now as
you can see in the output we have four customers and we have the total sales for each customer. And with that we have
solved the first step. We have the total amount of sales for each customer and we have now prepared the data for the next
query in order to rank the customers. So now I think you already getting how important are the subqueries in order to
do stepby-step analyszis. So this is our subquery. Now we need the main query. So I will start preparing it. So main query
like this. And let's go first and select everything. So select star from let me just make this a little bit bigger like
this. And now we have to go and convert this query to a subquery. So we need the parenthesis. So the starting and the
ending and for the SQL server I'm going to give it an alias and I would like to push everything to the right side. So
let's go and execute it. Perfect. So it is working with that the subquery is passing the data in the from clause to
the main query. Now of course the main query is now is useless. It's just like selecting the data. We have to go and
calculate the rank and for that we have a very nice window function. So we're going to go and use the rank. So it
doesn't need any parameters over we have to sort the data order by. So we have to go and sort the data by the total sales
descending from the highest to the lowest. So we're going to go with the total sales and descending. So now as
you can see we are using the total sales that we have already prepared in the subquery. So without preparing first the
data we will not be able to rank the customers in the main query. So that's it. Let's go and execute it. And with
that SQL sorted our data and we have a nice ranking based on the data that we had from the subquery. So this is the
highest customer with the sales and then the customer number one and so on. So again in this task we have like multiple
steps and we use the power of the subqueries in order to do it step by step. So that's all on how to use the
subquery inside the from close. Okay. So now let's see quickly how SQL executed our query. So we have here our query and
we are quering the table orders. So the first step is that SQL going to go and identify the subquery and then it going
to go and execute it. So SQL going to go and execute the subquery part where we are aggregating the data based on the
customer ID. So once the subquery is executed the next step is that the result going to be introduced as an
intermediate results. So these results we will not see it in the output. It's going to be like temporarily saved in
the memory. So now the next step is that SQL going to go to the main query and it's going to execute it based on the
intermediate results. So that means the main query will not go back to the original table. It's going to go and
query the intermediate results. So here what SQL going to do going to go and rank the intermediate results by
introducing a new column where we see the ranks 1 2 3 4 and the output of the main query going to be the final
results. So as you can see it's very simple. First SQL is executing the subquery and the result of the subquery
going to be used in the main query and once the main query is executed we will get the final results. So the subquery
here is only supporting the main query. So those are the steps that SQL uses in order to execute the
subqueries. So now let's understand how the database server execute the subqueries behind the scenes. Let's go.
So now let's say that you are data analyst and you are writing a query at the client side where you have a
subquery inside the main query. So once you go and execute it what's going to happen the database engine going to go
and identify the subquery and in this situation the database going to execute first the subquery. So here subquery is
like selecting and retrieving data from the table orders. So that means the database has to retrieve the data from
the disk storage from the user data. So now once the subquery is executed the result the intermediate results going to
be stored in the cache. So this means the result of the subquery is temporary and as well very fast to retrieve. And
now once the database engine is done with the subquery it going to go and start executing the main query. So let's
see in this scenario it's completely depending on the result of the subquery. So that means the main query going to go
and interact with the cache storage. So this means now the data going to be retrieved very fast from the result of
the subquery. Once it's done, it's going to forward the result to the database engine and the database engine going to
forward the results to the client side. And at your side, you will find the final result. And of course, once
everything is executed, the database engine going to go and clean up the cache. So the subquery results going to
be destroyed and removed completely from the cache in order to have a free space for other queries. So this is how the
database server execute the subqueries behind the scenes. All right. So now we're going to
talk about how to use the subquery in the select clause. So now we typically use the subqueries in the select clause
to aggregate the data side by side with the columns of the main query. Okay. So let's check the syntax of the subquery
in the select clause. So we start with the simple stuff where we say okay let's go and select a column that we want to
retrieve from specific table. So nothing new we are just quering a table. And now what we can do in this query is that not
only we can go and select the columns from specific table we can go and insert here inside the select another query
like a full query like select from and where. So again it's like query inside another query and we call this of course
a subquery. In order to tell SQL this is a subquery we go and add the parenthesis. So with SQL going to
understand huh this is a subquery and the result of this query going to be used in the select. So we can handle it
like any other column. We can go and give it like an alias. It is here optional and not m to add an alias. So
this inner query we call it a subquery and the outer query going to be the main query. So this is how you put a subquery
in the select clause. But there is one rule for this query that the result of this subquery must be a scalar query.
That means the result must be a single value because otherwise it will not work. SQL here is expecting only one
value. So this is how we use the subquery inside the select clause. All right, let's have the following task and
it says show the product ids, product names, prices and the total number of orders. So now if we check the task
there is like two parts. The first part is that we are showing the details about the products and the second part that we
have to go and calculate the total number of orders. So let's see what we're going to do. First let's go and
solve this simple part here where we have the product ID, product names and prices. So we're going to go and select
the product ID and the product and then the price from the table sales products. Let's go and execute it. So with that we
have solved the first part of the task. We have the details about the products. Now we go and solve the second part. We
have to go and calculate the total number of orders. Now this information come from different table from the
products. We cannot calculate it from products. We have to go and query the orders. So now what am I going to do?
I'm going to go and calculate this part in separate query. Instead of having it here inside the products. So let's have
a semicolon in order to have a second query. So we're going to go and select the total number of orders. That means
we can go simply do account star from the table sales orders. Let me just make it a little bit bigger. So we're going
to call it total orders and a semicolon as well. So now if you just execute the whole thing, you will get here like two
parts in the results. First you have the details of the products and the second part we have now the total number of
orders. We have 10 orders. But now with that we have like two different queries like separated from each others and we
have two different results. But in the task we have to show all those informations in one result. So now what
we can do we can put one query inside another query. So now if you check the second query the total orders you can
see we have only single value. So we have a scalar query scalar subquery. That's why we can go with this as a
[Music] subquery like this. And I'm going to go and put everything in one line in order
to see it. So let's remove the semicolons. We don't need it. And now what we're going to do, we're going to
go and take the whole thing and put it inside the main query. So this is the main query. And now think about it as
new column. So I will put the query here. So it is just one new column in our select. But in order to have it as a
subquery, we have to use the parenthesis at the start and at the end. And of course, we have to go and give it a
name. So I'm going to go and use the same name over here. So it's going to be as total orders. So with that, the setup
for the subquery is ready and it is inside the select clause in the main query. Let's go and execute it. Now, as
you can see, we have everything together. We have the three informations the product details and as well side by
side with the total orders and since it is always the same value it going to go and be repeated for each row. So this is
what we call scalar sub query inside the select clause and here again very important to understand if you are using
a subquery inside the select clause only the scalar subquery is allowed. So for example instead of having one value from
the aggregation we can go and use the order ID. So let's see what going to happen. We will get an error. It going
to says subquery is returning more than one value and this is not allowed because we are using the subquery in the
select clause. So that's why we have to have only one value and by using the aggregation you will get one value. So
let's repair it. And it's working. And now again if you would like only to see the results from the subquery what you
can do you can go and highlight the subquery like this without the parenthesis of course and you go and
execute it and with that you can see in the output the 10 this is the intermediate results that's going to be
passed to the main query and if you want the whole thing to be executed just like unmark it and execute and with that
everything can be executed the subquery and the main query. So this is the scalar subquery in the select clause.
Okay, so now let's see quickly how SQL executed this query step by step. So this is our original query and we need
two tables from our database for it. So the first step is that SQL going to go and identify the subquery and it's going
to go and execute it. So this is the first step. So the query is targeting the orders table and we are just simply
doing a count. So in the output we will get an intermediate results where we are counting the number of rows of the
orders. Now the next step is that SQ is going to go and pass this value to the main query. So this is the second step
and if you go and pass this value to the main query, it's going to look like this. So you are saying product ID,
products and the tin. So after SQL prepared the main query, SQL going to go and execute it. So this time we are
targeting the products and in the output we will get all the informations from the products without any filter because
here we don't have any work clouds and the final results we will get it like this. So we will have the product ID,
the product and the total that we got it from the subquery. So as you can see here the subquery here is a scalar
subquery where we have only one single value. So again it's very simple always SQL starts with the subquery and then
it's going to go and pass the values to the main query and at the end the main query going to be executed and we will
get the final result from it. So this is how SQL executed our query. All right, next we're going to talk
about how to use the subquery in the join clause. All right, so now as we are joining tables in SQL, sometimes we have
to go and prepare the data before doing the join to dynamically create a result sets for joining with another table. So
again here we cannot join tables directly. We have to do a preparation step before doing the joins. Okay, let's
have the following task and it says show all customer details and find the total orders of each customer. Now, of course,
in SQL, you don't have only one solution, you have multiple solutions. But I would like to solve this task
using the subquery. So, now if you check the task, we have like two parts. The first part we have to show all the
customer details. And the second part, we have like here an aggregation find the total orders of each customer. So,
now let's solve those different parts using two different queries. Let's start with the easiest one. Show all customer
details. So I think this is very simple. So select star from sales customers. So let's go and execute it. So in the
output we have all the details about the customers and we have solved the first part. Very simple. Now let's go and
solve the second part. We have defined the total number of orders of each customer. That means let me just have a
semicolon over here. We have to go to the table orders. So let's go and select first the order ID, customer
ID from the table sales orders like this. So I will just highlight the second query and execute it. Now in the
output we have 10 orders and we have the different customers. Now in order to find the total orders for each customer
we have to go and use the group pie. In order to do that it's very simple. We're going to go over here and say so count
let's go with the star and then we're going to go and group up the data by the customer ID. I will go and call this
total orders. So let's go and execute only these parts and with that we have four customers and we have the total
number of orders. So with that we have solved the second part of the task. So now what I'm going to do, I'm going to
go and execute both of those queries using the semicolon separately like this. I will just make this a little bit
bigger. So let's go and execute it. Now in the output we have the two results, all details about the customers and the
total number of orders for each customer. So now what we want to do is to go and combine those two results in
one. And in order to do that we can use the joins. So now we have to think about what is the first query, what is the
second query. Since the first query returns all the customers that we have in the database, I would like to have
this as the left table and since in the second query we have only four customers, I would like to have it then
as the right table and I will go with the left join so that I don't miss any customer because if I do the inner join,
I will lose the customer number five. So let's go and do that. So this is the first query in the main query. So I'm
going to call this main query. And now I'm going to give this as well an alias like the C. And now we're going
to go and join this table from the database together with the results the output of this query. So that means
we're going to do it like this. Left join and now we're going to join with a sub query. So we will have our
parenthesis. I will just put here few spaces so that it's clear it is a subquery and we need for this an alias.
So let's go and say for example the O. So with that we are joining a table with the result of a sub query. And now of
course what is missing is joining the tables using a key. Now if you check the two results you can see in both queries
we have the customer ID. That's why we're going to join with the customer ID. So on then the customer
ID with the customer ID from the sub query like this. So we have everything and let's go and execute it. Now as you
can see in the output we have all the details about the customer and as well together with the total number of orders
for each customer together with the total number of orders for each customer and as you can see we didn't miss any
customer. So we have all the customers from the database and we can see that Anna doesn't have any orders. Now you
might say you know what we have here the customer ID twice. So what I'm going to do I will select all the columns from
the customers but from the subquery I'm interested only on the total orders. So like this let's go and execute it. Let's
make this a little bit smaller. So now the results are really clean. We have all details from the customers and as
well the total orders of each customer. And of course as we learned if you would like to check the results from only the
subquery you go and highlight it and execute it. So as you can see you can put the subqueries almost everywhere and
this is how we use subqueries inside joins. Okay. So now we're going to focus on how to use the subquery in the wear
clause. So now in this scale as we learned we can go and filter the tables using the wear clause by using like
static values. But now in real data projects we're going to go and filter the data based on like complex logic. So
now in order to prepare this complex logic we go and use the sub queries in order to make like dynamic filtering for
our main tables. And now in order to filter data using the wear clause we have to go and use operators and we can
split it into like two groups. We have the comparison operators and another sets we can call it logical operators or
sometime we call it subqueries operators. So now first we're going to talk about the comparison operators. So
there are operators that we can use in order to compare two values in order to help us filtering the data based on
specific condition. And now in SQL basics we have learned that we have different comparison operators and they
are very simple. So in order to compare two values we have operator like the equal we have as well not equal the
opposite. So we have greater than less than and as well we have greater than or equal to and the last one we have less
than or equal to. So they are very simple. Now instead of comparing two values, we're going to go and compare a
value with the result of subquery using the comparison operators. All right, let's check the syntax of the subquery
inside the wear clause using the comparison operators. So we start with the standard stuff where we say select
few columns that we want to retrieve and we want to get the data directly from specific table in our database and now
we come to the where condition where we want to filter the table. So we say where and then we select specific column
from the table one. Now since we are talking about the comparison operators we can go with operator for example
equal and usually we go and specify here like static value like a number or string but instead of having a static
value what we can do we can get the value from another select statements another query like here for saying
select a column from table two and with a filter. So now whatever comes from this subquery going to be used in order
to filter the table number one. And of course we are telling SQL this is a subquery by defining the parenthesis at
the start and at the ends and the outer query going to be the main query. So as you can see we are using the subquery in
order to filter the main query. And here in SQL if you're using subquery with the comparison operators we have a rule the
subquery must be a scalar subquery. So only one single value. So that's all about how to use the subquery in the
wear clause using the comparison operators. All right. So now we have again the same task and it says find the
products that have a price higher than the average price of all products. We have solved this task already using the
subquery inside the from clause. But now we're going to go and solve it again using the subquery but this time inside
the wear clause. So let's do it step by step. Let's go and get the informations that we need. So we need the product ID,
we need the price from the table sales products. So let's go and execute it. So now we got the list of all products. But
we have to go and filter those informations using the column price. So with that in the result, we got all the
products, but we don't need all the products. We need only the products where the price is higher than the
average. That means we have to go and filter the table based on the values of the price. So now in order to do that
what we're going to do we're going to use the wear clause and we have to go and filter the data based on the price
and since we need higher than we're going to go and use the compressor operator higher than now next we need
the value average price. So how we going to do it? We don't have the average price like out of the box in the table
products. We have to go and calculate it. That's why we're going to go and write another query where we're going to
go and find the average price from the table sales products like this. So now let's go and highlight it
and then execute it. And with that we got now the average price of our products. And as you can see in the
output we have only one single value. So this is a scalar query. So now what we need? We need this value in order to be
used in order to filter the first query. So that's why the first query is the main query bigger. The second one is the
subquery that going to support the main query in order to filter the data. So now what we're going to do, we're going
to take the subquery and use it in the wear clause. And now of course we have to tell SQL this is a subquery. That's
why we have to put it inside two parenthesis. So with that we have the sub query inside the wear clause in
order to filter the main query. So let's go and execute it. And now as you can see in the output we have now only two
products where the price is higher than the average price. So with that we have solved the task but this time using the
subquery in the wear clouds in order to filter the main query. And of course in order to see this value in our select
since it is scalar sub query we can as well go over here and put it in our select just in order to see the value.
So average price. So let's go and execute it. And with that we can see as well in our results the average price.
So this is how we use the subquery in the workcloud using the comparison operator. Okay. So let's see quickly how
is going to execute our query step by step. So as usual first is going to go and identify the subquery. It's going to
be our select average price and so on. And now the next step SQL going to go and execute our sub query. So it is
based on the products and since we are doing aggregations without group by at the output we will get only one value.
So the average going to be 20. This value is start intermediately in the memory. So we will not see it in the
output. SQL going to go and pass this value to the main query. So the main query going to look like this. We are
selecting few columns from the table and we are filtering the data based on the price that is higher than the value 20
that we got it from the subquery. So now once SQL have everything for the main query SQL going to go and execute it. So
SQL going to go to the products and only select the products where the price is higher than 20. So it's only those two
rows and in the output we will get the final results the two products as well. So product ID and product price. So
that's it. It's very simple. This is how SQL executed our query. So as usual first starting with the subquery passing
the value to the main query and at the end so the main query going to be executed with the informations from the
subquery and we will get at the end the final results. So that's [Music]
it. All right. So now we're going to talk about the second group of operators and we're going to start with the in
operator. So what is in operator? As we learned before in the comparison operators, we can go and filter the data
based on only one single value. But now in some scenarios, we have to go and filter the data based on multiple
values, not only one. In this case, we can go and use the n operator. So if you go and use the n operator, it's going to
go and check whether the value matches any value from a list. So a list of multiple values. If it matches any of
them, so we will get a true. Okay. Okay. So now let's have a quick look to the syntax of the sub query using the in
operator. So we start with the classic stuff where we say okay we would like to retrieve the column one column two from
the table one and we want to filter the data based on the column from the table one. Now after specifying the column
we're going to use the in operator and after that we can go and specify static values but since we are talking about
the subqueries the values going to come from another query. So here we have another select statements from table two
and we filter the data for this query. And now the result of this subquery going to be used in order to filter the
data using the in operator. And now the big difference between the in operator and the comparison operators that the
subquery is allowed to have multiple rows. So there is no rule about having like one single value scalar subquery.
We can have in the result a list of multiple values. So this is the syntax of the subquery using the in operator.
All right, let's practice using this task. It says show the details of orders made by customers in Germany. So let's
see how we can solve this task. First it needs the details of orders. So as we know we have the
table sales orders. So let's go and execute it. So in the output we have all orders and with all details. But for the
task we don't need all the orders. We need only the orders that made by customers from Germany. So now if you
check the table orders, you don't find any informations about the countries, right? So we have to go and get it from
another table. And as we know, we can find these informations in the table customers. So let's build another query.
So let's say select star from sales customers like this. So let's go and execute only the second query like this.
Now, as you can see in the customers, we have the country column, and this is exactly what we need. So, now let's make
a list of all customers from Germany. So, we don't need all customers. We need only the one that come from Germany.
That's why we're going to go and use the work clause and we say country equal to the value Germany like this. So, let's
go and execute it again and check the results. Now, in the output, we have our German customers number one and number
four. So now we're going to go and use this information in order to filter the table orders. So let's go back to the
table orders over here. And here we have the customer ID informations. And as we can see we need the orders where the
customer is either one or four. Now in order to filter that we're going to go to the first query and use the work
clause like this and say the customer ID. So now since we have like two values one on four we can go and use the
operator in. So let's go and use the in and let's go and build the list. So let's go and have the one and four. So
let's go and execute it. Now we can see the results. We have the orders but only from the customers one and four. So with
that we have solved the task. We have the details of orders made by customers in Germany. Right? And now of course
this is really bad solution because what about if we get like in the future new customer you don't want to go and keep
adding here like values and so on for each time you have a new customer. We want to make the values for this list to
be dynamic. So we don't need a static value we need like dynamic values and we can use the subqueries in order to
retrieve those informations. Right? And we have it already in the second query. So let's go back to the second query
over here. We need only those two values one and four. That's why we're going to go to the query and say okay let's
retrieve the customer ID. So let's go and execute it again. And with that we have with a one and four exactly like we
have it here in the first query. And of course in the future if there's like another customer that come from Germany
this list going to be little bit longer. So this query going to always retrieve all the customer ids that have the
country equal to Germany. So now what we're going to do, we're going to take this as a sub query. Let's go and get
everything from it and now put it instead of those static values. So of course we're going to go now and put few
spaces to the right side in order to understand this is subquery and of course here we don't use any aliases. So
now what we are doing the results from this subquery going to be used in order to filter our main query. So let me just
call it main query like this and make this smaller. So let's go and execute it. And now we
are getting the same results. We are getting all the orders from only the customers one and four where they come
from Germany. And this informations come dynamically from the subquery and we don't have to worry about new customers
from Germany. It's going to be added here automatically. And this query going to always return all the orders from
Germany. So this is the power of the subquery together with the in operator if you are having like multiple values
multiple rows. So we have solved the task. All right. Now one more thing. Let's say that the task is exactly the
opposite. It says show the details of orders made by customers who don't come from Germany. So now here there's like
two ways in order to do it. Either you go to the subquery and you say you know what the country should not be equal to
Germany. So if you go and execute it, you will get all the customers ids that are not from Germany. And if you execute
the whole thing, you will get all the orders where the customers are not from Germany. So either you do that or you
stay with the equal to Germany, but you go and convert the whole logic by using the operator not. So now we are saying
the customer ID should not be equal to one of those values. So it should not be equal to one or four. And for that we
are using the notin operator. So let's go and execute it. So now with that we are getting all the orders where the
customers don't come from Germany by just using notin operator. So that's all about the notin and the in operators.
All right. So now let's see step by step how is execute our query. So we are targeting two tables the customers and
the orders. So the first step is that SQL going to go and identify the subquery and it's going to go and
execute it. So the subquery here is filtering the data based on the country. So the query going to be executed and in
the output we will get only two rows. So it is one column with multiple rows. This is the row subquery and this is our
intermediate results where it's going to be passed to the main query. So our main query going to look like this. We are
selecting few informations from the orders and we are filtering the table orders based on the customer ID where we
are saying the customer ID must be one of those values one or four. So the subquery here is supporting the main
query with the informations for the filter. Now once SQL have everything going to go and execute our main query
and this going to be like the following. So we will start with the first row and here the customer ID is equal to two. So
the value two is not equal to 1 or four. That's why this row will be excluded from the final results. Now let's move
to the second row. We have here the value three and the value three is not equal to one of those values. That's why
this value going to be as well failing. So we will not have it at the output. And then it's still going to go to the
next one. Now this time the customer ID is one and it is equal to one of those values. It's equal to one. So we have a
match. That's why this row will be included to the results. And the same thing for the next row because we have
the customer ID one and so on. Now after SQL checking all those customer ids whether they are in the list one or four
we will get the final results where we have here all the orders where the customer ID either one or four. So this
is how SQL executed the in operator using the subqueries. Okay. So now moving on to
the any operator. So we can go and use the any operator in order to compare a value if it matches any value from a
list. So that means we can go and use it in order to check whether a condition is true for at least one of the values in a
list. Okay. So now let's check quickly the syntax of the subquery using the any and all operators. So as we learned
before we can go and use a subquery inside the wear clause in order to filter the main query using like the
comparison operators like here less than. Now the syntax of the any operator is that you're going to go and use the
comparison operator and after that immediately you use the keyword any. And for the all operator going to be exactly
the same where you're going to go and put after the comparison operator the keyword all. So the syntax is very
simple. We just add those keywords. So let's practice using the following task. Find female employees whose salaries are
greater than the salaries of any male employee. So that means we want to go and compare the salaries between the
male and female and specifically we are searching for female employees whose salary is greater than at least one male
employee. So let's solve it step by step. Let's go and start selecting few informations like for example the
employee ID and first name, gender, salary from the table
sales employees. So let's go and execute it. So now we have like five employees. Three of them are male and two are
female. So now since we want to compare the data between male and female let's go and create two queries. The first one
is filtering the data based on the gender. So the first one is for the female. So and we can go and remove this
information over here. Let me just make this little bit smaller and zoom out. And the second query it's going to be
the exact opposite. Let's go and get employee informations for the male. So let's go and execute it. Now the first
results are the female employees and the second one are So now for the first result is for the female employees and
the second one is for the male employees. So now what do we need in the output? We need the female employees.
That means this is going to be our main query. So we are focusing on the female employees and we are using the male
employees only as a filter and what we need we need only the salary informations that's why we can prepare
it like this. I will just put everything in one line to make it clear. So this going to be our sub query. So now we're
going to go and work with the main query where we're going to add one more filter where we're going to filter the data
based on the salary. Right? So we're going to say if the salary is greater than and now we need the values from the
subquery right so this is our subquery we're going to put it like this and don't forget about the parenthesis at
the start and at the ends and I would like still to have those two uh queries so let's go ahead execute it and now we
will get an error and that's because our sub query is returning multiple rows and this is not acceptable we are using the
comparison operator and SQL expect from the subquery to have scalar subquery. So only one single value. But now in order
to solve this issue, we can go and use the logical operators either all or any. So now since we are saying it's enough
for the salary of the female employee to be higher than at least one male employee, we will go with the operator
any. So let's go after the comparison operator and have the keyword any. And let's go and execute it again. And now
as you can see in the output we got only one female employee where her salary is higher to one of those male employees.
So let me just go and get the first name as well from the second query just to have it like this.
So now if you go and compare the salary of Mary it is not higher than Michael but it is higher than Frank and Kevin.
And since we are using the any operator it's enough for Mary to have salary higher to one of those values. In this
case, it's higher than both Frank and Kevin. And the condition is fulfilled. That's why we are getting the marry. And
the other female, let me just check. Do we have else? So, we have Carol is salary is less than all the salaries of
the male employees. So, it must be at least higher than one of the male employees. So, with that, we have solved
the task, right? All right. So, now we have another operator that is similar. We call it the all operator. We can go
and use it in order to compare a value if it matches all values in a list. So that means we can go and use it if we
need to check whether a condition is true every value in a list. I know that might sound a little bit complicated but
don't worry about it. We can have examples. Now let's say that our task says find female employees whose salary
are greater than the salaries of all male employees. So that means now the condition is more restrictive. Mary
should now has a salary higher than every male employee. So it should be higher to all those values that we have
from the male employees. And of course in this scenario it's not because we have Michael. Mary has less salaries
than Michael. And this is a problem because Mary should has higher salary than everyone. So let's go and try it.
If I go and write here all and let's go and execute it, you will see we will not find any results that fulfill this
requirement. So we don't have any female employee who her salary is higher than all male employees and that's because we
have a very small data sets. So this is how we use all and any operators in our subqueries in SQL. All right. So with
that we have covered almost everything about how to use the subqueries in different locations and clauses. But we
didn't talk about the exist operator and that's because I would like you to understand a very important concept in
the subqueries where we have two different types of the subqueries based on the dependencies the non-correlated
and correlated subqueries. And after that we're going to go back to the exist operator. All right friends. So now we
come to the part where it is a little bit complicated about the subqueries. Now we're going to talk about the
dependencies between the subquery and the main query. So far all the examples and the subqueries that we have learned
where a noncorrelated subquery. A non-correlated subquery means a subquery that can run independently from the main
query. So that means the subquery is like standalone query. But in the other hand we have the exact opposite type of
the subquery. We have the correlated subquery. A correlated subquery is a subquery that relies on values from the
main query for each row it processes. So that means the subquery here is completely depending on the main query.
So I know this might be a little bit confusing. That's why we can have the following very simple sketch in order to
exactly understand how this works. So as usual we have a database tables and now this time going to go and start
executing the main query first. This is the first thing happens. So the main query going to go and query the database
in order to get results and SQL going to process the results row by row. So now what going to happen? The main query
going to go and pass the first row informations to the sub query. So now the subquery going to get the data from
the main query. So SQL going to execute the subquery. So here the subquery going to return a value like for example one.
So here it's very important to understand that now the SQL or the main query going to check is there a result
from the sub query in this example yes we have a results. So here SQL is checking the output for the subquery for
the first row. So if there is a result SQL going to go and return the row in the final result. So this is the whole
iteration happened only for the first row. So we're going to process the whole thing again from the start for the
second row. So the main query going to get the second row from the database and it going to pass it to the subquery.
Once the subquery gets this new informations, SQL going to go and execute the subquery once again. So now
let's say that after executing the subquery, there were no results. So the subquery is not returning anything after
the execution. So now what can happen? SQL and the main query going to check okay there is no result from the sub
query and this means this row should be excluded and not presented in the output. So we will not see this row at
the output. So as you can see SQL is executing the subquery once again for the second row. So this will keep
happening as long as we have row. For example, we have another row. The main query going to pass it to the subquery.
The subquery going to be executed for the third time and the result of the subquery is going to be one. So the same
thing going to happen. SQL going to check it. Okay, we have a value. So this row is allowed to be in the final
results and so on. The cycle going to keep repeating for each row that's going to be retrieved from the main query and
once we have processed all the rows, the final result going to be presented in the output. So what we have understood
so far the correlated subqueries is always depending on the main query and the subquery going to be executed for
each row that we're going to get from the main query. So in this example we have four rows and the subquery is
executed four times. So this is how the correlated subquery works. It's a little bit more complicated than the
non-correlated subquery. The non-correlated subqueries are really straightforward. So first the subquery
going to go and execute the database only once and the output of the subquery going to be like an intermediate results
that going to be used from the main query. So the main query going to go and query the intermediate results and in
the output we're going to get the final results. So as you can see in the execution of the non-correlated subquery
it is straightforward. There's no iterations everything going to be executed only once. So now if you
compare them side by side you can see that with the non-correlated subquery it is completely independent from the main
query. So that means the subquery going to be executed only once and after that SQL going to go and as well execute the
main query only once using the result from the subquery. But on the left side the subquery is going to be executed
multiple times and it is completely depending on the main query and there is like an iteration for each row that's
going to be retrieved from the main query. So the process going to be cycling until all the rows are processed
and this is exactly how the correlated and the non-correlated subqueries work in SQL. All right. So now let's have the
following task and it says show all customer details and find the total orders of each customer. We have already
solved this task and you know in scale we don't have only one query in order to solve something. We have multiple ways
in order to do it. So we solved this task before using the subqueries and the joins. Now we're going to go and solve
this task using subquery in the select clause and as well using the correlated subqueries. So again let's do it step by
step. It's very simple. First we need all the customer details. So as we learned select star from sales
customers. So if you execute it you will get all the details of all customers. Now we need to find the total number of
orders of each customer. Now before we have solved this using a simple query where we have used the count function
together with a group I but this time we're going to do it little bit different. So let's go and write query
saying select count star from the table sales orders. So now let's go and execute it. With that we have the total
number of orders. So let's go and take this sub query and use it in the select. So we are using it as a scalar subquery.
So let's just put it over here. And this is the main query. And in order to make this as a subquery, what we're going to
do, we're going to have the parenthesis and we're going to say the total sales. So now let's go and execute it. So now
as you can see, we have here all the details about the customers and we have the total sales. But we have one issue.
We don't need just the total order. We need the total orders for each customer. So each customer has different total
orders. So we cannot have like the following setup. We cannot say group by customer ID. And then you have like here
the customer ID and so on. So if you go and execute it, you will get a problem. And that's because if you go and execute
this subquery over here, you will get like multiple rows and multiple columns. So you have like a table query. And this
type of subquery is not allowed to be used in the select clause, right? We have to have only scalar
subquery. So that's why we cannot do that. So we have to go and remove all those stuff.
But we can go and solve it using the correlated subqueries. So now the subquery is completely independent from
the main query. So in order to correlate it, what we're going to do, we're going to go and connect it. So I'm going to
give aliases for the tables and I'm going to say where the customer ID equal to the customer ID from the main query
from the customers. So again we are connecting the customer ID from the orders in the subquery with the customer
ID from the table customers that comes from the main query. So now we are saying okay execute this only for a
specific customer not for the whole table. So let's go and execute it. So now in the output we have the total
sales for each customer and we don't have here like the total sales in the whole table orders and that's because
what is happening for each row the subquery going to be executed. So for the customer number one this query going
to be executed like this count the total number of orders where the customer ID equal to the one. So let me just show
you what this means. If I go and remove this from here and just put the number one. So if I go and execute this, you
will see the customer ID one has three orders. And let's just put it back and execute. And the same thing going to
happen for each customer. So for each customer, for each row, this subquery going to be executed and it can be
filtered with the customer ID that comes from the main query. So this is another way in how to solve this task using the
correlated subqueries. So now let's summarize and understand quickly what are the differences between the
non-correlated and the correlated subqueries. So now if you are talking about the definition the non-correlated
subquery are subqueries that are independent of the main query but in the other hand the correlated subqueries are
dependent of the main query. And now if you're talking about the execution the non-correlated subquery is going to be
executed only once and then the results going to be used by the main query but by the correlated subqueries the
subquery going to be executed for each row that we have from the main query. And as we learned for the non-correlated
subqueries we can execute it on its own. So we can go and select it and execute it. But the correlated subqueries we
cannot execute it on its own. So we have to execute always the whole thing. And if you are talking about which one is
easier, I think it's clear that the noncorrelated subqueries are easier to write and to read. And in the other
hand, the correlated subqueries are harder to read and as well it's complex. Now, if you're talking about the
performance of the database since the correlated subqueries can be executed only once, this of course going to lead
you to have better performance because things are really straightforward and not complicated. But in the other hand
with the correlated subqueries there is more effort because SQL has to check a lot of stuff and the subquery going to
be executed many times. So the noncorrelated subqueries are faster. We use the noncorrelated subqueries in
order to do static comparison. So the value that we are getting from the subquery is executed only once and we
will get only one static value in order to use it for filtering and so on. But in the other hand we use correlated
subqueries in order to do rowby row comparison. And since we don't have here a static value each time the subquery
going to run we're going to have different results. This going to add more dynamic to the filters and we don't
have a static value. So those are the big differences between the non-correlated and the correlated
subqueries. All right. So now after we understood the concept of the two types correlated and non-correlated subqueries
we're going to go now and cover the last operator for the subqueries. We have the exists. So what is exist
operator? All right. So now we're going to talk about a very interesting operator function in SQL the exists. So
now in some scenarios as you are querying the data from one table you would need to go and check whether the
rows of this table exist in another table. So that means you are checking like the existence of your rows in
different table. And exactly in this scenario we go and use subqueries together with the operator exists. So
the exist operator is very simple. It just simply check whether the subquery returns any results any rows. All right.
So now let's understand the syntax of the correlated subqueries using the exist operator. This can be a little bit
complicated but we're going to do it step by step. Don't worry about it. So let's start with the easy stuff. In the
main query we're going to go and write a simple select. We are selecting few columns from the table two. And now we
don't need all the data from table two. We want to filter the table using the wear clause. Now what we're going to do
after the wear clause, we're going to write immediately another keyword called exists. So we don't specify any column
before the exist like we have done in the comparison operator or the in operator. We don't need that because we
are not filtering based on a value. We are filtering based on the logic. That's why we have the word exist immediately.
And now directly after they exist, we're going to go and define the subquery like this. So we're going to start saying
select one from the table number one. Well, it is not like a must or something. But it is very commonly used
to specify here a one. We are not using the subquery in order to retrieve informations from the table one. We are
just testing whether the subquery going to return a value or not. And we don't care about the returned value. It could
be one, it could be column, it could be anything. So we don't care about the data that is retrieved. We are just care
whether the subquery is returning anything. So that's why we go and write any value like here a one. So now we are
not done yet. This subquery is not yet connected to the main query. We have somehow to go and connect them together.
And we can do that using the wear clause where we go and connect the ID from the table one from the subquery with the ID
from the outer query from the main query. And with that we are building like a relationship between the subquery
and the main query. So with that the subquery is now depending on the values from the main query because here we have
the table 2 id. So the ids from the main query going to filter the subquery. So this is the syntax of correlated sub
queries using the exist where we are making the subquery depending totally on the main query. So let's understand how
exist works. So now for each row that we have from the main query, it's going to trigger and cause an execution of the
subquery. This subquery going to help us to evaluate this row. So we are testing this row. Now if the subquery doesn't
return anything, so there is no results, what can happen? The row that we are evaluating from the main query will be
excluded from the final results. But now in the other hand if the subquery is returning a value so we have like some
kind of results then this row that we are evaluating going to be included in the final results. So the subquery is
used in order to do a test. Do we have a results or we don't and based on this SQL either going to include or exclude
the row from the final results. So this is the logic behind the exist in SQL. All right. So now we're going to go and
solve the same task using the exists. So the task says show the details of orders made by the customers in Germany. So we
have already solved this task using the in operator and the subquery. Now we're going to go and solve it using the
exists. So again we're going to have the same logical steps that we have done before. So first we're going to go and
select all the details from the table sales orders. So let's execute it. And with that we have all the orders and all
the details. But of course we don't need all those informations. We need only the orders that's made by customers from
Germany. So that is the first query. Let's go and construct the second query. We're going to say select star from
sales customers. But we don't need all the customers. We need only the customers from country equal to the
value Germany. So let's go and execute it. So now we have all customers that come from Germany. Now we have to go and
put those two queries together in order to get the final results. So as we learned before the second query going to
be our subquery. So it's going to be supporting the first query in order to filter the data. So the first query
going to be our main query. Let me just make this smaller and the text as well. Now we don't need all the orders, right?
We need only the orders where the customer come from Germany. So we need the work clause. So now we can have the
filter logic like this. Show the order details only if the customer ID exist from the subquery. And now we have to go
and put our subquery. So our subquery going to be this one over here. So let's just move it to the right side. And in
order to have it as a subquery, we have to close the parenthesis. And now since exist is correlated subquery, we cannot
have it like this. we have to go and connect the subquery together with the main query. So now the subquery is
currently independent from the main query because we want to check each order information from the order table
to check whether the customer exist in the sub query. We're going to go and add the condition like the following. And
now it's like the joins we have to go and connect the customer ids together. So we're going to go over here and give
it like an alias and as well for the subquery. And now we're going to say customer ID from the orders should be
equal to the customer ID from the subquery the table customers like this. So again this customer ID
come from the subquery and this customer ID comes from the main query. So now since we are using the subquery only in
order to test the existence of the customer. So if the subquery returns anything or not, it doesn't matter what
you are selecting in the subquery. So so you can go with the star or a column or any static value. But for some reason
all the SQL developers decided to go with the static value one. And of course you can go and add like a column like
the customer ID but it's like unnecessary step for the SQL in order to retrieve the information from the
customer ID. So it's going to be way faster for SQL if you say okay select one. So let's stick with the best
practices. Use the one value if you are working with exist. So this is our sub query and I think we have everything.
Let's go and execute it. Now as you can see in the output we got all the orders where the customers come from Germany.
Now of course if you want to go and try another value and execute you will get exactly the same results. So it doesn't
matter which value you are using. So with that we have solved the task this time using the exists. Now if the task
says show the details of orders made by customers that don't come from Germany it's going to be very simple. We're
going to go and use the operator not before the exist. So where not exists. So now we are flipping the whole logic
and we are saying there should be no matching with the subquery. So now if you go and execute it you will get all
the orders where the customers don't come from Germany by simply using the not logic. And there is one more thing
that is annoying about the correlated subqueries. If you compare to the non-correlated subqueries as we learned
before, let me go back to the n operator. Now this is a non-correlated subquery. And if I go and select only
the subquery, I can go and execute it independently. So I can go and check the intermediate results and like validate
my query. But the problem with the correlated subquery, I cannot go and highlight the subquery and then go and
execute it. And that's because in the syntax of the subquery we are adding a column that is outside our subquery that
come from the main query. So this piece of information currently for the SQL is unknown and that's why we are getting
this error because SQL saying okay I don't know where this column come from. So this is little bit annoying using the
correlated subqueries you cannot go and test the intermediate results. But how I usually do it I go and test like an
intermediate result for only one row. So for example, I'm going to go and pick like a customer here. For example, two.
So I'm going to go and say okay, the customer ID should be equal to two. So let me just remove this from here. I got
this value from the main query. So if I go now and execute it, I can see here. Okay, the subquery is not returning
anything because there is no such a value. So with that, I'm just testing like one row. And of course in order to
make this working I have to go and add as well the column from the main query. So this is why correlated subqueries are
a little bit more hard to understand compared to the non-correlated because we cannot go and test the intermediate
results like we can do there. So this is another way on how to solve this task using a correlated subqueries with the
operator exists. Okay. So now let's see step by step how SQL executed the correlated subqueries using the exists
operator. So now this time SQL will not start with the subquery. SQL going to go and start immediately with the main
query. SQL first going to identify the main query and it going to go and execute it. But it's going to executed
row by row. So the first row going to be the first customer. So now SQL going to go and put the first customer under the
test. So now the next step is that SQL going to go and pass the value of the customer ID from the main query to the
subquery. So we are doing now exactly the opposite. So now what going to happen? SQL going to prepare the
subquery with the following information. So we are saying the customer ID equal to one and then SQL going to go and
execute it. So now once SQL executed this query, we will get the result of one and that's because we have here
multiple times where the customer ID is equal to one. So there is rows in the order table where the customer ID equal
to one. So now what going to happen? the row from the main query going to pass the test and this customer going to be
included in the final results. So now the next step with that is going to go and start testing the second customer.
So we're going to put this customer under the test. Now we're going to go and pass the value to the subquery. So
here we're going to have the value of two and then SQL going to go and execute this query and of course we will get a
result because we have here multiple times where the customer ID equal to two. So that's why in the output of this
subquery we will get one. So now it's still going to say great we have a value from the subquery that's why it is safe
to show this customer in the output. And now it's still going to go to the next row and so on. So for the next two
customers the same things going to happen. All of those customers will have a value from the subquery and that's why
they are all like passing the test. So we will have it in the output. Now skill going to go to the last row from the
table customers. So we have the Anna and we're going to put Anna to the test. So now what going to happen? SQL going to
go and pass the value five to the subquery and SQL going to go and execute this query to the table orders. Now once
SQL execute this query there will be nothing returned and that's because we don't have here in the table orders a
customer ID equal to five. And now SQL going to say well we are not getting any results from the subquery. That's why
this customer going to fail and SQL will not show it at the output. So it will be completely removed. So the customer Anna
is excluded because the subquery is not returning anything. Customer ID number five Anna does not exist in the table
orders. So it's going to fail the test and we will have in the final results only for customers. So this is exactly
the purpose of the exist. we are checking and testing the existence of our rows from another table from another
query. So this is how SQL executes the correlated subqueries using the operator [Music]
exists. All right friends, so with that you have covered everything about the subqueries, all the different categories
and types of the subqueries and now we're going to do a quick recap about the subqueries. So as we learned
subqueries is just simply a query inside another query. And we use the subqueries in order to break down a complex queries
into smaller, simpler, easy to manage pieces that makes everything easier to develop and as well to read. And as we
learned there are like many different use cases for the subqueries. So we use subqueries in order to create temporary
result sets to be used later from another query. And we learned that we can use the subqueries in order to
prepare the data before joining the tables. And another very important use case for the subquery is that we can use
it in order to filter our data using a dynamic and as well complex filter logics. And as we learned, we can go and
use the correlated subqueries using the exist operator in order to check the existence of data and rows from another
tables. and as well using the correlated subqueries help us to do rowby row comparison. All right my friends, so
with that we have covered an important technique on how to nest your queries in SQL. Now in the next step we're going to
talk about one of the most famous technique on how to do multi steps in SQL the city common table expression. So
let's go. A city common table expression is a temporary named result set like a
virtual table that could be used multiple times within your query to simplify and organize complex query. So
let's understand what this means using the following sketch. So we have our database tables like orders, customers
and so on. And in very simple scenario we write a simple SQL in order to query and retrieve the data from the database
and then in the output we will get the result of the query. So this is the simplest version of querying data. Now
things get complicated in our project and we could have the following technique in our query. So we still have
this section where we are saying select from. But now inside our query we can write another query like for example
select from where which is completely nothing to do with the first query and we can give this new query inside our
query a name CTE and we can call this query a CTE query common table expression. And the first query outside
this CDE we call it a main query. Now if you check this we have like a query inside another query. So now let's see
what is going to do with this. The first thing is going to go and execute the city query. So the city query going to
be executed and we're going to go and retrieve few informations from our database tables. Now the output going to
be available only in the query and the output going to have the shape of like a table like for example the sales. So now
the sales table and the orders tables both of them are tables but one is stored in the database and the other one
is an intermediate virtual table. So now what can happen in the main query we can go and start querying the sales table
the result from the CTE as any other normal table like we do to the database tables. So the main query going to go
and retrieve few informations and maybe do some manipulations on top of the sales table or let's say the CTE results
and of course the main query as well can go and say you know what let's go and query as well few tables from the
database. So the main query has two sources of tables. Either get it directly from the database or get it
from the table that is created inside the query and then once everything is done the final results of the main query
going to be presented for the user as a final result. So as you can see the CTA query has one task where it generates
like a table that lives inside our query and we can go and use it as we want. So now this intermediate table that is
created from the city has two features. First this table will not live long. So once the query ends what going to happen
is going to go and destroy this table. So it will not be available afterward and we are not able to query it anymore.
So SQL is doing here like a cleanup and the second character about this let's imagine that we have another side query
and it's retrieving tables directly from the database tables. Now if you say let's go and join those tables as well
with the sales from the first query well it will not be working because SQL going to say I don't know what you are talking
about and that's because the sales is only locally available for the main query in the same query. So that means
it's not globally available like the database tables for any query. It is dedicated only for the main query within
the same query. And now you might tell me bar wait I have heard this story before right? So this is an identical
story to the one that you have told us about the subqueries. So what is exactly the difference between the subquery and
the CTE? Well, you are totally right. The story is identical between the subqueries and the CTE but still there
are differences between them. So let me show you few differences. Now let's put them side by side. We have on the left
side the subqueries on the right side we have the CTE. So now if you look on how we wrote the CT and the subqueries you
can see that on the subquery we are writing it from bottom to top. So first we have this inner query the subquery
and then on top of it we have the main query. But now on the other hand the CTE we are writing it from top to bottom. So
first we write this inner query the CTE query and then beneath it we're going to go and write the main query. So this is
the first difference between them on the way we write the query. So if I'm thinking about subqueries, I start from
bottom to top. If I'm thinking about CTE, I think from top to bottom. But still you say, you know what, I don't
care how we write it. They are doing the same thing. The subquery is introducing an intermediate result that is used
later from the main query. And the same thing for the CTE. It present like intermediate table that is used as well
from the main query. Now let me tell you the big differences between them is that in the subquery the result can be used
only once. So you cannot have another place in your main query where you go and reuse the result from the subquery.
So you can use it maximum only in one position and only once. But in the other hand with the city technique, you can
think about the sales table as a virtual table and not only you can use it in one place in the main query, you can go and
use it in many other places. So you can go and join it again. So that means I'm using the output from the CTE query in
two different places in the main query or maybe from three different places. So you can have another place where you go
as well and query the sales table that is only available in our query. So this is the main and the most important
difference between the subquery and the CTE. It's from the name common table expression. We think about the result of
the CTE as a table. So we can go and select it. We can go and join it with any other table. So it is like a hidden
virtual table lives inside our query. But the subqueries it's totally different. It's a result only for one
position in the main query and it's used only once. So that means if you want the subquery in two three different places,
you have to go and write the subquery three different times. So now you understand why do we have CTE and why do
we have subqueries. All right. So with that you have understood what is CTE. Now the
question is why do we need CTE in the first place? What is the main purpose of the CTE? Let's go back to the sketch.
Now let's say in our complex SQL task we have to do the following step. Step one we have to go and join the tables
together in order to prepare all the data that we need for the next step. And now in the second step we have to go and
aggregate the data. Maybe we are doing summarizations. Now in our task we have to do as well different types of
aggregations based on different data. And now what might happen is that we have to go and join again the same
tables in order to prepare the data and perform different type of aggregations like for example the average which going
to be in the last step. Now we have learned before we can go and use the subqueries in order to make this logical
flow. So for step one, step two, step three, we will have subqueries and the final step going to be in the main
query. But now if we keep doing this we're gonna have a problem and that is we are repeating the same step more than
once. So we are joining the table twice in step number one and three for different purposes which cause us to
have two different subqueries that looks exactly the same and this is exactly the weak point of the subqueries. It might
introduce redundancies. So that means the subqueries alone will not help you to eliminate all the duplicates in your
code. But still we have different techniques in order to solve this issue. So what we going to do? We're going to
have only one step in order to join the tables. And then this data going to be used in the step two in order to
aggregate the data. And then we don't need the step three of joining again the data. We're going to reuse the step one.
And we're going to use the same data for the step four which is aggregating the data using average. And we can do this
with the help of the amazing CTE. So now if you compare the steps in the subqueries with the steps with the CTE
you can see with the CTE we are reducing the number of steps which can lead to reduce the size of the query. So now
again here in subquery we think about the steps from bottom to top but in the city it's the way around we think from
top to bottom. So that means the first step on the top it's going to be joining the tables and then below it going to be
step two and step three. And of course since we are repeating the join we're going to put it in CTE and then we can
use it twice in different places in the main query. So as you can see there are a lot of benefits of the CTE. It's like
the subqueries. We are breaking down complex queries into smaller pieces that are easier to write manage understand
and as well we have like a logical flow from step one to three but with one more benefit that we reduce the redundancies
of our code. So we don't have to join the tables twice. Now I'm going to show you a simple example how the CTE makes
our life easier in our query. We might have to do different stuff like for example we have to go and find the top
customers. So we can put this in one CTE and we might need as well to calculate what are the top products and we can put
as well this in another city. So you don't have to put everything in one big city. Then you can have the same issue
of having complex query. And let's say that we have as well to find and calculate the daily revenue. And for
this as well, we have to put it in one CTE. Now once we have all those parts, we can put everything together in the
main query. So now if you look to this structure, you can see it's really easy to understand this code. It's easy to
read. So CTE improves the readability of our queries. So that means your code is divided into clear sections making it
easier to understand what each part does. Now if you keep looking to this we have another advantage of the CTE
introduces modularity. So that means it breaks your code into smaller manageable parts. So this means instead of writing
one huge complex query you break it down into smaller chunks using CTE. Each city is like self-contained and handles
specific part of the problem and then you can combine them all together in the final query. It's like we are putting
together a puzzle piece by piece. And now one very important advantage of the CTE is the reusability. So that means we
can have a result set that is used multiple times inside our query. So that means you write the logic the code only
once and then use it in different places inside your query. This is very important. Not only you are wasting time
writing the same stuff over and over, but also it reduces the errors and mistakes that you might do if you are
repeating the same code. Especially if later you want to go and change the logic then you have to go and visit each
time you have done this logic and then do the changes and you might forget some places. That's why the CTE is amazing.
You can write the logic once and then you go and reuse it in different places. So these are the advantages of using
this technique the CTE inside your [Music] queries. So again you are at the client
side and you are data analyst. You are writing a query where you are defining a CTE called details and inside it you
have some logic and now in the main query you are selecting the data from the orders and as well you are joining
it with the details with the CTE multiple times using multiple conditions. Now once you go and execute
this query the database engine going to read the query and say aha we have here a CTE and it has the main priority. So
that means it going to go and execute the CTE first. And now let's say that in the city you are retrieving data from
the table orders and the table orders of course in the disk storage inside the user data. And now once the city is
completely executed the database engine going to go and place the results in the cache and it's going to name this result
as details. It's like a table name. So the database engine is done with the CTE. It's going to go now and grab the
main query and it's going to start executing it step by step. So the first step is that to get the data from the
orders. So since the orders exist in the disk storage, it going to go and retrieve it from there. Now the database
engine going to check the details. Okay, we have it in the cache. That means we don't have to search for it in the disk
storage and it going to start retrieving the data from the details with high speed. And now it's going to go to the
second step as well joining the data with the details. So again the database engine going to go to the cache and
going to see the table details and retrieve the data based maybe in different conditions. And then to the
third time as well we are joining to the details and we're going to get the data from the cache. So as you can see from
the main query we are using the result from the CTE multiple times in different places and the retrieval of all those
informations is happening in high speed. So this is one big benefit of using the CTE is to utilize using the high-speed
memory of the cache. So that means retrieving the data from the cache from the details is way faster than
retrieving the data from the disk storage from the orders. Now once the main query is completely executed the
result going to be returned to the database engine and then it's going to send it back to the client side and we
will see the results in the output. So that's it. It's amazing right? This is how the database server execute the
amazing technique the CTE behind the scenes. All right. So now for the CTE, we don't
have only one CTE. We have different types of CTE. So mainly there are like two types of CTE. We have the
nonrecursive CTE and recursive CTE. And we can say for the nonrecursive CTE, we have two subtypes. The first type is the
standalone CTE and the second one is the nested CTE. And now what we're going to do, we're going to deep dive into each
type. And we will start with the easiest form of the CTE, the standalone CTE. It is the simplest
form. So what is standalone CTE? It is a CTE query that is defined and used independently in the query. So that
means it is self-contained and it doesn't depend on anything. It doesn't depend on any other CTE or queries. So
that means we can run the standalone query independently from anything inside our query. So let's understand what this
means. We have our CTE. It's going to go and query the database tables and in the output we will get an intermediate
results and then the output can be used from the main query. So the main query going to query the intermediate results
and present in the output the final results. So now if you check our CTE, it is completely independent from anything
else. So it simply query the database and it has one output. So since this CTE is independent from anything else we
call it a standalone CTE. Now if you compare this CT with the main query you can see that the main query cannot be
executed alone. And that's because it needs the result from the first query. So we cannot say the main query is
independent cannot be executed alone. It always depend on the city query. So that means city first need to be executed
then the main query can be executed. So this is what we mean with the standalone city. It doesn't depend on anything
else. So now we can understand the syntax of the CTE. So we have a very simple query select from where. So it is
a very simple select statement. Now in order to put it inside a CTE we can go and use the with clause. So it starts
with the keyword with then the CTE name. It's like a table name and then we have the keyword as in order to say this CTE
is defined like the following. So this is the definition of the CTE and it has two parenthesis the starting and the
ending. So with this you are telling a scale okay now we are talking about CTE and it has a name. So if you are using a
query inside with clause we call this a CTE query it is where you define the CTE. Now of course we don't want only to
define a CTE. We want to use it. So outside of this definition we can go and use it like this. So we are saying
select from the CTE name. So that means we want to select the data from the result of the CTE. And here it's very
important to use exactly the same name as you define it in the width clause. So if you leave it like this, we can call
this the main query. It is the place where we use the CTE. So this is the syntax of a very simple CTE in SQL.
Okay. So now what we're going to do, we're going to have like a task that's going to keep progressing through this
section. So we're going to start with the first step and we will keep adding steps as we progress in the CTE. So now
the first step in this task says find the total sales per customer. And now of course since we have only one step, it
makes no sense to use the CTE. But we will use it since we know that there will be different steps later. So let's
start doing that. Now before I use any CTE, I would like just to write our query first. So we need the total sales
for each customers. It's very simple. So we're going to go and select and what do we need? Let's go and get the customer
ID and we need to do aggregations on the sales. So summarize the sales and we're going to call it total sales from the
table. And now since this is our first query, we have to get the data from our database. So we don't have any other
option. Our data going to be in the sales orders. So let's go and get it. And don't forget to group by for the
aggregation. We are grouping by the customer ID. That's it. Let's go and execute it. And as you can see in the
output, nothing is fancy. We are just aggregating the sales by the customers. So with that, we have solved the task.
But now I would like to put my query in a CTE. And that's because later we're going to add more steps. So let's put
our query in a city. And in order to do that, we're going to start with the with keyword. And now we have to define the
name of the CD. So I'm going to call it city total sales like this. And then
afterward we're going to say as and then we have to go and add the parenthesis at the start and as well at the end. And
with that we are telling SQL this query is a CTE query. So that means the SQL should store the result of this query in
a cache in memory to be used later in the main query. our CTE and of course what is missing is the main query and
you have to do it exactly after the definition of the CTE. I will just make here a small comment about the main
query. Uh let me just make this smaller like this. And now we have to go and have a very simple select
statements from. And now I would like to get more details from the customers table. So I will just go now to the
customers. So now we are not querying the CTE right? We are just querying the database table that we have and I would
like to get from the customer the customer ID and the first name and let's go and get as well the last
name. So now if we go and query this what happens in the output we are getting the data actually completely
from the database table the customers and of course we are not using at all the CTE inside our main query. Of
course, we can do that, but it's just waste of like space in the memory because SQL did execute this and stored
it in the database memory. And of course, we would like to use the city in our main query. So, let's go and do
that. So, let's go and do a join, but this time we're going to join the data from the CTE. So, let's go and get the
name and I will just call it CTS. So what we are doing now we are joining the physical table the customers with the
virtual table that we have created with the CTE that exist only in our query and of course not only we are joining the
tables we would like to get the informations from the CTE. So CTS and we need only the total sales. So total
sales. So that means those three columns comes from our database table customers and only this column the total sales
comes from our CTE. So let's go and execute the whole thing. Now as you can see in the output everything is working.
We have the three columns from the table customers and we have the total sales for each customer and this total sales
comes from our city. Now as you can see the last customer has a null over here and that's because in the table orders
we don't have the customer five. And now you might say you know what I would like to see the intermediate result from the
CTE because what we are seeing now in the output is the final result from the main query. So now what we can do in
order to see the result of the CTE we're going to mark the query in the CTE of course without any parenthesis or the
width. So just the query and execute it. And with that you can see in the output the intermediate results that we are
passing to the main query. And as you can see we don't have here customer number five. That's why in the final
results we are getting null and that's of course because we are using the lift join. So if I execute the whole thing
you can see we are getting the customer five over here with the null. So as you can see is very simple. We just treat it
as any normal database table. But this table is created from our query that we have defined in the city over here. Now
of course in the city you can use any kind of clauses like select from join group by having everything that you want
window functions all aggregate functions but there is only one restriction you cannot go and use the order by clause so
you cannot sort the data in the city so let's go and try it out let's go and say order by and let's say I want to sort by
the order ID for example so let's go and execute it you can see here SQL is saying Okay, I cannot do it for you
because order by is not allowed in many things. So you cannot use it in views, in sub queries, in comment table
expressions, the CTE over here. So it is not allowed. You cannot use order by in the CTE. But of course you can go and
sort the data in the main query. So if you go over here and say order by customer ID. So if we execute it, it's
going to be working. So in the main query you can use order by but in the CTE this is the only thing that you
cannot use inside the city. So that's it. This is our first CTE in this section. All right. So this is the
simplest form of the CTE the standalone. Now we can have not only one CTE, we can have multiple
CTE. So it's going to look like this. We have our database and this time we don't have only one CTE. We have multiple
CTEes in our query and each CTE is going directly to the database and it will query the database in order to prepare
the intermediate results. So in this example four CDEs is going to the database and preparing four different
intermediate results and of course SQL going to execute it from the top to the bottom. So first the CD 1 then 2 3 four
but they have nothing to do with each others. So now once we have all the four intermediate results the main query
going to go and retrieve all those informations and do some magic in order to prepare the final result for the end
user. So now by looking to this sketch you can understand all those CTE are independent from each others. So there
is no nesting or something. Each CTE is self-contained and it could be executed on its own without depending on any
other results from any other CTE or any other query. So it goes directly to the database and get the data. So that's why
all of them are standalone CDs. And since we have multiple CDs, then it is standalone multiple CDs. That's it. It's
simple. So now let's check the syntax of the multiple standalone cities. So we're going to start writing our first city.
So it start with the with clause and then we have the city name and then the logic of our city. So nothing new. This
is how we define the city. And then in order to use it, we're going to have our main query where we select from our new
city and we make sure we are using the name of our city. So nothing new. Now in order to add another city to our query,
what we're going to do, we're going to go after the definition of the city. And below it, we're going to go and start
defining the city too. But this time, as you can see, we are not using the width clause. We are using a comma. So that
means only the first city going to be using the with clause in order to tell SQL we are talking about CTE. All the
other CDEs you're going to separate it using the comma. So the syntax going to be comma instead of with then the name
of the CTE and then we're going to say as the following definition. So we're going to write here the query of the
second CTE. So now of course if you want to go and add more CTE you go and use the comma below it and as well you
define the third city. So you can have as much cities as you want and always separate it with comma but only the
first city start with the width. And of course in the main query we can go and use the results from the city 2 where we
are for example here joining the data between the city 1 and city 2. So as you can see in the main query here we are
like collecting the data from these different cities in order to do the final step in the main query. It start
with the width. So SQL understands okay now we are talking about CTE and once SQL sees after the parenthesis a comma
SQL can understands okay now we are talking about another city and now if you don't go and use a comma after the
parenthesis SQL can understands okay we don't have any more CDEs the next query it's about the main query so this is how
you create multiple standalone CTE all right so now back to our task where we are creating like a report step by step
so now we have in the task a second step where it says find the last order date for each customer. So now we have to go
and add one more information about our customer. So when the last time the customer did order. So how we going to
do it? Now we have to add this to our query. And I would like to use as well the CTE in order to have this logic. So
as we learned from the first task, this is the first step in order to find the total sales for each customer. And here
we have the main query. Now I would like to put now in between another CTE. And as we learned from the syntax, we have
to go and add a comma. We cannot go and use the width again. And we have to give it a name. So let's call it CTE and last
order. So latex and we have to define it. So as and then double parenthesis. And now in between we have to go and add
our logic. So now we have to focus only in this logic. So forget about the other CTE and the main query. So we have to
find the last order date for each customer. So we're going to go and query again the table orders. So what do we
need? We need the customer ID. We need the order date from our table sales orders. So
that's it for now. Let's just select it and execute it. And now with that you can see all the customers and as well
all the orders. But we would like to have the highest order for each customer. And we can go and use our
aggregate function, the max function. So what we're going to do it's like here at the top. So we have to go and use the
function max and group up by the customer ID. So group up the customer ID. Uh let me just shift it like this.
And let's give it the name last order. So like this. And as you can see I'm just selecting now only my query. I'm
not selecting everything. And I keep executing in order just to check the results before we integrate it in the
main query. So now as you can see we have for each customer one row and we have as well the highest order for each
customer. So with that we have solved this subtask. So as you can see it's really easy to extend. I'm just making
like another box and I'm adding inside it the business logic that I want and this going to solve one problem from the
whole task. So you feel now exactly the power of the CTE. We are making complex logic but still it's easy to add. Now
imagine you are not doing this. You are always extending one big query. It's going to be really hard to extend and
that's why a lot of SQL developers really love using CTE and they like use it in each query or in each task that
they have. So we have solved this task and we have to go now integrated in the main query. It's going to be very
simple. So we're going to get over here and we will go and just add another join. So we're going to join it with the
city and as you can see SQL now is offering it as a table even though it is not a physical table that exists in our
database. It only lives inside our data but still SQL treat it as a table. And this is exactly what we are doing. We
treat those informations as table. So city the last order and I will call it CL. And then of course we have to go and
do the same condition like here. So the CLLO customer ID should be equal to the customer ID from the first table, the
customers. And of course we have to go and add this new information to the main query. So
CL the last order. So now what we're going to do, we're going to go and execute the whole thing. So we have now
two CDs and as well our main query. So let's go and execute it. Now again let's check the data. The first three columns
comes from the physical table customers. The fourth one, the total sales comes from our first city over here. So from
here and the last order comes from our new city that we just defined the city number two. So as you can see guys,
everything feels like organized and structures and we have like flow and of course those cities are standalone
cities. So we can go always and select the city and execute it separately. It doesn't need anything else from outside
this query. It just needs the tables inside your database. So guys again here pay attention if you want to add more
CDs use the comma. You cannot go and use for example here I another width. So if I execute it I will get an error. So you
have to separate it with this comma. And another mistake that I do frequently that I forget and go add here like to
the last CTE a comma and this happens to me if I'm using a lot of CDEs. So if I go and do it like this, I will get as
well an error because the main query doesn't need a comma. So the last city should not has a comma after the
parenthesis. So I just removed it and execute. So guys with us we have now multiple cities inside our
query. All right. So now what is a nested CTE? It is a city inside another city. So it's kind of like subqueries, a
query inside another query. So not only a main query can use the result of CTE another CTE can use the result from a
CTE and of course the nested CTE is like a main query is depend on other query that means you cannot go and select it
and run it independently from the query. So always you have to run the CTE inside it first before seeing the result of the
nested CTE. Okay. So now let's understand what this means. Again we have our database and we have a city
query that goes directly to the database and queries the data from there and in the output we will get the intermediate
results. And now in this scenario this time we will not have only one intermediate results because we have
many different steps. We need another intermediate results before everything is prepared for the main query. So that
means we have another step that's going to be built up on top of the first intermediate results. So that means we
can have another CTE that's going to be quering the results from the first CTE and build on top of it another
intermediate result. So as you can see here we have CTE1 and CTE2 and that means now we have like two intermediate
results. And now of course we can go and add CTE 3 4 and so on. But now let's say that the CTE2 going to prepare the final
intermediate result for the main query. So now the main query going to go and query the second intermedator results
and it's going to do the final step where the final result can be presented for the user and of course if it is
needed the main query can access not only the second intermediate result from the second CTE but also the first
intermediate result from the CTE1. Now we call the first CTE a standalone CTE because it doesn't depend on any
intermediate results. It goes directly to the database and gets the data. But now since the second city is completely
depending on the city one. So this time we're going to call this CTE a nested CTE because we cannot go and execute it
on its own. It always depends on the city one. And of course the main city is depending on everything. So as you can
see we're using the CTE we're going to go and build like a chain. So this is what we mean with the standalone city
and nested city. Okay. So now let's understand the syntax of the nested city. So we start as usual with the
definition of the first city using the with clause and then the name of the city and the definition of the city. So
here it's nothing new. Now we go and define the second city as we learned using the comma then the name of the CTE
and the definition. So this is our CTE number two. So now the second CTE is depending on the results of the first
CTE. So how we going to do it? It's very simple. Now for the CTE number two, we're going to select the data from the
CTE number one. And with that, we are making the second city depending on the first one. So this means the second CTE
is getting the data from the first one and it's querying the data in order to do the second step. And with that we are
nesting one CTE in another. And the CTE2 is completely depending on the first one. So again we call the first CTE as a
standalone CTE because it doesn't depend on anything. We can execute it on its own and it just need the data directly
from the database. But the second city since is completely depending on the city number one we call it a nested
city. So they are very similar. We are just selecting the data from the city number one. And now comes our main
query. And of course it's going to go and use the data from the second step. So it's going to go and select the data
from the city number two. But it's still of course it's not a rule. It can go and access the data and select the data from
the city number one. So this is how we can create a nested city in SQL. All right guys, back to our project where we
are creating a report about the customers and we would like to add one more step. So the task is rank the
customers based on total sales per customer. So this is one more step inside our projects and we would like to
go and use as well the CTEs in order to implement this step. So now what do we need? We need to rank the customers
based on total sales for each customer. So here like we have two steps. First we have to calculate the total sales per
customer and then we have to go and rank it based on this information and of course the sales are stores inside the
orders. So now let's go and start implementing the CDE. So we're going to have a comma and we're going to call it
CTE customer rank as and then we're going to go have the parenthesis and inside it we're
going to develop now the logic. So first we have to go and aggregate the data by the total sales. So select customer ID
and then sum the sales from the table sales orders and then of course group
by the customer id. And now I can hear you even telling me bar we have already done this. We have already this logic.
So why we are repeating? If we go to the first CTE you can see we have already done that. And you are totally right. We
have already the logic. So it makes no sense to repeat it again. And if we do this then we didn't understood the power
of the city. So we don't have to repeat the same logic and we can reuse the city inside another city. So now we don't
need all those stuff. We can go and focus immediately with ranking the customers. So first let me just select
the data from the first city. So I'm going to go and select. So what do we have? We have customer
ID and we have total sales. And we're going to select it this time not from any physical table. We're
going to select our city. So like this. And now what we're going to do, we're going to go and select the whole thing
and execute it. Well, this is the issue of nesting cities. Sadly, this CTE is completely depending on the first city.
So we cannot go and execute it on its own. And this is of course very annoying because each time I execute the query by
the end of the query SQL gonna go and destroy all the CTE. So in the memory we will not find the CT and that's why once
I executed it SQL don't know anything about this city. And in order now to see the result of this we have always to
execute as well with it the city that I'm using. So what I usually do I go over here and make everything in comment
in the main query and now I can go and execute the whole thing and now I will see in the output the outcome of this
nested city. So this is the big difference between the standalone cities like here and the nested. So now let's
go back to our task. We have to rank those sales based on the total sales. So we can go and use the rank function from
the window function. So rank over and now we don't have to partition the data. We just want to sort the data by the
total sales descending. So like this the highest sales going to get the rank number one.
So let's go and give it the name as customer rank. Now as you can see we have a really nice rank beside those
informations. Customer three has the highest sales and customer two has the lowest total sales. So with that, as you
can see, we didn't repeat ourself. We just reused another CTE in our current city. And this is exactly why this
technique is very amazing in order to reduce redundancies and to reduce the complexity of the whole query. So nested
are annoying to execute, but they reduce the redundancies of our code. Now we are done with our logic. We tested
everything. So what we're going to do, we're going to go and integrate it in our main query. So let me just remove
the comments from here and let's go and add it in the main query. So we will do the same thing. We're going to go and do
a left join with the last city that we just created. So let me just call it CCR and the same conditions. We are
always joining on the customer ID. But don't forget to rename the alias. So it is CCR customer ID equal to the customer
ID from the first table. And of course we have to go and select the new information. So CCR dot customer rank.
And now let's go and execute the whole thing. Now as you can see in the results those three columns comes from the
customers table. The total sales comes from the first city. The last order from the second city and the customer rank
comes from our nested city that we just created. So guys, it is not a simple task creating such a reports because it
involves different aggregations and different functions, but our work is organized. As you can see, it's very
simple. We have step one, step two, step three, and the main query. And it's really easy to add more components to
our query. Now, I would like really to keep practicing using those nested queries. So, we have the following task.
We would like to add one more step in our report. segment the customers based on their total sales. So I would like to
implement this as well using CTE. So let's go and solve it. We want to go and add a new CTE. It's going to be CTE
customer segments as and then we have to go and define our logic. Now if you check our
task, it has two parts. We have to find the total sales and then we have to segment the customers based on this
information. So it is something very similar to what we have done in the step three. So that means we don't have to go
and calculate again the total sales. We have to go and use as well our amazing first city. So let's go and do it. What
do we need? We need the customer ID like this. And let's do basic segmentations using the case win. So let's say case
when the total sales if it's higher than 100 then let's say the customer going to belong to the group high and let's go
and add another category. If it's not higher than 100 if it is higher than 50 then the customer going to belong to
medium. And if the total sales is less or equal to 50. So what's going to happen? We're going to say else the
customer belong to the low category. So that's it. We're going to have an end and let's call it customer
segments. All right. But of course we have to go and select it from a table and it's going to be our city. So total
sales and let's put it in our new city. And I would like to test it before like putting it inside our main query. That's
why I will put everything in comments in my main query since it is a nested city sadly. And we will just go and select
our new nested city like we have done before. So let's go and execute it. Now as you can see in the output we have two
customers with the category high and two customers with the medium. But in order to make sure that everything working
perfectly, I would like to go and add the total sales just to see the numbers. So let's go and execute it. Well, you
can see everything is correct. So those customers having higher than 100 in the total sales and those two having higher
than 50. But let's go and change stuff around. I would like to have it like 80 as a medium just in order to have a low.
So with that the customer number two having a lower sales than 80. That's why we are getting the segment low.
Everything is done and we have segmented the users into different categories. So I don't need to test anymore. Let's go
integrate it in our main query. So we're going to do the same things over here. We're going to say lift join and we're
going to get our new CTE. So CCS and we have to do the join condition. Don't forget to change it. And we have to
select our new nice information. It's going to be the customer segments. And now we can go and execute the whole
thing. So we have now like four different cities and one main query. And now we can see in the output we got all
three informations from the table customers. The first city, the second, third and this is our new column that we
just created. So again we have done this using a necessityd like this. Let me just add
it and it was really easy to extend and to add to our report. All right guys, so with us we have done like a many
projects where we have analyzed the customer information based on different aspects from our data and we have done
it like step by step and now you have like a feeling on how to write complex SQL queries using the help of the CTE
and we have done it like step by step. So as you can see if you go through the scripts you can understand okay it is
divided into multiple steps and each block is responsible for one specific problem of the whole report and this is
exactly the power of the CTE it introduce modularity. So each CTE is self-contained and talk about one issue
and this is amazing way on how to organize your project using SQL and how to structure your work.
All right, my friends. So, now let's have a little break in order to have a real talk about the city. But first,
some coffee. And now I can say that I'm working with SQL since really long long
time ago, over 15 years. And I can say as well, I have met a lot of SQL developers in different projects. And if
there is one thing that all those SQL developers love is the CTE, they love using it everywhere. like each time they
write a query they going to be writing SQL CTE and of course it's fine it's not a bad thing but the problem with that
they overuse it of course not all of them but a lot of SQL developers overuse using the CTE of course the CTE is very
powerful but with power comes great responsibility remember with great power comes great
responsibility so my advice for you especially if you are new to the CTS try to not add a new CTE each time you are
doing something new and I saw it a lot like for each new calculation for each new column they jump immediately and
create a new CT and what happens at the end we can have like massive number of CTE inside one query and the developer
thinks now everything is organized and easy to read but believe me it's exactly the opposite if you open any code and
you have a lot of CDEs and especially if they are necessities it is impossible to understand what is going on even if the
developer like describe each CTE and the task of the CTE, it's going to be really hard to understand and as well to read.
If everything is like nested and you have like I don't know 20 cities in one query. So it's going to be impossible to
read and to understand and as well you're going to be using a lot of memory and you might get bad performance. So my
advice for you try always as you are creating new CDs to think about how about to merge two CDEs in one. So it is
really always important to rethink and refactor your CDEs in order to merge it into one and to reduce the number of
CTE. But now if you ask me how many CTEs are okay in one query, well I don't have a magic number for that. But normally I
tend to say between three and five CTE it's fine. So it's going to be easy to understand and to read and so on. But
once you get more than five CTE then you have to rethink your code. Maybe you have to create another complete query so
you don't have to put everything in one query. So this is my advice for you. Try to not overuse the CTEs in your
projects. Not for each step always refactor the CTE, consolidate them and try to not have more than five CTEs in
one query. So that's my advice for you. Be responsible using the CTE. And let's go back to our course.
So with that we have learned the standalone CTE and the NIST CDE and both of them belongs to a type called
nonrecursive CTE. So what is a non-recursive CDE? It means it is a city that is executed only once. So there is
no repetitions or looping or anything. So the SQL going to execute it in one go and that's it. But in the other hand the
recursive city is exactly the opposite. So a recursive city it is a selfreferencering query that repeatedly
processing the data until a certain condition is met and we usually use the recursive city if we have like
hierarchical structure and we want to navigate and travel through the hierarchy. I know this might be
confusing but don't worry about it. We're going to have very simple examples. Now again we have our tables
in the database and we have a CTE. Now the query of the CTE going to be executed for the first time and in the
results we're going to have the initial data from the CTE but it is not everything yet. Now this intermediate
result is not ready yet for the main query but instead of that it's going to go back to the CTE and CTE going to
check whether the current results is meeting a specific condition. So now if the check says no it's not meeting the
condition what's going to happen the city query going to be executed for the second time. So as you can see we are
looping through the CTE. Now the result of the second iteration the second execution will be added to the
intermediate result. So now the intermediate result has more data and again before we can use it from the main
query it going to be checked from the CTE. Does the result fulfill the condition? If it's still no, then go and
execute the CTE again. So we're going to have a third iteration and a new data going to be added to the intermediate
result. So this is our third iteration. Now it's going to be checked again from the CTE. Did we fulfill the condition?
If the answer is yes, then the loop going to break and everything else. So there will be no fourth iteration of the
CTE. So with that, the CTE says okay, I'm done. This is the final result of the intermediate result. then the loop
going to break and everything ends and the city will not be executed for the first time and now the city going to say
okay I'm done now my intermediate result is ready to be used from the main query and now nothing new happens the main
query going to go and retrieve the data from the intermediate results and do some magic in order to prepare the final
results so that means there will be no iterations or looping inside the main query the looping going to be happen
only in the CTE and that's why we call it recursive CTE. So now if you compare it with the other types, all other types
are always in one direction and all the CTE is going to be executed only once but the recursive CTE going to be keep
looping until the condition is met and only then it's going to forward the data to the main query. And normally we use
the recursive CTE if you are navigating through hierarchical structure. So if you have in your data like hierarchal
structures, you can go and use the recursive CTE in order to navigate through it. So this is the recursive
city. Okay. So now let's check the syntax of the recursive CTE. It is a little bit complicated but we're going
to do it step by step. So what do we have? We have a query and we would like to put it in a city. So we're going to
have the usual stuff with clause the name of the city and as and then the query. So this is the definition of our
city. But now if you leave it like this SQL going to execute it only once. But we would like to make a loop iteration.
So in order to do that we have to go and define a second select statement inside our CTE like this. So we are selecting
the data and here we have to define a breaking condition. So here in the second query we are defining a condition
in order to break the loop otherwise it's going to loop for infinite or the system going to break. You could use it
in the wear clause or you can use it even in an inner join because both of them are filtering the data and you can
use it in order to break the condition. All right. So now still there is something missing. How we going to make
like things looping? Well, we have to reference this CTE to itself. So what we going to do? We're going to say the
second query going to select the data from the same CTE. So that means we have now a query that is quering itself. And
this is of course what we want. We want to make iterations and we want to make a loop. That's why we have to go and
reference it to itself. And now in SQL you cannot have it like this. You cannot have like two select statements in one
query. you have to connect it somehow. That's why we can go and use the union all or union depend if you want to have
duplicates or not. So now we call the first query the anchor query. The anchor query going to be the first query that
interacts with the database and provide us the initial intermediate results. So it is the starting point of the
iteration and we can say it is the first step in the process. So this going to be executed only once and it going to
provide us the initial step the first step in the process. Now we call the second step as a recursive query and we
call it like this because this query going to be executed multiple times and it will keep repeating and add data to
the intermediate results until the condition is met or let's say there will be no more data that is available to be
processed. So this is the syntax of the city query for the main query nothing is changed. So we have to go and use the
city name in the main query. So this is the syntax of the recursive city. So think about it like this. SQL going to
go and execute the anchor query only once and then after that going to go through the recursive query and keep
looping and looping and iterating until a certain condition is met and then SQL going to go out from the CTE. So this is
actually what we mean with the anchor and recursive queries. All right. Right. So now let's have a simple task in order
to understand the recursive city. So the task says generate a sequence of numbers from 1 to 20. So now let's do it step by
step. So that means we have to create a loop from 1 to 20 and after 20 the loop should stop. So let's go and do it. Now
the first step of the recursive CTE is to build the anchor query. So the anchor query is responsible for the first
iteration. So that means the first row of the output. So what is the first value between 1 and 20? It is the one.
So let's go and write a query that generate the value one. So select and we're going to say one as I'm going to
give it the name my number. So that's it. Let's go and execute it. Now you can see in the output we have the first
member of our sequence. And this is exactly the task of the anchor query. It retrieves the first step in the
iteration. So let's go and call it anchor query. Now the next step with
that we have to go and build the iteration. So we need a CTE. So I will build now the city. So we're going to
say with we're going to call it series and then we're going to put everything in parenthesis and then we're going to
go to the main query. So this is the main query and we will go and select everything from the Sirius the city. So
let's go and execute it just to make sure that everything is working fine. So we didn't create any loop or anything.
We have just created a city on top on the anchor query and we just call it from the main query. So now we come to
the second step of building the recursive city. We have to build the recursive query. So let's do it. I will
just make this little bit smaller. And now before we start writing the query, we have to go and use union
all in order to go and connect the anchor query with the recursive query. And let me say this is the
recursive query. So how we going to build it? Let's go and start with the select. And now next what I usually do I
just make sure that we are making a recursive city. So I go with selecting from and then we're going to use the
name of the current city so that we are referencing the city to itself in order to make the city recursive and to do the
looping. Now here comes the tricky part. So we need to create like the sequence. Now what is the current value? The
current value is one. Right? Now what do we need? We need the second value in the sequence which is two. So we can do it
by 1 + 1. So if you do it like this you will get the output two. But actually what we are doing here we are always
taking the current value and we are saying plus one in order to generate the next value. So in order to do that
instead of saying one we're going to take the my number the current value and we're going to add to it plus one in
order to generate the second value in the sequence. So that means my number always holds the current value and we do
the operation + one in order to generate the next sequence. So having it like this what we are doing we are generating
the sequence of numbers. Now if you go and execute it like this let me just execute it what will happen it going to
breaks because SQL will not allow it and SQL set it to 100 iterations. So more than 100 SQL going to break the query so
that we don't have infinite number of looping. So this is bad because we didn't define the breaking mechanism of
the looping. So now we have to define as well in the recursive query how the loop going to ends and we usually use a
condition. For example, we can go and use the wear clause and we can say okay keep looping and keep generating but
always check whether the value of the my number is less than 20. And you might ask okay it should be less or equal to
20 right? Well no because if you are making less and equal to 20 what going to happen once the my number is equal to
20 you are allowing one more iterations where you will get in the output 21. So that's why we are making it with 20. So
now let's go and execute it and let's check the sequence. It start with 1 2 3 4 5 and until we reach the 20. So with
that we have solved the task. Again here it's not that hard right? We are just providing the initial step and then we
are providing the loop where we are defining inside it how the loop going to ends. Now there is one more thing that
you can do with the recursive CTE is to define the limit of iterations. So for example in your code if you say okay if
this iterates more than 10 times then the SQL should breaks and stops. So you can define for the SQL the maximum
number of recursions. So how we can do that? We can do that in the main query. So if you go over here and say option
then two parenthesis and then max recursion and after that you can define the limit. So for example let's go with
the 10. Now of course we are iterating in our code now more than 20 but here we are making the rule it should not
iterate more than 10. So let's go and execute it. So now we can see that our SQL breaks and it says the maximum
recursion is 10. So as you can see now in the output we are getting the error of having more than 10 iterations which
is not allowed. So with that you can control how many recursions you can have. Let's say that you would like to
have like thousand iteration. So if you go over here and say you know what I would like to have a sequence of 1,000.
If you let me just comment this out. So if you execute it you will get an error because the default is 100. But of
course you can go and increase the maximum recursion. For example let's go with 5,000s. in the output it will work
and you will get a sequence of 1,000. So with this you can control how many iterations are allowed in your query. So
that you have like a control on it. Okay. So now we can understand step by step how SQL executed the recursive
query. And here we have like flow diagram in order to understand the process the steps of executing the
recursive query. So let's go and do it. Now in the start we have the first step is to run the anchor query. So our
anchor query is just a select for the value one. So in the output we will get the value one in my number and as you
can see the anchor query going to be executed only once. So there is no iterations or anything. SQL executed
once and then goes to the next step. So what is the next step? It's going to execute the recursive query. So it's
going to go over here and now what going to happen? We will get the current value of my number. The current value is one.
and then we're going to add to it a one. So 1 + 1 we will get from the recursive query the two which is added to our
results. Now it's going to check the condition is my number now smaller than 20. Well yes it's smaller than 20 and
what's going to happen since it's true is going to go and reexecute the recursive query. So now we are doing the
second iteration. So again it's going to go to the recursive query and going to say okay what is the current value of my
number? It is two. So 2 + 1 the second iteration will give us the value three. So as you can see each time the
recursive query is executed it is adding more values to our result. So the same question can be asked is now my number
smaller than 20. Well yes it is smaller. Well what can happen is still going to reexecute the recursive query. So SQL
going to keep looping and iterating and adding values to the output until we reach the value 20. So now SQL going to
ask is 20 my number now smaller than 20. Well no. So it is false and what's going to happen the chain will break and we
will not loop anymore. So it's going to be the end of the city and this going to be the final results that's going to be
used from the main query. So this is how SQL executed this recursive CD. Okay. So now let's have another task for the
recursive CD. This time it's going to be a little bit more advanced. So the task says show the employee hierarchy by
displaying each employees level within the organization. So that means we have to show for each employee for each row a
level that tells us the hierarchy of the employee. So first let's go and explore the table employees. So let's go and
select everything prompt sales employees. Okay, let's go execute it. So now by looking to the results we have
like few informations about the employee. We have information about which department the gender salaries but
here we have the key. It is the manager ID. So this is like self referencing to the same table. So for example the first
employee the value is null. That means this employee has no manager which makes this employee like the big boss, the
CEO. Then now by looking to the next two employees, they have a manager ID one. So who is the manager of those two? It's
going to be the first row, the manager ID number one. So the manager ID number one is the post of those two employees.
And then for the fourth one, we can see the manager ID number two. So the manager of Michael is actually Kevin,
the second row. And for Carol the manager ID is three. That means Mary is the manager of Carol. And this is
exactly what we can do with the recursive CTE. We can use such informations in order to create like a
loop. So let's go and do it step by step. First we're going to start with the anchor query as usual. So this is
the anchor query and here the first step or the first record going to be the highest manager which is the CEO, right?
The first record. So in order to select now the only the first record what we can say we can say where manager id is
null. So let's go and execute it. And with that we have now the first row and we can use this as the first step in our
iteration. So now let's go and pick few informations in the select like the employee ID and the first name and as
well let's go and get the manager ID. And now we have to start creating the levels. Right? So this is the first
level. So I'm going to have the value one as let's have it like level. So our CEO has the level number one. So let's
go and execute it. So now as you can see Frank is the CEO and he is in the level number one. So this is our anchor query.
Now we have to do the iteration right. So we have to go and start creating the city. So let's call it with CD employee
hierarchy and then as and then this is the definition of our CD. So let me just make it like this. And of course what do
we need? We need the main query. So main query we will select everything from our new city like this.
So let's go and test it. All right. So now we have prepared the CTE and the main query and of course the next step
with that we're going to go and build the recursive query but first we need the union all in order to connect the
two queries and recursive query and now we can start building the logic. So now we want to find all the employees where
their manager is the employee ID number one right because they going to have the second level in the hierarchy. So what
we're going to do, we're going to go and select and we need the same stuff. So we would like to get the employee ID, the
first name, and the manager ID. And we need the level. So this going to be the level number two. It's not correct yet.
I'm just want to show what this means because we need to get the employee ID and the first name and so on. We cannot
get it yet from the CT because in the city we have only one employee. So we still have to go to the database and
grab the next employees. So now I will give this as an alias like E and I will select it as well from those employees.
So so far we are not doing any recursive yet right in the recursive query we're still querying the database but now we
don't need all the employees from this table we need all the employees where the manager ID equal to one right now.
Of course, in order to get those employees where the manager equal to one. So we can do it with the workclouds
for example and say manager ID equal to one. Let me just select this and query it. Now we will get those two employees
where their manager is the CEO the top manager. But of course we cannot do it like this. What we're going to do we're
going to join this table with our current CTE in order to make a loop. So let me show you what I mean. We will
remove this. We're going to use the inner join and we're going to reference it from the CTE and let's give this the
name C H and we connect it like this. So on we're going to say the manager ID of the employee should be equal to the
employee ID. So the employee ID at the start going to be the number one. So it's going to be like
this employee ID. Now we are connecting the manager ID with the employee ID and we are as well reusing the CD inside
itself in order to make the iterations and here we don't need the work clause because the inner join going to filter
the data automatically as we learned the inner join going to show only the matching rows from the left and to right
so that mean there will be filtering. So we are almost there but of course we don't want to show it as a two. What
we're going to do, we're going to show it like this. Level + one. So the current level is one. The second
iteration going to be two. And the third iteration going to be three. So I think we have everything for our iteration.
Let me just check and make this smaller. Now again we have here our anchor query. This is only for the top level manager.
And then here we are just connecting the managers with the employees. And we are reusing the CTE in order to make the
effect of the loop. And as well we are using the inner join in order to break the loop once there are no more rows to
process. So let's go and execute it. Now let's check the output. This is our top manager. So level one. This information
comes from the anchor query. Then the second iteration it is the employees where the manager ID equal to one. So
it's going to be those two employees. So those employees in our hierarchy are the second level in our organization. And
then we're going to search for employees where their manager ID is equal to either two or three. And this is going
to be those two employees, Carol and Miracle. And now to the third iteration, we're going to search for all employees
where their manager ID equal to either two or three. And now to the third iteration, we're going to search for all
employees where their manager ID equal to either two or three. And this going to result having those two employees
because their manager ID is equal to three or two and they're going to get the level of three. And then after that
SQL going to try to search for employees where their manager ID equal to five and four and SQL will not find anything and
that's why it kind of breaks. So with that we have solved the task. All right. I totally understand if this is
complicated but now we're going to do it step by step in order to understand how SQL executed this and why we have done
it in this way. So again we have our flow diagram. We start by running the anchor query then the recursive query
and then we have a check. If the check fails we iterate otherwise we end. So let's do it step by step. Here we have
the table employees and beneath it we have the result of the city. So the first step it says we run the anchor
query and we run it only once. So it's going to go to the anchor query and start executing it. So here we are
selecting from the table employees but we are making a filter on the manager ID. So the manager ID should be null. So
that means we will get the record of Frank and Frank going to be at the output and we are saying the level of
this employee is one. So we will have here at the level one. So this is the output of the anchor query and that's
it. This will never be executed. Now we go to the next step. Now we will run the recursive query. So what's going to
happen in the recursive query we are saying okay I would like to select as well data from the employees and join it
with the city results but the join should be an inner join so only the matching data between the CTE and the
employees and now comes the join condition and this is the key for this iteration we are saying the manager ID
of the employee should be matching to the employee ID from the CTE. So SQL going to go and join the table with the
CTE. So now we have here only employee number ID one. So it's still going to do it step by step searching for any
matches. So for the first one we don't have a match because the manager ID is not equal to one. So that's why it will
not be included in the result. The second row here the manager ID is equal to one and this is a match with the
employee ID. So SQL going to take it and put it at the output. Not only that, SQL going to increase the level. So we have
here the current value is one. So level + one. What can happen? We will get the value two. We are still in the same
iteration. We are not iterating yet. So this is the first iteration of the recursive query. So until the whole join
is done to the next row, we have a match as well because the manager ID is equal to one. And we're going to have the same
thing. The level going to be as well too because the value of the level didn't change. It's still the current value is
equal to one. And this going to keep going. So two, three, we don't have any matches. And with that, SQL is done
executing the recursive query. All right. So now the SQL going to say, okay, did we process everything? Well,
no. We still have missing output. We still have missing employees. That's why we didn't fulfill the condition. And
we're going to run this again. So now in the second iteration, it's going to join as well again the city result with the
employees by matching the manager ID and the employee ID. But this time it's going to focus only on those two ids. So
the two and three. So SQL going to go and find any matching where the major ID equal to two or three. So it's going to
do it step by step. The first one is not. The second one is as well not. The third one is not because the manager ID
is one. But now to the employee number four we have a match. So it's still going to take this one and put it in the
output like this. And now in this iteration what is the current level? It is two but we add to it one that's why
we will get in the output three. And then SQL keep going. So we have here the employee number five and the manager ID
is equal to three. So what happens? SQL takes it as well and put it in the output as the result of the CTE and as
well the current level is two + one. We're going to have as well three. So with that SQL done joining the tables
and going to ask again did we process all employees? Well yes it's true that means we don't have to do any more
iterations because if you do any iterations SQL will not find anything. So for example if you go over here let
me just remove this and let's say we are joining with the four and five. So what can happen isql going to search in the
manager's ID for four and five and it will not find anything. So that means we will not be adding anything to the CTE.
That's why SQL stops. So we have a complete results and we have now all the data from the employees in the output
and this results going to be passed to the main query. So this is why we have done it like this and this is how
executed this recursive query. I would like to visual for you what this means the level or the structure of the
organization. So the hierarchy looks like this. The level one the top manager is Frank. So this is the level number
one. And then we go to the level number two. So we have those two employees. So we have Kevin. So this is the level
number one. And then we have two employees Kevin and Mary at the level two. So they work together and their
boss is Frank. So it's going to look like this. And they are at the level two. We have then Michael that directly
reports to who? To Kevin because here we have the employee ID two and two. So we have one employee here and as well Carol
is as well at the level three and she reports to Mary and both Michael and Carol are at the level three. So this is
what we mean with the level. It can help us to identify which employee at which level in the organization. If you have
like hierarchy in your data and you can see in one table things are referencing each others like here the manager ID is
actually the employee ID. So it's like we are referencing those ID to each others. This means there is hierarchy
and there is a structure in this table and you can use the recursive city in order to build those levels and to
navigate as well through the hierarchy. All right. So that's all for the recursive city and with that we have
covered all the different types of cities that we have in SQL. So now let's have a quick recap. So
we have learned that the CTE the common table expression is a temporary named result like a virtual table that could
be used from different places in the query and we have a lot of advantages for the CTE. The main one is it breaks
the complexity of query into small multiple pieces which makes our query much easier to read and as well to
understand. So it improves readability. Another advantage of the city is that those small multiple pieces they are
really easy to manage and to develop. So those pieces are like self-contained which makes our queries more modular. So
it introduces modularity inside our queries. And we also learned that the CTE help us to reduce the redundancy
inside our queries where it makes the result of one query usable in multiple places inside our query. So it makes our
code smaller and reduce redundancy. And one more advantage of the city is that it help us to do looping and iterating
in SQL by using the recursive CTE. And we have understood as well that we can treat the CTE result as any other
physical table inside our database. So we can treat it and handle it like any other tables. Only one exception that
this table lives only in one query. So we cannot query the CTE from an external query. Now we have learned that the
result of the CTE could be used from the main query. This is the classical one. But not only we can use it in the main
query but also we can use it in another CTE query which leads to having nested cities. And of course we have learned as
well we can use the result of the CTE within itself which makes the CTE recursive and allows for looping and
iterating. And I can only keep recommending to not use more than five CTEs in one query. Otherwise you're
going to get the exact opposite and benefits from cdes where your code going to be really hard to understand and to
read and even to extend. Okay my friends with that we have covered this amazing and very important technique in SQL the
common table expressions the city. Now in the next step we're going to talk about a new type of objects that you can
use in databases. We don't have only tables we have as well views. And views are amazing in order to give you dynamic
and flexibility in your project. So let's talk about views. Now a view is not like a query
that we can use in SQL. It is an object that we can find in the database. So before we jump immediately to the view,
I would like to give you the big picture, the whole structure of the database. So let's go. We have like
hierarchy structure and the highest level of this hierarchy is the SQL server. The SQL server manages multiple
databases. It's like the control center that keep everything running and accessible. Now inside the SQL server,
we have multiple databases. So a database is collection of informations that are stored in structured way. It's
where all your data is kept and organized in different tables and objects. And each database is separated
from others and it has its own data. Now inside each database we can find multiple schemas. A schema is like a
logical way on how you group up related objects like tables and views together within a database. Like for example, if
you have a database called sales, we can group up different tables about the orders underneath the schema orders. And
maybe we have like multiple views and tables about the customers where we can put it in the schema customers. So if
you find like multiple tables and views that are describing the same object, the same topic, we put them all together
underneath one schema. So again, a database could be like the sales database and the HR database. They are
completely different types of data. And underneath the sales, we can have like different sections. We have the sections
about the orders and sections about the customers. And now moving on, what we can find inside the schema, we can find
tables. A table is where actually your data is stored. It contains multiple columns and rows. So it is where the
data physically lives. And now inside the schemas, we have another type of object. We call it view. And of course
in this section, we are focusing on the views. So a view is like a virtual table that has a structure and everything but
inside it we don't have any data. So the view does not store any data and in order to see the data we have to execute
the query behind the view and only after that we're going to see some data but it is not like the tables it doesn't store
the data permanently. Now inside the tables we can define multiple stuff like columns and as well keys and the same
thing for the views. Inside the views we can define multiple columns and one last level for each column we have like a
name and a data type. So as you can see the databases are really organized and we have like hierarchy where the top
node is the SQL server and the lowest node is the columns and rows. So this is what we call the database structure. Now
in order for you to build and manage this structure we have set of commands we call it DDL the shortcut of data
definition language. So the detail is a set of commands that allow us to define and manage the structure of the
database. So we have commands like create where it help us to create databases, schemas, tables, views.
Another command called alter. Of course after you create something you would like maybe later to do changes and
updates and of course we have the drop in order to remove any database object like dropping a schema, dropping a
database, tables, views. So as you can see the DDL commands can help us to manage the database structure. So from
this picture we have understood that we can create views inside schemas in the database. So now if you check the client
and the object explorer you can find the exact hierarchy. So it start with the SQL server. This is our local server
that's run at our machine and then we can find inside it multiple databases and one of them is our sales DB that you
have installed together with other database like the adventure works. So now if you go to the sales DB over here
you can go and drill to the next level and now we can find here a lot of objects and one of them that you know we
have tables and views and now you might say okay but between the database and tables we have schemas so where are the
schemas well actually if you go inside the tables you're going to find our tables customers employees and so on but
before it we have a name called sales doc customers and you can find it everywhere sales doc customers sales do
employees and so on the sales is the schema that bring all those tables together underneath one logical schema.
So we have a database called sales DB. We have a schema called sales and we have a table called customers. And now
if you would like to see all the schemas inside this database, what you can do? You can go to the securities over here
and then here we have like a folder called schemas. If you go over there, you will find the list of all schemas
that we have in this database. You might say, but we didn't create all those stuff. If we have only the sales that we
know. Well, as you create a database in SQL server, you will get a lot of other system default schemas that the server
can create. One of them is the information schema where it holds a lot of views about the catalog and the
metadata where you can find the list of columns, tables, views and so on. So here we have only one schema that we
have created for the user. It is the sales. So let's go back. Now if you go inside one of those tables you will find
here multiple stuff like we have columns, keys, constraints and so on. And if you go to the columns you will
end up at the lowest level of the hierarchy. And here we have the columns like the customer ID and we have some
extra informations like the data type length and so on. So this is the structure and the hierarchy of
databases. Now I would like you to understand a fundamental concept on the database in
order to understand the views the three-level architecture of the database. This architecture can describe
the different levels of data abstractions in a database. So let's see what this means. So the architecture is
divided into three levels. The first level is the physical level. Then we have the logical level and the third one
is the view level. Now let's understand each level what it means. So now the physical level it is the lowest level of
the database where the actual data is stored in a physical storage and usually who has access to this layer are the
database administrators because they are the experts and they have to manage the access and the security of this layer
because they are the expert that have to manage a lot of stuff like optimizing the performance making sure that
everything is secure and managing the backup and recovery and to do all the configurations and many other tasks. So
at the physical layer we have to deal with a lot of stuff like the data files, partitions, logs, cataloges, blocks and
caches and many other stuff that each database needs in order to store your data. So as you can see this layer is
very complicated and you need to be really an expert of databases in order to be able to manage all those stuff. So
we call this layer a physical layer or sometimes we call it an internal layer. So now let's move to the next level. we
have the logical level. So the logical layer it is less complicated than the physical layer. Here at this level you
have to deal on how to organize your data and normally we have here like an application developer or we have like
data engineers that access the logical level in order to define the structure of your data. So those developers can
focus on how to structure your data rather than how the data is exactly storing the data physically at the
storage. So they don't have to deal with all those details. they leave it for the database administrator and they can
focus only on how to structure the data. That's why we need for this kind of role an abstraction level for them which is
the logical level. So now what actually the developers are doing at this level? Well, they are like creating tables and
defining the relationships between those tables or they can go and define views. they can create indexes on the tables in
order to optimize the performance of the tables or maybe they are creating stored procedures and functions and some other
codes in order to manage those tables. So as you can see they are building the data model they are structuring your
data but they don't care at all where are those data stored physically in the database. So as you can see here things
are less complicated than the physical layer and it is perfect abstraction for developers to build projects. So we call
this the logical layer or sometimes we call it the conceptual layer. Okay. So now moving on to another level of
abstraction. We have the view level. So the view level is the highest level of abstraction in the database and it is
what the end users and applications can access and can see. So for example, you could have like one view for business
analyst. So you prepare and customize a views that are suitable only for the business analyst and you might say you
know what let's prepare another set of views that are suitable for data visualizations and reporting like you
can go and connect for example a PowerBI in order to create dashboards. So they are fully customized and prepared views
in order to be connected with the PowerBI reports and you can keep doing that by creating multiple set of views
that are suitable for specific purpose and use case. So as you can see at this level we are exposing our data for
multiple users and multiple applications. So now the question is what do we have to deal at the view
level? Well, you have their only views that holds only the relevant informations for the use case or users.
So the users at this level have only views. They don't have to deal with the tables, indexes, store procedures, any
files, logs, partitions or anything. This is the highest level of abstraction because the focus of this layer is to
make it friendly for the end users and easy to consume. So we call this layer the view layer or sometimes we call it
an external layer. So this is the three-level architecture of the databases or we call it the three
abstraction levels of the database. So the physical layer has the highest complexity, the lowest abstraction and
the view layer has the highest abstraction. So this is one more reason why the views are very important concept
in SQL [Music] databases. Okay. So with that we have
enough fundamentals in order to start talking about the views. So the question is what are views? A view is a virtual
table in SQL that is based on the result of a query without actually storing the data in the database. So in short this
means views are stored or persisted SQL query in the database. So let's understand what this exactly means. Now
so far what you have learned we have like database table and all what you have done we create a select query in
order to retrieve the data from this table. So once we execute our query we will get the result back. Now if you are
talking about views they have as well like the structure of the table but without any data inside it. And for each
view there is like a query attached to it. So there is no data but we have like a query in order to get data. We call
the normal table as a physical table and the view we call it a virtual table. So now how exactly we're going to get the
data. So now if you go and write query by selecting data from the view not from the table from the view what going to
happen SQL going to go and trigger the queue that is attached to the view and this query is responsible to query the
physical table and then the result going to fill the structure of the view and we will get back of course the results. So
we are directly querying a view but actually we are indirectly querying a physical table. So the view is like
between us and the data. So that means my real data is stored inside the database tables and the views are like
an abstraction layer between me and my real data. And of course the data will not be stored inside the view. Each time
I'm querying the view what's going to happen the SQL query behind the view going to be executed again. So it's
going to go and retrieve the data and get it back to the view and then I will see it in the output. So this is what we
mean with SQL view. So now let's have a quick comparison between tables and views.
Tables stores the actual data physically at a database. So the tables where the data is persisted with in the other hand
the views they are virtual tables and they do not store any data inside the database but they present the data from
the underlying tables. So that means views don't persist any data physically. Now the tables are hard to maintain and
as well hard to change. So it needs a lot of efforts in order to do any change like adding columns and moving columns
always requires a lot of efforts for the migration especially if you have large tables. But in the other hand the views
are way easier to maintain and very flexible to change. All what you have to do is only to change the query of the
view. So that means you can very quickly change stuff in the views compared to the tables. But if you are talking about
performance, tables are faster than views. For example, if you go and do a simple select on the table, you will get
the data back as soon as the database fetches the data. But if you are selecting something from the view, it is
actually two queries. The query that comes from the user and as well the second query is the view query. and the
query of the view could be very complicated in order to extract the data from the underlying table. So selecting
something from the view is always slower than selecting something from a table. Now if you have a table you can read
from the table and as well you can write to a table but the views are read only as the name says it is only a view. You
cannot go and write something to the database using the view. Okay. So those are the big differences between views
and tables. All right. So with that we have a clear understanding what are views. But now
you might ask me why do we need views? That's why now what we're going to do we're going to deep dive into multiple
scenarios and use cases that you might encounter in your SQL projects. So let's start with the first use case. The first
use case and the core reason why we use views in our data projects is to store central logic from a complex query in
the database so that everyone can access it and with that we improve reusability between multiple queries and we reduce
as well the complexity of the overall projects. So let's understand what this means. So now in our project we have
like two tables in the database orders and customers and we have learned previously that if we have like a
complex query we can go and use the city. So for example in our city we are joining tables and doing some
aggregations using the sum and the city going to store the data in an intermediate results and then we have
the main query. For example we are doing the step two where we are ranking the data. So the whole thing is in one query
and let's say that a financial analyst was doing this type of analyszis. Now what could happen is that you might have
another user for example a budget analyst where he is doing exactly the same first step. So he has as well a
city query where first the data are joined and then aggregated using the sum. But the last step in the main query
he's not doing ranking he's just doing like max and min. And not only that, we have a third user, the risk analyst,
where as well doing the same initial step using the CTE, joining the tables and doing the summarization. But here
the risk analyst in this scenario, he's just comparing the data at the last step in the main query. So now if you sit
back and look to this, you can see all those three data workers, all of them are doing the same first step. So all of
them are doing the same CTE. They are joining the data and then doing summarization. And of course this is a
complete waste of time that each one of them has to create first the city from the scratch in order to do some
analyszis. So it is complete redundancy and makes no sense. So this is exactly the disadvantage of only using cities in
the projects. Now what we can do instead of that those three data workers going to decide to say you know what let's put
the first step as view in the database. So instead of using CTE each time we're going to take this script and put it in
the database. So we have now a central logic that is stored in the database where everyone can use it. So we have
this query this logic only once and everyone can benefit from it. So now the financial analyst instead of going
directly to the physical tables they can go to the view. So thus means she needs only to write one script the rank
script. Same thing goes for the budget analyst. he has only to write the query for the max and min and as well for the
risk analyst he just need to compare the data. So as you can see all those queries are reduced and they can only
focus on the analyzes. So this is exactly the magic of views in data analytics. This logic this knowledge can
be centralized in the database and this is way faster and better than having this logic written each time someone
want to do any analyzes. So this is why we need views in data projects. So now if you compare views with CTE,
the CTE are used in order to reduce the redundancy within one single query. So it improves the reusability within one
query. Where in the other hand in the views we are reducing the redundancies from multiple queries. So we are
reducing the complexity of the whole project. So the views are improving the reusability in multiple queries. Now
think about it like this. We use views in order to persist a logic in the database. So the logic is so important
that we want to persist it in the database. It's like in the tables we persist data but with the views we are
persisting logic. But in the other hand in the CTE the logic is not persisted. It is temporary and going to be
calculated only on the fly within the scope of one query. So this logic is important only in this scenario and it
is not important for any other queries. That's why it makes no sense to persist it using the views. So you have to
decide is this logic is very important then take it away from the city and put it in the view. But if you think you
know what this logic is not really important and only important in this one query then stay with the city because
creating views always needs some extra steps in order to maintain the view. You have to create the view. You have to
drop the view if you don't need it. But the CTE, there is almost no maintenance for it. The database going to do
automatically the cleanup once the query is done. So there is no extra activity to drop a city or something. That's why
CTE is easier to use than views. So those are the big difference between the views and
cities. Okay. So now let's check quickly the syntax of a view. So now we have a query like select from where. So this is
a query a simple select statement. But now in order to create a view an object in database we have to go and use a DDL
command create. So we're going to say create view cuz we want to create a view then the name of the view and then it's
like the CTE we say as and then double parenthesis. So as you can see it's very simple and we call this a DDL command
where we are telling the database go and create a view and the logic of the view comes from this query. So it's very
simple. This is how you can create views in database. Okay. So now let's have the following task and it says find the
running total of sales for each month. I'm going to start this task by solving it using the CTE. So first I'm going to
go and do few aggregations on the top of the month. So let's go and select. So now what do we need? We need the order
dates but we need it as a month. I'm going to go and use the date truncate like this and say okay I would like to
have the date as the granularity of month. So let's go and call it order month. And now after that we're going to
do a few aggregations like for example let's go and get the sum of sales and we're going to call it total sales. And
that's it for the start. So now let's go and call it from the table sales orders and group by and we are grouping up by
by the month. So something like this. Let's go and execute it. And now for this we get for each month the total
sales. And now the next step that we have to go and calculate the running total for the sales. This is of course
not the running total. So that means either we can go and use subqueries. So this means this is our first step and we
need a second step. So either use queries or cities. I will go with the city over here. So I'm going to say with
city and monthly summary and we're going to define it like this. And now what we're
going to do, we're going to go and define the main query. So the main query going to be simple. So select and let's
go and get the order month. And now we have to build the running total. So we're going to go and use the window
function. So sum total sales. And then we're going to say over we don't have to partition the data. We will just sort it
by the order month and we can leave it ascending. So this is the running
total and we have to go and select of course our CTE from here. So let's go and execute it and with that we are
getting the running total. Of course we can go and add the total sales in the output in order to understand the
results. So here in the output we are just building accumulative sales. So for this scope everything is fine. and we
are using the CTE. But now imagine that this logic is important for multiple queries. So it's really nice to have
such a report where we are aggregating the data at the level of the month and this could be used from different users
and different queries. So now we say how about to put this logic in one view so that everyone can access it and we don't
have to repeat the same aggregations over and over. And now before we put it in view, someone comes and say how about
to add one more aggregation so that not only the total sales we can add. So now before we put it as view maybe some
other user says you know what we would like to have one more aggregation not only the total sales let's make the
scope a little bit bigger so that everyone can believe it. So for example we can go over here and say you know
what let's go and add the total number of orders. So we can go over here and say counts and let's get the order ID
and say this is the total orders or maybe some other says let's get the quantities as well. So we can go and
summarize the quantity like this and we call it total quantities. So with that we are like
doing a lot of aggregations on the month level. Let's go and execute only the CTE. So now we have really nice report
that is based on the months and can be used from many different queries. So now what we're going to do, we're going to
take this and put it in a view. Let's go and select only this logic and create a new query. And now what we're going to
do, we're going to put our query here and we have to create now the DDL in order to create a view. So it's going to
be like this. Create view. Let's give it the name maybe starts with the V underscore and this going to be the
monthly summary. So this is the name of the view and as then we put everything in parenthesis. It's like you are
building a CTE. So we have here our logic and here is our DDL query in order to create the view. So now let's go and
execute it. Now as you can see in the output it says only that the command is completed because this is not a select
query. This is a DDL command. So the SQL going to tell you okay either I created it successfully or not. So now the
question is where do I find now my view? Well, if you go to the object explorer, you can see over here underneath our
database sales DB, we have here something called tables where we are used to query those tables. But beneath
it, we have as well our views. So if you check the views and expand it, now we are not seeing M view because we just
created the view here. So go over here and refresh. And once you do that, you will see the newly created view. So this
is the one that we just created. So now what we can do, we can go and create a new query and let's go and just query
the view. So select star from so v month monthly summary. Let's go and execute it. And now as you can see we are
getting now the result of the view and I'm accessing now this logic from completely external query. So now I can
think about the view as any other table that we have in the database. And again the big differences between the views
and the tables. The tables has data has actual data and everything there is persisted but the view is just an
abstraction for me and behind it there is like a query that goes to the table and query the tables in order to present
the results. But for me I don't care about all those details. I can go immediately to the query over here and
start querying. So now in order to create the total running sales I don't have to create the CTE and sub queries.
I just go and get for example our main query. Let's go back over here. So now instead of using the CTE I can go
directly and access the view. So as you can see now my query is very simple. I'm doing immediately the step two without
having to prepare the data first. So if I go and execute it I will get exact results. And now if you compare the
query on top of the view like this with the city query you can see that the CTE has more steps and it is like little bit
more complicated than the query on top of the view and this is exactly the benefit of the view. We reduce the
complexity and it is very easy to consume from the point of view of users. So this is how you can put your logic in
central place using views and with that we have learned how we create a view. Now one more thing about the schemas. If
you check our tables over here, they have all one schema. So we have sales dot customers, sales do employees,
orders and so on. Our new view has the schema of DBO. If you create any object whether it's table or view and you don't
specify a schema in a default schema called DBO. And now let's go back to our DDL scripts. So as you can see over
here, we didn't specify any schema. We just said okay, this is the view name. And now in order to put our view in the
correct schema we don't want it to be in the defaults. You have to go and specify the schema name in the DDL. And now in
order to do that we go to the name of the view and we write the schema name and then separated with a dot. So the
first one is the schema name and the second one is the view name. So now let's go and execute it. Now if you
check over here you don't see anything new. But if you refresh you will find another view in the correct schema. So
we have sales dot vmon monthly summary and this is exactly what we want. So this is how you can assign a view or
even a table to the correct schema if you don't want to use the default one the view. All right. So now the next
step is that you say you know what I would like to clean up. I don't need those two views in my database. So how
to delete a view? We can go and use the command drop. It is very simple. If you go and create a new query and you say
drop and then you say what you want to drop. you want to drop a view and then you have to specify the name and schema
of the view. But now since it is the default schema DBU, I don't have to write it down. So we can start
immediately with the view name. So V monthly summary. So that's it. It's very simple. So now we go and execute it. It
says it's completed but as you can see nothing has changed. We go and refresh. And now we can see that the database did
remove the view with the schema DBU. So it's very simple. This is how you can drop a view in SQL. Okay. So now to the
next step. Let's go back to our DDL of creating the view sales monthly summary. And now you say you know what I would
like to change the logic inside the view. So how we can update this content? How I can update my query? If you say
let's go and for example delete this column. I need only three columns. So and you go execute it. The database say
I cannot do it for you because we have already such a view. So SQL will not go and replace stuff going to say no we
have the same name and I cannot do anything for it. So how we can update the view? Well in other databases like
ocris for example it's very simple. You can go over here and say create or replace view. So it's like you are
telling the database create this view or if it already exists then replace it and you will not get error in the postcress.
But in the SQL server it is little bit more complicated. we don't have this command. So here you have two ways.
Either you go over here and say you know what let's first drop the view. So you go with the same name over
here and then what you're going to do you're going to go and mark the drop view. So if you execute it like this the
view going to be dropped and then we recreate the view like this. So what we have done we destroy the view and then
we recreate it using the new logic. Or you say you know what I would like to have everything in one go like I don't
want to do it in two steps. I would like to have everything in one command and for that you have to use in SQL server
the TSQL the transacts SQL it is like an extension for SQL only in SQL server well it's like programming where you can
go and add variables or you can all go and add checks we will not do a deep dive in this language but I would like
to show you how to do it for the views so just follow me with that I'm going to go and replace the whole thing and then
we're going to say if and now we are checking the system catalog if the object ID
And now we go and specify the view name. So let's go and copy the whole thing with the schema as well. And then we're
going to say for SQL this is a view. So if this object exists so we are saying is not null. So that means it exist in
the catalog then what SQL should do? Should drop this view. So we're going to say drop view and it's like we have done
it first and then semicolon and then we say for scale go and with that we are saying for SQL the tscale is done. So
the logic is done and after that we have the DDL for our view. So again what we are doing we are checking before
creating the view whether the view exist. If it exist then we are telling the scale go and drop it and if it
doesn't exist that means we haven't created this view yet. it is completely brand new view then this step going to
be skipped so that there is nothing to drop. So now if you go and execute the whole thing it will work and of course
if you go and refresh over here you still see the view. So SQL did destroy the table first and then recreated. So
if you execute it again. So this is how you replace your logic in view in SQL server. And with that we have learned
all possible scenarios. How to create a view, how to drop a view and how to update the logic of a
view. Now back to our database architecture and let's understand how the database executes views. So now
let's say that the data engineer is creating view called top end. So the query going to be sent to the database
engine and once the database engine understand this is a view this is not a table. So now the database engine going
to go to the disk storage and to the catalog and it will stores not only the metadata about the view also the SQL
that is responsible for the view. So it's going to take the SQL statements that you have defined in the create view
and place it as well in the catalog. So if you compare to the tables we have in tables only metadata but in the views we
have both the metadata and as well the query of the view and as well you can see that the database engine will not
create a table in the user's data. So there is nowhere data stored inside the disk or the cache. So the actual data
the physical data will not be stored anywhere. We are storing only metadata and the query inside the system catalog.
So now we tell our data analyst okay we have a new view and the data analyst can go and write a query in order to
retrieve the data from the view. So he going to say and say select from the view and execute it. The database engine
going to take it and understand okay now we are talking about view. So the database first has to retrieve not the
data going to retrieve the query from the catalog in order to understand what do we have now to execute. Then the
database going to execute the query of the view first and the data for this query comes from a physical table called
orders. So now the database engine is querying the order to retrieve the data so that we have a data for the end user
and then it's going to be executed and the result going to be sent back to the data analyst. So as you can see there is
like two queries. The SQL engine first has to execute the query from the view and only after that the database engine
can execute the query that comes from the user. So actually the data comes always from a physical table but we are
not providing the data analyst an access to the table. We are just providing an access to the view. So this can happen
each time an end user selecting data from the view. Always the database engine going to grab the query from the
catalog, execute it first in order to get the data and then execute what the end user wants. And now if the data
engineer says no, let's go and drop the view. So she writes a query in order to drop the view. And the database engine
going to go to the system catalog and delete both the metadata and the query. So as you can see, if you are dropping a
view, you are not losing the actual data. So there will be no user data lost at all. So don't worry about it. What
you are losing is only the query and the metadata about your view. It's only if you drop a physical table like the
orders, you will lose your data. So dropping views is not that bad like dropping a database table. So this is
how the database works with the views behind the scenes. Now moving on to the second scenario to
the next use case of using views in projects is that we use views in order to hide complexity and to improve
abstraction. In many scenarios we work with a very large and complex databases and we can use views in order to reduce
the complexity and make things easier for the users. So let's understand what this means. Now I'm going to explain for
you a scenario that happens almost in each project. Like if you get an access to a database where you want to do
analyzes, you will be in scenario and this can happen a lot where you're going to find a large database where the
tables are very complex to understand. They have a lot of columns. They have like technical and cryptical names and
how tables are connected to each others and relationship between them. It's almost impossible to understand. then
you have to be deeply involved with the data models with documentations and with experts until you understand how to
query this database. So if you are not a developer and from end user perspective it can be nightmare where you are trying
to do multiple joins in order to make simple analyzes and of course from the database perspective this data model is
good enough for one application but if you are opening your database for multiple data analyszis projects this
can be a nightmare because you have to go and explain for each user how to query the data. So what we usually do
instead of giving a direct access to such technical and hard to understand data model we go as developers creating
multiple views since we are the expert of the data model and these new views going to be an abstraction of the
complexity that I have in my database and we have to make sure that those views are providing objects that are
friendly. So they have like a full English name that makes sense and as well the columns are friendly and we try
to not offer a lot of views so the user don't have to do all the joins. So we provide like few views that are friendly
and has a lot of informations that the users needs for the analyzes. So with that the users can have an access to
something more friendly and easy to consume and then they can write simple queries in order to do analyzes on top
of these friendly views. And this is what we can give a name like we are providing a data product from my complex
physical database. So here again how important are the views to provide an abstraction and easy to consume objects
for the users and with that I can hide all my complexity and the script of the view going to be developed from the
experts and only once so that the users don't have to understand or to write these complex SQL joins and with that
you can make your data projects way easier than before. So this is another important use case for the views where
we can use it in order to provide abstraction and as well easy and friendly objects for the end users.
Okay. So now let's have the following task and it says provide view that combines details from orders, products,
customers and employees. So now instead of having all those tables from our database, we have to provide one
combined view that has everything well almost everything. So now let's see how we can create such a view. So let's
start first by the table orders. I'm going to go and select first star from sales orders and
let's go and execute it. This is the central table that connects everything. You can see here we have the order ID,
product ID, sales, customers and so on. So it is a great start point. So now we're going to go and be picky about the
columns. I would not show all the columns but I would say let's go and show for example the order ID. This is
essential. It's nice to have a unique identifier. Now the product ID, I will not show it but I will just list it over
here. The same for the customer ID, saleserson ID. Those stuff I would like to replace later. So I will just make it
as comment so I don't forget about it because it makes no sense to show the product ID and customer ids and so on.
We would like to show the details about each object because instead of having the product ID, I would like to show for
example the product name itself and some other informations from the table products. And with that we are reducing
the complexity. So now what else we can get from the table orders? We can go and get the order date. I will put it here.
And maybe we can go and get stuff like sales and quantity. So like this. Of course, we
can go and put all the columns. But for now, I will go with those informations. Now, it's important since we're going to
have a lot of tables. Let's go and make sure we are using aliases. So, now we're going to have the O for each of those
columns. All right. Fine. So, now we have four details from the table orders. Now, what is next? We have the product
ID. So, let's go and get the informations from the products. What we're going to do, we're going to use a
left join just to make sure to not miss any order. If you go with the inner join, you might miss some orders. So I
will not do that. So let's join it with the products like this. And so now we have to go and join the
tables. So we can use the keys product ID equal the order product ID. All right. So now the question is which
informations we want to show for the users. Let's go to the table orders. So we have the product and category and the
price. I would say let's go and get the product and category. That's enough. So now instead of the ID I'm going to have
it like this. So it's going to be the product and the category. Now let's go and test it. I'm
going to execute it. Now as you can see we don't have a product ID. We have the product name which is more friendly. So
we have now those two columns from the orders and those two from the products and the last two as well from the
orders. So it looks really nice and friendly and with that the user don't need extra table called products. We
have everything in one. Now let's go and do the same for the customers. So let's go
and do the same thing. So let's join sales customers see and as well join them using the key customer ID equal to
the customer ID. Now we have to go and grab a few columns from the customers. Let's go and check. So we have a first
name, last name and country and score. I would say I would go with the names and the countries but instead of having
first name and last name I'm going to put everything in one. So we have to go and concatenate the informations. So
we're going to get the first name then plus then empty between the first name and the last name and then
the last name like this. Now we will not call it a name. We're going to go and call it the
customer name because later we're going to have as well an employee name. All right. So next we want to get the
country and we have to say this is the country from the customers. So we're going to call it customer country and
that's it. Let's go and execute it. Now we can see we have again our orders products and now we have the
informations from that customer. But here we have issue that we have some nulls and that's because there is no
last name. So what we're going to do, we're going to go and handle the nulls for the last name and as well for the
first name. So we're going to use the kowalis. If the last name is null then make an empty string and the same thing
for the first name. So first name. All right. So now let's go and execute it. So with that we are getting as well the
first name if the last name is missing or if the first name is missing we can get the last name. So looks good. So it
looks good with that. We have the customer's details. The last thing we have to go and get the employees. So the
employee here is called salesperson ID which we can connect it directly to the table employees. So if you go to the
employees over here, which columns do we need? We have the first name, last name, department and so on. I would say let's
go get the names and the departments. So first let's go and join it. So lift join sales
employees and we're going to join it using the employee ID. and we're going to join it with the sales person ID that
comes from the order table. So now instead of the person ID we're going to have as well the same thing. So I will
just go and copy paste this. So instead of the alias we're going to have E and as well E over here and we're going to
call it sales name and as well what we going to have we're going to have the department. So
department and that's it. Let's go and execute it. So now we have a lot of informations in our view. So we have the
first columns from the orders then from the products and here we have from customers and those two from the
employees and the last two again from the orders. So that we have combined now all the relevant informations from
multiple tables in our database in only one view. This result is relative big but still we have all the informations
in one and it is more friendly for the users in order to consume our data instead of going and joining like all
those four tables together. So now the next step we're going to put the result of this query in view in our database so
that our end users can start consuming it. So how we going to do it? This is our combined query and now we're going
to write the DDL for it. So create view and now we're going to give it the name order details and then as and we're
going to put the whole thing in two parenthesis. So at the start and at the end and of course don't forget the
schema. So our schema is sales sales dot then we have the view name just in order to have it in the correct schema and not
in dbo. So everything is ready. Let's go ahead and execute it. So now let's go and check our database. So if you go and
refresh, you will find our second view order details. So now let's go and test it. We're going to say select star
from sales v order details. Let's go and execute it. And with that we are getting now a combined view that are showing all
important informations from the database. So this is what the users can see. And with that the users don't care
about how many tables do we have in the tables and how to join all those tables. We have only one view and we can start
working on it. This is a very common use case for the views. Okay. Moving on to the next
scenario to the next use case. We use SQL views in order to implement security and to protect our data in the database.
In many scenarios, we have sensitive informations in our data and we cannot go and share it with everyone. So one of
the best practices is to create views in order to protect your data before sharing it with the users. So let's
understand what this means. So now let's understand first the scenario without views only tables. So now let's say that
you have the table orders four columns and three rows and then you have like for example a manager that has an access
directly to the database and start writing some queries in order to retrieve data. But in your project you
have multiple people that has an access to your database like for example a data analyst and as well she is writing a
script in order to retrieve data from the orders and as well you have maybe a students that has an access to your
database and querying the data like any other role like a manager and data analyst. So as you can see you have now
different rules in your project and all of them having the same rights by accessing directly your table. So a
manager or data analyst or a student they are seeing the whole table all rows and all columns. And of course in the
real projects this is a big problem. Sometimes the data are sensitive and you cannot give an access for everyone. And
of course if you are using only tables this going to be a nightmare because you can go and create multiple tables but
it's going to be really hard to make all those tables in sync. But instead of that we have views. So what you can do
you can go and remove all accesses to the physical table but instead you can go and create multiple views for each
role. For example you can go and create a view called orders managers and maybe you can give all the data and all the
columns because the managers are allowed to see let's say sensitive data but still it's nice to create a view maybe
you change your mind later and you go and remove something. Now let's say that for the data analyst you want to offer
all the data but there is only one column that is very sensitive. So what you can do you can go and create another
view called orders analyst. So in the view only three columns are available ABC and then you give access to all data
analyst and with that you have protected this sensitive information. So we call this column level security. And now we
come to our poor students. And here we create another view where we are not only protecting the column D but also we
are protecting few rows like for example the row number three because we want to offer only few informations to the
students. So we are protecting the columns and as well the rows and for that we can create another dedicated
view called for example orders students and we can offer it to the students and with that we are doing column level
security and as well row level security. So we are offering multiple views very easily without having to worry how to
load the data from one table to another. So creating those views are really easy and provide us a perfect tool in order
to manage the security of our data. So this is one very common use case of using views in data projects. All right.
So now let's have the following task and it says provide a view for EU sales team that combines details from all tables
and excludes data related to the USA. So the first part of the task is similar to what we have already done but we cannot
offer all data for the user. So this time we are providing a view that is specifically created for a team the
sales team. So the first part we have already done it where we are combining all details in one view. But the problem
with the view that we have created that it shows all data. But now the requirement change we cannot show all
data. We have to go and exclude the USA data from our details. So let's see how we can do that. It's very simple. We're
going to go and grab the same query. We will not repeat that. So we have as well here joining tables and prepare
everything. But instead of showing all data, what we're going to do, we're going to go and filter the data based on
the customer country. So it's very simple. At the ends we will have a work clause where the C country is not equal
to USA. So we have now a filter. Let's go and execute it. And with that, as you
can see in the output, we are getting the orders that are not from USA. And with that we are protecting the data of
the USA and the EU sales teams can access only their data. So it looks nice and protected. And with that we are
doing now role level security. That means we are hiding now all the orders all the rows that are not allowed to be
seen and consumed from this group of users. So now what is the next step? It is very simple. We're going to go and
put everything in one view. So with that we have the query ready and we can go and create the new view. So we're going
to call it create view. Then we need the schema and the name going to be almost the same. So order details but EU. And
then we have to have as punch parenthesis like this. So everything is ready. Let's go and execute it. And now
we can go and refresh in order to see our new view. If you still don't see it, you can go to the views over here and
refresh as well to the folder. So with that I can see we have our new view. Now, of course, the next step we go and
test it. So, let's create a new query. Select star from sales and v order details EU. So,
let's test it. And with that, as you can see, we are getting the combined view only for the data that is relevant for
the EU sales team. So, I'm not seeing here any USA records. So, with that, we are providing view that protects few
rows like the orders from USA. So as you can see views are really great in order to provide security to our data whether
we are protecting the columns or the rows. For example in our view we can say not only I want to remove the USA orders
but let's say the department information is sensitive information and I would like to hide it from the view. So you
can just simply remove it from the select and with that you are doing column level security. So now I have two
options that I can provide to the users. The first option doesn't has any like role level security. It is the first
view the order details. We don't have there any filters. So it's going to show all the orders. So here we give access
only to people that are allowed to see all data. And we have another option the details with the EU. It doesn't show all
data. It shows only a subset that is relevant for the EU team. So now it's really easy to control the security of
my data using the views. And this is very important use case for the [Music]
views. Okay. Okay, so moving on to the next use case for the views, we can use it in order to have more dynamic and
flexibility in our projects. So let's understand what this means. If you have a table and you have multiple users
accessing this table, now what can happen? you might change your mind about the design and the data model of your
database where you can say you know what instead of having one table I'm going to go and split it into two tables or maybe
another decision you say you know what I'm going to go and rename a table or in another day you decide you know what
let's go and rename few columns or maybe add a column remove column so you are doing changes to your physical data
model and you are changing stuff in the tables you know what's going to happen all those users that are accessing the
tables going to scream because all of them having a complex SQL queries and your small changes at the tables are
breaking everything in their queries and what this means this means escalations and you don't have anymore the freedom
to change anything in your database without talking before to 100 people before doing any change. So we don't do
that instead of that we use views. So what's going to happen? You create a view and you tell the users, okay, take
this view and consume it and leave me alone. And now you have again your freedom to do any changes you want. So
you go to your tables and do splitting, renaming and changing everything you want as long as you are updating the
query between the table and the view to make sure that the users are not noticing any change. So for example, if
you go and split the table into two tables, then you have to put in the view a join or union in order to reconstruct
the same structure that the users are used to. And if you would like to rename something in your database, like instead
of ID, you are now calling it a key. All what you have to do now is to go to the query of the view and rename it back
from a key to an ID. So no one going to notice that you are doing changes to the physical tables. So using views and
offering it to users is a gamecher for you because giving the users views kind of gives you more freedom dynamic and
flexibility to change anything in your data model and the tables without getting any headache. So this is amazing
use case for the views. Okay, moving on. We have a lot of use cases for the views. They are just
amazing. So the next one is we can use views in order to introduce a second version of my data model in another
language. So we could offer multiple languages to the users. Let's understand what this means. So now we have the
following scenario. We have again our table orders where the data is persisted and everything in English and of course
what happens sometimes you have like international team that are accessing your data. So you have team in USA and
maybe you have team from Germany that as well are end users that want to access the data. Of course it depend on the
number of users that are using your database. But if you have a lot of users that come from Germany and as well from
India, it might make sense that you go and translate your data and the table structure into another language. So for
example, instead of giving access to the table orders, we can create another view called bishong. That's the order in
German. But not only you are giving a new name for the object, you could go as well and rename all the columns inside
the view. Then the German users going to access the German view and it's going to be for them easier to understand the
content of your database. The same thing for the Indian team. And for the Indian users, you can go and provide a view in
Hindi. I'm not sure whether I'm pronouncing the word correct, but this is the first word that I said in Hindi.
I don't promise that I'm going to learn the Hindi language because it's enough to learn Germany. So I'm trying as well
to write this word Adish. I hope it is correct. And to be honest, it is really interesting how you write this word in
Hindi. So now back to the topic. As you can see now we are using like the views in order to provide a translation for
our database by just giving a new name for the views and as well for the columns. So this is another nice use
case that I usually use as well in my projects in order to provide multi- languages for the data model that I have
and I can do that with the power of views. Now we come to my favorite use case for the views and that I personally
recommend in each project that we can use views as a virtual data ms in a data warehouse. So now why this is my
favorite? Because I'm specialist in data warehouses and data leaks and this topic is very important decision in each
project like this. So let's understand what this means. So now a classical data warehouse architecture based on the
approach of enmon is going to look like this. We have multiple source systems where our data are spreaded and now we
would like to go and extract all our data from these multiple sources and put it in one big database called data
warehouse. And there will be a lot of operations on this central database like the data going to be first cleaned and
then maybe integrated together and maybe we are building there some historical data. So we're going to be doing
multiple steps in order to prepare the data for complex reporting and analyzes. And what we usually do in the data
warehouse, we're going to store all those informations as a physical table. Now once we have built the data
warehouse, what's going to happen? We're going to have multiple use cases that would like to access the data warehouse
in order maybe to do some different reporting. Now, it's going to be very complex if we connect immediately like a
reporting engine like PowerBI directly to the data warehouse. But instead of this, we try to split the data warehouse
into multiple subsets like we can split it after topic or domain or departments and we call those subsets as data marts.
So a data mart is always specific for a use case that's focus on one topic like for example we could have a dedicated
mart for the sales and another data m which is dedicated only for finance topics but both of them comes from our
data warehouse. Then the last layer going to be like for example the reporting and dashboarding maybe you
have something like powerbi where you are creating a dashboard one data m like the sales or and as well maybe few stuff
from other marts. But now the big question here in the data mart is how should I store the data? Should I store
the data using tables or should I use views? And now the best practice says if you are building data marts then use
views. And we call this virtual data marts. And there are many reasons why using views at a data mart it's way
better than using tables. Like for example, it is more dynamic and quicker to change them cuz usually at the data
mart you are building a lot of business logics and you want to have some flexibility and speed and the
maintenance efforts is very simplified. No need to build any ETLs or data loads from the data warehouse to the data
parts and this makes the data warehouse as a real single point of truth for your data. And once you start copying data
from one layer to another layer, it's going to be really hard to maintain and chaotic and you have to have really
restrict monitoring and data quality. So that's why using views you're going to always reflect the status of the data
warehouse and this can help you of course with the data consistency which is a critical point in each data
warehouse project. So there are many reasons why we build virtual data mart and we go with the views in this layer.
So as you can see how the views are playing a very important role in building a data warehouse. So this is
another amazing and very important use case of using views in your data projects. All right friends, so now
let's have a quick recap about views. So we have learned that views are a virtual table that is based on the result of a
query without actually storing any data in the database. So we use views in order to presist a complex SQL logic and
query in the database. And we have learned that in some scenarios views are better than CTE because it improves the
reusability and reduce the complexity in multiple queries which reduce the complexity of the whole projects where
the CTE only improves the reusability in one query. And we have learned that as well the views in some scenarios are
better than tables. We have learned that they are very flexible and easier to maintain since they don't store any data
and it's really fast and easy to change stuff in the view compared to the tables. But as well we have learned that
the tables are faster than views. Now there are like endless use cases for the views. But from my experience in
projects I have choose for you the best use cases for the views. The first use case is if we find like a common
repeated logic in SQL queries, we can go and store this logic in view in the database so that the users don't have to
keep repeating the logic over and over. So we use views in order to have a central business logic. Another use case
is to hide the complexity of your physical data model and to offer for the users and high abstracted layer. So you
provide for the user something very friendly and you hide all the complex technical data model that you have in
the database because not everyone is expert with your data model. One more use case we can use views in order to
implement security and to protect our sensitive data in the database. So we can offer multiple views in order to
protect columns or rows in a table. Another use case we have learned that we can use views in order to have more
dynamic and flexibility for your database where we offer the users a table view and then you have the freedom
to change stuff at your physical data model without affecting all users. And another nice use case for the views we
can offer multiple languages from our data model. And the last use case we have learned how views play an important
role in a data warehouse system. So views are amazing. All right my friends. So with that we have learned everything
about this new objects the views in databases. This is amazing for flexibility and dynamic in your
projects. Now in the next one we're going to learn how to create tables based on query and we will learn about
the temporary tables. So let's go. Okay. So now first let's have a look again to the database structure. We have
learned that in each SQL server there are multiple databases and in each database there are multiple schemas. And
now inside each schema we can define multiple objects like we can define tables and views. And now we will be
focusing on the object table. And we have learned as well we can use the language DDL data definition language
which is a set of SQL commands in order to define this database structure. So we can use the SQL command create in order
to define a new table or alter in order to update the structure or drop in order to drop the whole table. So a table is
an object in the database structure and we have learned as well there is three levels of the database architecture and
we have understood that at the logical level the middle one the conceptual level we deal as application developer
or data engineer with the tables. So we define tables and relationship between them. So if you are an end user or a
business analyst it's going to be little bit more hard to work with the tables. You have to be a developer or a data
engineer. But working with tables is way easier than working with the complexity of the database at the physical level.
So you don't have to be a database expert or administrator to work with tables. So the difficulty here is like
in the middle. The abstraction is not that low but as well not that high. So now let's answer the question, what are
tables? A database table is a structured collection of data. It's like a simple grid or spreadsheet that you might find
in Excel. So it has different columns like each column represent a field like the ID, name, country and the table has
as well multiple rows and each row represent a record or an entry of the data. So for example if this table is
about the employees then each record each row is one employee. Now the intersect between the rows and columns
we call it a cell and a cell is a single piece of data. Now the whole table going to be stored physically in the database
as database files. So they are in the database like multiple files that are holding the informations about the table
and those files are stored physically in that disk storage of the database. So that means your data inside the tables
are not stored like a spreadsheet like an Excel but they are stored in special database files that usual developers and
end users don't have access to those files. So tables again it's like an abstraction and representation for the
actual data that are in the files. So actually each time you are querying the database table the database has to go to
those files and fetch the data for you. All right. So this is what we mean with database
tables. Okay. So now we have like different types of tables in SQL. We have tables that stays forever. We call
it permanent tables. So they stay as long as you don't drop them. And you have another type of tables they called
the temporary tables. And those tables going to be deleted and dropped once the session ends. So now we're going to
focus first on the first type, the permanent tables. And there are two ways on how to create them. The first way is
the classical way where you create table from the scratch and then you go and insert your data. So we call it create
insert and the other way called create table as select. It's going to create as well the table but based on SQL query.
So let's understand the differences between them. The create insert method is the
classical way on how we define and create tables in SQL where first we have to go and create the table and define
the structure and after that we insert our data into the database table where the other method the CTAs create table
as select. And this one going to create a new table as well but this time based on the result of SQL query. So let's
understand what this means. Okay. So now to the first method create insert. So here we have two steps. The first step
is we have a DDL statements where we use the command create. So once we execute the first step what's going to happen
the database engine going to go and create for us an empty table. It is a brand new table where we can hold our
data. So with that we have defined the structure of our table but it's still an empty table. So now in the next step we
have to go and insert our data inside this new table. So our data can come from multiple sources like a CSV file or
maybe completely from another database where we are doing migration or maybe you are inserting manually your data or
maybe it come from an application or you are doing data migration from one database to another. So at the end once
you execute insert what's going to happen your data going to be inserted in this new table. So in this method we
have like two steps. First we define the structure of the table and the second step we take care of inserting our data
inside the table. And now this new table and your data going to be persisted permanently. Now let's check the other
method the CTIS. Here it's only one step where you define a query and once you execute this query what going to happen
the database has to retrieve the data from another table. So it might retrieve data from our new table that we just
created using create insert. So once the query is executed we will get a result. So now what the database going to do
going to create a new brand table but this time the definition and the data of this new table it doesn't come from any
definition that we specify. it comes from the result of the query. So whatever structure that we have in the
results, it going to be reflected in our new table. So again the definition and the data that we see in this new table
comes one one to one from the result of our query. So in this type we don't have to define anything or to insert any
data. We are just writing a query and the output of this query going to define the table. But in this method as you can
see it always needs a database table in order to execute the query. But the create insert method we are creating
something from the scratch. So these are the two different ways on how you create tables in SQL and the differences
between them. Okay. So now you might ask you know what the CTAs are very similar to
the views. We have a query and the output of this query going to be like an object in the database. So what are the
differences between them? Let's check this. Now let's say that in our database we have a table that has three columns
A, B, C. And now what we can do, we can go and create view based on a query. So you create the DDL statement in order to
create the view in the database. And of course the database going to go and store the query in the database and it's
going to be empty. So there will be no data because views does not store any data and the query of the view will not
be yet executed. But now in the other hand if you go and create a table using CTIS. So here again we have a query
attached to the object to the table. So here what happens the database has to execute the query in order to understand
the structure and as well the data that should be inserted inside the table. So our SQL query going to be executed and
the result of the query going to be inserted inside the table. So that means this new table is storing already the
result of the query. So now this is the first differences between the table and view. As you create view the query will
not be executed and we don't have anything about the result of the query where in the CTIS we have already result
of the query stored inside the table and everything is prepared. So now let's see what's going to happen once the user
selects something from the view. So now the database going to go for the first time executing the query of the view in
order to fetch the data from the original table and then presented as a result for the user. But now in the
other hand if the user go and query the table that is created from the CTIS. So now what can happen? SQL will not
execute again the query of the CTIS because the database already done that and prepared everything. So that means
we are not querying anything from the original table and the data can be directly fetched from the new table. So
the user is going to get immediately the result from our table that is created from the CDIS. So here comes the second
difference between the tables and views. The views are slower than CTIS and that's because the database has here an
extra task. It must execute the query of the view in order to get the data. But in the CTIS the query going to be faster
than the view because we have already executed everything and prepared it for the user. So that's why tables from CTIS
are way faster than views. And now there is another difference and perspective about this which is from my point of
view is more important than the performance. So now let's say that in the next day we are doing data updates
on the original table like we are doing updates on the column C and as well in the column P. So now let's see what this
means for the user if they are using views. So the user in the next day is executing again the same query and again
here the database has to execute the query of the view in order to fetch the data from the original table. So that
means today in the views we are getting different data than yesterday because we have a new data and new updates and the
user in the result going to see as well the new updates and the fresh data. So the user is seeing exactly the status of
the data in the original tables. But now let's see what going to happen if the user go and query the table from the
CATS. So in the table of the CATS, we are still having the data from yesterday. All those new updates from
the original data will not be reflected in this new table because once the user selects something from this table, the
database will not go and query or fetch the new changes from the original table because we have already prepared the
data from yesterday. So that means our user now is getting old data from the CTAs table and the only way to get new
fresh data from the CTIS is to reexecute the CTIS query. And of course this is another step and it is harder to
maintain the table from the CTAs and this is a big difference for the users between the views and the tables from
the CTAs. Now think about views you are ordering a pizza at restaurants. So every time you are quering the view you
are placing an order the chef going to go and make a pizza from the scratch using the freshest ingredients. So that
means you are always getting a fresh hot pizza. And think about the CTS as like a frozen pizza from a grocery store. The
pizza was prepared earlier and stored in the freezer. And if you want to eat it, you have to go and heat it up in the
oven. But it's still not like a fresh pizza that is made on the spot and from the scratch. Now I made myself hungry
because I love pizza. So I think I'm going to go for a quick break. [Music]
Okay, so now let's check quickly the syntax of those two methods. The first one is create insert. So first step we
have to go and create a table using a DDL statements. So we use the command create and then we have to tell SQL are
we creating a table or view. In this scenario we are creating a table and then we specify the name of the table.
Then after that we have two parenthesis and inside them we make a list of all columns that we need inside this table.
So we have two columns the ID and the name. And after that we are defining the data type of those columns and maybe as
well the length. There are a lot of options that we can add to this syntax but now we are just checking the
simplest form of creating a table. Now the next step is that we need an insert statement. So we are saying insert into
our new table the following values. we are inserting the id number one and the value for the name going to be frank. So
this is a classical way on creating new table and inserting data to it. Now let's move to the second method the cas.
Now this time we have an SQL query like select from where and some extra logic. So this is our query and then we're
going to go and put our query inside a DDL statement. It's like we have done it in the views. It's exactly like we have
done it in the views but this time instead of saying view we're going to say table. So again we have the create
command and we are creating a table then the name of the table and then we say as and then we have two parenthesis and
inside them we have our query and this is where the name come from create table as select cas. So it is very simple in
one statement you have everything you are creating a new table and as well you are inserting the data that comes from
this query. Now this syntax is used in databases like MySQL, Postgress and Oracle. But in MySQL we have like a
shorter way on how to do it. Again we have our query select from where. But now in SQL server we can insert a
command between the select and from like this. So we are saying select the following columns into new table. So we
have this keyword into then the table name and then you continue after that with your query from where aggregations
and so on. So here it's like the DDL is inside your query itself but in the other databases you can have like the
query is separated from the DDL statements. Personally, I prefer this syntax than having this into because if
you have like big complex query, this can be really hard to see and to miss the column selection. So, this is the
syntax of creating a new table from a query the CTAs in different [Music]
databases. Okay. So, now we're going to check the scenarios and use cases where it makes sense to use. So, let's start
with the first one. Now we have learned before it makes sense to have a complex logic stored inside the database so that
our end users don't have to keep repeating the same logic over and over and it's as well maybe complicated for
some users. So that's why we have used views and the result of the view going to be used from our users. So everything
can stay easy and friendly to consume for our users. But now what might happen is that the logic of the view could be
very complicated and needs a lot of time to be executed from the database. So it takes really long time until we get the
intermediate result from the database. So that means if it's going to takes 30 minutes then each users has to wait 30
minutes until the query is executed and none of your users going to be happy with this situation. In this scenario,
if this happens, you have to try maybe to optimize the query. But if you cannot do anything about that, you have to
switch the view to CTAs table. So now what you have to do, you have to take the same logic and then put it
in so that the intermediate results are stored in a table. And of course at the moment of creating the table, it will
take 30 minutes. It will take long time because it is the same query and the database going to need the time until
creating the intermittent results. But the big advantage is that once everything is prepared maybe at the
night at the morning once your users are like online and start querying the data they have everything prepared. So the
user is going to go and start selecting and analyzing the intermediate result but this time using the table that you
have created from the CTAs and the response time going to be for all users again normal and fast. So if you have a
scenario where your views are very slow you have to go and prepare the data at the night using the CTIS and prepare the
tables to be analyzed from the end users. So this is the most common use case for the CTIS and this scenario
happens a lot in projects where you decide to go instead of views to go with the CTIS in order to have persistence
data and you gain performance. Okay, so finally back to SQL let's go and create a table using now we're going to go and
create a table that shows the total number of orders for each month. Let's go and do it. So first what do we need?
We need a query. So let's write it. select. I'm going to go with the date name in order to get the name of the
month from our order dates and we're going to call it order month. And then we're going to go and aggregate the data
by counting the order ID for total orders from our table sales orders. Uh don't forget to group by our month. So
something like this. Let's go and execute it. So the result is very simple. We have the order month and the
total orders. So we have two columns and three rows. So we have our query and of course we didn't create anything yet.
Now in SQL server in order to create a table from the query what we're going to do exactly before the from we're going
to write into and now we have to specify the schema and the table name. I'm going to stay with the schema sales and I'm
going to call it monthly orders like this. So that means we have our query and the DDL is exactly between
the from and select. So now if I go and execute this what going to happen we will not see here the result of the
query. We're going to get here like three rows affected because this is a DDL statement. It is not anymore a query
and the database is telling us I have created now a table with three rows. So now if you check our tables we don't see
it yet. Let's go and refresh and check again the tables. Now we can see our table here sales monthly orders. Now of
course we have to go and check whether everything is fine. So let's go and select the rows from our new table sales
monthly orders. So let's go select it first and execute. And now we can see again the
result of our query. But we are not writing here the query. We are just selecting it from the table. So our data
is stored in our table. And we can go and check the structure of this table. So if you go to the columns you can see
we have here the order month and the total orders and those informations comes from our query. So SQL is saying
here the order month is a var which is correct because here we have the names of the month. So SQL is able to define
the data type of the table from our query and the second column the total orders it is an integer and that's
because we have here numbers. So as you can see SQL is defining the structure of the table based on the result of our
query over here. And of course the data inside the table comes as well from the query. And the result of this table
going to stay like this as long as you don't change anything. So if you go and close this and open it after one year
it's going to show exact same results. So it's going to live in the database as long as you don't drop this table. But
if things change in the table orders, this table will not be updated automatically like we have learned in
the views. So now if you want to say you know what I would like to go and drop this table well it is very simple just
go and say drop table and the table name over here. So make sure you select it and execute it. And now if you go over
here and refresh. So let's check the tables. You can see here the table is dropped. And now if you say you know
what let's go and refresh the table that come from the CTAs every day so that we always get refresh data inside this
table. So now let's go and execute again our CIS. And with that if we go and refresh we're going to find again our
table inside it. Now if you go and execute it one more time in order to refresh the data of the table what you
going to get? You're going to get an error. The database going to tell you we have already this table so we cannot
recreate it. So now the question is how we can update the the content of this table. Well, we have to go and drop it
first and then recreate it. And if you want to put everything in one statement, we have to go and use the TSQL. It is
transacts SQL. It's like extension where you can do some programming inside SQL. So in order to do that, what we're going
to do, we're going to go at the start over here and we're going to make an if logic. So we're going to go and search
for the objects. So we're going to say if the object ID and now we have to go and specify the name of this object
together with the schema. Make sure to select everything sales monthly order and put it inside here. And then we have
to define the type of this object. And here we're going to go with you. It is userdefined table. So we are saying if
the object sales monthly orders is not null. So that means it exist. So what you want to do? we have to go and drop
it. I'm going to take the statement from here and then we're going to put it after the if over here. So we are saying
if this table exist then drop the table otherwise don't do anything because we don't have any new table and the query
going to work and at the end of the TSQL we have go in order to say the TSQL is done and then our usual query after all
that. So let's go and execute the whole thing and as you can see it is working. So what happens? The database did find
this table and drop it and then executed our query. So if you keep executing this, you are just refreshing the
content of this table. So this is how we work with the CTAs in SQL. All right, moving on to another
common use case for the CTAs that I usually use as well in my projects. We use CDS in order to create a persistent
snapshot of the data at specific time in order to analyze data quality issue. So let's understand what this means. Now in
some scenarios you have like a table and you are analyzing an issue. So there is like a data quality issue at your data
and you are analyzing this scenario in order to understand why it happens. But the problem is that at the same time
there will be updates on the table and your data is changing. So there will be updates maybe on some fields or you are
getting new records and everything is getting mixed up and you will not be able to analyze the scenario where the
data quality issue happened. So now it's almost impossible to find the ro cause of your issue. But instead of that what
we do if we have like an issue of the data we go and create a fixed persisted snapshot of the data in a separate table
using CTS so that we make sure nothing is changing and everything is fixed. And with that I can keep doing my analysis
on the same data without the worry that data are getting changed. So this is another way why we use CTS in projects
to make sure that we have snapshot of the data to ensure that our analyzes are done on the same scenario that caused
the buck and going to be used as a foundation for finding the problem and fixing
it. All right, moving on to another use case of the CTAs. We can use it in order to create our data m to make it physical
data m instead of virtual data ms using views. So let's understand what this means now. As we learned before, if you
have a data warehouse system, our data warehouse layer going to store the data inside tables. But for the second layer,
the data m, we can go and use views in order to have dynamic and flexibility in order to generate multiple data ms. And
we called it the virtual layer. But now in some scenarios if things get complicated your data m and reports
going to be slow because there for each action you are generating a query. So the powerbi reports and dashboards are
creating queries in your data marts and your data marts have always to go to the data warehouse in order to retrieve the
data for the reports and the whole thing could take minutes or maybe sometimes hours. So in these scenarios we cannot
stay using views because they are slowing everything down. But instead of that we have to convert our data mart to
a physical layer. That means instead of using views we have to go and use tables. And one very common way in order
to generate the tables of the data marts on daily basis is to use queries between the data warehouse layer and the data
mart layer. It's still going to take maybe 30 minutes. That's why you can go and prepare the data at the night. But
at the reporting layer where things and the performance really matters, the performance going to be better because
the response time from the tables is way faster than views and the reports don't have always to waste time waiting for
the data marts to get data from the warehouse. So this is another use case where you use CTAs where the views at
the data marts are slow and we have to go and replace them with stables using CTAs to speed up things. But still my
recommendation here is that start first with the views. So create a virtual data mart using views because the
implementation going to be very dynamic and fast and you are always getting fresh data from the warehouse but maybe
later if you notice okay some data ms and models are complex then maybe go and replace few marts from views to tables
using cis. So this is another use case for the and it is nice workaround for your data warehouse system. All right
friends, so with that we have covered now the first type of the tables that we have in databases. The permanent tables
where you create a table and it's going to live forever until you go and drop it. Now we're going to talk about
another type of tables in databases. We have the temporary tables. So let's understand what are temporary
tables. So temporary tables or sometimes you call them as a shortcut temp tables. They store intermediate results in a
temporary storage in the database during a session and the database automatically drop these tables after the session
ends. So let's understand what this means. Now we have learned in the CIS we could use a query in order to retrieve
data from one table and then it puts the intermediate results in brand new table in the database. So with that we are
creating another table based on a query. The same thing for the temporary tables. We have as well a query that goes and
retrieves the data from a table and as well the database going to go and create new brand table in the database that has
the structure and the data from the result of the query. So it is exactly at the CTIS. What is the difference here?
Well, it is about the lifetime of the table. Now the database tables that you have created using create insert or CTIS
those tables going to stay permanent and they're going to live in the database as long as you don't drop them. So even if
the system is completely offline the data going to stay at the database once it is online again but the temporary
tables going to get deleted and dropped from the database automatically once the session ends. So what session means like
once you open the client and you connect to the database and you are start doing queries we call the time between
connecting ourself to the database and disconnecting from the database we call this a session. So that means once you
close the client and you disconnect from the database and maybe shut down your PC and do something else. What going to
happen? The database going to go and destroy and delete all the temporary tables that you have created during the
session. So that mean the table going to live as long as you have a session and you can access during this time the
table as you are accessing any other permanent table. So this is what we mean with temporary tables or sometimes we
call it as a shortcut temp [Music] tables. Okay. So now let's check the
easiest syntax ever. So for the temporary table the syntax going to look like this. you're going to have like a
query select from where and as we learned in the CTIS if you go and say into then the table name it's going to
go and create a physical new table but now if you want it as a temporary table what you going to do you're going to
just put hash before the name of the table then SQL can understand okay now we are talking about temporary table and
the database going to store it in that temporary storage so it is very simple this is the syntax of that temporary
tables so so far we have learned that we have a database called sales DB and inside it we can find the tables that we
have created the customers, employees, orders and so on. Those are our tables and they are always there like if you go
and close everything and then start it or in the next day you're going to find always those tables with the same data.
So they're going to exist as long as we are not dropping them. Now the question is where do we find the temporary
tables? Well, as we learned, if you go over here at the system databases, you will find multiple databases from the
SQL server and normally only the database administrator has an access to this and one of those databases called
temp DB, temporary database. So, let's go inside it. Now, we can find multiple objects and one of them we can find here
the temporary tables. And now, of course, we don't have anything inside it because we didn't create anything. So,
let's go and create one. We have already an open session and active session with the SQL server. As you can see here, we
are connected to the database and we can start creating temporal tables. So now what is the plan? I would like now to do
few modifications on the table orders. But I will not do it directly at the table orders. I would like to take a
copy from the sales DB and create from it a temporary table. So let's go and do that. What do we need first? We need a
query. So I would like to select everything all the columns all the rows from the table orders. So from sales
orders. So this is my query. Now so far nothing is created. We have only select statements. But now in order to create a
temporary table what we're going to do we're going to put a statement between the select and from. So exactly before
the from go over here and say into then in order to make sure it is a temporary table we use hash and then the table
name. So we're going to call it orders. So that's it. We have our query and in between we have the into and make sure
you are using hash in order to be a temporary table. So let's go and execute it. And now we can see that 10 rows are
affected and we don't have any error. And now of course we cannot see it yet because we have to go and refresh the
object explorer. So let's go and do that. And now let's expand it. And now we can see our temporary tables. As you
can see it is at the schema dbo because we haven't defined any schema. And this is the default one from the database. So
nice. Now we have the table and let's go and check few stuff. So let's go and select the table itself. So select star
from and make sure to say hash orders. Let's go and select it. And now we are getting the data from the temporary
table and not from the original table. The orders in the database sales DB. So all those informations comes from the
temporary table. Now, of course, you can do whatever you want to this temporary table because it's not that important
and it's anyway going to get deleted. So, let's say that I would like to delete all the orders where the order
status equal to delivered. So, let's go and do that. What we're going to do delete from our hash orders. So, make
sure we are selecting the temporary table and then where we're going to say the
order status equal to what I say delivered. Yeah, delivered. So delivered like this. Let's go and execute it.
Okay, with that it says five rows are affected. Let's go and select it again. So
select from orders and let's check that. So as you can see now we don't have all orders. We
have only the orders where the status equal to shipped. So all delivered orders are removed. And now we can do
whatever we want to this copy. We can analyze it. We can modify it. We can go and insert a new data. So we can do
whatever manipulation we want on this copy. And now if you say, you know what, I like this result and I would like to
have it not only during the session. Maybe I'm going to need it for tomorrow or something. So now what we're going to
do, we're going to do the exact opposite. We're going to now store the result of the temporary table back to
our database so that we don't lose this intermediate result. So in order to do that, we're going to say into and then
make sure to specify the sales dot because we want to select the correct schema and then let's say it is orders
and I'm going to call it test like this. So let's go and execute it. So it says five rows are affected. Now we have to
see those informations in the sales DB. We still don't have this table over here. So right click on the DB and then
refresh it. So let's go again to the tables. And now you can see we have our new table orders test. So it is amazing
right? What we have done is we have took a copy from the original table orders to a temporary space. We have done some
modifications and play with the data and we have done some analyzes and then the end result of our temporary table. We
have loaded back to another new table called orders test in order maybe in the next day to keep working on it. So it is
really nice way to do changes in place where you say you know what it is temporary and whatever mistakes you
makes it's okay it is like playground. So now we still have an active session with the database and our temporary
table going to be always here. Now let's see what going to happen if we end our session. So in order to do that let's go
and just close everything. So I will just close and we'll not store anything. So with that we have now ended the
session. Let's go and start it again and see whether we still have the temporary table. So we have now again to connect
to the SQL server and now we have another session. So that means the old session is already lost. Let's go to the
databases to the system databases to the temp DB and let's go to the temporary tables. As you can see the database
already cleaned up everything and this space is again empty for any new temporary table that I'm going to
create. So as you can see once you close the session everything going to get lost. Now let's go back to our sales DB
over here to the tables. We can see the table that we have created orders test it is still living here and still has
like the data that we have created. So this is how things works with the temporary tables in
SQL. Now let's see how the database server executed that temporary SQL. So now let's say that you are as a data
analyst. You have created a query and then you say into in a temporary table. Now the database engine going to
identify the query and first it's going to go and execute the query and then it's going to go and execute it and
maybe we're going to get the data from the table orders and after the query is executed the database engine now has the
results. Now two things can happen. First the database engine going to go and store the metadata informations in
the system catalog. And now the second thing the database engine going to create a table but this time not in the
users but in the temporary storage in the disk. So the table going to live there for a short time. And now what you
can do you can write multiple SQL queries that are doing maybe multiple analysis on top of this table. So each
time you select something the database engine has to go to the temporary storage and fetch the data from there.
And now once you are finished and let's say you close your client the session between you and the database going to
ends and now the database going to understand okay there is no more connection to this user and it going to
go and clean up now the temporary storage with any tables that are created from this session. So that means the
database is automatically cleaning up the storage maybe for other sessions. So this is how the database engine works
with the temporary tables. So now the question is why do we need temporary tables? Let's see the
following scenario. Now let's say that in our source database we have a table called orders and now we would like to
go and load the table in our data warehouse. We have to do several transformations in order to prepare the
data for the analyzes in the data warehouse. So maybe you have one query to remove the duplicates and another one
to handle the nulls and maybe you are doing filtering and cleaning up and the last step you would like to aggregate
the data. And now of course those queries those transformations want to change the content of the table orders
and there is no scenario where you can do that directly on the source database and of course this is not allowed.
That's why in data warehousing we have to go and get our own copy of the data and then on top of this data we can do
our transformations. Now one way to do this using the temporary tables. So you have one script in order to extract the
data from the table orders and put it in temporary table as an intermediate results and then you come with the
transformations and all those queries and they start manipulating and changing the data of this extra copy in the
temporary table and the last step you have the load where you go and load the final version of the intermediate
results in the database. This is if you would like to do the whole ETL before inserting the data to the database. So
now the orders table and the final table in the data warehouse both of them are tables. So they are permanent tables and
they will stay there as long as we don't drop them. So they are very important tables. But now for the intermediate
results it is not that important. It is just an intermediate step that we have done in order to have our extra copy of
the data to manipulate it and so on in order to prepare it to be inserted in the data warehouse. So after we loaded
it in the data warehouse, this copy of the data is not anymore important. It shouldn't stay like for a long time.
That's why in this scenario, maybe we can go and use the temporary tables instead of normal tables for the
intermediate results. And that's because only of one advantage is that the database going to go and do an automatic
clean up after the host session ends. So it comes out of the box automatically from the database. So that means I don't
have to deal with the dropping mechanism of this table for the next load. If there is like something wrong in the
data warehouse, you would like always to check the copy where the transformations are done in order to debug and find
issues. So I don't normally use temporary tables in these scenarios, I use just normal tables. But for other
small projects, maybe this makes sense. So this is one use case on when to use the temporary tables in your projects.
We use it in order to store intermediate results temporary until we are done with the session and then once we are done
the database can go and drop that temporary table. All right guys, now a quick talk
about the temporary tables. To be honest, I never use this in my projects. If I need an intermediate results in one
query, I can go and use the CTEs. And if my intermediate results is very important then I put it in either view
or CTIS but it is nice technique to learn maybe you can utilize it in one of your
projects. All right guys so now let's have a quick summary about tables. Tables in database are like spreadsheet
or grid that contains columns and rows and your actual data are stored in these tables. And we have learned there are
two types of tables. We have permanent tables and temporary tables. Permanent tables lives in the database forever as
long as you don't drop them. But in the other hand that temporary tables they have short lifetime. They will be
dropped from the database once you end the session. Now we have learned as well there are two methods on how to create
tables in databases. The first method is create insert. This method involves two steps. The first one is defining and
creating the table and the second step is by inserting the data inside this new table. So you are creating something
from the scratch. And the second method we call it CTAs. It create as well brand new table but based on the result of a
query. So this type is done with only one step but it always needs another existing table. And we have learned as
well the difference between tables and views where the main advantage of using tables created from CTIS is that to
ensure the performance is fast enough at the end of the users or your reporting system. So we use CIS instead of views
if the logic of the view is very complex and takes a lot of time to be executed in the database. And one more nice use
case for the CIS is that we can go and persist a snapshot of the data in order to analyze a bug and data quality issue
and to ensure that we have the exact data in order to find a solution for the bug and the issue. Now we have learned
as well that we can use temporary tables in order to store intermediate results in a temporary storage and the main
advantage of the temporary table is the database automatically drops all that temporary tables when the session ends
and that's because for you the intermediate results are not that important to live long
time. Hey my friends. So we have learned that in real data projects if you have a database there will be a lot of
analytical use cases that want to access your data and do analytics. And what going to happen? They're going to write
complex queries because in many scenarios they are doing complex analyzes. And if you don't do anything
about it in your projects, you're going to face a lot of challenges like complexity and a lot of redundancy of
the same complex logic but from multiple users and maybe performance and security issues. And we have learned we have five
amazing techniques in order to solve those problems. We have learned the subqueries and cities and as well how to
create objects like views, CTAs and temporary tables. So now what we're going to do, we're going to go and
compare them side by side in order to have a big picture about the advantages and the disadvantages of each method. So
let's go and compare them. Okay. So now we have our five methods and the first criteria that I would like to compare
them is the storage type. We have learned that if you are using subqueries and CTE, what can happen? and the
database going to put the result of those two techniques in the memory in the cache so that later the main query
has a fast access to those intermediate results. But in the other hand if you are using temporary tables or tables
from CDS the new created table can be stored inside the disk storage. And now for the views as we understood there
will be no data storage and that means we are not using any storage from the database. Now if you are talking about
the lifetime so that means how long the object going to live or persist in the database. Now our three techniques sub
queries CTE and temporary tables all of them going to live a short time in the database. So all of them are temporary.
But now if you are talking about creating objects using CIS and views those two going to be permanent. So that
means they're going to live in the database as long as you don't drop them. Now we're going to compare them with
something similar is when the database going to go and drop or delete those objects. Now we have learned that the
subqueries and the cities have a short time. They going to live only during the execution of the query. So once the
query ends the database going to go to the cache and delete everything. But for the temporary tables they live little
bit longer as long as you are in the session. But once you end the session, the database as well going to go and
drop and delete your table. Now for the objects that comes from the CIS and views as we learned they are persistent
and permanent and the database can only delete them if you ask the database to do that by using the DDL command drop.
So the database will not delete anything for these two. So now the next one is the query scope like how we can access
those objects. Now for the subquery and the CTE the scope is here very small. It is accessed only from one single query.
The query itself where you write the city and subquery. So you cannot access it from external queries. But we have
learned that the temporary tables cis and views you can access all those objects from multiple queries. So that
means you can access those objects from multiple external queries. Now the next one if you are thinking about the
reusability if you look to the subqueries they are very limited. the subquery going to be used only in one
query and only in one place. So if you need it in multiple places, you have to go and repeat the same logic. So
subqueries are the worst with their reusability. But now if you are talking about the CTE, it is little bit better.
You still can access it only from one single query but you can access it in the same query from multiple places. So
you can access it multiple times from different joins and you don't have to repeat the same logics over and over.
But still it is limited because you have only one query that is using the logic. Now if you think about the temporary
tables I could say the reusability here is medium and that's because you can access the data by multiple queries but
only during this session. So once the session is ended you cannot access it anymore which means you have to recreate
it in order to reuse it again. So it is more reusable than the city and the subqueries but not that good like the
CTAs and views. Those techniques can offer the highest reusability for you. So they are always there for multiple
users from multiple queries. So it can eliminate a lot of redundancies and you have to do the job only once. Now moving
into the next one. If you are thinking about the intermediate result of those techniques, the question is how fresh is
the data? Is the data from these objects always up to date? Now for the subqueries and the cities they are
always up to date because the SQL is executing the logic on the fly and storing the data in the memory and
immediately after that going to come the main query and get the data. So always the intermediate results in the memory
are up to date. But now if you think about that temporary tables and the CTIS the query is only executed once and if
there is like any update and changes on the original table you will not find those changes in those objects and
that's because SQL executed once and that's all. So if you query those tables there is no guarantee that the data are
up to date. So if you want fresh data you have always to drop the table and create it again from the query. Now if
you are talking about the views they are amazing they are always up to date because views does not store any data.
So each time you ask the views for data what's going to happen the database going to go to the original table and
fetch the data to the view. So your data are always fresh and up to date. So this is a big picture about the behavior of
those advanced techniques that you can use in SQL projects. And if you ask my opinion my favorite is going to be the
views in the first place. Then in the second in my list is the city. They are amazing, but don't use more than five
CTEs in one query. Otherwise, it's going to be really annoying and hard to read. And then I'm going to say in the third
place, the sub queries. And then the CDIS. I use CIS if the views are slow. If that's a scenario, I'm jump to the
CDIS and create a permanent physical tables from my query. And the last one that I rarely use is the temporary
tables. So, this is how I rank those techniques in my skill projects. Now I would like to show you as well a
big picture on how things works in my projects in order to see all those different techniques and possibilities
that you can use. It's like a big picture and recap. So story time. So you have a database and things starts where
you have a database administrator or let's say a data engineer that is creating a new table from the scratch.
So he going to write a DDL statement in order to create one physical table at our database. And now our database table
is empty. That's why in the second step he going to go and write an insert statement in order to fill our new table
with data. Now once we have a table we're going to give the access maybe to a data scientist or data analyst in
order to start writing SQL queries. So now the first thing that could happen that the logic is complex and she has to
do that in two steps. So the first step is a query that prepares the data in order to execute the second step. So
that's why she going to go and use the subquery and the main query going to go and retrieve the data from the
intermediate results in order to prepare the final results for the analyst. Now what could happen is that there will be
an SQL logic in the query where it keep repeating the scripts. So now instead of writing another subquery for that she
going to go and put this logic in CTE and now she going to go to the main query and use the result of the CTE in
multiple places in the same query. So all those stuff the sub queries and the city queries the main queries all those
stuff happens in one single query and now what could happen is that she is writing an amazing code. So instead of
using it only in her query what's going to happen she going to go and persist this logic in the database. So she going
to put it as a view in the database so that all other users and analysts can benefit from this logic and they don't
have to write it again. So instead they're going to go and query the view and this going to makes the life easier.
And of course our data analyst can as well use this view in the main query. And now one more thing she has as well
another logic that is really complex and as well everyone can benefit from it. But the issue this query is very slow.
So now she has to decide do I put it in view or do I create a new table based on the query using CTAs. Now of course
because of the performance and the view takes around 30 minutes to be executed. She decided to execute the query using
the CTIS where she generate a physical table so that all other analysts as well can access this new table in order to
reuse the results and of course she can use it in her main query and with that now you have experience how things works
in real projects. It is not simple select query from table it is like this people are creating subquery CTE views
temporary tables CTAs for different purposes. All right my friends. So that's all about the CTIS and the
temporary tables. And with that we have learned all the techniques on how to organize our complex projects. Now next
we're going to start talking about something completely different. We're going to talk about the stored
procedures on how to put our code inside the database. This is all about that programmability and how to add stuff
like parameters, variables, error handling. So it's like programming. So let's go. So let's uncover this word of
the s procedures and let's go. Now think about store procedures like this. Every time you go to a coffee
shop, you say, "I would like a large coffee with a coconut milk, no sugar, and extra whipped cream." And you repeat
this over and over each time you go to this coffee shop. And now, if you are working with stored procedures, it's
going to be like this. Whenever you go to the coffee shop, you just say, "Give me my usual." and the barista know
exactly what you mean behind that and you will get exactly your order without specifying and repeating everything word
by word and this is exactly what's going to happen if you work with stored procedures so let's have some coffee
right all right so now we can continue all right so now let's start again from the scratch we have always these two
sides we have the client side and the server side of the database and what we have learned we have like a database and
you as a user you can go and create like different SQL statements Like for example, you can create like an SQL
select statements in order to retrieve data from the database or another SQL statements where you are inserting data
to the database and another one let's say that you are updating the content of your tables and so on. So you have like
different statements in order to interact with the database. Now let's say that what you are doing is not only
one time job you are keep repeating those steps over and over. So you are always like doing an insert then an
update and then a select and you keep repeating that day after day. So now imagine that you are doing something
crazy where you go in vacation but the job should be done. So what you do you hand over all those select statements to
your colleagues and they have to do it every day as well as you are gone. So you go and give them all those SQL
scripts and you tell them okay you have to execute the first query then the second query and then the third query.
This is of course not a good way on how to do things because of course there will be some human errors where like the
execution of the script is not correct like first updating then inserting and things can go wrong and that's exactly
why we have stored procedures in SQL. So what we can do we can put all those SQL statements together in one frame in one
program and we call it start procedure. And now once you do that all your SQL statements will not stay at the client
side they will be stored now in the server side of the database. So that means in store procedures we are storing
our SQL statements inside the database. So you don't have to go and hand over your SQL statements to your colleagues.
And now all what you have to do in order to interact with your SQL statements is to go and execute the store procedure.
So you write very simple command called execute SP for example. So with that you are calling your stored procedure that
is stored inside the server. And once you execute this what can happen the database going to go to the stored
procedure and start executing all the SQL statements that you have inside the store procedure and it's going to do it
exactly in the order that you have defined. So from top to bottom. So now once the database went through all your
SQL statements, it's going to return back to the user the data that we have from the selects. And with that things
are really easy and you can tell your colleagues okay just execute this third procedure and the rest can be done from
the database. So with that you minimize the human errors and you make sure that everything can be executed as you wish
and as well as you are back from your vacation things are easier. You have to just go and execute the third procedure.
So this is what we mean with start procedure. You can store inside it multiple SQL statements in specific
order and you can save it inside the database and each time you need your SQL statements you can go and simply execute
them. So now let's have a quick comparison between a normal query normal SQL statements compared to a stored
procedure. So a normal SQL query you have like select from where and so on. This is like one-time transaction. You
are asking the database for one thing and the database is answering. So it is like one-time request. But now in the
other hand in the stored procedures you have multiple SQL statements and once you execute the stored procedure there
will be many interactions with the database in one go. So that means you will have multiple transactions that is
happening in your store procedure. So an SQL query it is like a simple request. You need one thing and you are getting
it. But on the other hand in the start procedure it is like a program. As you are writing a code in any programming
languages it is more than one request it has a lot of stuff like for example you can go and build looping logic where we
go and iterate through something or you can go and build a control flow where you have a logic like the FL statements.
So there are like different paths in your code and as well in programming we have like parameters and variables in
order to make our code dynamic and flexible and as well we can build error handling on our code in order to
customize what can happen if there is like an issue. So the store procedure it is like having a code like for example
in Python. So that means you can do more complicated stuff compared to a simple query where you have only like one
request. So in the stored procedures you are doing like programming and coding and it is more advanced than only just
having a query. So that means if you are working with stored procedures things going to get more complicated and
advanced but of course you will get a lot of flexibility and reusability compared to a simple
[Music] query. So now there is like another alternative to stored procedures. Well,
you can go and put all your SQL statements in a Python code and things can work as well. So, either you put
your SQL statements inside the stored procedure or in a Python code. But now the big question is what are the
differences between them? Well, there is like a disadvantage if you having Python in different server because you have to
go and build a connection between your server and the database server and connection means always networking and
you might get slightly worse performance. So this is one advantage for the start procedure. Another
advantage for the search procedure that all the scripts that you're going to store inside the store procedure in the
database going to be pre-ompiled. So pre-ompiled means the SQL database servers knows already about your SQL
statements and there was already a check whether all the syntaxes are correct and the database as well going to be
preparing everything to execute the stored procedure like maybe preparing the execution plans and a lot of stuff.
So if you store your skill statements inside store procedure in the database, it is very close to the database and the
database knows everything about your scripts and it is ready to execute it. But if you put all your SQL statements
outside of the database, of course, the database has no chance to understand what is coming. So it cannot go and
compile anything until Python sends the code to database. So this is another advantage for the stored procedure. But
now if you build your SQL statements in Python, you will get a lot of advantages. Like for example, you can go
and build very flexible Python codes where you can use Python features together with the SQL and with that you
open the door of many possibilities and flexibility. Another thing with Python, you can make great version control. So
everything is integrated in Python tools. And one more advantage is that if you have a complex requirement in your
projects, it's going to be really hard to implement it in stored procedures. it's going to cost you a lot of lines of
code and things going to be not comfortable. But if you are implementing a complex logic in Python, things going
to be way easier. So with Python, you can implement complex logics very easily compared to the stored procedure. So
those are the big differences between the stored procedure and Python. Now I have to be honest with you about having
your code in store procedure or in Python. Well, if you are working together in a data project, I will never
recommend you to use stored procedure if you have the possibility to have your code in Python. And that's because I saw
a lot of projects using stored procedure and most of them ends in chaos. It is really hard to debug. It is really hard
to test. It's like catastrophic. So really don't use in your projects any store procedures. Especially if you have
like a big project and you have a lot of data and tables and so on. You can manage everything perfectly using
Python. Especially if you have platform like data bricks or snowflakes then of course the best way to control your data
projects is using Python. But of course if you don't have this possibility and you have only a database server and you
can only work with this then you don't have any other option. You have to work with the store procedures. But if you
have this possibility to put your project inside Python and to run your scripts from there, then it is way
better than having stored procedure. Well, this is my opinion. I'm just talking about working in projects in big
projects. But if you have like small projects, few tables and so on, then it's fine to stay with the store
procedure. But never build a big project using stored procedures because I tell you it will never work. So try to always
to think about to have the right platform in order to run your projects. And now I'm thinking about it. Maybe I
should have put this tip at the end of the video, not in the middle. So whatever. If you still want to learn
store procedures, we're going to continue on that. And I'm going to have like a really nice example about how to
build store procedures step by step like having a mini projects. So why not learning both of them. So let's
go. Okay. So now let's have a quick look to the syntax of the store procedure. It is very simple. So it has always two
parts. First we have to define the start procedure. So we can do it like this. Create procedure. Then we have to define
the procedure name and then we say as and then we have begin and end. It's very important for SQL to understand
when that definition starts and when it ends. And then between the begin and end we're going to have a set of SQL
statements. So here you can insert whatever you want. Insert update queries anything. And once you have defined the
sort procedure the next step is that we're going to go and execute it. So the syntax is very simple. We're going to
say execute and then the procedure name. So that's it with that SSQL going to go to the S procedure and start executing
all the SQL statements that you have in the definition. So this is the syntax of the S procedure. As I said it is very
simple. All right guys. So now let's do it step by step. The first step is that we're going to go and write a query. So
let's say that we have a very simple task and it says for US customers find the total number of customers and the
average score. So let's go and do it. It's very simple. So select count star total customers and then the average of
scores as average score from our table sales customers and then since it says US customers we have to go and filter
the data based on the column country is equal to USA. So that's it. This is our query. Let's go and execute it. So we
have a very quick nice report about the total number of customers and the average score. So now let's say that I
have a weekly meeting and I have to represent this reports over and over. So that means I have to go and execute this
query like frequently in weekly basis in order to get the data for the reports. So now what this means I have to go and
save this query in order to use it later that each time I have to rewrite it. So that means I have to store this text
somewhere that I don't go and rewrite the query over and over. So what I usually do, let's go and we copy the
whole query and then we create a new text and let's say it's going to be my weekly query and it's going to be SQL.
So I'm going to go and edit it and here I'm going to save my query and each time I need this query I have to go and copy
it, go back to my SQL and then I'm going to go and paste it in order to execute it. So either going to write it each
time or copy and paste it. Well, we don't have to do that. we have start procedures. So that means we're going to
go to the step two where we're going to turn this query into a store procedure. So let's do that. It's very simple. So
we're going to say create procedure. And now we have to go and give it a name. So it's going to be get customer summary.
And then after that we're going to say as and then we need the begin and end. And in between we're going to put our
query. So let's go and copy our query and just put it in between. So that's it. Let's go and execute it. And with
that we have created our store procedure. And now in order to see our store procedure we can go to the object
explorer to our database sales DB. And then here we have a folder called programmability. So let's go inside it.
And here we have a lot of stuff like functions, triggers and we have stored procedures. So let's go inside it. And
we can see over here this is our new created stored procedure. So we are almost there. The next step is that
we're going to go and call our store procedure. And this is the easiest part. So it's going to be execute the stored
procedure. And the syntax is very simple. So execute and then the name of the stored procedure. So get customer
summary. So let's go and execute it. And with that as you can see we get the result of our query. So as you can see
it is very simple. In just few steps we created a store procedure. And then in the future you don't need the whole
thing. You just go and execute the store procedure. I don't have to store the query locally at my PC or to copy and
paste anything. If I want this report now, I just have to execute the store procedure like this and I will get the
results. Okay. So now let's keep moving. Now we're going to talk about the parameters inside stored procedures. So
what is a parameter? It is like a placeholder where you can pass in information from you into the store
procedure while running it and using parameters in store procedure it's going to make it flexible reusable and
dynamic. So let's understand what this means. Let's say that you got a new task. So it says for German customers
find the total number of customers and the average score. So that means now we have like to generate two reports one
for USA and one for Germany. And in both of them you are doing the same aggregation. And again we have to go and
start writing the query. It's going to be very similar to the one that we have in the previous example. So we are doing
the same stuff same aggregations but the only change here is that we're going to use another value to filter the data. So
instead of USA we're going to go and say here Germany. So let's go and execute this one over here. And with that we can
see we have total number of customers too. So this is the report that we have to provide like in weekly basis. And
again in order not to go and copy paste stuff we're going to go and create a store procedure for that. At the end
we're going to have an end. But now of course we cannot have like the same names we're going to go and say here
Germany. So let's go and execute it. And the next step we have to go and execute the store procedure. So like this. Let's
go and execute it. And the whole logic now stored inside the database. Let's go and refresh on the explorer over here.
And you can see now we have two stored procedures. But now you have to feel there is something wrong. Always in
programming and coding. If you find yourself repeating the same task over and over then there is always a smarter
way on how to optimize that. Repeating stuff in coding is always bad thing. So now clearly we are repeating the same
query in two different store procedure. And now if you compare them you see it's because of the value. So we have here
the value for the filter once Germany and one USA. And those values are static values. So it's always going to stay
inside the store procedure as USA. But instead of that we can replace those static values with a parameter. And then
you decide as you are executing the stored procedure for which country you want to execute the store procedure. So
let's go and do that. I'm just going to remove everything from here and focus only on the first store procedure. Now
what we're going to do after giving the name of our store procedure we have to define our parameter. So it start with
at and with that SQL understandhuh now we are talking about parameters and we need now the name of the parameter. So
it's going to be country. It could be any name that you want and after that we have to define for SQL the data type.
It's like when you are creating a table and you define columns you assign a data type for each column. The same thing
here you have to assign as well a data type for each parameter. So we're going to use the data type in var and for the
countries it's enough to have the length of 50. So with that we are telling SQL for this third procedure we can pass an
information to the store procedure and this information and value going to be used inside this parameter. So now after
we defined this parameter over here we can go and use it anywhere inside our query. And of course we want to go and
use it instead of this static value. So now we're going to remove this static value and instead we're going to have
the parameter. So now we are saying you're going to filter the table based on the value that comes from the user
and not anymore static with a USA. And as I said you can use this parameter everywhere like even here in the select
statements. So it is a value that could be used everywhere in your query. So that's it. We have defined our new
parameter and we have used this parameter in our query. So now we have to go and update the store procedure. We
cannot leave it as create. Instead of that, we're going to say alter. So we are saying alter procedure and with the
new informations. Let's go and execute it. And now we have to go and execute it. So now what we're going to do, we're
going to say execute get customer summary. But now our store procedure is expecting a value from you from the
input. So we're going to do it exactly like we done in the name over here. So we're going to say the parameter country
is equal to Germany. So that means the value of this parameter come from me come from
the input and this information going to be passed to my query to the store procedure. So let's go and execute it.
And with that as you can see we are getting the report of customers for Germany. And now if you say okay let's
go and generate the report for USA. All what you have to do is replace the parameter. So in the value instead of
Germany we're going to say USA. So let's go and execute it. Great. Now we are getting as well the report for us
customers. So that seems my friends for those two reports I just need one store procedure and with the help of the
parameter I made my store procedure now more flexible and professional. So this is exactly the power of the parameters
it makes everything reusable and dynamic. And now of course we don't need the store procedure for Germany. So what
we can do we can go and drop it. So we're going to say drop procedure and it was like this Germany. So we don't need
this store procedure and we're going to stay with only one dynamic store procedure. So this is how to use
parameters in store procedure and why it's important. Okay. So now to the next step is that we can go and add default
values for the parameters. So let's say that I execute very frequently this report where I say the country equal to
USA and I don't want each time to define the parameter value equal to USA. So if you are using a value very frequently
you can add it as a default inside the definition of the store procedure and it is very simple. So if you go to the
definition again over here after the parameter and you say equal to USA. So now it's very important to understand
that the country will not be always equal to USA. It is just you are saying if I don't get from the user any value
then as a default I'm going to go and use the USA. So let's go and again change the definition of our stored
procedure using alter. So execute and now we can go to our store procedure and I can skip the whole thing over here and
execute it. So now as a default I'm getting the report of USA without passing an information to the store
procedure because I know it is as a default USA. But if you need it as a Germany of course you have to go and
define it. So you say execute the store procedure where the country equal to Germany. So if you execute it like this
SQL still going to use your value. So the value that comes as an input from the user has more priority of course as
the defaults. And with that we are getting the Germany reports. So as you can see it's really nice right using
parameters in store procedure. All right moving on to the next step. Now we can work with multiple
queries inside one stored procedure. And this is what we have learned at the start. We can have multiple SQL
statements in one stored procedure. And now we have a new report and query to generate. It says find the total number
of orders and the total sales. So let's do it quickly. We can write it like this.
Select counts order ID. This is the total orders and then the sum of sales. Total sales from our table sales orders.
And of course we are always creating a report based on specific country. So that means we have to go and join it
with the customers table in order to filter the data. So on customer ID equal to the customer id. And now we're going
to go and filter the data. So country equal to USA. So something like this. Let's go and execute it. And with that
for the US customers, we have six orders and the total sales 180. And of course, the same thing we're going to do for
Germany. So now, of course, we will not go and create an extra store procedure. For this, we're going to go and put
everything in one store procedure. So let's go and copy the whole thing and put it here inside. So after the first
report we're going to have the second report and now the best practice here if you have multiple queries in store
procedure go and add at the end of each query a semicolon. It is just easier to understand how now this is the end of
this query especially if you have like a big complex queries where you have CTE union and so on. It's going to be really
hard to understand that we are talking now about completely new query but it is not like something the database requires
it but it's just easier to read. So just add semicolons at the end of each query. So now let's go and execute the whole
thing in order to change the definition of our query. And one more thing of course don't forget we don't need static
values over here. We're going to go and add our nice parameters. So add country. So I think
with that we have everything is ready to be executed. So let's go and change the definition of our store procedure. And
now let's go and start with the defaults where the country equal to USA. So let's go and execute it. And now in the output
as you can see we have two results. And that's because we have two queries. So the first report is for the first query
and the second one for the new one that we just created. And the same thing if you go and execute the store procedure
for Germany we will get as well two results. And here we can see we have four orders and 200 of total sales for
Germany. So as you can see it's very simple. You can go now and add multiple SQL statements not only queries you can
go and update you can do an insert delete any kind of SQL statements you can just go and add it inside your
program. And as usual SQL going to execute it from the top to the bottom. So since this is the first SQL statement
it's going to execute it first and then after that it's going to go to the next one. So this is how you can add multiple
SQL statements to your store procedure. All right everyone. So now we're going to talk about the variables.
So what is a variable? It is like a placeholder where you store inside it a value in order to use it later inside
your stored procedure. So that means variable holds like a value inside the memory and you can reuse it everywhere
you want inside your stored procedure but it's not like the parameters. Parameters are something like outside
the store procedure. It's an input from the one that is executing the store procedure and the store procedure has to
adapt with the parameter. But a variable it's something that lives inside the store procedure and we use it as a
developers in order to make our code dynamic and to move a value from one place to another. So let's have a very
simple example now. Let's say that we don't want our report here about the total customers as a query. So I don't
want it as a result in the output. Let's say I'm generating a report always like this. We are saying the total customers
from Germany equal to two and the average score from Germany is equal to 425. So I need it as a text not as a
table like here. So in order to do that we can use the TSQL print in order to give a message after executing the store
procedure. So the syntax of print is very simple. So we can go over here and say print and then we have single quotes
and let's go and get the whole message from here without the comments and then the semicolon and we can repeat that for
the second message. So for the average score and we put it over here as well a semicolon. Now if you do it like this
this message going to be always static. So we will have always like two for the total customers and the average score
going to always be like this even though that the data is changing. So we cannot have it static like this. We have to
make it dynamic and especially if we are calling this function for USA. So we cannot have it here as a Germany. So
let's see how we can make this dynamic. Now let's start with the easy stuff. Instead of the Germany over here we can
go and put our parameter right. So instead of this so we're going to say at country but now the problem is it is
part of the whole string we cannot do that so we're going to stop the text and you can see the coloring is changing and
then have a plus in order to have concatenations. So this text comes first then the value from the country and then
we're going to have as well the double point as a static text and again a concatenation and then we have the two
we can talk about later. So let's do the same stuff over here. So we're going to say plus add country caring is not
changing because of this code. So let me just remove it and then afterward plus make it static again plus and remove the
final quotes. So with that in the message we have now dynamic where we get the value of the country from the
parameter. And now we come to the interesting part. We have here an issue those two values they come from this
query. And of course we cannot use a parameter for that. We have to use now the variables. Now in order to make a
variables we have three steps. The first step is that we have to tell SQL about our new variable. So SQL can prepare and
make like placeholder for it in the memory. So we have to tell and prepare it with our new variables. Now usually
we do all the declarations of our variables at the start of the store procedure immediately after begin. So
that means we're going to go over here and say declare and now after that it's like the parameters. It's very simple.
So at total customers. So this is the name of the variable. And after that we have to define the data type. Of course
you have to understand the data type from the query. Since we are saying count star then the output going to be
an integer. That's why we're going to write it like this. So integer. And now we need another one for the average. So
what we're going to do we're going to make a comma. Now we are declaring another variable. So at average score
and the data type of this one going to be float because we have an average. So that's it for the first step. We are
telling SQL we have two variables and SQL going to go and create an empty placeholder. So now in the second step
we have to give our variables a value. So where we going to get the values? We're going to get it from the query. So
let's do that. Now let's start with the first column. As you can see we have here the count star. And as we learned
anything that we write on the right side, it going to be like an alias for the column. But in SQL if you go and
write something before it, it going to be the variable. So we can do it like this. at total customers and then equal.
So now we are saying whatever value this query returns it should be stored inside my new variable so that I'm assigning
values to my variable. But here there is one thing that we cannot have any more aliases because our query will not
return any results. Our query have now only one task to assign values to my variables. So that's why we cannot have
it like this. We have to remove the alias. And the same thing we're going to do it for the average. So at average
score equal to the average score and we have to remove the alias. So that's it. Now our query having different purpose.
It is not for returning result. It is to assign values to our variables. So now we have values in the next step we have
to go and use it. And we can use our variables everywhere inside our store procedure. So it could be in the print,
it could be in the next query. So in any select statements in any place. Sometimes we use variables in order to
pass an information from one query to another one. But in this example, we want to use our variables inside the
prints. So it is very simple. We now we're going to go and replace the static number and it's like the parameter.
We're going to say at total customers and the same thing for the average at average score. So that's it. It's very
simple. So again the step one we have to declare them to define it for SQL and with that we're going to get an empty
variable. The second step we have to add values to those variables and the last step we have to go and use those
variables. So it makes sense right now if you check our message over here you can see that everything is dynamic and
we don't have any static values but there is one more thing that's in the print everything should be as a string.
So we cannot have dates numbers floats and so on. So that's why you have to make check if you're adding any
parameter and variables all of them should be string. So the country it is okay because we have the data type of
varchar but the total number and the average score this is not really good because they have different data type
and we have to go and now cast those data types to another one. So we're going to say cast and we're going to say
here as invar so that we don't get any errors from SQL. So cast as well here as in
vchar like this. All right. So I think we are ready. Let's go and change the definition of our store procedure in
order to test. So let's go and execute. Perfect. And now let's go and test. So let's start with the defaults where we
have the parameter as USA. So now as you can see we are getting one result and this is from the second query. So the
first query is not returning anything anymore in the output. But if you go to the messages over here, you can see we
have a new message. It says total customers from USA is equal to three and the average score from USA is equal to
825. And this is exactly what we wanted for our reports. Now let's go and execute the parameter equal to Germany.
Again, we have only one result. And in the messages, we're going to get total customers from Germany is equal to two
and the average score from Germany is equal to 425. So this is exactly how we work with the variables. We use it in
order to hold one information in one place in order to reuse it later in different place. So that's it for
[Music] variables. All right everyone. Now we're going to talk about how to control the
flow in your store procedure and we're going to learn how to do that using the if else statements. So now let's have
the following scenario. Now if you check our query over here we are doing the average of score and if you check the
data you can see that in the scores we have nulls and nulls are really bad for aggregations. So we usually have to
clean up our data before doing any aggregations. And in this scenario we can understand null as a zero. And how
we going to clean up and handle the data? We're going to go and make an update on our table where we say if
there is like a null then make it as a zero. And we will do this as a pre-step inside our store procedure. So that
means first we have to clean up the data and then afterward we're going to generate the reports. And this is what
we usually do inside SQL projects. So the logic going to be very simple. We have to check first do we have nulls
inside the score. If the answer is yes then we have to go and update the null values to zero. But if the answer is no,
we don't have any values then we can skip everything. So now we're going to go and build this logic inside our store
procedure in order to clean up and prepare the data. So let's go. Okay. So now this part we're going to call it
generating reports and we're going to have another part called prepare and clean up data.
So now let's prepare first the structure of the if statements. So the syntax going to look like this. So if and then
begin and end. So this is the block of the if and we're going to do the same thing for the else. So we have else and
we have begin and end. Let me just separate them. So now how this works? We have to create a condition. If the
condition is met then the if statement going to be executed. But if the condition is not fulfilled and we have
false then the else statement going to be executed. So what is the condition? We have to check whether there is null
inside the scores. So let's write a very simple query. It's going to say select one from sales
customers where score is null and always we have to check the country equal to let's say USA. So let's go and execute
this one over here. So now we are getting in the output a results. If we are getting a results that means
somewhere there are nulls. But if you go for example and say here Germany and execute the same query in the output you
see that we don't have any results. That means for the German customers we don't have any nulls in their scores. So if
this query returns something we have nulls. If it didn't return anything then there is no nulls. And we're going to
use exactly this query as a condition. So we're going to take our check and say if exists and then two parenthesis and
then we put our query. So what we are saying if exist if this query return anything then go and execute the next
block and if it is not exist that means it is not returning anything then go and execute the second block. So it's a
logic right it's very simple now of course instead of having a static value over here we can use our parameter so at
country and now we have to tell SQL what to do if it exists. So in between we can have like an update statement. So update
sales customers and we're going to set the score equal to zero. But very important we have to go and use where
condition otherwise it going to go and update everything. The score is null and the country equal to our
parameter country. So with that we are updating exactly the nulls for specific country. And let's have a semicolon at
the end. And at the start maybe I'm going to say just to have a nice message in the output print and we can have a
message updating null scores to zero and as well a semicolon at the end. So if there is any nulls then execute the
whole thing print the message and update the table. So now the next step is that we're going to go and tell SQL what can
happen if the condition is not fulfilled. That means we don't have any nulls. Well we don't have to update the
table at all because we don't have to clean up anything. But I'm going to go and make print over here. So print and
we're going to give the message no null scores found. And at the last end I'm going to go and put a semicolon. So
that's it. This is our logic. We are checking our condition and then we execute if the condition is met where we
update the table with zero instead of null and if the condition is not met then don't do anything. Just print a
message. Now you might say you know what why you are doing this? we just can use this update statements and we don't need
the whole if else statements. So why we are checking in the first place? I can like each time I run this store
procedure I go and update all the nulls if they exist to a zero. Well, this is not really professional because you are
wasting resources. So each time you run an update statement like this. So imagine that you have a big table and
each time you run your store procedure, SQL have to go and check whether there is any nulls and so on. And this is of
course consume resources. It's way better if you go and check first whether it's really needed. So that's why we are
doing this logic. Now as you can see our store procedure is getting bigger and bigger. So we have like two parts. The
first part is preparing and cleaning up the data. And the second part we are generating reports. Let's go and update
the whole thing and execute it. And now we have to do it step by step. So let's check our query over here. And you can
see we have here null for USA customers. So let's go first execute it for the USA as a defaults. And now let's go and
check the messages. It's saying updating null scores to zero. That means the first block is executed because SQL did
find a customer with a null. And with that the average of scores going to be different than previously. So we have
now more accurate average in our reports. So if you go and check our query again, you can see now we have a
zero instead of null. Let's go and execute it for Germany like this. And let's go and check the messages. It says
no null scores found. And that is correct because for Germany we don't have any nulls. So with that we have
created a control flow using the FL statements. And as you can see we are not doing any more like simple queries.
We are creating like a mini program. And now it's like an ETL where first we prepare the data and second we generate
reports. And you can imagine a real project how big those stored procedures going to get where you have a lot of
tables and a lot of things to do. Okay. So now we're going to talk about the error handling in store
procedure. Error handling it is like essential things to do while programming because it gives you the control on what
can happen once you have an error. And there's a lot of things that you can do like maybe deleting data, printing a
very structured like message or maybe doing some logging and so on. So you have a full control on what to do if
there is an error and of course we can do that in the store procedure. So now let's check the quickly the syntax. It
is usually has two parts. The first part is the try part. So the syntax is like this begin try end try. So you are
defining the boundaries of the try and in between you going to have all your SQL statements and your code and the
second part going to be the catch parts. So you say begin catch and end catch. So you are defining the boundaries and then
in between you can tell SQL what to do if there is like an error. So what is try and catch? Like the word it says try
it's like you are attempt to do something that might fail. So you are telling SQL try to execute this code. So
the SQL going to go and try to execute your codes. And if any error happens while executing your codes, the SQL
going to jump to the second block and start doing whatever you have defined in the catch. But if there is no errors at
all, this part will not be executed. So the catch is like your backup plan. If something goes wrong here, then go to
the plan B and do something. So let's see the workflow of the try catch. So first the SQL going to go and execute
the try and then it going to check is there any error. If we don't have any error then everything ends and that's
it. But while execution if the SQL face any error what going to happen it going to go and execute the catch. So as you
can see the workflow is very simple and this is what we mean with try and catch. So let's go back to SQL to have some
example. All right. So now back to our store procedure. Let's go and introduce an error inside our code. So let's go
over here and maybe in our query we're going to go and divide by zero which is of course a problem. So we have this
error over here and let's go and update the logic of our store procedure. And now if you go and execute it. So let's
go and do that. We will get an error saying yeah you cannot divide by zero. But now what I would like to do I would
like to have something else where we have customized message when error happens. So I would like to have the
control on which information should be displayed if there is an issue. And in order to do that we have to use the try
and catch. So it's going to be very simple. Now this is my whole code. So the whole thing from preparing to
generate the report the whole thing is my code and we have to put the whole thing in a try. So how to do that?
Exactly after the first begin we're going to have another begin but for the try. And now what we're going to do,
we're going to go to the last end over here and have an end try. So with that we put now the whole code inside the
try. And after that we're going to introduce the catch. So begin catch and end catch. And now in between we have to
tell SQL what can happen if we encounter an error. And here we can do many stuff but I would like now to focus on
customizing the error message. Let's start with the first one. So I'm going to say print let's say an error
accord. This is the first thing. Then on the next line I'm going to print more informations. And now we're going to say
the error message. So error message double point space. And now we can go and use some predefined functions from
SQL like for example the error message. This function going to return the description of the error like the
one we have here divide by zero error encountered and we can go and keep adding stuff the way that we need like
maybe the error number. So we can have it like this and for that we have as well a function
called error number and I think we have to cast this one because it is a number and in the messages we have to have only
vchar. So this going to be as int var like this and we can keep adding stuff to our message like for example let's
take the error line and for that we have as well a function so it's going to be the error
line like this and we have to cast it because it is as well a number and as well what is really important is the
name of the stored procedure. So error procedure and we have a function for that error procedure like this. It's
going to be a string. So that's why I don't have to cast it. So now with that we have defined for SQL what to do if
there is like an error in our code. So let's go and execute the whole thing. And now let's go and execute our stored
procedure. So let's go and do that. So now as you can see in the output we are not getting any results and it is not
giving an error. But if you go to the messages, you will see a very nice message. So it says an error is
occurred. The error message is divided by zero and we have the error number in which line and as well the stored
procedure name. So as you can see it's amazing. This is how we use the try and catch in order to have more options on
to control what can happen if there is an error. Now the next step what I'm going
to do, we have to go and organize our store procedure. As you can see, everything is getting bigger. So now
what we usually do, we use tab in order to make spaces between each section. So now the first section is between the
first begin and the last end. So we have to go and mark everything and hit once a tab. So now it is easier to read. Now
the whole thing is our codes. So now the next level is the block of the try. So the whole thing over here is the try. So
let's go and do that. I'm just going to mark everything until here and then hit tab. So now we can see it better, right?
And the same thing for the catch. I think I have already done that. So it's already pushed. Now we go to the next
level. So between this begin and end, everything is pushed. So this looks nice. The same thing over here. It's
pushed as well. And then we don't have here any begin and end. So it looks okay. And the same thing over here. So
all our begin and end is now sorted correctly. Now the next step is that we can go and improve the comments a little
bit. So we can split our code into multiple sections. So what we're going to do, we're going to go over here and
say this is step one. And what I like to do is to go and add separation using the equals or any special character that you
like and as well here. So with that we have the first step. We are preparing the data. And then let's go and copy the
whole thing and go over here and say this is the step two. And we're going to say this is
generating summary reports and something like this. And of course below that we can say what is this report about. So
calculate total customers and average score for specific country. And as well we can go
over here and add as well a comment. calculate total number of orders and total sales for specific country. And of
course we have to go and remove this error over here otherwise we'll get an error and we can go and add something
about the catch where we can say like this again few comments we're going to say error
handling. So let's go and execute it again in order to make sure we have the newest version. And with that we are
done. We have a really nice stored procedure with multiple steps and we have it professional where we have error
handling inside it and everything looks well organized and easy to read. So this is how we build stored procedures. All
right my friends. So that's all about the store procedures. That was an amazing feature in SQL to add
programmability in SQL. Now in the next step we're going to cover quickly the topic of the triggers. So let's
go. All right. So previously we have understood that we can put all our SQL statements in one stored procedure and
you have to go and manually execute the store procedure. So that means in order to trigger the start procedure, you have
manually to execute it and this is of course a problem. How about to do that automatically? So triggers in SQL they
are special stored procedure that automatically runs or let's say fired in response to a specific event that
happens on a table. So what this exactly means? So now let's say that we have a table in our database and now something
could happen to this table like inserting data, deleting, updating data, all those stuff that is happening we
call them events. And now what we can do we can go and attach like a trigger on top of this table and each time an event
happened like insert update delete something else going to be triggered like maybe going and inserting data
somewhere else in another table or doing a check whether we are allowed to delete the data in the first place or maybe
sending a warning message or something. So based on any changes to the table we can trigger another events and we can do
that using the SQL triggers and for the SQL triggers we have like multiple types like the DML triggers and this type of
trigger going to respond once we have like insert update delete statements. Another type of triggers we have the DDL
triggers like you can make a trigger to respond to any schema changes like creating altering or dropping a table or
even view by the way not only tables. And the third type of triggers we have the login trigger. So the trigger can
respond to login events. Now in this tutorial we're going to focus on the DML triggers the insert update delete. And
for the DML triggers we have two types. We have after triggers and as well we have instead of triggers. So as the name
suggest if you use after so it can be executed after the event and the other type that instead of it's something that
cannot wait until everything happens. So this time the trigger going to be executed during the event not after it.
So now in order to understand all of this we're going to have really nice use case. And now the use case is about
maintaining an audit logs. So what we mean with that? Let's have for example the table employees. The employee data
are usually very sensitive informations because there we can see which employees are added, the salary updates, the
employee terminations and this makes the table very important because we would like to track all those changes that is
happening to this table. So each time we are inserting, updating, deleting, we would like to maintain a log about all
those changes in order to analyze it later. It is of course very important such a logs for the compliance and the
auditors and in case there is like a problem we can go to the logs to understand when this happened who made
the changes and what exactly changed and now in order to maintain logs we're going to use the power of triggers. So
what we're going to do we're going to go and attach like a trigger on the table employees and each time we insert new
data to the employees we are triggering another events. So what can happen this new employee going to be inserted in the
audit logs in order to have a record about this activity in the logs. So that means each time you are inserting data
to the table employees you are automatically inserting data inside the logs and this is really amazing use case
for the triggers. So let's go and implement it. Okay. So now let's check quickly the syntax of the triggers. So
we start with the usuals create trigger then the trigger name and then we have to specify on which table this trigger
going to be built in. So now we are attaching like a trigger on top of one table and after that we have to define
for SQL when this trigger going to happen. So what is actually triggering the trigger and here you can define
after or instead then you have to define the operator. So first you have to define like after or instead of and then
we have to define the operation. So insert, update, delete or one of them. And with that you are telling SQL when
exactly this should happen. And now after that we have to tell SQL what going to happen if the trigger is
triggered. So here we have like begin and end. And then we have like several skill statements that's going to
describe what's going to happen once we have the trigger. So that's it. As you can see the syntax is very simple. Okay.
So now let's do it step by step. First I would like to create a table where we're going to store the logs information. So
it's going to be very simple table. We're going to say create table. Then we're going to call it sales employee
logs and we're going to have the following columns inside it. So let's start with the primary key. It's going
to be the log ID and the data type int and then we're going to have like a sequence. So we're going to have
identity and this is the primary key. Let's go to the next one. It's going to be the employee ID and the data type
going to be ins. The next one is going to be the log message. So let's have it as a vchar and I'm going to have it like
255 and then to the next one we're going to have the lock dates and then we're going to have like let's say a date or a
date time. So that's it. Let's go and execute it and with that we have a new table inside our database. Now the next
step is that we're going to go and create our trigger. So we're going to say create trigger and I'm going to call
it like this trg. This is just a prefix to indicate this is a trigger. And I'm just going to call it after insert
employee. And now we have to define the table. So it's going to be on sales employee. So now with that we are saying
we have now a trigger on the table employees. And now we have to define the logic. So we're going to use after
insert. So that means after we insert any record to the table employees the following things should happen. So we're
going to say as and then begin and end and in between we can have our logic. So what can happen after a new record is
inserted to the employees. We're going to go and insert a new record to the employee logs. So we're going to have
insert into sales employee logs and we're going to have here the three columns employee
ID the log message and the log dates. So now which value is going to be inserted? it going
to be like from a query. So we're going to say select and we're going to say as well employee ID and for the log message
we can have customized one like let's say new employee added and it's going to be equal to the employee ID. So in order
to have the employee ID it's going to be like this. So that's it. Now to the next one
we need the log date. It's going to be get date. And now you might say okay but where this employee ID is coming from?
Well, it going to come from the table from inserted. So what is actually inserted? It is like special virtual
table that holds all the new inserted data to our table employees. So anything we are inserting inside the employees
will be available inside this table. And of course this is only available during the execution of this trigger. So you
cannot go now outside of this query and start querying the table inserted because you will not find anything. This
is only like a virtual table that contains anything that you are doing to the table employees and you find a lot
of informations like the salary, the age and so on. So that's it for the inserted. Now we have to make sure that
in our message we have everything as a string because the employee ID is an integer. So we have to cast it. So cast
and then we're going to say as far char like this otherwise we'll get an error. So I think we have our trigger ready. We
have a new trigger on the table employees. And now the first question is when this trigger going to happen? Well
it can happen after inserting data to the employees. And then the second question what's going to happen? Well,
once we have this event, the whole thing here going to be executed where we are saying insert to the logs, the employee
ID, the message and as well the date when this happens. And we can get all those informations from the table, the
virtual table inserted. So I think we are ready. Let's go and execute it. And now if you go to the object explorer to
our database, let's go to our table employees and then to the triggers. So if you refresh over here you can see our
new trigger that we just created. So with that we have to find our trigger and we are ready. Now the next step is
that we're going to go and trigger our trigger. So let's go and do that. Let's have a new query. But first I'm going to
have a look to our logs. So sales employee logs. So let's query this one. And as you can see our
logs is empty because we didn't insert anything to the table employees. Let's go and do that. Let's trigger our
trigger. So what we're going to do, we're going to say insert into sales employees and we're going to have the
following values. So we are at the counter, I think six. Let's have the first name
Maria. The last name an then we're going to have the position. It's going to be the HR for example. The birth date,
let's pick something. I don't know. We have a female here. And the salary. Let's go and get
this salary and the hierarchy it can be for example three. So let's go and execute it. And with that as you can see
we have inserted a new data to the employees. Let's check now the logs. So let's query it. So we have here nice log
about the employee number six. And we have here nice message and when this did happen. Of course you can go and insert
another employee let's say seven with the same data. So let's do that and check the logs. And with that we have
another log for the new employee. So this is really amazing use case in order to maintain a log to your data and you
can go and make like some analyzes on how many inserted happens and of course not only on the insert you can have it
on the update delete. So as you can see it is very simple. This is how we create the triggers in SQL. All right my
friends. So that's all about the triggers with that with with that we have covered now with that we have
covered now all the concepts and topics that you have to learn about SQL. Now in the next chapter it's going to be about
the performance. So as you start writing queries and so on you will start noticing some queries are really slow.
Now what we're going to do in this chapter we're going to learn different techniques on how to optimize the
performance. And the first and the very famous one is to go and build indexes in databases. So let's understand what this
means. So what is an index? An index is a data structure that provides a quick access to the rows to improve the speed
of your queries. So an index is like a guide for your database in order to speed up the process of searching for
data especially if you have like big tables. So now in order to understand what are indexes, imagine you have huge
book and you want to find a specific topic or a chapter. Instead of flipping each single page in order to find the
topic that you are searching for, you would use the index at the back of the book in order to jump straight to the
right page. And that's exactly what index does but for your data. Another analogy that I use in order to
understand indexes is think about the indexes as a big hotel. Now let's say that in the hotel we don't have any
guide and you would like to find the room number let's say 5001. Now what you going to do? You're going to go and
search for your room floor by floor and checking each room until you find your room. But instead of that, thankfully
hotels have a numbering system. And you can ask for a map from the reception in order to understand in which building in
which floor you can find your room. So by just following the map and maybe some signs, it's going to be very quickly to
locate and find your room in such a big hotel. And that's exactly what each database needs. It needs an index in
order to help the database finding and locating the right data without having to scan
everything. And now let's say that you ask me, you know what, I have this big table and I would like to speed up the
queries using indexes. And my first question going to be, what are you exactly doing with this table? Are you
using this table to search for text or are you doing like complex analyszis with this table? And the reason why I'm
asking this is that we have different indexes in databases for different purposes. So now let's have a quick look
to the different types of indexes that we have in database. I divide the indexes in databases into three
categories. The first one is by the structure how the database is organizing and referencing the data. And here we
have two types. The clustered index and the non-clustered index. Those are very important to understand. Now we have
another category for the indexes. We can divide them by the storage. And in this category we are talking about how the
data is stored physically in the database. So we have two types. We have the row store index and the column store
index. And the third type is the functions and here we have two types. We have the unique index and the filtered
index. Now each index type has its own strings but as well there is always a tradeoff. Some might improve their read
performance. The other one might improve the insert and update operations. So it's all about choosing the right type
of index for the job. So now what we're going to do, we're going to go and deep dive into each of those types in order
to understand how they work and how we can create them. And we will start with the first category, the structure. We
have the clustered index and the nclustered index. Now before we dive into how the indexes
works in databases, let's understand first what happens to the database tables if you don't use any index. When
you create a new table in your database like for example the customers table where you have let's say 20 customers
inside this table. What you're going to see at the client side is like spreadsheets like a table with rows and
columns but behind the scenes the database store it a bit differently. It's going to store the data in a data
file on the disk and inside this file the data can be stored inside blocks called pages. So it's not like rows and
columns that are stored inside data files and inside the data files we have pages. So what is a page? A page is the
unit of data storage in a database and it is a fixed size of 8 kilobyt where the SQL database can store anything
inside it. It can store inside it the rows of your tables or columns metadata indexes and every time you are
interacting with your data the SQL is reading and writing to those pages. So as you can see the SQL is not storing
the data inside like rows and columns. So if you are running a query the SQL is not like selecting a specific column it
always fetch a data page in order to read the rows inside this page. And the main two types that we're going to learn
is the data page and the index page. So how the data page looks like it is divided into multiple sections. The
first section is the page header where the database can store key informations about the metadata like the page ID and
it has the following format. It start with the file ID like one and then we have a unique number for each page. So
for example 150. So the page header is a fixed size of 96 bytes. Now to the next section, we're going to have a variable
size. This is where your data row is going to be stored. So your actual data and row is going to be stored in this
section. And the SQL going to try and fits as many rows as it can in one single page. And this of course depends
on the size of each row. So if you have like a large table where the rows are really big, so SQL can fit only few rows
in one single page. And now moving on to the last section in the data page, we have the offset array. This is like a
quick index for the rows stored inside this page. It keeps track of where each rows begins so that the SQL can easily
locate a specific row without having SQL like scanning the entire page in order to find a row. So this is the structure
of the data page and this is exactly how the SQL stores data inside the databases. So now back to our example
where we have the customers table and 20 rows. So let's see how SQL going to be creating those pages. Now if you are not
using any index in this table. So now what going to happen? SQL going to insert the data inside those pages as
you are inserting the data inside the customers. So maybe first you are inserting the customers like 12 5 6 7
and SQL going to insert it to the data pages exactly like that. So that means SQL is just inserting the data as you
insert it to the table. So let's say each data page is like fitting only five rows. So after we insert five customers,
SQL going to go and create another data page for the next rows. So in the next page, the SQL going to insert the next
five customers. And once it's full, it's going to create another data page in order to start adding the next customer
until we have like for example four pages for that 20 customers. So now if you check the customers inside those
four pages you see that they are not sorted at all and that's because in this scenario we are not using any index. So
we call this structure as a heap structure. So a heap table is a table without a clustered index. That means
the rows are stored randomly without any particular order. This is not a really bad because it's going to be very quick
to insert data inside this table. But of course finding something from this table going to be very slow. So this is the
first tradeoff. You have a very fast writes but a very bad reads. Think about it like you are throwing all your papers
in a drawer without organizing them. So you can toss things very quickly in this drawer. But if you want to search for
specific paper later, it's going to be very long process until you find it because nothing's in order. So now let's
see how the SQL going to handle if you read something from this table. Let's say that you are searching for the
customer with the ID 14. So now SQL has totally no idea where to find this customer. So SQL going to start fetching
each data page and start scanning each row. So it's going to start with the first data page and start scanning.
Well, SQL will not find 14 here. So SQL going to go to the next page and start scanning as well. Searching for the ID
14 and nothing going to be found. The same thing for the third page as well. SQL will not find 14. So SQL going to go
to the last data page and there after scanning four rows in this data page finally SQL going to find the customer
number 14 and it's going to return it for the clients. So as you can see in order to find one customer SQL did read
four different pages and scanned like 19 rows in order to find the customer and this process we call it full table scan.
So the full table scans means SQL is scanning the entire table page by page and row by row in order to find specific
row. And of course for this table maybe it's not a big deal. But if you have like a big table where you have like
hundred of thousands or maybe millions of rows searching through the heap structure going to be very painful and
slow in order to locate one row. And here exactly why we need indexes in SQL databases. So let's understand the first
type of indexes the clustered index. All right. So now let's understand what can happen if you create
clustered index in your table. So say you create a clustered index on the ID column of the customers. So the first
thing that's going to happen SQL going to physically sort all the data based on the column ID. So the rows going to
rearranged in each data page from the lowest to the highest. So in the first page we're going to have the first
customer ID number one then 2 3 4 5 until we reach in the last page the last customer number 20. So as you can see
the first page has the lowest value and the last page has the highest value. So that's not all. The next step is that
SQL going to go and start structuring and building the B tree. So what is a B tree? A B tree short for balance tree.
It is hierarchal structure that store the data as a tree upside [Music]
down. It start with the root the root node and then it keep branching out until we reach eventually the leaves.
Between the leaf nodes and the root nodes we call this section the intermediate nodes. So it could be like
one level or multiple levels between the root and the leaves. And once SQL construct the B tree, it's going to be
very easy for SQL to navigate through the B tree in order to find specific information. So let's see how SQL is
building the B tree for the clustered index. Now very important to understand that the leaves the leaf nodes and the B
tree for the clustered index contain the actual data the data pages. So all your nice sorted data pages and your data is
stored at the leaf level. Then after that SQL going to start building the intermediate nodes and here the database
going to use different type of pages. We have the index page. So in the index page we cannot find the actual data the
entire rows but instead the index page stores a key value that contain a pointer to another index page or to a
data page. So for example we have here the value one the key and then the value going to be the ID of the data page. So
here we don't have like the whole row about the data we have here only a pointer to another data page. So here we
are telling the scale if you are searching for ids between 1 and five you can locate it at the data page ID
1.100 and then we can store in this index page another pointer where we can tell SQL if you are searching between 6
and 10 then you can locate it at the second data page. So this is the structure of the index page it contains
only pointers to another page and the same thing for the second two pages. The SQL going to create another index page
where it's going to says if you are searching for IDs between 11 and 15, you can find it at the third page 1 double
point 10002. And for the last group between 16 and 20, we have another pointer to the last page to the page
number one3. So as you can see inside those index pages, we have like a pointer for
each group of ids for each cluster. So for the group of customers between 1 and five we have one pointer and for the
second group between six and 10 we have another pointer. So that means we don't have here a pointer for each row. We
have a pointer for each group for each cluster. That's why we call it clustered index. And now once SQL is done building
the intermediate nodes, SQL going to go and build the last node, the root node where it says if you are searching for
customers between 1 and 10, then go to the index page with the ID 1.200. So that means the route node here
is pointing to another index page, not directly to the data page. And the same thing, we need another pointer for the
second index page. So the customers between 11 and 20 go to the index page with the ID
1.201 and this is exactly what going to happen if you create a clustered index in SQL. First it going to go and
physically sort all your data in the databases. So if it's from the first time sorted randomly SQL has to arrange
everything and sort the data from the scratch. And then it's going to go and build this structure where you have in
the root node and index page in the intermediate nodes the index pages but at the leaf level at the leaves we have
the actual data the data pages. So now let's see what going to happen if you query the table where you search for the
ID number 14. So it's going to check which pointer to use since 14 is in the group between 11 and 20. It's going to
go and use the second pointer to the index page with the ID one double point 2011. And here the SQL going to open
this index page and check the pointers. So since 14 is between 11 and 15 it going to go and use the pointer to the
data page one point 102 and with that SQL located the correct data page the third page and now SQL going to open
this data page and find the customer ID number 14. So as you can see it was very fast for SQL to locate the correct data
page with only three jumps from the root node to the intermediate node. The SQL were able to find fast the correct data
page. And here SQL needs only to read one data page instead of reading as we saw in the heap structure four different
data pages. And of course you might say but still here we are reading like three pages. Well, reading an index page is
very fast compared to the data page because reading a data page is always slower than reading an index page. So,
as you can see, this P3 structure, the clustered index structure did help the SQL and the database to locate the right
data in the right [Music] databases. And this is exactly how that
clustered index works in the SQL database. All right. So now we're going to move to
the second type and we're going to understand how exactly SQL build and create the nonclustered index. So let's
go. So now we are back to the heap structure where our table don't have any index and our data are stored randomly
inside the data pages. And now if you go and create a non-clustered index on the customer ID, what can happen? And here's
the big difference that SQL will not touch or change anything on the physical actual data on the databases. So the
database is going to stay as it is and nothing going to be changed and the SQL start immediately building the B
structure. So it's going to start immediately building an index page and this index page is a little bit
different than the one that we have learned previously. So since it's index page, it's going to store pointers. But
this time SQL going to store in the key the customer ID. So one is the customer ID and now the value the pointer it will
not be the data page ID. We will be more specific. So we're going to have like an address where exactly the row is stored.
So it's going to start with the file ID, the page number because the customer ID one is stored in the page
one2. But SQL gonna go add as well the offset number of the row where exactly in the page we can find this ID and the
whole thing we can call it an air ID the row identifier. So now let's see quickly how the index page is pointing exactly
to the row inside the data page. So the first part of the row identifier is mapping to the data page ID and then
from the 96 it's going to take us to the offset and that's exactly the location of the row number one. So 96 is the part
where we're going to start finding the row number one and that's going to takes us exactly to the place where we can
read the information about the row ID number one. So this is how the index page is locating the exact place of the
rows. So SQL going to go and continue and assign for each customer ID a pointer to the exact location. So as you
can see now in the index page we don't have like a pointer for each group of customers like we have learned in the
clusters index. We have now a pointer for each ID and this type of index page we call it roator page. So now SQL going
to go and continue and map a pointer for each customer ID that we have inside our table. So we will have multiple index
pages pointing to our data page. So as you can see we have a lot of pointers and the data inside the index page is of
course sorted but inside the data pages it left as it is. And now those index pages that has the row identifier going
to be stored at the leaf level of the B tree. So at the leaf level we don't have the actual data the data pages we have
index pages where we have pointers then to the actual data and then it's going to go and start building the
intermediate nodes. It's exactly like the clustered index where it's going to point to another index page. So between
one and five customers it's going to be in the index page number 200. So the next step is going to go and build the
intermediate nodes. It's going to be exactly like the clustered index. Nothing going to be changed. is like the
same structure. So it is an index page pointing to another index page but this time for a group of customers and then
we're going to have as well the root node. So again we call this structure as a B tree structure where they point to
another databases but the databases are not part of the B tree. So now let's say if we are searching for the customer ID
number 14, what's going to happen? It's going to start again from the root node and then it's going to find the pointer
to the intermediate node and then jump to the next step to the intermediate node and then it's going to find the
pointer to the index page between 11 and 15 and then it's going to go and scan this index page and find okay for the
customer ID number 14 we have the following address. So it's going to go and locate the exact database and as
well the exact place of the row. So it can go and jump immediately to the row without scanning anything else. So here
this time with the nclustered index the SQL did read three different index pages. And finally the one data page in
order to find the data. So if you compare to the clustered index you can see that we have here one extra layer
one extra index page to be scanned in order to find the right place of the row. And this is how SQL creates the B
tree for the nonclustered index and how it scans it in order to find the information. All right. So now when I
think about the clustered index and the non-clustered index, I think about a book. You can think of the clustered
index like the table of contents at the front of the table. So the table of contents kind of tells you where to find
each chapter and the chapters are exactly sorted like the table of contents and this is exactly what the
clustered index does. But now in the other hand think about the nclustered index as the index that you can find at
the end of the book. The index of the book is a very detailed list of topics, terms and keywords where it points
exactly to the location where you can find it in the book. And the content and the topic of the book is not sorted like
the index of the book. And this is exactly what the noncluster index does. It is coexisting with the data. It is an
extra list where it can point exactly where we can find the data inside our table. All right. Right. So now let's
put those two indexes side by side to understand the differences between them. So the structure of the cluster the
index is a B tree where it start with the root node where we have an index page. This index page is pointing to the
intermediate nodes where we have as well index pages and those index pages are pointing to the actual data to the data
pages. So at the leave level of the clustered index we have the data pages the actual data. What's special about
the clustered index is that it physically sort the data inside those pages. So everything here is physically
rearranged and sorted. Now if you are talking about the nclustlustered index we have as well a bit tree. So the same
thing at the root node we have an index page pointing to an intermediate index page but this time the intermediate
nodes are pointing to another index page. They are not pointing like the clustered index to a data page. they are
pointing to index page. So now if you check this structure you can see that at the leaf level for the clustered index
we have the actual data the data pages but on the other side at the leaf level for the nclustered index we don't have
the actual data we have index pages but those index pages are pointing to the actual data to the data pages but the
big difference of that the data pages are not part of the B3 the B3 of the nclustlustered index is just a separate
structure that does not involve any data. So we have only index pages and it just points to the data pages without
changing anything physically with your data. But in reality what happen is that you can have those two types of indexes
the clustered and the nclustered indexes in one table. So one can happen the leaf level of the nclustered index going to
be pointing to the data pages of the clustered index because those index pages don't care whether those pages are
sorted or not. It's just going to go and point to the correct page and to the correct row. So that means we have now
like two different B3 structures that are pointing to the data. And here there is like one thing that you have to
understand that that you can create only one clustered index on a table. And this rule really makes sense because you can
sort the data only in one way in SQL. And that's of course makes sense because you can sort the data physically only
once. And that's why in SQL databases you are allowed to create only one clustered index because physically the
data can be sorted only in one way. But in the other hand in the non-clustered index you can create as many
nonclustered index you need. So you can create three four and all of them are pointing to the same data pages because
in the B tree of the non-clustered index you don't store any data pages. We store only pointers to the data and you could
have like multiple pointers. So this is the most important and the main difference between those two indexes.
Now if you put it side by side, we have learned that the clustered index going to go and physically sorts and stores
the rows at the B tree. But the nclustered index is going to go and create a separate p structure with
pointers to the actual data. And by the way, the clustered index we call it the main index that we could use in each
table. So the clustered index is the main one, the most important one that you can go and use in each table in your
database. Now as we learned if you are talking about the number of indexes you can create maximum one index for each
table but for the nclustered index there is no limitations you can go and create multiple indexes for each table. And now
if you go and compare them about the read performance how fast we can get data using clustered index. Well it is
faster than the nclustlustered index. And that's because in the nonclass and index we have this extra layer at the
leaf node from the B tree and because of this having extra layer that means SQL has to do extra job in order to find the
data that's why clustered index is faster than the nonclustered index but now in the other hand if we are talking
about the right performance how fast we can insert data to the tables well writing data to a table with a clustered
index is slower than the nclustered index. And that's because as you are inserting data to the table, SQL has
always to check the databases is everything sorted correctly and if not SQL has to go and start physically
sorting the data again in order to have the correct order. So there is a lot of stress in order to sort the data with
the clustered index. But in the other hand in the non-clustered index we don't have this. So the physical data going to
stay as it is. We are just creating nice new pointers. So if you are writing to a table where you have a clustered index,
it's going to be slower than writing to a table where you have nclustered index. And of course the fastest way to write
data to a table is to not have indexes at all. So a heap structure. So SQL just go and start inserting data inside those
databases without creating any extra structures. So as you can see it's like always a tradeoff. You can read fast but
you're going to write slower. So you cannot have like everything. Now we are talking about the storage efficiency.
The clustered index going to be better with the storage than the nonclustered index and that's because of the same
reason with the nonstructured index. We have this extra layer of index pages and index pages needs storage and that's why
they can waste more storage than the clustered index. Now if you're talking about the use cases when to use
clustered index. Well, if you have like a column this column has to have few criteria in order to be good candidate
for the clustered index. First, it's going to be good if the values inside the columns are unique. And second, and
it is way more important than that, the values of this column should not change a lot because if this column having a
lot of update operators and the data is keep changing, that means each time SQL going to go and start sorting the data
again left and right. So having a column that is frequently changing, it's not good for clustered index. And that's why
the primary keys of tables are a perfect candidate because first they are unique and second we will never go and update a
primary key value. We always append a new primary key value and that's why primary keys are perfect for clustered
index. And one more thing where I go and use clustered index is that to optimize the performance of a range query. If you
are quering the data between one value and another one clusters index works really well. Now in the other hand if we
are talking about the non-clustered index we could use it on coms that are used in the search conditions or if you
are joining tables without using the primary keys then you can go and apply the nclustered index in order to have
faster joins or you can go and use it to optimize the performance if you are searching for an exact value exact
match. So those are the main and important differences between the clustered and the nclustered indexes.
All right. So now before we go to SQL and start practicing, I would like to show you the syntax of the index. So
it's very very simple. It start with create and then we can define whether it is clustered or nonclustered and then
the keyword index. But this section is optional. So if you don't define anything, the default going to be the
nonclustered. So if you say create index the SQL server going to go and create nclustered index. Then after that we
have to go and define the name of the index and then we have to tell SQL which table we have to create the index in on
table name and then we can go and define one column or multiple columns for the index and we call an index with multiple
columns as composite index. So for example we can go and create a clustered index using this command create
clustered index the index name and then we specify the table and the ID. So we are saying create clustered index based
on this column the ID from the table customers. And if you want to create a nclustered index you say create
nclustered index and the same thing. So so far we are using one column in the index but we can go and create a
composite index with multiple columns like the following example. So we can say create an index and as you can see
we skipped here defining the type and that's because the default going to be nonclustered index. And now here we are
specifying two columns the last name and the first name. And as you can see we specifying as well for SQL how to sort
the data. So we are saying last name should be sorted inside the data page ascending lowest to the highest but the
first name should be the way around from the highest to the lowest. So you can control how the data going to be sorted
physically in the data page. So as you can see it is very simple. This is the syntax for creating index in SQL. All
right. So back to SQL and the first question is where do we find indexes in the database? Well you can go and
explore it. If you go to the object explorer over here and check any tables from our sales DB for example the
customers and here you have a folder called indexes. So if you expand it you will find here an index. I didn't create
any of those indexes in the database. But in SQL server, if you define any of the columns as a primary key, the SQL
server going to go by default creating a clustered index for the primary key because it makes always sense to create
a clustered index on the primary key. So this one is created as a default and as you can see at the start we have like a
key primary key customer and then it is clustered. Now I would like to start from the scratch. That's why I would
like to go and create a new table without any indexes. So what we're going to do, we're going to go and load the
table customers into a new table. So how we going to do that? We're going to go and say select star from sales
customers and before the from we're going to say into a new table. So it's going to be TB customers. So like this.
Let's go ahead and execute it. So now if you go to the left side and refresh the tables you can find we have now a new
table called DB customers. Now let's go and check whether we have any indexes inside it. So indexes it is empty. So we
don't have anything no clustered index or anything else. And this table has the structure of heap structure. So the data
are inserted there randomly. It is not sorted. And if I go over here and for example, let's say I'm going to select
from this new table where customer ID equal one and I execute it. The SQL server did a full
scan on the table in order to find this customer ID. So our new table DB customers is heap cluster. But let's go
and change that. What we're going to do, we're going to go and create a new clustered index. So we're going to say
create clustered index and then we're going to go and give it a name for the index. We
usually follow the following index. So we have index as prefix and then after that we specify the table name. So DB
customers and then the key for the index. So the column that we are using in order to index the table. This is
important to stick with the same naming convention for the index name because later as you are monitoring your
indexes, it's going to be really easy to understand. Okay, this index is for the table DB customers and we are using the
customer ID to index. So now after that we're going to go specify on which table we are doing the index. So on sales DB
customers and then we're going to specify the column name. So we are saying build for me a clustered index
based on the customer ID. So now let's go and execute it. So as you can see it's very fast because we have only five
rows. So the database just switched all the data pages very fast. Now let's go and check our new index. So let's go and
refresh and let's go inside it. And now we can see that we have our new index clustered index based on the customer
ID. Now as we learned we cannot create multiple clustered index. But let's go and test that. So I will just take the
whole thing and let's say I would like to create a class index based on the first
name as well here. So let's go and execute it. So as you can see saying you cannot create
more than one clustered index on this table. That means we can create only one clustered index. And let's say that
after you created the index you chose the wrong column and you would like to change it to the first name. So what
we're going to do, we have to go and drop the index. So we say drop index and then you need the index name. It was
this one. And then you have to specify which table. So it's going to be sales DB
customers like this. So if I do it like this and let's go and refresh again. You can see that we don't have any indexes
anymore and the table is packed as a hip structure. And now you can go and create the correct clustered index for this
table. But to be honest, I'm going to stick with the customer ID. So I will not create a clustered index on the
first name because the first name of course is not unique. You can have like maybe multiple customers having the same
name. And as well updates could happen on the first name and that's going to be very expensive. So that means I'm going
to stick with my index on the customer ID. Let's go and execute it. And now I have again my index on my table. Now
let's say that that I have the following select statements from our tables. So customers and I'm searching for the last
name where let's say we are searching for brown. So let's go and execute it. So let's say that we are getting more
and more customers and our table is getting bigger and I frequently use this query. So I'm searching for specific
customers using the last name. So what we can do, we can go and create a nonclustered index for the last name in
order to improve the performance of this query. So let's go and create that. So we're going to say create
nonclustered index. And now we're going to give it the name using the naming convention. So DB customers and we're
going to use the last name for this index. So on sales DB customers and we will use the
column last name for the index. So let's go and execute it. And now if you go to our indexes and refresh, we will find
our new index over here. And as you can see, it says it is nonclustered and as well non-unique. We will talk about the
uniqueness later. So as you can see, it's very easy. We have just created a uncclustered index on the last name. And
now as we learned, we can go and create multiple nonclustered index on the same table. Let's say for example, now we our
query looks like this. We are searching for the first name using for example the value Anna. And now this query happens a
lot and maybe slow. So we can go and create new nonclustered index. So let me just have it like this. And for the
nonclustered index you don't have to specify always like nonclustered index. As default it's going to be
nonclustered. So we can skip that. And here let's call it first name. And the column that we are using is the first
name. So let's go and create this index. And now let's go and refresh our indexes. And as you can see, SQL did
create a nonclustered index for the first name. So if you don't specify the type of the index, it's going to be as a
default nonclustered index. All right. So now let's talk about the composite index. It is an
index that has multiple columns inside the same index. So far we have used only one column in the index but we can go
and specify multiple columns and that's because sometimes our wear conditions are complicated and based on multiple
columns. So for example let's say that we are searching for country equal to USA and at the same time we are saying
the score should be higher than 500. So that means in this condition we are using two columns and we would like to
speed up this query. So how we going to do it? So we're going to go and create let's say an index and give it a name DB
customers and let's say country score on sales DB customers. And now it is very important to do the following thing. Now
we have to go and define a list of columns that we want to be included in this index. And it is very crucial and
important that you get the same order as your query. So your query start with the country and then the score. You have to
do it the same thing in the index. So the first column it's going to be the country and then the score. So it must
be the same order as your query. So let's go and create this index. And if you go to the indexes over here, you can
see that we have created our new index. So now once you create such a index and your table going to be like always
updating this index you have to be committed and responsible. So in your queries if you want to filter the data
using country and score always start with the country then the score in order to be able to use the index optimizer.
So if you do it like this the index going to be working but if you go and query the way around. So you start with
the score and then the country the SQL will not be using your index. So either you adjust your queries or you have to
go and recreate the index based on this switch. So be very careful with the composite indexes. The order is very
crucial. So you're going to have it exactly like the query. And now you might say you know what now we have like
a nice index for those two columns. What going to happen if I go and use in my query only one of them like for example
the country. So now the question is if I go and execute this query is the SQL is using this index even though that I
don't have the score. Well yes because it follows the leftmost prefix rule. So this means SQL can use the index if you
are using always the lift columns. So here in our index country is on the left that's why it is working over here. But
if you go and skip the lift column it will not work. So if you go over here for example and say let's go and select
only the score and it is like higher than 500. What we have done, we have skipped the
country in this query and that's why it will not be working. So as long as you are including the left columns, it will
work even though it is only one column. So in this scenario, the first query going to use the index, the second one
will not be using it. So now let me give you a very simple example in order to understand how this works. So let's say
that we have an index using four columns A, B, C, D. Now in your query if you go and target the column A the index going
to be used. Now the same thing going to happen if you go and use A and P. So if you're using those two columns you will
be using the index. So those are where the index will be used. So now let's have the scenarios where the index wants
be used. So for example if you go and just jump immediately to the column B. So you are not using the left column the
A that's why you will not be using the index and as well in your query if you are using A and you are skipping the P.
So you have A and then C you will not be using the index. So you have always to use always the lift columns. So here if
you are using A B C you will be using the index. And let's see here you are using A B and then you jump and skip to
the D you will not be using the index. So this is what we mean with the leftmost prefix rule by using the
composite index. So if you're using multiple columns inside one index, be careful with the order of the columns
that you are defining. All right. So that's all for this category, clustered and uncclustered index. Now we're going
to move to the second category where we talk about the indexes by the storage, the row store and the column store.
So now let's say that we have a table we have multiple rows and multiple columns. Now if we use a row store index this is
the classical one. What going to happen? Our table going to be splitted into multiple rows. And as we learned each
group of rows going to be stored inside a data page. So that means we are organizing the data row by row which
means all the columns for each row going to be stored together. This is the traditional way on how the databases
organize their data where the informations are stored row by row. But now in the other side if you use column
store index the SQL going to go and split your table into multiple separate columns and then SQL going to go and
store the values of one column together in data page. So that means if you go and open a data page you will find only
the values of one column. You will not find the entire row. So if it's like the first name you will see only the first
name informations you will not see the last name information in this data page. So if you compare them the row store
index stores the data row by row the column store index stores the data column by column. So this is a very high
level representation on how the column store index is stored. As you know me we go in details in order to understand
exactly how SQL works with the column store index. So let's go. All right. So now let's say that we have
a table for the customers. We have three columns ID, name and status. And as well we have around 2 million rows, 2 million
customers. And as we learned as a default, the table going to be built as a heap structure where the rows are
stored row by row inside data pages. But now we go and create a column store index on top of this table. So now once
you do that SQL going to go through a process in order to build the column store. So the first step is SQL going to
go and divide the data the rows into row groups. Now in SQL server each row group can contain around like 1 million row.
So in this example our table going to be splitted into two row groups. The first one million row in one group and the
second one in another row group. Now you might ask me we are talking about columns. Why we are splitting the rows?
Well, this is just a pre-step in order just to optimize the performance and to do parallel processing. And of course,
the data will not be stored like this because we have the second step. Now, in the next step, SQL going to go and
segment the columns. So now, SQL will go for each row group and start splitting the data by the columns. And that's why
we call it a column store because we are separating the columns from each others. So that means we have one segment for
the ID, another one for the name and a third one for the status. And this can happen for each row group. And now it's
going to move to the third step in this process. We have the data compression. And this is the most important step in
this process because it is the reason why column store is very fast compared to the ro store. So in this process
there are like different techniques on how to do data compression and the most famous one is that it's going to go and
create like a dictionary. Let's take for example the column status the status of the customer whether it is active or
inactive. So the word active and inactive going to be repeated like 2 million times because we have 2 million
customers and since it is like string it is like taking a lot of space and storage. But now instead of that we're
going to go and compress the data. So first it's going to go and create a dictionary by replacing the value active
and inactive into smaller values like one and two. So we have like a mapping between the long value to a small value.
And after that SQL going to store like a data stream where we have like only two values one two one two. So we're going
to have like a big stream of 2 million rows. So it's going to go and do this for each column and with that the size
of each column going to be changed depends of course on how much different values you have in each column. So this
step is very important in order to reduce the size of the data and as well to increase the performance. So now once
everything is organized and compressed, SQL going to go and start storing the results in databases. But TSQL will not
use the standard databases that we have learned previously. But instead going to use a special database called LOB large
object page. So now let's quickly compare the structure of the normal database that we have learned in the row
store with the new one, the column store, the LOB data page. So as usual each page has a header. This is same as
any data page. But the next section is going to be the segment header. It has like metadata informations about the
column segment that is stored in this page. Like we have the segment ID, the row group ID, the column ID and it has
as well very important information the ID to the dictionary page. So the dictionary page is as well a type of
pages in SQL. It has as well a header but inside it we have like a mapping. So it maps the original value, the long
one, the inactive to the smaller version of this value, for example, one. And that's all for the dictionary page. It
has the mapping between the original values and the smaller values. And beneath the segment header, we can have
now the important place where our data can be stored. We have the data stream. So it is like sequence of ids from the
dictionary that represents the values of the columns side by side. And of course, we cannot fit the whole 1 million rows
inside this data stream. We're going to have like multiple LOP databases. So this is how exactly the SQL stores your
data. If you decided to go with the column store, so let's go back to the process. So back to the process. As you
can see, SQL is storing the data as LO data storage. So this is the last step and with that SQL did convert your table
into a column store. So now we cannot just create a column store without defining whether it is clustered index
or non-clustered index. So let's start with the first one the clustered column store index. So if you create such a
index SQL of course will not be building a B3 structure. SQL going to use exactly this structure the column store
structure. So as we learned the cluster index is a complete makeover of your table. when you apply it then SQL going
to format everything column-wise and it is fully replacing the old row based table structure that we have at the
start. So once you apply the clustered column store index it will not leave anything behind and your table going to
be completely structured as a column store and one more thing which is makes sense of course all the columns from the
original table going to be converted to a column store. So it is not leaving anything behind it. But in the other
hand, if you are using non-clustered column store index, as we learned, it is like a companion to your existing table.
So it coexist with the table and it will not replace anything. So the column store index can be an additional thing
that is stored beside your table. So that means the original table will not be deleted at all like the clustered
column store index. The first one is in the old row based storage. the regular table, the first one, and your data
going to be as well stored in a separate structure in the column store index. And of course, in the non-clustered column
store index, since we are creating an extra index outside of your original table, you can go and define which
column should be included in this process. It must not be all the columns. You can go for example with only the
status. So that means you build a column store index only for one column for the status of the customers. So this is what
we mean with the clustered column store index and the nclustered column store index. All right friends, so now you
might ask me why we are doing all those stuff. Why I would split my data by the columns? Well, it's all because of
analytics. Because in analytics we have like big complex query where we have a lot of data aggregations and stuff on
big tables. And the roster index is perfectly designed in order to improve the performance of such big queries. And
that's why SQL databases like SQL server and as well BI tools like Tableau and PowerBI did adopt this methods in order
to offer fast platform for data analyzes. So now let's understand exactly why the column store index is
way faster for data analyzes than the row store index. So let's go. So again we have the customers tables and let's
say we have like five customers where we have ID, name and status and as we learned before if we are using roster
index the data can be stored in multiple databases and in each database we're going to have the whole record the whole
information about one customer. So for this example we're going to have like three databases but if you are using the
column store index it's going to be stored little bit differently. So the first column the id going to be stored
in one data page and here the SQL will not go and build a dictionary because the ids are already short. So we're
going to have like one data stream with all ids and now for the next column name is going to be stored in separate data
page where we're going to have an extra dictionary page where each name going to be mapped to one small value. So the
data going to be compressed and we're going to save storage. Now the database going to create for the third column the
status one more data page and the dictionary here going to be very small. So for active we're going to have one
and for the inactive we're going to have two and in the data stream we will be storing only the ids of the dictionary.
So now let's understand why the column store is faster. Let's have the following query. We want to find the
total number of customers that are active. So we have the query select count star from customers and we're
going to filter the data by the status where it is equal to active. So now if we query the table with the row store
what can happen? SQL have first to go and collect the data. So it's going to go to the first data page and collect
the first two customer then to the second to the third and so on. And as you can see SQL here is reading
everything the whole row the ID the name the status even though that for the query we actually we don't need all
those informations we just need to count how many customers we need with the status active but still cannot go and
selectively only reading the status has to read the whole record. So after SQL has all the data it's going to go and
filter the data. So it's going to go and remove the inactive rows and then SQL going to do the aggregate operation and
with that we're going to get three rows. So that's why the total count of active customers going to be three. But now
let's see how SQL going to query the column store. So SQL first have to analyze okay which columns do I need
actually for this query. Well, we need only the status. So SQL will not go and open all three data pages and read it.
SQL will target only one data page the database where we have the column status. So it's going to take this very
simple data stream and then it's going to go and understand the dictionary and it going to go and remove all the values
where it is equal to two. So without in the output we have only three values and SQL going to go and do a very quick
count for those values. So in the output we will get as well three total number of active customers. So now if you
compare this intermediate result sets from the row store and the column store you can see that in the row store we
have fetched and retrieved a lot of unnecessary informations for this query and this of course going to make the
speed of the query very slow but in the column store reads exactly what it needs for this aggregation and we didn't read
any extra informations about the names of the customers the ids it didn't like open any extra data pages it exactly
gets the data that it needs for the aggregation and that's exactly why the performance of queries where we have
aggregations and data analyzes is going to be very fast if you are using column store compared to the row store. So
that's why we use column store for big data and data analytics. All right. So now let's summarize the differences
between the row store and the column store indexes side by side. So let's start by the definition. The row store
going to go and organize and store the data row by row. It is really nice method if you need a lot of columns in
one row. But in the other hand, the column store index going to go and store the data and organize it column by
column which is really great if you're focusing on specific column. Now if you are talking about the storage
efficiency, the row store index going to take more space compared to the column store index and that's because as we
learned the column store going to go and compress the data which going to save a lot of storage if you have large tables.
Now to the next point which is more important about the performance. The read and write optimizations we can say
for the row store things are more balanced. So you will get a decent speed for both write and read operations but
things in the column store is different. It is fast for reading especially if you are doing data analytics but writing
data like inserting and updating it is slower because as we learned there are like multiple steps until the data is
written in the pages. So in one hand you are optimizing the speed of your analytical queries but in the other hand
changing data it is slower than the roster index. Now let's talk about the next point input and output efficiency.
Well the roster index it's not really good because you are retrieving a lot of columns. So a lot of data should be read
from the disk storage in order to answer your queries. But in the other hand for the column store it is lower and that's
because it targets exactly the data and columns that is needed for the query. So there will be generally less data that
is read from the disk storage and of course that's why we are getting fast read performance. So now if you are
thinking which systems are best for ro store index well the roster index is very suitable for the OLTB systems
online transactional systems like banking and commerce systems where the full records are accessed very
frequently but in the other hand the column store index is great for OLAP. All app systems are online analytical
processing where you have like data warehouses, data league, business intelligence. You are building reports
and analyzes. You have large data sets and very complicated aggregated queries. So if you have such a project then the
column store index is the way to go. So that means the use case for the row store index if you have high frequency
transactions where the system has to quickly access records and the use case for the column store is big data
analytics where the SQL has to scan large data sets. So those are the main differences between the row store index
and the column store index. All right. So now let's check the syntax of the column store index. Well,
it is really easy what we're going to do. we can just put a column store keyword between the clustered or
nonclustered and the index. So once you specify that then you are telling SQL you want to create a column store index
and the rest is going to stay as it is. Now if you want to create row column store then you don't have to specify
anything. There is no keyword for the row store. So as we learned before we can go and create a nonclustered index
and cluster on the index and both of those syntax is going to tell SQL we are creating row store index but if you go
and use the column store keyword then you are telling SQL that you want to create either clustered or nclustered
column store index and here there is like a syntax rule if you are creating a clustered column index then you must not
specify anything for the columns. So you cannot go and specify anything like an ID or country or any columns over here
because it makes no sense once you say cluster column store then all the columns going to be included in the new
structure. So this is the syntax of the column store index. All right. So back to scale let's check how we can create
column store index. Now if you check our table here DB customers that we have created previously and we go to the
indexes you can see that we have created few indexes and one of them is the clustered index. This one is a row store
index. So our table is splitted by the rows. Now let's go and change that. Let's make our table splitted by the
columns using the column store. So we're going to say create
clustered column store index and we're going to give it the name index DB customers and it's going to be on the
table sales DB customers and here if you go and specify a column it's going to be a mistake. So let's go and check that.
So if you go and execute it says it fails because key lists or the columns is not allowed. So we cannot have this.
So let's remove it. And now we have the correct syntax. Let's execute it again. We will get another error because it
says in one table you cannot have more than one clustered index. We have already one. You have to decide do you
want to split your table by columns or by rows. That's why we have to go and drop the previous index. So we're going
to do it like this. Drop index. And I need the name of the index like this. And then we have to specify the table
name. So that's it. Let's drop the index. Now if you refresh, we cannot see anymore our clustered index and our
query should be working. So let's do that. Now let's check the indexes again. And now as you can see, we got a new
clustered index, but this time it is column store. Now you can see at the start we have like an icon. This looks
like a bar chart or like analytics and reports and that's because the main purpose of creating com store is to have
a bar chart. So now of course we cannot go and create multiple clustered column index. We can have maximum only one. So
now if you say you know what let's go and create for the first name another index but this time it's going to be a
column store. So if I go and copy the whole thing over here and let's say it is none clustered column index and let's
call it for example first name and we define over here the first name. So that's it. Let's go and execute
it. You will see that we will get an error where SQL tells us you cannot create multiple column store indexes.
That means you can create only one column store index for each table and you have to decide whether it is a
clustered or non-clustered and you cannot create like the row store multiple non-clustered index. So you are
allowed only with one column store index but this limitation is only here in the SQL server. In other databases I know
that is allowed to use multiple column store indexes like in the Azure SQL server you can do that. So now in order
to practice and you would like to create a nonclustered column store index, you can drop the first one and you can go
and create the one that you need as a nclustered index. So actually let's go and do that. Let's drop the first one.
So drop index and this is our index on this table. Let's do that. And once you execute the nonclustered column store
index is going to work. And if you refresh over here, you will see that we have a non-clustered column store index
for the first name. Okay. So now as we learned that the column store going to go and compress the data and the storage
that is needed for the entire table going to be less than the row store. So let's see whether that is really true.
Now in order to check this I will not do that in the database sales DB because everything here is already small. We're
going to go and use another database. We have the adventure works DW2022 and if you have a newer version that's okay. So
now what is the plan? We're going to go and create three identical copies of one table and we're going to have different
structures. So the first one going to be the heap structure. The second one going to be row store structure and the third
one going to be column store structure and then we're going to go and compare the storage of those three. So now we
have to go and pick one of those tables. We need one big table. So for example the fact internet sales. So let's see
how we can do that. Let's start with the heap structure. We're going to say select star into a new table. So it's
going to be the fact internet sales and underscore hp for the heap. And we're going to get it
from the table fact internet sales. So like this. And here it's very important if you are switching databases you have
to go and use the database. So it's going to be use adventure work DW 2022. So execute this at the starts to make
sure that you are switching to the new database. And now let's go and execute our heap structure. So with that we have
created heap table as you can see 60,000 rows. And since we didn't define any clustered index this table going to be
heap structure. Now let's go and create another table where we use clustered row store index. So what we're going to do,
we're going to copy the whole thing over here and we're going to call this row store and we're going to go of course
change the name to RS but still we are targeting the same table. So let's go and execute this at the start. But now
in order to make it as clustered row store we have to go and create an index. So it going to be like this create
clustered index. We don't have to specify the row store because it is as a default. It's going to be ro store. So
let's call it index facts internet sales RS and then the primary
key. So B key and now we need the table fact internet sales RS and now we need the columns the primary key well
actually I don't know what is the primary key so let's go and check that so it is a composite primary keys so
it's going to be the sales order number and sales order line number like this. So let's go and execute this. And with
that we have clustered row index. I'm going to go and check what do we have over here. So let's go and refresh
everything. So we have now two tables the heap and the row store. So let's extend it and check the indexes. And as
you can see we have the clustered index. Now we need the third table. It's going to be the column store index. I'm just
going to go and copy the whole thing over here. So this is the column store going to be here CS and CS and of course
we don't need any columns for the column store and don't forget to add the column store keyword. So create cluster column
store index and we have to rename as well over here. So let's go and execute our new stuff. So we create first the
table and then we convert it to a column store index. So let's go and do that and we have to go and refresh and check our
tables. So this is our third table and let's go and check the indexes and we have clustered column store. All right.
So now we are done. We have our three different tables. Now let's go and check the stoages of those three tables. So
now let's go and check our first table the heap table. So right click on it and go to the properties. And now we can see
here a lot of informations about our table. But we are interested on the storage. So click here on the page for
the storage. And now we can see here few informations about the storage and one of them is the data space. It is around
9 MB and the index space is almost nothing. So we don't have anything over here. So this is the storage of the heap
structure. We don't have any indexes. Let's go now to the row store. So we're going to go to the RS and properties.
Then let's go to the storage. And now as you can see the data space is exactly the same. And that's because whether it
is heap or row store index, we're going to store the data in data pages as rows. So the size of the data itself will not
change. It will be sorted differently. But what changed here is the size of the index. Now we are consuming more storage
for the index. So that means the overall storage of the table with a cluster draw store index it is more than the heap
structure. Let's go and check now our column store index. So to the CS and let's go to the properties. And now it
is interesting to see whether our table is getting smaller. So let's go to the storage. And as you can see the data
space is around 1 mgabyte compared to the 9 mgabyte. I know those are small numbers but still it is massively
reduced space because everything is compressed and of course we are not using any index spaces because we don't
have this B3 structure in the column store. So as you can see if you compare to the others it is the winner. This
table that is using the column store is consuming way less storage than the others. So now if you want to rank it
based on the storage the best one is the column store index table. Then the next one is the table with the he structure
and the worst one is the table with the row store clustered index. So that's true. column store index is consuming
less space than the other type of indexes. All right. So now what is unique index? Unique index is a special
type of indexes that going to make sure no duplicates in your data. And there are a couple of reasons why is it
important to have a unique index. The first one and the most obvious reason is to have data integrity. So the unique
index going to go and enforce uniqueness in your data and that is very helpful. For example, if you have a column like
an email address or a product ID. Having duplicate in such a columns can mess up your data very badly. So having a unique
index on a column like an email going to make sure there are no sneaky duplicates inside your data. And the second
important reason why unique index is important is to improve the performance. So for example, if you are searching for
specific email, the SQL going to start searching for the email value and once the SQL find the value, the SQL will
stop searching because we are sure that there is no duplicates in the data. So with that you are improving the
performance of your queries. So if you are creating an index and you know this column is unique then make sure to make
the index as unique index. So now if you have a look again to our clustered index where we have the B structure if you
make this index as unique then you are giving an extra task for the SQL that's going to go and make sure that all those
ids of the customer going to be unique. So SQL has to guarantee that there are no duplicates at all inside your data in
the databases. So now since we are giving SQL an extra task to prove the uniqueness of the data building the
clustered index going to be little bit slower. So that means inserting new data writing data going to be slower as the
normal clustered index. But now if you are talking about the read performance the performance of our query it's going
to be optimized a little bit faster than a normal clustered index. So again this tradeoff we are making writing data
slower but we are gaining more speed on the query performance. So this is what we mean with unique index. Okay. So
let's keep extending the syntax of the index. So now in order to tell whether it is unique or not we can specify it
exactly at the start. So we say create unique is just before the clustered or nonclustered and then afterward the cl
store and nothing changed for the rest. So we can specify this keyword to TSQL, it should be unique. And if you don't
write anything before the clustered index, it's going to be not unique. So for example, this one says create an
index. So we didn't specify anything here, duplicates are allowed in the index. But if you go and specify a
unique index, then the duplicates are not allowed. So it is very simple. Okay. So now let's go and create unique
cluster. Now let's go and target the table products. Let's go and first select the data from the table. So sales
products and execute it. Now let's see that I'm going to go and create a unique index on the column category. Let's go
and try it. So create unique nonclustered index and let's give it the name index products
category on the table sales products and we are targeting the column category. So let's go and execute it. Now we will get
an error because the category has duplicates. So if you go and query again our table, you can see we have here
duplicate values and the SQL cannot go and create unique index for this table. It's too late. But you still can create
this index if the table is empty and SQL will not allow you to insert any duplicates about the categories. And of
course it makes no sense to have unique index on the categories because of course we're going to get duplicates
here. But maybe you say, you know what, my products are unique. The product name should be unique and we are not allowed
to have in this table two products with the same name. So if you have such a rule at your business, you can go and
define a unique index for the products. So let's go and do that. Now we're going to go and replace the category with the
products and the same thing over here. So we are targeting the column products. Let's go and execute it. As you can see
now it is working because we don't have any duplicates inside the table products. And if you go and check the
indexes over here, we can see our new index. And as you can see at the start here, it says it is unique non-clustered
index. Now let's go and try the data integrity. Are we allowed not to add any duplicate to this table? So let's go and
try that out. Let's have an insert statement. Let's say insert into sales products. And I would like only to
insert the product ID and the product name. and we're going to insert two values. Values, let's say we're going to
have a new ID 106, but we're going to go and insert duplicate for the product name. So, we're going to say caps. We
have already a product called caps over here. So, we are now inserting duplicates. Let's go and try it. Now,
you will get an error saying you cannot insert duplicates to this table because we have unique index. So as you can see
this index is now helping us and improving the quality of my table. So this is how we work with the unique
index in SQL. Okay. So now what is a filtered index? A filtered index is a regular
index but with a twist. It only includes rows that meet specific condition. So let's understand what this means. So
again we have our nonclustered index and the B3 structure. So now at the leaf nodes we will get only the ids the data
that fulfill a specific condition. So for example if we are saying we want only the active customers this is the
condition. So that means on the leaf nodes we will have only the customer ids that are active and any inactive
customer will not be included at all at the data page and at the nodes. So that means our B structure going to be little
bit smaller as usual because we have less data included in the structure. So our index going to be smaller than the
regular nclustered index. So now the question is why is it important to have a filtered index? Well the biggest
benefit is we going to have targeted optimizations. So for example if our analyzes always focuses on the active
users and the inactive users are totally unrelevant. So that means having only relevant subset of data in the index
going to make the whole index much smaller which leads to faster performance. So it's going to be faster
to query this filtered B3 structure. So that means we are doing targeted optimizations and we are improving the
query performance. Now the second benefit if you think about the storage since the size of the B structure going
to be smaller that means we're going to need less storage space in order to store the index which is great thing if
you have large tables in your database. So the filter the index going to make the structure of the index smaller which
going to improve the speed and the performance and as well reduce the storage that is needed for your index.
Okay. So now let's check the syntax of the filtered index. It's very simple. It's like any query you can go and add
at the end of creating the index the wear clause and then the condition as you are doing in any select statements.
But the SQL server is very restrictive using this type of index. So you cannot use filtered index on a clustered index.
So it is only allowed for the nclustered index because it makes no sense. If you create a clustered index, the entire
table should be reorganized and ordered. So it will not work for only subset of data and as well you cannot create a
filtered index on a column store. So it is only allowed if you are using row store but you can go and combine the
unique index together with the filtered index. There's no restrictions. So it's going to be like this. Create unique
nonclustered index on the table and then you specify the wear condition. So this is the syntax of the filtered index and
we have these restrictions. All right. So now let's say that we have the following query where we are selecting
data from customers but always in our program or in our report we are selecting only the customers from USA.
So we have the following condition. It says where country equal to USA and execute. So this is the basics of many
queries that we have in our project and we are always filtering the customers based on the country. So in one query we
are finding maybe the top customers and another query we are finding the average of scores and so on. But we are always
filtering the data like this where country equal to USA. So now since we are using this column a lot and our
table may be getting like million of records we can go and create nonclustered index on this column. So
the usual way we go over here and say create nonclustered index and we call it like
this index customers country and then it's going to be on the
table sales customers and we select the column country like this. So if you do it like this SQL going to go and create
a nclustered index for all customers not only from USA but for everything. So even if the customers come from Germany
which is not really necessary because in our project we only focus on the customers from USA. So instead of that
we can go and include the wear condition inside our cluster. So it's very simple we're going to go and say where country
equal to USA exactly like our query. So now the index that's going to be built it will be focused and targeted only for
subset of data only for the data that fulfill this condition. So now let's go and create our filtered index and it is
working. Let's go and check our indexes on the customers. So let's go to the indexes over here and refresh. Now we
can see our index over here. It says it is not unique because we didn't define anything at the start. So duplicates are
allowed of course which is what we defined here. And as well it is filtered. So it doesn't contain all the
rows from your table. It contains only the rows that fulfill our condition. So that means now if I go and execute this
query, the index going to be used because the rows of this query is included in the index. But if I go over
here and say Germany and execute the query, it's going to be slower because all those rows inside the query is not
part of our index. So this index will not be used at all in order to improve the query. So this is how we work with
the filtered index in SQL. All right. So now we're going to summarize and talk quickly about how to
use the right index. So when to use which type? Let's start with the first one. We have the heap structure. So as
we learned it is a table without any index. So in which scenario we don't have to use any indexes in case you want
to have fast inserts. So if you want to have a fast write performance then don't take any index. So you stay with the
default with the he structure of your table and we usually use it in not very important tables like the staging tables
or temporary tables where we want to insert the data fast and then get rid of the data later. So here there is no need
to utilize any index. Now if you are talking about the clustered index, we usually use the clustered index for
primary keys. It is even a default from the database. If you create any primary keys, then SQL going to go and create a
clustered index. So this is the main usage of the clustered index, you use it in the primary keys. And if there's like
no primary key in your table, then you can go and pick another column where sorting the data is important like for
example a date column. So it could be a good candidate for your clustered index. Now moving on to another type we have
the column store index. So when I said here clustered index I mean clustered row store index of course. But now the
question is when do we use the column store index. If you have like big complex analytical queries where you are
aggregating a lot of data doing data aggregations then go for the column store index because it going to give you
amazing performance. And as well if you are struggling with the size of tables so if you have a super large table you
can go and use the column store index because it can go and compress the data and reduce the size of the whole table.
So for those scenarios we use the column store index. So again for the row store clustered index we use it usually for
the old TB systems where you have a lot of transactions and so on but for the column store we use it usually for the
OLAP systems where you have a data warehouse reporting system business intelligence and so on. Now moving on to
another type we have the nonclustered index. We usually use this index for non primary key columns. So that means the
rest of the columns of your tables could be candidate for the nonclustered index. And there are a lot of reasons why you
would do that. For example, for the foreign keys or using it on the columns that are used in order to join two
tables and another place where you can use the nonclustered index for the columns that are used for the work
clause. So there are like many scenarios where we can use the nonclustered index but not for the primary keys. Now moving
on to another type, we have the filtered index. We use it in order to target a subset of data. So if in our query and
analyzes we are only focusing on a subset of data all time, it makes no sense to have one big index for all
data, we can use the filtered index to have focused index. And of course if the size of the index is a problem then you
can use a filtered index in order to reduce the overall size of the storage of the index. And then to the last type
we have the unique index. you can go and use the unique index in order to ensure data integrity of your table and as well
it might prove slightly the performance of your query and that's because SQL has less task to do if the index is unique
once SQL finds a match it going to skip the search so this is a quick summary and guide on when to use which index
type that usually help me finding the right index all right friends so now let's say
that you have created your index ES in your database and your query is optimized and you have fast performance
but the job is not done yet. No god no god please no no no no
because over the time the indexes get fragmented outdated unused and this going to lead to a poor performance in
your queries and as well going to increase the storage costs and the overall performance of your database
going to drop down. So indexes like having a car it need maintenance. So you need to change the oil and the tire of
the car. And the same thing goes for the indexes. You have to maintain them. They need attention to keep everything
running smoothly. So now I'm going to show you how I manage, maintain, and monitor the indexes of my SQL projects.
So let's go. The first and the most important task is to monitor the usage of your
indexes. So of course the first question we have to ask ourself over the time are we using really the indexes that we have
created are they really helping improving the speed of my queries or was it just a good idea at the start of the
project and later no one used those indexes. This is very crucial because if you are having an unused index you are
consuming unnecessary storage space and as well the right performance in the tables going to be slow which is
completely unnecessary if you are not using the index. So now our task is to find out the usage of each index that
you have in the projects. So let's see how we can do that. So now previously we have created like multiple indexes on
the table DB customers. So if you go to the DB customers and to the indexes, you can see that we have four indexes. Now
we can go and show those informations by using a special stored procedures from the SQL server called SP help index.
Let's go and do that. So SP help index. So it is a system stored procedure that comes with the database. So this stored
procedure needs only one value and that is the table name. So we have it over here sales DB customers. Let's go and
query it. So we have four indexes. Then we have a nice description of the index. So it says it is nonclustered index and
whether it is column store. And it say where it is located. So it says it located on primary. Primary is the name
of the file group where the data is stored. And as a default it can be stored as primary. And now the next
information we have the index keys. It is nice information to understand which keys are used or which columns are used
for the index. So the first one you can see we have two columns that means it is a composite index and of course for the
column store we don't have any columns and then we have the first name last name. So this is a really nice quick
store procedure in order to see information about our index. Okay. So now let's focus on our task on how to
monitor the usage of the indexes. Now in databases we have a lot of schemas and tables that protocol the metadata of our
database. And in SQL Server, we have a special schema called CIS where you can find a lot of metadata information about
the SQL server. Metadata like the description of the tables, views, columns and as well the indexes. So now
let's check what we can find inside the table indexes. So let's going to do it. Select star from CIS. This is the schema
name. And then as you can see we have a list of many informations but we want to focus on the indexes. Now let's go and
execute it. Now we get a huge list of all indexes that we have and a lot of informations for each index. We don't
have to go and understand now each column. But I'm going to go and select the main important informations from
this table. So what do we need? The object ID. This is the table ID. So the object
ID and we have the name. It is the index name. And then here we have a nice information whether it is clustered or
nonclustered. So let's go and select it type disk as so let's call it index type and we can go and check whether it is
primary key or not. So let's get this information as well is primary key. I will go and just rename it is primary
key and what else do we need whether it is unique. So it is as well nice information to have. So is
[Music] unique. So of course you can go and grab a lot of stuff. It depends really on
what you are monitoring. So for example, I'm going to go and check whether it is disabled or not. So is
disabled and I'll just rename it. So with that I have like focus monitoring. I don't have to have all
those informations. So let's go and execute. But now I would like to go and change few stuff like for example I
don't want the object ID. I would like to have the full name of the table. And as well there is a lot of indexes that
is unrelevant for my database. So now in order to do that we have to go and get the informations from another metadata
table. So let's go and call this index and let's go and join it with another metadata table. It's called tables. So
tbl and we're going to go and join it using the so the index object ID equal to the table object ID. And now
if you like to see the content of this table we can go and create separately. So select star from our new table. So
let's see the content of this table. So you can see we have the name which is the table name. And I think that's all
what we need. We have a lot of other informations about the table. Well, I just need the table name. So let's go
and do it at the start. tbl name as table name and I don't need anymore the object
ID. But of course we have to go and use the alias for each of those informations in order to understand those
informations comes from the index. So let's go and do that. All right. So my query is ready. Let's go and execute it
again. So now as you can see we are getting the table name and the list is very short because it is only focusing
on the tables that you have in the database. And this filter happens because of the inner join. But one more
thing I would like to go and sort the data. So I'm going to say order by I would like to sort it by the table name
and then the index name. All right. So now let's go and check for example the table customers. You can see that we
have two non-clustered index and one of them is column store index. Those two we have created from the previous tutorial
and we have an index on the primary key as you can see here is primary key equal to one and this is as well unique. So
with that we have a really nice list of all indexes that we have in our database. But we are not there yet
because our task is how to monitor the usage of the index. Now in order to get the usage for each of those indexes, we
have to go to a special view called dynamic management view. And there the SQL server going to provide a lot of
statistics about the usage for that index. And we can find it as well in the same schema. So let's go and query this
table. So it's going to be select star from. So the same schema says dodm
db_ind index usage stats. So let's go and explore this table and execute it. Now in those statistics we can find the
usage of two indexes the index number three and one. And we can see there are like three usage informations of the
index number one. And next we have like user seeks user scans and user lookups. So this is how many times the index is
used as seeks or scans or lookups. We will understand those informations as we learn about the execution plan. And here
we have a very nice information about how many time our index got updated. So as you can see here is zero because I
didn't add any new data after creating the index. But of course all those numbers might be different at your site
because it depends whether you are doing more queries and practicing. And you can find here more informations about when
was exactly the last usage of those indexes and many many nice informations. So now let's go and integrate this view
with our query. So now what I'm going to do, I'm going to do a lift join because if I do an inner join, I will only find
the used indexes. But I don't want that because I want to see a full build of all my indexes in the database. So left
join and we're going to go and get our view and call it S. And then we have to join it on the keys. So S on. So I would
say let's go and grab the object ID equal to the index object ID. And of course we have to join on the index ID.
So it's going to be the index ID equal to the index ID like this. Now we have to go and select few informations from
this view. So I'm going to go and select like all those number of usage. So s let's get the user
seeks as the user scans and the lookups and maybe as well the user
updates and it is really nice information to understand when it was the last time used. So last user
seek and the last user scan. Let me just correct it over here. And actually I can go and put
those two dates in one date because if it's like the last seek it's going to be null over here or the opposite. And now
what we can do we can go and put those two together actually in one column because when we have a value over here
it's going to be null and vice versa. So we can do that using the null function kowalis like this and we can get this
over here and we can call the whole thing last update. So like this and maybe I'm going
to go and rename all those [Music] stuff. All right. So now we are done.
Let's go and execute it. Okay. So let's go and check our new report over here. So this is our query and let's start
with the first table for example the customers and go to the right side. And now we can see that we have three
indexes and from these two indexes we have only one index that is not used at all. So we can see over here that the
nclustered index on the country is not being used and that's because we have another index about the country that
comes from the column store. So it could be like this that you are quering the table using the country but the SQL
saying I would like to go and use this index instead of the first one. So we can say okay this one is not really
useful maybe we can go and drop it right and for the rest you can see okay this column store index is used twice and the
next one is once again the numbers at your side might be different and if we have a look to all other tables we have
a lot of nulls so that means all those indexes that you have created on the DB customers let me check only one is used
but now you might say you know what I've used the index but why I'm not seeing here any numbers about it well that's
because those numbers will not live forever and we are using now the express edition locally at our PC. So each time
you shut down your PC and you close the client the database going to shut down as well and those statistics going to be
lost because they are in the memory. But in real projects the numbers going to be totally different than here and of
course you're going to get realistic numbers. Now let's try to target one of those not used indexes. So for example
let's go with this index. It is not clustered index on the product. So let's go and query that. Currently it is
completely not used. So if I go and select it. So select star from sales products where
product equal to caps. So with that we have used the index I think. Let's go back and query again and let's go to our
index and check whether it is used. Well it is correct. So our query did use this index and we can see here it is used
once. And now you can go and analyze in your project all the indexes that you have on your tables and you can see
whether you are really using it with your queries or not. And if you are not using the query of course you have to
make a decision about it. Maybe if you are working a team to ask about it who did create it and why. Maybe there is
like one task in the database that is not frequently used. Maybe it's something that is run like once a month
or something like that. So the index is needed but not that frequently. But still now we have like insights about
what is going on with those indexes and whether we need them or not. And if you don't need them, go and drop them. All
right, my friends. So here is the secret that 90% of SQL developers don't do that's going to make you in 1 minute the
hero of the projects. So once I join a project and after saying hello to everyone, I open the database of the
project and do one query. I checked the usage of the indexes of the projects and I can tell you after working 15 years
with SQL that 90% of indexes created in projects are totally untouched and unused. So I collect all unused indexes
and discuss it with the team. And if we don't find real usage for those indexes, we go and drop them. So after dropping
all those unused indexes, you have done two great things for the projects. First, you have saved a lot of storage
in the database. And second, which is way more important, you have improved and optimized the right performance on
the database. So in your first day with one query, you have optimized the performance of the database. You have
save storage and you're going to shine like an expert in your project. So if you haven't done that, do that
now. All right. And now moving on to the next one. As we learned, identifying an unused index is an important task. But
in the other hand, identifying a missing index is as well very important to improve the performance of your queries.
So in SQL server, you can get recommendations from the database itself about missing indexes for your query. So
let's see where we can find those recommendations. All right. So now let's say that you are doing multiple queries
and you are doing analyszis and so on. For example, I have this query over here. It is query on the database
adventure works DW and I'm joining just two tables the fact with the dimension and then filtering the data based on the
colors and as well on the date key where I have like a range over here. So once I executed I got the following
informations. It could be any query that you are doing while practicing and analyzing and so on. So now if you have
like slow query and so on you can go and check the recommendations from the database about missing indexes. So in
order to do that we can go and check again the metadata from the database system to see the recommendations about
the missing indexes. So let's go and do that. So we're going to go and select from and now we have to go and target
the dynamic management views and it is like this dm db
missing index details. So let's go and explore the content over here. And don't forget that those informations going to
be inside the cache of the server and if there's like a restart or something in the server you will lose all those
informations. So now from my query there is few suggestions and recommendations from the database. Let's go and check
it. So we can see here there are four recommendations about missing indexes from the database. So now let's go and
check the first recommendation over here. You can go and check the table name from the object ID or you can find
it here in the statements. So here the database is suggesting an index for the table dimension product and it is
recommending us to make an index for the column color and that's because if you check our query we have like here a
filter the wear condition where we are seeing the color equal to black and since we don't have an index on the
color SQL is just suggesting to use an index for the color and of course in this situation we can go and use an
uncclustered index. Now after that we have three recommendations for the same table fact internet sales. So for
example here it is suggesting to make an index on the order date K because we are using it in the filter over here and as
well suggesting to make an index for the product key since we are using it for the join. So this is really nice report
about missing indexes in the database and it could assist you to find out things that you didn't thought about.
But here my recommendation is evaluate those informations very carefully. Don't go and create like an index for each
suggestions from the database. You still have to think about it. Is it really necessary? Do we really use this query
very frequently and so on? So don't go blindly creating indexes for each recommendations from the database. So
this is really nice tool and assistant for you in order to make a good strategy for your indexing. So this is how you
find the recommendations of missing indexes from SQL database. Okay. Okay. So now to the next
step, we have to go and monitor the duplicates in indexing. If you are working in team with multiple developers
and you are working parallely in order to optimize the performance of the queries, what might happen is that
different developers creating different indexes for the same column in the same table. But of course, this must not
happen if you have a clean and solid review process in the project. But we are human and those things could happen.
So that's why you have to monitor whether there are like duplicates. So the mission is to find whether there is
a column that is involved in multiple indexes. So let's see how we can monitor that in SQL. Okay. So now it's very
simple in order to find the duplicates of indexes inside your database. So we have learned before that we can find the
list of all indexes in this table indexes in the system schema and then we join it with the tables in order to get
the table name and then we have another table in order to find the columns that are involved in the index. Those
informations we can find it inside the index columns and now in order to get the full name of the columns we're going
to go and join it with the columns table. So it's very simple and makes sense. Let's go and execute the whole
query. Now as you can see it is sorted by the table name and the column name and that's because we can find then
easier the duplicate. So let's go and check the first table. So the country is part of this index where we have the
column store nonclustered and again the country is involved in another index where we have the customer's country and
this is a row store nonclustered index. So this is of course bad thing. We have to go and decide now do we want it as a
column store or row store. And if we check as well this table, we can find the first name in two different clusters
the same story. And that's because we were practicing and creating those indexes. And that's it. But now if you
have like large schema and a lot of indexes, I would go and make like a flag in order to understand whether we have a
duplicate or not. And that's by calculating the number of rows of unique table name and index name. And we can do
that very easily using the window functions. So let's have new row. And we're going to go and use the function
count since we want to find the number of rows over. Then we're going to go and partition
by we need the table name and as well the column name. Our expectation of this column should be one. If we have more
than one then there is an issue and that means the column is inside two different indexes. And now let's go and sort it by
the column name and descending. So let's go and execute it. And now we have here a nice flag where we can see how many
rows we have for a specific column in a table. So if it's one like those columns, they are fine. Those columns
are involved only once in one index. But for the first four rows, we have here an issue because we count here two columns.
That means we have two indexes for the same column. So as you can see the query is very simple and with that we have a
nice report about the duplicates of indexes inside our database. Okay, one more thing in order to
maintain our indexes is by updating the statistics. The database engines usually use statistics in order to understand
which index should be used for our query. And if these statistics are not up to date, SQL going to make wrong
decisions. So let's understand what this means. Now let's say that you have created a table and you start inserting
data to this new table. Now the database engine going to go and create your new table and insert the data. Behind the
scenes the database engine going to go and create for your new table statistics. It's like metadata
informations about your data and that's like a report or insights about your table where you can find a lot of
informations like the number of rows that distribution of values in a column and as well we can find the number of
distinct values and histogram and patterns and many other informations about your table. So now of course the
question is why do we have those informations in the database? Now imagine that you are doing select from
where what going to happen the database engine has to go and create an execution plan. We're going to learn about this
later in details. It is just a road map on how to execute this query. So here for example in order to load the data
from the table there are like different ways on how to do it. So there is like a table scan, index scan, index seek. So
that means the database engine has here three different ways on how to do it. And now in order for the database to
decide which way to use, it's going to go and read the statistics of the table. So it's going to go and collect
informations. Okay, how many rows do we have? Are the informations are unique? How is the distribution of the data and
so on. And now based on those statistics and numbers, the database can now make a good decision about which methods to use
in order to load the data. So for example, here the index scan is the best way to load our table. So this is
exactly why the database needs the statistics in order to make the correct decision and to use the correct index.
So now you might ask okay this is something internal for the database why do we have to care about it? Well there
is an issue. Now for example in our table we have 50 rows and let's say that in the next day you went and inserted to
this table like around 1 million row. Now the issue that could happen is that the statistics will not get updated
about this table and the statistics can still say that we have only 50 rows. So that means the statistics of this table
is now outdated. And the big issue that once you query this table, the SQL engine don't know at all about the 1
million row that you have inserted in the table because it's going to go and ask the statistics and it's going to
answer with only 50 rows and the database going to say okay this is very small table and let's maybe skip an
index or something. So that means the database going to make wrong decisions because the statistics are outdated. And
now your task is to monitor those statistics and to keep updating them. So let's see how we can do that. Okay. So
now the first thing that we have to do is to find out whether our statistics are up to date or outdated. In order to
do that we have as well to access the metadata about our database. And for that as well we have tables and dynamic
management functions in the system schema where we can find a lot of details about the statistics. And in
order to monitor the statistics, I have prepared a query like this. So here I'm using a table called stats uh where here
you're going to get a list of all statistics inside our database and the name of the statistics and then I'm
joining it with the tables in order to get the table name and what is very important is the dynamic management
function. So here we're going to get very important informations like the last updates and the number of rows and
the number of modifications. So let's go and query it. So here we can see informations like the table name, the
statistics name and now it's very important when the last time the statistics get updated. So now let's go
and check our table DB customers. We can see here the statistics name and what is very important is the last update. So
this tells us how old is the statistics. So for me it is like 4 days. And then we can find the total number of rows in
this table. And now what is very important is the number of modifications that have been done on the table. So
after updating the statistics on the 19th of October, there were around 15 rows that got modificated. This could be
an insert, update, delete. So any operation of the table considered to be a modification. So that you can see
there were a lot of modifications. So these statistics should be updated. So now for the table customers, you can see
that the statistics are up to date. So we have here zero as a modifications and there will be no need to update the
statistics. So this is how you can go and check the statistics informations inside your database in order to make a
decision should I update the statistics or not. So now let's say that I would like to go and update the statistics of
our table DB customers. Now as you can see we have here multiple statistics. So over here we have this statistics on
this table and as well we have the statistics on the index. So as you can see we have here multiple statistics in
one table. One for the table itself and one for each index that we have in this table. So now let's say that I would
like to go and update the statistics only for one. I don't want to update everything in this table only for one
statistics. Let's go and do that. So it's going to be very simple update statistics. And then we have to go and
mention the name. So it's going to be sales DB customers. And then we have to specify the name of the statistics. So
let's go and get this over here and let's go and execute it. So it was very fast. Let's go and reexecute our query
and check the data. So now let's go and find it. It was exactly this one. And as you can see it just got updated and the
number of rows is five and the number of notifications is zero. So we have now an upto-date statistics for this table. But
let's say that I would like to go and update the rest but I don't want to do it one by one. So what we can do we can
just copy the same thing over here but we don't specify any name of the statistic. So we are saying update
statistics and then only the table name. So let's go and execute it. So now what going to happen is still going to go and
update all the statistics that belongs to this table. So let's go and check our query again. Now you can see everything
disappeared and the DB customer is completely up to date with no modifications problem. So this is how
you can go and update your table and you can do then for the rest as well. But now there is like one more thing where
you can go and update the statistics of the whole database. But beware this might take really long time and we're
going to do that by executing a special store procedure. So execute SP update stats. This one over here. Let's go and
do that. And now it is done. And we have here a pretty long log. It was fast because we don't have a big database. It
is very small database. So it's not compared to any real databases. So now we can see over here that SQL is going
through everything that you have in the database and trying to update the statistics. So in many situations it's
going to be not necessary because there is nothing to update. There were no modifications and so on. That's why the
database is smart enough to say no it is not required and it go and skip it. So now how I usually do it in my project is
that I have like a job on the weekend where it's going to go and update the whole database statistics. So with that
I make sure all my tables and indexes having up to-date statistics. Of course if you have small database you can run
this like every day but if this takes long time then you can schedule it in the weekend. And as well if I know in
the project that there will be in one day a lot of new incoming data. So we are doing some kind of data migrations.
So I go and update the statistics after the data migration is done just to make sure we have up-to-date statistics. So
this is how we monitor and update the statistics of the [Music]
database. Okay. Okay, so now moving on to the final task that I usually do in order to monitor and manage the indexes
is to monitor the index fragmentations. Over the time as your data is inserted, updated, deleted into your tables,
indexes can become fragmented. So what is fragmentation? It means like there is unused spaces in your databases and the
database is not filling them or your data is not anymore sorted correctly in the index and this of course leads to
inefficient use of the storage and as well going to slow down your queries and in SQL in order to get
everything organized again we have two methods the first method is reorganize so it's going to go and def fragment the
leaf level of the index in order to get it organized and sorted again with the logical order. So it is very light
operation and it will not block the user from using your table. And the second method called rebuild this is
heavyweight operation. It going to go and drop the whole index and recreate it from the scratch. And this means of
course not only the data going to get sorted again but as well the fragmentations inside your databases and
the index going to be eliminated. So let's see how we can do that in SQL. Okay. So now back to our database and
the first question that you have to ask do we have an issue with the fragmentations in our indexes. So we
have to check the health of our indexes in the database. And in order to do that, we have again to go to the system
metadata that we have and we're going to check their dynamic management functions. So there is like a special
functions in order to get an answer in the SQL server. Let's go and do that. So we're going to go and select star from
the function. So it is sis dot so it's going to be sis dot dm db index physical states this one. And this is a function
that we have to pass few parameters. We will not go in details just follow me with this. So we have to give it the DB
ID and a null another null and a third null and the last one going to be limited. So we have to do it like this.
So let's go and query it. Now what do we find? We have the object ID. We have the index ID and few other informations but
the most important one is the average fragmentation in percent. So this columns gives us the degree of the
fragmentations in a word index. If it is zero then it is perfect. We have no fragmentation in the index and our index
is very healthy. But if it is like 100 then that means it is completely out of order and we have to do something about
it. And now you might say you know what I don't know which object it does and which index. Well you have to go and
join few tables like the cy.ts and cis.index in order to get those informations. So we have to go and do
that like we have done at the first query. So okay so offline I have done that. So I joined with the tables and
the indexes and I'm sorting the data by the average fragmentations and percentage descending in order to get
the problems at the start because we are interested where we have high percentage. So let's go and execute
this. And now since it is practicing database I didn't insert any data and so on. But in real projects you will get
here different numbers. And here is my recommendations about the percentage. If the fragmentation is between like zero
and 10 that means everything is like okay and you don't have to do anything about it. But if the percentage is
between like 10 and 30 then here we have to do something about it. So here I recommend to use the reorganize method
in order to sort the data again correctly. But if you have more than 30% then here my recommendation is to go and
rebuild the whole index because not only the data is in wrong order but as well there is a new spaces in your data page
in the index. So you have to do something about it. So now let's go and imagine one of those indexes for example
this one over here has fragmentation of 15%. So now what we have to do is to go and reorganize this index. Let's see how
we can do that. So let's go over here and say the following. alter index and then we need the index name. So let's go
and get it from here and then you have to mention the table name where the index exists. So we have it from the
customers. So from sales customers so now we are editing the index and we have to tell SQL what to do now. So we just
want to reorganize the index. So you go and use the keyword reorganize. So reorganize and that's it. This is very
simple. So let's go and do that. And as you can see it is completed and it was very fast because we have small
database. But sometimes it take little more time if you have a big index and big table. So after reorganizing you can
go and again check the table over here and see the results and it should be like here is zero. Now let's see that we
have another index where the fragmentation around like 50%. So let's go and copy it and this time instead of
reorganize we're going to do rebuild. So I'm going to take the whole thing and this time we're going to go and rebuild
this index over here on the same table and instead of reorganize we're going to say rebuild. So let's go and execute
that. And with that SQL did drop the whole index and create it from the scratch. And this is usually takes more
time than reorganize of course. And the next step of course is to go and check again the fragmentations and so on. So
that's all about how to make your index healthy and remove the fragmentations from your index. All right, my friends.
So as you can see, improving the performance of your queries doesn't end by creating them. It's all about staying
proactive. So monitor the usage of the indexes, check whether there are any missing indexes, and always make sure
the statistics of the database are up to date and keep your eyes on the fragmentations and make sure you have
healthy indexes. So with that you have learned how I manage and monitor the indexes once I create them and I really
recommend you to follow those steps. All right friends, so now let's say that you have a large complex
analytical SQL query and it involves a lot of joins and aggregations and so on but it is slow and of course you want to
go and optimize the performance of your query by maybe using indexes. And now the big question is where exactly I'm
going to go build this index on which table on which columns. So that means you have to understand where exactly the
problem is. Is it by joining tables or sorting data or by the aggregations? Now in order to answer all those questions
we have something called execution plan. So what is that? The execution plan going to show you how the database
exactly process your query step by step. And this is what we need. It's going to show us where exactly we have a
performance issue. So in other words, the execution plan it's like your window on how the SQL database thinks and once
you understand that then you're going to make a right decision on building an index. So let's understand exactly what
this means. Okay. So now let's imagine that you are doing a query like selecting from table and then joining
the data with another table. So now once you execute this query the database engine will not go immediately and start
fetching data from the disk but instead of that first the SQL has to make a plan. So it's like you are planning a
trip where you check the Google map in order to find the best route in order to reach the destination and the execution
plan is exactly the same thing. The database has first to plan how to execute your query and it's going to
build this plan step by step based on your query and as well the statistics. So the first step for example how to get
the data from the tables and there are like multiple ways like scan index or full table scan and then after that it
need to decide which type of joins going to be done like is it hash join or a loop join and then at the end of this
plan it's going to be the select statements. So once the execution plan is ready the database engine going to
start implementing the steps. So it's going to go and start reading your tables for example from the disk and
then after that it's going to join the tables and then select the columns and send at the end the results to the end
user. And now once everything is done the database engine going to do one more thing where it's going to go and take
this execution plan and store it at the cache. And that's because the database engine can go and reuse this plan if we
have a similar query. So for example, if you go and execute the same query again, the database engine here going to
understand ah this is the same query. We have already built an execution plan for that. So it going to go and check the
cache and it is way faster to get it immediately from the cache instead of building it. So in this scenario, the
database engine doesn't have to make any decisions or something like that. going to go and get the plan from the cache
and start immediately by executing the plan. And of course, the database engine will not hide the execution plan from
the users. You can go and check it because you can go and check how the database loaded the data, how they are
joined and so on. And then you can make a correct decision on how to optimize your query maybe by adding indexes. So
let's go back to SQL and see how we can do that. Okay, so now we're going to work with
the database Adventure Works DW2022. And now we're going to go to our tables and we're going to focus on the fact fact
reseller sales. Now let's go and check the type of this table. So if you go inside it and go to the indexes, you can
see that we have an index on the primary key. So we have a clustered roster index. So that means the data is
structured in this P tree. So now what we're going to do, we're going to go and create a mirror of this table but
without any indexes. So it's going to be very simple. Select star from our fact reseller sales and we're going to insert
it in a new table. So into fact reseller sales and I'm going to call it
HP for heap. So let's go and execute it. And now you can see we have inserted in the new table around 60,000 rows. So now
we can go and refresh our tables in order to find our new table. So it is over here factory seller sales and if
you check the indexes you will not find any. So that means it is a heap table. Now let's go and do a very simple query
on top of our new table. So select star from the factory seller HP like this. So let's go and execute it and we got the
results. So now the question is I would like to see the execution plan of this query. Now in order to see the execution
plan we're going to go to the toolbar over here and we have three things. The first one is says display estimated
execution plan and we have another one says include actual execution plan and a third one says include live query
statistics. So now the question is what are the differences between them? Let's start with the first one displayed
estimated execution plan. So here what's going to happen? SQL going to go and guess the execution plan without
executing the query. So it's just an estimation. So this is only a guess an estimation. The second one is the actual
one. So this going to show you the execution plan that is used in order to process your query. So after executing
your query, SQL going to show for you which plan is used. So that means the estimated plan it is something before
executing your query and the actual plan is something after executing your query. And the third one is while executing the
query. So you're going to get a realtime execution of your query and you can see how your execution plan is working. So
now we can go and try that. Let's go and activate the estimated execution plan. Now we can see over here we have a new
output where you can see like few boxes. So this is an estimated execution plan without executing your query. But now if
you go over here and switch it to the actual execution plan nothing going to happen because first you have to execute
your query. So let's go and do that. So once we have executed we got the result the messages and here we have a new tab
called execution plan. So if you go over here you will find the real execution plan that is used to process your query.
And let's go and try the third one. And let's go and execute. It was pretty fast because the
query is very fast. But here we can see how the data and the plan is working during the execution. So this is the
live execution plan. And of course we have the last one which is the current execution plan. So those are the
differences between those stuff. Now you might ask why do we have this estimated and actual execution plans? Well, it is
really nice tool to understand whether everything like is healthy at your database because if the guessing is
something else at the actual execution plan that means this is an indicator that something is wrong at the
statistics or the index at your database. So if they are matching the estimated and the actual then everything
looks good. But now we're going to focus only on one type of those execution plans. We're going to stick with the
actual execution plan. So now what we're going to do, we're going to go and open two queries side by side and one going
to be from the clustered index and another one is from the heap structure. So it's going to be like one to one.
Let's go and query both of them. And now let's go and try to read the execution plan. But make sure that you are
activating the actual execution plan. So we have here now two plans. So now we are at the he table and we don't have
any indexes. So now the question is how to read this execution plan? Well, now the plan is very simple because we have
a very simple query but we read it from the right to the left. So the first operation is the table scan and then we
have here a very small arrow to the next one where we have the select. So from right to left. So now of course the
first operator is how to read your data inside the table and here we have different types of scans and one of them
is the table scan. So table scan actually is scanning the entire table. So it's going to go and scan all the
rows inside your tables in order to execute this query. Now if you go and mouse hover on the table scan, you will
find a lot of details about what is happening during loading the data or scanning the table. But it is little bit
annoying better than that. If you go right click on it and then go to properties, you will get in the right
side the same details but it is easier to read. So the first thing that we have to read is the number of rows that has
been read. So we can see that we have read all the rows inside the table which is not really good and we have another
important informations about the resources and the cost. So we have the CPU cost and the input output costs and
what is interesting is the logical operator the table scan and we can see some nice informations about the
storage. It says it is row store. Now let's go and check the execution plan of this other table where we have a
clustered index. So let's go to the execution plan. And now you can see that we have on the right side something
else. We don't have table scan. We have something called clustered index scan. It is either scanning the entire table
again or only a range or a part of the index. And of course in the details we can see whether it read all the
informations or not. Now if you go and check the number of rows again the whole index is read in order to get this
results. So again we have here the total number of rows inside our table. And as well you can see over here the logical
operation it is clustered index scan. So it is not table scan. Now of course we have to go and check the CPU and the
input output costs whether we are consuming the same efforts or not. So we can go and compare stuff. So here we
have like 0.07. And if you go over here you can see we didn't gain like a lot of
information having an index on this table. And that's of course logical because this query is not using any
indexes. It is just like selecting everything from the whole table. So now let's go and extend it
where we're going to sort the data by the primary key sales order number. So let's go and get this one and as well
for the heap structure. So let's go and execute it and check the execution plan and the same thing for our cluster
table. Now let's check first the heap structure. As you can see here, we have like two steps. First, it's going to go
and scan the whole table and then we have sort operator in order to go and sort all the data in order to present it
in the output. And at the end, we have the select which is not really important. So here we have like two
operators. But now if you go to our clustered index, you can see that we have only like two steps. There is no
sort step, right? And that's because the clustered index is only sorted and SQL don't have to go and sort the data
again. So it doesn't have to go and sort anything. The data is already sorted. So this is the first win that you have if
you have an index. So everything is already sorted and if you have an order by on this column then SQL don't have to
do it during the query. So now if you want to go and compare the cost you can see here we still have the same cost for
the CPU and the input output in the h structure without any index we have here like double cost. The first cost is for
the table scan. It is the exact same amount of CPU and input output like the clustered but as well on top of it we
have high cost for sorting the data. So we are consuming more CPU and input output. And if you summarize those cost
of course this query going to be slower and bad compared to the clustered index. So with that in the execution plan you
can understand exactly the benefit of your index. And one more thing about this plan if you go over here. So if you
go to the objects and let me just extend it like this. You can see the name of the index that has been used for your
query. So it says the index is B key for primary key. And then we have the whole thing. So now if you go to our table on
the left side, check the indexes, it going to be exactly this index. So in the execution plan you can find as well
which index has been used in your query. And this is very important to check. If you create a new index then run your
query and check whether the database is using your new created index. And if not then you are making the wrong decisions
about your index. So each time you create a new index, make sure to check whether in the execution plan the
database is using your new [Music] index. Okay, so now let's keep going.
Now instead of using the primary key, I'm going to go and filter the data based on one of those columns that we
have in this table. So let me check the results and let's take for example the carrier tracking number. So carrier
tracking number and let's go and pick a value. the first one here like this and let's do the same thing for the heap
table and execute it. And now in the execution plan you see we still have a table scan and on this table let's see
the execution plan with the clustered index. Now let's say that I would like to go and create a nclustered index for
this column. So let's go and do it. So create nonclustered index and I'm going to call
it index fact reseller and then the column name. So on our table fact reseller and the column going to be
carrier tracking number. So I'm going to take it from here and let's go and create it. Now let's see whether our
query going to use this index. So let's go and execute it and let's go to the execution plan. Now things looks
completely different than before. So what is going on? We can see that we have now something new. We don't have a
clustered index. We have something called index seek. Index seek is an amazing sign in your execution plan
because it tells us that SQL server did find a way to use the index in order to find the exact data that we need without
scanning a lot of stuff. So that means now we have like three types of scans. We have the table scan where the SQL
going to go and scan the whole table and this can happen in the heap structure and the second one we have the index
scan and here we don't know whether it is scanning the whole index or a part of the index and the last one we have the
index seek where the database is able to find directly the data without scanning a lot of stuff. So the worst type is the
table scan. Then we have the index scan and the best one is the index seek. So if you check here the details you can
see the number of rows that has been read is only 12. This is amazing. Let's go and check the heap scan over here. So
to the execution plan and if you go over here you can see that we are reading around 60,000 rows in order to get 12.
But with the index we are reading only 12 in order to get 12 and this is amazing and very fast of course and of
course the cost of this is very very small. So if you check the CPU and the input output you can see those numbers
are nothing and of course if you go to the object over here you can see which index has been used and this is exactly
the index that we have just created. So that means it was a really good decision to create this index and the SQL was
very happy about it and used it in order to fast find our data. So now let's go and check the rest of the plan. And now
you can see over here we have key lookup. The key lookup is an operation that we need in order to get the rest of
the columns because from this index we are getting the data of only one column the carrier tracking number. But since
in our query we are saying select star that means we have a lot of columns and those columns are not part of the index.
So in this index is called don't know anything about the rest. That's why has to go and search for the other columns
and of course it is called a lookup not a scan or something like that and that's why we have here as well only 12 rows
but from this step we will get the rest of the columns. So and now the next step is that SQL going to go and join those
two informations. So we have from the first one the carrier tracking number and the second one we have the rest of
course SQL has to go and merge all those stuff in one in order to have it as a results. And now this operation called a
nested loops. Behind the scenes there are different types of joins not the one that we know the inner lift and so on
but there is another types of joints. We have the nested loop. We have the merge join and the hash join. The nested loop
is very good for small stuff. If you have large tables, then the merge and the hash joints are way better than the
nested loop. So that means if you are getting here a lot of data from the index and the lookups and you seek is
using a nested loop, this is not good. But for now it is okay because we are getting only 12 rows and the operation
going to be fast enough. And now one more thing that we can see inside our execution plan is the cost in
percentage. So from checking this plan you can see the select is almost costing nothing. The cost of the nested loop is
as well like 0%. And then we have like 6% of the index seek. That's because it is pretty fast and the most expensive
operation that done in our query is the key lookups of course because it's going to go and get all the columns. And now
if you go and compare to the heap structure even though that the execution plan of the heap structure looks very
small doesn't mean that is faster than the indexes that we have. Still if you go and add up all those numbers it is
way way faster than the heap structure. Now I would like to show you one more thing. If you want to get rid of this
key lookup and in your query you have only selecting the carrier tracking number. Let's go and execute it and go
to the execution plan. As you can see there is no need for the lookup because we have only one column and this data we
can get it completely from our index. So as you can see it is interesting to understand how SQL is working with your
table and with your index and this is how to validate whether you are making correct decisions about your
indexes. Okay. So now let's go and add more stuff where we are doing aggregations joins and so on. Let's
extend our query. So I'm going to go and join it with another dimension like for example the dim products and the join
going to be on the product key. So product key and equal to as well product key. Now after that we're going to go
and aggregate few stuff. So we're going to aggregate by the product name. So I'm going to take the product name. So it's
going to be the English product name and let's go and call it product name. And let's go and aggregate the sales. So sum
and we're going to get it from the fact table. It's going to be sales amount. So as
total sales and of course we have to go and do group by and not French name. It's going to be the English
name. So let's group up by the product name. And that's it. Let's go and execute it. Now we have a nice list of
products and total sales. But let's go and check the execution plan. And oh my god, we have a lot of stuff. So let's
start from the right side. So let's do it quickly from the right to the left. So the first thing is that it's going to
go and get the data from the fact. So it is using the clustered index. And then after that it's going to go and do a
hashmatch for the aggregation. And after that it's going to go and sort the data because it is doing later a merge join.
So all those steps are preparing the fact table. And then we have another cluster scan for the dimension. So it
going to go and as well select the informations from the dimension. And we have here like not a lot of rows. So it
is very small table 600 rows. And now of course the result of the cluster scan is as well sorted right and of course as we
learned the cluster the index going to go and sort the data. So we have here a sorted output together with another
sorted output. So we have like two data sets that are sorted and SQL here decided to go with the merge join which
is a good join in order to join two sorted data sets. It is way faster than joining using the nested loop. So
everything is fine and then the data going to be sorted and presented at the output. And now if you are checking this
plan you can see the most expensive thing happened at the fact table. So 71% of the total cost happened in this
step. Now let's say that the query is slow and I would like to go and optimize it. We have learned that if you are
doing aggregations on big tables then the column store index is a good idea. So let's go and find whether that is
true. So I'm going to go to our other table. So our sales table was with the heap structure. And now you say you know
what let's go and convert this he structure to a column store. So let's go and do that. So we're going to say
create clustered column store index and we're going to call it index and then the
whole name fact reseller sales HP and we don't have to specify any columns. So it's going to be
our table on and that's it. Let's go and execute it. So now our table is not anymore heap structure. It should be a
column store. So if you go and check the informations we can see we have like clustered column stored index on it. So
now let's go and do the same query and check whether we have a better performance. Let's go and execute it.
And of course you have to go and activate the execution plan. So I'm going to and now let's go and check from
the right again. So this is our fact table and as you can see already it is costing only 6%. Interesting. So let's
go and compare what happened to our fact table. First of all, we can see that the physical operation is a column store
index scan. And if you go to the objects over here, you can see that the SQL did use the column store. And that is of
course going to happen because the whole data is stored only in the index. So there is no way around it. So it can go
and of course and use the index. But now what is interesting maybe we have to go and compare the CPU costs. So if we
check over here, it is like 0,000.67 almost the same thing for the input output. Let's go to the previous
plan where we don't have a column store and check our facts. So as you can see here it is way more expensive reading
the fact table than the column store and as well we have reduced the input output costs. So as you can see we went from
71% of total cost for the fact table to only 6%. And the resources that is used to execute the query it is way less than
a normal clustered res store. And this is exactly the power of this index, the column store index. You can use it in
big tables like the fact tables like we are doing here in this query, you will be getting amazing performance for this
scenario. So of course you can go and compare the execution plan by moving left and right. So as you can see if I
click over here and I just switch to the other tab, I can like quickly compare the numbers. But there is another way on
how to compare execution plans and that is if you go to the execution plan and right click on it then go to save
execution plan as and then you have to go and give it a name for example query pro store. So let's go and save it and
then you can go to the second query where we have the row store and then right click on the execution plan and
say compare show plan. So once you click on that then you have to go and select the one that you want to compare with.
So open and now on top you have your query and at the bottom you have the execution plan that you have saved and
then you have here a lot of informations where they compare both of the execution plan and with that you can go in more
details in order to understand which plan is better. All right friends so as you can see having the execution plan is
is amazing. We can see how the SQL is working behind the scenes and we can understand how SQL is processing my
query step by step. How much resources it is consuming, whether my indexes are useful or useless and I can go and
experiment stuff. I can go and add like an index then test and check whether I gained like few performance or not. And
we can go and compare like multiple execution plans before and after until you get the right index for the right
table and the right column. So the execution plan are amazing in order to help us understanding whether our
indexing strategy is correct or not. All right friends, so so far we have learned that the SQL server going
to make its own decisions on how to execute your queries and the SQL make those plans based on the statistics. But
sometimes the plan that you are getting from the database might be not the best one for your query and there could be
many reasons why this could happen. Maybe the statistics are not up to date or you have a lot of indexes and the
database engine get confused and here exactly where we need the SQL hints. So you can use the SQL hints in order to
command to force the SQL database on how exactly your SQL query should be executed. So you can intervene and
change the steps in the execution plan. So let's see how we can do that. All right. So now let's have a very simple
query. We are just joining the table orders with the customers and we are showing like few columns. Now if you go
and execute it and we go and check the execution plan, we can see in this plan that it is using the clustered index in
order to read the data from the orders and the customers and then it is using the nested loop in order to do the
joins. Now let's say that our tables are really big but still the SQL is using the nested loops and of course this is
not good for large tables and maybe the SQL was confused with the indexes and statistics and so on and it decided to
use the nested loops. So now in order to force the SQL to use another type of join, we can go and give a hint in our
query for the SQL to use different types for the join. So let's go and do that. We're going to go at the end of our
query and we're going to say option and inside it we're going to say use the hash join like this. So that's it. This
is our query and at the end we are giving the database a hint for the execution plan. So let's go and try that
out. So let's check the execution plan. And now as you can see is using different type of join. So with that we
are intervening in the execution plan and we are making choices. So with that we have changed the technicality on how
the SQL is joining those two tables. All right. So now let's go and change something else like for example instead
of having index scan I would like to have an index seek. So if you have the right index in your table, you can go
and tell SQL how to read your data in the table. So let's go and do that. Currently here we have an index scan on
the table customers. So we can go over here near the table and we're going to say with and inside it we're going to
say for SQL force seek. So we are forcing SQL to use the seek index. So we can use those keywords near the table in
order to specify for SQL how to load the data. If you are not specifying anything like here with the orders, we don't have
here any hints. That means we are counting on the execution plan that is generated from the SQL. But if you don't
want the recommendations, you can go and specify which one should be used. So now let's go and execute it. Now we got an
error because the SQL is not able to process what we are asking for and I think maybe we are using the force
command and as well the hash join. Let me just uncomment this and let's go and give it another try and now it is
working. So let's go to the execution plan. So you can see we got again the nested loop. And now if you go to the
customers table you can see now it is using the index seek. So it is not using anymore the index scan. So as you can
see again we are intervening and forcing SQL to use the method that might be better for our query. Now if you are
creating a lot of indexes in one table and the SQL is still not targeting the right index. So if you check the object
you can see it is targeting specific index. But if you have a better index than that you can give a hint for the
SQL to use a specific index. And we can do that like this. If you go over here and remove the force seek and you say
use index and then we have to go and specify the index name. So let's go and get again the primary key over here. Now
I'm telling SQL you have to go and use this index in order to scan the table customers. So let's go and try this out.
And if you go to the execution plan you can see it is as well targeting this index. So not only you can force SQL for
a specific type of loading or joining, you can force SQL to use a specific index that you created. All right
friends, so as you can see, SQL hands are very powerful, but we have to be very careful with them because I really
had a bad experience using them in my projects. So here are my recommendations and what happens. So what could happen
is that you are optimizing the performance in the development database and you start using the hints and the
speed was really good and once you roll that out to another database the production database this hint will not
be working correctly. The same hint that you are using might not improve the performance and one reason is that
sometimes the productive database has like large data compared to the development database. So you have really
to test the hint in each database that you have. So if your hint is working in one environment that doesn't mean it
going to work in the other one. So always make sure to test. And the second recommendation is that don't use the
hint as a permanent fix for your queries. So what this means? Let's say that you are working in the project and
one of your queries are very slow. Now, if it's not clear why the execution plan is really bad, you can go and use the
hints as a workaround in order to speed up your query again, but it's still as a workaround temporary. You still have to
invest and spend time in order to analyze what was the road cause. So maybe it is an old statistics or you
have wrong indexing and so on. So use hints only to work around and speed up your queries, but don't use it as a
permanent fix. So friends, SQL hints are really amazing in order to control the execution plan, but use it very
carefully and only if there is like an emergency. All right friends, so now for each SQL
data project, we have to make sure that we create a clear guidance about the index strategy and everyone in the team
has to commit and follow the strategy in order to make sure that each index that is created in the project to fulfill a
purpose and that's because without a clear strategy about the indexing, I'm going to promise you there will be a lot
of redundancy, unused indexes, uh waste of storage and the whole system of your project is going to be slow and bad. So
now what we're going to do, I'm going to show you my indexing strategy that I usually follow in my projects. But I'm
going to tell you from now there is like not one strategy that can fit any project and any scenario. That's why the
team of each project should brainstorm in order to make their own strategy. So now let's have a look to my indexing
strategy. And now if I have to pick only one recommendation from me to you in this
indexing tutorial, I'm going to have this advice for you. Avoid overindexing. Overindexing is the biggest mistake and
trap that a lot of developers do where they think adding more indexes. That sounds like we are speeding up things
and our queries can be fast. But I have to tell you this exactly lead to the opposite. And here's why. As we learned,
each time you add a new data to your table, your index has to get updated, sorted, rearranged. That means having
too many indexes, what's going to happen? Your insert, update, delete operations going to be slow. And this
means your database is slower and not faster. And one more very important reason why overindexing is bad is you
make the database confused while creating the execution plan. As we learned, the SQL database has to create
the best execution plan for your query. And if you have a lot of indexes in your database, it's going to make the process
of creating an execution plan complicated for the database, which makes it of course for database harder
to choose the best path and index. And as well, you open the door for bad execution plans. And this means it's
going to slow the query because first the database has to create the execution plan before executing your query. So
again it has a bad effect for the performance and as well there is another bad thing. It can make it harder for the
database to decide what is the best execution plan for a query and having too many indexes might make the SQL
database choosing a really bad execution plan. So overindexing confuse the execution plan and as well makes the
query slower. So that's why I call this a golden rule and you have to commit to it. Just avoid overindexing because it
is double-edged sword and exactly you have to have the mindset of less is more. So having a few effective indexes
is way better than having a lot of indexes. So keep it in mind and write it in your development guideline for the
team with big statement avoid overindexing. So this is the first statement in your indexing strategy. So
now let's check the [Music] rest. All right. So now we can split the
indexing strategy into four phases and each phase has multiple steps. So now the first step is we're going to go and
create an initial indexing strategy. So now once you start a new SQL project you have to define the objectives of the
projects very clearly. So that means we have to make it clear what we are focusing on what we want to achieve and
in order to define the goal of your indexing strategy you have to understand your system. We have mainly two types of
databases. In one hand we have OLAB databases. It stands for online analytical processing. The purpose of
this database is for data analytics and an example for that is the data warehouse. So in data warehousing we go
and extract the data from multiple sources and then we prepare it and transform it and put it in one big
storage and we call this process an ETL process. And then the front end we have like reports and dashboards where the
data is summarized and aggregated and presented for the end user. And these reports could be used from users in
order to analyze and have insights about the data. And now in order to generate those reports there will be like heavy
reading on the data warehouse database. So that means there will be huge queries that's going to access the database in
order to aggregate and prepare the data for the visualization. But now in the other hand we have the OLTP systems
online transactional processing. It is like an e-commerce finance banking where you have at the back end a database
where the data is stored and on the front end we have like an applications for the end users. So now as the users
are interacting with the app this can cause write operations on the database. So inserting new data or changing data
and as well there will be read operations on the database in order to show the data in the app. So we have
both write and read. So now of course we have to ask ourself what is the goal what do we want to achieve and here
mainly there is like two strategy either you want to improve the read performance or the right performance. Now if you are
looking to the OLAP system here it's really you have to understand the project where is the struggle sometimes
it could be like the ATL process itself it's slow and mainly the ATL is writing data from the sources in the data
warehouse and maybe you have scenario where it takes like every day 10 hours and 10 hours is of course a problem
because you cannot wait so long in order to get a new data fresh data to the report every day. So you can make the
goal of the project is to optimize the right performance. You want to speed up the ETL. But actually most of those
projects having another issue. Well, it is the read operation on the database because data warehouses normally have
really big data sets and at the front end the reports generate large complex queries on the database. So that means
the rate process going to be the pain point in each OLAP system. So normally the big goal in each OLAP system going
to be how to optimize the read performance. But now in the right hand with the OLTB we have different nature
of database and scenario. What going to happen? You will not have like big queries from the apps. You're going to
have like many query many transactions happening between the application and the database. So you're going to have
like massive amount of read and write transactions. So the whole time we are reading, writing, reading, writing and
so on. But with the OL app we have like something bigger and slower because in the ATL we usually run it only once.
That means we are writing only once new data to the database and this happen usually at the night but on the
transactional systems you have a lot of readrs all time. Again depend on the project but usually the main pain point
in the OLTP is the right operation. So it could be like this. If you are building OTP system, the main goal is to
optimize the right performance. Now of course the question is how to do that? How we going to optimize that? Well,
again we have to understand the nature of the database. What do we have in the OLAP systems is usually like a data
model where you have a very big fact tables and around the fact we have like multiple dimensions that are connected
to the facts. So those fact tables are really big tables in the database and each time they are used in order to
build a report and the report going to be using all time those facts in order to prepare the data for the
visualizations and a lot of aggregations query going to be done on the facts and now of course you have to answer now the
question which type of index should we use in this scenario. Well we have a perfect one called a column store index.
So the best practice here is and you can make it as a strategy for the whole project that we make all fact tables as
a column store index because this is what we are doing in the OLAP. We are aggregating large data sets but now the
data model and the scenario is completely different at the right side here. We're going to have like a lot of
tables and they have like different sizes and so on and there are like a lot of relationship between all those
tables. So it is completely connected. So you have a lot of like primary keys and foreign keys relationships between
them and normally those tables are completely normalized table. So they are like small pieces but on the left side
we have denormalized tables as a facts. So here is like one strategy that we can follow in the indexing of the ALTB is
that we create clustered index for each primary key of our tables. This of course can improve a lot of stuff like
searching, sorting and as well joining tables together. But of course since we are focusing on optimizing the right
performance on the OLTP you have to be more sensitive by adding new indexes compared to the OLAP because each index
you add it could be a reason why the data is written very slowly. So in the OLTB you have to be way more careful
adding indexes. So now as you can see you have to understand the nature of your project. You have to understand
what is the main issue. Once you understand your project, you can go and define like a goal for optimizing the
system. So either read or write or maybe both of them and with that you are making like the initial strategy of
indexing your [Music] system. All right. So with that we have
an initial strategy for our indexing and we have a rough plan. Now in the next phase we have usage patterns indexing.
So now we're going to do a deep dive into our project. And the first thing that we have to do is that we have to
identify the frequently used tables and columns. So that means you have to go and check the queries used in your
project in order to understand okay what is the most important table that is used in many queries. Like for example here
we have the fact internet sales. It is used like in many many queries in our scripts. So here you are like developing
a feeling about what are the most important frequently used tables and not only that you can go and check how we
are filtering the data on those queries. So for example we have over here we are filtering by the order date key is this
kind of filtering is used like in multiple queries. So as you can see we have like here a couple of queries where
we are doing always the same where we are filtering the data by the dates. So with that we understand there is like a
pattern inside our projects where this column is used mainly on filtering and as well for aggregating. So that means
you do a deep dive in order to understand what are the most and frequently used tables and columns
inside your scripts. And now of course what I usually do I go and use the help of the AI and IBT where I give it my
code and then ask questions about it. For example, this prompt, it says, "Anal analyze the following SQL queries and
generate a report on table and column usage statistics. And for each table, provide the total number of times the
table is used across all queries. A breakdown for each column in the table showing the number of times each column
appears. And I would like to see as well the primary usage of each column, filtering, joining, grouping, and so on.
And in the output, as you can see, we got like nice statistics about my scripts. So as you can see the most used
fact table is fact internet sales. It is like 13 times used in the projects and then we can see like statistics about
each column that is inside these facts. So most of the time is the sales is used for aggregating and as we saw the order
date key is used like five times for filtering and the other keys is used for joining tables. So as you can see it's
amazing right now we can identify which tables are important which columns as well are important and we can like based
on those informations maybe derive our indexing for our database. So with that we have identified our frequently used
tables and columns and now the next step we have to go and choose the right index type and as we learned before we have
multiple types of indexes and that's really depend on the usage and the scenario. So for examples, if your
columns are primary keys, then go with the clustered index. And if you are using columns that are not primary key
where you are doing joining filtering, then think about the non-clustered index. And of course, if the table is
very big, as we said, you can go and use the column store index. And if you are targeting always like a subset of data
only like one year informations, then you can think about the filtered index. And the last one, if you have like a
unique column where you don't have any duplicates, then you can go and apply a unique index. So it depends on the
scenario and the usages. You have to choose the right index. And of course the last step in this phase is that you
have to go and test your index whether everything is working fine. So that's all for the phase two.
Then we go to phase three scenario-based indexing. So here we have to tackle and focus on specific issues to specific
pain points. So that means we have first to identify the slow queries. So it could be reported from users or the team
is doing like analyzing on the logs and to understand which queries are causing like performance issues. And now once
you get a list of slow queries then you have to analyze them one by one and it is time to dig into the execution plans.
So as we learn we can check how SQL is implementing our queries and start looking for areas for example where the
SQL is doing a full scan of the tables or maybe using expensive operations like nested loop joins and so on. So once you
understood where is exactly the pain point the next step is that you have to go and choose the right index. So which
type of indexes we're going to use in order to optimize the query. And once you go and create the index, the last
step is that you have to go and test it. So you're going to run again the execution plan in order to make sure
that your query is using the index that you have just created. So that means you have to go and compare the execution
plans before and after. And if you see that there is no benefit, then something is wrong. That means you have to go and
investigate more and analyze the execution query and maybe choose a better index way. And you have to do
this process for each slow query until you get all your queries fast. But of course, don't forget indexing is not the
only methods on how to optimize the speed of queries. So as you can see through these three phases, we went from
a very generic methods on how to index our system to something very specific and scenario based. So as you can see as
we moving in the phases, we are doing more deep dive into our projects. All right. So now moving to the last
phase, we have the monitoring and maintenance of our indexes. As we learned, the job doesn't stop by just
creating and implementing indexes. We have to be responsible by keeping eye on the health of our indexes. And here the
databases offers a lot of statistics and metadata about your data that you could use in this phase. So the first step is
to monitor the usage of the indexes. And as we learned, we can use the dynamic management views or functions that we
can find in the system schema where we can see the number of usage of each index and when the last time our queries
did use the indexes. So with that we can go and find out all those indexes that we have created and never been used in
our projects. And now the next step is that we can go and monitor the missing indexes. So here we can go and check
what are the recommendations from the database where the database is reporting missing indexes from the execution plan
and again we can go and use those dynamic management views or functions in order to see more details and as well we
can go and monitor whether we have duplicates in the indexing. It happens a lot if you have like a lot of developers
in your team. So it could be that they are working parallelly to optimize the performance of slow queries and then go
and create multiple indexes for the same column. So this is something that we can go and check whether we have duplicates
in our indexes and if you have duplicates then you have to go and find how you can go and consolidate them.
Then the next step we have to go and update the statistics. So as we learned statistics are very important for the
execution plan because the database engine use those informations to decide the best execution plan for your query
and if the statistics are old then the database going to make wrong decisions about how to execute your query which
might lead to bad performance. So here again we have like special functions in order to monitor the statistics but here
my recommendation that each weekend have a job that go and create all the statistics of your database. And the
last step we don't have to forget about monitoring the fragmentations as we learned over the time as you are doing
modifications on the tables. What could happen the order of the databases could get wrong or there are like free spaces
on the database that are not used. So we have like fragmentations in the index and the same thing we have to monitor
the fragmentations of each tables and here if the percentage is between 0 and 10 then there is no issue but if the
fragmentation is between 10 and 30 then we have to go and reorganize the index and if it's more than 30 then this is
alerting you have to go and rebuild the whole index and usually for the monitoring I go and build like automated
dashboard in PowerBI or Tableau where I go and extract all those metad data and create a nice dashboards in order to
monitor the health of the database or you can go and buy some other tools that are advanced in order to do those
stuff. All right. So this is my indexing strategy that I usually follow in my projects. And as you can see, each phase
builds upon the previous one. Moving from a general strategy to more targeted, refined, specific strategy
where we define first the goal of the indexing strategy of the projects. And as we move with the phases, we're going
to be targeting more specific scenarios. And this cycle keep repeating. It's not only one time. So you have to keep
discussing is the goal still suitable for the projects. You have to keep analyzing the frequently used tables and
columns and keep searching and finding those slow queries and always keep an eye monitoring the indexes and of course
I can only keep repeating this avoid overindexing. All right my friends so that's all about the indexes that was a
lot of informations and a lot of technique. So now you know everything about indexing in SQL. Now in the next
one there is another important techniques on how to optimize the performance. So we're going to talk
about the partitions. So how to divide our data in order to optimize the performance. So let's
go. All right. So what is SQL partitioning? It's a technique in order to divide a large table into small
pieces and each piece we call it a partition. Well, this sounds like we are dividing one big table into smaller
tables but it's not like that. We are just dividing one table into smaller partitions. So we going to see it in the
database still as one solid table but behind the scenes it is splitted into multiple partitions. So now let's go and
understand what this means. Okay. So now let's say that you have a table at your database and over the time this table is
getting bigger and bigger where you have like hundreds of millions of rows. Now once you have such a big table what's
going to happen everything going to be slow. So for example, if you are reading the table and the execution plan is
doing full scan of the table, this can take SQL long time until all the rows are fetched. And if you decide to make
like an index for this table, what's going to happen? SQL going to go and build a very big B tree index where
there are a lot of branches and files and so on. And having a big index is not always a good thing because if you do
operations like delete rows, update rows or inserting rows, these operations going to need long time to process. So
having a big index doesn't mean that you can have a good performance for your big table. So that means having a big table
is a problematic because everything going to be slow. So now what we can do in order to optimize the performance of
this big table? Well, we can use SQL partitioning and in order to do that, we have to understand the behavior and the
transactions that are happening on our table and what usually happen with that the table grows over the time. So, you
can have like subset of data that belongs to 2023 and another one that is created and updated in 2024 and then you
have something like more current in 2025. So that means we have like in our table old data and as well new data and
we usually interact with the new data more often than the old data. So maybe for example for 2023 there is like only
one read transaction and for the data in 2024 we have done like two reads and one rights. So it is little bit more than
2023 but for the new data for the current year there will be heavy transactions. So we're going to have a
lot of reads a lot of rights. We are updating, inserting, reading. So a lot of things are going on for the new data.
So that means we are accessing frequently the big table only to interact with the new data and we rarely
need the old data. So what we can do, we can go and divide this big table and we usually divide it by like a date. So
that means we can go and split this table by the year and we put each year in one partition. So at the end we're
going to have like three partitions. And now it's really important to understand that that those are three partitions.
They are not three tables. So that means at the client side the users can see only one table but behind the scenes we
have like three partitions. Now let's say that you have a query in order to read the data from 2025. And now what
going to happen? SQL will not go and scan all the data from the table. It's going to go and only target one
partition the 2025. So that means SQL is only scanning the relevant informations the relevant partition and not the
entire table. And now we have another benefits of having partitions. Let's say that you're using a modern database and
normally they support parallel processing. So if you have the infrastructure for that what can happen
the database engine can process each partition independently and parallelly. So whether you are reading or writing
data. So what's going to happen? SQL going to process your queries parallelly which of course can reduce the overall
execution time. So that means if you have a modern infrastructure like maybe for example the Azure Synapse and so on
go with the partitions because the partition then could be stored in different servers and this helps of
course the SQL engine to use all the resources at once. So that means partitions allow scalability and as well
parallel processing. partitions going to make the indexing more efficient. So instead of having one very big index for
the whole table, if you put an index on a partition table, what's going to happen? Each partition going to get its
own index, which means the size of the indexes going to be smaller. And of course, this helps a lot with searching
for data or as well extending the index itself. So for example, if you are inserting data to the partition 2025,
the SQL will not go and change anything on the other indexes, it's going to go and only change the index of the
partition 2025. So that you can see the power of the partitioning. It improves significantly the performance of your
table whether you are reading or writing data to this big table. So this is what we mean with partitioning and why we
need it. All right, friends. So now we're going to go to the process of creating
partitions in SQL. At the start it might sounds a little bit complicated but we're going to do it step by step and I
have a sketch for that. So we have like four steps because we have in the database like multiple layers. So let's
see how we can do that. Let's go. So the first step is that we're going to go and define the partition function. So what
is that? We're going to go and define here in the function the logic on how to divide the table into partitions. And
this can be based on the partition key. So that means we need a column in order to define the logic. And we usually use
columns with the dates like for example the order dates or in other scenarios we can use the region or country and so on.
But the most famous one is the dates and that's because our tables like get bigger over the time and there are like
multiple types of functions. We're going to focus on the range function. So how it going to work? We're going to have
like a range of dates and then we have to define like boundary values and let's say that I would like to make a
partition for each year and in order to do that we have to define the partition boundary. So it is like a value the
boundary of the years could be like the first day of the year or the last day of the year. So here in this example we're
going to take for the boundary the last day of the year. So the last day of 2023, 2024 and 2025. So we call those
values the boundary of our function. Now between the boundaries we going to have our partitions. So for examples all the
rows for 2025 and earlier years is going to be the partition one. So between the boundary and everything before is one
partition and after that between the two boundaries we have partition two. So this partition going to be for all rows
of 2024. And then we have another section the partition three where we have all rows of 2025 and then between
the last boundary and everything onwards is going to be partition 4 and here we're going to have all the rows from
2026 onward. So with that we have now a logic we are telling SQL how to divide our data into multiple partitions and
here there is like two methods the left and the right. So what are those two methods? So again we have our boundary
and now the big question to which partition does this boundary belongs to is it partition one or partition two and
that's why we have those two methods. If you say it is left that mean the boundary belongs to the partition number
one. But in the other hand if you say it is right then the boundary going to be part and belongs to the partition number
two. So you have to decide whether the boundaries belongs to the left partition or to the right partition. And with that
in the partition one, we're going to have all the rows of 2023 including the last day of 2023 because in the
partition 2 we only focus on 2024. So it's just the boundary belongs to the left partition. It's very simple. Now
let's go and implement that in SQL. So let's do it. The syntax is very simple. We're going to say create partition
function and then we have to give it a name. So it's going to be partition by year since we are dividing
the data by the year. And after that we have to define the data type. So we are splitting the data by a date. So it's
going to be date. And after that we have to define the partition function type. So in our example we are using the
range. And now we have to define whether it is left or right. We're going to stick with the left. And now comes the
very important step. We have to define the boundaries. So we're going to say for
values and we're going to enter here three boundaries like in our example for each year we're going to define a date.
So 2023 and the last day of the year. Same goes for
2024 and for the last one 2025. So with that we have defined the logic the range we have defined the
boundaries and we tell SQL the boundaries are a date. So let's go and execute our function. Okay, so that's
it. As you can see, it's very simple. We just created a function that split the data by the date using the range lift.
And of course, this function is not yet attached to any tables or anything. It is just a logic that is stored in the
database. All right. So now since our partition function is stored inside the database, we will have metadata about
those functions stored in the system schema. So we have there a dedicated table called partition functions and
there we're going to find informations about all functions that we have inside our database. So let's go and execute
it. And as you can see we find now our new created partition function. So partition by year it is a range and it
has an ID and so on. And I really recommend you to check it before creating any new partition function.
Maybe you have already one in the projects. Okay. Okay. So now let's check the next step in our process. We're
going to go and build now the file groups. So what is a file group? It is like a logical container of one or more
data files. So it's very simple. It's like folders. We're going to go and create now like multiple folders. So
later we can insert inside them files. And this is really nice because it gives us like freedom and flexibility where we
can go and decide how the data files are organized for each partition. So what we usually do, we go and create for each
partition a file group. So we're going to have like four folders or four file groups for 2023, 2024 and so on. So now
let's go back to SQL in order to do that. All right. So now let's go and create those file groups. The syntax is
very simple. So it's going to say alter database. And now we have to tell the database where these file groups should
be stored in which database. So I'm going to stay with the sales DB. And then we have to tell okay add file group
and after that we have to define the name of the file group. So the first one going to be for
2023. So the syntax is very simple. Let's go and do it for the other years. So we need
2024 5 and six. Okay. So that's all. We can just select everything and execute. So as you can see it's very simple. We
have just created four file groups and they are empty. So we don't have anything inside those containers. Now
let's say that you have made mistake with the namings and so on and you would like to drop one of them. So the syntax
is as well very easy. So it's going to say alter database sales DB and instead of add you're going to say remove. So
once you execute this file group will be dropped but we need it. So let's go and recreate it. Now as usual after creating
stuff let's check whether everything is created correctly and whether we have any duplicate or anything wrong. So with
that we have as well a file group table inside the system schema and let's go and execute it. So I'm just filtering
with the type FG for file group. So let's execute it. And now we can see in our database we have four file groups.
Now four of those file groups we just created it right. So we have the 2023 24 and so on. But we have something called
primary file group. This is the default file group that is created for each database. So it is a container for all
data files in your database. And as you can see we have here a flag saying it is a default. So it's default and we have
it one and for the rest they are not the defaults. So this is really nice to see all the file groups inside your database
to check that you don't have duplicate and so on. Okay. Now moving on to the third
step where things going to get more physically. So so far we have like a function the file group and all those
stuff are logical stuff. We don't have data yet. In order to have data, we have to go and create data files. So, as we
learned before, data files going to contain our actual data and they're going to be stored physically in the
database. So, you can go and assign for each file group like one or multiple data files. And the file format here is
MDF. It is secondary data files. We have like primary and secondary. But in the partitions, we usually go with this
format, the NDF. So again the file groups are illogical containers and the data files are physical files where our
actual data going to be stored inside it. So now let's go back to SQL in order to create some data files. Okay. So now
we're going to come to the little bit annoying part where we're going to go and create files. But the syntax is as
well very simple. So we're going to say the same things alter database and our database is sales DB. And then this time
we're going to say add file. And now we have to give SQL not only the name but the physical place of the files. So
let's do it step by step. We're going to open new two parenthesis. So first we have to define for SQL the logical name.
It is not the file name. It is the logical name of the file. So let's give it a name for example B 2023 and then
comma. So this is the logical name. And now the next one is we're going to give the physical name of the file together
with the path. So we're going to say file name equal and now we have to define for SQL the complete path of the
file in SQL server there is like a default path where the data going to be stored and I'm going to go and use the
same path and the path really depends on the version and as well the type of the SQL server that you are using. So for
the current version that I'm using for this tutorial we can find it over here in this path. So if you go to the C then
program files Microsoft SQL Server MSSQL and the version for me is 16 SQL Express and then inside MSSQL data and so on. So
we're going to go inside this folder and now we can see over here all the database files. So we can see for
example here the sales DB the sales DB logs and we have here the adventure works and so on. So you're going to see
all the files of your database. And what we're going to do, we're going to put as well our partitions files inside the
default folder. But for real project, you have to ask the database administrators about the exact location
where you can put your partitions. So let's go back to SQL and I'm going to put this path over here. And then we
have to specify the file name. So it's going to be P 2023 dot. And now we have to specify the file name. So, NDF and
with that we have now a complete path with the file name. So, we are almost there but we are not done yet. We have
to tell SQL where to put this file in which container in which file group. So, we're going to go over here and we're
going to say to file group and here make sure to select the correct one. So, FG 2023. All right. So, that's all. Let's
go and execute it. So, let's do it. And with that we have created a file inside a file group. I will not be creating
like multiple files inside one file group. It's going to be like one to one. So now what we're going to do we're
going to go and create the other files for each file group for each year. So we just have to copy and paste and just
change the names. So for 2024 going to be like this. So that's it. And the same thing
for 2025. And for the last one 20 26 and we can go and select now
everything and execute it. So that's it with that we have created now four different files and we have mapped as
well each file to the correct file group and I usually don't create like a lot of files. I just create like one for each
year or maybe for bunch of years. So you don't have to go and make for each day like partition or something like that.
Okay. As usual after creating stuff we have to go and check the metadata. Now I have here prepared a query where we
query the file groups together with the files. So all the data informations could be found inside the table master
files and then we join those tables and select our database. So let's go and query this one. And now we're going to
get a list of all files inside your database. So we see over here we have the primary for the database itself and
you can see the path of the file and as well the size of it and we can see over here we have four files and the file
group that is assigned to and the complete path of each file and you can monitor over here of course how the size
of each file is growing over the time. Maybe one of them is getting like really big and then you can think about let's
go and split it to multiple files. So that's it about how to create data files.
All right. So now we're going to move to the last step where we're going to go and define the function scheme. Now if
you have a look to this picture, you see that there is something missing. From one side, we have defined how to divide
our data into multiple partitions. And from the other side, we have repaired all the files and the file groups and so
on. And now what is missing is the connection. How to connect those partitions to the file groups. And we
can do that by using the partition scheme. So all what we are doing now is just defining which partition belongs to
which file group. So for example, we're going to go and map the partition one to the file group 2023. And with that all
the data of 2023 and earlier going to go to the file group 2023. And of course we have to go and map each partition to a
file group. If you don't do that, you will get error in SQL. And once we build the partition scheme then we can have
all the component ready in order to have partition table. So now let's have a quick summarize. The partition function
going to decide on how to split your data into multiple partitions. The partition scheme going to go and map the
partitions to a file group. And the file groups are like folders in order to organize your files. And each file group
has one or more data files where your actual data going to be stored physically. add these files at the
start. It might be confusing, but now as you understand each layer, then it's going to make it easier for you to build
partitions. So now let's go back to SQL in order to build the partition scheme. Okay, so now we have the easiest part
where we're going to connect everything together. So the syntax as well very simple. It's going to say create
partition scheme and now we have to give it a name. So let's go with like scheme partition by year. And now we have to
map the partition function with the file groups. So first we're going to say as and then we define here the partition
function. So as partition and now we need the partition function that we have created. So as
partition by year and then after that we're going to map it to the file groups. And here it is very important to
map it in the correct order. So the order is very important. So the first one was file group 2023. The second one
2024 and we have 2025 and the last one 2026. So again the order is very important and as well it's going to be a
little bit tricky. So sometimes as you are creating like the functions maybe you make mistake that you don't know how
much partitions are going to create like in our example we have three boundaries and SQL going to create four partitions.
So it happens sometimes that you think okay I have three boundaries and then I'm going to get three partitions which
is not really correct. So for example let me just remove one of those and let's say I have only three five groups
and let's go and execute this one over here. Now we are getting error. It says the partition function generates more
partitions than the five groups. And that is really correct because our definition of the logic can split the
data into four partitions. And now we are giving SQL only three five groups which is not correct. So we have to go
and add the plus one. And one more thing SQL will not go and check whether you are mapping things correctly to the five
groups because it doesn't really care about the naming of those five groups. So for example, if you go and put this
one at the end, what's going to happen? It's going to be a big problem. So all the years of 2023 going to be stored
inside 2024, 2024 going to be in 2025. So everything going to be mixed and the skill can do it like you tell it. So
that's why make sure you have the correct sorts. So that's it. Let's go and create our scheme. So it is working.
This is very simple. We just map now the partitions to the five groups. And as usual we check things after creating and
I have prepared here like really nice query from the metadata in order to see the whole thing the functions the file
groups the schemes you can of course add to it the data files but I'm just going to stick with this over here. So again
in SQL server we have a dedicated table for the partition schemes. Then I'm just joining it with the functions and then
with the destination data spaces in order to get the partition number and the file groups. So let's go and execute
it. And now we can see very nicely the scheme that we have created and the function name of the partition. And then
we can see the partition number and the file group name. So we can see how things are mapped together. So if you
get it like this then so far everything is good. All right. So so far what you have
done we have prepared all the layers. So we have the setup is ready to be used in any table. So we have the functions, the
files, the file groups and schema and everything is ready. But still we are not using it. The logic just exist and
the files are empty. So now what we're going to do we're going to go and create a table but not a normal one a partition
table. So let's go and do that. It's very simple as well. So create table and we have to give it a name. So let's get
it as well in the schema sales orders and I'm just going to give it the name partitions. So now we have just to
define like few columns inside this table. So let's get an order ID and data type int. And let's go and get an order
date. We call it dates with the data type dates. And maybe just one more called sales and a data type in. So this
is very normal table that we create in databases. But it's still not yet partitioned. Now in order to use
everything that we have defined, we're going to go do the following. We're going to say on and now we have to tell
SQL only the name of the partition scheme. So everything else is like connected and mapped together because
the scheme is mapping the function with the file groups. The file groups are mapped to the data files and everything
is like connected together. And here in the table we have just to give the name of the scheme. So the name of the
partition scheme is scheme partition by year. And now it's very important to give a column. And since
the whole logic and the function is based on a date, we cannot go and specify here for example the order ID or
sales because it makes no sense. We're going to go and pick the order date and put it over here. And with that, we have
created a partition table. So now what we're going to do, we're going to go and start inserting that out of our table.
So let's go and do that. We're going to say insert into sales order partitioned and we're going to pick
values like this. So one and then let's get any dates like 2023 like for example my the mid of the month and the sales
could be anything like let's say 100. So let's go and execute this and let's go query our
table. So it is this one over here. All right. So now we have one record inside our partition table. And
now the big question is in which partition in which data file did SQL store this record. So we have to test
whether everything is working fine. So in order to do that I have prepared as well a query. So we are again asking the
table partitions with the destination data spaces where we're going to get the number of rows in each partition and
then we have the file group and we are focusing on our table orders partitions. So let's go and execute this one. And
now we can see very easily we have the four partitions. our new record is inserted in the correct place in 2023
file group and in the correct partition. So with that we make sure our function and the whole logic that we have built
is working correctly. So now let's go and add more records. I'm just going to go and duplicate it. Record number two.
And I'm just going to pick a date in 2024. And this one going to be like 20. Let's just change the value. So 50.
Let's go and execute it. And now we have a second row inside our table. And again the big question is
whether it is working. So let's go and execute this again. And now we can see our record is inserted in the partition
2 in the file group 2024 which is correct. Now let's go and check the boundaries whether it is working
correctly. So I'm going to go and here in the third row I'm going to say the last day of 2025. So it's going to be
month 12 and the last day. So 20. Let's go and insert it and check our table. So we have a new record. And now let's go
and check. My expectation here that this row is going to be inserted in the file group
2025. So let's go and execute. And that is correct. As you can see the record is inserted in the correct partition. And
this is really important to test the boundaries whether they are working correctly because it's a little bit
tricky. You have this range left right and boundaries and so on. So you can do it like this to check whether the
expectation of your logic is working correctly. And the last one I'm just going to do it very fast. So let's do it
2026. And I'm going to pick the first day of this year. So let's go and insert it. And now
what is the expectation? I think it is pretty simple. So let's go and query. And the first day of this year is
inserted in the partition number four. So I can say everything is working correctly. If you get it like this then
you have created successfully a partition table and you have prepared all the layers of this partition
correctly. I know this is a lot of work but to be honest it is fun because for the first time in database you feel like
you are controlling stuff. Usually in database everything like behind the scenes and you don't know exactly where
the files are stored of your tables and so on. There is a lot of abstraction in databases but here like we are getting
deep in databases and we are controlling and managing all those files which is sometimes it's nice to have this freedom
and flexibility. All right one quick thing that I would like to show you that if you go to the database in the
explorer then let's go to the storage over here. So let's expand it and here you can find easily informations about
the partitions. So over here we can find our partition scheme and as well the partition function that we have created.
it is just a quick access instead of like querying the metadata. So now let's have a quick
summarize how everything is connected together. So we have a table and then we specify for scale that is connected to a
partition scheme and in the partition scheme we have everything connected. It is linked to a specific partition
function and there we have the partitions and at the same time it is connected to file groups and the file
groups are connected to the data files. So as you can see all those layers and elements are connected together. Now
let's see how this works. So we have inserted the last day of 2025 and now the first thing that's going to happen
the partition function going to decide to which partition it belongs. So as you can see it is a boundary value and since
we have defined it as a lift it going to target the left partition the partition three and then the partition scheme
going to connect it to the right file group and in this scenario it's going to be the file group 2025 and we have here
only one file so it going to as well go to the correct data file and in this file the SQL going to store this row so
it is pretty easy and now we come to very important part where we can understand how the
partitions are really improving the performance of my query and of course we can do that by checking the execution
plan. So now in order to compare like the behavior with and without the partition what we have to do is to
create a mirror table without partition. So we have our table here the partitioned one what I'm just going to
do I will go over here and say into and we're going to call it sales orders no partition. So we are taking
the data and the structure from the orders partitions and of course it will not be partitioned. So let's go and
execute it. Now if you go over here we can see that we have two tables. We have the no partition and the partitioned
one. So now what we're going to do we're going to write a query on both tables and then compare the execution plan. So
first let's start with the no partition. also from and and now in order to see the effect of the partition what we're
going to do we're going to say where order dates equal to and now we're just going to pick a value like 2026 the 1st
of January so let's go and query it and we're going to do the same thing a new query but this time for the partitions
so now in order to see the execution plan make sure to activate it so we go to the action bar over here and we're
going to say include the actual execution plan. So let's click on it and execute. And with that we have here an
execution plan. And let's do the same thing for the no partitions. So execute and we have here execution plan. So now
let's check what we have in execution plan. We're going to focus on this one over here. So right click on it and then
go to properties. And now we can see a lot of details about the execution plan. But what is interesting is the number of
rows. So as you can see we are reading four rows. That means the whole table. And of course we have here the CPU and
the other costs. Now let's go and check the partition. So let's click over here. So now if you check over here, you can
see that the total number of rows is one. So SQL didn't read all four rows. It reads only row and that's because we
have in this partition only one row. And as you can see the number of partitions that is used is as well only one. So as
you can see using partition we have reduced the number of rows that is retrieved from the files. Now let's go
and retrieve like two data from two different partitions and check the execution plan. So let's target 2025 the
last day of the year like this. So let's go and execute it. And the same thing for the other
query. So let's check the without partition. We still we are reading like four rows. But now if you go to the
other one, if you check the execution plan and check the table scan, you can see we are reading only two rows and
this time the number of partitions that are involved in this query is two and that's because we have partition for
2025 and 2026. So as you can see it's worth the efforts. We have optimized our queries and this has a great impact on
big tables. The number of resources and the number of reads going to be reduced massively. All right my friends. So
that's all about the partitions in SQL. It is amazing and you can use it as well not only in databases but as well in
many other data platforms and tools where you always can divide your data in order to optimize the performance. Now
in the next step what I have prepared for you after 15 years working in real projects using SQL. I have a lot of best
practices and tips for you. So I have collected everything that I know and now I'm going to show you the best practices
and tips and tricks that I can give you in order to optimize the performance in SQL. So let's go.
And now before we deep dive into the 30 best practices, I'm going to give you the golden rule. The SQL optimizer
responds differently for different sizes of tables. So that means if you have small and medium tables like hundred of
thousands, you might not notice any performance differences if you are following the best practices. And that's
because the size of the data is small. But if you have like million or hundred of millions of records in tables, you
will immediately notice how things can be faster if you follow the best practices. And here is my golden rule.
If you get any best practice from me or let's say you are reading something in the internet, always you have to test
using the execution plan. So for example, if you have like two queries are returning the same result of the
data, I'm going to recommend you here to check the execution plan. And if you notice there is no differences between
them in the execution plan then pick the one that you see it is easier to read and to understand because sometimes if
you are following the best practices for the performance your query might be like little bit more complicated. So always
write the query to be understandable and only optimize it if you notice it is slow. So the golden rule here is always
test. If you find you are optimizing the performance with the new query then pick that and if there is no gain in the
performance then focus on making your queries readable. So this is the golden rule always test test test using
execution plan. So let's deep dive into best practices and we're going to start by optimizing the performance of our
queries. All right let's start with the easy stuff. The first step is select only what you need. What I usually see
in many queries is that the developers just go and select all the columns from one table and I can tell you I cannot
think of one scenario where you need all the columns of one table in one query. So for sure in the result we will get
like unnecessary columns and of course reading unnecessary informations going to make your query slower. So this is
usually a bad practice. Don't use select star but instead of that go list all the columns that you need for your query. So
make sure that you only select what you need. Don't go and select all the columns from one table and with that you
don't risk reading unnecessary informations from the database. So always make sure that you select exactly
what you need for a query don't go with a star. Okay. Tip number two avoid unnecessary distinct and order by. I
have noticed that many developers as they are writing a lot of queries they tend by default adding always distinct
and order by for each query. And as we review the code and discuss it with the developer, we see that we really don't
need to remove any duplicates in the query because there are no duplicates and it was only a habit to remove the
duplicates using distincts. And the same thing for the order by in many situations there is no need to sort the
data at all. And those operations, the distinct removing the duplicate and sorting the data, they are very
expensive operations in your execution plan. So they're going to take a lot of resources and slow down your query. So
this considered as a bad practice if you always go and use distinct even though it's not needed or you are using the
order by in order to sort the data when it is not necessary. So the best practice here is to avoid them. Don't
use distinct or order by only if it is necessary. Okay. The next one for exploration purposes limit the rows. So
sometimes especially if you are working with a new database you would like to explore the tables just to have a quick
peek in order to see the content of the tables. And if your database has a lot of big tables with millions of rows and
so on, you will be consuming a lot of resources. If you just select the data like this. So now imagine that the
orders has like 100 million. As you run this query, the database has to fetch all the 100 million for you. And usually
for exploration, it's enough to see like 10 rows and that's going to be enough. That's why it is considered as a bad
practice if you are exploring the tables to not have a limit or top. So a good practice would be to say select top 10
and then have the same query. So if you go over here you will get only 10 rows and the database will not fetch 100
million. It can fetch only 10 rows. And now if you are exploring a lot of tables you will not consume a lot of resource
from the database. So if you are exploring always limit the number of rows that you are
retrieving. All right. Right. So now we're going to talk about how to optimize the filtering in SQL. So the
tip here is to create an uncclustered index on frequently used columns in wear clause. So now of course you have to
check your queries and so on. And if you see that you are frequently filtering the data using the order status then it
makes sense to create a non-clustered index for this column in order to improve the performance of your query.
So for this situation I'm going to go and create then a nonclustered index for the table sales order for the order
status. So once you create it then you improving now the performance of your query. Okay. The next one is avoid
applying functions to columns in the works. So in many cases what we usually do is that we go and transform the
columns before like filtering the data. Like for example here I'm applying the function lower on the order status
because I'm searching for the value delivered and I'm not sure about the values in the table whether they have
like a camel case or uppercase or anything but in order to make sure that I'm going to find the value I'm going to
go and say lower the order status and then give here a lower value and of course it's going to work. So if we go
and search for it and as you can see we have here the status delivered and the value is different than the one I used
because here we have like a capital first character but here we have a problem we have an index on the order
status and now if you use any functions like for example here the lower the SQL will not use the index so that means the
whole index is now useless and the SQL is not using it and that's why we consider it as a bad practice to use
functions for the wear clause and Instead of that the good practice is that to not use any function and to
write exactly the value that is used inside your data and with that the SQL going to be happy and use the index that
you have created. Okay, let's have another example about this rule and here we are selecting all the customers where
the first name start with the A. So with that we can go and use the function substring in order to get the first
character of the first name and once you match it with a then you will get the result and here we have Anna. And this
is again bad if you have an index on the first name and that's because we are applying a function on the column. So
this considered to be a bad practice and instead of that we can go and use the help of the like. So we can go and
search for this pattern where it start with the A and then we have a white card. We don't care about the rest. So
it must start with a. So if you go and execute it you will get the same results. So try as much as you can to
avoid the functions in the wear clouds in order to hit and get the index working. And in many scenarios, we have
a workaround in order to use the function without transformations. So try your best to avoid using functions if
your columns having an index. All right, one more example that you see a lot on queries that you filter by the year. So
we are searching for the orders that happens in 2025 and we usually go and use the year order dates. And now if you
have an index on the order dates, this again will not be working because you are using a function year. So this
considered to be a bad practice. Instead of using the year function, you can go and use between. So we don't apply a
function on the order date and we say the order date is between the boundaries of the year. Of course, now our query is
not looking really cool and easy like the first one. But still with the second one, we are hitting the index. So again
while you are filtering, try to not use functions on the columns because it is really waste if you have an index and
you are not using it. and most of the cases you have like a workound for your function. So those are the three
examples that I wanted to show you about this tip. All right, moving on to a similar one. It says avoid leading wild
cards as they prevent index usage. So this is a similar one. Let's say for example I'm searching for the word gold
inside the last name. And here we have to be careful what we are searching for. Should the gold exist somewhere in the
last name or only we are searching for the last name that start with gold? If it's like that we are searching only the
last name that starts with gold then we are doing it here wrong. And in SQL if you're using the leading wild card then
the SQL will not be using the index. But if you are using the wild card at the end and the trailing this one is fine
and will not avoid using the index. So this considered as a bad practice because you will not be hitting the
index. Better than that to not use the white card as a leading and if that's enough for your search then with that
you are hitting and using the index. Okay, moving on to the next one. It says use in instead of multiple or or
operator is very evil for performance and try to avoid using it. It really kills your performance whether it is in
the filters or joins and so on. So now we want to show the orders where the customers is equal to one or two or
three. And of course this is considered to be bad practice and hard to read and so on. Please don't do that. Instead we
have the in operator and we are saying if the customer is one of those values then show the orders. So if you go and
run it you will get the exact results and it's not only looks nicer than the first query but it has as well a better
performance. So if you find out writing a lot of ors think about the inoperator. So those are the best practices for
filtering data to improve the performance. Okay, so now we're going to focus on how
to optimize joining tables in SQL. So the first tip here is to understand the speed of joins and to use inner join
when it's possible. Well, as we learned before, we have like different types of joins. We have the inner, left, right,
and outer join. And if we talk about the performance, the best performance you will get from the inner join. And that's
because SQL going to work only on the matching rows. That means the effort and the processing time is better than the
other joins. Now in the next one in ranking we have the left and right joins. They are slightly slower than the
inner join because usually they process more data and more rows than the inner join because SQL will work not only with
the matching rows as well with the unmatching rows. So for right and left SQL has to do more stuff than the inner
join. And now the worst type of joins we have the outer join. And that and that's because this type works with the biggest
number of rows compared to the other types. It's going to present unmatching rows from the left and from the right
tables. So that means SQL has a lot of to-do and that's why this join has the worst performance. So here my advice is
always try to use the inner join if it's enough to work with the matching rows and if the matching rows is not enough
then go with the lift join maybe. But try your best always to bring the inner join instead of lift join. But don't
forget inner join filters the data. Okay. The next one it says use explicit join the unzi join instead of implicit
join. Well it is considered as a bad practice if you join tables like this the implicit join or the nonzi join.
It's better to use the normal modern join where you use the inner join for example. about the performance. There is
like no differences between them. And for this scenario, it's very simple. But if you have like a complex query, then
joining table like this might be very confusing and really hard to read and as well complex to optimize. That's why the
best practice says go with the normal inner join. So go with the anzi join instead of the nonzi join. Okay. To the
next tip. Make sure to index the columns used in the on clause. So we have to go and make sure that both of those columns
has an index because indexes speed up the lookup process. Without an index, the SQL might go and do a full table
scan. Without an index on those columns, the database might go and scan the entire tables in order to find a match.
And that is really slow if you have big tables. So now if you go to the customers over here and then to the
indexes, we can see that we have an index, a clustered index for the customer ID. But if you check the
customer ID in the orders, we don't have an index for that. So this one doesn't have an index. So in order to fix that,
we're going to go and create an uncclustered index on the table orders for the customer's ID since it is a
foreign key. So once we do that, we have now an index for both of those columns and with that our join going to be
faster. Okay. So now we come to a tip where we say really it depends on there is like not one clear way on how to do
it. But let's say if you have a big tables, it is better to filter data before joining. And here we have like
three different scenarios that going to deliver the same results. But of course the question is which one is the best
for performance. So now let's have a look to them. What we are doing here we are just joining two tables and then we
are filtering the result based on the order status that comes from the orders. So in the first query what we are doing
we are first joining tables and at the ends we are using where clause in order to filter the data. So by looking to
this we are just filtering the data after joining the tables. But there is another way on how to do it. You can go
and join the tables but on the join condition you can go and add this order status equals to delivered. So we are
matching the data by the customer ID and at the same time we are filtering the data by the order status since we are
using the inner join. So the filtering is happening during the join or you can do it like this where we have here more
stuff to be added where we don't join the table directly with the orders. We first prepare the table orders before
joining it with the customers. And here our preparation is we are just selecting the columns that we need and we are
already filtering the data before doing the join using the subquery. But if you run all those queries you will get the
exact same results. And of course there is another way on how to do it. you can go and prepare the data not in subquery
you can go and use a CTE and then join the result of the CTE with the table customers. So now about the performance
if your query is like small not that complex and as well you don't have a big data inside your tables all those three
queries going to deliver the same performance. I know it might sounds weird because here we are like filtering
after joining or here we are filtering during the join. Normally in databases the SQL optimizers are now very smart
can understand that there is a filter here and decide on the best execution plan for you. So actually wherever you
put your filter after, during or before the SQL is smart enough to do it correctly. So if you don't have complex
query and you don't have like big tables, go with the one that suits you. And I really recommend you to go with
the first one because it's logical and easier to understand. But if you have big tables and complex queries, the best
practices says try always to prepare the data before joining it. So try to isolate and abstract the pre-step in a
subquery or in a CTE before joining it with any other tables. And in many scenarios in my project where I have a
big table, this did help where the execution plan was better if I isolate and prepare the data before joining it.
So if you have small or medium tables, go with the normal way, use the wear clause. But if you have complex big
tables, prepare the data in subquery or CTE and then join it with the tables. Okay. And now moving on to tip number
12. It is similar to the previous one but this time it says aggregate data before joining tables and again it is
special case to improve the performance of big tables. So now we have the following scenario where we are joining
the orders and the customers and we are aggregating the data by the customer ID but we are just joining the table
customers because we need the first name. So as a result we have the customer ID, the first name and the
order count. So the standard way is to join the tables and then do a group by in order to summarize the data. Now if
you look to this query, we actually don't need the join in order to do the aggregations. We can do first the
aggregation like preparing the orders with the aggregated data and then join the result with the customers in order
to get the first name. So again we prepare first and then we do the join and we can do that using either the
subqueries or using the CTE. So in this scenario first we are doing the group by we are aggregating the data and the
result of this is joined with the customers tables in order to get the first name. Now of course there are like
many ways on how to do it like for example as well using the correlated queries where we can go and use the
subquery in the select statements and then use the where condition over here to make the correlated query. Now all
those three going to deliver the same results but the question here again which one has the best performance?
Well, I can go immediately and tell you that correlated subqueries are the worst one. Always avoid using correlated
subqueries. They has really bad performance. And that's because SQL going to go and do the aggregations for
each customer individually. So it's going to go like for each row and doing aggregation then to the next row and so
on. So it takes long time. So this is bad practices. Don't use it. Now we are left again with the first option and the
second option. And here my tip going to be like the previous one. I'm going to say if you have small to medium size of
tables then go with this one because it is easier to read and to understand and you will gain exactly the same
performance as this subquery. But if your tables are big the best practices is to prepare first the data to group up
the data to filter the data and to isolate it in a subquery or a CTE before joining it with the final table in the
final query. But again here only for big tables and always test check the execution plan whether you are really
getting any benefits from it. All right. So if you have big tables try to prepare the data first in city subquery and then
join. Okay moving on to the next tip. It says use union instead of or operator in joins. So what this means sometime let's
say that you are joining two tables the customers and the orders. And now about the join key, you can see over here it
says the customer ID should be equal to the customer ID from the orders or the customer ID should be equal to the
saleserson's ID. If one of these two conditions is fulfilled, then we have a match. And I can tell you the or
operator over here is a performance killer. It has really bad performance. So try to avoid it. Don't use ore in the
joins. It has a lot of problems like it avoid indexes, it create like loop joins and so on. That's why we consider it as
a bad practice. And now in order to get the same results, we can go and split the joins. So we can go and have two
queries. The first query is joining the data based on the customer ID and the second query based on the saleserson and
then we go and merge those two results using the union. It sounds like bigger and too much for the SQL but with this
you will get better performance than using this simple or operator. So again if you have big tables try to avoid
using or and instead of that go and use union. Okay the next tip says check for nested loops and use SQL hints. Now
imagine that we have like big tables and we are joining tables. So now if you are checking the execution plan you have to
check always the join type. So for example here it is using the nested loops which is of course is okay because
we have small tables but if you have big tables and still SQL is using for some reason the nested loops then this is
alerting. So in order to change this what we can do we can go and use the SQL hints in order to force SQL to use the
hash join. Hash join is really good if you have a big table like for example the orders that is joins with a small
table like the customers. So now what we can do at the end we can write over here option hash join. So let's go and
execute it and let's check the execution plan and with that we have forced SQL to use the hash join or hash match. Again
you have here really to evaluate your tables. If you have like small tables don't bother with that. But if you have
big tables and SQL still doing the nested loops, nested loops are usually very slow because you have a lot of
iterations and so on and with the hash join that small table going to be stored in the memory and then you have really a
quick matching between the two tables. So those are all the best practices and tips on how to optimize joining tables
in SQL. All right, so now we're going to talk about union and here is the best practices. It says use union all instead
of using union if duplicates are acceptable. So it's very simple. If the duplicates are acceptable or let's say
that there is no duplicates then don't go with the union because it needs more time to be executed. SQL has to go and
check row by row whether we have duplicates or not and this usually takes longer time than using the union all. So
if duplicates are acceptable or you don't have any duplicates in your data go with the union all just have to go
and merge all the data without checking anything and the performance going to be faster. All right, the next one is
little bit tricky. So it says use union all together with the distinct instead of using union if the duplicates are not
acceptable. So you want to remove the duplicates. So we have learned that in order to do that we're going to go and
use the union. It's going to go and merge the data and as well remove the duplicates which is really okay to use
it if you have like smaller data or medium. But let's say that you have like millions of row which is really okay if
you have like medium and small tables. But again here if you have huge tables big tables hundreds of millions the best
practice says go with the union all and afterwards use a distincts. So in the sub query we are using union all but in
order to remove the duplicates we use the distincts. But again here you have to test it to check the execution plan.
If you are getting benefit then go with this version. But if your data is not really big you have hundred of
thousands. So go just with the normal union. the code is smaller and you will get the same effects but only for large
tables you can go with this best practice. So that's all what I have for you for the
[Music] union. Okay. So now let's talk about aggregations and here the tip says use
column store index for aggregations on large tables like for example fact tables and that's because column store
index going to compress the data. So the size of the data going to be smaller and as well the aggregation is super fast
because we are selecting only the relevant informations only the relevant columns. So it makes it a perfect setup
for aggregating large tables. And now let's say that we have hundreds of millions of orders and we have this
query over here. So the best practice says convert this table to a clustered column store index. So if you go and
create this clustered index over here, the whole table going to have amazing performance for aggregations like this.
All right. So to the next one, it says pre-agregate data and store it in a new table for reporting. So let's say that
we have like a big query where we are aggregating the data and so on. And this query takes really long time. Let's say
like 5 minutes or something like that. But now the problem with that I would like to show the results as a report
maybe to my manager or let's say during a meeting it's going to be really bad if everyone have to wait until the query is
done. So the best practice here if you have like a query that runs very slow what you can do you can go and store the
results in a table. So if I go over here and say into sales summary what going to happen going to store the result inside
this table. So let's go and execute it. And now with that we have a nice table where everything is prepared. So all
that you have to do is to go and query this table. And of course it's going to be very fast because it's only select
statements. And with that you have like prepared and pre-agregated the data to have like fast reports. So don't forget
about this. If you have a big query you can insert the result of this query in a new table in order later to use it for
reporting. But one thing that you have to make sure that you have always to update this table. So if we have new
orders, it will not be presented inside the sales summary. You have to go and run this query again in order to get new
data inside the sales summary. So those are the tips on how to improve the performance of your aggregations in
SQL. So now what is happening here? I would like to show the orders but only from customers from USA. So if you check
this query over here, we are joining the tables order and customers but mainly we are showing only the orders information
and that means we are using the customers only to filter the table orders and there are like multiple ways
on how to do this task. So it's not only the joins you can go and use the exist as a subquery and as well you can go and
use the in operator in the subquery. And now comes the old but gold question. Which one is better? Should we join or
use exist or in? And oh my god, if you go to the forums, you will see people fighting about which one is the best.
Clean tech. Come on, do that again. Do that again. I dare you. Okay,
bring it. Oh, you can't say you can't say one point. Two point. Now, about the best practices, everyone agrees that's
don't go and use the in operator. So this is the bad practice. So bad practice avoid it. Don't use it. And
of course I'm always speaking about big tables, okay? Not small tables. So we don't go and use this in order to filter
one table based on the result of another table. So don't use any operator in this scenario. Now here comes the conflicts.
We have join and exist. Well, about the performance of those two, they are very similar for medium tables. like I'm
speaking about hundred or thousand and so on. But still you have to test it. You have to go and compare the execution
plans and if you are getting like identical results and both of them are having the same speed then I prefer to
go with the join and that's because to be honest it is easier to write than writing that exists. So I'm going to say
from my point of view this is best practice if the performance
equal to exist. But now what happens for me is that sometimes I get better performance using exists. So I'm going
to say from my point of view the best practice here. And now you might ask why we are
getting with the exist better performance than in the inner join. And that's because SSQL has only to check
the existence of data from the subquery. But in the other hand with the inner join SQL has to go and start doing
matching between two tables. So it can go and evaluate all matching records and so on. It is not evaluating whether it
exist or not. And as well sometimes SQL has to deal with more rows because you might introduce duplicates as you are
joining tables. And this will not happen using exists. So for some scenarios if you are using exist you might get better
performance than using join but everyone agrees to not use the end operator. Okay the next tip is to avoid redundant logic
in your query. This happens a lot if you have a lot of sub queries and if you analyze it you might find sometimes
there is like redundancy. So for example this query I would like to have like a tag for each employee whether the salary
is above the average or below the average. So now we might do it like this. we say okay let's get the data for
employees where the salary is higher than the average and you go and calculate the average in a subquery. So
if it's higher then you write here above average and now we say okay let's go for the below average. So we do a union all
and the condition going to be salary is less than the average. And now by checking this you see that there's a
problem. First of all we are querying the employees like four times. We have 1 2 3 4. So we are scanning the table
employees four times and as well we have the same logic over here. So we are calculating the average of salary at
twice. So this is of course I can say a bad practice and there is like many ways on how to do it better than that. For
example, you can go and put this subquery in CTE and then use it multiple times. But there is like better solution
using the window function. So if you check this, it is very simple. Let's me execute it. We are reading the table
employees only once and then we are using the case statements. If the salary is higher than the window function. So
we are calculating the average on top of the whole table employees. If it's higher then write above average. If it's
lower then below average. So as you can see it is easier to read and it is smaller and the performance here is way
better than reading four times the employees and repeating the same logic. So here you have always to look to your
queries and if you see that you are repeating the same things over and over then you are writing a bad query. Think
about alternatives like CTE window functions and I'm sure you will find a better way than reading the table
several times or repeating the same logic several times. So as you can see optimizing the queries is not always
about using indexes and partitions. It's all about using best practices. All right guys, so with that we have covered
a lot of best practices on how to optimize the performance of your query. And as you can see it's not always
creating indexes, right? In many scenarios it's about how you write the query. And now in the next section I'm
going to show you the best practices on how to create tables. So the best practices of DDL data definition
language. If you have a poor definition of your tables, this has a great impact on the performance of your queries. All
right. So now we have here like a DDL in order to create a table customer info and it is not really following best
practices. So let's go through it one by one. The first tip is try to avoid the data types varchar and text if it's
possible. The vchart and text they are like one of the worst data types for performance because they consume a lot
of resources whatever you do like for example if you are sorting the data by a column that is var or text it is very
expensive operation the same thing if you go like and create an index on top of such a column it's going to be as
well expensive and they cause a lot of problems with the data fragmentations and many issues. So try as much as you
can to skip those data type if it's possible. So now let's go and review all those columns in order to see whether we
can change something about it because it has a lot of bar charts. So the first one over here we have is var because it
is the first name. Well, it is okay. Now moving on to the next one. We have the last name as a text which is not really
good because text is worse than vchar. So it's better to use var than a text. So here we have to fix it. So var and
I'm going to go with the links 50. Now moving on to the countries. So the country is going to be vartar. We cannot
change that. that contain characters. So the next one is the score of the customer. H here we can do something
about it because scores are only numbers. So that's why we can go and skip this one. So let's remove it and
say you are integer and with that we have avoided using the varchar. And the same thing goes for the birthday. The
birthday is a date and here we have it as a vchar. Well this is not really good and we can skip that by having this
column as a date. So date is way better than having a vchar. All right. And the next one is integer. So with that we
have fixed few stuff. So we have fixed the score and the birthday. And with that we have saved some storage. If we
have an index on the score it's going to be way better than having a var. And if you are filtering the data based on the
birthday it's going to be faster. So again try your best to avoid the vchar and the text. I have seen in many
projects that a lot of developers tend to use the vchar and I understand it is easier to make everything as a vchar
than deciding whether it is an integer, date, float and so on because you can fit everything in the vchar and text but
this is lazy. Take time to understand the content of this column and try to assign it to the correct data type
because this has really impact on the performance. Okay, to the next one it says avoid using max or overly large
lengths. So now we have to keep our eyes on the links of each data type especially the bar charts. Not only it
going to waste like a lot of storage. It's also going to like mislead the SQL by creating large indexes which is
totally unnecessary because the data itself is small but because you have defined like a large length SQL going to
check those informations and make decision to make a big index and large indexes are always problematic because
they're going to slow everything down by sorting the data by retrieving data by updating the index. So it is really bad
practices if you go blindly and define everywhere max or 255. Again give it a chance to think
about each column and predict a length for it. So for example if you check over here we are saying first name v chart
max. Well most of the first names are short. So we don't need like the maximum size of a v chart to fit a first name.
So here we can go easily instead of max with the 50. And the same thing goes for the column country. We don't need 255
characters for the country name. We can go with something more realistic like around 50. I think you can even go
smaller, but it's fine to have 50. So, the best practice here is to analyze your data and to predict the size of
each column. And don't be lazy by just defining max everywhere. I know it's faster, but it's bad for performance.
Okay. What do you have else? Use the constraint nutnull as much as possible. The nutnull is amazing. It has a lot of
advantages. Of course, the biggest advantage is that's the data integrity of your table. So with that, you make
sure no nulls are inserted in specific column. But it is as well good practices to use it for improving the performance
because if you are creating an index, you're going to get a better index performance since SQL knows there is no
nulls inside my tree inside the index. And in the other side, if you are writing query, we tend to use a filter
where we say a specific column should not be null. But if you make sure that in the DDL it is not null then you can
skip this filter and with that you are reducing the size of your query. So what we're going to do we're going to go
through all the columns and decide whether it is not null and null. So for example the first name and the last name
they should not be null. So that's why I'm going to say not null and the same thing for the last name not null. For
the customer ID we're going to talk about it soon because we're going to convert it to primary key and primary
keys are usually not null. So now for the country we make have it in the business that it should not be null. So
we go and make a constraint about it. Now about the total purchases and scores. If it is new customer, maybe we
can have a null inside our data. So we're going to leave it empty. And I think birthday is going to be usually
optional. So we're going to leave it as well. And whether the customer is employee or not. This could be as well a
null. So with that we have found out like three columns where we can have a constraint about the not null. And if we
go and create like an index on the country, it's going to be a better index. Okay. Moving on to the next one.
It says make sure that all your tables inside the database have a clustered primary key and as well it can help you
building the relationship between tables where you have primary keys and foreign keys and you can join tables then very
easily and as well a primary key has importance for the performance and incale server the default going to be a
clustered index which is really good to have an index on the primary key because sometimes you are doing like an update
operations or delete operations it's going to help up by the lookups of joining tables. So there are a lot of
performance benefits of having a primary key and make sure that all your tables having a primary key. So as you can see
the issue of our table we don't have a primary key and our primary key going to be the customer ID. So let's go and do
that primary key and as I said as a default it can be clustered but I'm going to write it down in case if you
are working with different databases make sure it is clustered. Okay moving on to the next one. It's not only about
the primary key we have to take care of our foreign keys. So the best practice says create non-clustered index for the
foreign keys if they are frequently used. The foreign keys are usually important in order to connect and join
two tables and usually we frequently use it and not only that we use it sometimes in order to filter the data and if you
create a nonclustered index for that it can improve the speed. So what we can do it's very simple we're going to go and
create a nclustered index on our table customers info for the foreign key employee ID. So how to do it is very
simple. We're going to go and say create nonclustered index on our table the customer's info on our foreign key the
employee ID. But again make sure that this is an important foreign key that is used frequently from your queries. All
right friends so as you can see there are a lot of best practices on how to improve and optimize the DDL. Having a
healthy DDL can improve the performance of your queries. Now in the next section I'm going to show you the best practices
and tips and tricks about indexing. So let's go. All right, the fifth best practices and
the most important one is avoid overindexing because too many index is going to slow down the insert, update,
delete operations and it's going to confuse as well the execution plan about choosing the right index and the
performance of the whole system going to go down. And another tip is to monitor the usage of the indexes and I can tell
you 90% of the indexes that is being created usually are not used at all. So they are taking a lot of space slowing
down everything. So go and drop those unused indexes in your system. The next best practice is to have a regular job
like maybe a weekly job. So first you have to update the statistics regularly as you are inserting new data and
modifying data inside your database. The statistics and the metadata of your tables might get outdated and this is
really bad because you will not get an optimal execution plan for your queries and this can slow down your queries of
course. So regularly make sure that all the statistics are updated in order to have an optimal execution plan. And what
else we can do in this weekly job is that we can go and rebuild and reorganize our indexes. And that is to
make sure that we are preventing data fragmentations in our indexes. Data fragmentations in your indexes is really
bad because there will be a lot of unused spaces. The order of your clustered index will not be correct. So
make sure that at least weekly you are rebuilding and reorganizing all your indexes. So those are the best practices
of improving the performance and optimizing your indexing. If you are struggling with very large tables in
your projects like having fact tables, then go and use SQL partitioning in order to divide these tables into
smaller pieces which can improve the performance whether you are reading data from the table or writing data. And of
course you can go and mix things where you can go and apply a column store index on this partition table then you
will get the best performance if you are having large tables. All right friends so that's all
those are the best practices tips and tricks that I've collected in the many years working with SQL. And now my final
thought about this is that try always to focus on making clear queries. Make it like easy to read and easy to understand
and try to optimize the performance only if it's needed. So if you have like small database don't worry a lot about
the performance because the SQL optimizer going to pick the best plan for you and focus only on having simple
queries and if there is like performance problem always test using the execution plan. It should be your judge. So if you
are applying any index or you are rewriting your queries always compare before and after using the execution
plan. And if you are gaining more performance then adopt the new query or the new index. All right my friends. So
that's all the tips and tricks best practices that I have for you in order to optimize the performance. And with
that we have covered now everything about this chapter the performance optimization. Now in the next chapter
I'm going to show you how I use AI in order to assist me while I'm using SQL. So let's
go. All right. Right. So now I would like to share something important with you especially as a future developer
that is working with AI. One of the best ways in order to truly build skill and to grow as a developer is by working on
complex task and issue on your own. So when you are stuck on complex task and you are pushing yourself to find a
solution for it and you are writing your code in yourself here the magic happens and the real learning can happen. And if
you jump too quickly and ask the AI for a solution, what you are doing, you are skipping an essential step in order to
become an expert. And more important than that, you won't develop skills in order to understand when and where the
AI was wrong. So my recommendation here is to have a discipline. Always try to solve the task on your own and only turn
to AI if you don't have any more ideas on how to solve the task. So that's my opinion and my advice for you.
So quickly what is shippet? It is an AI program that is developed by open AI that is trained to understand questions
and provide humanlike answers. So what GPT stands for? The G stands for generative. So that means the data model
can generate a new content new text and P stands for pre-trained. The data model is already trained on huge amount of
data. And the T stands for transformer. It is type of neural network architecture that processes your
sentences in the prompts in order to understand the context behind it very fast and accurate. And in the other hand
we have the GitHub copilot. It is developed by the GitHub and as well using the same data models from the open
AAI. So that means both shad and copilot both of them are using the same language model that is developed from OpenAI. So
the GitHub copilot did train on tons of codes that is available in GitHub. So how it works as you are writing a code
in the code editor like for example visual studio it going to provide realtime suggestions as you are writing
and typing your code. So now if we compare those two shad and the copilot we can say that the shajibet is a
standalone application where you can interact with it using a website or an app where you go and start a
conversation with the AI where in the other hand the copilot is directly integrated in your code editor like for
example the visual studio code this is way better than shibility because you have realtime interaction with the AI
this is a great advantage for the copilot because everything in one place so with the copilot pilot you are
getting realtime assistant during your coding. So the main purpose of the ship is to have a conversation with the AI
for any topic that you like not limited only for software developments but in the other hand a copilot focuses only on
assisting the software development where you as a developer as you are writing your code you are getting auto
completion of the code or maybe a block of code as a suggestion. So these are the key differences between shad and
copilot. Now if you are doing software developments or you are working with
data projects and of course it depends on your role in the projects there will be many different types of tasks and
activities that should be done in the project like there will be a lot of brainstormings about new ideas and
coding solutions debugging generating documentations discussing the different types of architecture doing road cause
analyzes. So the spectrum of activities and tasks in each projects usually is very huge. And of course we can go and
use the help of different AI tools to assist us with those tasks and activities and there is like not one AI
tool that can cover all those stuff. I tend to jump between co-pilots and something like Shajbet. Okay. So now I'm
going to go and map those different tasks to either sht or copilot. So now let's focus on the shibbet. The first
one is brainstorming and ideas. So now if we have in our project a big task or let's say a big issue that we want to
find solution for it. I tend to use of course tools like shad in order to have a discussion about the topic in order to
explore and discuss multiple ideas and then start evaluating all those ideas. The next one where I found myself using
shbt is doing the project planning. So it is as well something high level. You can go and discuss with the shaj GBT
about the design of your projects and you can as well discuss the milestones the road map of the projects. The next
thing that I find myself using shajbt is for learning knowledge and research. If you are working with big data projects
you will be overwhelmed with the amount of cloud services and AI analytics tools. So and of course you can go and
learn new stuff gather informations and knowledge using shajibb. Okay, moving on to the next task. We have generating
documentations. Writing documentations is always painful process and consumes a lot of time and I tend to use tools like
shibbit in order to generate those documentations. But of course, I always review the documentations and make it
short. Okay, moving on to another topic where I use shadet is that to discuss architecture. Of course, if you are
starting new projects, they will be like different types of architecture in order to implement the projects. And of
course, you can discuss with the shajibility about the different types of architecture and if you give the
specifications about your projects then you can discuss with the shajibility which architecture is suitable for the
project. And another task that I find myself always like researching is exploring the best practices, tips and
tricks. So you can have a discussion with the SHP about the recommendations, what are the best practices, what are
the common pitfalls in order to make sure that your code and your solution is always up to date with the best
practices. And one more thing, if there's like in the projects a very complex task, then I tend to have a
discussion with a tool like Shajibet in order to break this complex task into small pieces and start finding the
solution for each piece. And now in the other hand, I'm using copilot in order to solve different type of tasks. So
here where I get my hand dirty in the code. So while I'm coding I'm using alltime co-pilot in order to assist me
because it provide directly inline suggestions and help me to code faster and reduce the human error that I might
make. So while I'm writing a code or debugging I tend to use copilot and I don't find myself going to shy GBT to
ask about code or syntax. We can do it directly in the copilot. And one task that is very famous in any software
developments we have the refactoring. So if you have like a code that is slow and bad designs and you want to refactor the
whole codes, you can do it directly in your code together with the copilot in order to find optimizations. And I use
as well copilot in order to add inline comments. So I don't find myself going to ship and asking to add comments to my
codes. You can do it directly in your code using cilot. And of course if everything is working perfectly, I have
the best practices, the good performance, I have the comments, it's still you have to maintain nice style
and format of your code. And of course now we can do that directly using the copilot. We don't have to go and jump to
shajbt in order to style and format your code. And as you can see I'm currently using both of them for different types
of tasks. So again if I have the feeling that I have to discuss something I go to shbt. But once the idea is very clear
and I know the solution then I start using copilot in order to write the code and with the help of the copilot I can
deliver clean and professional code. So this is how I currently use both Shajbuty and
Copilot. Okay friends, so now what we're going to do, I'm going to show you a quick guide about the GitHub copilot in
the Visual Studio Code. Once you create a profile and connect it to your Visual Studio, you will get a new icon for the
copilot. So once you go there, you can see quickly the status and as well you can go and disable the copilot. So if
you have it like this, that's means your co-pilot is active. So now once you have everything up and running, what you have
to do is very simple. Just go and start writing your code. So start typing any select statements. And now you can see
that we have a gray text. This gray text called the ghost text. It is an auto completion from the copilot. And now it
says select star from table. And now as you can see as I mouse hover on it, we can see that I can go and switch between
different suggestions. So here we have like three suggestions. One, two, three. And I'm going to go with the third one.
So now here as it says if you want to accept the suggestion all what you have to do is to press tab. So let's go and
do it. So you are accepting the whole thing. But now if you say you know what I'm going to accept only part of the
code. So let's go again and write select. So this time we're going to be selective. In order to do that hold
control and then with the right arrow and with that we are accepting part of the ghost not everything. But of course
if you are accepting the whole thing just go with the tab. And now there is another way in order to trigger the
ghost text and that's by defining first a comments. For example we want to select the top three customers based on
the score. So now once you start writing the query the co-pilot going to go and write a query that is relevant for the
comments. So now as you can see we are getting top three from customers because we want the top three customers and here
we have like two suggestions like over here we have the order buy or without it. So I will go with order by and hit a
tap. And now here another suggestion which is correct. In order to solve the data from the highest to the lowest. All
right moving on to the next one. As we learned in SQL in order to solve a task there could be like multiple solutions
and multiple variants of queries that solving the same task. So let's say that we have this task rank customers based
on their total order sales. So what you can do if you start writing the query we are getting now the ghost text. But now
what we can do we can go and hit ct controll enter. So now what happens on the right side you will get different
suggestions and here we have like nine suggestions on how to solve this task in scale. So now what you have to do is to
go through all those suggestions and pick one. For example I can go with the suggestion number three and say accept
suggestion and you will get it in your code editor. So this is what we mean with the copilot autocomp completion and
integrating the AI directly as you are developing and writing a code. Now in the co-pilot, not only using the ghost
text and the autoco compilation, we can go and interact with the AI using inline shots. So it's something like shimity.
Now in order to trigger the shot, what you're going to do, you're going to go and hit control I and then you're going
to get a place in order to ask the copilot any question like for example join the query with the
table sales orders. So let's go and hit it. And now as you can see we got a full
query where the customers is joined with the orders and it is totally correct how the table are joins. So that means
copilot knows already all the tables that I have in the database and as well the columns and how to join them. This
is amazing. So if you like it you go and accept it of course and this is way faster than having shajibbd because in
shajibity you have to introduce your database your columns and stuff before even asking anything. This is exactly
the power of copilot. Now what else we can do with that? We can go and highlight part of our codes and then
start again the shots and here we can say replace this column with an aggregation of the sales. So let's go
and hit okay. Now as you can see it replaced it with an aggregate function. And one thing that is very important the
code is not changed yet. So it is highlighted and showing you a suggestion and now you have to accept it or discard
it. If you discard, nothing going to change in your codes. But once you say accept, it's going to go and replace
your original codes. So if you go and do that, now your code is replaced with the AI suggestion. Okay. Another thing about
the copilot, it's try to fix issues that you have in your codes. So for example, we have here an error. If you go and
mouse hover it, you can see a menu from the copilot in order to view the error or to fix it. And another way to do
that, if you right click on it, you go to the copilot. And here you can see we can explain or fix. So if you go and
explain, you will get another window where you get an explanation about the issue in your code. And once you
understand it, you can go and ask the copilot in order to fix it. So let's go over here and go to
fix. And with that, the copilot did fix the issue. It was all about the order of the select statements. So first you have
to do the group by then order by. So it helps you to find issues and to fix it as well. And now, as you might already
noticed, as we are writing the code and interacting with the Visual Studio, you will often get a sparkle, this little
yellow sparkle on the left side. So, you will see this icon each time the copilot thinks it can help. So, if you go and
click on it, you will get a menu of different stuff that the copilot can do for you, like fixing, explaining,
modifying, and so on. Well, my friends, that's it. This is the copilot, and it is very simple, but yet very powerful
for developers. And of course, not only for SQL, for anything like for Python and so on. Everything is integrated in
one place. I don't have to jump to Shajibbet and ask stuff. It is live and I can do it directly as I'm writing my
code. So that's all for Copilot. All right friends. So now let's switch to Shajibet. So let's start first by
understanding the structure and the basic components of Shajbet prompts. So the first component and the most
important one we have the tasks. You have to be very clear by defining what the AI should do and without having a
clear tasks the AI will not understand what to do. So this is mandatory in each prompt and then after that you have to
provide some context. So you give some background informations like for example you say I am students or I am a data
engineer and so on. And another components we have to add specifications. So in the task you give
the main task what the AI should do but with the specifications you go in details like for example which topic
should be added or maybe excluded the number of word counts. So here you are specifying a lot of wishes and small
details and specifications in order to get an answer that meet your expectations. So both of the context and
specifications they are important. And then after that we have some nice to have components like for example
specifying a rule. So here you give the AI a role like for example you tell it to act as an expert as a teacher
interviewer. So you are setting the AI to play a role and the last component that you can add as as well the tone.
Here you are defining like the voice of the answer in order just to make the answer like more friendly and easy to
read and engaging. So the role and the tone they are nice to have and if you go and use all those components you will
get a better results from the AI. So let's take for example the following prompts explain SQL window functions. So
this is very simple and very short and here we have only one component the task. So here you are not giving any
context whether it is for data analytics or for data engineering. So you leave it up to the AI and maybe the answer that
you will get will not meet the expectation that you have. And now if you want to shape it in the way that you
want you have to add more components like for example this prompt you are saying you are a senior SQL expert. So
here we are defining the rule for the AI. So the AI should act now as an SQL expert. And then the next section we are
adding a context to the prompts. So we are saying I'm data analyst working on SQL projects using SQL server. So now
the answer that you will get from the AI going to use the syntax of the SQL server and focus on the topic of
analytics. That's why the context is very important and then we go specify in the prompt the task the main task. So we
say explain the concept of SQL window functions and do the following. And now we go and give more fine details about
what the AI should provide. We are saying explain each window function and show the syntax. describe why they are
important and when to use them and list the top three use cases. So you are now specifying what you are expecting from
the AI and after that of course it is nice to have we specify the tone of the explanation. So we say the tone should
be conversational and direct as if you are speaking to me onetoone so that it is not like you are reading a document
you are reading something that is engaging. So I know this prompt is really big but still you will get way
better results than only saying explain the concepts. So those are the main components that I usually use if I'm
starting like a conversation and a discussion with the shajuti. Okay. Next I'm going to show
you the frequently used prompts that I use in my projects. Now little bit awareness about using shajib in
companies. If you are working in new company, make sure to ask about the rules of using Shia Gibbt because some
companies offer their own chatbots for few security reasons. So make sure always to check with the rules before
jumping immediately to sht. All right. So let's start with the first prompts. We can use shad in order to solve an SQL
task that you have in the project. So let's see this prompts. It start first with the context. So I'm telling that I
have an SQL server database and we have like two tables. So now I have to explain for shad the database that I
have. So I'm saying we have a table called orders and we have the following columns and we have another table called
customers and here are the columns for the customers. So that I gave shy a context about the tables that I have in
my database and as well I was precise about the database. It is SQL server. Now after we have the context the next
step is that I'm going to tell SQL what to do. So I'm telling the AI do the following. write a query to rank
customers based on their sales and then I'm detailing what I'm expecting to have at the output. So the result should
include customer ID, full name, country, total sales and so on. And here I'm adding like more tasks. It's not enough
to have a query. I would like as well to have a comments. So I'm saying include comments but avoid commenting on obvious
parts because if you tell just include comments, you will get a lot of unnecessary comments. Now of course in
square there is like not one solution for a task. there is always like different variants on how to achieve the
same task. So usually I would like to understand what are my options. That's why I'm telling Shaji write three
different versions of the query to achieve this task and then I would like to evaluate each of those versions and
that's why I'm giving the task for the AI to evaluate those versions and to focus on two things. It is easy to read
and as well has good performance. Okay. So let's see what shajivity going to give us the results. So we can see the
first solution over here where shadivity is using the CTE. So we can see in the CT over here that the table first are
joined and then we have like a group by in order to aggregate the sales. In the step two we can see over here we have
the rank window function in order to rank the sales. So of course you can do that. Let's check the version number two
over here. So they I used the subquery and it is as well a nice solution where the shad first prepared the data. So
first done the aggregation before joining the data. Let's get the last solution over here. So we have here
single query using window function which is as you can see it is the smallest one. We don't have CTE we don't have any
sub queries. So first it is joining the tables and doing together the group by together with the window function and
after that we get an evaluation from the AI where where as you can see it focus on two things the readability and the
performance. So it is saying with the CTE the readability is really high compared to the sub query and to the
last version where you have the group by together with the window function. So I totally agree with the shajibbity the
first version was the best one for the readability. Now checking the performance. You can see the performance
is moderate. The second one, the subquery is good. And the last one is the best for the performance. But of
course, always test with the execution plan. So as you can see, there is like a trade-off between the readability and
the performance. If the priority is readability, then go with the version one. But if the priority is the
performance, then go with the version three. As you can see, we got three solutions for our one task. And you can
now evaluate which one you want to use. And this is really amazing, right? All right, moving on to the next one that I
frequently use. We have impromptability. As you are creating an SQL query for a complex task, you might
end up writing a lot of CTE, sub queries. You might end up having a lot of joins, sub queries, CTE, hundreds of
lines, and you might lose the big picture. So what I always do, I give the query to the SHBT and ask it to optimize
it in order to be more readable and to find any redundancy in my query in order to consolidate it. So now let's check
the prompt. It says the following SQL server query is long and hard to understand. And then we're going to give
the AI tasks. So the first task is to improve its readability and the next one is to detect any redundancy in the code
in order to remove it and to consolidate the query. So to make our query compact and small and of course to include some
comments and not to comment the obvious parts and now always if there is like some optimizations there should be a
learning process. So I'm asking now the AI to explain each improvement to understand the reasons behind it so that
next time I'm writing the queries I can avoid those mistakes and of course you have to go and give the query to the AI.
All right. So now let's check the answer from the ship for my prompt. So as you can see we have a really long query and
here we have now from the result the improved query. So we can see that we have only one city. Well that is crazy.
We had before like five six cities and we can see here that the team managed to put everything in one city and then do
all the aggregations and the window function and then we have here the final select. Well this is huge improvement to
the previous query. Let's check here the explanation. So it says it consolidated the cities so combined all the cities
into one and many other stuff like there were a lot of unnecessary joins and so on. And here a small improvement where
it uses the concat instead of the plus because concat is standards for multiple databases. And here we have a final
benefits. So we have shorter query instead of five CDs we have only one and combining the logic you can reduce the
number of scans of the tables which is correct. So as you can see it is the magic of the AI. It found the issues in
my code, improved the readability and reduced all the redundancy and unnecessary joints and so on in the SQL
script. Okay, moving on to the next prompt. It is about optimizing the performance of my query. And if you are
working in big projects where you have like millions of data in your tables, it can be an issue if you are writing
queries that are not following the best practices for performance. So that's why I go and double check with the AI
whether my script is following the best practices for the performance. So as usual in the prompt we have to go and
give the context. So the following SQL server query is slow and then we start giving the AI some tasks. So propose
optimizations to improve its performance and provide me then the improved SQL query and I would like always to
understand the reason why it's better to write it in another way so that by the next time I improve while I'm writing
the query. So explain each improvement to understand the reasoning behind it and then at the end we go and give our
query. Okay. So now let's write the prompts on the following query over here. So on this query we have a lot of
bad practices like for example doing aggregations using correlated subquery. We are using a lot of functions inside
the work clause which is not really good for indexing and we are using a lot of or operators and here we have again a
subquery. So let's check whether shad going to find all those bad practices. So let's check the results from the
shad. And as you can see now we have an optimized query. It is little bit longer but I think we have here better
practices. So we have here a lot of changes. Let's check what did. So first it replaced the lower in the query. It
says it's not really good to use functions in the works so that the index can work. So it replaced the lower with
the order status without the function. the next one. So it is avoiding the correlated subquery. So instead of that
it is using a lift join. So it is joining the table normally without doing any correlated queries and as well it is
avoiding the function year in the works and instead of that it is using the range using between and the next one it
is using exist better than in which is better for the performance of course. So as you can see you can use the AI in
order to optimize the performance of your query and to convert it to a script that is following the best practices. Of
course my recommendations always don't go blindly with all changes that is suggested from the shajibity. Always
take each recommendation one by one. Test it and evaluate it using your knowledge. Okay to the next one. It is
interesting one. We can use [Music] impromptution plan. So now the execution
plans usually are advanced. So you need a lot of knowhow and experience in order to understand and read the execution
plan and if you have a big query it's going to be really nightmare in order to understand the flow and where is exactly
the issue. But now we are not alone. We have assistant the AI in order to help us understanding this complex stuff. So
what we can do we can take a screenshot of the execution plan and upload it to Shajib and we say the image is execution
plan of SQL server query and now we give the following task to say describe the execution plan step by step after that
I'm going to tell SQL to identify the performance bottlenecks and where is exactly the issue what makes my query
slow this is of course the hardest part of reading an execution plan and once it identify the performance issues I'm
going to ask it to suggest ways to improve improve the performance and optimize the execution plan. So first
understand the execution plan identify the issues and how to optimize it. Okay. So now after uploading the photo and
asking the AI we have the following results. So now we can see a detailed explanation about the execution plan and
there is like a lot of details. I will not go through everything. So we start with the table scans then the cluster
scan and the nested loops. So we have several nested loops and then the aggregation and the final step. So that
now we have like a nice explanation what is SQL is doing behind the scenes for my query and you don't have to be an expert
understanding the execution plan. You can ask the AI about it. Now what is very important is to understand where
are the bottlenecks what are the problems. So let's see what's we have here. So let's say the first one we have
a table scan which is really bad. That means this table the orders archive does not has any index. So it says the table
scan indicates a lake of useful index on the table which forces the engine to scan the whole table or rows. And now
what is very important is the nested loops in the joins. This is really bad if you have big tables. So here it's
saying it's fine if you have like small data sets but it going to be really problematic if you have many rows. So as
you can see we are getting more knowledge about the issues that we have from our execution plan. And the last
step it is the suggestions. So the first one and the most obvious one is to add an index to the orders archive. The
nonclustered index. Well, if there's no index at all, I would go first with a clustered index, not immediately with a
nonclustered index. And then some other best practices, but I think this one is very relevant is to change the join
type. So you can use the hints in order to use a merge join or a hash join. So now we understand how it works, where
are the issues and what the suggestions to fix it. All right, the next prompt is about debugging. As you are writing a
complex SQL query, you might get from the database an error when you execute it and sometimes it is challenging to
find the root cause of the issue. So we have the following prompts. First the context is going to say the following
SQL server query causing this error. Then we can paste the error message that we are getting and then we ask the AI to
do the following stuff. First explain the error message. So I would like to have better understanding of the error.
And then we ask the AI to find the root cause of the issue from my scripts. And after finding the problem and the issue,
we're going to ask the AI to suggest how to fix it. And of course, we have to give in the prompt as well our SQL
query. All right. So now I have the following query and if I execute it, I'm getting the following error. It says the
column sales.order dot sales in invalid in the select list because it is not contained in the aggregations and so on.
So I'm not really understanding what's going on. Let's ask the AI about it. So let's check what shity did answer. When
you are using group by every column in the select must be used in the group by as well. And it says in your query you
are selecting few columns which is this one is valid. The other two as well valid but we have one inside the rank
function. It is invalid. Okay. So now we can see here more details about the root cause. It is saying when you are using
window function like the rank it doesn't directly work with the aggregate functions. So here it's indicate clearly
that the sales inside the rank function is the issue. So let's see the fix over here. So since we don't have here sales
at all you cannot have here sales in the partition. That's why the fix here is to use the sum of sales because we have it
in the select. And here you have as well a nice explanation about the fix. So you can see here we have an explanation
about the error message the road cause it's pointing exactly where there's the issue suggesting a fix and explaining
the fix and this is exactly the steps that you have to do if you are debugging a code all right moving on to the next
prompt we can use AI to explain the result that I'm getting from SQL well sometimes you might have an SQL query
that you have in the project and you are not understanding why you are getting specific results so as usual we start
with the context we tell the AI I didn't understand the result of the following SQL server query and then we ask the AI
to do the following. First break down how SQL processes the query step by step and as well I would like to get an
explanation for each stage and how the result is formed. So as you can see here I don't need any optimizations. I don't
need in the output any query. I just need an explanation and then at the end you're going to go and paste your query.
Okay. So now we have the following query. We have a recursive CTE where we are generating like numbers between 1
and 20. Can tell you recursive CTE are usually like complicated to understand. So now maybe we are having hard time
understanding the result of this query. After asking the AI about it, we got the explanation first about the query
structure. So it says you are using the CTE with the main query. Well, okay. But what is very interesting is to
understand step by step how SQL executed this query. So it tells the step one it's going to go and execute the anchor
query and that's why we will get first the one and then the next step the recursive query going to be executed for
the first time. So it is saying okay we are adding one to the current value. So as you can see 1 + 1 we will get two and
then in the iteration two we will get 2 + 1 3 and it will keep repeating this process until we get all the result from
1 to 20. And then as well we have here an explanation about the termination of the recursive query. So it's saying the
filter is the way out of the loop. So once we reach the 20 it will stop. And then a few informations about the main
query and with that you will get a deep knowledge about how works and why you are seeing those results. This is really
amazing use case for the GBT. All right friends. So now we're going to talk about my favorite prompts. So we can use
the AI to style and format my code. So now once you are done writing a complex query to solve a task and everything is
correct and optimized as well for the performance. Now it's time to go and review your code in order to style and
format your script. So we have the following prompt. It says the following SQL server query is hard to understand.
So now we ask the AI to do the following. Restyle the code to make it easier to read. And the next task for AI
is to align all the columns aliases. Sometimes if you are using any tool to style and format your code, you will
find that it is bringing a lot of new lines. So I tell he AI, keep it compact, do not introduce unnecessary new lines.
And the last task for the AI is to make sure it is following the best practices. And of course, what do we need at the
end? Our query. Okay, so now we have the following query. And as you can see, we have very annoying query where it is
really hard to read and that's because the format and the styling of the query is really bad. I don't want to speak
about the alignment and so on. But as you can see, we have here lower cases, we have here uppercase sometimes for the
keywords. And of course, if you are developing and writing codes and you are delivering something like this, it is
really not nice. So let's see how shipy can fix it. Okay. So now after executing the prompts, as you can see, now my
query looks way nicer. So first of all all the keywords are uppercase and then you can see our CTE are really nice to
read. We have here enough spacing. The alignment of everything looks really nice and the case is very clear and the
main query over here is as well easy to read. So they done wonderful job styling and formatting my code and here you have
like explanation what did change. So first it is saying okay all the keywords are capitalized the alignment of the
aliases and the columns and so on. So with that we got a really nice style formatted query that we can share with
others. Okay, moving on to the next one. We can use AI in order to generate documentations and as well to add
comments to my code. Creating documentations and adding comments to code is usually something very annoying
for the developers. And sadly I see a lot of developers that they tend to not add any comments or anything to their
code. And of course, this is really bad because you are not thinking about other developers that are reading your code.
No god, no god, please no. And since this process is annoying and
takes time, we can use the help of AI to improve the speed of creating those stuff. So let's check the following
prompt. It says the following SQL server query lakes comments and documentation. So we are saying first insert a leading
comment at the start of the query describing its overall purpose. So this is what we usually do. We add at the
start a short description about the following code and then it should go and add comments only where clarifications
is necessary and very important it should avoid obvious statements. So it's like indexing don't over commenting your
code and usually if you are creating query for data analytics it's really good to explain the business rules and
transformations that you are doing inside your query and maybe another documentations describing how the query
works. So for now we are asking to add comments and documentations and of course you have to go and add your
query. Okay. So now I just used this prompt to one of my queries. Let's go and check the results. Now the first
comment is the most important one because it gives the overall purpose of the whole query. So let's see what it's
saying. It's saying this query identify customers based on their total salaries and provide list of customers with their
total sales and their assigned segments. So we have here like customer segmentations. We have high value,
medium value and low value. So with this comment we have the overall purpose of the query and then we have the inline
comments like here. So it says it's calculate the total sales for each customer for the first CTE and now for
the second CTE we have here a full description how the segment is built and this is built of course from the
business rule of the customer segments. So it say the high values for total sales above like 100 and between and so
on. Well this case win is really easy. So actually you can read it from the case win. But if you have like complex
queries, it's really nice to have the full text of the case win and then add the main query. You can see here the
final output and the inline comments. So as you can see it's really nice comments inside our codes. And now the next one
we have like a document about the business rule. And I totally agree with the AI that the business rule is here
about the customer segmentations. So we have here again very nice like short documentations about the business rules
that we have and then we have another document about how the query is working. Well I think this is too much for small
query. We can go and ask the shibility to make the documentation like shorter. So as you can see we have a full
documentation about our query about our business rules and we have really nice comments in our code. All right. Now
moving on to the next prompts. It is very important to improve the whole project, the whole database. So what
we're going to do, we're going to go and take our DDL scripts and give it to the AI and start asking AI to optimize our
database DDL. So here there is a lot of things that you can optimize with the database. So let's check this prompts.
It's going to say the following SQL server DDL script has to be optimized and we ask the following task from the
AI. The first one is to check the naming. So if you have a database where you have a lot of tables and columns and
so on, you should be always working with a specific naming convention. So here just to make sure that the naming that
you are using is correct. Then what is very important in DDLs is the data type. Data types plays very crucial role in
optimizing your queries. So we are telling the AI to check the data types and whether they are optimized as well.
And now the next point is about the data integrity. So if you are building a relational database, you will have a lot
of primary keys and foreign keys and you can tell the AI to check the integrity of all those keys. The next point is
about indexes. Here you can tell the AI to check the overall indexing that you are using in the DDL scripts just to
make sure that you are not missing anything and as well to check whether we have duplicates. So it is really great
check and the last check is that to check the normalizations of the table to check the data model and whether there
is like any suggestions about splitting tables and normalizing tables or they are like some weird redundancy. Okay. So
now what we're going to do we're going to let the chat activity to optimize the DDL of the sales DB. So now we have here
the DDL of the customers employees orders and so on. And after running it we have the following results. So now we
have here again the DDL but optimized one. And here the AI is adding comment about the changes. So here it added the
auto incremental for the primary key. And here for example a check that is not a negative score and for the employees.
Here another check to make sure that the birthday is not something in the future. So all those constraints in order to
make sure that the quality of the table is good. And here for the gender it is restricting the valid values that could
be used inside this column and many other stuff. And at the end we have like the key changes. So about the naming
it's saying that we have to stick with one naming convention. So here it did understand that we are using the bascal
case and for those two columns we have an issue like for example this product it should called product name. And for
the data types I don't want to go in all details. So here for example it says don't use the int use a decimal for the
price and sales for the integrity saying go and add foreign keys. I think for the orders we don't have any foreign keys
that is used in the DDL. So the sht did go and add all the foreign keys in the DDL. So that was good. And now about the
indexing it says since we have primary keys we will get automatically the clustered indexing and the foreign keys
should get as well an index in order to improve the queries and so on. So as you can see there is a lot of optimizations
that could be done in our DDL. So now if you are working on the project and you have a DDL go ask the AI what could we
optimize I'm sure you will find something and this is very critical because having a solid and optimized DDL
improves of course the speed of the queries. All right so now we come to very useful use case of using AI for
your SQL projects and that is by using AI to generate test data sets. It is always really nice to have small data
sets in order to test the logic of your query. Sometimes you are building a logic that does not exist yet in your
database and of course if you are not able to test the scenario that you are developing it can be really bad and it
is always very painful process in order to generate a data sets for your code but of course now it is easier because
we have the help of AI. So let's check the following prompt. It says I need the data sets for testing the following SQL
server DDL. And now next we have to specify for the AI different tasks. The first one is we have to define the shape
of the data sets. So how do you want the output? Do you want it as an insert statements or do you want it as an excel
or a file and so on. Now the next specifications I would like always to have a data set that is realistic. So I
would like to always to have a data set that is relevant and realistic not to get dummy word data. So again he's like
only configurations about the data set. The next configuration is that I would like to have small data sets. Of course,
you can go and specify for charge the exact size of your data sets. You can say I would like to have like 100,000
rows or millions of rows and so on. So you can define the size that you want. For me, I would like to have like small
data sets. And now what is very important that if you have multiple tables in your DDL and those table have
primary keys and foreign keys, the data set should be correct. So the AI should generate keys that is joinable. So if
you go and join data together, you will not get weird results. And of course, you can go and keep adding
specifications whether you want to have nulls or no nulls inside your data set. So here for example, I'm saying don't
introduce any null values. And of course at the end you have to go and give the DDL for the AI. It could be one table or
the whole database. So you could generate a data set for one table or hundreds of tables. Okay. So now I'm
asking the SHT to create test data sets for two tables. the employees and the orders. Let's check the results. So now
we can see very small nice insert statements for the table employees. So we have over here like five employees
with the different informations. And now for the table orders we have a lot of columns. So as you can see we have four
orders. And what is very important is that the salesperson ID comes from the table employees. So as you can see we
have two and one where we have it already in the employees. and the rest of the informations we have like here
fake addresses and stuff. So with that we have a very nice test data sets in order to be inserted to our database to
test our queries. Of course we can go and ask maybe to extend it maybe instead of only four orders we can go with 20
orders and so on. So we can go and change the size of it and here we have some notes about the data itself. So it
is really amazing we are now generating this data using our DLS. All right. So now we have the following query and of
course we are using the SQL server and let's say that you are migrating from SQL server to MySQL. So let's ask
Shajbet to convert my code to MySQL. All right. So after running it as we can see now we have the same query but in MySQL.
So instead of the isnull we are using Kawalis and here we are using the concatenation instead of the plus
operator and instead of the get date in MySQL we use the now function. And the last thing we are using here top 10 but
in my scale we use limit 10. And here we have really nice explanation about the transition. So as you can see it is
amazing and if you are working on companies and in projects this might happen that there is like decision to
start migrating from one database to another database and then your project going to get a big task of migrating the
data migrating the DDLs and the queries and everything and I really recommend using the shad in order to help with the
migration otherwise this big task might take really long time. So as you can see this is really amazing how shad can
improve the speed of your projects. Okay. Now in the next section I'm going to show you the prompts that
you can use as a student or if you are learning any new programming language. Okay. So the first thing that you can do
with Shajibet is that you can ask it to generate an SQL course. So you can ask the shajibet to guide you step by step
in your journey learning any programming language and you want to do it completely onetoone with the AI. So
first it is very important in creating a course is that to give enough context. So in this example it is very short I'm
saying create an SQL course with a detailed road map and agenda. But of course you can go and give more
specifications. You can tell about your current knowledge. You can specify which database type you would like to work
with MySQL SQL server. So the more context and details you give for the AI, the better results you're going to get.
And then you go and configure your course. So you can say for example start with SQL fundamentals and advance to
complex topics. And as well we can say make it beginner friendly and it is important if it is the first time you
are learning about the topic. And now we have to shape the focus of the course like I'm saying here include topics that
is relevant for data analytics because SQL is widely used in different topics for data engineering data analytics and
it's really important in each course to focus on use cases. So we are saying focus on real world data analytics use
cases and scenarios and of course you can go and add more details about your course. Okay. So now I just asked the
shivity in order to make this course. So now let's see the road map and the structure of our course. So let's start
with the phase one with the SQL fundamentals. So it start with the basic select where and so on. Then the next
section we are talking about order by group by and insert update delete. So the basic stuff. Now in the road map you
get the phase two intermediate SQL. So here we are talking about inner joins few functions about the text the date
and the case statements and views. And now to the phase three we have the advanced SQL for analytics. So we have
the window functions, the CTE and data cleaning using the null functions and few transformations. Then we go to the
phase number four. Here in your road map you start talking about real world use cases. And here you have like multiple
projects. So as you can see this is really solid road map in order to learn SQL. And now in the next step what you
can do you can start deep diving into each of those chapters until SQL to start okay with the phase number one
with the week one to give more details. All right. So now the next one once you have the agenda and the road map
learning the SQL now you can go and focus on specific chapter specific SQL concepts. So in this prompt we are
saying the context first I want detailed explanation about SQL window functions and now after that we are specifying for
the AI the exact structure of the explanation. So first it should explain what are the window functions and maybe
as well to give an analogy in order to understand exactly what is window functions and after that it should
explain why we need them and when to use the window functions. So once you understand the basics then you can start
learning about the syntax of the window functions and it should provide as well few simple examples and at the end the
AI should show you the best or the most frequently use cases used for the SQL window functions. So this is the pattern
that I like in order to learn something new. All right. So now let's see how the AI going to explain the SQL window
functions. So as you can see it start with the big title understanding SQL with the functions. So we have here a
quick definition and then we have here an analogy and the analogy about like a teacher grading students. Well that's
nice because we have the rank function. So you have here a nice analogy about the window function and then we
understand why do we need the window functions. Well I totally agree in order to have row level details with the
aggregations. So you can do aggregations while maintaining the raw level details and as well you can do complex
calculations because you cannot do everything with a group I there's functions that only work with the window
and then we have some explanation when to use them. So we see here for example the syntax of the window function. So it
divided to a function partition order by over and here few explanation about that. Then we have few simple examples
with queries. So explaining the different functions but not all of them. Of course, you can go and ask the
schedule to extend the examples for all functions. And now we can see the top three use cases for the window
functions. So we use it in order to rank the data and as well to build the running totals and the moving average.
And at the end we have a summary. So as you can see we have wonderful explanation about the concept of the SQL
window functions. Okay, moving on to the next one. And this one I use it very frequently in my projects. There is like
in programming always different concepts that are very close to each others and sometimes it is confusing and naturally
clear what are the big differences between them. So here I have for you a prompt in order to compare different SQL
concepts. So now the prompt says I want to understand the differences between SQL window functions and the group by.
So both of them are used usually to aggregate data in SQL and I would like to understand more what are the
differences between them. So we define for the AI the following task. Explain the key differences between the two
concepts and then it's really important to understand when to use what. So describe when to use each concept with
examples and it's really nice to understand as well the advantages and the disadvantages of each concept and at
the end you would like maybe to get a quick summarization about the differences between those two functions
side by side in one table. Okay. So now let's see how the share GBD can compare those two concepts. So first we have
really nice table in order to see the differences between those two. So for example the output granularity it says
the wind function provides calculation at the rowle details where the group by provides aggregated results at the group
level detail and if you are talking about the functions it allow ranking running total moving average and the
group by it allows only the basic aggregations like sum average count. So this is really nice overview for the
differences. Then we have when to use which concepts. So it's telling the window function it is used if you want
role level details together with the aggregations and here you have like a nice example for the group by it says
you can use it for example when summarizing data into categories like here grouping up the data by the region
and then after that we have like pros and cons for each concept. So the advantage of the window function we get
all the rows and for the group I it is like easier to understand and to use. For the disadvantage of the window
function it is more complex. For the group I the disadvantage is it removes the details about the rows and at the
end we have like sideby-side comparison between those two concepts. So as you can see we have really nice full
detailed comparison between those two SQL concepts. Practicing SQL with the AI. Well, it is really not enough to
just read about something or maybe to follow and watch a course in order to learn something. You have always to
practice. And of course, it is really hard to find a materials in order to practice a new programming language. So,
we can do it like this. We give a rule act as an SQL trainer and then a context where we say and help me practice SQL
window functions and then we go and configure this training this practice by doing the following. We tell it to make
it interactive practicing. So the AI provide a task and you give a solution. And what else is important is that it
provides you a simple data set and of course you can specify which data set you want. Is it industrial data set or
healthcare or anything you want and then we tell the AI give SQL task that gradually increase in difficulty. So we
start with the basics until getting advanced tasks. And you can tell the AI to act as an SQL server and show the
results of your query. So you would like to get as a result not only the correct solution or feedback you want to see the
result of the query that you gives and then finally the AI should go and review your queries provide a feedback and
suggest improvements okay so now let's start practicing I gave the prompt to shity and now we have simple data sets
so it is very simple we have the sales ID employee region sales dates and amounts and then we have the first task
so it says write a query to rank employees by their total sales. So here you have like an example output and now
it says your turn. So the shad is waiting for your answer. Okay. So now I just prepared a query for it. Let's see
what can happen once I post it. Oh no, I got some errors in the query. So let's see what we have. So it says error in
the aggregations. You should use the amount instead of sales. And it says unnecessary partition by in the rank and
so on. So let's check the correct query. So we have here the group pi and then we have to do the window function without
using partition pi. So that was a mistake and the result of this query going to be this one. And here I have
really nice feedback about the first task. So now it ask me about the next task. So I'm going to say yes. So now we
have this task number two about the running total. We have a task and we have the data and we have now to write
query in order to solve the task. So my friends it is nice right interactive and not only SQL you can go and practice any
programming language. Now moving on to the last prompt you can use AI in order to prepare you for SQL interview. So
let's say that you are invited to an interview and you would like to prepare yourself for it. So you can do a quick
preparation together with the AI. So you can say the following act as interviewer and prepare me for SQL interview. And
now you can go and configure the interview where you can say ask common SQL interview questions and make it
interactive. So it provide a question and then wait for you to answer and then you can say gradually progress to
advanced topics. So from basics to advanced and it is very important that it evaluates your answer and give you a
feedback. So it is a really great way to prepare for interviews and I really recommended to do it and you can prepare
yourself not only for an SQL interview, you can prepare yourself for an SQL exam. Okay. Okay. So now let's prepare
for an ISQL interview. And here we have the first question. Shibility says what is the difference between where and
having. So now it is waiting for an answer. We can say where filters data before
aggregation and having filters data after aggregation. So let's check the answer. So here it is
giving me an example of a very solid answer. But in general I have answered correctly. So it says the answer is
correct. But the feedback says here maybe the interviewer like needs more details not only one sentence about the
differences. So here it is like encouraging me to speak more and to give more details as an answer but still the
answer is correct. So now let's go to the next question. What we have here can you explain the differences between
inner join and left join. So I hope you know the answer but as you can see it is very interactive and nice and I think
those questions are really relevant. So if I'm interviewing someone I'm going to go and ask this question. What is the
difference between where and having and as well the differences between the joint types. So this is amazing right? I
really recommend you if you have like an interview go and prepare yourself using shajbt and you can go and practice and
prepare yourself before going to the interview. All right. So with that you have learned how I use AI in order to
assist me while I'm coding using SQL. And now my friends we come to the most important chapter from the whole course.
You have now learned a lot of things about SQL. A lot of advanced techniques, a lot of functions, how to transform
data, how to aggregate data. But now what you have to do is to take everything and to apply it in SQL
projects. And those projects are not only like easy projects. I bought projects for you that is very similar to
the real project that I do in the industry. So you will not learn only like how to do project in SQL but as
well what are the main steps and how we implement projects in real world. And here I have for you three projects data
warehousing data exploration and advanced data analytics. We're going to start with the first one the data
warehousing projects. This one can be amazing. So let's go and deep dive in that. All right my friends. So now if
you want to do data analytics projects using SQL we have three different types. The first type of projects you can do
data warehousing. It's all about how to organize, structure and prepare your data for data analyszis. It is the
foundations of any data analytics projects. And in the next step, you can do exploratory data analyzes, EDA. And
all what you have to do is to understand and cover insights about our data sets. In this kind of project, you can learn
how to ask the right questions and how to find the answer using SQL by just using basic SQL skills. Now moving on to
the last stage where you can do advanced analytics projects where you're going to use advanced SQL techniques in order to
answer business questions like finding trends over time, comparing the performance, segmenting your data into
different sections and as well generate reports for your stakeholders. So here you will be solving real business
questions using advanced SQL techniques. Now what we're going to do, we're going to start with the first type of projects
SQL data warehousing where you will gain the following skills. So first you will learn how to do ETL ELT processing using
SQL in order to prepare the data. You will learn as well how to build data architecture, how to do data
integrations where we're going to merge multiple sources together and as well how to do data load and data modeling.
So if I got you interested, grab your coffee and let's jump to the projects. All right, my friends. So now
before we deep dive into the tools and the cool stuff, we have first to have good understanding about what is exactly
data warehouse why the companies try to build such a data management system. So now the question is what is a data
warehouse? I will just use the definition of the father of the data warehouse bill in a data warehouse is
subject-oriented integrated time variant and nonvolatile collection of data designed to support the management's
decision-making process. Okay, I I know that might be confusing. Subject-oriented it means that the
warehouses always focus on a business area like the sales, customers, finance and so on. Integrated because it goes
and integrate multiple source systems. Usually you build a warehouse not only for one source but for multiple sources.
Time variance it means you can keep historical data inside the data warehouse. Nonvolatile it means once the
data enter the data warehouse it is not deleted or modified. So this is how build inmon defined data warehouse.
Okay. So now I'm going to show you the scenario where your company don't have a real data management. So now let's say
that you have one system and you have like one data analyst has to go to this system and start collecting and
extracting the data and then he going to spend days and sometimes weeks transforming the raw data into something
meaningful. Then once they have the reports they're going to go and share it. And this data analyst is sharing the
report using an Excel. And then you have like another source of data and you have another data analyst that she is doing
maybe the same steps collecting the data spending a lot of time transforming the data and then share at the end like a
report and this time she is sharing the data using PowerPoint and a third system and the same story but this time he is
sharing the data using maybe PowerBI. So now if the company works like this then there is a lot of issues. First this
process it take two way long. I saw a lot of scenarios where sometimes it takes weeks and even months until the
employee manually generating those reports. And of course, what can happen for the users? They are consuming
multiple reports with multiple state of the data. One report is 40 days old, another one 10 days and a third one is
like 5 days. So it's going to be really hard to make a real decision based on this structure. A manual process is
always slow and stressful and the more employees you involved in the process the more you open the door for human
errors and errors of course in reports leads to bad decisions and another issue of course is handling the big data. If
one of your sources generating like massive amount of data then the data analyst going to struggle collecting the
data and maybe in some scenarios it will not be anymore possible to get the data. So the whole process can breaks and you
cannot generate anymore fresh data for specific reports. And one last very big issue with that. If one of your
stakeholders asks for an integrated report from multiple sources, well good luck with that because merging all those
data manually is very chaotic, time-conuming and full of risk. So this is just a picture. If a company is
working without a proper data management, without a data leak, data warehouse, data lake houses. So in order
to make real and good decisions, you need data management. So now let's talk about the scenario of a data warehouse.
So the first thing that's going to happen is that you will not have your data team collecting manually the data.
You're going to have a very important component called ETL. ETL stands for extract, transform and load. It is a
process that you do in order to extract the data from the sources and then apply multiple transformations on those
sources and at the end it loads the data to the data warehouse and this one going to be the single point of truth for
analyzes and reporting and it is called data warehouse. So now what can happen all your reports going to be consuming
this single point of truth. So with that you create your multiple reports and as well you can create integrated reports
from multiple sources not only from one single source. So now by looking to the right side it looks already organized
right and the whole process is completely automated. There is no more manual steps which of course it reduces
the human error and as well it is pretty fast. So usually you can load the data from the sources until the reports in
matter of hours or sometimes in minutes. So there is no need to wait like weeks and months in order to refresh anything.
And of course the big advantage is that the data warehouse itself it is completely integrated. So that means it
goes and bring all those sources together in one place which makes it really easier for reporting and not only
integrated you can build in the data warehouse as well history. So we have now the possibility to access historical
data and what is also amazing is that all those reports having the same data status. So all those reports can have
the same status maybe sometimes one day old or something. And of course if you have a modern data warehouse in cloud
platforms you can really easily handle any big data sources. So no need to panic if one of your sources is
delivering massive amount of data. And of course in order to build the data warehouse you need different types of
developers. So usually the one that builds the ETL component and the data warehouse is the data engineer. So they
are the one that is accessing the sources, scripting the ATLs and building the database for the data warehouse. And
now for the other part, the one that is responsible for that is the data analyst. They are the one that is
consuming the data warehouse, building different data models and reports and sharing it with the stakeholders. So
they are usually contacting the stakeholders, understanding the requirements and building multiple
reports based on the data warehouse. So now if you have a look to those two scenarios, this is exactly why we need
data management. Your data team is not wasting time and fighting with the data. They are now more organized and more
focused and with like a data warehouse and you are delivering professional and fresh reports that your company can
count on in order to make good and fast decisions. So this is why you need a data management like a data warehouse.
Think about data warehouse as a busy restaurant. Every day different suppliers bring in fresh ingredients,
vegetables, spices, meat, you name it. They don't just use it immediately and throw everything in one pot, right? They
clean it, shop it, and organize everything and store each ingredients in the right place, fridge or freezer. So,
this is the preparing phase. And when the order comes in, they quickly grab the prepared ingredients and create a
perfect dish and then serve it to the customers of the restaurant. And this process is exactly like the data
warehouse process. It is like the kitchen where the raw ingredients, your data are cleaned, sorted and stored. And
when you need a report or analyzes, it is ready to serve up exactly like what you
need. Okay. So now we're going to zoom in and focus on the component ETL. If you are building such a project, you're
going to spend almost 90% just building this component, the ETL. So it is the core element of the data warehouse and I
want you to have a clear understanding what is exactly an ETL. So our data exist in a source system. And now what
we want to do is is to get our data from the source and move it to the target. Source and target could be like database
tables. So now the first step that we have to do is to specify which data we have to load from the source. Of course
we can say that we want to load everything but let's say that we are doing incremental loads. So we're going
to go and specify a subset of the data from the source in order to prepare it and load it later to the target. So this
step in the ATL process we call it extract. We are just identifying the data that we need. We pull it out and we
don't change anything. It's going to be like one to one like the source system. So the extract has only one task to
identify the data that we have to pull out from the source and to not change anything. So we will not manipulate the
data at all. It can stay as it is. So this is the first step in the ETL process, the extract. Now moving on to
the stage number two. We're going to take this extract data and we will do some manipulations, transformations and
we're going to change the shape of those data. And this process is really heavy working. We can do a lot of stuff like
data cleansing, data integration and a lot of formatting and data normalizations. So a lot of stuff we can
do in this step. So this is the second step in the ETL process, the transformation. We're going to take the
original data and reshape it, transform it into exactly the format that we need into a new format and shapes that we
need for analyzes and reporting. Now, finally, we get to the last step in the ATL process. We have the load. So, in
this step, we're going to take this new data and we're going to insert it into the target. So, it is very simple. We're
going to take this prepared data from the transformation step and we're going to move it into its final destination,
the target like for example data warehouse. So that's ETL in a nutshell. First extract the raw data, then
transform it into something meaningful and finally load it to a target where it's going to make a difference. So
that's it. This is what we mean with the ETL process. Now in real projects, we don't have like only source and targets.
Our data architecture going to have like multiple layers depend on your design whether you are building a warehouse or
a data lake or a data warehouse. And usually there are like different ways on how to load the data between all those
layers. And in order now to load the data from one layer to another one there are like multiple ways on how to use the
ATL process. So usually if you are loading the data from the source to the layer number one like only extract the
data from the source and load it directly to the layer number one without doing any transformations because I want
to see the data as it is in the first layer. And now between the layer number one and the layer number two you might
go and use the full ETL. So we're going to extract from the layer one, transform it and then load it to the layer number
two. So with that we are using the whole process the ATL. And now between layer two and layer three we can do only
transformation and then load. So we don't have to deal with how to extract the data because it is maybe using the
same technology and we are taking all data from layer 2 to layer three. So we transform the whole layer 2 and then
load it to layer three. And now between three and four you can use only the LM. So maybe it's something like duplicating
and replicating the data and then you are doing the transformation. So you load to the new layer and then transform
it. Of course, this is not a real scenario. I'm just showing you that in order to move from source to a target,
you don't have always to use a complete ETL. Depend on the design of your data architecture. You might use only few
components from the ETL. Okay. So this is how ETL looks like in real projects. Okay. So now I would like to show you an
overview of the different techniques and methods in the ETLs. We have wide range of possibilities where you have to make
decisions on which one you want to apply to your projects. So let's start first with the extraction. The first thing
that I want to show you is we have different methods of extraction. Either you are going to the source system and
pulling the data from the source or the source system is pushing the data to the data warehouse. So those are the two
main methods on how to extract data. And then we have in the extraction two types. We have a full extraction
everything all the records from tables and every day we load all the data to the data warehouse or we make more
smarter one where we say we're going to do an incremental extraction where every day we're going to identify only the new
changing data. So we don't have to load the whole thing only the new data we go extract it and then load it to the data
warehouse. And in data extraction we have different techniques. The first one is like manually where someone has to
access a source system and extract the data manually or we connect ourselves to a database and we have then a query in
order to extract the data or we have a file that we have to parse it to the data warehouse or another technique is
to connect ourself to API and do their calls in order to extract the data or if the data is available in streaming like
in CFKA we can do eventbased streaming in order to extract the data. Another way is to use the change data capture
CDC is as well something very similar to streaming or another way is by using web scrabbing where you have a code that
going to run and extract all the informations from the web. So those are the different techniques and types that
we have in the extraction. Now if you are talking on the transformation there are wide range of different
transformations that we can do on our data like for example doing data enrichment where we add values to our
data sets or we do a data integration where we have multiple sources and we bring everything to one data model or we
derive new columns based on already existing one. Another type of data transformations we have the data
normalization. So the sources has values that are like a code and you go and map it to more friendly values for the
analyzers which is more easier to understand and to use. Another transformations we have the business
rules and logic depend on the business you can define different criterias in order to build like new columns. And
what belongs to transformations is the data aggregation. So here we aggregate the data to a different granularity and
then we have type of transformation called data cleansing. There are many different ways on how to clean our data.
For example, removing the duplicates, doing data filtering, handling the missing data, handling invalid values or
removing unwanted spaces, casting the data types and detecting the outliers and many more. So we have different
types of data cleansing that we can do in our data warehouse and this is very important transformation. So as you can
see we have different types of transformations that we can do in our data warehouse. Now moving on to the
load. So what do we have over here? We have different processing types. So either we are doing patch processing or
stream processing. Patch processing means we are loading the data warehouse in one big patch of data that's going to
run and load the data warehouse. So it is only one time job in order to refresh the content of the data warehouse and as
well the reports. So that means we are scheduling the data warehouse in order to load it in the day once or twice. And
the other type we have the stream processing. So this means if there is like a change in the source system,
we're going to process this change as soon as possible. So we're going to process it through all the layers of the
data warehouse once something changes from the source system. So we are streaming the data in order to have real
time data warehouse which is very challenging things to do in data warehousing. And if you are talking
about the loads we have two methods either we are doing a full load or incremental load. It's the same thing as
extraction right? So for the full load in databases there are like different methods on how to do it like for example
we truncate and then insert that means we make the table completely empty and then we insert everything from the
scratch or another one you are doing an update insert we call it upsert. So we can go and update all the records and
then insert the new one and another way is to drop create and insert. So that means we drop the whole table and then
we create it from scratch and then we insert the data. It is very similar to the truncate but here we are as well
removing and dropping the whole table. So those are the different methods of full loads. The incremental load we can
use as well the upserts. So update and insert. So we're going to do an update or insert statements to our tables. Or
if the source is something like a log, we can do only insert. So we can go and append the data always to the table
without having to update anything. Another way to do incremental load is to do a merge. And here it is very similar
to the upsert but as well with a delete. So update, insert, delete. So those are the different methods on how to load the
data to your tables. And one more thing in data warehousing, we have something called slowly changing dimensions. So
here it's all about the historicizations of your table. And there are many different ways on how to handle the
historiizations in your table. The first type is sedd0. We say there is notoriizations and nothing should be
changed at all. So that means you are not going to update anything. The second one which is more famous, it is the sedd
one. you are doing an overwrite. So that means you are updating the records with the new informations from the source
system by overwriting the old value. So we are doing something like the upsert. So update and insert but you are losing
of course history. Another one we have the sedd2 and here you want to add historiizations to your table. So what
we do each change that we get from the source system that means we are inserting new records and we are not
going to overwrite or delete the old data. we are just going to make it inactive and the new record going to be
active one. So there are different methods on how to do historiizations as well while you are loading the data to
the data warehouse. All right. So those are the different types and techniques that you might encounter in data
management projects. So now what I'm going to show you quickly which of those types we will be using in our projects.
So now if we are talking about the extraction over here we will be doing a pull extraction and about the full or
incremental it's going to be a full extraction. And about the technique we are going to be parsing files to the
data warehouse. And now about the data transformations. Well, this one we will cover everything all those types of
transformations that I'm showing you now is going to be part of the project because I believe in each data project
you will be facing those transformations. Now if you have a look to the load our project going to be
patch processing and about the load methods we will be doing a full load since we have full extraction and it's
going to be truncate and inserts. And now about the historiizations we will be doing the sedd one. So that means we
will be updating the content of the data warehouse. So those are the different techniques and types that we will be
using in our ETL process for this project. All right. So with that we have now clear understanding what is a data
warehouse and we are done with the theory parts. So now the next step we're going to start with the projects. The
first thing that we have to do is to prepare our environment to develop the projects. So let's start with
that. All right. So now we go to the link in the description and from there we're going to go to the downloads and
you can find all the materials of all courses and projects. But the one that we need now is the SQL data warehouse
projects. So let's go to the link and here we have bunch of links that we need for the projects. But the most important
one to get all data and files is this one download all project files. So let's go and do that. And after you do that
you're going to get a zip file where you have there a lot of stuff. So let's go and extract it. And now inside it if you
go over here you will find the repository structure from git. And the most important one here is the data
sets. So you have two sources the CRM and the ARP. And in each one of them there are three CSV files. So those are
the data set for the projects. For the other stuffs don't worry about it. We will be explaining that during the
project. So go and get the data and put it somewhere at your PC where you don't lose it. Okay. So now what else do we
have? We have here a link to the get repository. So this is the link to my repository that I have created through
the projects. So you can go and access it. But don't worry about it. We're going to explain the whole structure
during the projects and you will be creating your own repository. And as well we have the link to the notion.
Here we are doing the project management. Here you're going to find the main steps the main phases of the
SQL projects that we will do and as well all the task that we will be doing together during the projects. And now we
have links to the project tools. So if you don't have it already go and download the SQL server express. So it's
like a server that's going to run locally at your PC where your database going to live. Another one that you have
to download is the SQL Server Management Studio. It is just a client in order to interact with the database and there
we're going to run all our queries and then link to the GitHub and as well link to the draw AO if you don't have it
already go and download it. It is free and amazing tool in order to draw diagrams. So through the projects we
will be drawing data models the data architecture a data lineage. So a lot of stuff we'll be doing using this tool. So
go and download it. And the last thing it is nice to have you have a link to the notion where you can go and create
of course free accounts if you want to build the project plan and as well follow me by creating the project steps
and the projects tasks. Okay. So that's all those are all the links for the projects. So go and download all those
stuff create the accounts and once you are ready then we continue with the projects. All right. So now I hope that
you have downloaded all the tools and created the accounts. Now it's time to move to very important step that almost
all people skip while doing projects and that is by creating the project plan and for that we will be using the tool
notion. Notion is of course a free tool and it can help you to organize your ideas, your plans and resources all in
one place. I use it very intensively for my private projects like for example creating this course and I can tell you
creating a project plan is the key to success. Creating a data warehouse project is usually very complex. And
according to Gartner reports, over 50% of data warehouse projects fail. In my opinion about any complex project, the
key to success is to have a clear project plan. So now at this phase of the project, we're going to go and
create a rough project plan because at the moment we don't have yet clear understanding about the data
architecture. So let's go. Okay. So now let's create a new page and let's call it data warehouse projects. The first
thing is that we have to go and create the main phases and stages of the projects and for that we need a table.
So in order to do that hit slash and then type database in line and then let's go and call it something like data
warehouse epics and we're going to go and hide it because I don't like it. And then on the table we can go and rename
it like for example projects epics something like that. And now what we're going to do we're going to go and list
all the big task of the project. So an epic is usually like a large task that needs a lot of efforts in order to solve
it. So you can call it epics, stages, phases of the project, whatever you want. So we're going to go and list our
project steps. So let's start with the requirements analyzes and then designing data
architecture and another one we have the project initialization. So those are the three
big task in the project first. And now what do we need? We need another table for the small chunks of the tasks, the
subtasks and we're going to do the same thing. So we're going to go and hit slash and we're going to search for the
table in line and we're going to do the same thing. So first we're going to call it data warehouse tasks and then we're
going to hide it and over here we're going to rename it and say this is the project tasks. So now what we're going
to do, we're going to go to the plus icon over here and then search for relation. This one over here with the
arrow. And now we're going to search for the name of the first table. So we called it data warehouse eix. So let's
go and click it and we're going to say as well two-way relation. So let's go and add the relation. So with that we
got a field in the new table called data warehouse eix. This comes from this table and as well we have here data
warehouse tasks that comes from the below table. So as you can see we have linked them together. Now what I'm going
to do I'm going to take this to the left side and then what we're going to do we're going to go and select one of
those epics. Like for example let's take design the data architecture. And now what we're going to do, we're going to
go and break down this epic into multiple tasks. Like for example, choose data management approach. And then we
have another task. What we're going to do, we're going to go and select as well the same epic. So maybe the next step is
brainstorm and design the layers. And then let's go to another epic for example the project initialization. And
we say over here for example create get repo prepare the structure. we can go and make another one in the same epic.
Let's say we're going to go and create the database and the schemas. So, as you can see, I'm just defining the subtasks
of those epics. So, now what we're going to do, we're going to go and add a checkbox in order to understand whether
we have done the task or not. So, we go to the plus and search for check. We need a checkbox. And what we're going to
do, we're going to make it really small like this. And with that, each time we are done with the task, we're going to
go and click on it just to make sure that we have done the task. Now, there is one more thing that is not really
working nice and that is here. We're going to have like a long list of tasks and it's really annoying. So, what we're
going to do, we're going to go to the plus over here and let's search for roll up. So, let's go and select it. So, now
what we're going to do, we have to go and select the relationship. It's going to be the data warehouse task. And after
that, we're going to go to the property and make it as a checkbox. So, now as you can see in the first table, we are
saying how many tasks is closed. But I don't want to show it like this. What we can do, we're going to go to the
calculation and to the percent and then percent checked. And with that, we can see the progress of our project. And now
instead of the numbers, we can have really nice bar. Great. So as well, we can go and give it a name like progress.
So that's it. And we can go and hide the data warehouse tasks. And now with that, we have really nice progress bar for
each epic. And if we close all the tasks of this epic, we can see that we have reached 100%. So this is the main
structure. Now we can go and add some cosmetics and rename stuff in order to make things looks nicer. Like for
example, if I go to the tasks over here, I can go and call it tasks and as well go and change the icon to something like
this. And if you'd like to have an icon for all those epics, what you're going to do, we're going to go to the epic for
example design data architecture. And then if you hover on top of the title, you can see add an icon. And you can go
and pick any icon that you want. So for example, this one. And now as you can see, we have defined it here in the top.
And the icon going to be as well in the below table. Okay. So now one more thing that we can do for the project tasks is
that we can go and group them by the epics. So if you go to the three dots and then we go to groups and then we can
group up by the epics. As you can see now we have like a section for each epic and you can go and sort the epics if you
want. If you go over here sort then manual and you can go over here and start sorting the epics as you want. And
with that you can expand and minimize each task. if you don't want to see always all tasks in one go. So this is
really nice way in order to build like data management for your projects. Of course, in companies, we use
professional tools in order to do projects like for example Gyra. But for private personal projects that I do, I
always do it like this and I really recommend you to do it not only for this project, for any project that you are
doing. Cuz if you see the whole project in one go, you can see the big picture and closing tasks and doing it like
this. These small things going to makes you really satisfied and keeps you motivated to finish the whole project
and makes you proud. Okay friends, so now I just went and added few icons, a renamed stuff and as well more tasks for
each epic and this going to be our starting point in the project and once we have more informations we're going to
go and add more details on how exactly we're going to build the data warehouse. So at the start we're going to go and
analyze and understand the requirements and only after that we're going to start designing the data architecture and here
we have three tasks. First we have to choose the data management approach and after that we're going to do
brainstorming and designing the layers of the data warehouse and at the end we're going to go and draw a data
architecture. So with that we have clear understanding how the data architecture looks like and after that we're going to
go to the next epic where we're going to start preparing our projects. So once we have clear understanding of the data
architecture the first task here is to go and create detailed project tasks. So we're going to go and add more AP and
more tasks. And once we are done then we're going to go and create the naming conventions for the project just to make
sure that we have rules and standards in the whole project. And next we're going to go and create a repository in the git
and we're going to prepare as well the structure of the repository so that we always commit our work there. And then
we're going to start with the first script where we're going to create a database and schemas. So my friends this
is the initial plan for the project. Now let's start with the first epic. We have the requirements
analyzes. Now analyzing the requirement, it is very important to understand which type of data warehouse you're going to
go and build because there is like not only one standard on how to build it. And if you go blindly implementing the
data warehouse, you might be doing a lot of stuff that is totally unnecessary and you will be burning a lot of time. So
that's why you have to sit with the stakeholders with the department and understand what we exactly have to build
and depend on the requirements you design the shape of the data warehouse. So now let's go and analyze the
requirement of this project. Now the whole project is splitted into two main sections. The first section we have to
go and build a data warehouse. So this is a data engineering task and we will go and develop ETLs and data warehouse.
And once we have done that we have to go and build analytics and reporting business intelligence. So we're going to
do data analyszis. But now first we will be focusing on the first part building the data warehouse. So what do we have
here? The statement is very simple. It says develop a modern data warehouse using SQL server to consolidate sales
data enabling analytical reporting and informed decision making. So this is the main statements and then we have
specifications. The first one is about the data sources. It says import data from two source systems ERB and CRM and
they are provided as CSV files. And now the second task is talking about the data quality. We have to clean and fix
data quality issues before we do the data analyzers because let's be real there is no raw data that is perfect is
always messy and we have to clean that up. Now the next task is talking about the integration. So it says we have to
go and combine both of the sources into one single userfriendly data model that is designed for analytics and reporting.
So that means we have to go and merge those two sources into one single data model. And now we have here another
specifications. It says focus on the latest data sets. So there is no need for historiization. So that means we
don't have to go and build histories in the database. And the final requirement is talking about the documentation. So
it says provide clear documentations of the data model. So that means the last product of the data warehouse to support
the business users and the analytical teams. So that means we have to generate a manual that's going to help the users
that makes lives easier for the consumers of our data. So as you can see maybe this is very generic requirements
but it has a lot of informations already for you. So it's saying that we have to use the platform SQL server. We have two
source systems using the CSV files and it sounds that we really have a bad data quality in the sources and as well it
wants us to focus on building completely new data model that is designed for reporting and it says we don't have to
do historiization and it is expected from us to generate documentations of the system. So these are the
requirements for the data engineering part where we're going to go and build a data warehouse that fulfill these
requirements. All right. Right. So with that we have analyzed the requirements and as well we have closed the first
easiest ebick. So we are done with this. Let's go and close it. And now let's open another one. Here we have to design
the data architecture and the first task is to choose data management approach. So let's
go. Now designing the data architecture it is exactly like building a house. So before construction starts, an
architect's going to go and design a plan, a blueprint for the house. How the rooms will be connected, how to make the
house functional, safe and wonderful. And without this blueprint from the architects, the builders might create
something unstable, inefficient or maybe unlivable. The same goes for data projects. A data architect is like a
house architecture. They design how your data will flow, integrate and be accessed. So as data architects we make
sure that the data warehouse is not only functioning but also scalable and easy to maintain. And this is exactly what we
will do now. We will play the role of the data architect and we will start brainstorming and designing the
architecture of the data warehouse. So now I'm going to show you a sketch in order to understand what are the
different approaches in order to design a data architecture. And this phase of the projects usually is very exciting
for me because this is my main role in data projects. I am a data architect and I discuss a lot of different projects
where we try to find out the best design for the projects. All right. So now let's
go. Now the first step of building a data architecture is to make a very important decision to choose between
four major types. The first approach is to build a data warehouse. It is very suitable if you have only structured
data and your business want to build solid foundations for reporting and business intelligence. And another
approach is to build a data leak. This one is way more flexible than a data warehouse where you can store not only
structured data but as well semi and unstructured data. We usually use this approach if you have mixed types of data
like database tables, logs, images, videos and your business want to focus not only on reporting but as well on
advanced analytics or machine learning but it's not that organized like a data warehouse and data leaks if it's too
much unorganized and turns into data swamp and this is where we need the next approach. So the next one we can go and
build data lakehouse. So it is like a mix between data warehouse and data lake. You get the flexibility of having
different types of data from the data lake but you still want to structure and organize your data like we do in the
data warehouse. So you mix those two words into one and this is a very modern way on how to build that architecture
and this is currently my favorite way of building data management system. Now the last and very recent approach is to
build data mesh. So this is a little bit different. Instead of having centralized data management system the idea now in
the data mesh is to make it decentralized. You cannot have like one centralized data management system
because always if you say centralized then it means bottleneck. So instead you have multiple departments and multiple
domains where each one of them is building a data product and sharing it with others. So now you have to go and
pick one of those approaches and in this project we will be focusing on the data warehouse. So now the question is how to
build the data warehouse. Well there is as well four different approaches on how to build it. The first one is the
enimmon approach. So again you have your sources and the first layer you start with the staging where the row data is
landing and then the next layer you organize your data in something called enterprise data warehouse where you go
and model the data using the third normal format. It's about like how to structure and normalize your tables. So
you are building a new integrated data model from the multiple sources. And then we go to the third layer. It's
called the data marts where you go and take like small subset of the data warehouse and you design it in a way
that is ready to be consumed from reporting and it focus on only one topic like for example the customers sales or
products and after that you go and connect your BI tool like PowerBI or Tableau to the data marts. So with that
you have three layers to prepare the data before reporting. Now moving on to the next one we have the Kimple
approach. He says you know what building this enterprise data warehouse it is wasting a lot of time. So what we can do
we can jump immediately from the stage layer to the final data because building this enterprise data warehouse it is a
big struggle and usually waste a lot of time. So he always want you to focus and building the data ms quickly as
possible. So it is faster approach than in but with the time you might get chaos in the data MS cuz you are not always
focusing in the big picture and you might be repeating same transformations and integrations in different data ms.
So there is like trade-off between the speed and consistent data warehouse. Now moving on to the third approach we have
the data vault. So we still have the stage and the data marts but it says we still need this central data warehouse
in the middle but this middle layer we're going to bring more standards and rules. So it tells you to split this
middle layer into two layers the row vault and the business vault. In the row vault you have the original data but in
the business vault you have all the business rules and transformations that prepares the data for the data marks. So
that vault it is very similar to the inmon but it brings more standards and rules to the middle layer. Now I'm going
to go and add a fourth one that I'm going to call it medallion architecture and this one is my favorite one because
it is very easy to understand and to build. So it says you're going to go and build three layers bronze, silver and
gold. The bronze layer it is very similar to the stage but we have understood with the time that the stage
layer is very important because having the original data as it is it going to helps a lot by traceability and finding
issues. Then the next layer we have the silver layer. It is where we do transformations data cleansing but we
don't apply yet any business rules. Now moving on to the last layer the gold layer. It is as well very similar to the
data marts but there we can build different type of objects not only for reporting but as well for machine
learning for AI and for many different purposes. So they are like business ready objects that you want to share as
a data products. So those are the four approaches that you can use in order to build a data warehouse. So again if you
are building a data architecture you have to specify which approach you want to follow. So at the start we said we
want to build a data warehouse and then we have to decide between those four approaches on how to build a data
warehouse and in this project we will be using the medallion architecture. So this is a very important question that
you have to answer as the first step of building a data architecture. All right. So with that we have decided on the
approach. So we can go and mark it as done. The next step we're going to go and design the layers of the data
warehouse. Now there is like not 100% standard way and rules for each layer. What you have
to do as a data architects you have to define exactly what is the purpose of each layer. So we start with the bronze
layer. So we say it's going to store row and unprocessed data as it is from the sources. And why we are doing that it is
for traceability and debugging. If you have a layer where you are keeping the raw data, it is very important to have
the data as it is from the sources because we can go always back to the bronze layer and investigate the data of
specific source if something goes wrong. So the main objective is to have raw untouched data that's going to helps you
as a data engineer by analyzing the root cause of issues. Now moving on to the server layer. It is the layer where
we're going to store clean and standardized data and this is the place where we're going to do basic
transformations in order to prepare the data for the final layer. Now for the go layer it's going to contain business
ready data. So the main goal here is to provide data that could be consumed by business users and analysts in order to
build reporting and analytics. So with that we have defined the main goal for each layer. Now next what I would like
to do is to define the object types and since we are talking about a data warehouse in database we have here
generally two types either a table or a view. So we are going for the bronze layer and the silver layer with tables
but for the gold layer we are going with the views. So the best practice says for the last layer in your data warehouse
make it virtual using views. It going to gives you a lot of dynamic and of course speed in order to build it since we
don't have to make a load process for it. And now the next step is that we're going to go and define the load method.
So in this project I have decided to go with the full load using the method of truncating and inserting. It is just
faster and way easier. So we're going to say for the bronze layer we're going to go with the full load. And you have to
specify as well for the silver layer as well. We're going to go with the full load. And of course for the views we
don't need any load process. So each time you decide to go with tables you have to define the load methods with our
full load, incremental loads and so on. Now we come to the very interesting part the data transformations. Now for the
bronze layer, it is the easiest one about this topic because we don't have any transformations. We have to commit
ourself to not touch the data, do not manipulate it, don't change anything. So it's going to stay as it is. If it comes
bad, it's going to stay bad in the bronze layer. And now we come to the silver layer where we have the heavy
lifting. As we committed in the objective, we have to make clean and standardized data. And for that we have
different types of transformations. So we have to do data cleansing, data standardizations, data normalizations.
We have to go and derive new columns and data enrichment. So there are like bunch of transformations that we have to do in
order to prepare the data. Our focus here is to transform the data to make it clean and following standards and try to
push all business transformations to the next layer. So that means in the god layer we will be focusing on business
transformations that is needed for the consumers for the use cases. So what we do here we do data integrations between
source system we do data aggregations we apply a lot of business logics and rules and we build a data model that is ready
for for example business intelligence. So here we do a lot of business transformations and in the silver layer
we do basic data transformations. So it is really here very important to make the fine decisions what type of
transformations to be done in each layer and make sure that you commit to those rules. Now the next aspect is about the
data modeling in the bronze layer and the silver layer. We will not break the data model that comes from the source
system. So if the source system deliver five tables, we're going to have here like five tables and as well in the
silver layer. We will not go and denormalize or normalize or like make something new, we're going to leave it
exactly like it comes from the source system because what we're going to do, we're going to build the data model in
the gold layer. And here you have to define which data model you want to follow. Are you following the star
schema, the snowflake or are you just making aggregated objects? So you have to go and make a list of all data models
types that you're going to follow in the gold layer. And at the end, what you can specify in each layer is the target
audience. And this is of course very important decision. In the bronze layer, you don't want to give access to any end
user. It is really important to make sure that only data engineers access the bronze layer. It makes no sense for data
analysts or data scientists to go to the bad data because you have a better version for that in the silver layer. So
in the silver layer of course the data engineers have to have an access to it and as well the data analysts and the
data scientists and so on but still you don't give it to any business user that can't deal with the raw data model from
the sources because for the business users you're going to get a better layer for them and that is the go layer. So in
the gold layer it is suitable for the data analyst and as well the business users because usually the business users
don't have a deep knowledge on the technicality of the server layer. So if you are designing multiple layers you
have to discuss all those topics and make clear decision for each layer. All right my friends. So now before we
proceed with the design I want to tell you a secret principle concept that each data architect must know and that is the
separation of concerns. So what is that? As you are designing an architecture, you have to make sure to break down the
complex system into smaller independent parts and each part is responsible for a specific task. And here comes the magic.
The component of your architecture must not be duplicated. So you cannot have two parts are doing the same thing. So
the idea here is to not mix everything. And this is one of the biggest mistakes in any big projects and I have shown
that almost everywhere. So a good data architects follow this concept this principle. So for example if you are
looking to our data architecture we have already done that. So we have defined unique set of tasks for each layer. So
for example we have said in the server layer we do data cleansing but in the gold layer we do business
transformations and with that you will not be allowing to do any business transformations. In the server layer and
the same thing goes for the gold layer. You don't do in the gold layer any data cleansing. So each layer has its own
unique tasks and the same thing goes for the bronze layer and the silver layer. You do not allow to load data from the
source systems directly to the silver layer because we have decided the landing layer. The first layer is the
bronze layer otherwise you will have like set of source systems that are loaded first to the bronze layer and
another set is skipping the layer and going to the silver and with that we have overlapping. You are doing data
ingestion in two different layers. So my friends, if you have this mindset, separation of concerns, I promise you,
you're going to be a top data architect. So think about it. All right, my friends. So with that, we have designed
the layers of the data warehouse. We can go ahead close it. The next step, we're going to go to DYO and start drawing the
data architecture. So there is like no one standard on how to build a data
architecture. You can add your style and the way that you want. So now the first thing that we have to show in that
architecture is the different layers that we have. The first layer is the source system layer. So let's go and
take a box like this and make it a little bit bigger. And I'm just going to go and make the design. So I'm going to
remove the fill and make the line dotted one. And after that I'm going to go and change maybe the color to something like
this gray. So now we have like a container for the first layer. And then we have to go and add like a text on top
of it. So what I'm going to do, I'm going to take another box. Let's go and type inside it sources. And now I'm
going to go and style it. So I'm going to go to the text and make it maybe 24. And then remove the lines like this.
Make it a little bit smaller and put it on top. So this is the first layer. This is where the data come from. And then
the data going to go inside a data warehouse. So I'm just going to go and duplicate this one. This one is the data
warehouse. All right. So now the third layer what it going to be? It's going to be the consumers. who will be consuming
this data warehouse. So I'm going to put another box and say this is the consume layer. Okay. So those are the three
containers. Now inside the data warehouse, we have decided to build it using the medallion architecture. So
we're going to have three layers inside the warehouse. So I'm going to take again another box. I'm going to call
this one. This is the bronze layer. And now we have to go and put a design for it. So I'm going to go with this color
over here. And then the text and maybe something like 20. And then make it a little bit smaller and just put it here.
And beneath that we're going to have the component. So this is just a title of a container. So I'm going to have it like
this. Remove the text from inside it. And remove the filling. So this container is for the bronze layer. Let's
go and duplicate it for the next one. So this one going to be the silver layer. And of course, we can go and change the
coloring to gray because it is silver. And as well the lines and remove the filling. Great. And now maybe I'm going
to make the font as bold. All right. Now the third layer going to be the gold layer. And we have to go and pick a
color for that. So style and here we have like something like yellow. The same thing for the container. I remove
the filling. So with that we are showing now the different layers inside our data warehouse. Now those containers are
empty. What we're going to do, we're going to go inside each one of them and start adding contents. So now in the
sources, it is very important to make it clear what are the different types of source systems that you are connecting
to the data warehouse because in real project there are like multiple types. You might have a database, API, files,
cafka and here it's important to show those different types. In other projects we have folders and inside those folders
we have CSV files. So now what you have to do we have to make it clear in this layer that the input for our project is
CSV file. So it really depend how you want to show that. I'm going to go over here and say maybe folder and then I'm
going to go and take the folder and put it here inside and then maybe search for file more results and go pick one of
those icons. For example, I'm going to go with this one over here. So I'm going to make it smaller and add it on top of
the folder. So with that we make it clear for everyone seeing the architecture that the sources is not a
database is not an API it is a file inside the folder. So now very important here to show is the source systems. What
are the sources that is involved in the project. So here what we're going to do we're going to go and give it a name.
For example we have one source called CRM like this and maybe make the icon and we have another source called ERP.
So we're going to go and duplicate it put it over here and then rename it ERP. So now it is for everyone clear. We have
two sources for this project and the technology is used is simply a file. So now what we can do as well we can go and
add some descriptions inside this box to make it more clear. So what I'm going to do, I'm going to take a line because I
want to split the description from the icons something like this and make it gray. And then below it, we're going to
go and add some text and we're going to say is CSV file. And the next point and we can say the interface is simply files
in folder. And of course you can go and add any specifications and explanation about the sources. If it is a database,
you can say the type of the database and so on. So that we made it in the data architecture clear what are the sources
of our data warehouse. And now the next step what we're going to do we're going to go and design the content of the
bronze silver and gold. So I'm going to start by adding like an icon in each container. It is to show about that we
are talking about database. So what we're going to do we're going to go and search for database and then more
result. More results. I'm going to go with this icon over here. So let's go and make it bigger. Something like this.
Maybe change the color of dots. So, we're going to have the bronze and as well here the silver and the gold. So,
now what we can do, we're going to go and add some arrows between those layers. So, we're going to go over here.
So, we can go and search for arrow and maybe go and pick one of those. Let's go and put it here. And we can go and pick
a color for that. Maybe something like this. And adjust it. So, now we're going to have this nice arrow between all the
layers just to explain the direction of our architecture, right? So we can read it from left to right and as well
between the go layer and the consume. Okay. So now what I'm going to do next we're going to go and add one statement
about each layer the main objective. So let's go and grab a text and put it beneath the database and we're going to
say for example for the bronze layer it's going to be the row data. Maybe make the text bigger so you are the row
data. And then the next one in the silver you are clean standard data. And then the last one for the gold we can
say business ready data. So with that we make the objective clear for each layer. Now
below all those icons what we're going to do we're going to have a separator again like this. Make it like colored.
And beneath it we're going to add the most important specifications of this layer. So let's go and add those
separators in each layer. Okay. So now we need a text below it. Let's take this one here. So what is the object type of
the bronze layer? That's going to be a table and we can go and add the load methods. We say this is patch
processing. Since we are not doing streaming, we can say it is a full load. We are not doing incremental load. So we
can say here trank and insert. And then we add one more section maybe about the transformations. So we can say no
transformations. And one more about the data model. We're going to say none as is. And now what I'm going to do I'm
going to go and add those specifications as well for the silver and gold. So here what we have discussed the object type
the load process the transformations and whether we are breaking the data model or not the same
thing for the gold layer. So I can say with that we have really nice layering of the data warehouse and what we are
left is with the consumers over here you can go and add the different use cases and tools that can access your data
warehouse like for example I'm adding here business intelligence and reporting maybe using PowerBI or Tableau or you
can say you can access my data warehouse in order to do at analyzes using the SQL queries and this is what we're going to
focus on the projects after we build the data warehouse and as well you can offer it for machine learning purposes and of
course it It's really nice to add some icons in your architecture and usually I use this nice websites called flat icon.
It has really amazing icons that you can go and use it in your architecture. Now, of course, we can go and keep adding
icons and stuff to explain the data architecture and as well the system. Like for example, it is very important
here to say which tools you are using in order to build this data warehouse. Is it in the cloud? Are using Azure datab
bricks or maybe snowflake? So we're going to go and add for our project the icon of SQL server since we are building
this data warehouse completely in the SQL server. So for now I'm really happy about it. As you can see we have now a
plan right. All right guys so with that we have designed the data architecture using the doyo and with that we have
done the last step in this epic and now with that we have a design for the data architecture and we can say we have
closed this epic. Now let's go to the next one. We will start doing the first step to prepare our project. And the
first task here is to create a detailed project plan. All right, my friends. So now it's
clear for us that we have three layers and we have to go and build them. So that means our big epics going to be
after the layers. So here I have added three more epics. So we have build bronze layer, build silver layer and
gold layer. And after that I went and start defining all the different tasks that we have to follow in the projects.
So at the start we will be analyzing then coding and after that we're going to go and do testing and once everything
is ready we're going to go and document stuff and at the end we have to commit our work in the get repo. All those
epics are following the same like pattern in the tasks. So as you can see now we have a very detailed project
structure and now things are more cleared for us how we're going to build the data warehouse. So with that we are
done from this task and now the next task we have to go and define the naming convention of the
projects. All right. So now at this phase of the projects we usually define the naming conventions. So what is that?
It is set of rules that you define for naming everything in the projects whether it is a database, schema,
tables, stored procedures, folders, anything. And if you don't do that at the early phase of the projects, I
promise you chaos can happen because what going to happen? You will have different developers in your projects
and each of those developers have their own style of course. So one developer might name a table dimension customers
where everything is lowerase and between them underscore and you have another developer creating another table called
dimension products but using the camel case. So there is no separation between the words and the first character is
capitalized and maybe another one using some prefixes like dim categories. So we have here like a
shortcut of the dimension. So as you can see there are different designs and styles and if you leave the door open
what can happen in the middle of the project you will notice okay everything looks inconsistent and you can define a
big task to go and rename everything following a specific rule. So instead of wasting all this time at this phase you
go and define the naming conventions and let's go and do that. So we usually start with a very important decision and
that is which naming convention we going to follow in the whole project. So you have different cases like the camel
case, the Pascal case, the kebab case, and the snake case. And for this project, we're going to go with the
snake case where all the letters of a word going to be lowercased. And the separation between words going to be an
underscore. For example, a table name called customer info. Customer is lowercased. Info is as well lowercased.
And between them an underscore. So this is always the first thing that you have to decide for your data projects. The
second thing is to decide the language. So for example, I work in Germany and there is always like a decision that we
have to make whether we use Germany or English. So we have to decide for our project which language we're going to
use. And a very important general rule is that avoid reserved words. So don't use a square reserved word as an object
name like for example table. Don't give a table name as a table. So those are the general principles. So those are the
general rules that you have to follow in the whole project. This applies for everything for tables, columns, stored
procedures, any names that you are giving in your scripts. Now moving on, we have specifications for the table
names. And here we have different set of rules for each layer. So here the rule says source system underscore entity. So
we are saying all the tables in the bronze layer should start first with the source system name like for example CRM
or ARB and after that we have an underscore and then at the end we have the entity name or the table name. So
for example we have this table name CRM. So that means this table comes from the source system CRM and then we have the
table name the entity name customer info. So this is the rule that we're going to follow in naming all tables in
the bronze layer. Then moving on to the silver layer, it is exactly like the bronze because we are not going to
rename anything. We are not going to build any new data model. So the naming going to be one one to one like the
bronze. So it is exactly the same rules as the bronze. But if we go to the gold here, since we are building new data
model, we have to go and rename things. And since as well we are integrating multiple sources together, we will not
be using the source system name in the tables because inside one table you could have multiple sources. So the rule
says all the names must be meaningful business aligned names for the tables starting with the category prefix. So
here the rule says it start with category then underscore and then entity. Now what is category? We have in
the code layer different types of tables. So we could build a table called a fact table. Another one could be a
dimension. A third type could be an aggregation or a report. So we have different types of tables and we can
specify those types as a prefix at the start. So for example we are saying here effect sales. So the category is fact
and the table name called sales. And here I just made like a table with different type of patterns. So we could
have a dimension. So we say it start with the dim underscore for example dimim customers or products. And then we
have another type called fact table. So it start with fact underscore or aggregated table where we have the first
three characters like aggregating the customers or the sales monthly. So as you can see as you are creating a naming
convention you have first to make it clear what is the rule describe each part of the rule and start giving
examples. So with that we make it clear for the whole team which names they should follow. So we talked here about
the table naming convention. Then you can as well go and make naming convention for the columns. Like for
example in the code layer we're going to go and have surrogate keys. So we can define it like this. The surrogate key
should start with a table name and then underscore a key. Like for example we can call it customer key. It is a
surrogate key in the dimension customers. The same thing for technical columns. As a data engineer, we might
add our own columns to the tables that don't come from the source system. And those columns are the technical columns
or sometimes we call them metadata columns. Now, in order to separate them from the original columns that comes
from the source system, we can have like a prefix for that. Like for example, the rule says if you are building any
technical or metadata columns, the column should start with DWH underscore and then the column name. For example,
if you want the metadata load dates, we can have DWH load dates. So with that, if anyone
sees that column starts with DWH, we understand this data comes from a data engineer. And we can keep adding rules
like for example the store procedure over here. If you are making an ETL script, then it should start with the
prefix load underscore and then the layer. For example, the store procedure that is responsible for loading the
bronze going to be called load bronze. and for the silver load underscore silver. So those are currently the rules
for the start procedure. So this is how I do it usually in my projects. All right my friends. So with that we have a
solid naming conventions for our projects. So this is done and now the next step is that we're going to go to
git and you will create a brand new repository and we're going to prepare its structure. So let's
go. All right. Right. So now we come to as well important step in any projects and that's by creating the G repository.
So if you are new to Git, don't worry about it. It is simpler than it sounds. So it's all about to have a safe place
where you can put your codes that you are developing and you will have the possibility to track everything happens
to the codes and as well you can use it in order to collaborate with your team and if something goes wrong you can
always roll back. And the best part here once you are done with the project you can share your repository as a part of
your portfolio and it is really amazing thing if you are applying for a job by showcasing your skills that you have
built a data warehouse by using well doumented get repository. So now let's go and create the repository of the
project. Now we are at the overview of our account. So the first thing that we have to do is to go to the repositories
over here and then we're going to go to this green button and click on new. The first thing that we have to do is to
give the repository name. So let's call it SQL data warehouse project and then here we can go and give it a
description. So for example I'm saying building a modern data warehouse with SQL server. Now the next option whether
you want to make it public and private. I'm going to leave it as a public and then let's go and add here a readme
file. And then here about the license we can go over here and select the MIT. MIT license gives everyone the freedom of
using and modifying your code. Okay. So I think I'm happy with the setup. Let's go and create the repository. And with
that we have our brand new repository. Now the next step that I usually do is to create the structure of the
repository. And usually I always follow the same patterns in any projects. So here we need few folders in order to put
our files right. So what I usually do I go over here to add file create a new file and I start creating the structure
over here. So the first thing is that we need data sets then slash and with that the repository going to understand this
is a folder not a file and then you can go and add anything like here placeholder just an empty file this just
going to help me to create the folders so let's go and commit so commit the changes and now if you go back to the
main projects you can see now we have a folder called data sets so I'm going to go and keep creating stuff so I will go
and create the documents placeholder commit the changes and then I'm going to go and create the scripts
placeholder and the final one what I usually add is the tests something like
this. So that as you can see now we have the main folders of our repository. Now what I usually do the next that I'm
going to go and edit the main readme. So you can see it over here as well. So what we're going to do, we're going to
go inside the readme and then we're going to go to the edit button here and we're going to start writing the main
information about our project. This is really depend on your style. So you can go and add whatever you want. This is
the main page of your repository. And now as you can see the file name here is MD. It stands for markdown. It is just
an easy and friendly format in order to write a text. So if you have like documentations, you are writing a text.
It is a really nice format in order to organize it, structure it and it is very friendly. So what I'm going to do at the
start I'm going to give a few description about the project. So we have the main title and then we have
like a welcome message and what this repository is about. And in the next section maybe we can start with the
project requirements and then maybe at the end you can say a few words about the licensing and few words about you.
So as you can see it's like the homepage of the project and the repository. So once you are done we're going to go and
commit the changes. And now if you go to the main page of the repository you can see always the folder and files at the
start and then below it we're going to see the informations from the readme. So again here we have the welcome statement
and then the projects requirements and at the end we have the licensing and about me. So my friends that's it. We
have now a repository and we have now the main structure of the project and through the projects as we are building
the data warehouse we're going to go and commit all our work in this repository. Nice, right? All right. So with that we
have now your repository ready and as we go in the project we will be adding stuff to it. So this step is done and
now the last step finally we're going to go to the SQL server and we're going to write our first script where we're going
to create a database and schemas. All right. Now the first step is we have to go and create a brand new database.
So now in order to do that first we have to switch to the database master. So you can do it like this. Use master and
semicolon. And if you go and execute it now we are switched to the master database. It is a system database in SQL
server where you can go and create other databases. And you can see here from the toolbar that we are now logged into the
master database. Now the next step we have to go and create our new database. So we're going to say create database
and you can call it whatever you want. So I'm going to go with data warehouse semicolon. Let's go and execute it. And
with that we have created our database. Let's go and check it from the object explorer. Let's go and refresh. And you
can see our new data warehouse. This is our new database. Awesome. Right now to the next step we're going to go and
switch to the new database. So we're going to say use data warehouse and semicolon. So let's go and
switch to it. And you can see now we are logged into the data warehouse database. And now we can go and start building
stuff inside this data warehouse. So now the first step that I usually do is I go and start creating the schemas. So what
is schema? Think about it. It's like a folder or a container that helps you to keep things organized. So now as we
decided in the architecture we have three layers, bronze, silver, gold. And now we're going to go and create for
each layer a schema. So let's go and do that. We're going to start with the first one. Create schema. And the first
one is bronze. So let's do it like this. And a semicolon. Let's go and create the first schema. Nice. So we have new
schema. Let's go to our database. And then in order to check the schemas, we go to the security and then to the
schemas over here. And as you can see, we have the bronze. And if you don't find it, you have to go and refresh the
whole schemas. and then you will find the new schema. Great. So now we have the first schema. Now what we're going
to do, we're going to go and create the others two. So I'm just going to go and duplicate it. So the next one going to
be the silver and the third one going to be the gold. So let's go and execute those two together. We will get an error
and that's because we are not having the go in between. So after each command, let's have a go. And now if I highlight
the silver and gold and then execute, it will be working. the go in SQL it is like separator. So it tells SQL first
execute completely the first command before go to the next one. So it is just separator. Now let's go to our schemas
refresh and now we can see as well we have the gold and the silver. So with that we have now a database. We have the
three layers and we can start developing each layer individually. Okay. So now let's go and
commit our work in the git. So now since it is a script and code we're going to go to the folder scripts over here and
then we're going to go and add a new file let's call it in it database.sql and now we're going to go
and paste our code over here. So now I have done few modifications like for example before we create the database we
have to check whether the database exists. This is an important step if you are recreating the database otherwise if
you don't do that you will get an error where it's going to say the database already exists. So first it is checking
whether the database exists then it drops it. I have added few comments like here we are saying creating the data
warehouse creating the schemas and now we have a very important step. We have to go and add a header comment at the
start of each script. To be honest after 3 months from now you will not be remembering all the details of this
script. And adding a comment like this it is like a sticky note for you later once you visit this script again. And it
is as well very important for the other developers in the team because each time you open the scripts the first question
going to be what is the purpose of this script because if you or anyone in the team open the file the first question
going to be what is the purpose of this scripts why we are doing this stuff. So as you can see here we have a comment
saying this script creates a new data warehouse after checking if it already exists. If the database exists, it's
going to drop it and recreate it. And additionally, it's going to go and create three schemas, bronze, silver,
gold. So that it gives clarity what this script is about. And it makes everyone life easier. Now, the second reason why
this is very important to add is that you can add warnings and especially for this script, it is very important to add
these notes because if you run this script, what's going to happen? It's going to go and destroy the whole
database. Imagine someone open this script and run it. Imagine an admin open this script and run it in your database.
Everything going to be destroyed and all the data will be lost and this can be a disaster if you don't have any backup.
So with that we have nice header comments and we have added few comments in our code and now we are ready to
commit our code. So let's go and commit it. And now we have our script in the git as well. And of course if you are
doing any modifications make sure to update the changes in the git. Okay my friends. So with that we have an empty
database and schemas and we are done with this task and as well we are done with the whole epic. So we have
completed the project initialization and now we're going to go to the interesting stuff. We will go and build the bronze
layer. So now the first task is to analyze the source systems. So let's go. All right. So now the big question
is how to build the bronze layer. So first thing first we do analyzing. As you are developing anything, you don't
immediately start writing a code. So before we start coding the bronze layer, what we usually do is we have to
understand the source system. So what I usually do, I make an interview with the source system experts and ask them many
many questions in order to understand the nature of the source system that I'm connecting to the data warehouse. And
once you know the source systems, then we can start coding. And the main focus here is to do the data ingestion. So
that means we have to find a way on how to load the data from the source into the data warehouse. So it's like we are
building a bridge between the source and our target system the data warehouse. And once we have the code ready, the
next step is we have to do data validation. So here comes the quality control. It is very important in the
bronze layer to check the data completeness. So that means we have to compare the number of records between
the source system and the bronze layer just to make sure we are not losing any data in between. And another check that
we will be doing is the schema checks and that's to make sure that the data is placed on the right position. And
finally we don't have to forget about documentation and committing our work in the G. So this is the process that we're
going to follow to build the bronze layer. All right my friends. So now before connecting any source systems to
our data warehouse, we have to make very important step is to understand the sources. So how I usually do it, I set
up a meeting with the source systems expert in order to interview them to ask them a lot of stuff about the source.
And gaining this knowledge is very important because asking the right question will help you to design the
correct scripts in order to extract the data and to avoid a lot of mistakes and challenges. And now I'm going to show
you the most common questions that I usually ask before connecting anything. Okay. So we start first by understanding
the business context and the ownership. So I would like to understand the story behind the data. I would like to
understand who is responsible for the data, which IT departments and so on. And then it's nice to understand as well
what business process it supports. Does it support the customer transactions, the supply chain, logistics or maybe
finance reporting. So with that you can understand the importance of your data. And then I ask about the system and data
documentation. So having documentations from the source is your learning materials about your data. And it's
going to saves you a lot of time later when you are working and designing maybe new data models. And as well I would
like always to understand the data model for the source system. And if they have like descriptions of the columns and the
tables, it's going to be nice to have the data catalog. This can helps me a lot in the data warehouse. How I'm going
to go and join the tables together. So with that you get a solid foundations about the business context, the
processes and the ownership of the data. And now in the next step we're going to start talking about the technicality. So
I would like to understand the architecture and as well the technology stack. So the first question that I
usually ask is how the source system is storing the data. Do we have the data on the on-prem like in SQL server, Oracle
or is it in the cloud like Azure, AWS and so on. And then once we understand that then we can discuss what are the
integration capabilities like how I'm going to go and get the data. Do the source system offer APIs maybe cafka or
they have only like file extractions or they're going to give you like a direct connection to the database. So once you
understand the technology that you're going to use in order to extract the data then we're going to deep dive into
more technical questions and here we're going to understand how to extract the data from the source system and then
load it into the data warehouse. So the first things that we have to discuss with the experts can we do an
incremental load or a full load and then after that we're going to discuss the data scope the historicizations do we
need all data do we need only maybe 10 years of the data are there histories already in the source system or should
we build it in the data warehouse and so on and then we're going to go and discuss what is the expected size of the
extracts are we talking here about megabytes gigabytes terabytes and this is very important to understand whether
we have the right tools and platform to connect that source system and then I try to understand whether there are any
data volume limitations like if you have some old source systems they might struggle a lot with performance and so
on. So if you have like an ETL that is extracting large amount of data you might bring the performance down of the
source system. So that's why you have to try to understand whether there are any limitations for your extracts and as
well other aspects that might impact the performance of the source system. This is very important. If they give you an
access to the database, you have to be responsible that you are not bringing the performance of the database down.
And of course, very important question is to ask about the authentication and the authorization like how you going to
go and access the data in the source system. Do you need any tokens, keys, password and so on. So those are the
questions that you have to ask if you are connecting a new source system to the data warehouse. And once you have
the answers for those questions, you can proceed with the next steps to connect the sources to the data warehouse. All
right, my friends. So with that, you have learned how to analyze a new source systems that you want to connect to your
data warehouse. So this step is done and now we're going to go back to coding where we're going to write scripts in
order to do the data ingestion from the CSV files to the pros layer. And let's have a quick look again
to our bronze layer specifications. So we just have to load the data from the sources to the data warehouse. We're
going to build tables in the bronze layer. We are doing a full load. So that means we are truncating and then
inserting the data. There will be no data transformations at all in the bronze layer. And as well we will not be
creating any data model. So this is the specifications of the bronze layer. All right. Right now in order to create the
DDL script for the bronze layer creating the tables of the bronze we have to understand the metadata the structure
the schema of the incoming data and here either you ask the technical experts from the source system about these
informations or you can go and explore the incoming data and try to define the structure of your tables. So now what
we're going to do we're going to start with the first source system the CRM. So let's go inside it and we're going to
start with the first table the customer info. Now if you open the file and check the data inside it, you see we have a
header information and that is very good because now we have the names of the columns that are coming from the source
and from the content you can define of course the data types. So let's go and do that. First we're going to say create
table and then we have to define the layer. It's going to be the bronze. And now very important we have to follow the
naming convention. So we start with the name of the source system. It is CRM underscore and then after that the table
name from the source system. So it's going to be the cost underscore info. So this is the name of our first table in
the bronze layer. Then the next step we have to go and define of course the columns. And here again the column names
in the bronze layer going to be one to one exactly like the source system. So the first one going to be the ID and I
will go with the data type integer. Then the next one going to be the key invar char and the length I will go with 50.
[Music] And the last one going to be the create date. It's going to be date. So with
that we have covered all the columns available from the source system. So let's go and check. And yes the last one
is the create date. So that's it for the first table. Now a semicolon of course at the end. Let's go and execute it. And
now we're going to go to the object explorer over here. Refresh. And we can see the first table inside our data
warehouse. Amazing right? So now next what you have to do is to go and create a DDL statement for each file for those
two systems. So for the CRM we need three DDLs and as well for the other system the ERP we have as well to create
three DDLs for the three files. So at the end we're going to have in the bronze layer six tables six DTLs. So now
pause the video go create those DDLs. I will be doing the same as well and we will see you soon.
[Music] All right. So now I hope you have created all those details. I'm going to
show you what I have just created. So the second table in the source CRM we have the product informations and the
third one is the sales details. Then we go to the second system and here we make sure that we are following the naming
convention. So first the source system ERB and then the table name. So the second system was really easy. You can
see we have only here like two columns and for the customers like only three and for the categories only four
columns. All right. So after defining those stuff of course we have to go and execute them. So let's go and do that.
And then we go to the object explorer over here. Refresh the tables. And with that you can see we have six empty
tables in the bronze layer. And with that we have all the tables from the two source systems inside our database. But
still we don't have any data. And you can see our naming convention is really nice. You see the first three tables
comes from the CRM source system and then the other three comes from the ERB. So we can see in the bronze layer the
things are really splitted nicely and you can identify quickly which table belong to which source system. Now there
is something else that I usually add to the DDL script is to check whether the table exists before creating. So for
example, let's say that you are renaming or you would like to change the data type of specific field. If you just go
and run this query, you will get an error because the database going to say we have already this table. So in other
databases you can say create or replace table. But in the SQL server you have to go and build a TSQL logic. So it is very
simple. First we have to go and check whether the object exists in the database. So we say if object ID and
then we have to go and specify the table name. So let's go and copy the whole thing over here and make sure you get
exactly the same name as the table name. So there you see like space. I'm just going to go and remove it. And then
we're going to go and define the object type. So it's going to be the U. It stands for user. It is the user defined
tables. So if this table is not null. So that means the database did find this object in the database. So what's going
to happen? We say go and drop the table. So the whole thing again and semicolon. So again if the table exist in the
database is not null then go and drop the table and after that go and create it. So now if you go and highlight the
whole thing and then execute it it will be working. So first drop the table if it exist then go and create the table
from scratch. Now what you have to do is to go and add this check before creating any table inside our database. So it's
going to be the same thing for the next table and so on. I went and added all those checks for each table and what can
happen if I go and execute the whole thing it going to work. So with that I'm recreating all the tables in the bronze
layer from the scratch. Now the methods that we're going to use in order to load the data
from the source to the data warehouse is the bulk inserts. Pulk insert is a method of loading massive amount of data
very quickly from files like CSV files or maybe a text file directly into a database. It is not like the classical
normal inserts where it's going to go and insert the data row by row but instead the bulk insert is one operation
that's going to load all the data in one go into the database and that's what makes it very fast. So let's go and use
this method. Okay. Okay, so now let's start writing the script in order to load the first table in the source CRM.
So we're going to go and load the table customer info from the CSV file to the database table. So the syntax is very
simple. We're going to start with saying bulk insert. So with that SQL understand we are doing not a normal insert, we are
doing a bulk insert and then we have to go and specify the table name. So it is bronze dot CRM cost info. So now we have
to specify the full location of the file that we are trying to load in this table. So now what we have to do is to
go and get the path where the file is stored. So I'm going to go and copy the whole path and then add it to the bulk
insert exactly like where the data exists. So for me it is in CSQL data warehouse project data set in the source
CRM. And then I have to specify the file name. So it's going to be like cost info. CSV. You have to get it exactly
like the path of your files otherwise it will not be working. So after the path now we come to the with clause. Now we
have to tell the SQL server how to handle our file. So here comes the specifications. There is a lot of stuff
that we can define. So let's start with the very important one is the row header. Now if you check the content of
our files you can see always the first row includes the header information of the file. So those informations are
actually not the data. It's just the column names. The actual data starts from the second row and we have to tell
the database about this information. So we're going to say first row is actually the second row. So with that we are
telling SQL to skip the first row in the file. We don't need to load those informations because we have already
defined the structure of our table. So this is the first specifications. The next one which is as well very important
in loading any CSV file is the separator between fields. The delimiter between fields. So it's really depend on the
file structure that you are getting from the source. As you can see all those values are splitted with a comma and we
call this comma as a file separator or a delimter and I saw a lot of different CSVs like sometime they use a semicolon
or a pipe or special character like a hash and so on. So you have to understand how the values are splitted
and in this file it's splitted by the comma and we have to tell SQL about this info. It's very important. So we're
going to say filled terminator and then we're going to say it is the comma and basically those two informations are
very important for SQL in order to be able to read your CSV file. Now there are like many different options that you
can go and add. For example, tape lock. It is an option in order to improve the performance where you are locking the
entire table during loading it. So as SQL is loading the data to this table, it going to go and lock the whole table.
So that's it for now. I'm just going to go and add the semicolon and let's go and insert the data from the file inside
our bronze table. Let's execute it. And now we can see SQL did insert around 80,000 rows inside our table. So it is
working. We just loaded the file into our database. But now it is not enough to just write this script. you have to
test the quality of your bronze table especially if you are working with files. So let's go and just do a simple
select. So from our new table and let's run it. So now the first thing that I check is do we have data
like in each column? Well yes as you can see we have data and the second thing is do we have the data in the correct
column. This is very critical as you are loading the data from a file to a database. Do we have the data in the
correct column? So for example, here we have the first name which of course makes sense and here we have the last
name. But what could happen and this mistakes happens a lot is that you find the first name informations inside the
key and as well you see the last name inside the first name and the status inside the last name. So there is like
shifting of the data and this data engineering mistake is very common if you are working with CSV files and there
are like different reasons why it happens. Maybe the definition of your table is wrong or the field separator is
wrong. Maybe it's not a comma, it's something else or the separator is a bad separator because sometimes maybe in the
keys or in the first name there is a comma and the SQL is not able to split the data correctly. So the quality of
the CSV file is not really good and there are many different reasons why you are not getting the data in the correct
column. But for now everything looks fine for us. And the next step is that I'll go and count the rows inside this
table. So let's go and select that. So we can see we have 18,493. And now what we can do, we can
go to our CSV file and check how many rows do we have inside this file. And as you can see we have
18,494. We are almost there. There is like one extra row inside the file. And that's because of the header. the first
header information is not loaded inside our table and that's why always in our tables we're going to have one less row
than the original files. So everything looks nice and we have done this step correctly. Now if I go and run it again
what's going to happen we will get duplicates inside the bronze layer. So now we have loaded the file like twice
inside the same table which is not really correct. The method that we have discussed is first to make the table
empty and then load truncate and then insert. In order to do that before the bulk inserts, what we're going to do,
we're going to say truncate table and then we're going to have our table and that's it with a semicolon. So
now what we are doing is first we are making the table empty and then we start loading from the scratch. We are loading
the whole content of the file inside the table and this is what we call full load. So now let's go and mark
everything together and execute. And again if you go and check the content of the table you can see we have only
18,000 rows. Let's go and run it again. The count of the bronze layer you can see we still have the 18,000. So each
time you run this script now we are refreshing the table customer info from the file into the database table. So we
are refreshing the bronze layer table. So that means if there's like now any changes in the file, it will be loaded
to the table. So this is how we do a full load in the bronze layer by truncating the table and then doing the
inserts. And now of course what we have to do is to pause the video and go and write the same script for all six files.
So let's go and do [Music] that. Okay, back. So I hope that you
have as well written all those scripts. So I have the three tables in order to load the first source system and then
three sections in order to load the second source system. And as I'm writing those scripts, make sure to have the
correct path. So for the second source system, you have to go and change the path for the other folder. And as well,
don't forget the table name on the bronze layer is different from the file name because we start always with the
source system name with the files. We don't have that. So now I think I have everything is ready. So let's go and
execute the whole thing. Perfect. Awesome. So everything is working. Let me check the messages. So we can see
from the message how many rows are inserted in each table. And now of course the task is to go through each
table and check the content. So that means now we have really nice script in order to load the
bronze layer. And we will use this script in daily basis. every day we have to run it in order to get a new content
to the data warehouse. And as we learned before, if you have like a script of SQL that is frequently used, what we can do,
we can go and create a stored procedure from those scripts. So let's go and do that. It's going to be very simple.
We're going to go over here and say create or alter procedure. And now we have to define the name of the S
procedure. I'm going to go and put it in the schema bronze because it belongs to the bronze layer. So then we're going to
go and follow the naming convention. The source procedure start with load underscore and then the bronze layer. So
that's it about the name and then very important we have to define the begin and as well the end of our skill
statements. So here is the begin and let's go to the end and say this is the end. And then let's go highlight
everything in between and give it one push with tab. So with that it is easier to read. So now next what we're going to
do we're going to go and execute it. So let's go and create this store procedure. And now if you want to go and
check your store procedure, you go to the database and then we have here a folder called programmability. And then
inside it we have start procedure. So if you go and refresh, you will see our new stored procedure. Let's go and test it.
So I'm going to go and have a new query. And what we're going to do, we're going to say execute
bronze.load bronze. So let's go and execute it. And with that, we have just loaded completely the bronze layer. So
as you can see SQL did go and insert all the data from the files to the bronze layer. It is way easier than each time
running those scripts of course. All right. So now the next step is that as you can see the output message it is
really not having a lot of informations. The message of your ETL sold procedure it will not be really clear. So that's
why if you are writing an ETL script always take care of the messaging of your code. So let me show you a nice
design. Let's go back to our store procedure. So now what we can do we can go and divide the message based on our
code. So now we can start with the message for example over here let's say print and we say what we are doing with
this store procedure we are loading the bronze liar. So this is the main message the most important one and we can go and
play with the separators like this. So we can say print and now we can go and add some nice separators like for
example the equals at the start and at the end just to have like a section. So this is just a nice message at the
start. So now by looking to our code we can see that our code is splitted into two sections. The first section we are
loading all the tables from the source system CRM and the second section is loading the tables from the ERP. So we
can split the prints by the source system. So let's go and do that. So we're going to say print and we're going
to say loading CRM tables. This is for the first section. And then we can go and add some nice separators like the
one. Let's take the minus. And of course, don't forget to add semicolons like me. So, we're going to have
semicolon for each prints. Same thing over here. I will go and copy the whole thing because we're going to have it at
the start and as well at the ends. Let's go copy the whole thing for the second section. So, for the ERP, it starts over
here. And we're going to have it like this. And we're going to call it loading ERP. So, with that in the output, we can
see nice separation between loading each source system. Now we go to the next step where we go and add like a print
for each action. So for example here we are truncating the table. So we say print and now what we can do we can go
and add two arrows and we say what we are doing. So we are truncating the table and then we can go and add the
table name in the message as well. So this is the first action that we are doing and we can go and add another
print for inserting the data. So we can say inserting data into and then we have the table name. So with that in the
output we can understand what SQL is doing. So let's go and repeat this for all other tables. Okay. So I just added
all those prints and don't forget the semicolon at the end. So I would say let's go and execute it and check the
output. So let's go and do that and then maybe at the start just to have quick output execute our stored procedure like
this. So let's see now if you check the output you can see things are more organized than before. So at the start
we are reading okay we are loading the bronze layer. Now first we are loading the source system CRM and then the
second section is for the ERP and we can see the actions. So we are truncating inserting truncating inserting for each
table and as well the same thing for the second source. So as you can see it is nice and cosmetic but it's very
important as you are debugging any errors. And speaking of errors, we have to go and handle the errors in our store
procedure. So let's go and do that. It's going to be the first thing that we do. We say begin try and then we go to the
end of our script and we say before the last end we say end try and then the next thing we have to add the catch. So
we're going to say begin catch and end catch. So now first let's go and organize our code. I'm going to take the
whole codes and give it one more push and as well the begin try. So it is more organized and as you know the try and
catch going to go and execute the try and if there is like any errors during executing this script the second section
going to be executed. So the catch will be executed only if the SQL failed to run the try. So now what we have to do
is to go and define for SQL what to do if there's like an error in your code. And here we can do multiple stuff like
maybe creating a logging tables and add the messages inside this table or we can go and add some nice messaging to the
output like for example we can go and add like a section again over here. So again some equals and we can go and
repeat it over here and then add some content in between. So we can start with something like to say error
accord during loading bronze layer and then we can go and add many stuff like for example we can go and add the error
message and here we can go and call the function error message and we can go and add as
well for example the error number. So error number and of course the output of this going to be a number but the error
message here is a text. So we have to go and change the data type. So we're going to do a cast as invar like this and then
there is like many functions that you can add to the output like for example the error state and so on. So you can
design what can happen if there is an error in the ETL. Now what else is very important in each ATL process is to add
the duration of each like step. So for example, I would like to understand how long it takes to load this table over
here. But looking to the output, I don't have any informations how long is taking to load my tables. And this is very
important because as you are building like a big data warehouse, the ETL process going to take long time and you
would like to understand where is the issue, where is the bottleneck, which table is consuming a lot of time to be
loaded. So that's why we have to add those informations as well to the output or even maybe to protocol it in a table.
So let's go and add as well this step. So we're going to go to the start and now in order to calculate the duration
you need the starting time and the end time. So we have to understand when we start loaded and when we ended loading
the table. So now the first thing is we have to go and declare the variables. So we're going to say declare and then
let's make one called start time and the data type of this going to be the date time. I need exactly the second when it
started and then another one for the end time. So another variable end time and as well the same thing date time. So
with that we have declared the variables and the next step is to go and use them. So now let's go to the first table to
the customer info and at the start we're going to say set start time equal to get date. So we will get
the exact time when we start loading this table. And then let's go and copy the whole thing and go to the end of
loading over here. So we're going to say set this time the end time equal as well to the get dates. So with that now we
have the values of when we start loading this table and when we completed loading the table. And now the next step is we
have to go and print the duration those informations. So over here we can go and say print and we can go and have as
again the same design. So two arrows and we can say very simply load duration and then double points and a space. And now
what we have to do is to calculate the duration and we can do that using the date and time function date diff in
order to find the interval between two dates. So we're going to say plus over here and then use date diff. And here we
have to define three arguments. First one is the unit. So here you can define second, minute, hours and so on. So
we're going to go with the second and then we're going to define the start of the interval. It's going to be the start
time. And then the last argument it going to be the end of the boundary. It's going to be the end time. And now
of course the output of this going to be a number that's why we have to go and cast it. So we're going to say cast as
invar and then we're going to close it like this and maybe at the end we're going to say
plus space seconds in order to have a nice message. So again what we have done we have declared the two variables and
we are using them at the start we are getting the current date and time and at the end of loading the table we are
getting the current date and time and then we are finding the differences between them in order to get the load
duration and in this case we are just printing this information and now we can go of course and add some nice separator
between each table so I'm going to go and do it like this just few minuses not a lot of stuff so now what we have to do
is to go and add this mechanism for each table in order to measure the speed of the ETL for each one of
[Music] them. Okay. So now I have added all those configurations for each table and
let's go and run the whole thing now. So let's go and edit the store procedure this and we're going to go and run it.
So let's go and execute. So now as you can see we have here one more info about the load durations and it is everywhere
I can see we have zero seconds and that's because it is super fast of loading those informations we are doing
everything locally at PC so loading the data from files to database going to be mega fast but of course in real projects
you have like different servers and networking between them and you have millions of rows in the tables of course
the duration going to be not like 0 seconds things going to be slower and now you can see easily how long it takes
to load each of your tables. And now of course what is very interesting is to understand how long it takes to load the
whole bronze layer. So now your task is as well to print at the end informations about the whole patch. How long it took
to load the bronze [Music] layer. Okay, I hope we are done. Now I
have done it like this. We have to define two new variables. So the start time of the batch and the end time of
the batch. And the first step in the start procedure is to get the date and time informations for the first
variable. And exactly at the end the last thing that we do in the start procedure, we're going to go and get the
date and time informations for the end time. So we say again set get date for the patch and time. And then all what we
have to do is to go and print a message. So we are saying loading bronze layer is completed and then we are printing total
load duration and the same thing with a date difference between the patch start time and the end time and we are
calculating the seconds and so on. So now what we have to do is to go and execute the whole thing. So let's go and
refresh the definition of the start procedure and then let's go and execute it. So in the output we have to go to
the last message and we can see loading bronze layer is completed and the total load duration is as well 0 seconds
because the execution time is less than 1 second. So with that you are getting now a feeling about how to build an ETL
process. So as you can see the data engineering is not all about how to load the data. It's how to engineer the whole
pipeline. how to measure the speed of loading the data. What can happen if there is like an error and to print each
step in your ETL process and make everything organized and cleared in the output and maybe in the logging just to
make debugging and optimizing the performance way easier. And there's like a lot of things that we can add. We can
add the quality measures and stuff. So we can add many stuff to our ETL script to make our data warehouse professional.
All right, my friends. So with that we have developed a code in order to load the bronze layer and we have tested that
as well. And now in the next step we're going to go back to draw because we want to draw a diagram about the data flow.
So let's go. So now what is a data flow diagram? We're going to draw a simple visual in
order to map the flow of your data where it come from and where it ends up. So we want just to make clear how the data
flows through different layers of your projects. And that's help us to create something called the data lineage. And
this is really nice especially if you are analyzing an issue. So if you have like multiple layers and you don't have
a real data lineage or flow, it's going to be really hard to analyze the scripts in order to understand the origin of the
data and having this diagram going to improve the process of finding issues. So now let's go and create one. Okay. So
now back to draw and we're going to go and build the flow diagram. So we're going to start first with the source
system. So, let's build the layer. I'm going to go and remove the fill dot it. And then we're going to go and add like
a box saying sources and we're going to put it over here. Increase the size 24 and as well without any lines. Now, what
do we have inside the sources? We have like folder and files. So, let's go and search for a folder icon. I'm going to
go and take this one over here and say you are the CRM. And we can as well increase the size. And we have another
source. We have the ERP. Okay. So, this is the first layer. Let's go and now have the bronze layer.
So, we're going to go and grab another box. And we're going to go and make the coloring like this. And instead of auto,
maybe take the hatch, maybe something like this, whatever, you know. So, rounded. And then we can go and put on
top of it like the title. So, we can say you are the bronze layer. and increase as well the size of the font. So now
what we're going to do, we're going to go and add boxes for each table that we have in the bronze layer. So for
example, we have the sales details. We can go and make it a little bit smaller. So maybe 16 and not bold. And we have
other two tables from the CRM. We have the customer info and as well the product info. So those are the three
tables that comes from the CRM. And now what we're going to do, we're going to go and connect now the source CRM with
those three tables. So what we're going to do, we're going to go to the folder and start making arrows from the folder
to the bronze layer like this. And now we have to do the same thing for the ERP source. So as you can see the data flow
diagram shows us in one picture the data lineage between the two layers. So here we can see easily those three tables
actually comes from the CRM and as well those three tables in the bronze layer are coming from the ERP. I understand if
we have like a lot of tables it's going to be a huge mess. But if you have like small or medium data warehouse building
those diagrams going to make things really easier to understand how everything is flowing from the sources
into the different layers in your data warehouse. All right. So with that we have the first version of the data flow.
So this step is done and the final step is to commit our code in the get repo. Okay. So now let's go and commit
our work. Since it is scripts, we're going to go to the folder scripts. And here we're going to have like script for
the bronze, silver, and gold. That's why maybe it makes sense to create a folder for each layer. So let's go and start
creating the bronze folder. So I'm going to go and create a new file. And then I'm going to say bronze slash. And then
we can have the DDL script of the bronze layer SQL. So now I'm going to go and paste the DDL codes that we have
created. So those six tables and as usual at the start we have a comment where we are explaining the purpose of
this script. So we are saying this scripts creates tables in the bronze schema. And by running this scripts you
are redefining the DDL structure of the bronze tables. So let's have it like that. And I'm going to go and commit the
changes. All right. So now as you can see inside the scripts we have a folder called bronze and inside it we have the
DDL script for the bronze layer and as well in the bronze layer we're going to go and put our start procedure. So we're
going to go and create a new file let's call it proc load bronze dossql and then let's go and paste our script and as
usual I have put it at the start an explanation about the store procedure. So we are saying this third procedure
going to go and load the data from the CSV files into the bronze schema. So it going to go and truncate first the
tables and then do a bulk insert. And about the parameters, this source procedure does not accept any parameter
or return any values. And here a quick example how to execute it. All right. So I think I'm happy with that. So let's go
and commit it. All right. My friends, so with that we have committed our code into the g. And with that we are done
building the bronze layer. So the whole op is done. Now we're going to go to the next one. This one going to be more
advanced than the bronze layer because there will be a lot of struggle with cleaning the data and so on. So we're
going to start with the first task where we're going to analyze and explore the data in the source systems. So let's
go. Okay. So now we're going to start with the big question. How to build the server layer? What is the process? Okay.
As usual, first things first, we have to analyze. And now the task before building anything in the server layer we
have to go and explore the data in order to understand the content of our sources once we have it what we're going to do
we will be starting coding and here the transformation that we're going to do is data cleansing this is usually process
that take really long time and I usually do it in three steps the first step is to check first the data quality issues
that we have in the bronze layer so before writing any data transformations first we have to understand what are the
issues and only then I start writing think data transformations in order to fix all those quality issues that we
have in the bronze and the last step once I have clean results what we're going to do we're going to go and insert
it into the server layer and those are the three faces that we will be doing as we are writing the code for the silver
layer and the third step once we have all the data in the server layer we have to make sure that the data is now
correct and we don't have any quality issues anymore and if you find any issues of course what you going to do
we're going to go back to coding we're going to do the data cleansing and again object. So it is like a cycle between
validating and coding. Once the quality of the silver layer is good, we cannot skip the last phase where we're going to
document and commit our work in the G. And here we're going to have two new documentations. We're going to build the
data flow diagram and as well the data integration diagram after we understood the relationship between the sources
from the first step. So this is the process and this is how we're going to build the server layer.
All right. So now exploring the data in the bronze layer. So why it is very important? Because understanding the
data it is the key to make smart decisions in the server layer. It was not the focus in the bronze layer to
understand the content of the data at all. We focus only how to get the data to the data warehouse. So that's why we
have now to take a moment in order to explore and understand the tables and as well how to connect them. what are the
relationship between these tables and it is very important as you are learning about the new source system is to create
like some kind of documentation. So now let's go and explore the sources. Okay. So now let's go and explore them one by
one. We can start with the first one from the CRM. We have the customer info. So right click on it and say select top
thousand rows. And this is of course important if you have like a lot of data. Don't go and explore millions of
rows. Always limit your query. So for example here we are using the top thousands just to make sure that you are
not impacting the system with your queries. So now let's have a look to the content of this table. So we can see
that we have here customer informations. So we have an ID, we have a key for the customer, we have first name, last name,
marital status, gender and the creation date of the customer. So simply this is a table for the customer information and
a lot of details for the customers. And here we have like two identifiers. one it is like technical ID and another one
it's like the customer number so maybe we can use either the ID or the key in order to join it with other tables so
now what I usually do is to go and draw like data model or let's say integration model just to document and visual what I
am understanding because if you don't do that you're going to forget it after a while so now we go and search for a
shape let's search for a table and I'm going to go and pick this one over here so here we can go and change the style
for example we can make it rounded or you can go make it sketch and so on. And we can go and change the color. I'm
going to make it blue. Then go to the text. Make sure to select the whole thing. And let's make it bigger. 26. And
then what I'm going to do for those items, I'm just going to select them and go to our range and maybe make it 40.
Something like this. So now what we're going to do, we're going to just go and put the table name. So this is the one
that we are now learning about. And what I'm going to do, I'm just going to go and put here the primary key. I will not
go and list all the informations. So the primary key was the ID. And I will go and remove all those stuff. I don't need
it. Now, as you can see, the table name is not really friendly. So I can go and bring a text and put it here on top and
say this is the customer information. Just to make it friendly and to not forget about it. And as well going to
increase the size to maybe 20 something like this. Okay. With that, we have our first table. and we're going to go and
keep exploring. So let's move to the second one. We're going to take the product information, right click on it
and select the top thousand rows. I will just put it below the previous query. Query it. Now by looking to this table
we can see we have product informations. So we have here a primary key for the product and then we have like key or
let's say product number and after that we have the full name of the product the product costs and then we have the
product line and then we have like start and end. Well this is interesting to understand why we have start and ends.
Let's have a look for example for those three rows all of those three having the same key but they have different ids. So
it is the same product but with different costs. So for 2011 we have the cost of 12. Then 2012 we have 14 and for
the last year 2013 we have 13. So it's like we have like a history for the changes. So this table not only holding
the current informations of the product but also history informations of the product and that's why we have those to
date start and end. Now let's go back and draw this information over here. So I'm just going to go and duplicate it.
So the name of this table going to be the BRD info and let's go and give it like a short description current and
history products information something like this just to not forget that we have history in this table and here we
have as well the PRD ID and there is like nothing that we can use in order to join those two tables we don't have like
a customer ID here or in the other table we don't have any product ID okay so that's it for this table let's jump to
the third table and the last one in the CRM M. So let's go and select. I just made the other queries as well short. So
let's go and execute. So what do we have over here? We have a lot of informations about the order, the sales and a lot of
measures. Order number. We have the product key. So this is something that we can use in order to join it with the
product table. We have the customer ID. We don't have the customer key. So here we have like ID and here we have key. So
there's like two different ways on how to join tables. And then we have here like dates. the order date, the shipping
date, the due date and then we have the sales amount, the quantity and the price. So this is like an event table.
It is transactional table about the orders and sales and it is great table in order to connect the customers with
the products and as well with the orders. So let's document this new information that we have. So the table
name is the sales details. So we can go and describe it like this. Transactional records about sales and
orders. And now we have to go and describe how we can connect this table to the other two. So we are not using
the product ID. We are using the products key. And now we need a new column over here. So you can hold
control and enter or you can go over here and add a new row. And the other row going to be the customer ID. So now
for the customer ID it is easy. we can go and grab an arrow in order to connect those two tables. But for the product
key, we are not using the ID. So that's why I'm just going to go and remove this one and say product key. Let's have
again a check. So this is a product key. It's not the product ID. And if we go and check the old table, the products
info, you can see we are using this key and not the primary key. So what we're going to do now, we will just go and
link it like this. And maybe switch those two tables. So I will put the customers below. Just perfect. It looks
nice. Okay. So, let's keep moving. Let's go now to the other source system. We have the ARP and the first one is ARB
cost and we have this cryptical name. Let's go and select the data. So, now here it's small table and we have only
three informations. So, we have here something called CD and then we have something I think this is the birthday
and the gender information. So, we have here male, female and so on. So, it looks again like the customer
informations but here we have like extra data about the birthday. And now if you go and compare it to the customer table
that we have from the other source system. Let's go and query it. You can see the new table from the ARB don't
have ids. It has actually the customer number or the key. So we can go and join those two tables using the customer key.
Let's go and document this information. So I will just go and copy paste and put it here on the right side. I will just
go and change the color now since we are now talking about different source system. And here the table name going to
be this one. and the key called C ID. Now, in order to join this table with the customer info, we cannot join it
with the customer ID. We need the customer key. That's why here we have to go and add a new row. So, ctrl enter and
we're going to say customer key. And then we have to go and make a nice arrow between those two keys. So, we're going
to go and give it a description, customer information. And here we have the birth date. Okay. So, now let's keep
going. We're going to go to the next one. We have the ERP location. Let's go and query this table. So, what do we
have over here? We have the CD again. And as you can see, we have country informations. And this is of course
again the customer number. And we have only this information, the country. So, let's go and document this information.
This is the customer location. Table name going to be like this. And we still have the same ID. So, we have here still
the customer ID and we can go and join it using the customer key. And we have to give it the description location of
customers and we can say here the country. Okay. So now let's go to the last table and explore it. We have the
ERP ex catalog. So let's go and query those informations. So what do we have here? We have like an ID, a category, a
subcategory and the maintenance. Here we have like either yes and no. So by looking to this table we have all the
categories and the subcategories of the products and here we have like special identifier for those informations. Now
the question is how to join it. So I would like to join it actually with the product informations. So let's go and
check those two tables together. Okay. So in the product we don't have any ID for the categories but we have these
informations actually in the product key. So the first five characters of the product key is actually the category ID.
So we can use this information over here in order to join it with the categories. So we can go and describe this
information like this and then we have to go and give it a name. And then here we have the ID and the ID could be
joined using the product key. So that means for the product information we don't need at all the product ID the
primary key. All what we need is the product key or the product number. And what I would like to do is like to group
those informations in a box. So, let's go grab like any boxes here on the left side and make it bigger and then make
the edges a little bit smaller. Let's remove the fill and the line. I will make a dotted line. And then let's grab
another box over here and say this is the CRM. And we can go and increase the size maybe something like 40 smaller 35
bold and change the color to blue and just place it here on top of this box. So with that we can understand all those
tables belongs to the source system CRM and we can do the same stuff for the right side as well. Now of course we
have to go and add the description here. So it's going to be the products categories. All right. So with that we
have now a clear understanding how the tables are connected to each others. We understand now the content of each table
and of course it can help us to clean up the data in the silver layer in order to prepare it. So as you can see it is very
important to take time understanding the structure of the tables the relationship between them before start writing any
code. All right. So with that we have now clear understanding about the sources and with that we have as well
created a data integration in the draw. So with that we have more understanding about how to connect the sources. And
now in the next two task we will go back to SQL where we're going to start checking the quality and as well doing a
lot of data transformations. So let's go. Okay, so now let's have a quick look to
the specifications of the server layer. So the main objective to have clean and standardized data. We have to prepare
the data before going to the gold layer. And we will be building tables inside the silver layer. And the way of loading
the data from the bronze to the silver is a full load. So that means we're going to truncate and then insert. And
here we're going to have a lot of data transformations. So we're going to clean the data. We're going to bring
normalizations, standardizations. We're going to derive new columns. We will be doing as well data enrichments. So a lot
of things to be done in the data transformation. But we will not be building any new data model. So those
are the specifications and we have to commit ourself to this scope. Okay. So now building the DDL script for the
silver layer going to be way easier than the bronze because the definition and the structure of each table in the
silver going to be identical to the bronze layer. We are not doing anything new. So all what you have to do is to
take the DDL script from the bronze layer and just go and search and replace for the schema. I'm just using the
Notepad++ for the scripts. So I'm going to go over here and say replace the bronze dots with silver dots and I'm
going to go and replace all. So with that now all the DDL is targeting the schema silver layer which is exactly
what we need. All right. Now before we execute our new DDL script for the silver, we have to talk about something
called the metadata columns. They are additional columns or fields that the data engineers add to each table that
don't come directly from the source systems. But the data engineers use it in order to provide extra informations
for each record. Like we can add a column called create date is when the record was loaded or an update date when
the record got updated or we can add the source system in order to understand the origin of the data that we have or
sometimes we can add the file location in order to understand the lineage from which file the data come from. Those are
great tool if you have data issue in your data warehouse if there is like corrupt data and so on. This can help
you to track exactly where this issue happens and when. And as well it is great in order to understand whether I
have gap in my data especially if you are doing incremental loads. It is like putting labels on everything and you
will thank yourself later when you start using them in hard times as you have an issue in your data warehouse. So now
back to our DDL scripts and all what you have to do is to go and do the following. So for example for the first
table I will go and add at the end one more extra column. So it start with the prefix TWW as we have defined in the
naming convention and then underscore let's have the create date and the data type going to be date time 2 and now
what we can do is we can go and add a default value for it. I want the database to generate these informations
automatically. We don't have to specify that in any scripts. So which value? It's going to be the get date. So each
record going to be inserted in this table will get automatically a value from the current date and time. So now
as you can see the naming convention it is very important. All those columns comes from the source system and only
this one column comes from the data engineer of the data warehouse. Okay. So that's it. Let's go and repeat the same
thing for all other tables. So I will just go and add this piece of information for each
DDL. All right. So I think that's it. All what you have to do is now to go and execute the whole DDL script for the
silver layer. Let's go and do that. All right, perfect. There's no errors. Let's go and refresh the tables on the object
explorer. And with that, as you can see, we have six tables for the silver layer. It is identical to the bronze layer, but
we have one extra column for the metadata. All right. All right. So now in the server layer before we start
writing any data transformations and cleansing we have first to detect the quality issues in the bronze without
knowing the issues we cannot find solution right we will explore first the quality issues only then we start
writing the transformation scripts. So let's go. Okay. Okay. So now what we're going
to do, we're going to go through all the tables over the bronze layer, clean up the data, and then insert it to the
server layer. So let's start with the first table, the first bronze table from the source CRM. So we're going to go to
the bronze CRM customer info. So let's go and query the data over here. Now, of course, before writing any data
transformations, we have to go and detect and identify the quality issues of this table. So usually I start with
the first check where we go and check the primary key. So we have to go and check whether there are nulls inside the
primary key and whether there are duplicates. So now in order to detect the duplicates in the primary key what
we have to do is to go and aggregate the primary key. If we find any value in the primary key that exist more than once
that means it is not unique and we have duplicates in the table. So let's go and write query for that. So what we're
going to do, we're going to go with the customer ID and then we're going to go and count and then we have to group up
the data. So group by based on the primary key and of course we don't need all the results. We need only where we
have an issue. So we're going to say having count higher than one. So we are
interested in the values where the count is higher than one. So let's go and execute it. Now as you can see we have
issue in this table. we have duplicates because all those ids exist more than one in the table which is completely
wrong. We should have the primary key unique and you can see as well we have three records where the primary key is
empty which is as well a bad thing. Now there is an issue here. If we have only one null it will not be here at the
result. So what I'm going to do I'm going to go over here and say or the primary key is null just in case if we
have only one null I'm still interested to see the results. So if I go and run it again, we'll get the same results. So
this is equality check that you can do on the table. And as you can see, it is not meeting the expectation. So that
means we have to do something about it. So let's go and create a new query. So here what we're going to do, we can
start writing the query that is doing the data transformation and the data cleansing. So let's start again by
selecting the data and execute it again. So now what I usually do I go and focus on the issue.
So for example let's go and take one of those values and I focus on it before start writing the transformation. So
we're going to say where customer ID equal to this value. All right. So now as you can see we have here the issue
where the ID exist three times but actually we are interested only on one of them. So the question is how to pick
one of those. Usually we search for a time stamp or date value to help us. So if you check the creation date over here
we can understand that this record this one over here is the newest one and the previous two are older than it. So that
means if I have to go and pick one of those values I would like to get the latest one because it holds the most
fresh information. So what we have to do is we have to go and rank all those values based on the create dates and
only pick the highest one. So that means we need a racking function and for that in scale we have the amazing window
functions. So let's go and do that. We will use the function row number over and then partition by and here we have
to divide the table by the customer ID. So we're going to divide it by the customer ID and in order now to rank
those rows we have to sort the data by something. So order by and as we discussed we want to sort the data by
the creation date. So create date and we're going to sort it descending. So the highest first then
the lowest. So let's go and do that. And now we're going to go and give it a name flag last. So now let's go and execute
it. Now the data is sorted by the creation date. And you can see over here that this record is the number one. Then
the one that is older is two and the oldest one is three. Of course we are interested in the rank number one. Now
let's go and remove the filter and check everything. So now if you have a look to the table you can see that on the flag
we have everywhere like one and that's because the those primary keys exist only one but sometimes we will not have
one we'll have two three and so on. If there's like duplicates we can go of course and do a double check. So let's
go over here and say select star from this query we can say where flag last is in equal to one. So let's
go and query it. And now we can see all the data that we don't need because they are causing duplicates in the primary
key and they have like an old status. So what we're going to do we're going to say equal to one. And with that we
guarantee that our primary key is unique and each value exist only once. So if I go and query it like this you will see
we will not find any duplicate inside our table. And we can go and check that of course. So let's go and check this
primary key. And we're going to say and customer ID equal to this value. And you can see it exists now only once and we
are getting the freshest data from this primary key. So with that we have defined like transformation in order to
remove any duplicates. Okay. So now moving on to the next one. As you can see in our table we have a lot of values
where they are like string values. Now for these string values we have to check the unwanted spaces. So now let's go and
write a query that's going to detect those unwanted spaces. So we're going to say select this column the first name
from our table bronze customer information. So let's go and query it. Now by just looking to the data it's
going to be really hard to find those unwanted spaces especially if they are at the end of the word. But there is a
very easy way in order to detect those issues. So what we're going to do we're going to do a filter. So now we're going
to say the first name is not equal to the first name after trimming the values. So if you use the function trim,
what it going to do? It's going to go and remove all the leading and trailing spaces. So the first name. So if this
value is not equal to the first name after trimming it, then we have an issue. So it is very simple. Let's go
and execute it. So now in the result, we will get a list of all first names where we have spaces either at the start or at
the end. So again the expectation here is no results. And the same thing we can go and check something else like for
example the last name. So let's go and do that over here and here. Let's go and execute it. We see in the results we
have as well 17 customers where they have like space in their last name which is not really good. And we can go and
keep checking all the string values that we have inside the table. So for example the gender. So let's go and check
that and execute. Now as you can see we don't have any results. That means the quality of the gender is better and we
don't have any unwanted spaces. So now we have to go and write transformation in order to clean up those two columns.
Now what I'm going to do, I'm just going to go and list all the columns in the query instead of the star. All right. So
now I have a list of all the columns that I need. And now what we have to do is to go to those two columns and start
removing the unwanted spaces. So we will just use the trim. It's very simple. And give it a name, of course,
the same name. And we will trim as well the last name. So let's go and query this. And with that we have cleaned up
those two columns from any unwanted spaces. Okay. So now moving on we have those two informations. We have the
maritalial status and as well the gender. If you check the values inside those two columns as you can see we have
here low cardality. So we have limited numbers of possible values that is used inside those two columns. So what we
usually do is to go and check the data consistency inside those two columns. So it's very simple what we're going to do.
We're going to do the following. We're going to say distinct and we're going to check the
values. Let's go and do that. And now as you can see we have only three possible values either null, f or m which is
okay. We can stay like this of course. But we can make a rule in our project where we can say we will not be working
with data abbreviations. We will go and use only friendly full names. So instead of having an F, we're going to have like
a full word female. And instead of m we're going to have like male and we make it as a rule for the whole project.
So each time we find the gender informations we try to give the full name of it. So let's go and map those
two values to a friendly one. So we're going to go to the gender over here and say case when and we're going to say the
gender is equal to f then make it a female. And when it is equal to
m then map it to male. And now we have to make decision about the nulls. As you can see over here we have nulls. So do
we want to leave it as a null or we want to use always the value unknown. So with that we are replacing the missing values
with a standard default value or you can leave it as null. But let's say in our project that we are replacing all the
missing value with a default value. So let's go and do that. We're going to say else I'm going to go with the NA not
available or you can go with the unknown of course. So that's for the gender information like this. And we can go and
remove the old one. And now there is one thing that I usually do in this case where sometimes what happens currently
we are getting the capital F and the capital M but maybe in the time something change and you will get like
lower M and lower F. So just to make sure in those cases we still are able to map those values to the correct value.
What we're going to do we're going to just use the function upper just to make sure that if you get any lowerase values
we are able to catch it. So the same thing over here as well. And now one more thing that you can add as well. Of
course, if you are not trusting the data because we saw some unwanted spaces in the first name and the last name, you
might not trust that in the future. You will get here as well unwanted spaces. You can go and make sure to trim
everything just to make sure that you are catching all those cases. So that's it for now. Let's go and execute. Now,
as you can see, we don't have an M and an F. We have a full word, male and female. And if we don't have a value, we
don't have a null, we are getting here not available. Now we can go and do the same stuff for the maritial status. You
can see as well we have only three possibilities. The s null and an M. We can go and do the same stuff. So I will
just go and copy everything from here. And I will go and use the marital status and just remove this one from here. And
now what are the possible values? We have the S. So it's going to be single. We have an M for married. And we have as
well a null and with that we are getting the not available. So with that we are making as well data standardizations for
this column. So let's go and execute it. Now as you can see we don't have those short values. We have a full friendly
value for the status and as well for the gender. And at the same time we are handling the nulls inside those two
columns. So with that we are done with those two columns. And now we can go to the last one that create date. For this
type of informations, we make sure that this column is a real date and not as a string or varchar. And as we defined it
in the data type, it is a date which is completely correct. So nothing to do with this column. And now the next step
is that we're going to go and write the insert statement. So how we going to do it? We're going to go to the start over
here and say insert into silverm customer info. Now we have to go and specify all the columns that should
be inserted. So we're going to go and type it. So something like this. And then we have the query over here. Let's
go and execute it. So let's do that. So with that we have inserted clean data inside the silver table. So now what
we're going to do we're going to go and take all the queries that we have used in order to check the quality of the
bronze and let's go and take it to another query and instead of having bronze we're going to say silver. So
this is about the primary key. Let's go and execute it. Perfect. We don't have any results. So we don't have any
duplicates. The same thing for the next one. So the silver and it was for the first name. So let's go and check the
first name and run it. As you can see there is no results. It is perfect. We don't have any issues. You can of course
go and check the last name and run it again. We don't have any results over here. And now we can go and
check those low cardality columns like for example the gender. Let's go and execute
it. So as you can see we have the not available or the unknown male and female. So perfect and you can go and
have a final look to the table to the silver customer info. Let's go and check that. So now we can have a look to all
those columns. As you can see everything looks perfect and you can see it is working this metadata information that
we have added to the table definition. Now it says when we have inserted all those records to the table which is
really amazing information to have a track and audit. Okay. So now by looking to this script we have done different
types of data transformations. The first one is with the first name and the last name. Here we have done trimming
removing unwanted spaces. This is one of the types of data cleansing. So we remove unnecessary spaces or unwanted
characters to ensure data consistency. Now moving on to the next transformation. we have this case when
so what we have done here is data normalization or we call it sometimes data standardization so this
transformation is type of data cleansing where we're going to map coded values to meaningful user friendly description and
we have done the same transformation as well to the gender another type of transformation that we have done as well
in the same case when is that we have handled the missing values so instead of nulls we going to have not available so
handling missing data is as type of data cleansing where we are filling the blanks by adding for example a default
value. So instead of having an empty string or a null we're going to have a default value like the not available or
unknown. Another type of data and transformations that we have done in this script is we have removed the
duplicates. So removing duplicates is as well type of data cleansing where we ensure only one record for each primary
key by identifying and retaining only the most relevant row to ensure there is no duplicates inside our data and as we
are removing the duplicates of course we are doing data filtering. So those are the different types of data
transformations that we have done in this script. All right, moving on to the second table
in the bronze layer from the CRM. We have the product info. And of course, as usual, before we start writing any
transformations, we have to search for data quality issues. And we start with the first one, we have to check the
primary key. So we have to check whether we have duplicates or nulls inside this key. So what we have to do, we have to
group up the data by the primary key or check whether we have nulls. So let's go and execute it. So as you can see,
everything is safe. We don't have duplicates or nulls in the primary key. Now moving on to the next one, we have
the product key. Here we have in this column a lot of informations. So now what we have to do is to go and split
this string into two informations. So we are deriving new two columns. So now let's start with the first one is the
category ID. The first five characters they are actually the category ID and we can go and use the substring function in
order to extract part of a string. It needs three arguments. The first one going to be the column that we want to
extract from. And then we have to define the position where to extract. And since the first part is on the left side, we
going to start from the first position. And then we have to specify the length. So how many characters we want to
extract, we need five characters. So 1 2 3 4 5. So that's it for the category ID. Category ID. Let's go and execute it.
Now, as you can see, we have a new column called the category ID. and it contains the first part of the string
and in our database from the other source system we have as well the category ID. Now we can go and double
check just in order to make sure that we can join data together. So we're going to go and check the ID from the bronze
table ERP and this canopy from the category. So in this table we have the category ids and you can see over here
those are the ids of the category and in the code layer we have to go and join those two tables. But here we still have
an issue. We have here an underscore between the category and the subcategory. But in our table we have
actually a minus. So we have to replace that with an underscore in order to have matching informations between those two
tables. Otherwise we will not be able to join the tables. So we're going to use the function
replace. And what we are replacing? We are replacing the minus with an underscore something like this. And if
you go now and execute it, we will get an underscore exactly like the other table. And of course we can go and check
whether everything is matching by having very simple query where we say this new information not in. And then we have
this nice subquery. So we are trying to find any category ID that is not available in the second table. So let's
go and execute it. Now as you can see we have only one category that is not matching. We are not finding it in this
table which is maybe correct. So if you go over here you will not find this category. I just make it a little bit
bigger. So we are not finding this one category from this table which is fine. So our check is okay. Okay. So that we
have the first part. Now we have to go and extract the second part and we're going to do the same thing. So we're
going to use the substring and the three argument the product key but this time we will not start cutting from the first
position we have to be in the middle. So 1 2 3 4 5 6 7. So we start from the position number seven. And now we have
to define the length how many characters to be extracted. But if you look over here you can see that we have different
length of the product keys. It is not fixed like the category ID. So we cannot go and here specify number. We have to
make something dynamic and there is trick in order to do that. We're going to go and use the length of the whole
column. With that we make sure that we are always getting enough characters to be extracted and we will not be losing
any informations. So we will make it dynamic like this. We will not have it as a fixed length and with that we have
the product key. So let's go and execute it. As you can see we are now extracting the second part from this string. Now
why we need the product key? We need it in order to join it with another table called sales details. So let's go and
check the sales details. So let me just check the column name. It is SLS product key. So from bronze
CRM sales. Let's go and check the data over here. And it looks wonderful. So actually we can go and join those
informations together. But of course we're going to go and check that. So we're going to say where and we're going
to take our new column and we're going to say not in the sub query just to make sure that we are not missing anything.
So let's go and execute. So it looks like we have a lot of products that don't have any orders. Well, I don't
have a nice feelings about it. Let's go and try something like this one here. And we say where sld key like this value
over here. So I'll just cut the last three just to search inside this table. So we really don't have such a keys. Let
me just cut the second one. So let's go and search for it. We don't have it as well. So anything that starts with the F
key, we don't have any order with the product where it starts with the F key. So let's go and remove it. But still we
are able to join the tables, right? So if I go and say in instead of not in. So with that you are able to match all
those products. So that means everything is fine. Actually it's just products that don't have any orders. So with that
I'm happy with this transformation. Now moving on to the next one. We have here the name of the product. We can go and
check whether there is unwanted spaces. So let's go to our quality checks. Make sure to use the same table and we're
going to use the product name and check whether we find any unmatching after trimming. So let's go and do it. Well,
it looks really fine. So we don't have to trim anything. This column is safe. Now moving on to the next one. We have
the costs. So here we have numbers and we have to check the quality of the numbers. So what we can do? We can check
whether we have nulls or negative numbers. So negative costs or negative prices which is not realistic depend on
the business of course. So let's say in our business we don't have any negative costs. So it's going to be like this.
Let's go and check whether it's something less than zero or whether we have costs that is null. So let's go and
check those informations. Well, as you can see, we don't have any negative values, but we have nulls. So we can go
and handle that by replacing the null with a zero. Of course, if the business allow that. So in SQL server, in order
to replace the null with a zero, we have a very nice function called is null. So we are saying if it is null then replace
this value with a zero. It is very simple like this and we give it a name of course. So let's go and execute it.
And as you can see we don't have any more nulls. We have zero which is better for the calculations if you are later
doing any aggregate functions like the average. Now moving on to the next one we have the product line. This is again
abbreviation of something and the cardinality is low. So let's go and check all possible values inside this
column. So we're just going to use the distinct going to be BRD line. So let's go and execute it. And as you can see
the possible values are null M R ST. And again those are abbreviations but in our data warehouse we have decided to give
full nice names. So we have to go and replace those codes those abbreviations with a friendly value. And of course in
order to get those informations I usually go and ask the expert from the source system or an expert from the
process. So let's start building our case win. And then let's use the upper and as well the trim just to make sure
that we are having all the cases. So the BRD line is equal to so let's start with the
first value the M. Then we will get the friendly value it's going to be mountain. then to the next one. So I
will just copy and paste here. If it is an R then it is road and another one for let me check what do we have here? We
have M R and then S. The S stands for other sales and we have the T. So let's go and get the T. So the T stands for
touring. We have at the end an else for unknown not available. So we don't need any nulls. So that's it. And we're going
to name it as before. So product line. So let's remove the old one. And let's execute it. And as you can see, we don't
have here anymore those shortcuts and the abbreviations. We have now full friendly value. But I will go and have
here like capital O. It looks nicer. So that we have nice friendly value. Now by looking to this case when as you can see
it is always like we are mapping one value to another value and we are repeating all time upper time upper time
and so on. We have here a quick form in the case when if it is just a simple mapping. So the syntax is very simple we
say case and then we have the column. So we are evaluating this value over here and then we just say when without the
equal so if it is an M then make it mountain. the same thing for the next one and so so with that we have the
functions only once and we don't have to go and keep repeating the same function over and over and this one only if you
are mapping values but if you have complex conditions you cannot do it like this but for now I'm going to stay with
the quick form of the case when it looks nicer and shorter so let's go and execute it we will get the same results
okay so now back to our table let's go to the last two columns we have the start and end date so it's like defining
an interval we have start and end so Let's go and check the quality of the start and end dates. We're going to go
and say select star from our bronze table. And now we're going to go and search it like this. We are searching
for the end date that is smaller than the start. So we are key to start dates. So let's go and query this. So you can
see the start is always like after the end which makes no sense at all. So we have here data issue with those two
dates. So now for this kind of data transformations what I usually do is I go and grab few examples and put it in
Excel and try to think about how I'm going to go and fix it. So here I took like two products this one and this one
over here. And for that we have like three rows for each one of them. And we have this situation over here. So the
question now how we going to go and fix it? I will go and make like a copy of one solution where we're going to say
it's very simple. Let's go and switch the start date with the end date. So if I go and grab the end date and put it at
the start, things going to look way nicer, right? So we have the start is always younger than the end. But my
friends, the data now makes no sense because we say it start from 2007 and ends by 2011 the price was 12. But
between 2008 and 2012, we have 14. which is not really good because if you take for example the year 2010 for 2010 it
was 12 and at the same time 14. So it is really bad to have an overlapping between those two dates. It should start
from 2007 and end with 11 and then start Feb from 12 and end with something else. There should be no overlapping between
years. So it's not enough to say the start should be always smaller than the ends but as well the end of the first
history should be younger than the start of the next records. This is as well a rule in order to have no overlapping.
This one has no start but has already an end which is not really okay because we have always to have a start. Each new
record in historiizations has to has a start. So for this record over here this is as well wrong. And of course it is
okay to have the start without an end. So in this scenario it's fine because this indicate this is the current
informations about the costs. So again this solution is not working at all. So now for the solution two what we can say
let's go and ignore completely the end date and we take only the start date. So let's go and paste it over here. But now
we go and rebuild the end date completely from the start date following the rules that we have defined. So the
rule says the end of date of the current records comes from the start date from the next records. So here this end date
comes from this value over here from the next record. So that means we take the next start date and put it at the end
date for the previous records. So with that as you can see it is working the end date is higher than the start date.
And as well we are making sure this date is not overlapping with the next record. But as well in order to make it way
nicer we can subtract it with one. So we can take the previous day like this. So with that we are making sure the end
date is smaller than the next start. And now for the next record this one over here the end date going to come from the
next start date. So we will take this one for here and put it as an end date and subtract it with one. So we will get
the previous day. So now if you compare those two you can see it's still higher than the start. And if you compare it
with the next record this one over here it is still smaller than the next one. So there is no overlapping. And now for
the last record since we don't have here any informations it will be a null which is totally fine. So as you can see I'm
really happy with this scenario over here. Of course you can go and validate this with an expert from the source
system. But let's say I have done that and they approved it and now I can go and clean up the data using this new
logic. So this is how I usually brainstorm about fixing an issues. If I have like a complex stuff, I go and use
Excel and then discuss it with the expert using this example. It's way better than showing a database queries
and so on. It just makes things easier to explain and as well to discuss. So now how I usually do it, I usually go
and make a focus on only the columns that I need and take only one two scenarios while I'm building the logic
and once everything is ready I go and integrate it in the query. So now I'm focusing only on these columns and only
for these products. So now let's go and build our logic. Now in SQL if you are at specific record and you want to
access another information from another records and for that we have two amazing window functions. We have the lead and
log. In this scenario, we want to access the next records. That's why we have to go with the function leads. So, let's go
and build it lead. And then what do we need? We need the lead of the start date. So, we want the start date
of the next record. And then we say over and we have to partition the data. So, the window going to be focusing on only
one product which is the product key and not the product ID. So, we are dividing the data by product key. And of course,
we have to go and sort the data. So order by and we are sorting the data by the start
date and ascending. So from the lowest to the highest and let's go and give it another name. So as let's say test for
example just to test the data. So let's go and execute. And I think I missed something here. It is partition by. So
let's go and execute again. And now let's go and check the results for the first partition over here. So the start
is 2011 and the end is 2012. And this information came from the next record. So this data is moved to the previous
record over here. And the same thing for this record. So the end date comes from the next record. So our logic is
working. And the last record over here is null because we are at the end of the window and there is no next data. That's
why we will get null and this is perfect of course. So it looks really awesome. But what is missing is we have to go and
get the previous day. And we can do that very simply using minus one. we are just subtracting one day. So we have no
overlapping between those two dates and the same thing for those two dates. So as you can see we have just built a
perfect end date which is way better than the original data that we got from the source system. Now let's take this
one over here and put it inside our query. So we don't need the end date, we need our new end date. Let's just remove
that test and execute. Now it looks perfect. All right. Now we are not done yet with those two dates. Actually we
are saying all time dates because we don't have here any informations about the time always zero. So it makes no
sense to have these informations inside our data. So what we can do we can do a very simple cast and we make this column
as a date instead of date time. So this is for the first one and as well for the next one as date. So let's try that out.
And as you can see it is nicer. We don't have the time informations. Of course, we can tell the source systems about all
those issues. But since they don't provide a time, it makes no sense to have date and time. Okay, so it was a
long run, but we have now a cleaned product informations. And this is way nicer than the original product
information that we got from the source CRM. So if you grab the DDL of the server table, you can see that we don't
have a category ID. So we have product ID and product key. And as well those two columns, we just changed the data
type. So it's date time here but we have changed that to a date. So that means we have to go and do few modifications to
the DDL. So what we're going to do we're going to go over here and say category ID and I will be using the same data
type for the start and the end. This time going to be date and not date and time. So that's it for now. Let's go and
execute it in order to repair the DDL. And this is what happen in the silver layer. Sometimes we have to adjust the
metadata if the quality of the data types and so on is not good or we are building new derived informations in
order later to integrate the data. So it will be like very close to the bronze layer but with few modifications. So
make sure to update your DTL scripts. And now the next step is that we're going to go and insert the data into the
table. And now the next we're going to go and insert the result of this query that is cleaning up the bronze table
into the silver table. So as we done it before insert into silver the product info and then we have to go and list all
the columns. I've just prepared those columns. So with that we can go and now run our query in order to insert the
data. So now as you can see this did insert the data and the very important step is now to check the quality of the
silver table. So we go back to our data quality checks and we go switch to the silver. So let's check the primary key.
There is no issues and we can go and check for example here the trims there is as well no issue and now let's go and
check the costs it should not be negative or null which is perfect let's go and check the data standardizations
as you can see they are friendly and we don't have any nulls and now very interesting the order of the dates so
let's go and check that as you can see we don't have any issues and finally what I do I go and have a final look to
the silver table and As we can see everything is inserted correctly in the correct columns. So all those columns
comes from the source system and the last one is automatically generated from the DDL indicate when we loaded this
table. Now let's sit back and have a look to our script. What are the different types of data transformations
that we have done here is for example over here the category ID and the product key we have derived new columns.
So it is when we create a new column based on calculations or transformations of an existing one. So sometimes we need
columns only for analytics and we cannot each time go to the source system and ask them to create it. So instead of
that we derive our own columns that we need for the analytics. Another transformation we have is the is null
over here. So we are handling here missing information. Instead of null we're going to have a zero. And one more
transformation we have over here for the product line. We have done here data normalization. Instead of having a code
value we have a friendly value. And as well we have handled the missing data. For example, over here instead of having
a null, we're going to have not available. All right, moving on to another data transformation. We have
done data type casting. So we are converting the data type from one to another. And this considered as well to
be a data transformation. And now moving on to the last one. We are doing as well data type casting. But what's more
important, we are doing data enrichment. This type of transformation, it's all about adding a value to your data. So we
are adding new relevant data to our data sets. So those are the different types of data transformations that we have
done for this table. Okay. So let's keep going. We have the sales details and this is the
last table in the CRM. So what do we have over here? We have the order number and this is a string. Of course we can
go and check whether we have an issue with the unwanted spaces. So we can search whether we're going to find
something. So we can say trim and something like this. and let's go and execute it. So we can see that we don't
have any unwanted spaces. That means we don't have to transform this column. So we can leave it as it is. Now the next
two columns they are like keys and ids in order to connect it with the other tables. As we learned before we are
using the product key in order to connect it with the product informations and we are connecting the customer ID
with the customer ID from the customer info. So that means we have to go and check whether everything is working
perfectly. So we can go and check the integrity of those columns where we say the product key not in and then we make
a subquery and this time we can work with the silver layer right so we can say the product key from silver dot
product info so let's go and query this and as you can see we are not getting any issue that means all the product
keys from the sales details can be used and connected with the product info the same thing we can go and check the
integrity of the customer ID and we can use not the product we and go to the customer info and the name was CST ID.
So let's go and query that and the same thing we don't have here any issues. So that means we can go and connect the
sales with the customers using the customer ID and we don't have to do any transformations for it. So things looks
really nice for those three columns. Now we come to the challenging one. We have here the dates. Now those dates are not
actual dates. They are integer. So those are numbers and we don't want to have it like this. We would like to clean that
up. we have to change the data type from integer to a dates. Now if you want to convert an integer to a date, we have to
be careful with the values that we have inside each of those columns. So now let's check the quality for example of
the order dates. Let's say where order dates is less than zero for example something negative. Well, we don't have
any negative values which is good. Let's go and check whether we have any zeros. Well, this is bad. So we have here a lot
of zeros. Now what we can do? We can replace those informations with a null. We can use of course the null if
function like this. We can say null if and if it is zero then make it null. So let's execute it. And as you can see now
all those informations are null. Now let's go and check again the data. So now this integer has the year's
information at the start then the months and then the day. So here we have to have like 1 2 3 4 5. So the length of
each number should be h. And if the length is less than eight or higher than eight then we have an issue. Let's go
and check that. So we're going to say or length sales order is not equal to h that means less or higher. Let's go and
execute it. Now let's go and check the results over here. And those two informations they don't look like a
date. So we cannot go and make from these informations a real date. They are just bad data quality. And of course you
can go and check the boundaries of a date. Like for example it should not be higher than for example let's go and get
this value 2050 and then any for the month and the date. So let's go and execute it. And if we just remove those
informations just to make sure. So we don't have any date that is outside of the boundaries that you have in your
business. Or you go for example and say the boundary should be not less than depend when your business started. Maybe
something like this. We are getting of course those values because they are less than null. But if you have values
around this dates you will get it as well in the query. So we can go and add the rests. So all those checks like
validate the column that has a date informations and it has the data type integer. So again what are the issues
over here? We have zeros and sometimes we have like strange numbers that cannot be converted to a dates. So let's go and
fix that in our query. So we can say case when the sales order the order dates is equal to zero or of the order
date is not equal to 8 then null. Right? We don't want to deal with those values. they are just wrong and they they are
not real dates otherwise we say else it's going to be the order date. Now what we're going to do we're going to go
and convert this to a date. We don't want this as an integer. So how we can do that? We can go and cast it first to
a varchar because we cannot cast from integer to date in SQL server. First you have to convert it to a varchchar and
then from varchchar you go to a date. Well this is how we do it in SQL server. So we cast it first to a varchar and
then we cast it to a date like this. That's it. So we have end and we are using the same column
name. So this is how we transform an integer to a date. So let's go and query this. And as you can see the order date
now is a real date. It is not a number. So we can go and get rid of the old column. Now we have to go and do the
same stuff for the shipping dates. So, we can go over here and replace everything with the shipping date and
let's go and query. Well, as you can see, the shipping date is perfect. We don't have any issue with this column.
But still, I don't like that we found a lot of issues with the order date. So, what we're going to do just in case this
happens for the shipping date in the future, I will go and apply the same rules to the shipping dates. Oh, let's
take the shipping date like this. And if you don't want to apply it now, you have always to build
like quality checks that runs every day in order to detect those issues. And once you detect it, then you can go and
do the transformations. But for now, I'm going to apply it right away. So that is for the shipping date. Now we go to the
due date and we will do the same test. Let's go and execute it. And as well, it is perfect. So still, I'm going to apply
the same rules. So let's get the due date everywhere here in the query. Just make sure you don't miss anything here.
So let's go and execute now. Perfect. As you can see, we have the order date, shipping date, and due date. And all of
them are date and don't have any wrong data inside those columns. Now, still there is one more check that we can do
and it's that the order date should be always smaller than the shipping date or the due date because it makes no sense,
right? If you are delivering an item without an order. So first the order should happen then we are shipping the
items. So there is like an order of those dates and we can go and check that. So we are checking now for invalid
date orders where we can say the order date is higher than the shipping date or we are searching as well for an order
where the order date is higher than the due date. So we can have it like this due date. So let's go and check. Well,
that's really good. We don't have such a mistake on the data and the quality looks good. So the order date is always
smaller than the shipping date or the due date. So we don't have to do any transformations or cleanup. Okay
friends, now moving on to the last three columns. We have the sales, quantity and the price. All those informations are
connected to each others. So we have a business rule or calculation. It says the sales must be equal to quantity
multiplied by the price. And all sales quantity and price informations must be positive numbers. So it's not allowed to
be negative, zero or null. So those are the business rules and we have to check the data consistency in our table. Does
all those three informations following our rules? So we're going to start first with our rule, right? So we're going to
say if the sales is not equal to quantity multiplied by the price. So we are searching where the result is not
matching our expectation. And as well we can go and check other stuff like the nulls. So for example we can say or
sales is null or quantity is null and the last one for the price and as well we can go and check whether they
are negative numbers or zero. So we can go over here and say less or equal to zero and apply it for the other columns
as well. So with that we are checking the calculation and as well we are checking whether we have null, zero or
negative numbers. Let's go and check our informations. I'm going to have here extinct. So let's go and query it. And
of course we have here bad data. But we can go and sort the data by the sales quantity and the price. So let's do it.
Now by looking to the data we can see in the sales we have nulls. We have negative numbers and zeros. So we have
all bad combinations and as well we have here bad calculations. So as you can see the price here is 50, the quantity is
one but the sales is two which is not correct. And here we have as well wrong calculations. Here we have to have a 10
and here nine or maybe the price is wrong. And by looking to the quantity now you can see we don't have any nulls.
We don't have any zeros or negative numbers. So the quantity looks better than the sales. And if you look to the
prices we have nulls we have negatives and yeah we don't have zeros. So that means the quality of the sales and the
price is wrong. The calculation is not working and we have these scenarios. Now of course how I do it here I don't go
and try now to transform everything on my own. I usually go and talk to an expert maybe someone from the business
or from the source system and I show those scenarios and discuss and usually there is like two answers either they
going to tell me you know what I will fix it in my source so I have to live with it there is incoming bad data and
the bad data going to be presented in the warehouse until the source system clean up those issues. And the other
answer you might get you know what we don't have the budget and those data are really old and we are not going to do
anything. So here you have to decide either you leave it as it is or you say you know what let's go and improve the
quality of the data. But here you have to ask for the experts to support you solving these issues because it really
depend on the rules. Different rules makes different transformations. So now let's say that we have the following
rules. If the sales informations are null or negative or zero, then use the calculation the formula by multiplying
the quality with the price. And now if the prices are wrong, for example, we have here a null or zero, then go and
calculate it from the sales and the quantity. And if you have a price that is a minus like minus 21, a negative
number, then you have to go and convert it to a 21. So from a negative to a positive without any calculations. So
those are the rules and now we're going to go and build the transformations. based on those rules. So let's do it
step by step. I will go over here and we're going to start building the new sales. So what is the rule says case
when of course as usual if the sales is null or let's say the sales is negative number or equal to zero or
another scenario we have a sales information but it is not following the calculation. So we have wrong
information in the sales. So we're going to say the sales is not equal to the quantity multiplied by the price. But of
course we will not leave the price like this by using the function APS. The absolute is going to go and convert
everything from negative to a positive. Then what we have to do is to go and use the calculation. So it going to be the
quantity multiplied by the price. So that means we are not using the value that's come from the source system. We
are recalculating it. Now let's say the sales is correct and not one of those scenarios. So we're going to say else.
We will go with the sales as it is that comes from the source because it is correct. It's really nice. Let's go and
say an end and give it the same name. I will go and rename the old one here as an old value and the same for the price.
The quantity will not touch it because it is correct. So like this. And now let's go and transform the prices. So
again as usual we go with case when. So what are the scenarios? The price is null or the price is less or equal to
zero. Then what we going to do? We're going to do the calculation. So it's going to be the sales divided by the
quantity the SLS quantity. But here we have to make sure that we are not dividing by zero. Currently we don't
have any zeros in the quantity but you don't know in the future you might get a zero and the whole code going to break.
So what you have to do is to go and say if you get any zero replace it with a null. So null if if it is zero then make
it null. So that's it. Now if the price is not null and the price is not negative or equal to zero then
everything is fine and that's why we're going to have now the else it going to be the price as it is from the source
system. So that's it. We're going to say end as price. So I'm totally happy with that. Let's go and execute it and check
of course. So those are the old informations and those are the new transformed cleaned up informations. So
here previously we have a null but now we have two. So two multiplied with one we are getting two. So the sales is here
correct. Now moving on to the next one we have in the sales 40 but the price is two. So two multiplied with one we
should get two. So the new sales is correct. It is two and not 40. Now to the next one over here the old sales is
zero. But if you go and multiply the four with the quantity you will get four. So the sales here is not correct.
That's why in the new sales we have it correct as a four. And let's go and get a minus. So in this case we have a minus
which is not correct. So we are getting the price multiplied with one. We should get here a nine. And this sales here is
correct. Now let's go and get a scenario where the price is null like this here. So we don't have here a price but we
calculated from the sales and the quantity. So we divided the 10 by two and we have five. So the new price is
better. And the same thing for the minuses. So we have here minus 21 and in the output we have 21 which is correct.
So for now I don't see any scenario where the data is wrong. So everything looks better than before. And with that
we have applied the business rules from the experts and we have cleaned up the data in the data warehouse. And this is
way better than before because we are presenting now better data for analyszis and reporting but it is challenging and
you have exactly to understand the business. So now what we're going to do we're going to go and copy those
informations and integrate it in our query. So instead of sales we're going to get our new calculation and instead
of the price we will get our correct calculation and here I'm missing the end. Let's go and run the whole thing
again. So with that we have as well now cleaned sales quantity and price and it is following our business rules. So with
that we are done cleaning up the sales details. The next step we're going to go and insert it to the sales details. But
we have to go and check again the DDL. So now all what you have to do is to compare those results with the DDL. So
the first one is the order number. It's fine. The product key, the customer ID, but here we have an issue. All those
informations now are date and not an integer. So we have to go and change the data type. And with that we have better
data type than before. Then the sales quantity price it is correct. Let's go and drop the table and create it from
scratch again. And don't forget to update your DDL script. So that's it for this. And we're going to go now and
insert the results into our silver table sales details. And we have to go and list now all the columns. I have already
prepared the list of all the columns. So make sure that you have the correct order of the columns. So let's go now
and insert the data. And with that and with that we can see that the SQL did insert data to our sales details. But
now very important is to check the health of the silver table. So what we're going to do instead here of
bronze, we're going to go and switch it to silver. So let's check over here. So here always the order is smaller than
the shipping and the due date, which is really nice. But now I'm very interested on the calculations. So here we're going
to switch it from bronze to silver. And I'm going to go and get rid of all those calculations because we don't need it
this. And now let's see whether we have any issue. Well, perfect. Our data is following the business rules. We don't
have any nulls, negative values, zeros. Now as usual the last step the final check we will just have a final look to
the table. So we have the order number the product key the customer ID those three dates we have the sales quantity
and the price and of course we have our metadata column. Everything is perfect. So now by looking to our code what are
the different types of data transformation that we are doing. So in those three columns we are doing the
following. So at the start we are handling invalid data and this is as well type of transformation and as well
at the same time we are doing data type casting. So we are changing it to more correct data type. And if you are
looking to the sales over here then what we are doing over here is we are handling the missing data and as well
the invalid data by deriving the column from already existing one. And it is as well very similar for the price. We are
handling as well the invalid data by deriving it from specific calculation over here. So those are the different
types of data transformations that you have done in these scripts. All right. Now let's keep
moving to the next system. We have the customer AZ2. So here we have like only three columns and let's start with the
ID first. So here again we have the customer's informations and if we go and check again our model you can see that
we can connect this table with the CRM table customer info using the customer key. So that means we have to go and
make sure that we can go and connect those two tables. So let's go and check the other table. We can go and check of
course the server layer. So let's query it and we can query both of the tables. Now we can see there is here like extra
characters that are not included in the customer key from the CRM. So let's go and search for example for this customer
over here where C ID like so we are searching for customer has similar ID. Now as you can see we are finding this
customer but the issue is that we have those three characters NAS. There is no specifications or explanation why we
have the NAS. So actually what we have to do is to go and remove those informations. We don't need it. So let's
again check the data. So it looks like the old data have an NAS at the start and then afterward we have new data
without those three characters. So we have to clean up those ids in order to be able to connect it with other tables.
So we're going to do it like this. We're going to start with the case when since we have like two scenarios in our data.
So if the C ID is like the three characters in as so if the ID start with those three characters then we're going
to go and apply transformation function otherwise it's going to stay like it is. So that's it. So now we have to go and
build the transformation. So we're going to use substring and then we have to define the string. It's going to be the
CD and then we have to define the position where it start cutting or extracting. So we can say 1 2 3 and then
four. So we have to define the position number four. And then we have to define the string how many characters should be
extracted. I will make it dynamic. So I will go with the length. I will not go and count how much. So we're going to
say the C ID. So it looks good. If it's like NAS then go and extract from the CD at the position number four the rest of
the characters. So let's go and execute it. And I'm missing here a comma again where we don't have any NAS at the
start. And if you scroll down you can see those as well are not affected. So with that we have now a nice ID to be
joined with other table. Of course we can go and test it like this where then we take the whole thing the whole
transformation and say not in we remove of course the alias name we don't need it. And then we make very simple
substring select distinct CST key the customer key from the silver table can be silver CRM cost
info. So that's it. So let's go and check. So as you can see it is working fine. So we are not able to find any
unmatching data between the customer info from ERB and the CRM. But of course after the transformation if you don't
use the transformation. So if I just remove it like this, we will find a lot of unmatching data. So this means our
transformation is working perfectly and we can go and remove the original value. So that's it for the first column. Okay.
Now moving on to the next field, we have the birthday of the customers. So the first thing to do is to check the data
type. It is a date. So it's fine. It is not an integer or a string. So we don't have to convert anything. But still
there is something to check with the birth date. So we can check whether we have something out of range. So for
example, we can go and check whether we have really old dates at the birth dates. So let's take 19, 100, and let's
say 24 and we can take the first date of the month. So let's go and check that. Well, it looks like that we have
customers that are older than 100 year. Well, I don't know. Maybe this is correct, but it sounds of course strange
to do the business. Of course. Hey, this is Creed and he is in charge of something. That is correct. Say hi to
the kids. Hi kids. Yay. And then we can go and check the other boundary where it is almost impossible to have a customer
that the birthday is in the future. So we can say birth date is higher than the current date like this. So let's go and
query this information. Well, it will not work because we have to have like an or between them. And now if we check the
list over here, we have dates that are invalid for the birth dates. So all those dates they are all per day in the
future and this is totally unacceptable. So this is an indicator for bad data quality. Of course you can go and report
it to the source system in order to correct it. So here it's up to you what to do with those dates. Either leave it
as it is as a bad data or we can go and clean that up by replacing all those dates with a null or maybe replacing
only the one that is extreme where it is 100% is incorrect. So let's go and write the transformation for that. As usual,
we're going to start with case when birth date is larger than the current date and time then null. Otherwise, we
can have an else where we have the birth date as it is and then we have an end as birth date. So, let's go and execute it.
And with that, we should not get any customer where the birthday in the future. So, that's it for the birth
date. Now, let's move to the next one. We have the gender. Now again the gender informations is low cardalities. So we
have to go and check all the possible values inside this column. So in order to check all the possible values we're
going to use select distinct gen from our table. So let's go and execute it. And now the data doesn't look really
good. So we have here a null, we have an f, we have here an empty string, we have male, female, and again we have the M.
So this is not really good. And what we're going to do, we're going to go and clean up all those informations in order
to have only three values. Male, female, and not available. So, we're going to do it like this. We're going to say case
when and now we're going to go and trim the values just to make sure there is like no empty spaces. And as well, I'm
going to go and use the upper function just to make sure that in the future if we get any lower cases and so on, we are
covering all the different scenarios. So case this is in F or let's say female then make it as female and we can
go and do the same thing for the male like this. So if it is an M or a male make sure it is capital letters because
here we are using the upper then it is a male otherwise all other scenarios it should be not available. So whether it
is an empty string or nulls and so on. So we have to have an end of course as gen. So now let's go and test it and
check whether we have covered everything. So you can see the M is now male. The empty is not available. The F
is female. The empty string or maybe spaces here is not available. Female going to stay as it is. And the same for
the male. So with that we are covering all the scenarios and we are following our standards in the project. So I'm
going to go and cut this and put it in our original query over here. So let's go and execute the whole thing. And with
that we have cleaned up all those three columns. Now the question is did we change anything in the DDL? Well we
didn't change anything. We didn't introduce any new column or change any data type. So that means the next step
is we're going to go and insert it in the server layer. So as usual we're going to say here insert into silver ERP
the customer and then we're going to go and list all the column names. So C ID birth date and the gender. All right. So
let's go and execute it. And with that we can see it inserted all the data. And of course the very important step as the
next is to check the data quality. So let's go back to our query over here and change it from bronze to silver. So
let's go and check the silver layer. Well of course we are getting those very old customers but we didn't change that.
We only change the birthday that is in the future and we don't see it here in the results. So that means everything is
clean. So for the next one, let's go and check the different genders. And as you can see, we have only those three
values. And of course, we can go and take a final look to our table. So you can see the C ID here, the birth date,
the gender, and then we see our metadata column. And everything looks amazing. So that's it. What are the different types
of data transformations that we have done? First with the ID, what we have done, we have handled invalid values. So
we have removed this part where it is not needed. And the same thing goes for the birth dates. We have handled as well
invalid values. And then for the last one, for the gender, we have done data normalizations by mapping the code to
more friendly value. And as well, we have handled the missing values. So those are the types that we have done in
this code. Okay. Moving on to the second table, we have the location
informations. So we have ERP location A101. So now here the task is easy because we have only two columns and if
you go and check the integration model we can find our table over here. So we can go and connect it together with the
customer info from the other system using a CID with the customer key. So those two informations must be matching
in order to join the tables. So that means we have to go and check the data. So let's go and select the data CST key
from let's go and get the silver data customer info. So let's go. Now if you go and check the result you can see over
here that we have an issue with the CI ID there is like a minus between the characters and the numbers but the
customer ID the customer number we don't have anything that splits the characters with the numbers. So if you go and join
those two informations it will not be working. So what we have to do we have to go and get rid of this minus because
it is totally unnecessary. So let's go and fix that. It's going to be very simple. So what we're going to do we're
going to say CI ID. So we're going to go and search for the minus and replace it with nothing. It's very simple like
this. So let's go and query it again. And with that things looks very similar to each others. And as well we can go
and query it. So we're going to say where our transformation is not in then we can go and use this as a subquery
like this. So let's go and execute it. And as you can see we are not finding any unmatching data now. So that means
our transformation is working. And with that we can go and connect those two tables together. So if I take the
transformation away you can see that we will find a lot of unmatching data. So the transformation is okay. We're going
to stay with it. And now let's speak about the countries. Now we have here multiple values and so on. What I'm
going to do this is low cardinality and we have to go and check all possible values inside this column. So that means
we are checking whether the data is consistent. So we can do it like this. distinct the
country from our table. I'm just going to go and copy it like this. And as well, I'm going to go and sort the data
by the country. So, let's go and check the informations. Now, you can see we have a null. We have an empty string,
which is really bad. And then we have a full name of country and then we have as well an abbreviation of the countries.
Well, this is a mix. This is not really good because sometimes we have DE and sometimes we have Germany and then we
have the United Kingdom and then for the United States we have like three versions of the same information which
is as well not really good. So the quality of the country is not really good. So let's go and work on the
transformation. As usual we're going to start with the case win. If trim country is equal to D, then we're going
to transform it to Germany. And the next one it's going to be about the USA. So if trim country is in. So now let's go
and get those two values the US and the USA. So US and USA then it's going to be the United States states. So with us we
have covered as well those three cases. Now we have to talk about the null and the empty string. So we're going to say
when trim country is equal to empty string or country is null then it's going to be not available otherwise I
would like to get the country as it is. So trim country just to make sure that we don't have any leading or trailing
spaces. So that's it. Let's go and say this is the country. So it is working and the country information is
transformed. And now what I'm going to do, I'm going to take the whole new transformation and compare it to the old
one. Let me just call this as old country and let's go and query it. So now we can check those values state as
before. So nothing did change. The DE is now Germany. The empty string is not available. The null the same thing and
the United Kingdom stayed as like it's like before. And now we have one value for all those information. So it's only
the United States. So it looks perfect. And with that we have cleaned as well the second column. So with that we have
now clean results. And now the question did we change anything in the DDL? Well we haven't changed anything. Both of
them are varchar. So we can go now immediately and insert it into our table. So insert into silver customer
location. And here we have to specify the columns. It's very simple the ID and the country. So let's go and execute it.
And as you can see we got now inserted all those values. Of course, as a next, we go and double check those
informations. I would just go and remove all those stuff as well here. And instead of bronze, let's go with the
silver. So, as you can see, all the values of the country looks good. And let's have a final look to the table.
So, like this. So, we have the ids without the separator. We have the countries and as well our metadata
information. So, with that, we have cleaned up the data for the location. Okay. So now what are the different
types of data transformation that we have done here is first we have handled invalid values. So we have removed the
minus with an empty string and for the country we have done data normalization. So we have replaced codes with friendly
values and as well at the same time we have handled missing values by replacing the empty string and null with not
available. And one more thing of course we have removed the unwanted spaces. So those are the different types of
transformation that we have done for this table. Okay guys, now keep the energy
up, keep the spirit up. We have to go and clean up the last table in the bronze layer. And of course, we cannot
go and skip anything. We have to check the quality and to detect all the errors. So now we have a table about the
categories for the products. And here we have like four columns. Let's go and start with the first one, the ID. As you
can see in our integration model, we can connect this table together with the product info from the CRM using the
product key. And as you remember in the silver layer, we have created an extra column for that in the product info. So
if you go and select those data, you can see we have a column called category ID and this one is exactly matching the ID
that we have in this table and we have done the testing. So this ID is ready to be used together with the other table.
So there is nothing to do over here. And now for the next columns they are string. And of course we can go and
check whether there are any unwanted spaces. So we are checking for the unwanted spaces. So let's go and check
select start from and we're going to go and get the same table like this here. And first we are checking the category.
So the category is not equal to the category after trimming the unwanted spaces. So let's go and execute it. And
as you can see we don't have any results. So there are no unwanted spaces. Let's go and check the other
column. For example, the subcategory, the next one. So let's get the subcategory and run the query as well.
We don't have anything. So that means we don't have unwanted spaces for the subcategory. Let's go now and check the
last column. So I will just copy and paste. Now let's get the maintenance and let's go and execute. And as well, no
results. Perfect. We don't have any unwanted spaces inside this table. So now the next step is that we're going to
go and check the data standardizations because all those columns has low cardinality. So what we can do we can
say select distinct let's get the cats category from our table. I'll just copy
and paste it and check all values. So as you can see we have the accessories, bikes, clothing and components.
Everything looks perfect. We don't have to change anything in this column. Let's go and check the subcategory. And if you
scroll down, all values are friendly and nice as well. Nothing to change here. And let's go and check the last column,
the maintenance. Perfect. We have only two values, yes and no. We don't have any nulls. So my friends, that's means
this table has really nice data quality and we don't have to clean up anything. But still, we have to follow our
process. We have to go and load it from the bronze to the silver even if we didn't transform anything. So our job is
really easy. Here we're going to go and say insert into silver dot ERP px and so on. And we're going to go and define the
columns. So it's going to be the ID, the category, subcategory, maintenance. So that's it.
Let's go and insert the data. Now, as usual, what we're going to do, we're going to go and check the data. So
silver ERP. Let's have a look. All right. So we can see the ids are here, the
categories, the subcategories, the maintenance and we have our meta column. So everything is inserted correctly. All
right. So now I have all those queries and the insert statements for all six tables. And now what is important before
inserting any data, we have to make sure that we are truncating and emptying the table because if you run this query
twice, what's going to happen? You will be inserting duplicates. So first truncate the data and then do a full
load insert all data. So we're going to have one step before it's like the bronze layer. We're going to say
truncate table and then we will be truncating the silver customer info and only after that we have to go and insert
the data. And of course we can go and give this nice information at the start. So first we are truncating the table and
then inserting. So if I go and run the whole thing. So let's go and do it. It will be working. So if I can run it
again, we will not have any duplicates. So we have to go and add this step before each insert. So let's go and do
that. All right. So I'm done with all tables. So now let's go and run everything. So let's go and execute it.
And we can see in the messaging everything working perfectly. So with that we made all tables empty. And then
we inserted the data. So perfect. With that we have a nice script that loads the silver layer.
But of course like the front layer, we're going to put everything in one stored procedure. So let's go and do
that. We'll go to the beginning over here and say create or alter procedure and we're going to put it in the schema
silver and using the naming convention load silver and we're going to go over here and say begin and take the whole
code end it is long one and give it one push with a tab and then at the end we're going to say edge. Perfect. So we
have our stored procedure but we forgot here the ass with that we will not have any error. Let's go and execute it. So
the stored procedure is created. If you go to the programmability and you will find two procedures load bronze and load
silver. So now let's go and try it out. All what you have to do is now only to execute the silver load silver. So let's
execute the start procedure and with that we will get the same results. This third procedure now is responsible of
loading the whole silver layer. Now of course the messaging here is not really good because we have learned in the
bronze layer we can go and add many stuff like handling the error doing nice messaging catching the duration time. So
now your task is to pause the video take this start procedure and go and transform it to be very similar to the
bronze layer with the same messaging and all the add-ons that we have added. So pause the video now. I will do it as
well offline and I will see you [Music] soon. Okay. So I hope you are done and I
can show you the results. It's like the bronze layer. We have defined at the start few variables in order to catch
the duration. So we have the start time, the end time, patch start time and patch end time. And then we are printing a lot
of stuff in order to have like nice messaging in the output. So at the start we are saying loading the server layer
and then we start splitting by the source system. So loading the CRM tables and I'm going to show you only one table
for now. So we are setting the timer. So we are saying start time get the date and time informations to it. Then we are
doing the usual. We are truncating the table and then we are inserting the new informations after cleaning it up. And
we have this nice message. We will say load duration where we are finding the differences between the start time and
the end time using the function date diff. And we want to show the result in the seconds. So we are just printing how
long it took to load this table. And we're going to go and repeat this process for all the tables. And of
course we are putting everything in try and catch. So the SQL going to go and try to execute the try part. And if
there are any issues the SQL going to go and execute the catch. And here we are just printing few information like the
error message the error number and the error states. And we are following exactly the same standard at the bronze
layer. So let's go and execute the whole thing. And with that we have updated the definition of the third procedure. Let's
go now and execute it. So execute silver dot load silver. So let's go and do that. It went very fast like fewer than
1 seconds again because we are working on local machine loading the server layer loading the CRM tables and we can
see this nice messaging. So it start with truncating the table inserting the data and we are getting the load
duration for this table and you will see that everything is below 1 second and that's because in real projects you will
get of course more than 1 second. So at the end we have load duration of the whole silver layer. And now I have one
more thing for you. Let's say that you are changing the design of this store procedure for the server layer. You are
adding different types of messaging or maybe you're creating logs and so on. So now all those new ideas and redesigns
that you are doing for the silver layer, you have always to think about bringing the same changes as well in the other
store procedure for the pros layer. So always try to keep your codes following the same standards. Don't have like one
idea in one store procedure and an old idea in another one. Always try to maintain those scripts and to keep them
all up to date following the same standards. Otherwise, it can be really hard for other developers to understand
the cause. I know that needs a lot of work and commitments, but this is your job to make everything following the
best practices and following the same naming convention and standards that you put for your projects. So guys, now we
have very nice two ETL scripts. One that loads the bronze layer and another one for the server layer. So now our data
warehouse is very simple. All what you have to do is to run first the bronze layer and with that we are taking all
the data from the CSV files from the source and we put it inside our data warehouse in the bronze layer and with
that we are refreshing the whole bronze layer. Once it's done the next step is to run the store procedure of the server
layer. So once you execute it you are taking now all the data from the bronze layer transforming it cleaning it up and
then loading it to the server layer. And as you can see the concept is very simple. We are just moving the data from
one layer another layer with different tasks. All right guys, so as you can see in the server layer we have done a lot
of data transformations and we have covered all the types that we have in the data cleansing. So we remove
duplicates, data filtering, handling missing data, invalid data, unwanted spaces, casting the data types and so
on. And as well we have derived new columns, we have done data enrichment and we have normalized a lot of data. So
now of course what we have not done yet business rules and logic data aggregations and data integration. This
is for the next layer. All right my friends. So finally we are done cleaning up the data and checking the quality of
our data. So we can go and close those two steps. And now to the next step we have to go and extend the data flow
diagram. So let's go. Okay. So now let's go and extend our data flow for the silver layer. So, what
I'm going to do, I'm just going to go and copy the whole thing and put it side by side to the bronze layer. And let's
call it silver layer. And the table name is going to stay as before because we have like one to one like the bronze
layer. But what we're going to do, we're going to go and change the coloring. So, I'm going to go and mark everything and
make it gray like silver. And of course, what is very important is to make the lineage. So, I'm going to go now from
the bronze and take an arrow and put it to the silver table. And now with that we have like a lineage between three
layers and you are checking this table the customer info you can understand aha this comes from the bronze layer from
the customer info and as well this comes from the source system CRM so now we can see the lineage between different layers
and without looking to any scripts and so on in one picture you can understand the whole projects so I don't have to
explain a lot of stuff by just looking to this picture you can understand how the data is flowing between sources is
bronze layer, silver layer, and to the gold layer, of course, later. So, as you can see, it looks really nice and clean.
All right. So, with that, we have updated the data flow. Next, we're going to go and commit our work in the G repo.
So, let's go. Okay. So, now let's go and commit our scripts. We're going to go to the
folder scripts. And here we have a server layer. If you don't have it, of course, you can go and create it. So,
first we're going to go and put the DDL scripts for the server layer. So let's go and I will paste the code over here.
And as usual, we have this commit as the header explaining the purpose of this script. So let's go and commit our work.
And we're going to do the same thing for the store procedure that loads the server layer. So I'm going to go over
here. I have already filed for that. So let's go and paste that. So we have here our stored procedures. And as usual at
the start, we have as well. So this script is doing the ATL process where we load the data from bronze into silver.
So the action is to truncate the table first and then insert transformed cleans data from bronze to silver. There are no
parameters at all. And this is how you can use the source procedure. Okay. So we're going to go and commit our work.
And now one more thing that we want to commit in our project all those queries that you have built to check the quality
of the server layer. So this time we will not put it in the scripts. We're going to go to the tests and here we're
going to go and make a new file called quality checks silver and inside it we're going to go and paste all the
queries that we have filled. I just here reorganize them by the tables. So here we can see all the checks that we have
done during the course and at the header we have here nice comments. So here we are just saying that this script is
going to check the quality of the server layer and we are checking for nulls, duplicates, unwanted spaces, invalid
date range and so on. So that each time you come up with a new quality check, I'm going to recommend you to share it
with the project and with other team in order to make it part of multiple checks that you do after running the ATL. So
that's it. I'm going to go and put those checks in our repo and in case I come up with new check, I'm going to go and
update it. Perfect. So now we have our code in our repository. All right. So with that, our code is saved and we are
done with the whole epic. So we have built the silver layer. Now let's go and minimize it. And now we come to my
favorite layer, the code layer. So we're going to go and build it. The first step as usual, we have to analyze. And this
time we're going to explore the business objects. So let's go. All right. So now we come to the big
question. How we going to build the gold layer? As usual, we start with analyzing. So now what we're going to do
here is to explore and understand what are the main business objects that are hidden inside our source system. So as
you can see we have two sources six files and here we have to identify what are the business objects. Once we have
this understanding then we can start coding and here the main transformation that we are doing is data integration.
And here usually I split it into three steps. The first one we're going to go and build those business objects that we
have identified. And after we have a business objects we have to look at it and decide what is the type of this
table. Is it a dimension? Is it a fact? Or is it like maybe a flat table? So what type of table that we have built
and the last step is of course we have now to rename all the columns into something friendly and easy to
understand so that our consumers don't struggle with technical names. So once we have all those steps what we're going
to do it's time to validate what we have created. So what we have to do the new data model that we have created it
should be connectable and we have to check that the data integration is done correctly and once everything is fine we
cannot skip the last step. we have to document and as well commit our work in the g. And here we will be introducing a
new type of documentations. So we're going to have a diagram about the data model. We're going to build a data
dictionary where we're going to describe the data model. And of course we're going to extend the data flow diagram.
So this is our process. Those are the main steps that we will do in order to build the code
layer. Okay. So what is exactly data moduling? Usually the source system going to deliver for you row data
unorganized messy not very useful in its current states. But now the data modeling is the process of taking this
row data and then organize it and structure it in meaningful way. So what we are doing we are putting the data in
new friendly and easy to understand objects like customers, orders, products. Each one of them is focused on
specific information and what is very important is we're going to describe the relationship between those objects. So
by connecting them using lines. So what you have built on the right side we call it logical data model. If you compare to
the left side you can see the data model makes it really easy to understand our data and the relationship the processes
behind them. Now in data modeling we have three different stages or let's say three different ways on how to draw a
data model. The first stage is the conceptual data model. Here the focus is only on the entity. So we have
customers, orders, products and we don't go in details at all. So we don't specify any columns or attributes inside
those boxes. We just want to focus what are the entities that we have and as well the relationship between them. So
the conceptual data model don't focus at all on the details. It just gives the big picture. So the second data model
that we can build is the logical data model. And here we start specifying what are the different columns that we can
find in each entity like we have the customer ID the first name last name and so on and we still draw the relationship
between those entities and as well we make it clear which columns are the primary key and so on. So as you can see
we have here more details but one thing we don't describe a lot of details for each column and we are not worry how
exactly we going to store those tables in the database. The third and last stage we have the physical data model.
This is where everything gets ready before creating it in the database. So here you have to add all the technical
details like adding for each column the data types and the length of each data type and many other database techniques
and details. So again if you look to the conceptual data model it gives us the big picture and in the logical data
model we dive into details of what data we need and the physical layer model prepares everything for the
implementation in the database. And to be honest in my projects I only draw the conceptual and the logical data model
because drawing and building the physical data model needs a lot of efforts and time and there are many
tools like in data bricks they automatically generate those models. So in this project what we're going to do
we're going to draw the logical data model for the gold layer. All right. It's now for analytics
and especially for data warehousing and business intelligence. We need a special data model that is optimized for
reporting and analytics and it should be flexible, scalable and as well easy to understand. And for that we have two
special data models. The first type of data model we have the star schema. It has a central fact table in the middle
and surrounded by dimensions. The fact table contains transactions, events, and the dimensions contains descriptive
informations. And the relationship between the fact table in the middle and the dimensions around it forms like a
star shape. And that's why we call it star schema. And we have another data model called snowflake schema. It looks
very similar to the star schema. So we have again the fact in the middle and surrounded by dimensions. But the big
difference is that we break the dimensions into smaller subdimensions. And the shape of this data model as you
are extending the dimensions it's going to looks like a snowflake. So now if you compare them side by side you can see
that the star schema looks easier right? So it is usually easy to understand easy to query it is really perfect for
analyzers but it has one issue with the dimension might contain duplicates and your dimensions get bigger with the
time. Now if you compare it to the snowflake you can see the schema is more complex. You saw you need a lot of
knowledge and efforts in order to query something from the snowflake. But the main advantage here comes with the
normalization as you are breaking those redundancies in small tables. You can optimize the storage. But to be honest,
who care about the storage? So for this project, I have chose to use the star schema because it is very commonly used.
Perfect for reporting like for example if you're using PowerBI and we don't have to worry about the storage. So
that's why we're going to adopt this model to build our gold layer. Okay. So now one more thing about those
data models is that they contain two types of tables fact and dimensions. So when I say this is a fact table or a
dimension table well the dimension contains descriptive informations or like categories that gives some context
to your data. For example a product info you have product name, category, subcategories and so on. This is like a
table that is describing the products and this we call it dimension. But in the other hand we have facts. They are
events like transactions. They contain three important informations. First you have multiple ids from multiple
dimensions. Then we have like date informations like when the transaction or the event did happen. And the third
type of information you're going to have like measures and numbers. So if you see those three types of data in one table,
then this is a fact. So if you have a table that answers how much or how many, then this is a fact. But if you have a
table that answers who, what, where, then this is a dimension table. So this is what dimension and fact
tables. All right my friends. So so far in the bronze layer and in the silver layer we didn't discuss anything about
the business. So the bronze and silver were very technical. We are focusing on data ingestion. We are focusing on
cleaning up the data quality of the data. But still the tables are very oriented to the source system. Now comes
the fun part in the god layer where we're going to go and break the whole data model of the sources. So we're
going to create something completely new to our business that is easy to consume for business reporting and analyzes. And
here it is very important to have a clear understanding of the business and the processes. And if you don't know it
already at this phase you have really to invest time by meeting maybe process experts, the domain experts in order to
have clear understanding what we are talking about in the data. So now what we're going to do, we're going to try to
detect what are the business objects that are hidden in the source systems. So now let's go and explore that. All
right. Now in order to build a new data model, I have to understand first the original data model. What are the main
business objects that we have? How things are related to each others? And this is very important process in
building a new model. So now what I usually do, I start giving labels to all those tables. So if you go to the shapes
over here, let's go and search for label. And if we go to more icons, I'm going to go and take this label over
here. So, drag and drop it. And then I'm going to go and increase maybe the size of the font. So, let's go with 20 and
bold. Just make it a little bit bigger. So, now by looking to this data model, we can see that we have product
informations in the CRM and as well in the ARP. And then we have like customer informations and transactional table.
So, now let's focus on the product. So, the product information is over here. We have here the current and the history
product informations and here we have the categories that's belong to the products. So in our data model we have
something called products. So let's go and create this label. It's going to be the product and let's go and give it a
color to the style. Let's pick for example the red one. Now let's go and move this label and put it beneath this
table over here. And with that I have like a label saying this table belongs to the objects called products. Now I'm
going to do the same thing for the other table over here. So I'm going to go and tag this table to the product as well.
So that I can see easily which tables from the sources does has informations about the product business object. All
right. Now moving on, we have here a table called customer information. So we have a lot of information about the
customer. We have as well in the ARP customer information where we have the birthday and the country. So those three
tables has to do with the object customer. So that means we're going to go and label it like that. So let's call
it customer and I'm going to go and pick different color for that. Let's go with the green. So I will tag this table like
this. And the same thing for the other tables. So copy tag the second table and the third table. Now it is very easily
for me to see which table to belong to which business objects. And now we have the final table over here and only one
table about the sales and orders. In the arb we don't have any informations about that. So this one going to be easy.
Let's call it sales. And let's move it over here. And as well maybe change the color of that to for example this color
over here. Now this step is very important by building any data model in the gold layer. It gives you a big
picture about the things that you are going to module. So now the next step is that we're going to go and build those
objects step by step. So let's start with the first objects with our customers. So here we have three tables
and we're going to start with the CRM. So let's start with this table over here. All right. So with that we know
what are our business objects and this task is done and now in the next step we're going to go back to scale and
start doing data integrations and building completely new data model. So let's go and do
that. Now let's have a quick look to the good layer specifications. So this is the final stage. We're going to provide
data to be consumed by reporting and analytics. And this time we will not be building tables. We will be using views.
So that means we will not be having like stored procedure or any load process to the code layer. All what we are doing is
only data transformation and the focus of the data transformation going to be data integration, aggregation, business
logic and so on. And this time we're going to introduce a new data model. We will be doing star schema. So those are
the specifications for the gold layer and this is our scope. So this time we make sure that we are selecting data
from the silver layer not from the bronze because the bronze has bad data quality and the silver is everything is
prepared and cleaned up. In order to build the good layer going to be targeting the server layer. So let's
start with select star from and we're going to go to the silver CRM customer info. So let's go and hit execute. And
now we're going to go and select the columns that we need to be presented in the go layer. So let's start selecting
the columns that we want. So we have the ID, the key, the first name. I will not go and get the metadata
information. This only belongs to the silver. Perfect. The next step is that I'm going to go and give this table an
alias. So let's go and call it CI. And I'm going to make sure that we are selecting from this alias because later
we're going to go and join this table with other tables. So something like this. So we're going to go with those
columns. Now let's move to the second table. Let's go and get the birthday information. So now we're going to jump
to the other system and we have to join the data by the CID together with the customer key. So now we have to go and
join the data with another table. And here I try to avoid using the inner join because if the other table doesn't have
all the information about the customers, I might lose customers. So always start with the master table and if you join it
with any other table in order to get informations try always to avoid inner join because the other source might not
have all the customers and if you do inner join you might lose customers. So I tend to start from the master table
and then everything else is about the lift join. So I'm going to say lift join silver ERP customer a12. So let's give
it the alias ca. And now we have to join the tables. So it's going to be by CE from the first table. It's going to be
the customer key equal to CA and we have the CI ID. Now of course we're going to get matching data because we checked the
server layer. But if we haven't prepared the data in the server layer, we have to do here preparation step in order to
join the tables. But we don't have to do that because that was a pre-step in the server layer. So now you can see the
systematic that we have in this bronze, silver, gold. So now after joining the tables we have to go and pick the
information that we need from the second table which is the birth date. So B date dates and as well from this table there
is another nice information it is the gender information. So that's all what we need from the second table. Let's go
and check the third table. So the third table is about the location information the countries and as well we connect the
tables by the CID with the key. So let's go and do that. We're going to say as well left join silver ERP location and
I'm going to give it the name LA and then we have to join Y the keys the same thing it's going to be CI customer key
equal to LA CI ID again we have prepared those ids and keys in the server layer so the join should be working now we
have to go and pick the data from the second table so what do we have over here we have the ID the country and the
metadata information so let's go and just get the country Perfect. So now with that we have joined all the three
tables and we have picked all the columns that we want in this object. So again by looking over here we have
joined this table with this one and this one. So with that we have collected all the customer informations that we have
from the two source systems. Okay. So now let's go and query in order to make sure that we have everything correct and
in order to understand that your joints are correct you have to keep your eye in those three columns. So if you are
seeing that you are getting data that means you are doing the the joints correctly but if you are seeing a lot of
nulls or no data at all that means your joints are incorrect but now it looks for me it is working and another check
that I do is that if your first table has no duplicates what could happen is that after doing multiple joins you
might now start getting duplicates because the relationship between those tables is not clear one to one you might
get like one to many relationship ship or many to many relationships. So now the check that I usually do at this
stage is that I have to make sure that I don't have duplicates from their results. So we don't have like multiple
rows for the same customer. So in order to do that, we go and do a quick group by. So we're going to group by the data
by the customer ID and then we do the count from this subquery. So this is the whole subquery and then after that we're
going to go and say group by the customer ID and then we say having count higher than one. So this query
actually try to find out whether we have any duplicates in the primary key. So let's go and execute it. We don't have
any duplicates and that means after joining all those tables with the customer info those tables didn't cause
any issues and didn't duplicate my data. So this is very important check to make sure that you are in the right way. All
right. So that means everything is fine about the duplicates. We don't have to worry about it. Now we have here an
integration issue. So let's go and execute it again. And now if you look to the data we have two sources for the
gender informations. one comes from the CRM and another one come from the ERP. So now the question is what we're going
to do with this? Well, we have to do data integration. So let me show you how I do it. First I go and have a new query
and then I'm going to go and remove all other stuff and I'm going to leave only those two informations and use it
distinct just to focus on the integration and let's go and execute it and maybe as well to do an order by. So
let's do one and two. Let's go and execute it again. So now here we have all the scenarios and we can see
sometimes there is a matching. So from the first table we have female and the other table we have as well female but
sometimes we have an issue like those two tables are giving different informations and the same thing over
here. So this is as well an issue different informations. Another scenario where we have a data from the first
table like here we have the female but in the other table we have not available. Well this is not a problem.
So we can get it from the first table but we have as well the exact opposite scenario where from the first table the
data is not available but it is available from the second table. And now here you might wonder why I'm getting a
null over here. We did handle all the missing data in the silver layer and we replace everything with not available.
So why we are still in getting a null? This null doesn't come directly from the tables. It just come because of joining
tables. So that means there are customers in the CRM table that is not available in the ARB table and if there
is like no match what going to happen we will get a null from SQL. So this null means there was no match and that's why
we are getting this null. It is not coming from the content of the tables and this is of course an issue. But now
the big issue what can happen for those two scenarios here we have the data but they are different. And here again we
have to ask the experts about it. What is the master here? Is it the CRM system or the ARP? And let's say from their
answer going to say the master data for the customer information is the CRM. So that means the CRM informations are more
accurate than the ERP information and this is only about the customers of course. So for this scenario where we
have female and male then the correct information is the female from the first source system. The same goes over here
and here we have like male and female then the correct one is the male because this source system is the master. Okay.
So now let's go and build this business rule. We're going to start as usual with the case win. So the first very
important rule is if we have a data in the gender information from the CRM system from the master then go and use
it. So we're going to go and check the gender information from the CRM table. So customer gender is not equal to not
available. So that means we have a value male or female. Let me just have here a comma like this. Then what's going to
happen? Go and use it. So we're going to use the value from the master. CRM is the master for gender info. Now
otherwise that means it is not available from the CRM table. Then go and use and grab the information from the second
table. So we're going to say CA gender. But now we have to be careful with this null over here. We have to convert it to
not available as well. So we're going to use the kis. So if this is a null then go and
use the not available like this. So that's it. Let's have an end. And let me just push this over here. So let's go
and call it new gen for now. Let's go and execute it and let's go and check the different scenarios. All those
values over here we have data from the CRM system and this is as well represented in the new column. But now
for the second part we don't have data from the first system. So we are trying to get it from the second system. So for
the first one is not available and then we try to get it from the second source system. So now we are activating the
else. Well it is null and with that the kalis is activated and we are replacing the null with not available. For the
second scenario as well, the first search system don't have the gender information. That's why we are grabbing
it from the second. So with that we have a female. And then the third one the same thing we don't have information but
we get it from the second source system. We have the male and the last one it is not available in both source systems.
That's why we are getting not available. So with that as you can see we have a perfect new column where we are
integrating two different source system in one. And this is exactly what we call data integration. This piece of
information, it is way better than the source CRM and as well the source ARP. It is more rich and has more
information. And this is exactly why we try to get data from different source system in order to get rich information
in the data warehouse. So with that we have a nice logic and as you can see it's way easier to separate it in
separate query in order first to build the logic and then take it to the original query. So what I'm going to do,
I'm just going to go and copy everything from here and go back to our query. I'm going to go and delete those
informations the gender and I will put our new logic over here. So a comma and let's go and execute. So with that we
have our new nice column. Now with that we have very nice objects. We don't have duplicates and we have integrated data
together. So we took three tables and we put it in one object. Now the next step is that we're going to go and give nice
friendly names. The rule in the gold layer that to use friendly names and not to follow the names that we get from the
source system and we have to make sure that we are following the rules by the naming conventions. So we are following
the snake case. So let's go and do it step by step. For the first one let's go and call it the customer ID. And then
the next one I will get rid of using keys and so on. I'm going to go and call it customer number because those are
customer numbers. Then for the next one, we're going to call it first name without using any prefixes. And the next
one last name and we have here marital status. So I will be using the exact name but without the prefix. And here we
just going to call it gender. And this one we're going to call it career date. And this one birth date. And the last
one going to be the country. So let's go and execute it. Now as you can see the names are really friendly. So we have
customer ID, customer numbers, first name, last name, material status, gender. So as you can see the names are
really nice and really easy to understand. Now the next step I'm going to think about the order of those
columns. So the first two it makes sense to have it together. The first name, last name, then I think the country is
very important information. So I'm going to go and get it from here and put it exactly after the last name is just
nicer. So let's go and execute it again. So the first name, last name, country. It's always nice to group up relevant
columns together, right? So we have here the status of the gender and so on. And then we have the career date and the
birth date. I think I'm going to go and switch the birth date with the career date. It's more important than the
career dates like this. And here not forget the comma. So execute again. So it looks wonderful. Now comes a very
important decision about these objects. Is it a fact table or a dimension? Well, as we learned, dimensions hold
descriptive informations about an object. And as you can see, we have here a descriptions about the customers. So
all those columns are describing the customer information. And we don't have here like transactions and events. And
we don't have like measures and so on. So we cannot say this object is a fact. It is clearly a dimension. So that's why
we're going to go and call this object the dimension customer. Now there is one thing that if you are creating a new
dimension you need always a primary key for the dimension. Of course we can go over here and depend on the primary key
that we get from the source system but sometimes you can have like dimensions where you don't have like a primary key
that you can count on. So what we have to do is to go and generate a new primary key in the data warehouse. And
those primary keys we call it surrogate keys. Srogate keys are system generated unique identifier that is assigned to
each records to make the record unique. It is not a business key. It has no meaning and no one in the business knows
about it. We only use it in order to connect our data model. And in this way we have more control on how to connect
our data model and we don't have to depend always on the source system. And there are different ways on how to
generate surrogate keys like defining it in the DDL or maybe using the window function row number in this data
warehouse. I'm going to go with a simple solution where we're going to go and use the window function. So now in order to
generate a surrogate key for this dimension what we're going to do it is very simple. So we're going to say row
number over and here we have to order by something. You can order by the create date or the customer ID or the customer
number. whatever you want but in this example I'm going to go and order by the customer ID. So we have to follow the
naming convention that all surrogate keys with a key at the end as a suffix. So now let's go and query those
informations. And as you can see at the start we have a customer key and this is a sequence. We don't have here of course
any duplicates. And now this target key is generated in the data warehouse and we're going to use this key in order to
connect the data model. So now with that our query is ready and the last step is that we're going to go and create the
object and as we decided all the objects in the gold layer going to be virtual one. So that means we're going to go and
create a view. So we're going to say create view gold dot dim. So follow the naming convention stand for the
dimension and we're going to have the customers and then after that we have ass. So with that everything is ready.
Let's go and execute it. It was successful. Let's go to the views now and you can see our first objects. So we
have the dimension customers in the gold layer. Now as you know me in the next step that we're going to go and check
the quality of this new objects. So let's go and have a new query. So select star from our view temp customers. And
now we have to make sure that everything in the right position like this. And now we can do different checks like the
uniqueness and so on. But I'm worried about the gender information. So let's go and have a distinct of all values. So
as you can see it is working perfectly. We have only female, male and not available. So that's it with that. We
have our first new dimension. Okay friends. So now let's go and build the second object. We have the
products. So as you can see product information is available in both source systems. As usual, we're going to start
with the CRM informations and then we're going to go and join it with the other table in order to get the category
informations. So those are the columns that we want from this table. Now we come here to a big decision about this
objects. This object contains historical informations and as well the current informations. Now of course depend on
the requirement whether you have to do analyszis on the historical informations. But if you don't have such
a requirements we can go and stay with only the current informations of the products. So we don't have to include
all the history in the objects and it is anyway as we learned from the model over here we are not using the primary key we
are using the product key. So now what we have to do is to filter out the historical data and to stay only with
the current data. So we're going to have here a wear condition. And now in order to select the current data what we're
going to do we're going to go and target the end dates. If the end date is null that means it is a current data. Let's
take this example over here. So you can see here we have three records for the same product key and for the first two
records we have here an information in the end dates because it is historical informations but the last record over
here we have it as a null and that's because this is the current information it is open and it's not closed yet. So
in order to select only the current informations it is very simple we can say brd in dates is null. So if you go
now and execute it, you will get only the current products. You will not have any history. And of course we can go and
add comment to it. Filter out all historical data. And this means of course we don't need the end date in our
selection of course because it is always a null. So with that we have only the current data. Now the next step is that
we have to go and join it with the product categories from the ERP. And we're going to use here the ID. So as
usual the master information is the CRM and everything else going to be secondary. That's why I use the lift
join just to make sure I'm not losing I'm not filtering any data because if there is no match then we lose data. So
lift join silver ERP and the category. So let's call it PC. And now what we're going to do we're going to go and join
it using the key. So en from the CRM we have the category ID equal to PC ID. And now we have to go and pick columns from
the second table. So it's going to be the PC. We have the category very important PC. We have the
subcategory and we can go and get the maintenance. So something like this. Let's go and query. And with that we
have all those columns comes from the first table and those three comes from the second. So with that we have
collected all the product informations from the two source systems. Now the next step is we have to go and check the
quality of these results. And of course what is very important is to check the uniqueness. So what we're going to do
we're going to go and have the following query. I want to make sure that the product key is
unique because we're going to use it later in order to join the table with the sales. So
from and then we have to have group by product key and we're going to say having
counts higher than one. So let's go and check. Perfect. We don't have any duplicates. The second table didn't
cause any duplicates for our join. And as well this means we don't have historical data and each product is only
one records and we don't have any duplicates. So I'm really happy about that. So let's go and query again. Now,
of course, the next step, do we have anything to integrate together? Do we have the same information twice? Well,
we don't have that. The next step is that we're going to go and group up the relevant informations together. So, I'm
going to say the product ID, then the product key, and the product name are together. So, all those three
informations are together. And after that, we can put all the category informations together. So, we're going
to have the category ID, the category itself, the subcategory. Let me just query and see the results. So we have
the product ID key name and then we have the category ID name and the subcategory and then maybe as well to put the
maintenance after the subcategory like this and I think the product cost and the line can start could stay at the
end. So let me just check. So those three four informations about the category and then we have the cost line
and the start date. I'm really happy with that. The next step we're going to go and give nice names, friendly names
for those columns. So let's start with the first one. This is the product ID. The next one going to be the product
number. We need the key for the surrogate key later. And then we have the product name. And after that we have
the category ID and the category. And this is the subcategory. And then the next one going to stay as it is. I don't
have to rename it. The next one going to be the cost and the product line and the last one going to be the start stage. So
let's go and execute it. Now we can see very nicely in the output all those friendly names for the columns and it
looks way nicer than before. I don't have even to describe those informations the name describe it. So perfect. Now
the next big decision is what do we have here? Do we have a fact or dimension? What do you think? Well, as you can see
here again, we have a lot of descriptions about the products. So all those informations are describing the
business object products. We don't have like here transactions, events, a lot of different keys and ids. So we don't have
really here facts. We have a dimension. Each row is exactly describing one object, describing one product. That's
why this is a dimension. Okay. So now since this is a dimension, we have to go and create a primary key for it. Well,
actually the surrogate key and as we have done it for the customers, we're going to go and use the window function
row number in order to generate it over and then we have to sort the data. I will go with the start date. So let's go
with the start dates and as well the product key and we're going to give it a name products key like this. So let's go
and execute it. With that, we have now generated a primary key for each product and we're going to be using it in order
to connect our data model. All right. Now, the next step with that, we're going to go and build the view. So,
we're going to say create view. We're going to say gold and dimension products and then us. So, let's go and create our
object. And now, if you go and refresh the views, you will see our second object, the second dimension. So, we
have here in the gold layer the dimension products. And as usual, we're going to go and have a look to this view
just to make sure that everything is fine. So dem products. So let's execute it. And by looking to the data
everything looks nice. So with that we have now two dimensions. All right friends. So with
that we have covered a lot of stuff. So we have covered the customers and the products and we are left with only one
table where we have the transactions the sales and for the sales information we have only data from the CRM. We don't
have anything from the ERP. So let's go and build it. Okay. So now I have all those informations and now of course we
have only one table. We don't have to do any integrations and so on. And now we have to answer the big question. Do we
have here a dimension or a fact? Well by looking to those details we can see transactions. We can see events. We have
a lot of dates, informations. We have as well a lot of measures and metrics and as well we have a lot of ids. So it is
connecting multiple dimensions. And this is exactly a perfect setup for effect. So we're going to go and use those
informations as a facts. And of course as we learned a fact is connecting multiple dimensions. We have to present
in this fact the surrogate keys that comes from the dimensions. So those two informations the product key and the
customer ID those informations comes from the source system and as we learned we want to connect our data model using
the surrogate keys. So what we're going to do we're going to replace those two informations with the surrogate keys
that we have generated and in order to do that we have to go and join now the two dimensions in order to get the
surrogate key and we call this process of course data lookup. So we are joining the tables in order only to get one
information. So let's go and do that. We will go with a lift join of course not to lose any transaction. So first we're
going to go and join it with the product key. Now of course in the silver layer we don't have any surrogate keys. We
have it in the gold layer. So that means for the fact table we're going to be joining the silver layer together with
the gold layer. So, gold dots and then the dimension products and I'm going to just call it PR. And we're going to join
the SD using the product key together with the product number from the dimension. And now the
only information that we need from the dimension is the key, the surrogate key. So, we're going to go over here and say
product key. And what I'm going to do, I'm going to go and remove this information from here because we don't
need it. We don't need the original product key from the source system. We need the surrogate key that we have
generated in our own in this data warehouse. So the same thing going to happen as well for the customer. So gold
dimension customer again we are doing here a lookup in order to get the information on SD. So we are joining
using this ID over here equal to the customer ID because this is a customer ID. And what we're going to do the same
thing we need the surrogate key the customer key and we're going to delete the ID because we don't need it. Now we
have the surrogate key. So now let's go and execute it. And now with that we have in our fact table the two keys from
the dimensions. And now this can help us to connect the data model to connect the facts with the dimensions. So this is
very necessary step building the fact table. You have to put the surrogate keys from the dimensions in the facts.
So that was actually the hardest part building the facts. Now the next step all what you have to do is to go and
give friendly names. So we're going to go over here and say order number. Then the surrogate keys are already friendly.
So we're going to go over here and say this is the order date. And the next one going to be shipping date. And then the
next one due age and the sales going to be I'm going to say sales amount the
quantity and the final one is the price. So now let's go and execute it and look to the results. So now as you can see
the columns looks very friendly and now about the order of the columns we use the following schema. So first in the
fact table we have all the surrogate keys from the dimensions. Then second we have all the dates and at the end you
group up all the measures and the metrics at the end of the fact. So that's it for the query for the facts.
Now we can go and build it. So we're going to say create view gold in the gold layer and
this time we're going to use the fact underscore and we're going to go and call it sales and then don't forget
about the ass. So that's it. Let's go and create it. Perfect. Now we can see the fact. So with that we have three
objects in the go there. We have two dimensions and one facts. And now of course the next step with that we're
going to go and check the quality of the view. So let's have a simple select fact sales. So let's execute it.
Now by checking the result you can see it is exactly like the result from the query and everything looks nice. Okay.
So now one more trick that I usually do after building effect is try to connect the whole data model in order to find
any issues. So let's go and do that. We will do just simple lift join with the dimensions. So gold dimension customers
see and we will use the keys and then we're going to say where customer key is null. So there is no
matching. So let's go and execute it. And with that as you can see in the results we are not getting anything that
means everything is matching perfectly and we can do as well the same thing with the products. So left join called
then products p on product key and then we connect it with the fact product key and then we going go and check the
product key from the dimension like this. So we are checking whether we can connect the fact together with the
dimension products. Let's go and check and as you can see as well we are not getting anything and this is all right.
So with that we have now SQL codes that is tested and as well creating the gold layer. Now in the next step as you know
in our requirements we have to make clear documentations for the end users in order to use our data model. So let's
go and draw a data model of the star schema. So let's go and draw our data model. Let's go and search for a table.
And now what I'm going to do, I'm going to go and take this one where I can say what is the primary key and what is the
foreign key. And I'm going to go and change a little bit the design. So it's going to be rounded. And let's say I'm
going to go and change to this color. And maybe go to the size, make it 16. And then I'm going to go and select all
the columns and make it as well 16 just to increase the size. And then go to our range and we can go and increase it 39.
So now let's go and zoom in a little bit for the first table. Let's go and call it gold dimension customers and make it
a little bit bigger like this. And now we're going to go and define here the primary key. It is the customer key. And
what else we're going to do? We're going to go and list all the columns in the dimension. It is a little bit annoying
but the result is going to be awesome. So what do we have? The customer ID. We have the customer number and then we
have the first name. Now in case you want a new rows so you can hold control and enter and you can go and add the
other columns. So now pause the video and then go and create the two dimensions the customers and the
products and add all the columns that you have built in the [Music]
view. Welcome back. So now I have those two dimensions. The third one going to be the fact table. Now for the fact
table I'm going to go with different color. for example, the blue and I'm going to go and put it in the middle.
Something like this. So, we're going to say gold fact sales and here for that we don't have primary key. So, we're going
to go and delete it. And I have to go and add all the columns of the facts. So, order number, products key, customer
key. Okay. All right. Perfect. Now, what we can do, we can go and add the foreign key information. So, the product key is
a foreign key for the products. So, we're going to say FK1. And the customer key going to be the foreign key for the
customers. So FK2 and of course you can go and increase the spacing for that. Okay. So now after we have the tables
the next step in data modeling is to go and describe the relationship between these tables. This is of course very
important for reporting and analytics in order to understand how I'm going to go and use the data model. And we have
different types of relationships. We have one to one, one to many. And in star schema data model the relationship
between the dimension and the fact is one to many. And that's because in the table customers we have for a specific
customer only one record describing the customer but in the fact table the customer might exist in multiple records
and that's because customers can order multiple times. So that's why in fact it is many and in the dimension side it is
one. Now in order to see all those relationships we're going to go to the menu to the left side and as you can see
we have here entity relations and now we have different types of arrows. So for example we have zero to many, one to
many, one to one and many different types of relations. So now which one we going to take? We're going to go and
pick this one. So it says one mandatory. So that means the customer must exist in the dimension table. Too many but it is
optional. So here we have three scenarios. The customer didn't order anything or the customer did order only
once or the customer did order many things. So that's why in the fact table it is optional. So we're going to take
this one and place it over here. So we're going to go and connect this part to the customer dimension and the many
parts to the facts. Well actually we have to do it on the customers. So with that we are describing the relationship
between the dimensions and fact with one to many. One is mandatory for the customer dimension and many is optional
to the facts. So we have the same story as well for the products. So the many part to the facts and the one goes to
the products. So it's going to look like this. Each time you are connecting new dimension to the fact table, it is
usually one to many relationship. So you can go and add anything you want to this model like for example a text like
explaining something. For example, if you have some complicated calculations and so on, you can go and write this
information over here. So for example, we can say over here sales calculation, we can make it a little bit smaller. So
let's go with 18. So we can go and write here the formula for that. So sales equal quantity multiplied with the price
and make this little bit bigger. So it is really nice info that we can add it to the data model and even we can go and
link it to the column. So we can go and take this arrow for example put it like this and link it to the column and with
that you have as well nice explanation about the business rule or the calculation. So you can go and add any
descriptions that you want to the data model. Just to make it clear for anyone that is using your data model. So with
that you don't have only like three tables in the database. You have as well like some kind of documentations and
explanation. In one click we can see how the data model is built and how you can connect the tables together. It is
amazing really for all users of your data model. All right. So now with that we have really nice data model. And now
in the next step we're going to go and create quickly a data catalog. All right, great. So with that we have a
data model and we can say we have something called a data products and we will be sharing this data product with
different types of users and there is something that every data products absolutely needs and that is the data
catalog. It is a document that can describe everything about your data model. columns, the tables, maybe the
relationship between the tables as well. And with that, you make your data product clear for everyone. And it's
going to be for them way easier to derive more insights and reports from your data product. And what is the most
important one? It is time-saving because if you don't do that, what's going to happen? Each consumer, each user of your
data product will keep asking you the same questions about what do you mean with this column? What is this table?
How to connect the table A with the table P? and you will keep repeating yourself and explaining stuff. So
instead of that you prepare a data catalog, a data model and you deliver everything together to the users and
with that you are saving a lot of time and stress. I know it is annoying to create a data catalog but it is
investments and best practices. So now let's go and create one. Okay. So now in order to do that I have created a new
file called data catalog in the folder documents. And here what we're going to do is very straightforward. We're going
to make a section for each table in the code layer. So for example we have here the table dimension customers. What you
have to do first is to describe this table. So we are saying it stores details about the customers with the
demographics and geographics data. So you give a short description for the table and then after that you're going
to go and list all your columns inside this table and maybe as well the data type. But what is way important is the
description for each column. So you give a very short description like for example here the gender of the customer.
And now one of the best practices of describing a column is to give examples because you can understand quickly the
purpose of the columns by just seeing an example. Right? So here we are saying we can find inside the male, female and not
available. So with that the consumer of your table can immediately understand uh it will not be an M or an F. It's going
to be a full friendly value without having them to go and query the content of the table. They can understand
quickly the purpose of that column. So with that we have a full description for all the columns of our dimension. The
same thing we're going to do for the products. So again, a description for the table and as well a description for
each column and the same thing for the facts. So that's it. With that you have like a data catalog for your data
products at the code layer. And with that the business user or the data analyst have better and clear
understanding of the content of your code layer. All right my friends. So that's all for the data catalog. In the
next step we're going to go back to DO where we're going to finalize the data flow diagram. So let's go.
Okay. So now we're going to go and extend our data flow diagram, but this time for the gold layer. So now let's go
and copy the whole thing from the silver layer and put it over here side by side. And of course we're going to go and
change the coloring to the gold. And now we're going to go and rename stuff. So this is the gold layer. But now of
course we cannot leave those tables like this. We have completely new data model. So what do we have over here? We have
the fact sales, we have dimension customers, and as well we have dimension products. So now what I'm going to do,
I'm going to go and remove all those stuff. We have only three tables. And let's go and put those three tables
somewhere here in the center. So now what you have to do is to go and start connecting those stuff. I'm going to go
with this arrow over here, direct connection, and start connecting stuff. So the sales details goes to the fact
table. Maybe put the fact table over here. And then we have the dimension customer. This comes from the CRM
customer info. And we have two tables from the ERP. It comes from this table as well. And the location from the ERP.
Now the same thing goes for the products. It comes from the product info and comes from the categories from the
ERP. Now, as you can see here, we have cross arrows. So what you can do, we can go and select everything and we can say
line jumps with a gap. And this makes it a little bit like better in the visual for the arrows. So now for example if
someone asks you where the data come from for the dimension products you can open this diagram and tell them okay
this comes from the server layer. We have like two tables. The product info from the CRM and as well the categories
from the ERP and those several tables comes from the bronze layer and you can see the product info comes from the CRM
and the category comes from the ERP. So it is very simple. We have just created a full data lineage for our data
warehouse from the sources into the different layers in our data warehouse and data lineage is this really amazing
documentation that can help not only your users but as well the developers. All right. So with that we have very
nice data flow diagram and a data lineage. All right. So we have completed the data flow. It's really feel like
progress like achievements as we are clicking through all those tasks. And now we come to the last task in building
the data warehouse where we're going to go and commit our work in the get repo. Okay. So now let's put our scripts
in the project. So we're going to go to the scripts over here. We have here bronze silver but we don't have a gold.
So let's go and create a new file. We're going to have gold/ and then we're going to say ddl gold.sql. So now we're going
to go and paste our views. So we have here our three views. And as usual at the start we can describe the purpose of
the views. So we are saying create gold views. This script can go and create views for the code layer and the code
layer represent the final dimension and fact tables. The star schema each view perform transformations and combination
data from the server layer to produce business ready data sets and those views can be used for analytics and reporting.
So that's it. Let's go and commit it. Okay. So with that as you can see we have the bronze the silver. So we have
all our ETLs and scripts in the repository. And now as well for the code layer, we're going to go and add all
those quality checks that we have used in order to validate the dimensions and facts. So we're going to go to the test
over here and we're going to go and create a new file. It's going to be quality checks gold and the file type is
SQL. So now let's go and paste our quality checks. So we have the check for the fact, the two dimensions and as well
an explanation about the script. So we are validating the integrity and the accuracy of the go layer. And here we
are checking the uniqueness of the surrogate keys and whether we are able to connect the data model. So let's put
that as well in our git and commit the changes. And in case we come up with a new quality checks, we're going to go
and add it to our script here. So those checks are really important if you are modifying the ATLs or you want to make
sure that after each those script should run and so on. It is like a quality gate to make sure that everything is fine in
the gold layer. Perfect. So now we have our code in our repository. Okay friends. So now what you have to do is
to go and finalize the get repo. So for example all the documentations that we have created during the projects we can
go and upload them in the docs. So for example you can see here the data architecture the data flow data
integration data model and so on. So that each time you edit those pages you can commit your work and you have like a
version of that. And another thing that you can do is that you go to the readme like for example over here I have added
the project overview some important links and as well the data architecture and a little description of the
architecture of course and of course don't forget to add few words about yourself and important profiles in the
different social medias. All right my friends. So with that we have committed our work and as well closed the last
epic building the god layer and with that we have completed all the phases of building a data warehouse. Everything is
100% and this feels really nice. All right my friends. So with that we have covered the first type of SQL projects
that data warehousing projects. This is usually a very complex project that you can get involved in a company and this
is really amazing project if you are planning to be a data engineer. But of course, if you are a data analyst, you
might end up as well building warehouses. So now we have everything prepared for the second type of projects
in SQL. We will deep dive now into the exploratory data analyzers. So let's go. And now here we're going to cover
the second type of projects where we're going to use our basic SQL skills in order to do something called data
profiling where we're going to try to understand all the aspects of our data sets using simple aggregations like the
sum, average, count and as well we will be using techniques like some [Music]
queries. All right my friends. So the first step in any data project is that we need data sets. If you have done the
previous project where we have built the SQL data warehouse, then you have everything the data and the database. So
you don't have to worry about it. But if you skip that, which I don't recommend, I still have prepared for you the files
and the database. So let's get the data and create our database. All right. So now if you go to the link in the
description, we're going to go to the downloads. And of course, you can subscribe to my newsletter. And then
here we have the SQL course materials. And here we have a link for data analytics projects. Let's go to the
link. And now here you have some important links like downloading the server the management studio where we're
going to write our SQLs and as well there is a link to the g repository and as well what is very important is to
download all the project files. So click on that and download all the files. Now extract the file and put it somewhere
safe at your PC and now inside it you can find all the scripts and the data sets. Now there is like three ways on
how to create the database in SQL server. So the first one is by executing scripts. If you go to the scripts over
here, the first one we have a file called init database. Just go inside it and copy the whole thing and then let's
go to SQL server. Now make a new query and make sure you switch to the master database and then paste the whole code.
So now what you are doing here is we are creating a new database. We are creating a schema and then three very important
tables that we're going to use in our data analyzes. Now there is like only one thing that you have to change in
this script and that is the path of the files. And once you have done that just go and execute the whole script. And now
as you can see everything is done and there is like data inserted. Now if you go to the left side to the database and
refresh you can find a new database called data warehouse analytics. And if you go inside the tables you will find
our three tables customer products and sales. So this is one way on how to create the database. The second methods
is to go to the databases over here. Right click on it and say new database. And for example, let's call it data
warehouse analytics. I'm going to call it two because I have already one. And then click okay. And with that you have
a new database. So what we're going to do now, we're going to right click on it and then go to tasks and then import
flat file. And now what we're going to do, we're going to go and import the CSV files to our new database. So we can go
next and then you have to go and locate your files. I have them somewhere over here. So data set CSV files and we have
to focus on the gold tables. So I'm going to go and select this one and then next. Now I'm just getting an overview
of my data. So next. Now just to make sure that you are not getting any error, I'm going to go and allow nulls and
that's all. So next and finish. So perfect. The data has been inserted. Now let's go to our database tables. And as
you can see, we have here our new table. So you have to go and repeat this three times in order to import the data. Well,
you can use this method if the first method didn't work. But I really recommend you to use the script in order
to create the database. The third way is to go and restore the database itself. Now how we're going to do it? We're
going to go again to the data sets and as you can see we have here a database backup. So as you can see we have here a
PAK file. So now what you have to do is to go and copy that and then we're going to go to the database location. So it
really depend where you have installed the SQL server. So currently I have it here program files Microsoft SQL server
and then the express MSSQL backup and you have to place the file over here. So I have it here data warehouse analytics
backup. And now all what you have to do is to right click on the database and then say restore database and then we're
going to go to the device three points and we're going to say add. And now you can see our database data warehouse
analytics. Once we say okay and then okay and now since I have it already I will get an error but once I click okay
the whole database can be restored without running any scripts. So those are the three ways on how to create the
database of the projects and if you have built with me the data warehouse projects before you don't have to do it
because we have built that together. So pause the video and get the data for the projects. All right my friends. So we're
going to start with a secret, a little trick that I usually do by analyzing any data sets. So let's start with little
coffee before we start. H this is really hot. Okay. So the secret says as I'm looking to any data sets in any
projects, I see the data always divided between dimensions and measures. What truth? You take the blue pill, you
take the red pill. All I'm offering is the truth. Nothing more. If you see your data like me as
dimensions and measures, you can generate like endless amount of insights from any projects from any data sets and
you will find me through the projects that I'm always speaking about measures and dimensions. So I'm going to show you
how I usually do it. So now usually by looking to any data sets in any projects. So you have like multiple
columns and rows here I see the data always splitted into two categories either a dimension or a measure. And now
of course the question is here is my column a dimension or a measure? Well in order to assign it to one of those
categories you have to ask the first question is it a numeric value? If it's not so you have like string or date or
any other data type then it is a dimension and if it is yes in numeric then you have to ask the second question
does it make sense to aggregate it. So if the answer for both questions is yes, it is numeric and it makes sense to
aggregate it then it is a measure otherwise it is a dimension. Now let's practice and have some examples. So now
by looking to the values of the column category you can see all the values are characters. So it is not numeric that
means this column is a dimension. So it is very simple. Let's take another column. We have the sales amount. So now
as you can see the values are numeric and as well it makes sense to aggregate those values. we can get the total sales
or the average sales and so on. So it fulfill both of the conditions. It is numeric and it makes sense to aggregate
it. That's why we say sales is a measure. Now if you're checking the values of the product name, you can see
that all of them are characters and names. So it is not numeric. That means the product is a dimension. Moving on to
the next one, we have the quantity. The values are numeric and as well it makes sense to aggregate it. Can summarize all
those values to have the total quantity. So quantity is a measure. Now if you're looking to the values of the birth dates
you can see this is a date information it is not numeric so that means it is a dimension right but if you calculate the
age from the birth dates age of the customer going to be in numeric and it makes sense to aggregate it for example
finding the average age of customers. So if we derive a numeric value from a dimension then we can use it as a
measure. So age is measure and now we come to something really tricky. This is the ID. So for example if you are
checking the customer ID you can see all those values are numeric. So the first condition is fulfilled. Now the very
important question does it make sense to aggregate the ids? Well those ids are unique identifier for a customer and if
you find like the average of that it is not like helpful right I cannot think of one use case of aggregating the customer
ID like having the average of all those ids or summarizing the ids. So it makes no sense to aggregate it. That's why we
can consider the ID of a customer as a dimension not as a measure. So as you can see it is very simple. If it is
numeric and it makes sense to aggregate then it is measure otherwise it is a dimension. And this is the foundations
of any data analytics. If you see your data as dimensions and measures you can generate a lot of use cases and insights
from your data sets. Now I totally understand if you are still confused about dimensions and measures and you
might be asking why do I need measures and dimensions. Well if you are doing any type of data analysis or you are
exploring any data sets you will be end up always like grouping up the data by something like you are grouping the data
by countries or grouping the data by for example products or categories. So we need dimensions to group up our data and
in the other sides you will be asking questions like how much how many what is the total of something. So you always
need to aggregate or calculate something right and for that you need the measure. So we need the measures in order to
answer the question how many and how much and we need the dimensions in order to group up the data by something. So
that's why almost in any type of data analyzes you need dimensions and measures and this going to be more clear
as we progress in the projects. All right. So now I'm going to walk you through the project road map and I have
split that into six steps. So we're going to do different types of explorations like the database
dimensions, measures, dates and we're going to do some basics analyszis like the magnitude and the ranking. So let's
start with the first step in our projects. We're going to do database exploration. So let's say that you have
joined a team and you got an access to a database. The first thing that I usually do is that I explore the structure of
the database just to have basic understandings about the database tables, the views, columns. Are we
talking about like 10 tables, hundreds of tables? So it is just a few queries in order to say hello to the database.
So now let's go to SQL and explore the database of our projects. So now how we going to do it? Either you go to the
left side over here and start clicking the objects of your database and explore the tables, views, columns and so on. Or
a better way that I usually do it that I explore the database using a query. So what we can do, we can go and select
data from the system tables because the database stores metadata informations about our tables and objects. So we're
going to target an information schema. This is an internal schema in the database where we have like multiple
tables and views to explore the metadata and the structure of our database. So for example, we can go with the tables.
So let's go and create it. And with that you have a list of tables and with that you can see multiple informations like a
catalog, the schema and the table names and you can see over here the object type whether it is a table or a view. If
you done the data warehouse project with me then you will find a lot of tables. But if you are just doing the data
analyzes you will see only those three tables. So customers, products and sales. So with that we can see in our
database there are like around 15 tables or three tables. Now in the output you can see the database name the schema and
a list of all tables and of course don't forget that you are using the database that we created. So with that we have a
nice quick list with all tables inside our database. Now the next step we can go and drill down and check what are the
columns that we have inside our database. And for that we can as well target the same schema. So select star
from information schema and it is very simple. So we're going to go to the table columns. So let's go and execute
it. And now we will see a lot of informations over here. So we can see in our database we have around 101 columns.
So that we can see all the columns available in our database. And what I usually do with that I go and select the
columns only for specific table. So we can say where are table name equal let's get for example the
dimension customers. So let's query the whole thing and with that we can see we have 10 columns inside this dimension
and this is how the columns are sorted inside our table or view and we can see all the metadata informations about each
column. So now as you can see we are now exploring the structure of our database and this is really helpful to get an
overview of the database and the projects. Are we talking about like 20 tables or hundreds of tables? And we can
quickly see the naming of the columns, the tables. This is really important to get a feeling about the projects and
sets the foundations for exploring the data inside those tables. All right friends, so with that we have done the
first step. We have explored the database structure and now we can start diving into the actual data. The first
thing that we can explore is the dimensions. Okay. So what we going to do with the
dimension exploration? All what we have to do is to go and identify the unique values of each dimension that we have
inside our database. This can help us to understand what are the categories, which countries, what are the product
types that we have inside our database and we have a very simple formula for that. So all what you need is the SQL
keyword distinct together with any dimension in your data set like distinct country, distinct category. So for
example if you are checking any column that is dimension you can see a lot of values and repeating stuff but now once
you say distinct column what going to happen you will get a list of all unique values and with that you can understand
quickly I have three different types so I have a bc and this as well going to help you to understand the granularity
of your dimension does the dimension has like three values or 100 value so it is very simple let's go and analyze our
dimensions okay so now let's explore the dimension values inside our database so let's start with the first table the
customers and if you check those columns we have to find an interesting dimension like for example the country. So now
what we can do we can go and explore all the countries our customers come from. So let's go and do that. It is very
simple. Select distinct and then we have our column the dimension country from our table customers. So let's go and
execute it. And with that we can see in the result we have six countries. This is really nice in order to understand
the geographical spread. So we have customers for our business that comes from six different countries. Germany,
United States, France, Canada and so on. So now with that we have like the first little insights about our business. Now
let's jump to another table the products. So what we have to do is to explore all the categories inside our
business the major divisions. So we're going to say select distinct category from our table products. So let's go and
execute it. Now in the output you can see we have four categories. We have the accessories, bikes, clothing and
components. This is like giving us an overview of the product range. What are the major divisions inside our business?
Now the next one I'm digging deeper in this information. So not only I want to see the categories, I would like as well
to see the subcategories. I'm not starting a new query because there is of course
relationship between the category and the subcategory. Let's go now and execute it. Now you can see in the
output our categories are now splitted into more specific groups. So for example the bikes over here we have
mountain bikes, road bikes and so on. So as you can see the subcategories has more details about the products than the
category. And now in order to get the full picture we going to bring now the product name. So with that we're going
to get a big picture in one shot. So now you can see the whole hierarchy of our products. And of course it is more
interesting if you go and sort the data by those three informations. So let me just execute it again. So now if you go
and explore our data for example we have here the category accessories and we have a subcategory inside it called
lights. And in this subcategory we have three different products. And if you scroll to the end of our table you can
see that we have around 295 products. So you can see the granularity of the product name is
different than the category and the subcategory. And all those three informations are related to each others.
So now as you can see after exploring those dimensions we have now better understanding on how the data is
organized and this can help us by the analyzes if you are aggregating by the category you will get only four rows. If
you are aggregating by the products you will get hundreds of rows. So this is how we explore the dimensions of our
database. Okay. So now with that we have a clear picture about the dimensions inside our data sets. And now in the
next step we're going to deep dive into one special type of dimensions. We have the dates. So we're going to explore the
date columns. Okay. So now what we going to do with the date exploration? We're
going to go and explore the boundaries of the dates that we have in the data sets. What is the earliest and the
latest dates in my data? We're going to understand the time span. Do we have in our business 2 years or like 10 years?
And this is of course very important to understand in order later to make different types of time analyzes. Now
the formula for that is very simple. All what we need is the min and max functions in order to get the earliest
and the latest dates. And of course we're going to apply that on date columns, date dimensions. So for
example, we're going to have like min order date, max create date, min birth date. So any date that you have in your
data set. And here if you look to any date column inside your data, you will find multiple values. But what is
interesting is to understand what is the earliest date like here for example 2018 and what is the latest date for example
2028 and with that we can understand aha we have like time span of 10 years using the date diff function. So now let's go
and apply our new formula on our date columns. All right. So now let's search for date informations inside our
database. And usually you're going to find a lot in the facts. So let's go to the fact cells. And here we have like
multiple dates. the order date, shipping date and due dates. Now let's go and explore the boundaries of the order
date. So we have the following task. Find the date of the first and last order. So how we going to do that? We're
going to say select and we are targeting the order date from our table sales. So let's go and execute it. And now we can
see we have a lot of values inside our database. So now in order to find the first dates, what we're going to do,
we're going to go and use the function min in order to get the minimum order dates. So we're going to go and call it
first order dates. So let's go and execute it. So now we can see the date of the first order. It is in December
2010. Now let's go and find the date of the last order. So we're going to have this time the max order date. Uh let's
go and call it last order date. So let's go and explore now the other boundary and with that we can see in January 2014
it is the date of the last order in our system. So with that we have explored the boundaries of the order dates the
first and the last and of course we can now understand very quickly that we have four years of sales inside our business
but we can go and calculate it. So now the task says how many years of sales are available. Now in order to find the
years between those two dates, we have another scale function. It's called date diff. And now we have to go and subtract
two dates. Now this function need three arguments. The first one you have to specify whether it is a year, month and
day. And we start with the smallest date. So it's going to be the min order dates. And then the last argument is
going to be the latest or the highest date. And it's going to be the max order dates. And we can go and call it order
range in years. Okay. So let's go and execute it. And with that you can see in the output we have four years. Of course
if you want to go and check the months you can go over here and say month and execute. So between those two dates we
have 37 months. And of course now we have to go and rename it. So with that we have explored the dimension order
dates. But what is more interesting is to check the customers and here we have the birth date. So now what we can do,
we can go and find the youngest and the oldest customer. So let's go and do that. We're going to say select
minates and with that we are getting the oldest birth date and we will get now the max birth date and with that we will
get the youngest birth date from our table customers. So let's go and explore that. Now we can see the birth date of
the oldest customer. I hope he or she is still alive. So it is more than 100 years and the youngest customer is
around like 40 years. So we don't have really young customers inside our business. And of course if you don't
want to see the birth dates, you want to see the age, what you have to do is actually very simple. You're going to
use as well diff and we want the year and then we're going to say min birth date with the current date and time. And
for that we have a function called get a date and we're going to call it oldest age. So if you go ahead and execute this
one over here you can see the age of the oldest customer it is 109. Of course you can do the same thing for the youngest.
If you just replace this with max and here we have the youngest age. So let's go and execute it. It is 39. So my
friends this is how we explore the boundaries of a date and by finding the first date and the last date and the
years between them we are having now more understanding of the time span of our business and that's going to help us
later by making different type of complex analyzers. So this is how we explore the dates. All right. So with
that we have now a clear picture about the scope of our projects and the date range inside our data sets. Now in the
next step, we're going to go and explore the second type of data, the measures. All right. So now what is
exactly exploring the measures? What we're going to do is to calculate and find out the key metrics of our
business, the big numbers, the highest level of aggregations of our data. And the formula for that is very simple.
We're going to go and use the aggregate functions in SQL like the sum, average, count for any measure inside our data
sets. So for example, we're going to find the total sales by summarizing the sales value, finding the average price,
finding the sum of quantity in order to have a big number about all sold items. So always an aggregate function together
with a measure. So for example, if you have a column where you have a lot of values and you go and summarize all
those values, you will get for example 240. So this is a key metric. This is the highest level of aggregations and
the value is not splitted at all. So for example, we say this is the total revenue of our business. And this is
exactly what we mean by exploring the measures. We will get those big numbers. So now let's go and apply those
aggregate functions to the measures that we have inside our data set. Okay. So now we're going to go and spotlight on
the big numbers that matters the most of our business. So now based on those three tables, I have collected here the
following questions. So let's go and solve them one by one. The first one is find the total sales. So we're going to
go and summarize by using the sum function for the sales amount as total sales from our table fact sales. So
let's go and execute it. So this is the total amount of sales in our business. It is around 29 millions. So this is the
business total revenue. Now we can go to the second one. It says show how many items are sold. So this time we need
another column but from the same table from the fact sales. So the question is how many items that means we want the
quantity and we're going to stay with the same function. So we are summarizing all the values of the quantity and we
can call it total quantity. Let's go and explore that. So we can see our business did sold around 60,000 items and these
60,000 items did generate around 30 million. So let's keep going. The next question, find the average selling
price. So that means we are targeting the same table. And here we have the price informations. So we're going to
say the price. This time the aggregate function going to be the average. And we're going to call it average price. So
let's go and execute it. So the average price in our business is 486. So that means our business is selling like
expensive items. Now let's go to the next question. It says find the total number of orders. And for that we're
going to go and use the function count and we can count the order numbers. So order number total orders let's go and
execute it. So it says we have 60,000 orders. And now as you are working with the count function what I usually do I
try to count the same thing but using a distinct. So distinct order number. So, what I'm trying to do here is first
eliminate any duplicates in the order number and then count it. I don't want to count the same order twice inside our
sales. So, let's go and execute that. Now, as you can see, we have only 27,000 orders out of 60,000. So, that means the
same order is repeating in our database. Let's have actually a look. So, select star from our table and let's go and
have a look. Now as you can see from the first order over here you can see the same order is repeated three times and
that's because this customer did order three things in the same order. So now of course what is the definition of
order? Usually the whole thing is one order. That's why in order to get an accurate number of orders you have to go
and use a distinct in order to eliminate first all duplicates and then count how many orders we have. So in this scenario
I'm going to say in our business we have around 27,000 orders. So that's why it is little bit tricky using the count
function. Always try to compare the numbers before and after using distinct. So let's keep going to the next one. It
says find the total number of products. So it is very simple. We're going to say select count and we're going to say
product key as total products from the table gold products. So let's go and execute it. So as you can see we have
295 and if you go and make it distinct just to check you will get the same number. So that means there is no
duplicates and of course you can go and count the product name instead. The names of the product is unique. So
that's why we are as well getting the same numbers. So that's it. Let's continue find the total number of
customers. So the same thing select count and you can go with a customer key for example from called a dimension
customers and I'm going to call it as total customers. So let's go and execute it. So we can see in our system we have
18,000 registered customers. Now the next one it says find the total number of customers that has placed an order.
So that means having a customer inside our database doesn't mean that this customer did already placed an order.
Maybe we have customer that just registered and didn't order anything. So what we're going to do, we're going to
take the same query, but instead of targeting the customers table, we're going to target our fact the sales. So
let's go and execute it. So now, as you can see, we are getting 16,000, which makes no sense because one customer
might order multiple stuff. So what we're going to do, we're going to say distinct and let's query it again. So
now it is more correct. We are getting around 18,000 customers. Now we can go and compare them one by one. So as you
can see we are getting the same numbers. So that means all our registered customers did already placed an order
because the numbers are matching. So it is very simple. We are just using an aggregate functions and that we are
getting those key values. But what I usually do is that I collect all those measures in one query in order to have
an overview of all key numbers in our business. So instead of me querying each one of them individually, I combine them
in one go. So now what we're going to do, we're going to generate a report that shows all key metrics of our
business. So how I usually do it, I'm going to go and get the first query for the total sales and put it over here.
And now I'm going to build only two columns. The first one is the name of the measure and the second one is the
value of the measure. So let me show you what I mean. Now this one over here, I will not call it total sales. I'm going
to make it like generic. So I'm going to say measure value. And before it we're going to make another column from a
static string value is the total sales and we're going to call it measure name like this. So let's go and just execute
this one over here. So the measure is total sales. So it is not anymore like the column name. It is now a value in
the output and the measure value is like around 30 millions. Now what I'm going to do I'm going to go and add another
measure as a second row. And in order to do that, we're going to use the union all and then copy the whole thing over
here and say total quantity and we're going to change the measure to quantity. So now let's select both of them and
query. And now as you can see we have now the two big numbers in one query. So the total sales and the total quantity.
So now what we can do we can go and collect all those big numbers and measures and put it in one query. So
with that we have the average price, the total number of orders, product, customers and as well you can go and
target different tables because SQL cares here only about the number of columns and the data type of columns
must be matching. So now let's go and query this and now in single query we can see the big numbers the key metrics
of our business. We can see the total sales, total quantity, average price and so on. This is a super report where you
can generate it for any business where you have in one go the full big picture about the business. So this is how I
generally do if I'm exploring a new database. I put all those big numbers and measures in one query to have better
understanding about the business. All right my friends. So with that we have now a clear understanding about the
dimensions and as well the measures of our data sets. Now in the next step we're going to go and start combining
stuff together in order to generate insights. And we're going to focus now in a very basic analyszis. It is the
magnitude analyzis. Okay. So now what is exactly a magnitude analyszis? It's all about
comparing the measure values across different categories and dimensions. And this can help us of course to understand
the importance of different categories. Now the formula for that going to be interesting. So now this time we will be
mixing stuff together. So first we have to go and aggregate a specific measure and then we say by dimension. We need
here the dimension in order to split the measure. It sounds complicated but it is very simple and basics. So for example
we can say the total sales by country, the total quantity by category, the average price by products, the total
orders by customer and if you follow this formula you will be generating endless amount of insights by just
combining any measure with any dimension. You can call it it is a new insight. So it's going to look like like
this. If you have one measure that is like for example 600 and if you put now this measure together with dimension
what's going to happen this 600 is going to be splitted by the dimension values. So A going to have like 200, B going to
have 300 and C 100. And now with that we can go and compare those categories right. So we can see now that category B
has the highest measure and the C has the lowest. And this help us to compare the values of the measure. what is the
best category and what is the worst category. So this is very basics analyszis. So let's go and apply this
formula on our data sets. Okay. So now let's go and break all our measures by dimensions. So here I have prepared few
interesting examples where first we're going to break the total number of customers. As we learned we have 18,000
by the countries. So the measure is total customers and the dimension going to be the countries. So let's go and
write the query for that. So we're going to select. So the first thing that we're going to go and add is the dimension. So
it's going to be the country. And then we need the measure. It's going to be the count of the customer key. So this
will give us the total customers. And we need to select our table. So it's going to be the dimension customers. And of
course we have to go and group up the data by the countries. So group up country. So let's go and execute it. And
with that you see again the list of countries. So we have our six countries and then the total customers for each
country. So that we can see the distribution of customers by the country. But what we usually do is that
we go and sort the data by the measure the total customers like this. And we're going to sort it by descending. So with
that we will get first the countries with the highest customers. So let's go and execute it. So now we can see in the
results the highest number of customers come from United States then Australia, United Kingdom
337 customers without the country informations it is not available. So that's it right it is very simple. So
with that we have splitted the total number of customers by a dimension the country. Now of course we can go and
split the data by different type of dimension. So for the next one we are saying find the total customers by
gender. So here's the same thing. We have the same measure that to other customers but we are splitting the data
by different type of dimension. So just copy and paste and now instead of countries we just going to switch it to
gender and over here and that's it. So let's go and execute. So now as you can see the granularity of the gender over
here is different than the countries. We have here only three values and we can see it is almost splitted evenly between
male customers and female customers. And of course this going to help us to understand the demography of our
customers. And as you can see it was very simple. We just switch the dimension. So you can go and split as
well by the marital status and so on. Now let's go and split the total products by the category. Well actually
the query is going to be very simple as well. So select and here we're going to have the same aggregate function the
count products key as total products from our table gold dimension products and then we're going to group
up by the dimension the category and we're going to order by as well the same thing total products distinct from the
highest to the lowest. So let's go and execute it. And with that we can see how many products do we have in each of
those categories. And we can see the biggest category the components and after that the pikes. And this is
interesting that we have seven products where we have nulls where they don't belong to any category. This is really
nice. Let's go to the next one. What do we have over here? What is the average costs in each category? So this is like
different style of question but at the ends we're going to have the same thing. We have over here the average costs.
This is the measure and the category is our dimension. It's like we are saying find average costs by category. So what
we're going to do, we're going to go and copy the same query and the dimension is the same. So the categories but the
measure is different. We are not talking about the total products. We are going to say average and here we're going to
have the column costs and let's go and rename it average costs. So that's it as well for the order by we have to use the
new measure. So let's go and execute it. So now we can see the most expensive category is the bikes costs a lot
compared to the accessories of course. So you can see the accessories is only 13 and the bikes is 900. So this is as
well gives us insights about how expensive each category is and as you can see it is always the same templates.
We are splitting specific measure by a dimension. So let's keep going to the next one. It says what is the total
revenue generated for each category. So again here the question is find the total revenue by category. So the total
revenue here is the measure and the category again is the dimension. So now the total revenue comes from the fact
and the category comes this time from the dimension. So that means we have to go and join tables right. So how we
going to do it? Let's go and start with the select star from and I would like always to start from the fact table. So
fact sales f and then we're going to go and join it with the dimension and usually I go with the left join in order
to not lose anything because if you use an inner join you might lose in the fact few orders and few sales I don't want
that. So lift join with the dimension this one going to be the products and the key for that going to be very simple
going to be the product key and the same thing for the facts. So with that we join the fact table with the dimension.
So now we have to go and pick what do we need? We need from the fact the sales right. So sales amount and we need from
the products the category and we want to group up the data by the category. So so this part is done. What is missing is of
course the aggregations. So we are aggregating actually the sales. So sum sales and we can call it total revenue.
So like this. And of course we can go and order the data by the total revenue by our measure and distinct from highest
to the lowest. So as you can see it is exactly like the previous one. But here the data doesn't come from only one
table. Here it comes from two tables. So the measure come from the facts and the dimension come from the dimension
products. And this is classic right? The dimension has all those descriptions and details about the products like the
categories. And the fact table has all those measures and dates that we use in order to calculate our measures. So
that's it. Let's go and execute it. Now, as you can see in the output, the category bikes is bringing the most of
revenue. So here it's like in millions 28 millions of sales and the accessories and the closing is not really bringing a
lot of like revenue. Both of them are below like 1 million. So with that you can understand our business is making a
lot of money selling bikes, right? So my friends as we are exploring the data we are understanding more and more about
our business right so let's keep going to the next one we have here the question what is the total revenue
generated by each customer so now we want to find out the top spender right select star and as well we start from
the fact table and this time we're going to lift join it with the customers right so the dimension customers and we're
going to go join the data so we're going to use the customer key for the join And what we're going to do, we're going to
go and get maybe the customer key. And let's go and get as well the first name, maybe few details about the customer and
as well the last name. So those are the columns that we want from the customers. And now what do we need? We need the
aggregation. So it's going to be the same thing. Sales amount as total revenue. And we have to go and group up
the data by all those three informations. So we're going to go and copy paste. And at the end as usual,
we're going to order by the measure total revenue descending. So that's it. It is exactly like previous one but with
different dimensions. So let's go and query it. And now we get a full list of all our customers, the 18,000s. And we
can see the total revenue for each customer. So we can see Nicole and Caitlyn, they are our top spenders and
the most royal customers that generated sales and revenue for our business. This is really cool. Right now let's go to
the next one. It says what is the distribution of sold items across countries. It is like finding the total
quantity by countries. So it is very simple. I'm going to go and take the same query because countries comes from
the dimension customers and the sold items the quantity come from the sales. So we are doing the same joints but with
different dimensions and measures. So what do we need from the customers is only the country and the measure going
to be the quantity. And here we're going to go and say total sold items and we have to change the group by to the
countries and sorting the data by the new measure. That's it. And with that we are generating new reports by just
changing the dimensions and measures. So again this is very interesting to understand which country is generating
like good business for us. So my friends as you might already noticed if in the dimension we have like small number of
unique values like in the countries we have here only seven values in the gender we have only three we call those
dimensions low cardality dimensions because we have low number of values inside it and in the result we will get
only here for example seven rows but if our dimension is high cardality like by the customers we have 18,000 unique
customers then our measure going to be splitted by those 18,000 and in the results we will get exactly the same
number of customers. So the number of rows and results really depends on the cardality of the dimension. So as you
can see we can generate a lot of different reports by only following this formula dividing the measure by a
dimension. So we just generated eight different insights and reports by only few measures and dimensions. So now what
you can do you can pause the video and try different dimensions and measures in order to have more insights about our
business. Okay. So as you can see this is the basics analyszis that we can do in any data set or any domain where we
are aggregating a measure by dimension. Now in the next and last step in our projects we will be doing ranking
analyszis. Okay. So what is ranking analyszis? It is very basic. We're going to go and order the value of our
dimension based on a measure in order to identify the top performers and as well the bottom performers. And the formula
for that is going to be the following. So this time we're going to be ranking the dimensions by an aggregated measure.
So for example, we're going to rank the countries by the total sale or we're going to find the top five products by
the sold item, the quantity or the bottom three customers by total orders. So it's like the magnitude analyzes.
We're going to have like an ordered list of dimensions value. For example, from the highest to the lowest in order to
identify quickly the top performers. And of course we can go and filter the data by saying I would like to have only the
top two categories. And with that you are removing all other dimensions that are not on the top two. And in SQL we
can use for that the keyword top or we can use the ranking window functions like rank, dense rank, row number and so
on. So let's go and apply our formula in order to rank our data set. Okay. So now let's check our data. We're going to
start with the first question. Which five products generate the highest revenue? So we are searching for the
best performing products in our business. So of course the first question what is the dimension and
measure that we have in this question. Well the revenue that means we need the sales from the facts and the products
that means we need the dimension products. Now in order to write this query it's going to be very simple. So
we can use as well the group by I will not write it from the scratch. So I'm just going to take this query over here
where we aggregated the total sales by the category. Now what I have to do is just to change the dimension. So instead
of the category we need the product name and we are aggregating now the data by the product name because we need the top
five products right. So the revenue is the sales amount and with that we have like almost everything is ready. So
let's go and execute it. And now we can see we have a list of all products in our business and as well we can see the
total revenue. But the task says here we need the top five. So we don't need all the products from our database. We have
to go and select only this subset. Now in order to do that in SQL server, it's very simple. We're going to go over here
and say top five and SQL going to go and return only the first five rows from the results. So let's go and execute it. And
as you can see now in the results, we have only five products with the highest sales. And that's it. With that, we have
solved the task and we can see the top five products and all of them are pikes. Now let's go and check the other sides.
We want to find the five worst performing products by the same measure, the sales. And this is very simple. So
what we're going to do, we're going to go and take the same query over here. And now what we're going to do, we're
going to go and sort the data from the lowest to the highest. So instead of descending, we're going to remove it.
And with that, SQL going to use the ascending. So let's go and execute it. And with that, as you can see, we are
getting the worst five performing products by just sorting the data differently. So it is very simple right
and with that we can see our five best sellers and the five worst sellers. And now what we can do we can go and just
change the dimension and generate different reports like instead of the product name let's go and check the
subcategories what are the best subcategories of our data. So I just change the dimension let's go and query.
So with that we can see the best subcategories we have in our business and the same thing if you want to go and
check the worst performing subcategories. So generating reports is very simple and now my friends in SQL
there is like two ways on how to create ranking. We have a simple one where we are using the group by clouds together
with the keyword top. But if you are generating a reports where it's things are more complex and you need more
flexibility, you should use the window functions. So let me show you how I can solve this task using the window
function. So now I'm going to go and take almost the same query. Let's put it over here. I'm going to get rid of the
top five. And let's see, we are still speaking about the products name as well with a group I. But now what we're going
to do, we're going to go and generate a rank. So we can go and use for example the row number. And in scale there's
like different types of window functions for ranking. One of them is the row number or the rank and then we're going
to say over. Now we're going to go and sort the data. It's like we have done in the previous one. We have to sort the
data by the total revenue and the total revenue is the sum of sales and descending and we're going to call this
rank products. So let's go and execute it. Now as you can see we have created a new column where we have like a rank. So
we have for each products like one rank until the last products 130. So now what we are
interested is to go and select the top five. Right now in order to do that we need a second step. That's why we're
going to go and use the subquery. So we're going to say select star from and then we're going to put the whole thing
in a subquery something like that. And all what you have to do is to use the new flag that we have created in order
to filter the data. So we're going to say where the rank products is smaller or equal to five. And with that we
should get only the top five products. So let's go and execute it. And as you can see we are getting the same results.
Now, of course, with the window function, it is more complicated than the first one. But with the window
function, we get more flexibility on selecting more columns or adding more different types of aggregations and
details on the query. And as well, we can go and use different types of ranking functions that handles the tice
differently. So, if the task is very simple like this, I'm going to go with the simple group pie. But if you are
generating like complex reports, I'm going to go with the window function. So now what you can do, you can go and rank
the data by different dimensions and measures. For example, find the top 10 customers who have generated the highest
revenue. And as well, you can go and find the three customers with the fewest orders placed. So again, we can go and
reuse the previous queries that we have generated. So this query generates the customers and their total sales. And all
what you have to do is to say top 10 and then rerun the query. And with that, we are getting the top 10 customers. and
about the lowest three customers. All what we have to do is to go and replace the measure. So we are counting the
unique number of orders. So we're going to say total orders and as well go change the order by not descending
ascending. And we need the top three. So let's go and execute it. So we can see the three customers that did order only
once and they are the three customers with the fewest orders. So as you can see by just switching the dimensions and
measures we are generating completely new important insights and as you can see as we are exploring the data we are
understanding what are the best products what are the top customers that are usually very important for reporting.
All right my friends so with that we have covered the last step in our projects how to rank our data and with
that we have covered all the steps of the project road map. We have done a lot of explorations for the database,
dimensions, measures. We have combined the dimensions and measures in order to do magnitude and ranking analyszis.
Okay, my friends. So that's all about the EDA projects. And now in the next one, we will do the last type of
projects, the advanced data analytics. So let's go. And now the type that we're going to
cover is advanced analytics projects using SQL where we're going to write complex SQL queries to answer real
business questions. So we're going to use the advanced window functions, the CTE subqueries and we're going to go and
script two big queries in order to generate two reports. So with this type of project, you will learn how to solve
real business questions using advanced techniques. All right. So for this project as well, we have a road map
where we're going to progress through different type of steps and analyzes. So we're going to do many stuff like change
over time, cumulative analyszis, performance, data segmentations and at the end reporting and all using SQL. So
let's start with the first step in the road map. We going to analyze the change over time. So let's
go. Okay. So now what is change over time? It is a technique in order to analyze how a measure evolves over the
time. And this is very important in order to track the trends and as well to identify seasonality of your data. And
the formula for that is very simple. We're going to go and aggregate a measure but this time based on a date
dimension. For example, the total sales by a year, the average cost by the month. So if you combine any aggregated
measure together with a date column or dimension, then all what you are doing is you are analyzing the change over
time. So for example, we're going to go and break our measure this time for example by the years. And with that we
can track immediately how our business is doing over the time over the years. So for example, we can see here the best
year was 2024 and then we have really hard decline in our business in 2025 and then slightly it's going up in 2026. So
with that we can quickly analyze the trends of our business. So now let's go and check the trends and the changes
over time in our business. Okay. So now let's analyze the trends and changes over time in our data and in order to do
this kind of analyzes usually we target the fact table because there usually we have our measures and as well dates. So
we have the order date, shipping date and due date. Now what we can do we can go and analyze there the sales
performance over time. So as we learned all what we need is a metric and a date. Let's go for example and select the
order date and as well one of those measures sales amount from our fact table. So let's go and query it. And we
can go and order the data by the order dates ascending. So let's go and execute. And as you can see we have
nulls in our data. What we can do? We can go and filter those data out. We don't need it. So we're going to say
where order date is not null. So let's go and execute it again. All right. So that we don't have those orders. Now, as
you can see, we have sales over time, right? We have a date and we have a measure. So this looks really good. But
now what we're going to do, we're going to go and aggregate the data by the sales amount. So let's go and say sum.
And we're going to call it total sales. And then we group up the data by the order dates. So let's go and execute it.
And with that, as you can see, for each day, we have the total sales. So now the granularity of our data is the day and
we can say of course now we are analyzing the sales over time but usually we don't aggregate the data on
the day level we want to have higher aggregations for example let's go to the years and now in order to change the
dimension date here from a day to a year we have to use date functions and there are a lot of date functions in order to
extract that date part and now in order just to get the year we have a quick function called year and it going to
convert convert our date to year. So let's call it order year and of course we have to go and group up the data by
the year and as well sort it by the year. So let's go and execute. Now we are at the year level and we have only
five years. So that means we have changed the aggregation from the day to year and now it is very easily to
analyze the performance of our business over the years. So the first year was the lowest and you can see 2013 is the
best year in our business and then it is declined massively in 2014. And of course we can go and add more measures
to our data not only the total sales. For example, let's go and calculate the total number of customers. So we can say
count distinct customer key as total customers. So let's go and execute it. And with that we can check are we
gaining like customers over the time if there are any trends that we can see and we can go and keep extending stuff like
we can go and add the total number of quantities. So summarize quantity as total quantity. So let's go and execute
and with that we have really nice picture in order to understand is the revenue increasing or decreasing over
the time what is the best year the worst year are we gaining customers over time if there any like trends that we can
spot now by looking to the result you can see this gives us highlevel long-term view of your data and of
course it helps for strategic decisions and now what we can do we can go and drill down to the months so we can go
and aggregate the data by the month regardless list the years in order to give us an idea how each month is
performing on average. So all what we have to do is to switch the function from year to a month like this. And of
course for the group by and the order by let's go and execute and of course in the output we will get all the months
and guess what which month is the best for sales is of course December because you have all those Christmas and stuff
and the worst months as you can see is February. So with that we are understanding the seasonality of our
business and the trends patterns of our business. And as you are not including the year in our analyzes you are
aggregating all the data from all years. Now what we can do we can make it more specific for each year where you go and
add the year informations to our query. So we can have both a year and months. Let me just change this to a month. And
of course we have to go and add it to the group by and the order by. So let's go and execute and with that we are
aggregating the data of a month of specific year. So now we have all the months of all years and now if you want
to focus on only one year what you can do you can go and filter the data by the order year and with that you can see how
the data is evolving over time. Now of course in SQL we can go and format the date differently. So instead of using
the year and the month in separate columns what we can do we can use the date trunk function. So instead of here
we're going to say date trunk and if you want the granularity of your date at the month level we're going to say month and
then the date and with that you will get both the year and the date and let's call it order date like this. So let's
go and execute. Now in the output we will get exactly the same result as before but instead of having like two
columns for the year and the month we have everything in one and because we saved the month that means it's still
going to go and remove all the days. So as you can see it always starts with the one. So the first day of the month and
with that you will get one row for each month for each year. And if you want to change that quickly to a year just you
go and change the date parts to a year and you will get the granularity of the year. Now if you don't like this format
and you would like to have your specific format what you can do you can go and use the format function. So format the
first argument is going to be the date and then you go and do your format that you want. So for example it start with
the years and let's say I would like to have the abbreviation of the month name. So something like this and of course
group by and order by. So let's go and execute it. And with that we got our format the year minus then the
abbreviation of the month. But you have to be careful which function you are using because the format you will get in
the output a string. And as you can see you cannot sort it correctly. So the data here is sorted by the year but not
by the month. But if you are using date trunk you can see the data is correctly sorted. So if we switch it to a month it
will be as well. Okay. So everything is sorted correctly because the output here is a date and SQL going to sort the date
correctly. It is not string. And if you are using the year and the month the output here going to be an integer and
sorting an integer is not a problem. So of course you can go and pick the one that you like. So that's it. Let's go
and execute it. And now you can go and keep analyzing by finding another date in our data set and another measure. So
as you can see it is very simple. Okay. So that's all about how to analyze the trends and the change over time. Now in
the next step we're going to do some kind of advanced aggregations by doing cumulative
analyszis. Okay. So what is cumulative analyszis? It is aggregating the data progressively over the time and this is
very important technique in order to understand how our business is growing over the time. So how our business is
progressing over the time whether it is growing or declining it is very interesting analyszis. So the formula
going to be very similar to the changes over time but instead of having a simple aggregations on the measure we're going
to aggregate our measure but this time cumulative. So we are like adding stuff on top of each others and the data again
can split it by the date dimension cuz we want to track the progress over the time. For example, we can find the
running total of sales or the moving average of sales by a month. So now let's have again our simple example
where our sales is splitted by the years. Now this is the classic change over time. But in order now to make it
cumulative what can happen? We're going to take the measure and add to it. For example, 2024 we have 300. And now for
2025, we're going to add the 300 together with the 100 in order to make it cumulative. So for 2025, we're going
to have 400. And the same thing for 2026, we're going to go and add the 400 together with the 200. And with that, we
will get 600. So as you can see, we are keep adding the values in order to generate something called cumulative
value. Now for this type of analysis, we use in SQL the aggregate window functions. in order to find out the
cumulative values. So now let's go and apply our formula in order to find whether our business is growing or
declining. So let's go. Okay, so now we have to analyze the following. We're going to calculate the total sales for
each month and as well the running total of sales over time in order to analyze the trends. So let's see how we're going
to do that. Let's start with the easy stuff where we're going to calculate the total sales for each month. So we are
calculating the changes over time and we have already done that. So all what we need is a date and a measure. Our date
going to be the order date and the measure going to be the sales amount from our fact
table. So let's query this. And now we want to find the total sales for each month. That means we're going to change
the granularity of the order date from a day to a month. And I usually like using the date rank for this kind of tasks.
And the granularity going to be the month. So this is the order dates. And now for the sales we're going
to use aggregate function sum sales as total sales. And of course we have to go and group up the data by the
date. So let's go and execute it. So as you can see we have now the total sales for each month. And don't forget to get
rid of the nulls. So where we can say where order date is not null. Now it looks better. We don't have nulls. And
of course we can go and order the data by our date. Now our measure is just aggregated for each month individually.
Right? But we don't want that. We want to have like a running total. So we'd like to have like commumulative metric.
In order to do that, we have to use window function. So let's go and do that. We will use a subquery for that.
In order just to make it simple. So what we need? We need the order date and let's say the total sales and here we
have to have our window function. Then we're going to put the rest in a subquery. And of course we can
go and get rid of the order by because anyway our data going to be sorted using the window function. So now let's start
writing our window function. We will have the sum of total sales. So we want to summarize those new values. And we're
going to build a window function like this over. We don't have to go and partition anything. So we can go
immediately and say order by our new order date that we have calculated. And we want it to be ascending. So actually
that's it. So as running total sales. So let's try that out. Now if you look to the result you can see that all those
values are cumulative and it is working like this. The first total sales is equal to the total sales because
previously we don't have anything. Now for the next row what going to happen is going to go and add this value to the
previous one. And with that we get the running total value. Now moving on to the third row is going to go and add all
those three values together. And of course this going to give us the running total for this month and so on. So as
SQL is moving through the window it is always adding the current value to all previous values. And this is because of
the default frame of the window. The frame going to be between the unbounded preceding and the current row. So that
means for example if we are at this row over here current total sales for this month is this one and the unbounded
preceding is all the values before this month. So that means we are getting all the previous values together with the
current value and with that we will get the effect of the running total sales. And now of course as you can see it is
going through all the years. Right now we can go and limit the running total for only one year. So for each new year
it has to reset and start from the scratch. So that means we are partitioning the data. For each year we
would like to have partition. For the first year, it's going to be 2010. It is one row. And for the 2011, we're going
to get the whole partition over here. So, in order to partition our window, it's very simple. We're going to go and
say partition by the order date. That's it. Let's go and execute it. Now, let's go and check for the first partition for
2010. You can see the running total is the same as the first month. But since we have only one month, that's it for
this year. Now, as we go to the next year, as you can see, it resets. So you can see the running total sales for
2011. It is exactly as January. It is not adding up now the value of the current value with the previous one
because the previous one is outside of the window. So as you can see we are getting running total for the whole year
and once we hit a new year it is going to reset. So it is working and this is how you can create cumulative values in
SQL. And of course if you would like to change the granularity of our data it is very simple. All what you have to do is
to go over here and say instead of month we're going to make it as a year. And of course don't forget to change as well
the group by. So let's go ahead and execute. And with that we are creating cumulative values for each year. But of
course it makes no sense to partition by the years. Let's go and remove it and execute it again. And with that you are
creating the running total sales the cumulative metric over the years. So as you can see it is very simple. Now we
can go and add like another measure and another aggregation like for example instead of finding the running total we
can find the moving average. So let's for example go and get the moving average of the price. So first we have
to calculate the average of the price as average price. And now what we have to do is to go and make another window
function over here where we are saying average the average price and we're going to go and call it moving
average. That's it. So let's go and execute it. And with that you are getting the moving average price of our
sales. All right. So now you might still asking what is really different between using a normal aggregation and
cumulative aggregation. Well, we usually use normal aggregations in order to check the performance of each individual
row. Like if I want to see how each year is performing, I'm going to go and do a normal aggregation. But if you want to
see a progression and you want to understand how your business is growing, you have to go and use cumulative
aggregations because you can see easily here the progress of your business over the years. So there is like a difference
between using cumulative value and normal aggregation. All right. So with that you have done with the cumulative
analyszis and you have learned all different types of aggregations. Now the next step in our road map we're going to
do performance analyszis. Okay. So what is performance analyszis? It is the process of
comparing the current value with a target value to compare the performance of specific category and this can help
us in order to measure the success to compare the performance. So the formula for that is very simple. We're going to
find the difference between the current measure and the target measure by subtracting them. Like for example, we
can go and compare the current sale with the average sale or the current year sales with the previous year sales or
the current sales with the lowest sales or maybe the highest sales. So as you can see we are always comparing the
current measure together with a target with something else. So for example, we have here again a measure that is
splitted by three categories. So those values are the current values. Now if you have a target like for example the
average. Now as you can see for each row we have like the 200. Now what we can do once we have those two things in one row
we can go and simply subtract them. So for the A the current value is exactly equal to the average. Both of them is
200 and the difference between them is zero. So this product is performing as an average. Now for the next one we have
300 and the target is 200. So the differences between them is 100. That means this category is performing very
well. So this is a good performer. Now for the last one we will get minus 100. So that means it is below the average.
So it is not performing very well. And for this type of analysis we usually use window functions like the aggregate
window functions, the sum, average, max, min or the value window functions like lead and lag. So now let's go back to
SQL and apply this formula in order to measure the performance of our business. So let's go. All right my friends. So
now we have the following task. analyze the yearly performance of products by comparing their sales to both the
average sales performance of the products and the previous year sales. Okay, this sounds a little bit
complicated and serious. Let's have some coffee before we start. Okay, so what do we have over
here? So it is talking about the yearly performance of products. So that means we need the order date as a dimension
and as well the product and the measure that is used over here is the sales. So let's do it step by step. So we need
things from our fact table. So fact sales and we need the product. So I'm going to go and get it from the
dimension product in order to have a nice name. So we have to join the data by the product key and I'm going to go
and change the alias to P. So product key. Okay. So with that we have our two tables. Now let's go and select our
columns. So we need the order date. We need the product name and we need our measure. So it's going to be the sales
amount. All right. So now let's go and query those informations. Now we have to analyze the yearly performance. That
means we don't need the day. The granularity is the years. So that's why let's go and convert it using year
function. And we're going to call it order year. And of course we have to go and aggregate then the sales. And I'm
going to call it current sales. And of course we have to group up the data by the date, the year and as well by the
product name. So that's it. Let's go and execute it. And of course I'm going to go and get rid of all those nulls. So
where order date is not null. All right. So with that we have solved the first part. So we have the yearly performance
of the product. Now in the task we have to compare this value the current sales to the average sales performance of the
products. So that means we need the average and as well the previous year sales. So that means we have to compare
each value to the previous year for the same product of course. So that means things are getting a little bit more
complicated and with that we need the help of the window functions. Let's do it one by one. Let's focus on the
average sales. So now what we're going to do based on those values based on this results we will do a new
calculations and aggregations. And now in order to do that either we use a subquery or a city. I'm going to go with
a city because it looks nicer. So with yearly product sales this is the new name that we are giving for this
results. And now what we're going to do we're going to build queries on top of these results. So first of all I will
just select everything from this table. yearly product sales just to test. So it is working. Now I'm selecting data from
our city. So now the next step I'm going to go and list all the columns that I want in my results. So the order date,
the product name, the current sales. This is just nicer in order to have control on which
columns you want to present at the end results. Now the next step, I'm going to go and order the data by first the
product name and then the order year. And with that we can have better understanding of
the results. So we can see this product has three years of sales and those are the current sales for each year. So now
we have to go and calculate the average of those three sales. So in order to do that we're going to use the
average current sales over we have to decide now how to partition the data. Since we are focusing on the products we
have to partition the results by the product name. So we're going to say partition
by product name and we don't have to sort the data because we are using the average. So it doesn't matter how the
data is sorted. So let's call it average sales. So let's go ahead and execute it. And now if you are looking to the
results for this product the average sales of all those three values is 13,000. So now as you can see for each
row we have the current sales and side by side with the average sales and the same thing for the next product as well.
So now since we have both of the informations on the same row current sales and the average the change the
difference between the current value and the average value. So all what we have to do is to go and subtract right. So
we're going to say the current sales subtracted by the average sales and we're going to call
it the difference in average. So let's go and execute it. And now as you can see we are getting now the comparison.
we have the differences between the current and the average and of course what I like to do is to make a flag or
like indicator whether we are above the average below the average or at the average so in order to do that we're
going to go and use the case when statement so if the difference is higher than zero then we are above the average
right above average oh let's have an abbreviation for that and if we are below zero that means we are below the
average right so below then below average and if it is exactly zero else then it is average. So that's
it. Let's end it and I'm going to call it average change. So let's go and execute it. Now if you focus again on
one of the products you can see the current sales of this product in 2012 it is below the average. It is really low.
And for the next year for 2013 it is above the average. It was really nice year for these products and the last
year 2014 it was again below the average. So with that we have really nice flag in order to see quickly
whether we are above or below the average and it is interesting to see whether we have zeros. So yeah sometimes
it is exactly like the average and here we have like a zero. It's not below or above. So with that we are comparing the
performance of the sales of each products with the average. And as you can see it is really simple. Yeah. using
the window functions. So let's go and check again our task. We have compared the current sales to the average sales
performance. Now we have to compare it as well with the previous year sales. So let's go back to our example over here.
This time we have to compare the current sales not with the average but with the previous year. So we don't have to write
like another CTE or query. We can continue with the same results. So now all what you have to do is to access the
previous year. And in order to do that, we have amazing window function called lag. So let's do it step by step. So now
we're going to go and create a new column that's called lag. I want to access the previous value of what the
current sales, right? So current sales and over we still have to partition the data
by the product name because we focus on the products. So partition by product name. But now in order to access the
previous value that means we have to sort the data and we're going to sort it by the years. We need the previous year.
So we're going to say order by order year and we're going to sort it ascending from the lowest to the
highest. So we're going to leave it like this. And with that this window function going to give us the previous year sales
of the products. So I'm just going to call it previous year sales like this. And I think here we have something
wrong. Okay. So let's go a and execute it and let's go and focus on one of those products. So now for the first
year of this product, the previous year was null, right? So we don't have any data from the previous year. But for the
2013, we have a previous year of 2012. So that's why now we are getting the previous value of the sales based on the
years. And the same thing for the last year over here. You can see we are getting the previous sales. So it is
working. And for the next window, same thing for the first year. we will get null and the previous sales we will get
it from the previous year. So with that we have now the previous sales and if you check this over here we have in the
same row now the current sales of the current year and as well the sales of the previous year. Now what we have to
do the same thing we have to go and subtract those two informations in order to compare them. Right? So we're going
to go and do the same thing. So we will get the current sales minus the whole thing the whole window function and
we're going to call it previous year. So difference of the previous year and with that we are calculating the differences
between them. So for this year for this product as you can see the difference here is really big between the current
sales and the previous year. Now of course what we can do we can go and make as well a flag or an indicator. I'm
going to go and copy the whole thing from the previous average but we have to go and get the right function this and
the same over here and now it is not above or below the average I'm going to say it is increasing or decreasing right
so increase or decrease and we're going to call it previous year change and instead of average we can say no change
so let's go and execute it and I'm having here an extra comma let's go and execute it so again let's go and focus
of one of those products. For the first year of this product, there is no change because there is no previous year. For
the next year of this product, we have an increase, right? Because the current sales is way higher than the previous
year. And now by going to the last year of this product, we have a decrease because the current sales is less than
the previous year. So my friends, we call this type of analyszis year over year analyszis. And if you want to
calculate the month over month analyzes, it's very simple. All what you have to do is to go and change the function from
year to a month and with that you are extracting the month part. And the difference between analyzing the months
and years is of course the scope. Year-over-year is good for long-term trends analyzes where on the other hand
the month over month it is shortterm trends analyzes. You are just focusing on the seasonality of your data. So this
is how we analyze the performance of our business by comparing the current measure with a target measure and you
can go and use different dimensions and stuff. So instead of the sales you can check the quantity instead of products
you can check the customers and you can go and compare the current information not only with the average or the
previous year you can compare it with the lowest sales and the highest sales and it can open the door for many
different insights. But we are always using the same methods using the window functions. We compare the current value
with another value in our data sets. So this is how we do performance comparison. All right. So that you have
learned how to analyze the performance of our business. Now in the next step we're going to do partto-hole analyszis.
So let's go. Okay. So now what is exactly part to whole analyszis? Well, we use it in
order to find out the proportion of a part relative to the whole. Well, here we're going to analyze how an individual
category is contributing to the overall in order to understand what is the most impacting category to the overall
business. So now for the formula, it is very simple. You have to go and pick one of your measures divided by the total of
the measure and then multiply it by 100 in order to find the percentage by a specific dimension. Like for example, if
you take the sales, so you divide the sales by the total sales, multiplied by 100 by the category or if you take the
quantity divided by the total quantity and then find the percentage by a country. So for example, again we have
our measure splitted by categories. But now instead of having this number, what we're going to do, we're going to
calculate the percentage. So for the first one, we're going to take the 200 divided by 600 multiply it by 100. So
we're going to get the percentage 33. So once we do that for the all categories, it's going to be now very easy to see
that the category P it is contributing to the overall number by 50%. Which makes it of course a top performer. So
you can visual in your head as like a pie chart and you can see how each part is contributing to the whole pie chart
and with that it can help us to understand the importance of each category to our business. So now let's
go and apply this formula to our measures in order to understand the importance of our categories. So let's
go. Okay. So now let's do part hole analyszis. All what we need one dimension and one measure. So for
example we have the following task. It is very simple. Which categories contribute the most to the overall
sales. So now let's go and do it step by step. So first we're going to go and collect the informations. So we need the
category. We need the sales amount and those informations come as usual from the fact sales and from our dimension
the product. Right? So we have quickly to go and connect them using the product key. Okay. So that's all what we need
for our query. So let's go and select. So we have here the categories and the sales amount. So now the first thing we
have to calculate the total sales for each category. So let's go and do that. It is very simple. So sum total sales
and we are grouping up the data by the category. So this is basics. Right now we have the total sales for each of
those categories. Now in order to calculate the percentage we need two measures the total sales for each
category and we have it here already and as well side by side we need the total sales across all categories. So the big
number without any dimension but now as you look to the result you can see the granularity here is that category. Now
we need the total sales again by different granularity. And in order to mix those stuff together we use the
window functions. So now how we going to do it? either you go over here and start writing your window function. And of
course, you can do it together with the group by or you can do it as a second step in your query using either a CTE or
a subquery. So I'm going to go with the CTE just to make it clear. So category sales like this. So now let's start
again selecting the same information. So category total sales from our table category or CTE sales. So let's go and
execute it. So now we have the same results and now we're going to go and build our window function like this. So
we're going to say the sum we want to aggregate all those values right to get the total sales over the whole data
sets. So we're going to say sum total sales. And now in order to get the big number we're going to say over and
inside it we will not define anything because we don't want to partition the data. We don't want to introduce any
dimension. We just want the big number. And with that we will get the overall sales. So let's go and execute it. Now
as you can see this is the total sales by the category. So the total sales is splitted by the categories. And this is
the overall sales of all orders of everything the highest number. Now since we have them side by side what we can do
we can very easily calculate the path to whole or the percentage. So let's start doing that. We need the total sales and
we want to go and divide it by the overall sales. So we're going to take our window function and put it over
here. So let's go and multiply it now with 100. I'm going to go and call it percentage of total. So let's go and
execute it. Now as you can see we are getting zeros and that's because the total sales is not float. So what we
have to do is to go and cast it to something like a decimal. So floats like this. So let's go and reexecute it. And
now, as you can see, we are getting now the percentages, but we have a lot of numbers after the comma. So, we're going
to go and round the numbers now. So, let's go to the start round and then go to the end, comma, and let's have like
two decimals. So, let's go and execute it again. Now, looks perfect. Now, what we can do, we can go and add like a
percentage. And with that, we are converting the whole thing to a string. So, we're going to do concatenation. So,
concat at the start and go to the end. And let's add the percentage character. And as well we can go and order the data
by the total sales descending. So let's go and execute it. So now by looking to the result you can see the category
bikes is dominating. So it is overwhelming top performing the categories. It is making 69% of the
total sales of our business. So this means my friends most of the business revenue comes from the bikes. And as you
can see the accessories and clothing they are really minor contributors to our business which is not really good
and this is actually dangerous thing. If you have like one category dominating your whole business you are over relying
on only one category in your business and if this fails this category then the whole business is going to fail. So by
looking to this either the business has to decide removing all those products by those two categories or to focus more on
bringing more revenue for the products that are inside those two categories. So as you can see guys those insights are
really amazing for the business and helps the managers and the decision makers to understand what is going on
quickly and make very critical decisions. And now you can see as well from the results perfectly why the part
to whole analyszis is very important because by just looking to those numbers it's going to be really hard to
understand the importance of the categories. But seeing the data as a percentage how each category is
contributing to the whole sales of the business makes it easier to understand which category is underperforming or top
performing. And now you have a very simple formula where you can go and change the metrics. For example, instead
of total sales, you can go and change the aggregations to total number of orders or the total number of customers.
So you can go and bring any type of measures and bring it to this analyszis and you're going to generate completely
new view for the decision makers in order to develop a new strategy for the business. It was very interesting. Now
in the next step, we're going to do my favorite topic where we're going to start doing data segmentations using
SQL. So let's go. Okay. So now what is data segmentations? What we're going to do here is we're
going to go and group up the data based on specific range. So that means we're going to go and create a new categories
and then go and aggregate the data based on the new category. And the formula for that going to be very interesting. So
it's going to be this time we're going to have a measure by a measure not by dimension. So you have to go and pick
two different measures and convert one of those measures to a range or to a group and then aggregate the data by
this measure. So for example, we're going to go and calculate the total number of products by the sales range or
the total number of customers by the age group. So as you can see we have two measures and we are trying to combine
them together in order to create new insights. Let's have the following example. So here for example we have
like two measures and now the first step is that we're going to take one of those measures and convert it to a dimension.
converted to a category. For example, we're going to say if the values are like equal or below 100, it will be
converted to a category called low. And between 100 and 200, it's going to be assigned to a new category called
medium. And everything above 200, it's going to be large. So, as you can see what we are doing, we are taking one
measure and based on the range of this measure, we are building a new categories, new dimension. And now the
final step is the easiest one. We're going to go and aggregate another measure based on the new category. So
we're going to have seven for low, six for medium, and 15 for large. So with that, as you can see, we are creating
new categories or segments based on a measure. And then we are aggregating another measure based of this new
segments. And in SQL, in order to create those new categories and segments, we use the amazing case when statements
because it's going to help us to define the rules and based on the range, it's going to go and create a new category
and labels. So now let's go and apply this formula on our data set in order to segment our data. So let's go. Okay. So
now let's go and segment our data and all what we need is two measures. So now we have the following task and it says
segment products into cost ranges and count how many products fall into each segment. So now by looking to this task
we have two measures. First the costs and as well the second one is the total number of products. And of course we
have to go and segment one of those two measures. And in this task we are segmenting the costs. So we have to
focus now on taking this measure and convert it to a dimension. So now all those informations are available in the
table products. So now let's go and select few columns. We're going to get the product key and let's get the
product name and the costs. That's all what we need. So let's execute it. Now as you can see this is our measure the
costs. Now we have to go and convert this measure to dimension. And in order to do that, we use the case win
statements. We always use the case win statement in order to create new categories. So let's go and do that.
Case win. Let's start with the first range. Let's say it is below 100. So all the costs that are below 100. We're
going to label it with a new value. It's going to be below 100. So now let's go to the next range. We are saying when
costs now between 100 and 500. So all costs between this range. They will get the label 100 and 500. So this is very
simple. Let's go and get another range. For example, between 500 and 1,000. Then it's going to get a label between 500
and 1,000. And now it depend how many categories and segments you want to create. Each row of this case when each
condition will be creating like a new value for your dimension. So I'm going to stop with that. I'm going to say at
the end else. So if the cost is not fulfilling any of those, it's going to be above 1,000. Right? So that's it.
Let's give it a name. It's going to be cost range. So now let's go and execute it. Now let's go and check the result.
For example, the cost here is zero. It is below 100, which is correct. This value is above 1,000. This is between
500 and 1,000. And this is between 100 and 500. So everything looks correct. Nice. So with that we are done with the
first step where we have converted one measure into a dimension. So with that we have now our segments. The next step
with that we're going to go and aggregate the data based on this a new dimension. So either you do it in one go
or what I usually do I put everything in one city or a subquery and I'm going to call it products
segments as based on this results I'm going to go and aggregate the data. So this is my temporary results and now
we're going to go and just aggregate the data like this. So let's get first our dimension cost range and then we need
our measure. So it's going to be count product key as total products from our city. It
was the product segments and then group by our new dimension. That's it. It's very simple. Let's go and execute it
now. Now you can see in the output we have our segmented measure and we can see the total numbers in each of those
segment and range and of course we can go and order the data by our aggregation the total products. Let's go and execute
it maybe descending. So now as you can see we have a lot of products that are not costing a lot. It is below 100.
After that between 100 500 and the lowest number of products is in the range that is above 1,000. So we don't
have a lot of products that are costing a lot and that's because maybe we have a lot of accessories in the business. So
my friends this is very powerful. If your dimensions in the data set is not enough to create insights you can take
one of your measures convert it to a dimension using case win and then aggregate your other measures based on
this new dimension. So we are deriving new informations and as I told you by just following this concept measures and
dimensions you can generate endless amount of reports even if your business or your data set is small. Okay my
friends so now let's go and segment something else. So this time it's going to be a little bit more complicated. So
we have the following task and it says group customers into three segments based on their spending behavior. So we
have the VIB customers. They are the customers with at least 12 months of history and spending more than 5,000.
And the second category we have the regular customers. They have at least as well 12 months of history but they spend
like less than 5,000. And the last category we have the new customers. Their lifespan is less than 12 months.
And we have to find the total number of customers by each group. So now here we have a lot of measures and stuff. So the
first one is the total number of customers. This is going to be the final aggregation that we're going to do. But
what is interesting, we're going to build the segments and this time is based on different columns. So first it
is based on a measure the total number of months for each customer and as well the total spending, the total number of
sales. So we have the sales, we have the total number of months and as well the total number of customers. So now we're
going to do it step by step. Don't you worry about it. So now what I usually do, I start collecting all the data that
I need. So what do we need? We need a customer key. In order to do the aggregation for the total number of
customers, we need as well the sales amount right for the spending. And now in order to calculate those number of
months, we need a date. And for that, we have to calculate the lifespan of a customer. And usually we create it using
the order date. I'm going to show you how we're going to do it. So we need the order date. And of course, we have to
select our table. So let's start with the fact table. So fact sales and we're going to join it with the
customers. So our dimension customers and the key for that it is the customer key as well for the customers. And here
we have to specify which column come from which table. So the first one from the customers, the sales from the fact
and the order date from the fact as well. So now let's go and execute. Now we can see we have our customers, the
sales and the order dates. So now the sales going to help us in order to specify the range of spending. But now
what is interesting we have to calculate the lifespan. So now in order to get the lifespan we have to find out the first
order and the last order of each customer. So how many months is between the first order and the last order. So
in order to do that we need the min function for the order dates. So this is the first order and the max in order to
get the last order. Right. And since we are using min and max, we have to go and group up the data. And we
need to do that anyway in order to get the total spending. So for the sales amount, we're going to have the sum in
order to have the total spend total spending. And we don't need the order age. And the dimension where we're going
to group up the data is by the customer key. So let's go and execute it. So now in the results we have a list of all our
customers and as well the total spending for each customer and we have the first order date and the last order dates. Now
in order to calculate how many months between the first order and the last order we can go and use the function
date diff in order to get a new measure. So let's go and do that date diff. And now since we need the number of months
we're going to use the month and then the second argument going to be the first order. So order date and the
second one going to be the latest. So max order date and we're going to call this lifpan. So let's go and query and
let's have a look to our results. You can see for this customer 712 between the first order and the last order we
have 11 muscles and for this customer over here we have zero because the first order and the last order is in the same
month and maybe there is only one order. So with that we have the lifespan and as you can see guys we have derived a new
measure from the dimension order age in order later to derive from this new measure a new dimension the segments. So
we are converting a dimension to a measure and then from a measure to a new dimension and this is usually what we do
in analyzes and in SQL. So now do we have all the informations for the logic? So we have the lifespan. So we have the
total number of monsters, we have the total spending and I think we are ready to start building our segments. So now
what we're going to do, we're going to create the segments based on these results that we have prepared. So this
result is the intermediate result before the final one. Now either you're going to put it in a CTE or subquery. Well, I
usually go and use the CTE. It is nicer. So with customer spending and I'm going to put the whole
thing in ECT and we can start writing a new query from the scratch based on the inter results. So let's go and select
again the customer key. I'm going to get the total spending and the lifpan. So we don't actually need the first and the
last order and we're going to get all those informations from our new city. So let's go and execute. And now let's
start building the segments. And as usual, we're going to go and use the case win statements. It is just amazing
statements in order to derive and build new columns. So now what do we have for the first category? So they are the
customers over 12 months and spending more than 5,000. So now we're going to say if the laugh span is higher than 12
and the total spending is higher than 5,000 then we have our VIB customers. So this is the first label. Let's go to the
second one. If the lifespan as well I think more than 12. So let's go and check. Well, it is at least 12. I have
here mistake. So it's going to be larger or equal. So now it is more correct. So the customers that has at least 12
months but they spend like 5,000 or less. So that means it's going to stay the same condition but the total
spending will be less or equal 5,000s and they are the regular customers. So they will get this label. Now if it is
not fulfilling those two conditions what this means this means this is a new customer right. So they will get this
label. Let's go and have an end and let's call it customer segments. So let's go and execute it. Now let's have
a look for this customer 712. So the total spending is less than 5,000. So this customer is not a VIB and as well
the lifespan is less than 12. So that means for us it is a new customer. Now the next one we have a VIB. So this
customer has a history at least 12 months. So we have here 16 months and as well the total spending more than 5,000.
That's why this customer is a VIB. But now let's go and search for a regular customer
2349. So this customer spent less than 5,000. So we are fulfilling this condition over here and as well this
customer has at least 12 months of history that's why we have a regular. So now as you can see we have derived a new
dimension from two measures the lifespan and the total spending. Now of course the last step what is going to be we
have to go and find the total number of customers for each of those categories. So now what we're going to do we're
going to remove all those stuff and we're going to start with our new dimension and then comes the aggregation
count customer key. So as total customers and then we have to group up the data by our new dimension. So this
going to be really annoying if I'm going to take this here and put it in the group I because this means each time I'm
changing the logic I have to take care of that twice. One in the select statement and the second one in the
group I. So now actually instead of that what I'm going to do I changed my mind. I'm going to still having the
aggregation in the second step. So we need the customer key we have the definition of our customer segments. And
now I'm going to go and use the subquery where I put the aggregation as a second step. So my friends that means this is
again a second intermediate results. You can of course put it in a second city. So that means this is the first
intermediate results where we have created the lifespan and the total spending and the second intermediate
result is creating the customer segments and the third step and the last one is by doing the final aggregation. So we're
going to do it like this. Select our dimension customer segments. Then we're going to go and count the customer key
from our sub query. So this is our subquery and don't forget to group by our dimension customer segments. I think
I have it wrong. All right. So this is the subquery and this is the final step where we are aggregating everything. I'm
going to go and order the data by the total customers like this. So now let's go and execute the whole thing. Well
descending not ascending. Okay. Okay. So now we can see from our results the highest number of our customers belong
to the category new. So we have 14,000 customers that are new in our business. And then the second category we have the
regular customers. So we have around 2,000 customers. And in VIB we have a lot of VIB customers. So we have
1,655 VIB customers in our business. So with that my friends, we have done data segmentation. It is amazing. We have
segmented our customers based on their spending behavior and as you can see all those informations are totally derived
from the our data and this help us to have a deep understanding of the behavior of our customers and of course
this can help as well making smart decisions. All right my friends so with that we have covered the five different
types of data analytics thus we can do using SQL. Now what I usually do as the last tip in my project is that I try to
collect all the different types of explorations and analyzes that I have done in my data sets so that I can put
everything in one for example view or table and then offer it to other users and with that it going to help the other
users or stakeholders to make a quick analyszis for decision- making. So now what we're going to do, we're going to
have like some kind of requirements where we're going to bring a lot of different analyzes in one big script in
order to have insights about one object like for example the customers. So I'm going to show you the requirement of
this reports and we're going to analyze it and start writing the scripts. So let's go. Okay friends. So now let's
create a customer report and here are the requirements for the report. So now we have like a general statement. It
says this report should consolidate key customer metrics and behaviors. So it says first we have to gather all the
details about the customers like names, age, transaction details and then we have to segment the customers into
categories VIB, regular and new and as well by the age groups and we have to provide as well aggregations like the
total order, total sales, quantity, products and so on. And we have to generate important KPIs like the
recency, the average order value, the average monthly spends. So we have a lot of things and we're going to do it step
by step. All right. Now I'm going to take you step by step in the process of building a complex query that I usually
use in order to build a report. Now the first thing that I usually do is I start selecting the data from the database and
I usually start with the fact table. So this is my starting point and then usually I join it with the dimensions
and here I use lift join and after that I think about how to filter the data because usually we don't need all the
data that is available in the database and of course in the result I will not be selecting all the columns. I'm going
to be selecting only the relevant columns that I need for my reports. So since we have like complex query we will
be dividing the process into multiple steps and I usually call this step the base data and this going to be the
foundation the scope for the next steps and since we have like multiple steps I'm going to put this in a CTE so we
have this as an intermediate results and what we're going to do in this step as well we're going to do few
transformations like maybe calculating and deriving new columns maybe formatting the date so some basic
transformations so now let's go and build this results for our report so the first step is retrieving the core
columns from the tables. So let's go and do it together. So we need of course our fact table facts and we need our
dimension gold customer and as usual we're going to go and connect them. All right. Okay. So this is the basic and
now what we're going to do we're going to go and retrieve all the columns that we need for our reports. So let's start
picking stuff. So order number let's get the product key the order date sales amount quantity and I think that's all
from the facts let's go and get few informations from the customers so let's get the customer key the customer number
the first name and as well the last name and what else we can go and get the birth dates because we have to create
the age groups so birth dates let's go and query. So I think those are all the columns that we need in order to do the
next steps. And now before we go and proceed with the aggregations, what we're going to do, we're going to think
about filtering the data. As I recall, we have some orders where the order date is null. So I'm going to go and remove
those stuff. So order date is not null. So that means in the first query the base query not only I'm selecting the
columns that I need for the reports also I'm defining the scope of the data sets by filtering the data. So you can as
well make the scope here only one year or something. Now what else we can do is to think about all those columns and
whether we can do any type of transformations in order to prepare them for the aggregations. Like for example
I'm going to go and say you know what instead of first and last name I'm going to put them together in one. So it's
going to be the customer name. It's better than having like two columns. So, let's go and do it. We're going to say
concat and then we're going to start with the first name and we're going to have a separator between them. You can
have like a minus or a white space like this and after that the last name. So, let's call it customer name. And we can
go and get rid of those two columns. So, let's go and execute. And with that, you have everything in one column. Now,
another thing that we can prepare that we don't need the birth date. We actually need for our reports the age
groups. So that means we have to go and calculate the age. So let's go and transform it. So date diff we want it in
years, the birth date and the current date from system and we're going to call it age. So let's execute again. Perfect.
So with that we have all the data that we need for our reports. Let's go and put everything in one city. So I'm going
to call it with query as and put everything in this city. And I'm going to go and put this comment over here
inside the city. Perfect. And now we're going to go and write a query from the scratch. Paste on our intermediate
results. So base is query. It's execute. All right. So now by looking to our report with that we have the important
columns. Right. So now in the next step we're going to do aggregations on top of these intermediate results. So here
we're going to do all the aggregations that is needed for the report and we're going to put everything again in CTE as
an intermediate results which makes everything a modular and easy to read. So now let's go and do the necessary
aggregations on the result that we have previously prepared. So that's why this is very important as a second step in
our report. Always tend to make a separated CTE only for aggregations. So let's go and do that. I'm going to go
and select again all the customer informations like the customer key number, age. So I'm just going to copy
and paste and put it over here. And we just need the column names. So the key number, name, and age.
Now after that, we're going to start doing aggregations. So what do you want to aggregate is first, for example, the
total number of orders. So we're going to go and count distinct order number as total orders. So this is one
aggregation. We can go and summarize all those sales amounts as
total sales and the quantities as well. So sum quantity as total quantity and as well we can go and count how many
products did our customer order. So the products key as total products. So what I'm doing now I'm just
looking to our intermediate results and try to figure out what we can aggregate for example it makes no sense to
aggregate for example the ages right so from the order number we have total orders total product sales amount
quantity and from the right side we cannot aggregate anything and that's because they are the details of the
customers but from the fact table we can do a lot of aggregations so now what we can do with the order date over here we
can for example find the last order dates from our customer which is really nice information. So we can say max
order date as last order and of course we can go and calculate the lifespan and that we're going to need it as you
remember in order to categorize our customer. So I will just copy and paste it from the previous query is the date
diff month between the first order from the customer and the last order of the customer. So and we call this lifespan.
Okay. So we derived two measures or aggregations from the order date. Now I think we have done everything possible
and what is missing of course is to have a group by because we are doing aggregations and we are grouping by the
customer details. So going to be customer key, customer number, name and age. So I think we have everything for
our aggregations. Let's go and execute it. A list of all customers and we have few details about the customers and now
we have a lot of measures. So the total order, total sales, total quantity, products, the last order and the
lifespan. And with that we have covered this part over here where we have provided aggregations on the customer
level. So we have the details and we have the aggregations. All right. So with that we have now all the
preparations that is required to build the final results. So it really depend on the scenario. If it's possible we can
take all the data from one city or if it's needed we can get it from multiple cities. But in our scenario, we're going
to take it from the second city, the aggregations, and we're going to prepare the final results. So here we're going
to bring everything together and we might introduce final transformations that is needed for the reports. So let's
go and write the query for the final results. Now we can go and start segmenting our customer and as well
creating the KPIs. So let's go to the third step. I'm going to go and put this in a CTE. So let's call it customer
aggregation. And now based on these results, we will write the final query. So I like always
to put a comment about the steps. So the first city is the base query where we just joined the data and prepared it.
And then the second query is for the aggregations. And the final one is for the final results. So let's go and start
writing our final query. We will start with select. And I'm going to go and list again all the customer
informations. So I'm going to go and get again same things. We have the customer key, customer number, name, age and so
on. And now after that we need to create the age categories. And now after that I'm going to go and get all those
measures as well from our query. But of course without the calculations I just need the names of
it. So with that we have everything from our previous CTE. So the customer aggregation. Okay. So let's just test
it. Now everything is working. So now what we have to do? We have to create few categories age category and as well
the segments of the customers right for segmenting the customers we have already done the query so I will just copy and
paste it from the previous analyszis it looks like this if the lifespan is at least like 12 months and the sales above
5,000 then a less or equal 5,000 then regular otherwise it is a new customer so this is our first segment but the
second segment about the ages we're going to go and build it now and again how we going to do it when so if the age
for example example less than 20 then the customer is under 20. Let's make another range where we say if the
customer age is between 20 and let's say 29 then we have the second range and we
can keep repeating the same thing for the second one. It really depend how many categories you want to build. So 30
and 39 I belong to this group. Now the next one let's have the 40s as well right so 40 49 same thing over here and
now else let's say 50 and above right and above so let's go and end it as age group I just want to sort it little bit
like this okay now it looks nice so with that again we have turned a measure into a dimension and let's go and execute it
now so now by checking the results we have the details of the customers and Now we have a new category. So as you
can see it is working. 54 it is above 50. This is in the range between 40 and 49. We have here 67 above 50. I believe
we don't have any customer that is below 20. Right? Or even between 20 and 30. Okay. So with that we have created our
two categories and by looking to the reports you see we can segment the customers now into categories. The VIB,
regular, new and the age group. And with that we have covered all those three requirements and we come now to the last
requirements. We have to calculate the following KPIs. Now the first one it is an easy one. It is the recency. How many
months since the last order we have calculated over here the last order for the customer. It is this one. And now in
order to find the recency it is very simple. So all we have to do is to take this over here. I will just put it maybe
after the segmentation. And all what you have to do is to use the date diff as usual. So month is the last order date
and the get date. So as you can see we are using this setup like in many analyzes right we always find the
differences between a date from our data sets and the current date and time and with that we will get the recency. So
let's go and execute it. Now you can see how many months since the last order of the customer and of course you can go
and test it using the last order date. And this is really important in order to understand whether the customer is still
active or inactive. Okay, so this is for the first easy KPI. Now let's go to the second one. It says calculate the
average order value. So how we going to do this? Let's go back over here. Now in order to compute the average order
value, we have to divide the total sales by the total orders. So how many revenue did the customer generate? And we divide
it by the total number of orders and after that we have to find the average. So it is very simple. Let's go and write
that. We're going to go to the end of our table where we're going to put our KPI and I'm going to say here compute
average order value. So as a shortcut AVO. So we say total sales divided by total orders. And let's call it average
order value. So let's go and execute it. And if you go to the last over here, you can see the average order value of our
customers. But now if you are dividing numbers together you have to be careful that you are not dividing by zero
otherwise you will get an error. So imagine that a customer has a zero didn't order anything you might get an
error. In our scenario, we don't have that because we are starting from the order table or the fact table. But
still, I like to make sure this never happens. And for that, I usually go and use the case when statements. Very
simple one. If the total orders is equal to zero, then make it zero. Otherwise, do the calculation that we talked about.
So like this. And at the ends, we will add an end. So that's it. And with that, I make sure we will never divide by
zero. So that's it. It was simple, right? Let's go to the last KBI the average monthly spend. So how we will
calculate that compute average monthly spend. So now since we are speaking about the spending
that means we need the total sales. Right? So how much sales did the customer generate totally and then we
divide it by the number of months and with that we will get the average monthly spend. Right? So that means we
can divide the total sales by the lifespan as we calculated it is the period where the customer has been
active from the starts until the end. Okay. So now let's do it step by step. First we have to be careful that we are
not dividing by zero and I believe in the lifespan we have zeros. So what we're going to say as usual case when
lifespan is equal to zero then this time we will not make it zero the customer exist only for one month. So what we can
do we can get the total sales of the customer and we don't have to divide it by the month in order to find the
average because the average is equal to the current total sales. So with that we make sure we are
not dividing by zero otherwise we're going to have our calculation. So total sales divided by life span. So the total
sale divided by the months and with that we will get the average monthly spend. So and and ass and we're going to call
it average monthly spend. Perfect. So let's go and try that out. Let's go to the right side. And with that we have
our third KPI and we have the average monthly spends. And with that guys, we have now full reports about the
customers and we have covered all the requirements. All right. So with that we have the final results and we have
fulfilled the requirements. So what we're going to do, we're going to take the whole query and put it in the
database as a view. And once we have the view, the report in the database, we can share it with the others. Now the other
data analyst in the team can go and maybe create a dashboard in order to visual data using API tool like Tableau
or PowerBI. But in this scenario, the user can go and connect your view the last prepared data to the dashboard. And
with that the user can quickly generate insights without doing a lot of steps in order to prepare the data for the
visualizations. And of course the data analyst can go and connect the dimensions and facts. But having this
one solid view it's going to be like way easier to consume. And of course the data analyst can as well write a query
on top of your view in order to generate a quick insights. So as you can see using only SQL you are covering a lot of
complex steps in order to make the data ready for reporting and analyzes and this is what usually happened in real
projects. We're going to go and put the query in the database so that the others can use it. So what we're going to do
very simple create review and we're going to put it in a good layer and we're going to call it report customers
and then ask like this and let's go and execute it. It is successful. Now if you go to our database and check the views
you will find a new view called gold report customers. Now all what you have to do is to go and have a simple select.
So codes reports customers and you will get an amazing report about the customers. This kind of reporting it is
very important because you are giving a full picture 360° view of all your customers. So you have details,
categories, measures everything in one go and it going to makes life easier. Now for any user of this view to quickly
understand the data and generate maybe insights based in this one view that can helps of course your customers. So I
just want to show you now what this means. If a user using your reports so either in SQL or maybe they're going to
go and connect it to PowerBI or Tableau they can generate immediately insights. So for example, if they go and say count
customer number so as total customers and then they're going to go and take any dimension for example the age group.
So something like this and then group by the age group. Put just put it here first. And then they're going to go and
add any other measure. For example, the total sales and any other measure that you
have in this view and then execute and quickly they can do analyszis on top of your view without having them to go to
their fact and dimensions. So this is like one extra prepared layer the data model that you have built. And if you
don't want to group it by the ages, you can go and have the customer segments and it will be working. So quickly they
can analyze the new derived informations that you have prepared in your reports. So guys, this is amazing reports about
the customers. And now what you're going to do, you're going to go and prepare the
second report where you have to build complete insights about the products of the business. It is very similar to the
customers. So we want to generate a report for the products. You have to provide details like the product name,
category, subcategory and the costs. You have to segment the products by the revenue. So you can have categories like
high, medium and low. And then you have to provide the basic aggregations at the level of the products and then calculate
few KPIs. So as you can see it is very similar to the customers. And now what you have to do you have to pause the
video follow the same step at the customers where we join the tables car create aggregations and put everything
like in CTE and at the end once you are done create the view where you have the report about the products. So I'm going
to go now and do it offline and I will see you [Music]
soon. Okay my friends I hope you are done with the reports. I'm going to show you quickly how I've done it. So I've
just created a new view called report products and then we start with the base query where we have joined the fact
table with the dimension products and collected all the columns that we need for the reports and we put everything in
the first city. So this is the first step and there was from my side no need for any transformations over here. So we
go now to the second step and here we have to put all the different types of aggregations in one go. So we calculate
the lifespan, the last sales order, total orders, total customers, sales quantity and as well I have created the
average selling price of the products. It is very simple. We are dividing the sales amount by the quantity. So this is
the basic aggregations about the products and finally we have the final query. So we start with selecting the
basic informations about the products. So we have the key, name, category and then we have here the recency and we
have our new segments. This one is very easy for the products. So we are saying if the total sales is higher than 50,000
then this is a high performer and if it's like between 50 and 10k then this is a mid-range otherwise it is low
performer. So the segmentations of the products is very simple and after that we have like all our measures that we
aggregated in the CTE and now we come to the two KBIS. It is very similar to the customers. So the first one the average
order revenue it is simply dividing the sales by the total orders and you have to take care of the zeros of course and
the average monthly revenue we divide the total sales by the lifespan of the products and of course if the lifespan
is zero so it is only one month then it is the total sales and with that you generate the average monthly revenue. So
as you can see it is very similar to the customers but still the focus here is the products. Now of course we put this
query in view. So we have the report products side by side by the report customers and now we have really amazing
report about the products where we have everything. So we have a lot of details about the customers. We have as well a
dimension in order to segment our products and we have a lot of measures that are really important about each
products. So we have the total number of orders sales, how many customers did order the products, the average price,
the average revenue and the monthly average revenue. And this gives you really deep insights about each product
of your business. And of course, this is very helpful in order to compare the products, right? And now, of course,
this is core analyzis that you're going to need it a lot in your business. That's why we offer it as a view. So, I
think we have now two amazing reports about our data. All right, my friends. So, now
don't forget to put all your work in the Git repository in order to share it with others as a successful project. So as
usual we have the data sets, documentations and as well the scripts that you have done through this projects
and here I'm putting everything together. So we have all the activity of the exploration as well with the
advanced analyszis that we have done. So we have the change over time, the cumulative analyszis, performance, data
segmentations, part tool analyszis and as well our two new reports. So I recommend you if you haven't done that
yet go and create now a repository put all your work there to make sure that everyone can access and see your work
and my friends don't forget to add nice commenting on your code and formatting and styling your code should be perfect.
So if you haven't done that yet go and do it now. All right my friends so with that we have done the last step in our
road map. We have created two solid reporting for our users. And with that, we have completed all the steps of our
advanced analytics projects. And with this project and the previous projects, you can see now the full picture on how
to do data analytics on any data sets using SQL. So starting by the first step where we have explored the database and
end up having a very solid reports where we have consolidated everything in one view and with that we have now really
great understanding about the business, about our data. And now what you can do, you can go and grab any data sets in the
internet and you can go through all these faces again and I promise you at the end you will have a full picture and
understanding of the business and this is what I exactly do in each project if I want to understand any type of data
sets. All right my friends. So with that we have covered the last type of SQL projects the advanced data analytics.
And with that we have now three solid projects using SQL and they are very similar to real world projects in the
industry especially if you want to be a data engineer or a data analyst. And my friends we have covered the last chapter
in our course. So this is the advanced level in SQL. And those are all the chapters that I have designed for you to
take you from the basics to intermediate and then to the advanced topics. My friend, you made it. Congrats. You
should be really proud of yourself. And now with that, I can say that I have shared everything that I know about SQL
and you can now solve any complex task using SQL like I do in my real projects. And I hope that you have enjoyed the
journey. And if you do and you want me to create more free courses like this, make sure to support the channel by
subscribing, liking, and commenting. This of course going to make the channel grow, reach the others, and as well
motivates me to make more content like this. So nothing left to say. Thank you so much for watching and I will see you
in the next course.
SQL window functions allow you to perform calculations across sets of rows related to the current query row without collapsing the result. You can use functions like RANK(), LAG(), and SUM() OVER() to analyze trends over time, compute cumulative totals, and perform segmentation. For example, use LAG() to compare current metrics with prior periods, enabling effective time-based performance analysis.
To optimize SQL queries, avoid using SELECT * to reduce unnecessary data retrieval, minimize DISTINCT and ORDER BY clauses when not required, and limit rows during exploration with WHERE conditions. Avoid functions on indexed columns to ensure indexes are used efficiently, prefer IN over multiple OR conditions, and monitor execution plans regularly. Also, maintain indexes by updating statistics and managing fragmentation for sustained performance.
Stored procedures encapsulate reusable SQL code with parameters and control flow (IF...ELSE) allowing dynamic and flexible execution, while triggers automate actions in response to data changes on INSERT, UPDATE, or DELETE events. For instance, triggers can maintain audit logs automatically, ensuring data integrity and compliance without manual intervention.
Design your data warehouse following the medallion architecture: ingest raw data into the bronze layer, clean and standardize it in the silver layer, and create business-ready models in the gold layer. Use star schema modeling with fact and dimension tables, implement ETL/ELT pipelines, and document data lineage. This structured approach supports scalability, data quality, and efficient analytics.
AI tools can enhance your SQL coding by generating query ideas, optimizing existing code, and providing inline suggestions for syntax and logic via GitHub Copilot. ChatGPT excels at explaining concepts, planning projects, and practicing interview questions, making learning and development faster and more interactive. Incorporating these tools allows more efficient and error-free SQL scripting.
Use views to encapsulate complex queries for reuse and simplify access to business logic without storing data, as views are virtual and dynamically generate results. Temporary tables are suitable for storing intermediate results within a session, improving performance during complex transformations or when repeated access to the intermediate dataset is needed. Choose based on whether you need persistent reusable logic (views) or session-specific data storage (temporary tables).
Key performance tuning techniques include analyzing execution plans to identify bottlenecks, implementing appropriate indexing strategies like clustered and columnstore indexes, and maintaining indexes by reducing fragmentation and updating statistics. Partitioning large tables can also significantly improve query performance by limiting data scans. Writing modular queries with CTEs and avoiding unnecessary computations on indexed columns helps maintain responsiveness.
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries
Comprehensive SQL Course: From Basics to Advanced Database Design
Master SQL with this full beginner-friendly course covering database fundamentals, MySQL installation, table creation, data manipulation, complex querying, joins, triggers, ER diagrams, and converting ER diagrams to schemas. Learn practical examples and advanced techniques to design and manage relational databases effectively.
Master Tableau: Comprehensive Guide to Data Visualization & Dashboards
This extensive Tableau course covers everything from basics to advanced topics, including data modeling, calculations, chart types, dashboards, and real-world project implementation. Learn to create dynamic, interactive visualizations and dashboards with over 60 functions and 63 chart types, optimized for business intelligence and data analysis.
Master Excel for Data Analysis: From Basics to Interactive Dashboards
Learn Microsoft Excel for data analysis starting from the basics to advanced features like formulas, pivot tables, and Power Query. This comprehensive guide covers data cleaning, dynamic filtering, advanced lookup functions, and building interactive dashboards for real-world business insights.
Comprehensive Bank Loan Data Analyst Portfolio Project Tutorial
Explore a detailed bank loan report project in the financial domain, covering data import from MSSQL Server, SQL query validation, and advanced Power BI dashboard creation with dynamic KPIs and interactive filters. Learn step-by-step how to build, validate, and visualize key business insights for real-time banking analytics and data analyst portfolio enhancement.
A Comprehensive Guide to PostgreSQL: Basics, Features, and Advanced Concepts
Learn PostgreSQL fundamentals, features, and advanced techniques to enhance your database management skills.
Most Viewed Summaries
A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.
Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.
Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.
Pamamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakaran ng mga Espanyol sa Pilipinas, at ang epekto nito sa mga Pilipino.
How to Install and Configure Forge: A New Stable Diffusion Web UI
Learn to install and configure the new Forge web UI for Stable Diffusion, with tips on models and settings.

