The Ultimate Guide to a Career in Data Analytics: Roles, Responsibilities, and Skills
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeIf you found this summary useful, consider buying us a coffee. It would help us a lot!
Introduction
In today’s digital landscape, the demand for data analysts is soaring as companies recognize the importance of data-driven decision-making. This comprehensive guide explores the various facets of a career in data analytics, highlighting job roles, responsibilities, necessary skills, and potential salaries. Whether you’re looking to dive into this exciting field or simply want to upgrade your current skill set, this article serves as a definitive roadmap.
Understanding Data Analytics
Data analytics involves the systematic computational analysis of data. It transforms raw data into meaningful insights, enabling organizations to make informed decisions. Here’s a closer look at what data analytics encompasses:
Key Components of Data Analytics
- Data Collection: Gathering data from various sources, such as surveys, transactions, and APIs.
- Data Cleaning: Processing data to remove inconsistencies and erroneous entries.
- Data Analysis: Employing statistical methods to analyze the data and uncover patterns.
- Data Visualization: Presenting data insights through graphical representations to facilitate understanding.
- Decision Making: Applying analytical findings to inform business strategies.
Career Opportunities in Data Analytics
Data analytics offers a multitude of career paths. Here are some of the key roles you may encounter:
1. Data Analyst
Responsibilities:
- Collecting and analyzing data to identify trends.
- Creating data visualizations and reports for stakeholders.
- Collaborating with other teams to solve business problems.
Skills Required:
- Proficiency in SQL, Python, or R.
- Familiarity with tools like Excel, Tableau, and Power BI.
- Strong statistical analysis skills.
Average Salary:
- India: ₹5,23,000 per annum
- United States: $62,400 per annum
2. Business Analyst
Responsibilities:
- Evaluating business processes and identifying areas for improvement.
- Communicating with stakeholders to gather requirements and insights.
- Delivering data-driven recommendations to enhance efficiency.
Skills Required:
- Knowledge of programming languages such as R or Python.
- Proficiency in SQL and data visualization tools.
- Strong analytical and critical thinking abilities.
Average Salary:
- India: ₹7,00,000 per annum
- United States: $68,446 per annum
3. Data Scientist
Responsibilities:
- Building predictive models using machine learning algorithms.
- Performing data mining to extract necessary data from large datasets.
- Analyzing the business impact of various data-driven decisions.
Skills Required:
- Strong programming skills (Python, R, Java).
- Knowledge of machine learning and statistical modeling.
- Experience with data visualization libraries.
Average Salary:
- India: ₹10,47,000 per annum
- United States: $113,000 per annum
4. Data Engineer
Responsibilities:
- Designing data pipelines and architecture.
- Managing and optimizing database systems for analysis.
- Working closely with data analysts and scientists to manage data flows.
Skills Required:
- Proficiency in big data technologies (e.g., Hadoop, Spark).
- Strong programming skills (Python, Java).
- Understanding of SQL and NoSQL databases.
Average Salary:
- India: ₹8,85,000 per annum
- United States: $103,000 per annum
5. Machine Learning Engineer
Responsibilities:
- Developing algorithms that allow systems to make predictions.
- Collaborating with data analysts and scientists to refine models.
- Implementing solutions to enhance model accuracy.
Skills Required:
- Expertise in machine learning algorithms.
- Strong programming skills (Python, R).
- Familiarity with software development practices.
Average Salary:
- India: ₹8,00,000 per annum
- United States: $114,000 per annum
Skills Needed for a Data Analyst Career
To thrive in a data analytics career, you need a blend of technical, analytical, and soft skills:
Technical Skills
- Statistical Knowledge: Strong understanding of statistical tools and techniques.
- Programming Expertise: Proficiency in Python, R, or SQL for data manipulation and analysis.
- Data Visualization: Ability to visualize data effectively using tools like Tableau and Power BI.
Analytical Skills
- Critical Thinking: Ability to analyze data and derive actionable insights.
- Problem-Solving: Aptitude for identifying data-related challenges and proposing solutions.
Soft Skills
- Communication: Strong verbal and written skills to present findings to stakeholders.
- Collaboration: Ability to work effectively in teams and with various departments.
Data Analytics Certifications
Pursuing certifications can significantly boost your employability and expertise:
- Certified Analytics Professional (CAP)
- Google Data Analytics Professional Certificate
- IBM Data Analyst Professional Certificate
- Microsoft Certified: Data Analyst Associate
- Tableau Desktop Specialist
Salary Ranges in Data Analytics
Data analytics salaries vary greatly by role, location, and experience. Typically, entry-level positions start around ₹5,00,000 to ₹8,00,000 in India and $60,000 to $70,000 in the U.S. With experience, salaries can increase substantially:
- Mid-Level Data Analyst:
- India: ₹10,00,000 - ₹12,00,000
- U.S.: $80,000 - $100,000
- Senior Data Analyst:
- India: ₹15,00,000 and above
- U.S.: $120,000 and above
Conclusion
A career in data analytics is not only rewarding financially but also offers an opportunity to make meaningful contributions to organizations. By acquiring the relevant skills and certifications, and understanding the various roles in this field, you can set yourself on a path to success. Continuous learning and adaptation to technological advancements will further enhance your prospects in the ever-evolving landscape of data analytics.
hey everyone welcome to Simply Lo in today's session we will go through the data analyst fil course but before we
begin if you are a tech gek in a continuous hunt for latest technological Trends then consider getting subscribed
to our YouTube channel and don't forget to hit that Bell icon to never miss an update from Simply now without further
Ado let's go through the agenda for today's session first we will get started by getting a brief understanding
of data analytics as a carrier job roles and responsibilities and salary descriptions and followed by that we
will go through a detail data analyst road map then to add up we will go through some important top
certifications for data analyst and we will understand the fundamental differences between a data analyst and a
data scientist followed by that we will have the data analytics Basics Next we will deep dipe into oap and olp and also
understand the differences between them then we will get started with the practical demonstration by understanding
the ATL pipeline then the top 10 data analyst tools and followed by that we will get through the msxl for data
analytics time series analysis and then data analytics with python and data analytics with SQL followed by that we
will go through some of the most important data visualization tools which as powerbi and W and we will also use
some of the most important data sets for the use cases and finally to make things more interesting we will also go through
some of the most frequently Asked data anal itics interview questions in real time I hope I made myself clear with the
description to FasTrack your Ambitions whether you're making a switch or aiming higher simply learn has your
back and if you're an aspiring data analyst try giving a sh to Simply learns postgraduate program in data analytics
from P University in collaboration with IBM the link in the description box and the pin comment should take you to the
programing off it with that over to our training experts hello everyone we have an interesting topic for today and that
is data analytics jobs career and salary I will run you through the top six data analytics job roles so before I dive
deep into the various job roles let's quickly understand how important a career in data analytics is and what the
data so back in the early 2000s there was relatively less data generated but with a rapid rise in Technologies and
with the increase in the number of various social media platforms and multinational companies across the globe
the generation of data has increased by Leaps and Bounds did you know that according to the IDC the total volume of
data is expected to reach 175 zettabytes in 2025 now the that's a lot of data let's
take a look at how organizations leverage all of this data as you know there are zillions of
companies across the world these companies generate loads of data on a daily basis when I say data here it
simply refers to business information customer Data customer feedback product Innovations sales reports and profit
loss reports to name a few companies utilize all of this data in a wise way they use all of this
information to make crucial decisions that can either hamper or boost their businesses you might have heard of the
term data is the new oil well it definitely is but only if organizations analyze all the available data very well
then this oil is definitely valuable and for that we have data analytics organizations take the help of
data analytics to convert the available draw data into meaningful insights so what is data analytics
technically you can say it is a process wherein data is collected from various sources then cleaned which involves
removing irrelevant information and then finally transformed into some meaningful information that can be interpreted by
humans various Technologies tools and Frameworks are used in the analysis process as you might have heard of the
term data never sleeps well it surely doesn't every millisecond some of the other data is generated and this is a
constant process this process is only going to increase in the near future with the Advent of newer Technologies
the data analytics domain holds Paramount importance in every sector companies want to leverage on all the
generated big data and boost their businesses they need professionals who can play with data and convert them into
crucial insights organizations are constantly on the lookout for such candidates and this opportunity will
only increase as data is only going to grow every second so if you want to start your career in
this field or if you want to switch your job role into a role in the data analytics domain then we have a set of
job profiles that you can look at we will look into six job roles in the data analytics fields and learn what
each job role is all about the responsibilities of a professional working in that particular role the
skills required to get that particular job the average annual salary of a professional working in that rooll and
finally the company's hiring for that rooll so let's start off first we have the job role of a data analyst a data
analyst is a person who collects processes and performs statistical analysis of large data sets every
business generates and collects data be it marketing research sales figures Logistics or Transportation costs a data
analyst will take this data and figure out a variety of measures such as how to price new materials how to reduce
Transportation costs or how to deal with issues that cost the company money they deal with data handling data modeling
and Reporting now talking about their responsibilities data analysts recognize and understand the organization's goal
they collaborate with different team members such as programmers business analysts engineers and data scientists
databases they filter and clean data using different modern tools and techniques and make it ready for
analysis they also perform data mining from primary and secondary data sources data analysts identify analyze
and interpret Trends in complex data sets this is done using statistical tools such as R and SAS
another key responsibility of a data analyst is to create summary reports and build various data visualizations for
decision-making and presenting it to the stakeholders next let us discuss the important skills that you need to know
to become a data analyst firstly you should have a bachelor's degree in computer science or
information technology a master's degree in computer applications or statistics is also
preferable you must have a good understanding of of programming languages like R python JavaScript and
also understand SQL in addition to that it is beneficial if you have hands-on experience with
statistical and data analytics tools such as SAS Miner Microsoft Excel and ssas basic understanding of machine
learning and its algorithms would be an advantage acquaint yourself with descriptive predictive prescriptive and
various data visualization software along with presentation skills this will help you pitch in your ideas and
a data analyst earns nearly 5 lak 23,000 rupees perom in India while in the United States they earn around
$62,400 per anim let's now look at a few of the companies hiring data analysts so as you
can see we have the American e-commerce giant Amazon then we have Microsoft the American online payment company PayPal
then we have Walmart Bloomberg and capital 1 so that was all about data analyst the next job role is of a
services and software through datadriven solutions they are responsible for Bridging the Gap between it and and
business using data analytics to evaluate processes determine requirements and deliver datadriven
responsible for creating new models that support business decisions and come up with initiatives and strategies to
analyst business analysts have a good understanding of the requ requirements for business their vital role is to work
in accordance with relevant project stakeholders to understand their requirements and translate them into
details which the developers can understand they frequently interact with developers and come up with a plan to
design the layout of a software application they also run meetings with stakeholders and other authorities they
engage with Business Leaders and users to understand how datadriven changes to products services
software and Hardware can improve efficiencies and add value they ensure that the project is
running smoothly as per the requirements and the design planned through user acceptance and validation
testing they make sure all the features are being incorporated into the application Bas rely on different
software to write documentation and design visualization to explain all the findings it is extremely critical for
any ba to effective L document the findings where each requirement of the client is mentioned in
detail now let us look at the skills required for a ba a bachelor's degree in the field of science engineering or
statistics or any related domain will suffice knowledge of programming languages such as Python and Java is
beneficial you should be really good at writing complex SQL queries and you should also have knowledge of various
analytical and problem solving skills are necessary to solve software and business issues you also need to have
excellent presentation and communication skills both oral and written moving on to their salary a
business analyst is expected to earn around 7 lakh rupees perom in India in the US they earn nearly 68,4 $46
perom IIA Dell Phillips Honeywell the famous American messaging platform WhatsApp the UK based company Ernest and
administrator who maintains a successful database environment by directing or performing all related activities to
keep keep the organization's data secure they are responsible for storing organizing and retrieving data from
several databases and data warehouses their top responsibility is to maintain data Integrity this means
maintains a database to ensure that the the data in it is properly stored organized and managed well they maintain
data Integrity by avoiding unauthorized taxes and they keep databases up to date they run tests and modify the existing
databases to ensure that they operate reliably they also inform end users of changes in databases and train them to
responsible for taking system backups in case of power outages and other disasters so they should have an
efficient Disaster Recovery plan now let's have a look at their skills to become a database
administrator you should have a bachelor's degree in computer science or information technology knowledge of
programming languages such as python Java and Scala is important you need you need to carry at
least 3 to 5 years of experience in data management you need to have an understanding of different databases
such as Oracle DB mongod DB MySQL server and pogress SQL also they should have an idea about database design and writing
SQL queries finally you need to have a good understanding of operating systems such as Windows Mac OS and Linux along
earn up to 4ak 97,000 rupees perom in the US they earn around $78,000 perom let's have a look at the
companies's hiring for database administrators so as you see here we have bookm show Oracle the American MNC
Intel Amazon Robert Half and the New York Times to name a few fourth in the list of job roles we
have data engineer a data engineer is someone who's involved in prep comping data for analytical and operational uses
a data engineer transforms data into useful format for analysis they build and test scalable Big Data ecosystems
scientist now let's jump into their responsibilities data Engineers develop test and maintain architectures they are
responsible for managing optimizing and monitoring data retrievals to storage and distribution throughout the
organization they discover opportunities for data acquisition find Trends and data sets and develop algorithms to help
make raw data more useful to the Enterprise data engineers build large data warehouses using ETL for storing
and retrieving data they also recommend ways to improve data quality and efficiency along with
building algorithms to help give easier access to Raw data data engineers often work with big
data and submit their reports to data scientists for analysis purpose they need to recommend and sometimes
Implement ways to improve data reliability efficiency and quality moving on to the skills of a data
engineer a data engineer should hold a bachelor's degree in computer science or information
technology they should have good hands-on experience with python R and Java also data Engineers should be well
wored with big data technology such as Hadoop Apache spark Scala Cassandra and mongodb data warehousing and detail
experience are essential to this position along with in-depth knowledge of SQL and other database
Solutions basic knowledge of statistical analysis will be an advantage along with idea about operating
systems here is what a data engineer can earn so in India a data engineer can earn up to 8 lakh 85,000 rupes per anim
professional who uses statistical methods data analysis techniques machine learning and related Concepts in order
to understand and analyze data to draw business conclusions they make sense to messy and unstructured data and bring
value out of it they employ techniques and theories drawn from many fields within the context of mathematics
challenges in business and comes up with the best Solutions using modern tools and techniques to analyze visualize and
build prediction models to make business decisions let us now look at their responsibilities in the
industries data scientists clean process and manipulate data using several data analytics tools they perform adhoc data
mining collect large sets of structured and unstructured data from disparate sources they design and evaluate
Advanced statistical models to work on Big Data they also create automated anomaly detection systems and keep
constant track of their performance data scientists interpret the analysis of big data to discover
dashboards for Relevant stakeholders they also adopt new business models and approaches apart from this they
regularly built predictive models and machine learning algorithms now moving on to the skills
technology will be fine but a master's degree in the field of data science will hold a major advantage you also need to
have a good experience in the analytics domain you should be proficient in programming languages such as python
Hadoop in addition to knowing programming languages you also need to know SQL machine learning and deep
learning data visualization and bi skills are necessary for creating reports and dashboards you should also
be able to communicate and present information and ideas properly now talking about their salary
a data scientist in India can expect an annual salary of 10 lakh 47,000 rupees per year meanwhile in the US they can
data scientists here we have a few companies named they are yet again Amazon City Bank Apple Google the
Japanese electronic Commerce and online retailing company rakuin and Facebook and finally we have machine
learning engineer machine learning Engineers are professionals who develop intelligent machines that can learn from
algorithms and statistical modeling to make sense of data they design and develop machine learning and deep
learning algorithms their main goal is to create self-running software let's have a look at the
design and develop machine Learning Systems they use exceptional mathematical skills in order to perform
faster computations and work with algorithms to create sophisticated models they perform AB testing and use
data modeling to fine-tune the results they use data modeling and evaluation strategy to find hidden
patterns and predict unseen instances machine learning Engineers work closely with data Engineers to
build data pipelines and interact with stakeholders to get a Clarity on the requirements most importantly they
analyze complex data sets to verify data quality perform model tests and experiments choose to implement the
right machine learning algorithm and select the right training data sets moving on to their
skills a machine learning engineer should have a degree in computer science and information technology they should
have an advanced degree in computer science or maths in addition to this they should also have experience in the
C++ and Java knowledge of Statistics probability and linear algebra is necessary as all the machine learning
algorithms have been derived from mathematics also having an idea of signal processing would be
beneficial machine learning Engineers need to have a good understanding of data manipulation and machine learning
libraries such as numai Panda psyit lar Etc they should have good oral and written communication skills let us now
have a look at their salary structure a machine learning engineer earns $8 lakh rupees perom in India while in the US
it let's have a look at the company's hiring machine learning engineers so as you see we have Amazon Microsoft
Oracle Salesforce Rapido and Accenture to name a few that was all about the job role of a machine learning
engineer now that we have seen the different job roles in the field of data analytics let's also go ahead and see
resume of a data analyst you you can grab some ideas from this and incorporate them in your resume nowadays
it's quite common to have a professional photograph of yours on the resume you can go ahead and have that then your
name in bold followed by your contact details like email ID and phone number then moving on you would have to
write a summary briefly explain your current job role and what you're looking for in the future having a LinkedIn
profile link works well these days employers can just go ahead and look at your profile and gauge you well make
sure to have an active LinkedIn profile in addition to LinkedIn profile it's also good to have a GitHub profile link
which can show your coding or other technical skills if it's impressive enough then a lot of times the rest of
your resume is just secondary as I mentioned this is a resume of a data analyst so as you can
see in the summary here we have just spoken about the basic responsibilities of a data analyst
moving on to the experience part you have to write the job title and below that you can mention the company and the
tenure accordingly here you would have to give a brief description of achievements in the organization any
relevant accomplishments related to the job you're applying for the tools and the various Technologies you have worked
with so in the sample you can see we have spoken about data visualization using RN tblo next we have spoken about
how the candidate has worked with other teams for a Better Business outcome most of the data an lists use SQL and
Excel to handle data for reporting and database maintenance and we have mentioned that here as well do make sure
that you always specify the tools you use then you can also mention if you have worked on improving data delivery
for example here we have spoken about developing and optimizing SQL queries data aggregations and ETL to improve
becoming a data analyst here we have taken the role of a statistical assistant as the first job since it's
easier for a candidate with this job role to shift into the data analytics field nevertheless you all can still
mention your prior experience here be it in any domain under the responsibilities for this job role we have given Basics
such as coding data prior to computer entry compiling statistics from VAR ious reports Computing and analyzing data and
finally some visualization and Reporting moving to the education here you can mention the name of your degree
and the university name if you have a postgraduation well and good you can list both the degrees here also if you
have any certifications you can mention them here under the education category now moving to the skills
depending on your skills and your choice you can either shift this part to the beginning of the resume or have it
here as you see on your screens this is just a different way of displaying your skill sets you can have all the five
stars colored if you are excellent in that particular tool or language as you see it's crystal clear
as to what the candidates strong areas are you can have various categories like shown for example under software
development you can list the languages that you know and how proficient you are in those particular
languages it's clear that the C Val knows python better than JavaScript here so the employer gets a clear idea about
the skills you possess and the depth of it similarly you can mention the databases as well the few mentioned here
are more or less a requirement to become a data analyst at least SQL is a must not to forget data visualization is
also very important when it comes to the job role of a data analyst mention the tools you know here and similarly give
yourself a rating out of five five stars shaded being the highest here we have mentioned Tableau and Excel which are
more than sufficient to become a data analyst moving to the non-technical skills you can mention the languages you
know here here we have taken English and German in addition to the languages you can also feel free to mention the
data analyst should look like you can alter it according to your achievements skills skills and experience a study
from new Vantage Partners suggest 97.2% of companies are now investing in data and its analysis nowadays every company
needs a data expert do you also want to become a part of this Market if yes how to become a part of it what are the
essential skills one should have this video will answer all these questions but before watching this video
please subscribe to Simply learns YouTube channel and press the Bell icon to never miss any updates first of all
we are going to discuss who is a data analyst working of a data analyst skills required to become a data analyst tools
required and Company's hiring and finally we are going to discuss about salary of a data analyst let me tell you
how simply learn can help you in your journey to become a data analyst check out the course on data analytics in
collaboration with IBM with realtime project and business case studies you will learn tools like scipi pandas and
programming languages like python or R2 enroll Now link is in the description box below question for you which one of
Jupiter notebook please leave the answer in the comment section below moving on who is a data analyst a data analyst
collects analyzes and interprets data a data analyst will convert raw data into useful information data analyst are in
high demand because every industry users data analysis work of a data analyst as a data analyst you will work closely
with the raw data and generate valuable insights to help companies decide their future goal if you like thinking out of
the box you are the perfect fit for this domain data analyst help maximize output when it comes to generating Revenue
working closely with both business and data nevertheless this field boost handsome salaries for all levels of
expertise can you become a data analyst without prior experience yes anyone can become a data analyst if they enjoy
solving real world problems have a strong background in statistics and have a creative mind if you feel you don't
have it you can definitely develop it so let us know the skills in detail what are the basic skill sets required for a
data analyst data analyst must know basic mathematics and statistics programming skills machine learning and
also data visualization tools so let us know what are the basics that you need to learn as a data
analyst mathematics it is always better to know basic mathematics like linear algebra and probability fundamentals
linear algebra is used in data pre-processing and transformation which is the critical process of every data
analyst statistics a branch of mathematics that deals with collection analysis presentation and
implementation probability we know that probability is the study of how life Lely something will happen which is
essential for concluding both probability and statistics are the backbones of data analysis it is
feasible to become a data analyst with only a basic understanding of these three areas of mathematics but in order
to remain relevant and grow as a data analyst one's mathematical knowledge should not be restricted compulsorily
most Val known spreadsheet software in the world it also has computation and graphick features that are excellent for
data analyst no matter your area of expertise or additional software you might want Excel is a standard in the
industry it's useful buil-in features include form design tools and pivot table it also generate a wide range of
every data analyst should know python it is easy to learn and has a simple syntax python is quite adaptable and includes a
vast variety of resource libraries that are appropriate for a wide range of diverse data analytics activities these
libraries help in numerical and data computation the pandas and numai libraries for instance are excellent for
supporting standard data processing and streamlining highly computational operations you can also choose between
Python or r r is a well-known open Source programming language much like python data visualization tool as we
previously mentioned data visualization tool is also necessary to become a data analyst powerbi is a userfriendly
interface makes building interactive visual reports and dashboard simple its most vital selling point is its superb
data integration it works flawlessly with Cloud sources like Google and Facebook analytics as well as text files
SQL servers and Excel is one of the best commercial data analysis tool available it handles huge
amounts of data better than many other bi tools and is effortless it has a visual drag and drop interface however
because it has no scripting layer there is a limit to what Tableau can do for example it could be better for
MySQL a lot of time it is a standard language for interacting with databases and it is very helpful when working with
structur data SQL creates userfriendly Das PS that may present in various data ways in since it is so simple to send
complex commands to databases and change data in seconds it has commands like add edit delete data in Edition SQL is an
excellent tool for creating data warehouses because of its Simplicity Clarity and interactivity
overall I would suggest that to become a data analyst you should work on programming languages like python or R
plus MySQL to work on databases adding to that Excel plus visualization tools like tblo or powerbi you now know what
to troubleshoot the reporting database environment and reports data analyst you will use statistical method to analyze
data sets and spot any valuable trends that may develop over time evaluate companies functional and non-functional
requirement data analyst assess data warehousing in inspecting and Reporting needs these are all the responsibility
of a data analyst in an organization coming to companies hiring a data analyst IBM accentor capam mini TCS
Facebook Amazon flip cart meta these are the top companies hiring a data analyst but data suggest that every small and
medium-sized company needs a data analyst therefore demand of a data analyst is in every company so there is
no need to worry job and salary of a data analyst this is the final part the salary of a data analyst is high all
over the world when it comes to the USA the average salary for a data analyst as a beginner is going as high as 70,000
plus per anom for experienced professional it is going as high as $120,000 perom in India for a fresher it
is going as high as 8 lakh perom and for experienced professionals it is 20 lakh plus perom such as the demand for data
analyst now that we have covered every important skill it's time for you to start working on it certifications have
become an essential Benchmark for professional growth and recognition in the dynamic field of data analytics they
not only validate your expertise but also open doors to exciting new career opportunities and when it comes to topn
certifications simply learns has established itself as a trusted provider of choice so let's embark on a journey
discover the best data analyst certifications available for you are you eager to step into the exciting world of
data analytics look no further than simply learn postgraduate program in data analytics this comprehensive
program is designed to equip you with the essential skills and knowledge needed to thrive in this rapidly
evolving field let's look into the extraordinary features and benefits that make this program truly stand out from
the crowd first and foremost simply learns Career Services is here to ensure that your talent are noticed by top
hiring companies giving you a Competitive Edge in the job market upon successfully completion of
the program you will receive a prestigious postgraduate program certificate powerful Testament to your
commitment and expertise in the field of data analytics this certificate holds a value that can open several career
opportunities to further enhance your learning experience we have introduced an exclusive partnership with industry
joint IBM through this collaboration we bring you UNP paled opportunities such as exclusive hackathons and ask me
anything sessions with IBM experts this unique expertise allows you to engage directly with industry professionals
gain invaluable insights and Tackle real world challenges faced by data analyst this collaboration ensures that you
receive Cutting Edge knowledge and stay ahead of the curve at simply learn we believe in in the power of Interactive
Learning that's why our live online classes offer an astonishing 8X higher level of live interaction compared to
other programs delivered by industry experts these engaging sessions provide you with a platform to ask questions
participate in discussions and gain practical insights that will deepen your understanding of the subject matter one
of the defining feature of our program is the Hands-On approach to the learning through our Innovative applied learning
model you will have the remarkable opportunity to work on a variety of real world projects using industry data sets
for reputable sources such as Google Play Store and the World Bank with our 14 data analytics project your best you
will gain valuable practical experience and develop an impressive portfolio that showcases your skills to potential
employers to enrich your learning joury even further we present Master Class delivered by esteemed faculty from
Purdue and IBM experts their immense knowledge and expertise will provide you with the valuable insights and unique
perspective ensuring that you receive a well rounded in data analytics now you may be wondering if this program is
suitable for you the answer is yes our postgraduate program in data analytics is thoughtfully designed to the need of
all working professional whether you are an experienced data analyst looking to upskill or someone
new to the field this program is perfect fit no prior programming knowledge is required as we cover everything from the
fundamentals to Advanced Technologies throughout the program you will acquire a wide range of skills essential for
Success career in data analytics these includes data analytics statical analytics using Excel data analysis with
python and R data visualization using W and powerbi model on linear and logistic recursion clustering using k mean and
supervised learning Technologies our comprehensive curriculum ensures that you are equipped with the necessary
tools to excel in this rapidly growing field so why wait any longer join simply learns postgraduate program in data
analytics and unlock the world of opportunities prepare to dive deep into the of data analytics collaborate with
industry experts and gain hands-on experience with real world projects your journey to become a data analytics
professional starts right here so the link of this program will be in the description box below do check it out
and enroll now next we have data analytics boot camp stay ahead of the data Revolution with the keltech data
analytics boot camp a collaborative program offered by keltech designed specifically for working professionals
and students of the US this boot camp adapts an applied learning approach providing integrated labs and real world
projects for the business environment experience academic Excellence with the keltech data analytics boot camp while
also enjoying the benefits of Kelch Campus Connect the program offers highly Interactive Learning ensuring active
participation and engagement this provides ample hands-on experience to reform your skills the keltech data
analytics boot camp CS to individuals from diverse backgrounds offering substantial benefits to working
professionals gain ass such as Excel proficiency datadriven presentational Technologies data manipulation with SQL
data analytics using Python and data visualization using Tableau Additionally you will learn data analytics with the
tools like AWS and other industry relevant Technologies covering a comprehens iive curriculum the Celtic
data analytics boot camp dive into key Concepts including data analytics with Excel python based data analytics
database management with SQL table for data visualization and data analytics on AWS by completing this boot cam you will
be equipped with the range of tools and Technologies by choosing this boot cam you open doors to exciting career
opportunities with renowned companies such as Microsoft Google Amazon IBM apple and many more the celte data
analytics boot camp has successfully empowered numerous as parents data analysts and their testimonials are
available on the course page accessible throughout the link in the description box now if you inspire to become a data
analyst and acquire job ready skills don't miss out on this intensive training program join the keltech data
analyst boot camp and embar on your journey to excel in the world of analyst the link for this boot camp will be in
the description box do check it out now we have another professional certification course in data analytics
provided by the IIT kpur for Indian professionals and students are you ready to unlock the secrets hidden within vast
amount of data and gain a comprehensive Edge in today's Cutthroat business World simply learns data analyst course
delivered in collaboration with IIT canpol will provide you with extensive expertise in the booming field of data
analytics this course is designed to provide a deep understanding of the principles Technologies and applications
of data analytics empowering you to efficiently analyze interpretate and extract the actionable insights from the
data the course follows an structured learning path that covers various aspects of data analytics including
business analytics using Excel SQ programming Foundation using python data analytics with r programming and Tabo
training some of the key features of this program includes master classes delivered by distinguished IIT kpur
faculty Hands-On lab experience to help you master 14 plus tools and Frameworks and Industry ready projects designed to
advance your career trajectory simply learns job assistant Services is here to help you get noticed by top hiring
companies increasing your chances to secure a rewarding position with renowned companies like Microsoft Google
Amazon IBM Goldman snacks and many more upon successful completion of the program you will receive a professional
certificate from IIT kpur adding immens value to your credentials so what are you waiting for roll now and embark on
this transformative Journey with IIT canos data analytics course unlock the boundless potential of data countless
inspiration data analysts have benefited from the data analytics program you can find their this testimonial by following
the link in the course page in the description box below hi guys this is R from Simply learn and today we're going
to look at three very important data related roles in the field of data science and then we're going to pit them
against each other so welcome to data scientist versus data analyst versus data engineer now let's have a look at
what's in store for you firstly we'll talk about the job description descriptions the skill sets required for
each role the salary the roles and responsibilities and the companies hiring for these positions so now let's
have a look at each of these roles in detail first off let's have a look at data scientist now a data scientist is
able to create machine learning based tools or processes within the company now they use Advanced Data techniques
such as clustering division trees neural networks and so on so that they can derive business conclusions they are the
senior most member in the team which involves a data engineer as well as a data analyst now they need to have
in-depth knowledge of Statistics data handling and machine learning they also take inputs from data Engineers as well
as analyst so that they can formulate actionable insights for the business now data scientist also needs to have the
same skills as a data analyst and an engineer but needs to have a lot more in-depth knowledge and expertise with
these skills next up we have data analyst now a data analyst is someone who's able to translate numeric data
into a form that everyone in the organization can understand now this is an entry-level position in the data
analytics team he or she needs to have technical skills in programming languages such as Python and have
knowledge of tools like Excel and understand the basics of data handling modeling and Reporting now in due time
they can move up the ranks by taking up roles of data engineer and data scientist with some experience that they
can accumulate over the years and finally we have data engineer now data engineer is someone who's involved with
pairing data Who's involved with preparing data for analytical or operation purposes now they are the
intermediary between the data analyst and the data scientist he or she needs to have a lot of experience when it
comes to developing constructing and maintaining architectures now they do generally work on big data and submit
their reports to the data scientist so that they can be analyzed now let's have a look at the skill sets required for
each of these roles first off you have data scientist now since this role is a little more coding oriented you need to
know a great deal when it comes to programming languages programming languages such as python R SQL SAS Java
and so on now you also need to be well ver with Frameworks in relations to Big Data such as Pig spark and doop speaking
of aoop if you want to learn more about how it works I suggest you click on the top right corner and watch our video on
what a do coming back data scientists also need to be welled with machine learning deep learning and other similar
Technologies next up we have data analyst now this role is much less technical as compared to a data
scientist as well as a data engineer considering how its entry level here knowing programming languages is a great
bonus so an idea about programming languages such as python R SQL JavaScript ssas and so on is a great
benefit at the same time you do need to be wellers with tools such as SAS Miner Microsoft Excel ssas SPSS and so on and
finally we have data engineer now being a data engineer requires you to be well wored with a bunch of programming
languages as well as Frameworks now you need to know about programming languages such as python R SQL SAS Java and so on
while having EXP expertise in Frameworks such as Hadoop map reduce Hive Pig Apache spark data streaming nosql and so
on now let's talk about money or the salary each of these roles get firstly we have the data scientist who earns a
is a pretty high salary when you consider that it's only an entry-level job and a data engineer which is in the
median with $116,000 perom now let's talk about roles in the responsibilities firstly we
have the data scientist now a data scientist gets to work with a lot of unstructured data so they need to mine
and clean the data so that it's usable they need to be able to design machine learning models to work on the big data
they need to infer and interpret the analysis on Big Data to be able to lead an entire team to achieve the goals of
the organization and deliver conclusions that have a direct business impact now let's have a look at the roles and
responsibilities of a data analyst they need to use queries to gather information from a database they need to
process the data and provide summary reports they need to use basic algorithms for their work such as linear
regression logistic regression and so on and have core skills in statistics data munging data visualization and
exploratory data analysis and finally we have data engineer now they need to mine through the data so that they can gain
insights from it they need to convert erroneous data into a usable form so that they can be further analyzed they
need to write queries on data they need to maintain the design as well as architecture of the data and create
large data warehouses using ETL or extract transform load now let's have a look at some of the companies hiring for
this role firstly for data scientists you have City Bank Facebook Schneider Intel Amazon and so on for data analysts
you have infosis Oracle Visa Capital 1 Walmart and so on and for data engineer you have Google Cisco flast cognizant
Apple Spotify and much much more now that we have looked at the various steps involved in data analytics let's now see
the different tools that can be used to perform the a steps so as you can see we have seven
tools including a few programming languages that will help you perform analytics better now let's discuss them
Source programming language that supports a range of libraries for data manipulation data visualization and data
modeling python programmers have develop tons of free and open source libraries that you can use you can find many of
software python provides the default package installer called pip or pip Python has libraries such as numpy for
numerical computation of data pandas to manipulate data on numerical tables and time series then you have scipi for
Technical and scientific computations it also provides psychic learn which is a machine learning library for creating
classification regression and clustering algorithms and finally it also has py torch and tensorflow for deep
learning up next we have r r is an open- Source programming language majorly used for numerical and
statistical analysis it provides a range of libraries for data analysis and visualization some of these libraries
are ggplot tidos plotly deer and carrot then we have tblo tblo is a popular data visualization and analytics tools that
helps you create a range of visualizations to interactively present the data build reports and dashboards to
show case insights and Trends it can connect with multiple data sources and give hidden business insights and
intelligence tool developed by Microsoft that has an easy drag and draw functionality and supports multiple data
sources with features that make data visually appealing powerbi supports features that help you ask questions to
Trends so the next tool is Click view click view provides interactive analytics with in memory storage
technology to analyze vast volumes of data and use data discoveries to support decision- making
it provides social media Discovery and interactive guided analytics it can manipulate huge data
data analytics engine to process data in real time and Carry Out complex analytics using SQL queries and machine
writing SQL queries it also has spark MLB which is a library that has a repository of machine
SAS SAS is a statistical analysis software that can help you perform analytics visualize your data write SQL
queries perform statistical analysis and build machine learning models to make future predictions SAS empowers our
customers to move the world forward by transforming data into intelligence SAS is investing a lot to
drive software Innovation for analytics Gartner has positioned SAS as a magic quadrant leader for data science and
analytics data analytics is being used in almost every sector of business these days let's discuss a few of them
when they need it data analytics helps retailers meet those demands retailers not only have an
in-depth understanding of their customers but they can also predict Trends recommend new products and boost
profitability retailers create assortments based on customer preferences invoke the most relevant
engagement strategy for each customer optimize supply chain and Retail operations at every step of the customer
Journey the second application is on Healthcare Healthcare Industries analyze patient data to provide life-saving
diagnosis and treatment options they also deal with healthcare plans and insurance information to drive
key insights using analytics they can discover new drugs and come up with new drug development methods
Imaging at number three we have manufacturing for manufacturers solving problem is nothing new they fight with
difficult problems and situations on a daily basis from complex Supply chains to motion applications to labor
constraints and Equipment breakdowns they deal with such problems on a regular basis using data analytics
manufacturing sectors can discover new cost saving and revenue opportunities the fourth application is
structured and unstructured data to derive analytical insights and make sound financial decisions using
analytics they can find out probable loan defaulters customer turnout rate and detect fraudulent transactions
analytics to develop new business models that can ease their business and improve productivity they can optimize routes to
ensure delivery reaches on time in a cost efficient manner they also focus on improving aut processing capabilities as
well as Performance Management with that now let's look at the companies using data analytics on a
of Health Information Technology Solutions Services devices and Hardware cner followed by Target and Antivirus
Company mac Cafe next we have Rapido which is an Indian bike rental company based in Bangalore
after that we have Flipkart and the world's largest retail company Walmart with that let's understand a
case study from Walmart and how it uses data analytics to grow its business and serve its customers
better Walmart is an American multinational retail company that has over 11,500 stores in 27 countries
worldwide and it has e-commerce websites in 10 different countries it has more than 5,900 retail units operating
outside the United States with 55 banners in 26 countries with more than 7 lakh Associates serving more than 100
million customers every week it has over 2.2 million employees around the world and 1.5 million
employees in the United States alone Walmart's e-commerce Branch alone employs more than 3,000 technologists
shop at Walmart each week online and at its Banner stores walmart.com sees up to 100 million unique visitors a month
of data from 1 million customers every hour that's really huge now to make sense of all this
information Walmart has created data Cafe a state-of-the-art analytics Hub located within its Banton Val Arkansas
headquarters here over 200 streams of internal and external data including 40 petabytes of recent transactional data
can be modeled manipulated and visualized teams from any part of the business are invited to bring their
problems to the analytics experts and then see a solution appear before their eyes on the nerve centers touchscreen
smartboards Walmart also constantly analyzes over 100 million keywords to know what people near each store are
saying on social media to understand the customer behavior on what they like and dislike Walmart uses modern tools and
Technologies to derive business insights and improve customer satisfaction action some of these tools include python SAS
nosql databases such as cassendra and Hadoop Now using all these Technologies and data analysis techniques Walmart can
better manage its supply chain optimize product assortment personalize the shopping experience give relevant
product recommendations and finally optimize and analyze Transportation lanes and routes for its Fleet of
trucks with that let's jump into our use casee demo where we will predict the sales based on
Advertising expenditure using the linear regression model in R the advertising expenditure has been
made via different mediums such as radio television and newspaper we will use the r programming
software that can be downloaded from the arran website it is easy to learn and use our language is built specifically
for performing statistical analysis data manipulation and data mining using packages such as plier deer tier and
lubridate R supports data visualization with the help of packages such as ggplot Google v r color Brewer leaflet and GG
map and finally the r software can be used in a wide range of analytical modeling including classical statistical
more now let's have a look at our data that we will be using for this demo here is our advertising CSV data
set which has four columns you can see this tv ads expenditure the next column is for radio ads then we have the
newspaper ads and the last column colum is our Target column that is the sales so the data set
example so consider the second row so suppose you spend around $230 on TV ads then $ 37.8 on radio ads and $ 69.2 in
similarly if you are spending $44.5 in TV advertising $ 39.3 in radio ads and $45 in newspaper ads you can sell around
10 units of certain item we will analyze this data using linear regression so linear regression is a supervised
learning algorithm which means the data has labeled columns and is used to predict numeric continuous variables so
our sales column here is the target column and it has continuous numeric variables now let me go to the r studio
script the next step is to install all the necessary packages that we need for this demo if you already have the
packages installed in your R Studio you need not do it again you can just call these packages using the library
function and pass the package names so first I will install the deer package which is used for data manipulation I'll
be using install. packages function and I'll give the package name so I'll type install dot packages
deer I'm not going to run this because I already have it installed in my R Studio The Next Step I'll write I'll
call this uh package using Library function I'll give the package name depler I'll run
this then I'll install the broom package it takes the messy output of buil-in function functions in R such as linear
model or LM then T Test and turns them into a tidy data frame so I'll copy the ABB code I'll
tools package which will help us build our linear regression model I'll paste the same code and I'll
function I'll run this now sometimes people face issues with installing this particular package
you so this is the r Studio Community page and here they have the solution you can just go through this two
package in R for data visualization I'm not running install. packages because I have already installed all
call the library function with that let's now load the data set for this I will use the read.
CSV function and provide the path look location where my data is located followed by the data set name and the
you where my data set is located so here is my advertising CSV data set and this is the location I'll
present now one thing to noce we have to change all the back slash to forward slash otherwise R won't accept
now let us look at how our data set looks like using the head function so I'll give a comment
so you can see the head function has displayed the first six rows from the advertising data
set let me now check the dimensions of the data set so I'll use the dim function uh it will give you the total
dimensions I'll use the dim function and I'll pass in the ads variable you can see it has given the
number of rows which is 200 and the total columns which is four now if you want to get a summary of
the data set you can use the summary function so I'll directly type in summary and I'll give ads
let me expand this so actually summary function gives you information about a few statistics
for each of the columns so you can see the minimum value for each column the maximum value for
each column the mean the median first quartile and the third quartile values the first quartile or lower
quartile is the value that cuts off the first 25% of the data when it is sorted in ascending order the second quartile
quartile or the upper quartile is the value that cuts off the first 75% of data moving ahead let's do some data
visualization now to visualize our data since our data has only numeric values using Scatter Plots would be the best
option so we will visualize our sales against each of the independent variables for that I will use the plot
function and give sales in my x-axis and the independent variable names in the y- axis let me now do that so I'll give a
comment data visualization first I'll use the plot function and then in x- axis using the dollar symbol
I'll give sales then in the Y AIS I'll give my independent variable you can see R is automatically giving
you the suggestions I'll select TV then I'll take type is equal to under codes I'll give P which stands for
much aligned in One Direction which means if you are increasing the expect expenditure on TV ads the units sold are
expect close it now let's look at how sales vary based on radio advertising expenditure
now if you look at the blue dots it is not that linear compared to our previous graph you can see there are a few data
radio ads but still you can expect a decent amount of sales if you are willing to spend on radio
advertising close it let's now look at how sales will vary based on the newspaper advertising
expenditure I'll change the radio to new paper column and this time I'll take color
in you can see the plots are very hazly present the data is completely nonlinear and there seems to be a low correlation
between the sales and newspaper advertising expenditure now if you want to look at these plots
at a time you can use the pairs function so I'll type Pairs and then pass in my variable name which is
visualizations so you can see the TV sales now you can see the sales that were made with radio expenditure and
ahead let's check the correlation between the variables and see what inside we can get we will use the core
analysis for this I will have to install the core plot package I have already got it installed
grab only the numeric columns now data only has numeric columns but still let me tell you how you can do it since
correlations are based on numeric columns only this can be done using the S apply function so for that we have
as num. calls which is numeric columns then I'll pass in the S supply function I'll give the ads variable and
I'll check if the variable is numeric or not so I'll use is do numeric let me run it
see it says TV it's true which means TV has numeric values even radio has numeric values similarly for newspaper
which is co to display the correlations between the variables so I'll give my variable name as co.
data and then I'll take the core function pass in the ads varable and I'll only filter
out the numeric columns so comma numeric columns means we need all the rows and the selected
are all above zero which means there is a positive correlation between the variables and the change in one of the
ads have the maximum correlation with sales and the value is around 78 then there is radio advertising which
has correlation of about. 57 with sales and newspaper ads have the lowest correlation compared to the other
two which is at 22 now you can also build a correlation Matrix using the correlation plot meth
method this will give you a visual representation of the correlation between the variables so let's see how
Matrix on the right you can see the scale so minus one is for negative correlation then there's light red zero
which is almost white color then there's light blue and finally dark blue for the maximum positive
correlation the diagonals are dark blue which presents the same variables as in rows and in
columns so it's dark blue tv ads and radio ads have the next highest correlation while newspaper ads
have the lowest correlation with sales with that let's jump into the most important part of this analysis which is
building our regression model first we will look at a simple linear regression model where we will take one input
variable that is tv ads I'll be using the LM function or the linear model function to build the model so I'll give
model underscore simple and then using LM function I'll give my target variable which is
it now that we have built our linear regression model let's check the summary take summary function and I'll
pass in model underscore simple let me run it now if I expand this you can see our intercept estimate
thousand and for every $1,000 increase in the TV advertising budget we can expect the average increase in sales to
in the broom package so if I call tidy and I'll give the model name which is model undor
variable so we'll build a multiple linear regression model I'll take my variable name as
sales and using till I'll take all the column names STV then I'll use an addition operator and
column and then I'll take my data as ads let's run it I'll follow the same drill Let Me Now
call the summary function over this newly created model so I'll write summary and I'll
coefficients is the same as in simple linear regression model first we see that our coefficients for TV and radio
advertising budget are statistically significant since our P value is less than 05 while the coefficient of
newspaper is not which is around 86 thus changes in the newspaper budget does not appear to have any relationship
with changes in sales however for TV ads our coefficient suggest that for every $1,000 increase
in TV advertising budget holding any other predictors constant we can expect an increase in sales of 45 units on
average similarly the radio coefficient suggest that for every ,000 increase in radio advertising holding all the other
now you can also call the Tidy function over this multiple linear regression model so let me do that I'll call tid
and I'll pass in model uncore multiple you can see it has given the output now you can also find the
coefficients of the model using another method it's called the coefficient Matrix here is how you can do that
the parameter as coefficient let me call C coefficient now so these are the coefficients of
regression model using the ca tools Library first I'll take a seed value a random seed value value of
sample then I'll call sample do split take add ads and then then I'll use another parameter
I'll use my subset function and given the same parameters but this time I'll take sample is equal to equal to
false which means the test sample data set won't have any values that are present in train data
so I'll assign the linear model to the model variable I'll take Sals as my target column use the
til followed by a DOT which means I'm taking all the variables in terms of the independent variables and
then I'll select my train data set with that let's check the summary as well so this is the summary of our newly
using the residuals function let me go ahead and assign a variable called Rees for residual and
predictions and I'll use the predict function pass in my model followed by the test data set now
predicted sales values to our original sales for the test data for that I'll use the cbind function and pass the
function I'll take sales. predictions and I'll consider the sales column from the test
data let me check the values now so you have the predicted sales values and the original values of sales but you
can see the columns don't have any name assigned to them so let me go ahead and assign the column names using the call
names function and converted into a data frame to make it look better so I'll use the call
names function and and pass in my results variable then I'll take a vector and give the
have the predicted values and on the right you have the real values so we have successfully built our
you can also go ahead and find the accuracy of this model to know how good your model is we won't be covering that
as part of this tutorial I'll leave it for you and encourage you to do some research on how you can find the
accuracy of a linear regression model you will come across terms such as mean squ error root mean squ error and R
squ value if you are able to find the accuracy please post the results in the comment section or if you face any
issues with it please post your queries we'll be happy to help you and if you're an aspiring data analyst
try giving a shot to Simply learns postgraduate program in data analytics from Pur University in collaboration
with IBM the link in the description box and the pin comment should take you to the program being offered online
analytical processing oap and online transaction processing in short olp are two popular database systems that have
evolved separately to serve distinct purposes while both system offer data storage and retrieval capabilities they
differ significantly in terms of their architecture data flow and performance characteristics oap is designed
specifically for data analysis and Reporting while olp is designed for online transaction processing such as
online banking Inventory management and order processing oap is optimized for analyzing large volumes of data and
Reporting metrics while olp is optimized for quickly processing individual transactions there are many other
differences between these two crucial data processing systems so in this video we'll be discussing the major
differences between these two and how you can Implement these in large transaction processing systems so
without any further Ado let's get started also if you're an aspiring data analyst looking for online training and
certifications from prestigious universities and in collaboration with leading experts then search no more
simply learns postgraduate program and data analysis from PUO University in collaboration with IBM should be your
right choice for more details use the link in the description box below and with that in mind over to our training
experts approaches stand out in data processing and data analysis online analytical processing that is oap and
online transactional processing olp although they share the common goal of handling data these methodologies differ
significantly in their purpose data structure performance characteristics and design principles oap is a powerful
tool for organizations seeking to extract valuable insights from vast volumes of historical data it's
multi-dimensional data model organized in cubes comprising dimensions and measures allows for sophisticated
analysis and in-depth Exploration with a focus on aggregating and summarizing information oap empowers users to
identify Trends patterns and correlations across multiple Dimensions despite longer response times which are
acceptable due to their analytical nature o AP excels in providing flexible ad hoc query capabilities for complex
data Explorations conversely o ltp is designed to handle real-time transactional processing ensuring the
integrity and consistency of daily operation activities by adapting a normalized data model oltp optimizes the
storage and retrieval of individual records in highspeed environments its primary objective revolves around
efficient data modification such as inserting updating and deleting records to support concurrent transactional
operations with the focus on rapid response times and maintaining data accuracy at record level oltp cers to
the needs of time sensitive business operations understanding the distinctions between oap and O TP is
crucial for organizations to choose the appropriate data processing approach based on their specific requirements
today we will be understanding the difference between oap and oltp by going through the following details mentioned
structures data volume response time query complexity data modification data granularity concurrency data backup and
recovery and finally the system design so with the briefing of O AP and olp discussed and the agenda for the session
discussed let's the point of today's session that is the major or top 10 differences between o AP and TP firstly
we will go through the purpose or reporting enabling users to gain insights from large volumes of
oltp oltp is designed for real-time transaction processing handling day-to-day operations such as inserting
updating and deleting individual records its primary objective is to ensure data integrity and support high-speed
transactional operations next is data structure oap uses a multi-dimensional data model called a q
it organizes data into Dimensions such as time geography and product and measures such as sales and profit to
facilitate multi-dimensional analysis and drill down capabilities then we have olp olp uses a normalized data model
with tables and relationships aiming for efficient transactional processing it minimizes the data redundancy and
ensures data consistency through the use of normalization techniques next ahead we have the data
volume oap deals with large volumes of historical data typically containing years of data it focuses on analyzing
and summarizing this vast amount of information next is olp olp deals with relatively smaller volumes of data
usually representing real-time transactions happening within a shorter time frame moving ahead we have response
time o AP allows for longer response time since it deals with complex queries and large data sets users expect
analytical reports to be generated within minutes or even hours coming into olp olp requires V response times to
support realtime transaction processing users expect quick responses usually in milliseconds or seconds a lot faster
involving aggregations grouping filtering and calculations across multiple Dimensions users need flexible
ad hoc quering capabilities to perform data analysis moving ahead we have oltp oltp queries are relatively simple
primarily focused on retrieving or modifying individual records based on specific transactional needs queries are
typically short and transaction oriented moving ahead we have the sixth one that is data modification oap is rering or
minimally updated data is loaded into all AP cubes periodically exampled daily or weekly to update the analytical
including insertions updates and deletions it ensures that the transactional database remains up to
date and reflects the current state of business next in the docket we have data granularity oap deals with aggregated
and summarized data providing with a high level view of information across various Dimensions it focuses on Trends
patterns and overall performance analysis next we have oltp oltp operates at a detailed level capturing individual
level next we have the eighth point which is about concurrency oap involves a low level of concurrent users since
users typically perform separate analysis and Reporting tasks that emphasizes on analytical activities
rather than concurrent transactional processing next oltp olp requires a high level of concurrency to handle multiple
users simultaneously accessing and modifying the same data it focuses on maintaining data consistency and
isolation amongst concurrent transactions nth Point data backup and Recovery o AP is usually derived from o
TP system and backup and Recovery are less critical it can be generated from the transactional database if necessary
in case of oil TP oil TP data is critical and backup and Recovery processes are essential regular backups
failures and lastly we have system design o AP systems are typically designed with a focus on read intensive
operations they employ specialized data storage and indexing techniques optimized for analytical queries and
aggregations oap databases are often denormalized to improve query performance and in case of oltp oltp
systems are designed to handle a high volume of concurrent lead and WR operations they prioritize data
consistency and transactional integrity often using normalized database structures to minimize rund dency and
ensure data accuracy in today's fast-paced digital landscape businesses face a daunting challenge extracting
valuable insights from massive amounts of data enter the ETL pipeline the backbone of data processing and
analytics in this tutorial we will embark on an accelerating Journey unveiling the secrets of building a
powerful ETL pipeline whether you are a season data engineer or just starting your data driven Adventure this video is
your gateway to unlocking the full potential of your data together we will demystify the etail process step by step
we'll dive into the extract phase where we retrieve data from multiple sources ranging from databases to apis and then
we'll seamlessly transition into transformation phase where we clean validate and reshape the data into a
consistent format but wait there's more we will explore Cutting Edge techniques for handling large data sets leveraging
cloud-based Technologies and ensuring quality we aim to equip you with tools and knowledge to create robust and
scalable eil pipelines to handle any data challenge so buckle up and get ready to revolutionize your data
workflow join us in this accelerating journey to master the art of ETL pipelines having said that if an
aspiring data analyst looking for online training and certifications from prestigious universities and in
collaboration with leading experts then search no more simply learns postgraduate program in data analytics
from PUO University in collaboration with IBM should be a right choice for more details use the link in the
description box below with that in mind over to our training experts hey everyone so without further Ado let's
get started with ATL pipeline so ATL basically stands for extract transform and lo so ETL pipelines fall under the
umbrella of data pipelines a data pipeline is simply a medium of data extraction filtration transformation EXP
spting and loading activities through which the data is delivered from producer to Consumer to make it a little
simpler the data is produced in two type let's say you run a vehicle showroom and you are being a data producer so the
data that you produce is very less and that could be basically fit into an Excel sheet this type of data might need
update once in 24 hours or based on your audit cycle here we call it match data and this data is processed using the O
TP model and batch processing tools but now let's say you're running an entire vehicle manufacturing plant now the data
you're dealing with is voluminous and includes various types of data it can be structured data unua semi structured
data ranging from space inventory to all the way up to robotic assembly sensors data based on requirements this type of
data needs updates maybe every hour every minute or even every second such type of data is called realtime data and
needs realtime data streaming Frameworks and the data is processed using oap models now ETL is involved in both these
approaches now let's dive in and understand what exactly is an ETL pipeline ETL stands basically for
extract transform and load and representing these three core steps in data integration and transformation
process let's dive into into each phase and explore their significance firstly extract the first step in ETL pipeline
is extracting data from various sources these sources can range from relational databases data warehouses apis or even
streaming platforms the goal is to gather raw data and bring it into centralized location for further
processing tools like Apache kafa Apache nii or even custom scripts can be used to perform the extraction efficiently
next is transform once the data is extracted it often requires significant cleaning validation and restructuring
this is transformation phase the transformation phase ensures that the data is consistent standardized and
ready for analysis Transformations can include tasks such as data cleansing filtering aggregating joining or
applying a complex business rules tools like apach spark Talent OR python libraries like pandas are commonly used
for these transformations lastly we have the load phase the final step is loading the transform data into Target systems
such as data warehouse data leak or database optimized for analysis this allows business users and analysts to
access and query the data easily loading can invol batch processing or real-time streaming depending upon the
requirements of the Business Technologies like Apache hu Amazon red shift or Google bigquery are often
employed for efficient data loading now that he understood the core phases let's explore some key Concepts and best
practices for building rast ETL pipelines firstly data quality ensuring data quality is crucial for Reliable
analysis implementing data validation checks handling missing values and resolving data inconsistencies are vital
to maintaining data Integrity throughout the pipeline next is scalability as data volumes grow exponentially scalability
becomes essential distributed computing Frameworks like Apache Spa enable processing large data sets in parallel
allowing the pipelines to handle increasing data loads efficiently thirdly we have error handling and
monitoring robust error handling mechanisms such as retry logging and alerting should be implemented to handle
failures gracefully additionally monitoring tools can provide realtime insights into pipeline performance
allowing quick identification and resolution of issues U next we have incremental loading for continuously
evolving data sets incremental loading strategies can significantly improve pipeline efficiency rather than
processing the entire data set each time only the new or modified data is extracted and transformed reducing
processing time and resource consumption and lastly we have data guance and security incorporating data guance
practices and adhering to security protoc protols is crucial for protecting sensitive data and ensuring compliance
with regulations like gdpr or hip AA now that we have covered what exactly is ETL and the ATL stages and also the best
practices for ETL pipelines let's proceeding with understanding the popular ETL tools so the first one
amongst the popular ETL tools is the Apache airflow Apache airflow is an open-source platform that allows you to
schedule Monitor and manage comp workflows Apache airflow provides a red set of operators and connectors enabling
seamless integration with various data sources and destinations next is Talent a comprehensive ETL tool that offers a
visual interface for Designing data integration workflows Talon provides a vast array of pre-built connectors
Transformations and data quality features making it ideal choice for Enterprises and lastly we have
Informatica Informatica is a widely used Enterprise grou ETL tool that supports complex data integration scenarios Power
Center offers a robust set of features like metadata management data profiling and data lineage and powering
organizations so what is data analysis data analysis is not just a single step but a set of processes it is
the process of collecting data then cleaning it when I say cleaning it simply means removing the irrelevant
data and then this data is transformed into meaningful information we can simply relate this process to how
you make a jigsaw puzzle just like how you gather all the pieces together and fit them accordingly to bring out a
beautiful picture data analysis also works on almost the same grounds to achieve the goals of data
analysis we use a number of data analysis tools companies rely on these tools to gather and transform their data
into meaningful insights so which tool should you choose to analyze your data which tool should you learn if you want
to make a career in this field we will answer that in this session after extensive research we have come up with
these top 10 data analysis tools here we will look at the features of each of these tools and the companies using them
so let's start off at number 10 we have Microsoft Excel all of us would have used Microsoft
Excel at some point right it is easy to use and one of the best tools for data analysis developed by Microsoft Excel is
basically a spreadsheet program using Excel you can create grids of numbers text and formula it is one of the widely
every other piece of software in office we can easily add Excel spreadsheet sheets to Word documents and PowerPoint
presentations to create more visually appealing reports or presentations the windows version of
weba programming with VBA allows spreadsheet manipulation that is difficult with standard spreadsheet
techniques in addition to this the user can automate tasks such as formatting or data organization in
VBA one of the biggest benefits of excel is its ability to organize large amounts of data into orderly logical
spreadsheets and charts by doing so it's a lot easier to analyze data especially while creating graphs and other visual
data representations the visualization can be generated from specified group of cells those were few of the features of
Microsoft Excel let's now have a look at the companies using it most of the organizations today use Excel few of
them that use it for analysis are the based company Ernest and Young then we have Urban pro whpr pro and
Miner a data science software platform rapid Miner provides an integrated environment for data preparation
analysis machine learning and deep learning it is used in almost every business and Commercial sector rapid
Miner also supports all the steps of the machine learning process seen on your screens is the
interface of Rapid minor moving on to the features of Rapid minor firstly it offers the ability to
drag and drop it is very convenient to just drag drop some columns as you are exploring a data set and working on some
analyses rapid Miner allows the usage of any data and it also gives an opportunity to create models which are
features such as graphs descriptive statistics and visualization which allows users to get valuable
task let's now have a look at the companies using rapid Miner we have the Caribbean Airline leward Islands air
transport next we have the United Health Group the American online payment company PayPal and the Australian
Telecom company mobilecon so that was all about rapid Miner now let's see which tool we have at number
eight we have talent at number eight Talent is an open- Source software platform which offers data integration
and management it specializes in Big Data integration Talent is available both in open source and premium versions
it is one of the best tools for cloud computing and Big Data integration the interface of talent is
as seen on your screens moving on to the features of talent firstly automation is one of the
great Boon Talent offers it even maintains the tasks for the users this helps with quick deployment and
development it also offers open- Source tools Talent lets you download these tools for free the development costs
reduce significantly as the processes gradually speed up Talent provides a unified platform it
allows you to integrate with many databases SAS and other Technologies with the help of the data integration
faster those were the features of talent the companies using Talent are Air France L'Oreal Cap Gemini and the
have nine constant information minor on N is a free and open-source data analytics reporting and integration
platform it can integrate various components for machine learning and data mining through its modular data
pipelining concept nime has been used in pharmaceutical research and other areas like CRM customer data analysis business
graphical user interface to create visual workflows using the drag and drop feature use of jdbc allows assembly of
nodes blending different data sources including pre-processing such as ETL that is extraction transformation
loading for modeling data analysis and visualization with minimal programming it supports multi-threaded
inmemory data processing Nim allows users to visually create data flows selectively execute some or all analysis
steps and later inspect the results models and interactive views n server automates workflow
execution and supports team based collaboration nime integrates various other open- source projects such as
machine learning algorithms Becca hed2 work Caris Park and our project n allows analysis of 300 million custom addresses
for n are United Health Group asml fractal analytics atos and LEGO Group let's now move on to the next tool we
have SAS at number six SAS facilitates analyses reporting and predictive modeling with the help of
powerful visualizations and dashboards in SAS data is extracted and categorized which helps in identifying and analyzing
data patterns as you can see on your screens this is how the interface looks like moving on to the features of
SAS using SAS better analysis of data is achieved by using automatic code generation and SAS SQL SAS allows you
toess access through Microsoft Office by letting you create reports using it and by Distributing them through
it SAS helps with an easy understanding of complex data and allows you to create interactive dashboards and
reports let's now have a look at the companies using SAS we have companies like genpact iqvia accenta and IBM to
quickly repeat our list at number 10 we have Microsoft Excel then at number nine we have rapid minor at number eight we
have talent at number seven we have n and at number six we have SAS so far do you all agree with this list let us know
in the comment section below let's now move on to the next five Tools in our list so at number five we have both R
which is used for analysis as well it has traditionally been used in academics and research python is a highlevel
programming language which has a python data analysis Library it is used for everything starting from importing data
features of both R and python when it comes to the availability of R and python it is very easy both R
and python are completely free hence it can be used without any license R used to compute everything in
memory and hence the computations were limited but now it has changed both R and python have options for parallel
computations and good data handling capabilities as mentioned earlier as both R and python are open in nature all
we have Uber Google Facebook to name a few python is used by many companies again to name a few we have Amazon
Google and the American photo and video sharing social networking service Instagram that was all all about RN
python at number four we have Apache spark Apache spark is an open- Source engine developed specifically for
handling large-scale data processing and analytics spark offers the ability to access data in a variety of sources
including Hadoop distributed file system htfs open stack Swift Amazon s 3 and Cassandra it allows you to store and
constructs Apache spark is designed to accelerate analytics on Hadoop while providing a complete Suite of
complimentry tools that include a fully featured machine learning library a graph processing engine and stream
processing so this is how the interface of a pares spark looks like now let's look at the important features
accelerate the speed of analytics spark helps to run an application in a Hadoop cluster up to 100 times faster in memory
and 10 times faster when running on disk it supports multiple languages and allows the developers to write
applications in Java Scala r or python Spar comes up with 80 high level operators for interactive querying Spar
code for batch processing joint stream against historical data or run ad hoc queries on stream
State analytics can be performed better as spark has a rich set of SQL queries machine learning algorithms complex
analytics Etc Apache spark provides fall tolerance through spark rdd spark resilient distributed data sets are
designed to handle the failure of any worker node in the cluster thus it is ensures that the loss of data reduces to
zero conviva Netflix iqvia ly Martin and eBay are some of the companies that use a party spark on a daily
view click view software is a product of Click for business intelligence and data visualization click view is a business
organizations with click view you can analyze data and use your data discoveries to support decision making
quadrant on the screen you can see how the interface of Click view looks like now talking about its features
click view provides interactive guided analytics with inmemory storage technology during the process of data
Discovery and interpretation of collected data The Click view software helps the user by suggesting possible
interpretations click viw uses a new patent in memory architecture for data storage all the data from the different
sources is loaded in the ram of the system and it is ready to be retrieved from there it has the capability of
efficient social and mobile data Discovery social data discovery offers to share individual Data Insights within
groups or out of it a user can add annotations as an addition to someone else's insights on a particular data
report click view supports mobile data Discovery within an HTML F enabled touch feature which lets the user search the
applications click view performs olap and ETL features to perform analytical operations extract data from multiple
you start your career in Click view are Mercedes-Benz Cap Gemini City Bank cognizant and Accenture to name a
few at number two we have powerbi powerbi is a business analytic solution that lets you visualize your
website it can connect to hundreds of data sources and bring your data to live with live dashboards and
reports powerbi is the collective name for a combination of cloud-based apps and services that help organizations
interface powerbi is built on the foundation of Microsoft Excel and has several components such as as Windows
desktop application called powerbi desktop and online software is a service called powerbi service mobile powerbi
apps available on Windows phones and tablets as well as for IOS and Android devices here is how the powerbi
interface looks like as you can see there is a visually interactive sales report with different charts and
graphs moving on to the features of powerbi it has an easy drag and drop functionality with features that make
data visually appealing you can create reports without having the knowledge of any programming language powerbi helps
users see not only what's happened in the past and what's happening in the present but also what might happen in
to create reports and dashboards you can select several charts and graphs from the the visualization pain powerbi has
machine learning capabilities with which it can spot patterns in data and use those patterns to make informed
predictions and run what if scenarios powerbi supports multiple data sources such as Excel Tech CSV Oracle
SQL Server PDF and XML files the platform integrates with other popular business management tools like
SharePoint Office 365 and Dynamics 365 as well as other non-micro oft products like spark Hadoop Google analytics sap
Salesforce and MailChimp some of the companies using powerbi are Adobe AXA carlsburg capure
Gartner's magic quadrant of 2020 classified Tapo as a leader in business intelligence and data analysis Tableau
California tblo is a data visualization software that is used for data science and business intelligence it can create
insights the important products of tblo are tblo desktop tblo public tblo server tblo online and tblo
reader this is how the interface of Tableau desktop looks like now coming to the features of
tblo data analysis is very fast with tblo and the visualizations created are in the form of dashboards and worksheets
tblo delivers interactive dashboards that support insights on the Fly it can translate queries to
visualizations and import all ranges and sizes of data writing simple SQL queries can help join multiple data sets and
highlighters tblo allows you to ask questions spot Trends and identify opportunities with the help of tblo
the companies using tblo are Deo Adobe Cisco LinkedIn and the American e-commerce giant Amazon to name a few
and there you go those are the top 10 data analysis tools let's now have a question and
answer session please feel free to post your queries in the comment section and we'll respond in the
chat before the question answer session let's recap quickly in the meanwhile you all can post your questions in the
comment section below so at number 10 we have Microsoft Excel then at number nine we have rapid
minor at number eight we have talent at number seven we have n at number six we have SAS R and python at number number
five parchas spark at number four click View at number three powerbi at number two and finally we have tblo topping the
list at number one welcome to this tutorial on Microsoft Excel so we will learn about functions and formulas we
will learn about conditional formatting data validation pivote chart and pivote table now let's look at a scenario here
so one day in a start up one professional speaks that their business is growing and they would need
an efficient way to work with the data they would have to find a way to work faster with storing and analyzing
data now to that another colleague responds well we can make use of Microsoft Excel to do this
job the question is will excel be able to cater to their business needs now the colleague responds well we
can make use of excel in several ways and it also is a coste efficient option now in that case the colleague who post
the question says let's go ahead with Excel and let's train our employees in Excel and the suggestion is welcomed
which would make the job easier for them and they would basically decide on using Excel so they decide on taking a
training right away and basically starting to learn Excel now before we move to excel one of the question is why
should we use Excel so let's look at some of the points here so Excel proves to be a great platform to perform
various mathematical calculation on large data sets which is one of the biggest requirements of various
organizations these days various features in Excel like searching sorting filtering makes it easier for you to
play with the data and Excel also allows you to beautify your data and present it in the form of charts tables and datab
bars now when it comes to reporting reporting accounting and Analysis can be performed with the help of excel it can
help you with your task lists your calendar Enders and goal planning worksheets Excel also provides good
security for your data Excel files have the feature of password protection this way your information can be
safe now when we talk about what is Excel and how it can be used so Excel or you might have heard a spreadsheet can
be basically used for lot of different tasks than just storing the information in so-called tabular
format now Microsoft Excel is an application that is used for recording analyzing and visualizing data it is in
the form of a spreadsheet let's have a look at few of the functions and formulas used in Excel
and before we do that we can also quickly take a small tour to understand how to work with Excel now to do that
what we can do is we can type in in our search say for example Excel and just select your Excel app which is installed
and here you see you have lot of options which says take a tour drop- down list get started with formulas make your
pivote table going forward with pie charts and much more so we can click on this one which says take a tour and that
basically pops up a window which says welcome to Excel and if you have always wanted to be better at Excel you have
this which can help you so let's click on Create and that takes us to the store window now that says instructions for
screen readers which basically talks about 10 different steps in which we can learn Excel and using the spreadsheet
app so there are more than 11 sheets which we see here at the bottom end and each one gives us a simple example which
we can work on so for example if I click on ADD now that takes me to this page which says how do we add numbers now you
might be provided data which we can upload by loading a file from our machine or getting data from a web
Source or even connecting to a database so there are various options which we will see in some time so here we have an
option which is called Data you can click on this one and this basically has options where you can use existing
connections if you have created some you can always click on from other sources and you can get your data from SQL
Server from analysis services from o data data feed you can get in from XML from data connection Wizard or also from
Microsoft query you can be running in different queries here which shows up in the option which says new query there
are connections which you can use and that basically will display all the connections for this particular workbook
which we do not have as of now but we can create them but let's look at simple examples now you can follow these
instructions here which says basically adding up the numbers and that could be easily done by just placing your cursor
here and what you could do is either you can type in the formula that is from which row to which row you would want to
add the data so for example I could just do a sum here and that shows up all the different functions which are available
then we can open up a parenthesis and we can say I would be interested in totaling the amount
from column D and I would select for example D4 so I could be doing this and then I could say D4 onwards till
D7 so that's the data which I'm interested in you can close your parenthesis and hit enter and that gives
you the total there is also a shortcut for this which you can always do is we can first delete this and you can just
place your cursor here and just use your alt and equals that automatically selects the numericals which we can
anytime expand or basically collapse so I will basically select this which says this function needs two numbers
which is number one and number two and then you can hit on enter and that gives you the total so similarly we can be
getting in the data here by selecting all the fields so here it also says that you can use a shortcut now what we can
also do is we can add numbers over 50 by selecting the yellow cell here and then giving a condition such as so I can
basically use something like some if and then open a parenthesis I can select I would be interested in this row and then
I can even drag and drop till here so that tells me d11 to D15 you can then put in a comma here and you can give
your condition say for example we would say I would be interested in numbers only above 50 T and we can select this
close your codes and then just close your parenthesis and that's your formula so you can do this and that basically
gives me the total is 100 now similarly we could do that for the amount here I could select this now there is also an
option I can click on home and I can go for something like Auto sum so that's one more way of doing it which anyway
says sum is alt Plus equals so it automatically adds up your values and I can try doing a auto sum that
automatically selects my rows and then I can get my total now as per this activity here it says try adding another
sum F Formula here but add amounts that are less than 100 and the result should be 160 so what we can do is we can
basically select all the numbers which are lesser than 100 so the way we did earlier here there can be always a
shortcut so you can always for example if you would want to avoid typing in the formula you can always copy it from here
and then just hit on enter so you are back into this cell and then I can basically go here and paste the number
and then as per the requirement we are required to select anything which is lesser than 100 so what I could do is I
could select here I could say let's say G and then I can change this value to g15 and that's one more way now we see
our selected rows have been changed so I can hit on enter and I can check what is the result so we would be interested in
looking for numbers which are lesser than 100 so I will have to also change this one to a lesser sign and and that
basically gives me the total which is 160 so that's how you can simply add numbers you can use Auto sum you can
type in the formula you can select the fields or you can just place your cursor where you would be looking for a sum and
then you can just do a alt equals and that basically populates the sum now let's look at some easy options
of filling your cells or automatically populating the values in your cells within your Excel sheet now here we have
an option which says 100 now we can click on this and that basically says it is making a sum of column C4 to D4 so if
I click on this one I can check that this is row number four which shows up here and I also know this is column C I
also have D so this equals is basically giving me a sum of C4 to D4 now what we can do is we can always place our cursor
here at the right corner and then we can just drag and drop and this basically gives me a total of all the numbers for
all different rows so this is one shortcut which we can do to get the total Excel will automatically give the
totals which we call as filling down now what we can do is in the same way if we would want to get the totals here we can
first check what is this 200 and this tells me it is C11 to c14 total so it is totaling the rows from C1 so column C
and 11th row till 14th row and that's the sum now what I can also do is I can similarly like above we can do a filling
right which basically means bringing your cursor here and then just dragging and dropping it all the way where you
would need the totals and this basically gives me the total there is one more quick way to check if this is right so
the easiest option would be to select this cell now what I can also do is I can just select all of these fields by
just highlighting and selecting all the fields once it is selected Press On r r and that gives you the total now if we
would be doing this stop down then I could select all these rows for this particular column and then I could do a
controll d so that's your filling down and this one was filling right so this is an easier option of doing a fill when
you would want to have the formula applied to every row as it occurs in the first row or the last row we could test
this by for example selecting these fields I could delete them and I have here with says 130 I could just place my
cursor here and I could drag it all the way up and that should also do the same magic which we were seeing from top down
so this is a simple way wherein you can fill up your cells and you can also automatically propagate or move your
computation to all the cells let's look at the split option which basically helps us in splitting the data when we
have some kind of pattern or when we have some kind of delimiters in our data in say one particular column and we
would want to derive the values out of it so we can always use the splitting option now the easiest option would be
so for example we have our email column which has the email IDs and which we can clear see has a first name do last name
now I see that there is a last name Smith filled up here first name is empty so what I can also do is I can just type
in say Nancy here now that's the first name I can again start typing the second name and as soon as you do that you
would see a faded list of numbers and that's your clue to hit enter and once you do that you would see all the first
names have been filled in here if you would want to maintain the case sensitiveness you can just go ahead and
delete these and let's type in as it occurs so let's say Nancy as the first name go down to the next cell and just
type in Andy and there is your grade list so just hit on enter and that basically fills up your first name what
we can also do is we can just select this particular field and either we can type in contr E which basically fills up
all the options now I can just do a undo by typing in or clicking contrl Z and that's basically gone what I can also do
is I can select a particular field and then I can go into home option and under home you have an option here which says
fill so you can select this and then you can do a Flash Fill which is what we are doing here so click on Flash Fill and
that automatically fills up the values so in this way you can work within your spreadsheet and you can be filling up
the values where a d limiter by default is understood and we can split the data now however sometimes you might have
some data which has a different kind of delimiter and there is again a smarter way of splitting your data so you can
always scroll down here and that says splitting a column based on D limitter so we have some values in the data
column and these values in each row are separated by comma so select this your data is already selected text to columns
Del limited comma is selected and now click on next so it basically says what is the destination let's select this
one and I can choose what would you want to have so that shows me this would be my data preview now I can basically
select this one I can say finish and say okay and now if you see our data has been placed in in the columns
appropriately so this is how you can split your data based on a d limiter and then organize your data in a better way
now there are some Advanced options which we can learn later but this basically tells about using a formula so
this is something if say if we have some name in one one cell and if you would want to split it into first name your
helper column your middle name last name so that can also be done using formulas and this basically tells how would you
extract characters from your left cell and how would you place them in your right cell so you can try this activity
which is a little more of advanc option the benefit is that you can always use this wherein if you you do some kind of
transformation using your formulas if your original data gets updated then the split data will also get updated and
that's the benefit of using formulas where you can place values from one cell into multiple cells based on execution
of your details in the formulas how about using the transpose option now you might have heard of
situations where you would want to to switch or turn your rows into columns and your columns into rows and that's
where transposing comes into picture it might be useful when you have your data uh in your X and Y AIS or as I would say
in rows and columns and you would want to switch your rows to become the columns and columns to become your rows
so what we can do is the simplest way is you can select all your values so here we basically have six columns and I
would say two rows now I can select all of these and then I can select an empty field for example the one which is
highlighted here well you can always do a control alt V that's a shortcut what you can also do is once you have
selected all your Fields you can just copy them so just do a control C and then click on an empty cell and then
what you can do is you can do a special paste or paste special so under your home you have the paste option and here
you can go for past special and once you do that you need to select the transpose option over here and click on okay and
now you will see that the columns and the rows have been transposed so your row name was item and that has become
the column heading you had row name as amount and that has become the column heading and all your values have been
trans osed in this particular format now there is another way of doing that and again that's using your formulas so what
you can do is you can transpose with a formula also and that basically works when you have similar kind of data so
this has six columns and basically two rows so you can basically do this so you can select this and earlier we were
doing a copy but now what we would want to do is we would want to just look at the row numbers which tells me it is c33
c34 and it starts with c and ends in with your H column so what we can do is we know that we have six columns and two
rows so transposing that would actually give me two columns and six rows so what we can do is we can select two columns
and six rows in our Excel sheet I can then basically start typing in the message or I can just go to the address
h34 it basically selects my data and now I can just do a control shift and enter and now if you see all the values have
been populated now you can just place your your cursor in one of the cells but if you see the address bar the formula
Remains the Same this is because this is an array formula so we can read more about an array formula here it's
basically something which performs calculations on more than one cell in an array and in the example here the array
is the original data that is c33 to h34 so your transpose is just changing the horizontal Orient orentation to the
vertical orientation so this is a very simple way in which you can basically use the excel's capability to transpose
on additions subtractions filling up your data sorting the data or basically splitting your data transposing your
data one of the other requirements is sorting and filtering your data now that can be very handy when you're working on
huge data and you would want to sort it in a particular order say ascending or descending or might be based on a
particular field or if that field was or if the cell was highlighted with a particular color sorting the data so
let's look at how Excel can be used for sorting and filtering examples are pretty simple here so let's check that
so if we're going to sort and filter and say this is the data I have say for example I would want to sort the values
in the department column alphabetically so what I can do is I can select Department column and I'm already in the
Home tab I can straight away go here which says sort and filter I can then say sort A to Z and that's basically
alphabetically sorting your department column and once once I do this you would see the data has been sorted but it's
not just this data we can just do a control Zed and check what are the values we have so here we have meat
which is beef and 90,000 110,000 the values then you have Bakery which should ideally be the first row if we sort it
in an alphabetical order which goes with Bakery as deserts you have the values so we can check this again so select
department and then just do a sort and filter and let's say sort a to zed and if you see the data has changed but it's
not just in changing your First Column but then it has taken care of all the data however the data has been sorted
based on the department column so you have Bakery which aligns with deserts which has the values and now we have all
the data which has been filtered now what we can also do is we can sort December's amounts from largest to
smallest so what we can do is we can basically click any cell in the December column let's say
20,000 and then what I can do is I can go into sort filter and then I can say sort largest to smallest so if you see
Bakery breads is the row which has the smallest data or might be you have deli sandwiches so that one looks also
smaller so let's do a larger to smallest and if I do this you would see the values have been shifted now so it is no
more based on the department column because now the data is being sorted based on the values in the December
column and you see Bakery which was alphabetically the first one has become second last so either you can sort the
data based on a department column which goes based on the values these are all string values or words so it sorts
alphabetically if you have numbers mightbe you can give some values and you can sort the data you could anytime do a
custom sort and you could basically select if you would want to select the data so I could do a custom sort and
then I could choose which is the column which we would want to use for sorting what is the Sorting needs to be done is
it sell values is it cell color font color conditional formatting and then you can also choose the order so that's
one more way to do that now if you scroll down that also shows how you can sort by date or a color so for example
if you would want to sort based on the expense date so there are different options so what I can do is I can select
this date field I can just do a right click I can go into sort and then I can choose I would want to sort oldest to
newest so since I selected the date field it basically has sorted all the data and it has taken the expense date
into consideration now there are these filters which you see on the row headings we could have also used those
so I could have selected this and that basically says or mentions which are the dates I would be interested in looking
at I also have sought by color here I can do is sort oldest to newest or newest to oldest so I could also use
these filters which have been applied here now we have the data which is in color so if I would want to basically
select the color columns or color cells I could select this I can basically do a right click here and when I go into sort
I could choose put selected color or cell color on the top and that basically will make sure that my data is sorted
and it has also sorted that in a descending order so in this way you can sort or basically filter your data what
we can also do is we can add filters so sometimes we can go for formulas which we would want to use what we can also do
is we can basically select the filter which has been applied here now how does the filter come in there so if I would
select a particular row I could select a particular row and then I could decide if I would want to just add a filter to
this one and that's how the filter has come in so we have the filter what we can do is we can basically click on this
drop down and then you have something like number filters so we can always go here we can basically choose one of
these so we can basically choose above average so I could select this and then basically it shows me the values we
could also delete the filter by clicking on this one and we could say well I'm not interested in this filter anymore so
I could clear the filter and that shows me all the values or I could say that let's click on some other field for
example food I can go in here I can go into number filters and then I could say well I'm interested in values which are
below average above average might be greater than and then I can choose what is the value so for example if we say
I'm interested in food which is greater than $25 I could give a value here I could say okay and now I have applied
the filter similarly you can select this and then you can just clear your filter and your data is back so remember no
data is lost it is just hidden or basically based on the filter not shown so that's good enough for us and in this
way you can sort and filter the data so for more details obviously in all these sheets you have the links which point to
more information on the web and you can always refer to these so this is simple way in which you can sort and filter any
amount of data which has been stored within your Excel within a particular sheet now that we have learned about add
fills split transpose sorting and filtering it will also be good to learn how to work with tables or basically
converting your data into a tabular format and then doing some easy computations so click on this tables
option here now here we see there is some data which is in five columns and N number of rows so I can basically select
this data and then what I can do is I can insert choose the table option and then it says my table as headers and
we'll be okay with that I'll say okay and now if you see this is the table created it basically has different
filters which we have learned earlier how to use and this is basically my table which is a collection of cells
which has some special features so we can easily add rows to this table we can add columns to this table and we can
even do some calcul ations so for example here I can click on this one I can basically enter some field and then
I can hit on enter and we see that this row has been inserted wherein we can easily fill in values for example I
25,000 might be 35,000 and then basically I can give in some values use here now what you can
also do is you can continue adding rows in this way and say for example you would want more columns so you can
select this option here in the top bottom right corner you can just drag it towards the right and that basically has
automatically created columns for my next months wherein I can feed in the data so this is a simpler way wherein
you can keep adding more rows and columns to your data if that has been converted in a tabular format now let me
just do a contrl zed that basically deletes the columns again control Zed deletes the last row which we added and
I can stop here or I can even remove my values by doing multiple control Zed I'm removing my rows so this is how I
converted my data into a table and then I can easily work on this what I can also do is I can do some calculations so
what we can do we have a table here we have a total field and what we can do is we can just select one cell here now as
we have learned earlier we can do a alt and then equals and that basically says what is this doing so it says it is
calculating the sum of the last 3 months and if that's what you you would want to do just hit on enter and it has
formula is getting filled up now I can select any particular cell and I can look in my address bar so it has already
given me the formula where it has started calculating the sum from the October column till the December column
and has given me the calculated values of the columns what we can also do is we can get total rows in the table now
that's a simpler option so what we can do is we can select any cell in this particular table and then we see that
there is a table tools design option showing up here now I can select this and then it says well let's get a total
row so let's select this and it automatically populates the total here and if you would want the average then
we could select this and from the drop-down I can select what I'm interested in so for example I would
want the average values and not the total I could just select this and that gives me the average of these values so
we can always do simpler computations here by converting our data into table format let's learn about one more
efficient way way of working with the data and that's using your dropdowns so let's see how drop downs work here now
say for example you have this data which has the values in the food column and department is empty and say for example
you would want to enter the values in Department however you would want to select the department should either have
produce or meat and bakery and these are the only three op options which should be available for any user to fill in the
values how do we do that so we can basically create a table by pressing controll D so what I can do is under my
department here I can select one of the cells and then I can do a control T that basically converts this into a table I
can say okay and my table is created now what I can do is once this part is done we can select all the blank Fields here
where we would want this dropdown to be applicable now under your data tab you can go in and select data validation and
this has an option called Data validation click on this which basically says allow any value so here I will
select I would want to give a list of values and then I can type in my values here which I can say produce save for
for example meat and then say Bakery now these are the values so we can click on okay and once we have done that we
basically have a drop down here next to Apple's which will only show us the values which we can feed in under the
department column so I can go into every cell and then I can basically choose what is the department which handles
this and then basically I can select one of these from the drop- down so this is an easier option of creating your drop
down and then filling in the values from the set of values which you have defined here on the right so this is a simple
example of using your drop downs working with your tables working with your sort and filter transpose split filling up
your data adding in some data here and similarly you can use Excel for more than one use case using its inbuilt
features to easily work with your data let's see how we can import data or bring in data into our Excel from your
local machine or from an external web source so what we can do is we can open up a blank Excel sheet and say for
example you have been provided a text file file or a CSV file and you would want to import that data into your Excel
sheet that can be easily done so right now I've opened an Excel sheet now I can click on data and here I have an option
which says existing connections from other data sources so or you can click on connections if you have already
created some so we can click on from other sources so this is one option where you can connect to your different
data sources and you can get the data from one of these what we can also do is I can click on connections now it says
there is none I can click on ADD it says well show the connections where connection files on
network connection files on this computer so I can say let's get some files from this computer now if that
does not show up something so say browse for more and that basically shows you different options so let's basically
select a folder where I have some data sets I'll click in here and this is basically a folder where I have some
data sets now let me select this particular file and I know it is a CSV file so let's click on open now if you
would want to verify this you could have gone and looked into the properties of the file and it says it is a CSV file
which is what we are interested in so I'll take this file I'll say open now this basically shows me the text import
wizard option which says is the file delimited I'll say yes click on next so I will select comma as my delimiter I
can say text qualifier is none now this is my data so my data preview is already showing me the data is what is the data
in the CSV file you can click on next and then you have an option which says data format is General you can go for
date format you can go for advanced options so I'll just say finish and basically now this has been
created here so we basically have this and now I can click on close now once you have done that you
can click on existing connections it shows me the data which we have here the connection which we have created say
worksheet you can also say add this to a data model if you're doing some data modeling so click on okay and now this
into this particular sheet now what we can also do is we can also start a new sheet and that does not have a data and
we can get some other data from web so what I can do is I can go into my GitHub and let's say I would be interested in
this CSV file so I can select this and this is my GitHub path a path on web so I can click on draw and that basically
gives me the raw path where this particular file is now you can select this copy this particular path and here
you can come back to your Excel sheet we would be interested in getting the data from web might be from a text file where
we will have to specify the D limits or let's go to web and here I can give the web path from where I would be
interested in getting the file let's give the GitHub path which is publicly available and then click on import now
within double quotes separated by comma so first let's click on import now once we do this it will basically get the
data from web and put in here it says existing worksheet so we had already created a new worksheet so let's click
on okay and now you have the data coming in but then this basically shows me in one particular field so what we can also
do is we can just do a control a that selects all my columns here and that's my data so we can then
basically filter this out so we can say text to columns it's a delimited file click on next we can select comma and
let the text qualifier be quotes it shows me data preview click on next so you have the general format it shows the
destination that's the column click on finish finish and now your data has been split and you have the data which you
have imported from web so this is the data which is coming in from web this is the data which came in from my local
machine and similarly we can even create connection with an existing database so I can basically click on connections if
I would want to do that so I have an option called connection here and it says where the selected connections are
used I can basically click on ADD I can basically choose if I would want to get the files from Network or from computer
like we did earlier I can click on browse for more which should show me different other options to create
connections say you would want to create a new SQL Server Connection you can connect to a new data source coming in
from different place you could basically choose what kind of connection would you want so these are all the different
options which we can go for and we can basically connect to a database for example if I have some database and say
for example access database I can see if there are some files with that particular database and
I can import it so similarly we can also uh click in here which says new query and that also gives you an option of
getting the data from your files from all these folders from databases so you can basically click here and then you
can import data from a mySQL database provided that is set up on your local machine or on a particular server you
can go from cloud you can get it from online services you can get it from other sources which says from web from
your Hadoop file system from active directory from a blank query and you can even com combine queries wherein you can
run a power query editor you can get the data from different sources and then you can bring it into your Excel so in this
way you can get your data from different sources into your Excel into your spreadsheet and then you can continue
working on those data sets we have uh already learned some basic operations which you can do in Excel and let's
implement our knowledge by working on this particular data which is coming in from housing data set now here if we see
some fields we have agent date listed area list price and this is basically the data which has been sorted in newest
to oldest order of date listed so how do we arrive at this so what I can do is I can just click on data listed and then
I can either go in here I can select sort I can get into custom sort and then I can choose the column based on which I
would want to sort the data so I would look for the newest data to the oldest data that means that would be in a
descending order of dates or you could say the oldest date or the earliest month will be towards the lower side of
your sheet so here we can select date listed now I can say let it sort based on sell values and the order what we
have here so we have newest to oldest so let's select this I can say okay and now if you see the date has been sorted so
we have your 1018 2007 on the top so that seems to be the latest date and as we go down we will see an earlier month
hour and early C month than that in this date listed so we have sorted our data into newest to oldest order and that's
based on your date column so the result shows up here now what we can also do is we can have different questions which we
would want to answer so for example I would want to sort the data in ascending order of area and descending order of
agent name how do we do that so let's look into this so this is here I already have the result here and how did I get
this so I'm looking for ascending order of area and descending order of agent name so we can start with any particular
column that does not matter so for example if I look into this Excel sheet I have my agent name select this which
we want in a descending order so we could either do a sort and then go for descending sort Z2 a we could also use
the filter option here on the top right and we could do it or I could just say sort Z2 a and then it has arranged the
data based on the agent column being in descending order now I can go into area and then I can again do a sort and I
wanted my area column to be used for sorting the data in ascending to descending and that basically not only
changes the order of this particular column but for my complete data so let's do that and now if you see we have the
data which has been sorted so we can see how many values we have here and the area values which we see and this is how
you get your result so I'll just do a contrl zed and again and I'm back to my original data and this is the sorted
result which we are seeing at similarly we can answer other questions for example sort the data according to the
we can basically choose in which particular way we would want to do it so it is County Central and then again
County so if I look into my sheet three so here I basically have my data which is having some South County then you
have your Central and then you have your North County so we would want to sort the data to solve our problem which is
according to the following order so first we go for area then we go for South County Central and North County so
what we can do is we can basically have area field selected and I would want to sort this particular data so I we have
South County Central and North County so I can basically go for custom sort and then I can choose which is the
value or column which I'm interested in let's go for area we will go for something like cell values well you can
also try to explore conditional formatting icon if this is what you would want to use or we can basically go
for just sell values and here we can say if I would be interested in first getting my values
for South County so for example I can say custom list and then I can basically give in the new list here so I can say
County so let's select the this as it exists we can basically say add and that's basically the order which
we want say okay and then say okay here and now what we would want is we would want our data so we can compare that
with the values which we see here it starts with Kelly you have in the 12th row something like Lang and that's what
we are doing so we can basically arrange the data in a particular order by choosing a custom list and then sorting
your data so that's one more simpler task what we have done where we have sorted the data in the order where under
our area column we first wanted South County then we wanted Central data and then we wanted North County so this is
how you can do it now let's look into one more problem so it says find all the houses in the central area and we would
want to basically apply a regular filter let's let's see how do we do that so we can click on this and here we have the
data so the problem statement is we would want to find all the houses in Central Area now how do we do that we
can do a sorting but we would want to use the filter which you see here is implemented so how do you do it so you
can select the this area and say for example I would want to apply filter I can just go in here and I can say let's
get a filter on my first row and now I have filters applied so we are interested in looking into the Central
Area houses let's go in here it says all these fields are selected that means it shows everything wherein your area has
all these values let's unselect this and then I will only be interested in the central so let's say okay and then say
okay so now you see the area filter has been applied and we are looking at the central column so what we have done is
we have applied a simple filter and we are looking at our data at any point of time if you do not want the
filter then what I can do is I can select this and I can say well I'm interested in all the data
so I could do this or you can clear the filters from area and you get your data back so that's in one way you can filter
out your data so let's look at an example of sort and filter where we might have to filter the data based on
two columns or multiple columns with different kind of values where it could be and and or condition now say for
example this is the data I have and this is the question which we need to answer such as find the list of all houses in
the central region with pool and South County without pool now if it was a simple filter based on one cell I could
have just selected my header row I could have then applied the filter and once I have the filter I can
look in area where I have three regions so I'm only interested in Central and South County so I can get rid of North
County and that's fine but then we have two different conditions here so we need the data in central region to have the
value for pool being true and for South County the value has to be without pool now how do we do that so what we can do
is we can first create a copy of these headers here so let me do that now the area has to be South County so basically
then I can choose Central so that's the criteria which I have and the pool value has to be so central region should be
so south county is without pool so let's select one of the values here and this is my criteria now to get my result uh
we can always place your cursor here and you can check this is M column and eth row okay so we would want the result
here so let's go ahead and now click on data and then in filter you you have an option called Advanced and here what I
can do is I can say I would want to filter the list in the place but that's not what I would want to do so I'll say
copy to another location and here if you see the list range will tell me that this is the data so A1 to J 126 so a to
J column selected and all the rows criteria range is basically based on what I've given here so that is m
from 1 to V which is three so I'm selecting these columns and then I'm saying all the way whatever criteria
I've given and copy two I'm saying M8 to V8 so that basically will give me my filtered result so you could
basically just say okay and now I have my data which has been filtered and I have my area which is South County that
is without Pool Central with pool again South County without pool and then if you look at your central value that's
pool so this is an advanced filter which we have applied where simply we have filtered the databased on two columns
and then we have our result so in this way you can have your customized filter applied on two different columns and get
your data which can can be either replacing the existing content or in the same sheet in different set of columns
and rows you can have your result let's look at one more example of filtering where you are trying to filter the data
based on a and condition condition being met in two different columns and then you would want to filter out the data
for only specific columns so the situation is the agents with a house in North County that should be County area
having two and a single type family so we are talking about two bedrooms and we would basically have a single type
family and here the criteria is that we would want to only populate these columns which is Agent area bedroom and
type now what you can do is as I explained earlier that you can get your result in the name sheet in a different
location so here I have created these headers which says agent area bedrooms and type now this is basically a copy of
all the columns what we have here agent data listed area list price bedrooms bath square feet and so on so you can
basically create a copy of the headers here and this is where we will give our Advanced criteria to filter the data so
so the conditions which need to be met is we need to look at North County so for example here in area I can basically
go ahead and select one of the values North County now the criteria is having two bedrooms only so let's say bedrooms
and let's say the value should be two and then basically I'm saying a single type family so when you would want the
single type family so here under type I can give the criteria single type so this is my and condition so we are
saying North County area having two bedrooms and the type is single family now this is the criteria which basically
means if I select this this one tells me that this is M1 row onwards till V 2 so this is what we have and we would want
to filter based on this so let's go ahead and then go for data filter Advanced and here in advanced it says
filter the list in place now that's not what we would want to do so I'll say copy to another location this basically
the columns and rows selected criteria range is based on M1 V2 which we have given here and copy to I would say for
example from M7 to p7 now this is the area where I would this is the place where I want the result let's stay okay
and now I get my data which is based on the question which has been asked that you would want the agents with a house
in North County area having two bedrooms and single type family so in this way you can basically do Advanced filtering
get your data and get it stored in the sheet anywhere at a different location well I could have also done filtering in
place and that would have replaced the data which we have but that's not what we want we would want the filtered
result in a different place so this is how you can do some Advanced filtering we can also use Excel to filter out the
data in one particular column which might be conditional or using some numerical filters now here
say for example the problem statement is that you would want to display all the houses whose list price is
between 45,000 to 600,000 or say for example we would want to filter out the data to something else
say for example let's say I have I would want to filter out the data between 300,000 and 400,000 so we can basically
different ways or there are two easier ways to do it one is I would want to look at the list price so I can select
this I can go ahead and to a filter here and in the list price now this is where we would want to do the filtering so
it's pretty easy you can click on this one and then you can go into number filters and you can choose between now
that's one easier way of doing it so I could basically select this I could say I'm looking for Value which is greater
so if I just do this I have applied my filter and I have my data which is filtered based on my criteria right so
that's one easier way of doing it or let me do a controll z now let the filter be there which you
can anyways use but what we can also do is as we have seen earlier methods so get a set or get your column headers
between my this set of columns which I have 1 2 3 4 and then you have seven and 10 columns and if you see here we have 4
5 6 7 8 9 10 11 right so whenever you want a and condition you will basically add the columns where I can give add
condition if it is the same column if it was a different column then it would be same number of columns but and
conditions will lie in the same line and or condition would lie in a different line now here I can give this value so
I'm looking for listed price being between 300,000 so I'm saying it should be greater than or equal to
300,000 and then I can say less than or equal to 400,000 now that's my criteria and then I need my result here which is
filter I can get into Data I can get into advanced and then I can say copy to another location so it is selecting my
this particular column so let's say okay and now you have your data filtered out in a different location in the sheet
which has been filtered based on your and condition so you can filter out the data in this way or you could just apply
a filter on a column and give the conditions now let's solve one more interesting problem and here we would
want to use x cell where we would have an and and an or condition so say for example this is the data given to you
and the question is that you would want to find all the houses in North County again that's a spelling mistake but then
bedrooms so the bedroom has conditional so it has R three and four and then basically you have list price which is
greater than 300,000 now I could have obviously selected The Columns and then basically gone for a filter so I can
just do a filtering here and then I'm looking for list price being greater than 300,000 so which we can always give
a number filter and I can say greater than and then I can say greater than or equal so I I can say greater than and
then I can give 3,3 300,000 and that's basically the filter which we would want and here I would
want to select the bedrooms which should be just either three or four so if for example I go in here and I unselect this
and then I say three and four right so I am getting my data which is greater than 300,000 and it should have the beding
values which will be either three or four selected now that's one way of doing it let's do a control Zed and get
it back to as we were or you can even just say clear filter so you get your data back as it was so what we can do is
we can here give the criteria so for example I have my list price now this is what I would want as a condition
so let's say greater than 300,000 300,000 and bedrooms should be three and then I can say greater than
300,000 and then I can say four so this one basically gives me a situation where your list price has to be greater than
300,000 and bedroom should be either three or four so we have given our filtering criteria now to get the result
what we can do is we can go into Data we can go into advanced and we can say copy to another location so our list range is
selected which is columns a to J row number 1 to 126 your criteria range is given in M1 to V3 where we have
specified and we are saying the result would be in M7 to V7 so if I do this now I have got the same data which we were
seeing earlier and here the bedroom values are three or four and basically the list price is greater than 300,000
so this is a simpler way in which you can create your filters and all this Advance filtering what we are seeing it
will be saved with your sheet you can always go back and change this value or you could do filtering where one person
has to look into the filter to see what value have been selected now that we have looked into
some operations which we can perform in Excel using filter or sorting the data creating your tables let's also quickly
look at functions and formulas which can be used for doing some easy calculations or computations now Excel can be used in
different kind of data analysis so for example you have have different inbuilt functions which can be used and we can
always check for a particular function so for example if I had if I wanted to look at a particular function I could
just type in here something for example is and then it shows me all the possible functions and you can always have a look
at the detail of the function for example you have is even which will return true if the number is even if we
would say is logical so I could search for is logical and that tells me whether a value is logical value true or false
and returns your value true and false now we can obviously say subtotal so you can search for any of these useful
functions and that tells me what this function can be used for so returns a subtotal in a list or a database you
have many other such functions such as integer sum average you may be interested in working on
truncating some data getting the absolute value getting the square root basically getting a count or getting a
max value you can look for any particular functions within your Excel sheet now you also have other function
such as now or time for example let's look at Now function so I can search for now and here it is so this is
Returns the current date and time formatted as a date and time so this is the function which we would want to use
and if I just give the function it tells me what is the current time let's first look at the description of time here so
say for example I would want time it says converts hours minutes and seconds given as numbers to an Excel serial
number formatted with a time format so for example if I would say 2 hours and then 30 minutes and 30 seconds and if I
do this it has basically converted this into your time format so you can always use different inbuilt functions for your
work now we will also look at some Advanced functions like suf or some IFS you have count F and count fs and you
can be working on various functionalities of excel to easily help you in doing some calculations
computations working with your data working with your different cell values so let's look at some example of using
functions like sum or Su if so for that let's go to this sheet and here we have some data now I have already applied a
filter which can allow me to filter out the data so it says find the total units that were sold in the east region now we
know that in region we have east and I have multiple regions I could basically be saying unselect all and
select only East and say okay and that basically gives me the units which were sold and if I plac my cursor here and
then if I did a auto sum so it would basically give me the function which is being used so something like subtotal
and it is basically working on your rows which is E2 to e44 and here we can just do this so that
gives me the total but this is this is fine you could do that but it would be good if we know how do we use a function
like sum if to do that so here I'm seeing this is the subtotal where I'm looking at the values and basically what
I have done is I have filtered out the region and then basically I'm getting a count but this does not give me clearly
how a sum was calculated from all the values which were listed what we can also do is let's do a controll zed and
let's get it back so now we have our data and we would want to get the total units that were sold in the east region
so what I can do is I can start typing in my formula and for that I'll use inbuilt function so for example I would
be interested in going for sum if now it says sum if adds the cells specified by a given condition or criteria when you
talk about some IFS this is when you could give set of conditions or multiple criterias so let's look at some if let's
do this now obviously this gives me an error because the formula is not right so we have to
basically come in here and let's start with some if now when I say some if it shows me there is a function with some
if which we would want to use and here once I open up the bracket it tell tells me okay what is the range of data which
you are interested in so I'm interested in all the units that were sold in the east region so we are interested in the
region which is here so I can basically be selecting this and this tells me you are interested in the data here so let's
not take the header value so let's say B2 and then we can go all the way to the end so we can basically select this
way that's the data we have select this and hit enter so now here it has selected B2 to B4 but we need to
basically now give the criteria so the criteria is either a value or you can point it to a cell which has that
particular value so as per our problem state statement we are looking for the units which are sold in east region so I
to calculate a sum so let's select this and now we are interested in finding out the sum of units so that's basically
this column e column so I can basically type in instead of selecting so I can say e and I'm interested in E2 to e 44
so that basically selects the area or all the values and now let's do this so that basically gives me the sum is
691 right now this is the criteria where I have pointed it to a cell and whatever value that cell has well I could have
done something like this so I could have selected East giving the value and then then doing it it still does the same
thing and in this way you have more clarity that you are using some if you are filtering the data so you have given
the rows you are given the criteria and then you have given the range on which you would want to sum up the values now
similarly if the question was what was the total revenue generated from binder now we would want to find out
binder and then I want to find out the total revenue generated so we have the revenue generated field also here and we
have we don't have any region to be filtered we are just looking for binder so let's again start doing the same
thing so we can go for some if we can open the bracket now it needs the range so we are interested in revenue
generated now that's the summation we want and we would want to get the range of data so here we can basically
would want to give the filtering criteria so let's say binder and then we need to give the range on which the sum
g44 and that basically selects the column and then you get your sum so it tells me what is the total revenue
generated from binder now I could be doing this for other things also so say for example if you would want want to
filter out something else you could basically just drag and drop here and then I could come here and change this
pencil if that's the criteria you are interested in remember to change this so that you take all the
values and here we will change it to select the relevant rows and then this is the data I'm getting so I know
that this is the revenue generated from pencil this is the revenue generated from binder now this
is a simple use case where we are using some if what if we would want to use some ifs so some ifs let's have a look
at how we get to some ifs so some ifs says where you would want to work on doing some calculation but then you
can do is let's work on this problem statement which says what is the total revenue generated from central
region where the item is a pencil so that's something which I would want to check now when we are answering this
question we can also look at the order in which things have been asked in the question so it says what is the total
revenue generated from central region so we need the total revenue generated we know there is a revenue column we are
interested in getting the total revenue generated we are saying the filtering criteria is central region and we say in
that we would be interested only in the item if it is pencil how do I do it so I can use some ifs where you can pass in
multiple criteria so let's start with some ifs and when I start with some ifs let's
open up the bracket so it says some range it says criteria range then it gives one criteria and you can give in
any number of criterias so for example we are interested in total revenue generated now that's my G column so
let's follow in the same order so let's say G2 so that's my first value and then I
know there are 44 rows here so I can say G 44 and you can obviously check if that has selected all the rows now that's my
total revenue generated so I would want the total revenue generated so I'm saying setting this sum range
then I need to give the criteria range so it says from central region so central region comes into column 2
that's B so let's say B2 to B44 so that's my criteria range then you have to give your criteria but we need
to filter out the region being Central so let's select this now either I could point to a value in the cell or I can
just give the exact value here we can also give a wild card or matching pattern so that also works now this one
is fine we are now also interested in finding the total revenue generated when the region is Central and the item is
pencil how do we do that so for item we know the columns the column is D2 so let's select D2 to
d44 so that basically selects all the rows in the D column and we need to give the filtering criteria so let's do a
comma and then just given our value so let's say pencil and then let's close your bracket and that basically gives
you the result so we need to just follow the order of our question which says what is the total revenue generated so
we are looking at the revenue column we are selecting all the rows then it says from the central
region so we need to select the region column and give the filtering criteria Central or point to a cell which
contains that value and then it says we would be interested only an item being pencil so then you select the column
which has all the items and provide you a filtering criteria that is pencil so that's your easy
having your criteria so basically you are selecting your rows giving your filtering criteria and then your Su
range in some ifs we are giving multiple condition now same thing can be done here it's says how many units were sold
by sales representative Jones or Jones where the cost of each item was greater than four so how many units were sold by
sales representative so when we talk about how many units that's your e column so let's
start with that so let's say some ifs I would be interested in E column and let's give the range so it
says sum range so those are the number of units on which we would want to find the sum then it says you need to give
the criteria range so we say sales representative where the name is Jones so sales representative is in sales rep
column C so let's say C2 to c44 now then we need to give our filtering right IA so let's say
Jones is the sales representative where we are interested about whom we are interested in and then we the question
says where the cost of each item so cost of each item is what we are interested in you have unit cost so that's what we
are interested in so that would be F and then say F2 to f44 and then you need to give your cost
so it says where each item is greater than 4 so let's select this and let's do this so this tells me
units units that were sold and that units or that should not include the pencil item how do we start
doing this so let's start with some ifs now we know that you start with some ifs you need to give the sum range so we
are interested in the number of units so let's basically go in and select our number of units which were sold so
that's your column e so I can say E2 to e44 that's where I would want to perform the sum now I'm
saying how many units that were sold where we are talking about sales rep being Jones so let's see let's select
the columns C and then give the range after that we need to give our giving filtering criteria which
is Jones and then we are interested in the items but excluding pencil so items is in column D so let's say d to
d44 and then we have to give our criteria so we can say well that should exclude pencil so I can basically
formula which says that these are the number of units which the sales representative whose name is Jones had
sold and that does not include pencil as an item let's also look at an example of using count if or count ifs now both of
these can be very useful when you would want to calculate certain values so for example if I would want to work on count
if let's try solving this problem now remember you can answer these questions using filters and that can be an easier
way but then sometimes you may want to get the formulas so that you can make your spreadsheet and your calculation
more dynamic in nature and that will basically depend on the values in the column colums or rows so for example if
I have find the total number of times Gil has made a sale now if I look at my data here it tells me that for every
sales representative there is some value in the sale and it says sales has greater than three so for example Jones
you have sales greater than three or you have Jardine which is sales greater than three and so on so what we are
interested in is doing a quick count and finding out the total number of times Gil has made a sale how do we do that so
we can use this count if function and if I go into count if it says counts the number of cells within a range that meet
the given condition now what's our condition our condition is Gil and we would want to find out how many times
the name say Gil appears or Gil has made a sale now I could just say count if and then open up a bracket I need to give a
range let's not give that that in codes so you have to give a range so let's do a count if that selects the
data and then we need to give the condition so for example let's here give the name which is Gil and then close
this so that basically tells me it is five times skill appears here we can check this so I can go in here I can add
that basically gives me five right so we can always do that and we can be using formulas like this now what about this
question so which basically says with sales representative made a sale more than three times now we it might be
looking a little confusing when I say for example let me clear out this filter now we have sales greater than
three and we would want to find out which sales representative made a sale more than three times now I could
basically check for every sales representative here if they have made a sale more than three times and what I
can do is I can just say equals I can start with count if then I need my filtering criteria so that's your range
to give and criteria what is the criteria we need basically a sales representative so I can choose the value
three and let's check so it tells me the Boolean value that yes this guy has made sales more than three times and what we
can do is we can just drag and drop which basically gives me the value for other sales rep you can always check the
value is automatically changing to this value in cell and for example let's go in here so this is obviously two so it
says me false right and you can basically get the values for all your values so that basically tells me which
sales representative has made a sale more than three times now like some ifs we also have count ifs where you can
give multiple criterias so for example the question is how many orders were placed from the east region after this
particular date so we have a date criteria we also have the region criteria and and we need to basically
ifs and this basically says that you can start within criteria so it says how many orders were placed from east region
after particular date so date is in My First Column so for example I I could say a start with two and say for example
let's go a44 that should have selected all the rows and then I need to once I've given
the criteria range I need to give the criteria so we are saying the date has to be greater than 10th Feb so let's
Give It 2 10 2019 and then you need to say how many orders were placed so you need to give the
criteria second criteria range so we are looking at the number of orders which were placed from the east region so when
East let's give that and once you have done this you would want to find out the total number of orders so let's select
this and if I do this it tells me 13 now is that right so we are looking for your date your region being East and then
a44 wherein I have given the date criteria that it should be greater than 10th Feb because I do not want to
count 10th Feb it says after 10th February and then you're saying the region has to be East so we would want
to find out the total number of orders so my region is East and that gives me the result now similarly you can also
find out how many times Gil sold pencils so here we will have to give the range so let's start with
me it's twice where Gil has sold pencils so we can obviously check this by going in here choosing my filter and then
let's search for rep being just Gil okay and now we are interested only in the item being pencil so I can say
well let's get to pencil only and that tells me twice so you can obviously reverifying filters but using functions
or using formulas it is always good to calculate and that can be making your computation and calculation more
Dynamic let's look at one more interesting feature of Excel and that's your conditional formatting now as you
see on the screen condition formatting has different rules which can be applied on your data and that allows you to
basically differentiate or easily identify data values which are based on certain criterias or rules so when you
talk about conditional formatting you have different options such as you can highlight sell rules you can get top and
bottom values you can apply different rules apply different color scales and you you can easily manage these rules so
conditional formatting is very useful for people who would want to work on huge amount of data and easily perform
formatting you can format CS based on a preset condition you can perform conditional formatting to identify sales
formatting as shown on the left side now how do we work with conditional formatting let's have a quick look so
say for example we have our Excel sheet and if you see here I am highlighting the salesperson who have generated
Revenue greater than 10,000 so we can be looking at the the values where the revenue generated by a
particular sales person is greater than 10,000 it has a particular color and how do we get here
so for example let's select this data and what I could do is I could go into conditional formatting now I could
basically highlight sell rules and we could just say greater than that's an easier way I could also go ahead and
create a new rule but then I can use one of this option I can say greater than and let's give some value might be we
12,000 and here it says what color would you want to select so for example I would say something
like yow fill with dark yellow text and let's say okay so right now what I'm doing is I have all the values
where the revenue generator was greater than 10,000 but then I have also selected all the sales people who have
made or who have generated Revenue greater than 12,000 so I can just do a control Zed to see the previous result
basically highlighted the values which are greater than 12,000 so this is one simple example now we can look at some
other examples say for example you want to format cells using three-c color scale so if you look at the values here
I have a three-c color scale mainly in green yellow and red and how do you do this so for example I can go in here and
I can go to conditional formatting so I would want to go for color scales and here you can create different rules so
we can set up a two-color scale so we can say format only values that are above and below average I can format
only cells that contain something I can get the top and bottom value so these are different ways in which I can have a
three color based scale now what I will do is I will select this and let me show you the rule which I
have so for example I can go into manage rules and if you see here there are certain rules which have been specified
now what does that mean so you would want to specify a three-grade scale so for example if I would want to look at
my first rule it tells me that I'm choosing three color scale I can choose lowest value percent style and highest
value and that basically will select the cells based on their values so what we could have done is I can
basically use one of these values I can delete these rules which I have created so for example I have all these rules
but you should always carefully remember that the rules will be applied in the order shown so for example if I just
delete these rules and then say apply and say okay my data is back now it does not have any
highlighting now I can go in here I can say condition sorry conditional formatting I could go for color scales
or I could basically go into new rule so I would want the cells to be using three color scale so let's
choose three color scale now when you say three color scale it says what will be the color of lowest value and we
could choose might be any one of this let's choose red I can say midpoint is percentile 50 and then the highest value
is green and if that looks good let's say okay and now if you see the lowest values have been highlighted as red you
have mid values and then you have the positive value so this is a three-c color scale and that easily helps me in
identifying the data based on the cell values now in conditional formatting what you can also do is you can
basically color the sales based on their value so what we are seeing here is if the revenue generated is greater than
average then that shows in green and if the revenue generated is lesser than average that's shows in Orange now how
do we do that so we can basically again manage some rules so I can basically create a new rule now here I can select
one of the options which says format only values that are above or below average and that's the option I would
want to select now I can select this and it says format values that are above average so in our case we had it in
green so for example I'll say above average and then here I can go for a particular color so you can go for a
particular size so let's go and look into the formatting so for example let's choose
yellow say okay now I'm saying wherever the cell values are above average AG it would be yellow instead of
green and let's go in here let's go and look into manage rules so this is basically the rule which we are applying
say here so we had gone for above now we'll go for below we'll go for format we will choose red we'll say okay we'll
say now these are basically the rules which we have created and here it says applies to your
data so right now it has not been applied so for example if I select this and then I could basically choose my
area just hit on enter and similarly you can go in here and then select your area hit on enter and say apply say okay and
now if you see I have really chosen bright colors but then I have said wherever my revenue generated is above
average it should be in yellow and below average should be in Red so we wanted above average to be in green and below
average to be in Orange so that's what we have here right so you can always color code your cell values based on
some rules which you are setting up now similarly you can also find the top 10 and bottom 10 values and that's pretty
easy so you can just select this and then you can go into conditional formatting you can go for top and bottom
values top 10 items bottom 10 items or you can go in for more rules so you can say format only top or bottom ranked
blue and I'll say okay so now if you see my top 10 values are blue now similarly I can add one more rule so I can say new
rule and I can say let's go for top or bottom let's go for bottom let's go for format let's say orange say okay say
okay and that's it so now you have your values which are top or bottom 10 values so you're using conditional formatting
where you are basically highlighting your cell values based on different colors and here easy conditional
formatting based on different rules helps us to do that now similarly you can also have the
values which is basically showing you how the values are increasing so what we can do is we can
select our columns either you could apply this to all the columns now here I have applied this only to Jan and April
now I could apply this for June so let's say June so you can go for gradient f you can go for solid fill you can
obviously just select the color and that takes care of the things you can say for example select this and now this is
selected but I would want to might be format this so I can go in here I can go into manage rules now that will tell me
what rule has been applied in the order so I can just do a edit Rule and that basically says this
is a solid fill which is color you have no border this is basically color is black now I can go for something like
Your First Column so you can use conditional formatting for various use cases and you can highlight the values
so anyone who would look at the value would automatically notice which are the higher values which are low lower values
might be here the revenue is getting generated or was getting generated but did not grow Beyond a particular value
and so on now similarly you can also go in for different options say for example here we would want to see if the revenue
example if the revenue was going up for this particular sales person so here we are looking at Carol so in Jan the
revenue Generation by sales was very high then in Feb it was falling down in March it was kind of stable then
in April it went way below so we can obviously work on this wherein we can grade our cell values so what we can do
is is we can go in for highlighting the cell values now you can go for color scales you can go for Icon sets and this
is where you can choose your different shapes so you could choose one of these shapes so for example I would be
interested in looking at the indicators like directional I could go using this three arrows I can go in for this color
we can also do is we can then go into manage rules and that basically tells me what rules have been applied so for
example the latest one is the icon set which I have chosen it shows the selected columns I can obviously do a
edit Rule and then I can choose so I'm saying the format style is icon sets I'm not using a data bar I'm not using color
scale now here I have chosen the style of icons and then here you can basically give some values so you have icon which
is green when the value is greater than or equal to 67 percentage when I say hyphen or minus it
is less than 67 it's way below 30 3 percentage then you give this value so you can obviously
edit and easily highlight your sell values based on this icon set so I can apply this and that's how I use
conditional formatting so conditional formatting can be very useful if you would want to use icon set if you want
to use your data bars if you would want to highlight particular values if you would want to color code based based on
you would want to just find out values based on some simple calculation so conditional formatting is used
extensively by data analysts or people who are working business intelligence teams or people who would want to use
Excel to easily identify the data easily identify the cells which contain particular value or finding out less
significant or more significant sales to then pull out values and carry out your computations calculations or
analysis and if you're an aspiring data analyst try giving a short to Simply learns postgraduate program in data
analytics from P University in collaboration with IPM the link in the description box and the pin comment
should take you to the programing off it and uh so why exactly do we need to do time serious analysis
typically we would like to predict something in the future and uh it could be stock prices it could be the sales or
um anything that needs to be predicted into the future that is when we use time series analysis so it is um as the name
suggest it is forecasting and typically when we say predict it need not be into the future in machine learning and data
analysis when we talk about predicting we are not necessarily talking about the future but in Time series analysis we
typically predict the future so we have some past data and we want to predict the future that is when we perform time
series analysis so what are some of the examples uh it could be daily stock price the shares as we talk about or it
could be the interest rates weekly interest rates or sales figures of a company so these are some of the
examples where we use time series data we have historical data which is dependent on time and then based on that
we create a model to predict the future so what exactly is uh time series so time series data has time as one of the
components as the name suggests so in this example let's say this is the stock price data and uh one of the components
so there are two columns here column B is the price and column A is basically the time information
in this case the time is a day so that primarily the closing price of a particular stock has been recorded on a
daily basis so this is a Time series data and the time interval is obviously a day time series or time intervals can
be daily weekly hourly or even sometimes there is something like a sensor data it could be every few milliseconds or
micros seconds as well so the size of the time intervals can vary but they are are fixed so if I'm saying that the it
is daily data then the interval is fixed as daily if I'm saying this data is an hourly data then it is the data is
captured every hour and so on so the time intervals are fixed the interval itself you can uh decide based on what
kind of data we are capturing so this is a graphical representation the previous one here we saw the table representation
and this is how to plot the data so on the Y AIS is let's say the price or the the stock price and x-axis is the time
so against time if you plot it this is how a Time series graph would look so as the name suggests what is time series
data time series data is basically a sequence of data that is recorded over a specific intervals of time and based on
the past Valu so if we want to do an analysis of Time series past data we try to forecast a future and again as the
name suggests it is time series data which means that it is time dependent so time is one of the components of this
data time series data consists of primarily four components one is the trend then we have the seasonality then
cyclicity and then last but not least irregularity or the random component sometimes is also referred to as a
random component so let's see what each of these components are so what is Trend trend is overall change or the pattern
of the data which means that the data may be let me just uh pull up the pen and uh show you so let's say you have a
data set somewhat like this a Time series data set somewhat like this all right so what is the overall trend there
is an overall trend which is upward Trend as we call it here right so it is not like it is continuously increasing
there are times when it is dipping then there are times when it is increasing then it is decreasing and so on but
overall over a period of time from the time we start recording to the time we end there is a trend right there is an
upward Trend in this case so the trend need not always be upwards there could be a downward Trend as well so for
example here there is a downward Trend right so this is basically what is uh a trend overall whether the data is
increasing or decreasing all right then we have the next component which is seasonality what is seasonality
seasonality as the suggest once again changes over a period of time and periodic changes right so there is a
certain pattern um let's take the sales of warm clothes for example so if we plot it along the months so let's say
January February March April May June July and then let's say it goes up to December okay so this is our December a
d I will just mark it as D and then you again have J F March and then you get another December okay and just for
Simplicity let's mark these as December as the end of the year and then one more December okay so what will happen when
if you're talking about warm clothes what happens the sales of warm clothes will increase probably around December
when it is cold and then they will come down and then again around December again they will increase and then the
sales will come down and then there will be again an increase and then they will come down and then again an increase and
then they will come down let's say this is the sales pattern so you see here there is a trend as well there is an
upward Trend right the sales are increasing over let's say these are multiple years this is for year 1 this
is for year two this is for year three and so on so for multiple years overall the trend there is an upward Trend the
sales are increasing but it is not a continuous increase right so there is a certain pattern so what is happening
what is the pattern every December the sales are increasing or they are peing for that particular year right then
there is a new year again when December approaches the sales are increasing again when December approaches the sales
are increasing and so on and so forth so this is known as seasonality so there is a certain fluctuation which is uh which
is periodic in nature so this is known as seasonality then cyclicity what is cyclicity now cyclicity is somewhat
similar to seasonality but here the duration between two cycles is much longer so seasonality typically is
referred to as an annual kind of a sequence like for example we saw here so it is pretty much like every year in the
month of December the sales are increasing however cyclicity what happens is first of all the duration is
pretty much not fixed and the duration or the Gap length of time between two cycles can be much longer so recession
is an example so we had let's say recession in 2001 or 2002 perhaps and then we had one in
2008 and then we had probably in 200 2012 and so on and so forth so it is not like every year this happens probably so
there is usually when we say recession there is a slump and then it recovers and then there is a slump and then it
recovers and probably there is another bigger slump and so on right so you see here this is similar to seasonality but
first of all this length is much more than a year right that is number one and it is not fixed as well it is not like
every four years or every six years that duration is not fixed so the the duration can vary at the same time the
gap between two cycles is much longer compared to seasonality all right so then what is irregularity irregularity
is like the random component of the time series data so there is like you have part which is the trend which which
tells whether the overall it is increasing or decreasing then you have cyclicity and seasonality which is like
kind of a specific pattern right uh then there is a cyclicity which is again a pattern but at much longer intervals
plus there is a random component so which is not really which cannot be accounted for very easily right so there
will be a random component which can be really random as the name suggests right so that is the irregularity component so
these are the various components of time series data yes there are conditions where we cannot use time series analysis
right so is it can we do time series analysis with any kind of data no not really so what are the situations where
we are uh we cannot do time series analysis so there will be some data which is collected over a period of time
but it's really not changing so it will not really not make sense to perform any time series analysis over it right for
example like this one so if we take X as the time and Y as the value of whatever the output we talking about and if the Y
value is constant there is really no analysis that you can do uh leave alone time series analysis right so that is
one another possibility is yes there is a change but it is changing as per a very fixed function like a sine wave or
a COS wave again time series analysis will not make sense in this kind of a situation because there is a definite
pattern here there is a definite function that the data is following so it will not make sense to do a Time
serious analysis now before performing any time series analysis uh the data has to be stationary and uh typically time
series data is not stationary so in which case you need to make the data stationary before we apply any models
like ARA model or any of these right so what exactly is stationary data and what is meant by stationary data let's let us
take a look first of all what is non-stationary data time series data if you recall from one of my earlier slides
we said that time series data has the following four components the trend seasonality cyclicity and random random
component or irregularity right so if these components are present in Time series data it is non-stationary which
means that typically these components will be present therefore most of the time A Time series data that is
collected raw data is not stationary data so it has to be changed to stationary Data before we apply any of
these algorithms all right so a nonstationary Time series data would look like this which means like for
example here there is an upward Trend the seasonality component is there and also the random component and so on so
if the data is not stationary then the time series forecasting will be affected so you cannot really perform a Time
series forecasting ing on a non-stationary data so how do we differentiate between a stationary and a
non-stationary Time series data typically or technically one is of course you can do it visually in
non-stationary data the the data will be more flattish the seasonality will of course be there but the trend will not
be there so the data May if we plot that it may appear somewhat like this right it's a horizontal line along the
horizontal line you will see compared to the original dat data which was there was an upward Trend so it was changing
somewhat like this right so this is non-stationary data and this is how a stationary data would look visually what
does this mean technically this means that stationarity of the data depends on a few things what the mean the variance
and the co-variance so these are the three components on which the stationarity of the data depends so
let's take a look at what each of these are for stationary data the mean should not be a function of time which means
that the mean should pretty much remain constant over a period of time right so there is there shouldn't be any change
uh so this is how the stationary data would look and this is how a non-stationary data would look I've
shown in the previous slide as well so here the mean is increasing that means there is an upward Trend okay so that is
one part of it and then the variance of the series should not be also a function of time so the variance also should be
pretty much common or uh should be constant rather so this is a if we visually we take a look this is how time
series stationary data would look where the variance is not changing here the variance is changing therefore this is
non-stationary and we cannot apply time series forecasting on this kind of data similarly the co-variance which is
basically of the ith term and the i+ MTH term should not be a function of time as well so co-variance is nothing but not
only the variance at the ayat term but the relation between the variance of the ith term and the I plus MTH or the I
plus neth term so as again once again visually this is how it would look if the co-variance is also changing with
respect to time so these are the three all three components should be pretty much constant and that is when you have
stationary data and in order to perform time series analysis the data should be stationary okay so let's take a look at
uh the concept of moving average or the method of moving average uh and let's see how it works we'll do simple
calculations so let's say this is our sample data we have the data for 3 months January February March the sales
in hundreds of in thousands rather not hundreds thousands of dollars is given here and uh now we want to find the
moving average so how do we find the moving average we call it as moving average three so moving average three is
nothing but you take three of the values or the readings add them up and uh divide by three B basically the way we
take a mean or average of the three values so that is as simple as that so that's the average first of all so what
is moving average moving average is if you now have a series of data you keep taking the three values the next three
values and then you take the average of that and then the next three values and so on and so forth so that is how you
take the moving average so let's take a little more detailed example of car sales so this is how we have the car
sales data for the entire year let's say so rather for four years so year one we have for each quarter quarter 1 2 3 4
and then year two quarter 1 2 3 4 and so on and so forth so this is how we have sales data of a particular car let's say
or a showroom and uh we want to forecast for year five so we have the data for four years we now want to forecast for
the fifth year let's see how it works first of all if we plot the data as it is uh taken the raw data this is how it
would look and uh what do you think it is is it stationary no right because there is is a trend upward Trend so this
is not a stationary data so we um we need to later we will see how to make it uh stationary but to start with just an
example we will not worry about it for now we will just go ahead and uh manually do the forecasting using what
is known as moving average method okay so we are not applying any algorithm or anything like that in the next video we
will see how to apply an algorithm how to make it stationary and so on all right so um here we see that all the
three or four components that we talked about um are there there is a trend there is a seasonality and then of
course there is some random component as well cyclicity may not be is possible that cyclicity is not applicable in all
the situations for sales especially there may not be or unless you're taking a sales for maybe 20 30 years cyclicity
may not come into play so we will just consider uh primarily the trend seasonality and irregularity right so
Random it is also known as random irregularity right so we were calling the random or irregularity component so
these are the three main components typically in this case we will talk about so this is the trend component and
um we will see how to do these uh calculations so let's take a look redraw the table including the time code we
will add another column which is the time code and uh is the column and we'll just number it like 1 2 3 4 up to 16 the
rest of the data Remains the Same okay so we will do the calculations now now let us do the moving average
calculations um or ma4 as we call it for each year so we take all the four quarters and we take an average of that
so if we add up these four values and divide by four you get the moving average of 3.4 so we start by putting
the value here so that will be for the third quarter let's say 1 2 3 the third quarter then we will go on to the next
one so we take the next four values as you see here and take the average of that which is the moving average for the
next quarter and so on and so forth now if we just do the moving average uh it is not centered so what we do is we
basically add one more column and we calculate the centered moving average as shown here so here what we do is we take
the average of two values and then just adding these values here so for example the first value for the third quarter is
actually the average of the third and the fourth quarter so we have 3.5 now it gets centered so similarly the next
value would be 3.6 plus 3.9 ided by 2 so which is 3.7 and so on and so forth okay so that is the centered moving average
this is done primarily to smooth the data so that there are not too many rough edges so that is what what we do
here so if we visualize this data now this is how it looks right so if we take the centered moving average as you can
see there is a gradual increase if this was not the case if we had not centered it the changes would have been much
sharper so that is the basically the smoothing that uh we are talking about now let's go and or do the forecast for
the fifth year so in order to do the forecast what we will do is we will take the center Ed moving average as our
Baseline and then start doing a few more calculations that are required in order to come up with the prediction so what
we are going to do is we are going to use this multiplicity or multiplicative model in this case and this is how it it
looks so we take the product of seasonality and uh the trend and the irregularity components and we just
multiply that and in order to get that this product of these two We have basically the actual value divided by
CMA YT value divided by CMA will give you the predicted value of YT is equal to the product of all three components
therefore St into YT is equal to YT by CMA so this is like this is equal to YT right so therefore if we want St into YT
the product of seasonality and irregularity is equal to YT by CMA so that is how we will work it out I also
have an Excel sheet of the actual data so let me just pull that up all right so this is how the data looks in Excel as
you can see here year one quarter 1 2 3 4 year to quarter 1 2 3 4 and so on and this is the sales data and then this is
the moving average as I mentioned this is how we calculate and this is the centered moving average so this is the
primary component that we will start working with and then we will calculate since we want the product of St into it
that is equal to YT by CMA so if you see this Val values are nothing but the YT value divided by CMA so in this case it
is 4 by 3.5 which is 1.14 similarly 4.5 by 3.7 1.22 and so on and so forth so we take we have the product St into it and
uh then the next step is to uh calculate the average of respective quarters so that is what we are doing here average
of respective quarters and then we need to calculate the deseasonalized values so in order to get
deseasonalized value we need to divide YT by St that was calculated so for example here it is 2.8 by .9 so we got
the deionized value here and uh then we get the trend and then we get the predicted uh values so in order to get
the predicted value which is basically we predict the values for known values as well like for example year one
quarter 1 we know the value but now that we have our model we predict ourselves and see how close it is so we predicted
as 2.89 whereas the actual value is 2.8 then we have 2.59 the actual value is 2.1 and so on just to see how our model
works and then continue that into the fifth year because for fifth year we don't have a reference value okay and if
we plot this we will come to know how well our calculations are how well our manual model in this case we did not
really use use a model but we did on our own manually so it will tell us the trend so for example the predicted value
is this gray color here and you can see that it is actually pretty much following the actual value which is the
blue color right and the gray color is the predicted value so the wherever we know the values up to year four we can
see that our predicted values are following or pretty much very close to the actual values and then from here
onwards when the year five starts the blue color l is not there because we don't have the actual values only the
predicted values so we can see that since it was following the trend pretty much for the last four years we can
safely assume that it has understood the pattern and it is predicting correctly for the next one year the next four
quarters right so that is what we are doing here so these four quarters we did not have actual data but we have the
predicted values so let's go back and see how this is working in this using the slides so this is we already saw
this part and um I think it was easier to see in the Excel sheet so we calculated the St it the product of St
and it using the formula like here y by YT by CMA we got that and then we got ST which is basically YT so this is the
average of the first quarters for all the four years and uh similarly this is the average of the second quarter for
all the four years and so on so these values are repeating there they are calculated only once once they get
basically YT by St so we calculated St here and we have YT so YT by St will give you the desaly data and uh we have
got rid of the seasonal and The Irregular components so far now what we are left with is the trend and uh before
we start the time series forecasting or time series analysis as I mentioned earlier we need to completely get rid of
the non-stationary components so we are still left with the trend component so now let us also remove the trend
component in order to do that we have to find the or we have to calculate the intercept and slope of the data because
that is required to calculate the trend and and uh how are we going to do that we will actually use um what is known as
a regression tool or Analytics tool that is available in Excel so you remember we have our data in Excel so let me take
you to the Excel and uh here we need to calculate the intercept and the slope in order to do that we have to use the
regression mechanism and in order to use the regression mechanism we have to use use the Analytics tool that comes with
Excel so how do you activate this tool so this is how you would need to activate the Tool uh from Excel you need
to go to options and uh in options there will be addin and in addin you will have um analysis tool pack and you select
this and um you just say go it will open up a box like this you say analysis tool pack and you say Okay And now when you
come back in to the regular view of excel in the data tab you will see data analysis activated so you need to go to
file options and addins and then analysis tool pack typically since I've already added it it is coming at the top
but it would come under inactive application addin so when you're doing it for the first time so don't use VBA
you just say analysis tool pack there are two options one with VBA like this one and one without VBA so just use the
one without VBA and then instead of just saying okay just take care that you click on this go and not just okay so
you say go then it will give you these options only then you select just the analysis tool pack and then you say okay
all right so and then when you come back to the main view you click on data okay so this is your Norm noral home view
perhaps so you need to come to data and here is where you will see data analysis available to you and then if you click
on that there are a bunch of possibilities what kind of data analysis you want to do if there are options are
given right now we just want to do regression because we want to find the slope and The Intercept so select
regression and you say okay and you will get these options for input y range and input X X range input
y range is the value YT so you just select this and you can select up to here and press enter and input X range
you can for now you start with uh the baseline or you can also start with the Des seasoned values so you can just
click on these and say okay I have already calculated it so these are the intercept and the
coefficients that we are getting for these values and we will actually use that to calculate our Trend here right
so which is in the J column so trend is equal to intercept plus slope into the time code so The Intercept is uh out
here as we can see in our slide as well so if you see here this is our intercept and the lower value is the slope we have
calculated here and it's shown in the slides as well so intercept the formula is shown here here so our trend is equal
to intercept plus slope into time code time code is nothing but this one t colum a 1 2 3 4 okay so that's how you
calculate the trend and that's how you use the data analysis tool from Excel using these two we calculate the
predicted values and using this formula which is basically trend is equal to interet plus slope into time code and
then we can go and plot it see how it is looking and therefore so we see here that the predicted values are pretty
close to the actual values and um therefore we can safely assume that our uh calculations which are like our
manual model is working and hence we we go ahead and predict for the fifth year so till four years we know the actual
value as well so we can compare our model is performing and for the fifth year we don't have reference values so
we can use a equations to calculate the values or predict the values for the fifth year and we can go ahead and
safely calculate those values and when we plot for the fifth year as well the predicted values we see that they are
pretty much they captured the pattern and we can safely assume that the predictions are fairly accurate as we
can also see from the graph in the Excel sheet that we have already seen okay so let's go and plot it so this is how the
plot looks this is the CMA or the centered moving average the green color and then the blue color is the actual
data red color is the predicted value predicted by our handcrafted model okay so remember we did not use any regular
forecasting model or any tool we have done this manually and uh the actual tool will be used in the next video this
is just to give you an idea about how behind the scenes or under the hood how fusting Works a Time series analysis how
it is performed okay so it looks like it has captured the trend properly so up to here is the known reference we have
reference and from here onwards it's purely predicted and uh as I mentioned earlier we can safely assume that the
values are accurate and predicted properly for the fifth year so let's go ahead and Implement a Time series
forecast in r first of all we will be using the ARA model to do the forecast of uh this time series data so let us
try to understand what is ARA model so arima is actually an acronym it stands for auto regressive integrated moving
average so that is what is ARA model and it is specified by three parameters which is p d and q p stands for auto
regressive so let me just mark this so there are three components here Auto regressive integrated moving average
okay so these three parameters correspond on to those three components so the P stands for auto regressive D
for integrated and Q for moving average so let us see what exactly this is so these three factors are p is the number
of Auto regressive terms or a we will see that in a little bit and D is how many uh levels of differences that we
need to do or differentiation we need to do and Q is the number of lagged forus error so we'll see what exactly each of
these are so a r is the number of Auto regressive terms and which is basically denoted by the p and then we have D
is for the moving average so what exactly AR terms so in terms of the regression model Auto regressive
components refer to the prior values of the current value where what we mean by that is here when we talk about time
series data focus on the fact that there is regression so what exactly happens in regression we try to do something like
if it a simple linear regression we do some equation like Y is equal to mx + C where there are actually there are two
variables one is the dependent variable and then there is an independent variable let me just complete this
equation as well MX plus C right so this is a normal regression curve or a simple regression curve now here we are talking
about Auto regression or Auto regressive so Auto regressive as the name suggests is regression of itself so which means
that here you have only one variable which is your maybe the cost of the flights or whatever it is right and the
other variable is basically time dependent and therefore the value at any given time and that we will denote as YT
for example so there is no X here there is only one variable and which which is y and we say YT which is basically the
predicted value at a time interval T for example is dependent on the previous value so for example there may be A1 and
then YT minus1 and then there will be like plus A2 and right plus A2 and YT minus 2 and uh all right and then plus
A3 into y t minus 3 all right so basically here what we saying is there's only one variable here but there is a
regression component so we are doing a regression on itself so that's how the term Auto regression comes into play so
only thing is that it is dependent on the previous time values so there is a lag let's say this the first lag second
lag third lag and so on so the current value which is YT is dependent on the previous time lag values so that is what
is auto regression component so this is what is shown here for example in this case instead of Y we are calling it as X
so that's the same and this is represented by some equation of that sort depending on how many lags we take
so that is the AR component and the term p is basically determines how many lags we are considering so that's the term e
for now what is d d is the degree of differencing so here differencing is like to for the nonseasonal differences
right so for example if you take the values like this which are given for 5 4 6 and so on and so forth if you take the
differencing of one after another like for example 5 - 4 or 4 - 5 the next value with the previous value so 4 - 5
so this is known as the first order differencing so the result is min -1 similarly 6 - 4 is 2 7 - 6 is 1 so this
is first order differencing and uh here we call it as D is equal to 1 okay and same way we can have second order third
order and so on then the last one is q q is the actually why we call it moving average but in reality it is actually
the error of the model so we also sometimes represent as ET all right so now ARA model works on the assumption
that the data is stationary which means that the trend and seasonality of the data has been removed that is correct
okay so this we have discussed in the first part how what exactly is stationary data and how do we remove the
non-stationary part of it now in order to test whether the data is stationary or not there are two important
components that are considered one is the autocorrelation function and other is the partial autocorrelation function
so this is referred to as ACF and pacf cor right so what is autocorrelation and what is the definition Auto correlation
is basically the similarity between values of a same variable across observations as the name suggest now how
do we actually find the auto correlation function the value right so this is basically done by plotting and auto
correlation function also tells you how correlated points are with each other based on how many time steps they are
separated by and so on that is basically the time lag that we were talking about and it is also used to determine how
past and future data points are related and the value of the autocorrelation function can vary from minus1 to 1 so if
we plot this is how it would look Auto correlation function would look somewhat like this and there is actually a
readily available function in R so we will see that and you can use that to plot your aor relation function okay so
that is ACF and we will see that in our R studio in a little bit and similarly you have partial autocorrelation
function so partial autocorrelation function is the degree of association between two variables while adjusting
the effect of one or more additional variables so this again can be measured and it can also be plot and its value
once again can go from minus1 to 1 and it gives the partial correlation of Time series with its own lagged values so lag
again we have discussed in the previous uh couple of slides this is how a PF plot would look in our studio we will
see that as well and once we get into the r studio and with that let's get into R studio and take a look at our use
case before we go into the code let's just quickly understand what exactly is the objective of this use case so we are
going to predict some values or forecast some values and we have the data of the airline ticket sales of the previous
years and now we will try to find the or predict the or forecast the values for the future years all right so we will
basically identify the time series components like Trend seasonality and uh random Behavior we will actually
visualize this in our studio and then we will actually for because the values based on the past values or history data
historical data so these are the steps that we follow we will see in our studio in a little bit just quickly let's go
through what are the steps we load the data and it is a Time series data if we try to find out what class it belongs to
the data is actually air passengers data that is already comes preloaded with our studio so we will be using that and we
can take a look at the data and then what is the starting point what is the end point so these are all fun functions
that are available we'll be using and then what is the frequency it's basically frequency is 12 which is like
yearly data right so every month the data has been collected so for each year it is 12 and then we can check for many
missing values if there are any and then we can take a look at the summary of the data this is what we do in exploratory
data analysis and then we can plot the data visualize the data how it is looking and uh we will see how the data
has some Trend seasonality and so on and so forth all right then we can take a look at the cycle of the data using the
cycle function and we can see that it is every month that's the cycle end of every 12 months the new cycle begins so
the each month of the year is uh the data is available then we can do box plots to see for each month how the data
is varying over the various 10 or 12 years that we will be looking at this data and uh from exploratory data
analysis we can identify that there is a trend there is a season ality component and how the seasonality component varies
also we can see from the box plots and we can decompose the data we can use the decomposed function rather to see the
various components like the seasonality trend and the irregularity part okay so we will see all of this in our studio
this is how they will look this is the Once you decompose and this is how you will actually you can visualize the data
this is the actual data and this is the trend as you can see it's going up upwards this is the seasonal component
and this is your random or irregularity right so we call it irregularity or we can also call it random as you can see
here yes so the data must have a constant variance and mean which means that it is stationary before we start
any analysis time series analysis and uh without so basically yeah if it is stationary only then it is easy to model
the data perform time series analysis so we can then go ahead and fit the model as uh we discussed earlier we'll be
using ARA model there are some techniques to find out what should be the parameter so we will see that when
we go into our studio so the auto ARA function basically tells us what should be the parameters right so these
parameters are the p d and q that we talked about that's what is being shown here so if you use autoa it will
basically take all possible values of this PDQ these parameters and it will find out what is the best value and then
it will recommend so that is the advantage of using Auto ARA all right so like in uh this case it will tell us
what if we use this parameter Trace we set the parameter Trace is equal to true then it will basically tell us what is
the value of this AIC which has to be minimum so the lower the value the better so for each of these combinations
of p d and q it will give us the values here and then it will recommend to us which is the best model okay because
whichever has the lowest value of this AIC it will recommend that as your best uh PDQ values so once we have that we
can see that we will basically we can potentially get a model or the equation model is nothing but the equation and
based on the parameters that we get and we can do some Diagnostics we can do some plotting to see how whether there
is a plot for the residuals so which shows the state it and then we can also take a look at the ACF and PF we can
plot the ACF and PF and then we can do some forecasting for the future year so in this case we have up to 1960 and then
we can see how we can forecast for the next 10 years which is 1970 up to 1970 and once we have done this can we
validate this model yes definitely we can validate this model and uh to validate the findings we use
junkbox test and this is how you just call box. test and then you pass these parameters and you will get the values
that will be returned which will tell us whether this how accurate this model is or how accurate the predictions are so
the values of P are quite insignificant in this case we will see that and that also indicates that our model is free of
autocorrelation and that will basically be it so let's go back and into our R studio and uh go through these steps in
uh real time so we have to import this Library forecast package is not installed you have to go here and
install the forecast package okay so that's the easy way to install rather than to click on this install I will not
do it now because I've already installed so the first time that's only one time then after that you just have to load it
into memory and then keep going so we will load this data called air passengers so by calling this data
method and if you see the the data a passengers is loaded here and if we check for the class it is a Time series
data TS data so we can check for the dates we can also view the data in a little bit and start date is 1949 and
January and our end date is 1960 December and the frequency is 12 which is like collected monthly so that is the
frequency which is uh 12 here and then we check if there there are any um missing values there are no missing
values and then we take a look at the summary of the data this is all exploratory data analysis and then if
you just display the data this is how it looks and then we need to decompose this data so we will kind of store this in an
object TS data and then use that to decompose and store the new values let me just clear this for now and uh if we
decompose basically as we have seen in the slides decomposing is breaking it into the trend seasonality and irregular
or random components then you can go ahead and plot it so when you plot it you can see here let me Zoom this this
is our original plot or observed value as it is known as then we have decomposed the three parts which is
basically the trend as you can see there is a upward Trend then the seasonal component so this is some regularly
occurring pattern and then there is this random value which is basically you cannot not really give any equation or
function or anything like that so that's what this plotting has done and then you can actually plot them individually as
well so these are the individual plots for the trend for the seasonal component and the random component all right so
now let's take a look at the original data and see how the trend is in a way so if we do this linear regression line
it will show that there it is going upwards and we can also take a look at the cycle that are there which is
nothing but we have a frequency of 12 right so the Cycles will display that it is January February to December and then
back to January February and so on and so forth and if we do box plots for the monthly data you will see that for each
of the months right and over the 10 years that the data that we have we will see that there is a certain pattern
right this is also in a way to find the seasonality component so while January February sales are relatively low around
July August the sales pick up so especially in July I think the sales are the highest and this seems to be
happening pretty much every year right so this is every year in July there seems to be a peak in the sales and then
it goes down and slightly higher in December and so on so that is Again part of our exploratory data analysis and
once again let's just plot the data now as I said in order to fit into an ARA model we need the values of PD and Q now
one way of doing it is there are multiple ways actually of doing it the earlier method of doing it was you draw
the autocorrelation function plot and then partial autocorrelation function plot and then observe that and where
does this change and then identify what should be the values of p and Q and so on now R really has a very beautiful
method which we can use to avoid all that manual process that we used to do earlier so what R will do is there is a
method called Auto ARA and if we just call this Auto ARA method and it will basically go and test the ARA model for
all possible values of this parameters PDQ and then it will suggest to you what should be the best model and it will
return that best model with the right values of PD and Q so you we as data scientists don't have to do any manual
you know trial and error kind of uh stuff okay so we got the model now and uh this is the model it it has PDQ
values are 211 PDQ and this is the seasonal part of it so we can ignore it for now and so if we want to actually
understand how this has returned these values 21 one as the best one there is another functionality or feature where
we can use this Trace function or Trace parameter so if you pass to Auto Ara the trace parameter what it will do is it
will show you how it is doing this calculation what is the value of the AIC basically as is what you know defines
the accuracy of the model the lower the better okay so for each combination of PDQ it will show us the value of AIC so
let's run it before instead of me talking so much let's run this if we run auto ARA with Trace you see here there
is a red mark here that means it is performing it's executing this and here we see the display right so it starts
with certain values of PDQ and then it finds that value is too high so it starts with again with some 0 1 1 Zer
and so on and so forth and ultimately it tells us okay this is our best model you see here it says this is our best model
211 let's go back and see did we get the same one yes we got the same one when we ran without Trace as well right now why
is 211 let us see where is 211 here is our 211 and if you compare the values you see that 10 17 is pretty much the
lowest value and therefore it is saying this is our best model all other values are higher so that's how you kind of um
get your model and now that you have your model what you have to do you need to predict the values right so before
that let us just do some test of these values so for that you install T Series again if you're doing it for the first
time you would rather use this package and install and say T Series and install it and then you just use this Library
function to load it into your memory all right so now that we got our model using Auto ARA let us go ahead and forecast
and also test the model and also plot the ACF and PF remember we talked about this but we did not really use it we
don't have to use that but at least we will visualize it and uh for some of the stuff we may need this T Series Library
so if you are doing this for the first time you may have to install it and my recommendation is don't use it in the
code you go here and install T Series and I will not do it now because I already installed it but this is a
preferred method and once you install it you just load it using this libraries function and then you can plot your
residuals and this is how the residuals look and you you can plot your ACF and PF okay so this is how your PF looks and
this is how your ACF looks for now there is really nothing else we need to do with ACF and PF this just to visualize
how that how it looks but as I mentioned earlier we were actually using these visualizations or these graphs to
identify the values of p d and q and how that was done it's uh out of scope of this video so we we'll leave it at that
and uh then we will forecast for the next 10 years how do we forecast that so we call forast and we pass the model and
we pass what is the level of accuracy that you need which is 95% and for how many periods right so basically we want
for 10 years which is like 10 into 12 time periods so that's what we are doing here and now we can plot the forecast
value so you see this is the original value up to I think 62 or whatever and then it goes up to 72 this blue color is
the predicted value let's go and zoom it up so that we can see it better so from here onwards we focusing and you can see
that it looks like our model has kind of learned the pattern and this pattern looks very similar to what we see in the
actual data now how do we test our model so we can do what is known as a box test test and we pass our model here
residuals basically with different lags and from those values here the P values here we find that they are reasonably
now the sky limit on this in today's world almost every business Act of Life your music on your Spotify are driven by
data analytics but some of the big players when you go in there job hunting are going to be
your fraud analysis uh if you want to go make a lot of money and you're good at it and you like dealing with numbers uh
go join the banks and track down the criminals who are stealing money it's a lot of you know it's a big thing to
protect credit cards protect uh sales purchases bad checks any of those things when you can track them down is
huge healthc care exploding uh there is everything from trying to find cures for uh the covid virus or any of the viruses
out there uh using your cell phone to diagnose different ailments uh that way you don't have to go and see the doctor
you can actually just go in there and take a picture of the funky growth on your arm hopefully it's not too big and
then they send it in there and the data analytics goes in there looks at it and says oh this is what this is this is the
professional you need to go see or don't need to see and that's just one aspect of healthare uh the databases being
generated by Healthcare and getting the right doctors and helping the doctors and analyze whether something is uh
benign or malignant if it's cancerous all those things are now part of the ongoing Health Care growth in data
analytics Inventory management think one of those huge warehouses where they're shipping out all the goods how do you
inventory that in such a way so that uh you maximize the stuff that's being purchased the most near the entrance and
all the other stuff towards the back or even pre- ship it uh so it's huge to be able to inventory the your inventory and
pretty soon they'll just have a drone come in there and start picking up some of those boxes and move them around
also deliver your Logistics again this goes from uh getting from point A to point B uh you can combine it with our
inventory so you pre- ship stuff if you know a certain area is more likely to purchase it how do you get it the
delivery to the most destinations the quickest in the short amount of time and then they even pre-stack the trucks
going out and that's all done with data analytics how do we stack all that stuff so it comes out in the right
order targeted marketing huge industry any kind of marketing whether you're generating uh the right content for the
marketing who are you targeting with that marketing researching the people what they want so you know what products
to Market out there all those things are huge and these are just a few examples you can probably go Way Beyond this from
tracking forest fires to astrology and studying the stars all of this is part of data analytics now in place a huge
role in all these different areas uh City Planning is another one you know you can see a nice organized
City like this one where you can get in and out of the neighborhoods if you're a fir truck uh police officers need to be
able to get in and out you want your tourist to be able to come in yet you still want the place to look nice and
you have the right commercial development the right Industrial Development like enough residence for
people to stay all those things are part of your City Planning again huge in data analytics so Sky a limit on what you use
so many ways uh but we're going to start with looking at the most basic questions that you're going to be asking in data
analytics and the first one is you want descriptive analytics what has happened hindsight uh how many cells per call
ratio coming out of the call center if we have 500 tourists in a forest and you have a certain temperature how many
fires were started how how many times did the police have to show up to certain houses um all that's descriptive
the next one is predictive Predictive Analytics is what will happen next we want to predict uh this is great if you
want have a ice cream store and you want to predict how many people to work at the ice cream store in a certain day
based on the temperature coming up in the time of the year and then one of the biggest growing and most important parts
of the industry is now prescriptive analytics and you can think of that as as combining the first two we have
it happen foresight what can we change to make this work better in all the industries we looked
at before we can start asking questions uh especially in City development there's a good one if we want to have
our city generate more income and we want that income to be commercial based uh what kind of commercial buildings do
we need to build in that area that are going to bring people over do we need huge warehouse sales Costco sales
buildings or do we need little momod joints that are going to bring in uh people from the country to come shop
there or do you want an industrial setup what do you need to bring that IND industry in there is there car industry
available in that area uh if it's not a car industry what other Industries are in that area all those things are
prescriptive we're guessing we're guessing what can we do to fix it what can we do to fix crime in ER with
education what kind of education are we going to use to help people understand what's going on so that we lower the
rate of crime and we help our communities grow better that's all prescriptive it's all guessing we went
foresight into how can we make it happen how can we make this better and we really can't not go into
enough detail on these three because a lot of people stumble on this when they come in and are doing analytics whether
you're the manager shareholder or the data science scientists coming in you really need to understand the
descriptive analytics where you're studying the total units of furniture sold and the profit that was made in the
past uh here we go into Predictive Analytics predicting the total units that would sell and the profit we can
expect in the future gear up for how many employees we need how much money we're going to make and prescriptive
analytics finding ways to improve the sales and the profit so we can uh sell maybe a different kind of furniture uh
we're going to guess at what the area is looking for and how that marketing is going to change
data analytics process steps so let's take a look at some of the basic processing and what that looks like when
you're working with this data so there's five basic steps uh the five steps of processing and and this
changes and there's a lot of things that go on when they talk about um agile programming the whole concept of agile
is you take some kind of framework like this and then you build on it depending on what your business needs so the first
is responsible for the database management um you might have another one where they're pulling apis and they're
pulling data off of U maybe the Census Bureau uh maybe something very very um specific uh domain specific so if you're
analyzing cancerous growths and how to understand them then the data collection is going to be those measurements they
take from the MRI or it might be even the MRI images they've used those also uh so there's a lot of things with data
collection and how to control that and make sure it has uh what you need and is clean and you don't have Mis information
coming in uh once you have the data collected there's a data preparation uh so stage two is we take
that data and we format it into something we can use probably one of the biggest formats that you see is when
you're processing text how do you process text well you use what they call a one hot encoder and each word is
represented uh by a a yes no kind of setup so it'd be like a long array of bits um that's one way to prepare it and
so you know bit number one is the bit number two is has or whatever it is other preparations might be if you're
using neural networks you might be um taking integers or float numbers and converting them to a value between zero
and one that way you don't have one of them creating a bias in there uh so there's a lot of different things that
go into Data preparation that is 80% of data science so we talk about the data analytics which is a little bit more on
the math side and they usually say talk about a data scientist kind of being the overall preparer of this stuff you're
going to spend 80% of your data preparation data exploration uh that's the fun part this is where you're
exploring things uh and it is maybe 10 to 15% of what you do with the data you spend with the data exploration it is
probably uh the most important step because this is where you got to start asking questions uh if you ask your
questions wrong you're going to get some wrong information if you're working with a company and they want to know the
marketing values then you really got to focus on hey how do we generate money for this company or fraud how do we
lower the fraud rate while still generating a profit four data modeling this is where we start actually getting
into the data code uh which model to use that predicts what's going to happen uh and then result interpretation
we want to be able to interpret those results usually see that in your matplot library where you create nice beautiful
images so it shows up on their dashboard for the marketing manager or for the CEO so they can take a quick look and say
hey I can see what's going on there you want to reduce it to something they can easily read uh they don't want to hear
the scientific terms they want to see something they can use and we'll talk about that a little bit more when we
start looking at some of this in a demo since this is data analysis with python we got to ask the question why
python for data analytics I mean there's C++ there's Java there's Donnet from Microsoft why do people go to python for
it so the number of reasons one it's easy to learn with simple syntax uh you don't have a very high
type set like you do in Java and other coding s allows you to kind of be a little lazy in your programming uh that
doesn't mean that it can't be set that way and that you don't have to be careful it just make means you can spin
up a code much quicker in Python the same amount of code to do something in Python A lot of times is one two or
three or four lines where when I did the same thing say in Java I found myself with 10 12 13 20 lines depending on what
it was it's very scalable and flexible uh so there's our flexibility CU you can do a lot with it and you can easily
scale it up you can go from something on your machine to using uh P spark and the spark environment and spread that across
hundreds if not thousands of servers across terabytes of data or pedabytes of data so it's very scalable there's a
huge collection of libraries this one's always interesting because Java has a huge collection of
libraries C has a huge collection of libraries net does and they're always in competition to get those libraries out
uh Scala for your spark all those have huge collection of libraries this is always changing uh but because Python's
open source you almost always have easy to access libraries that anybody can use you don't have to go check your
licensing and have special licensing like you do in some packages graphics and visualization they
have a really powerful package for that so it makes it easy to create nice displays for people to read and
community support because python is open source it has a huge community that supports it you can do a quick Google
and probably find a solution for almost anything you're working on python libraries let's bring it
together we have data analytics and we have python so when we're talking data analytics we're talking python libraries
for data analytics and the big five players are numpy pandas matplot Library scipi which is going to be in the
andsit so numpy supports n dimensional arrays provides numerical Computing tools useful for linear algebra and
even have uh a grid inside a grid or data it's not even numbers because you can also put uh words and characters and
just about anything into that array but you can think of a grid and then you can have a grid inside a grid and you end up
with a nice threedimensional array if you want to talk three-dimensional array you can think of images you have your
three channels of color for if you have an alpha and then you have your XY coordinates for the image we're looking
at so you can go XY and then what are the three channels to generate that color and numpy isn't restricted to
three dimensions you could imagine uh watching a movie well now you have your movie clips and they each have their X
number of frames and each of those frames have X number of XY coordinates for the pictures in each frame and then
you have your three dimensions for the colors so numpy is just a great way to work work with in dimensional
arrays now closely with numpy is pandas uh useful for handling missing data perform mathematical operations provides
functions to manipulate data pandas is becoming huge because it is basically a data frame and if you're working with
big data and you're working in spark or any of the other major packages out there you realize that the data frame is
very Central to a lot of that and you can look at it as a Excel spreadsheet you have your columns you have your rows
or indexes and uh you can do all kinds of different manipulations of the data within uh including filling in missling
data which is a big thing when you're dealing with large pools or lakes of data where they might be collected
differently from different uh locations and Matt plot Library we did kick over the scipi which is a lot of
mathematical computations which usually runs in the background of the F of numpy and pandas um although you do use them
they're useful for a lot of other things in there but the map plot Library that's the final part that's what you want to
show people and this is your plotting library in Python several toolkits extend map plot Library
functionality there's like a hundred different toolkits to extend matplot Library which range from uh how to
properly display star constellations from astronomy there's a very specific one built just for that all the way to
some uh very generic ones we'll actually add Seaborn in when we do the labs in a minute several toolkits extend met plot
Library functionality and it creates interactive visualization uh so there's all kinds of
cool things you can do as far as just displaying graphs and there's even some that you can create interactive graphs
we won't do the interactive grph but you'll see you'll get a a pretty good grasp of some of the different things
you can do in matplot library let's jump over to the demo which is my favorite roll up our sleeves
get our hand in on what we're doing now there's a lot of options when we're dealing with python uh you can use py
charm is a really popular one uh and you'll see this all over the place um so it's one of the main ones
that's out there and there's a lot of other ones I used to use net beans which is kind of lost favor uh don't even have
it installed on my new computer but the most popular one right now for data science now py charms
really popular for python General development for data science we usually go to Jupiter uh notebook or anaconda
and we're going to jump into Anaconda because that's my favorite one to go to cuz it has a lot of external tools for
us we're not going to dig into those but we will pop in there so you can see what it looks like so with Anaconda we have
our Jupiter lab we have our um notebook these are identical Jupiter lab is an upgrade to the notebook with multiple
tabs that's all it is and we'll be using the notebook and you can see that pie charm is so popular with um python that
we even have it highlighted here in Anaconda as part of the setup uh Jupiter notebook can also be a standalone uh so
we're actually going to be running Jupiter notebook and then you have your different environments um I have we're
going to be under main Pi 36 there's a root one and I usually label it Pi 36 the reason is is currently as of
writing this tensor flow only works in 36 and not in 37 or 38 for doing neural networks but you can actually have
multiple environments which is nice they're they separate the kernel so it helps protect your computer when you're
doing development and this is just a great way to do a display or a demo especially if you're looking for that
job pull up your laptop open it up or if you're doing a meeting get it broadcast up to the big screen so that the uh CEO
can see what you're looking at and when we launch the notebook uh it actually opens up a file browser in
whatever web browser you have this happens to be Chrome and then you can just go under new there's a lot of
different options depending on what you have installed uh Python 3 and this just creates an Untitled uh version of this
and you can see here I'm actually in a simply learn folder for other work I've done for simply
learn uh and that's where I save all my stuff and I can browse through other folders making it really easy to jump
from one project to another and under here we'll go ahead and change the name of this and we'll go ahead and rename
it data analytics data analytics just so I can remember what I was doing which is probably about 50 of the
folders in here right or files in here right now uh so let's go ahead and jump in there and take a look at some of
start with the uh numpy uh the least visually exciting and I'm going to zoom in here so you can see what we're
doing and the first thing we want to do is import numpy and we'll import it as in that is
the most common numpy terminology and let's go and change the view so we also have the line numbers um
I don't know why we probably won't need them but not it for easy reference uh and then we'll create a onedimensional
array we'll just call this array one and it equals np. array and you put your array information in here in this case
we'll spell it out uh you can actually do like a range and other ways there's lots of ways to generate these arrays
print our array one we can go ahead and run this and you can see right here it prints one two
three you can see why this is a really nice interface to show other people what you're doing uh with the Jupiter
notebook uh so this is the basic we've created an array this is a onedimensional array and then array is 1
two three one of the nice things about the jupyter notebook is whatever ran in this first setup is still running it's
still in the kernel so it still has the nump imported as NP and it still has our variable um arr1 for array one equal to
and we're just going to print we say hey what's what what what is this um setup in here and we want
type um and then we want what is the type of array one let's go ahead and run that and it says say class numpy indd
array so it's its own class that's all we're doing is is checking to see what that class
is and if you're going to look at the uh array class uh probably the biggest thing you do I don't know how many times
I find myself uh doing this uh because I forget what I'm working on and I forget I'm working with a three-dimensional or
four-dimensional array uh and I have to reformat somehow so it works with whatever other things I have and so we
do the array shape uh the array shape is just three because it has three members and it's a one-dimensional array that's
statement if you actually put a variable in Jupiter notebook and it's the last one in the cell it will the same as a
print statement so if I do this where array one of two it's the same as doing print array of two that's those are
identical statements in our Jupiter notebook uh we'll go and stick with the print on this one and it's three so
there's our print space two and we have 01 two 2al 3 we can easily change that so we have array one of place
it comes out it's 1 2 and five and there I left the print statement off cuz it's the last variable in the list um and
it'll always print the variable if you just put it in like that that's a Jupiter notebook thing don't do that in
pie charm I've forgotten before doing a demo and we talked about multiple Dimensions so we'll do an array um
our first Dimension we'll do 1 2 3 and our second dimension I uh 3 4 five and you can see right here that when we hit
the uh we'll do this we'll just do array two and we can run that and there's our array two 1 2 3 3 4 5 we can also do
doesn't really matter which one actually let's do uh two there we go and if I run this it'll print out five um because
here we are this is zero uh 0 1 2 3 is on our zero row 345 is on our one row always start with zero and then the two
0 one two goes to the five and then maybe we forgot what we were working with so we'll go do array
we have two rows and each row has three elements a two-dimensional array two three if you looked up here when we did
it before it just had three comma nothing when you have a single entity it always saves it as a tuple with a blank
just did this array two of oh let's go what is it one comma 2 we run that we get the five you can
also count backwards this is kind of fun and you'll see I just kind of Switched something on you because you can also do
one comma two to get to the same spot um now two is the last one 0 one two it's the last one in there we can count
backwards and do minus one and if we run this we get the same answer whether we count it as uh let's go back up here
whether we count this as 01 2 or we count backwards as Min -1 - 2 - 3 and you can see that if I change this minus
one to a minus two and run that I get four which is going backwards minus one minus two so there's a lot of
different ways to reference what we're working on inside the numpy array it's really a cool tool it's got a
lot of things you can do with it and we talked about the fact that it can also hold things that are not values
and we'll call this array s for Strings equals uh np. array put our setup in there brackets and
it even gives us our D type of a U6 and a lot of times when you're messing with data we'll call this array
R for range just to kind of keep it uniform np. a range so this is a command inside numpy to create a range of
numbers and if you're testing data Maybe you want maybe you have equal time increments um that are spaced a certain
point apart but in this case we're just going to do integers and we're going to do uh uh I
set up from 0 20 skipping every other one and we'll print it out and see what that looks
like and you can see here we have 0 2 4 6 8 10 12 14 16 18 like you expected it skips every one and just a quick
note there's no 20 on here uh why well this starts at zero and counts up to 20 so if you're used to another language
where explicitly says uh less than or less than equal to 20 like for xal 0 um x++ uh X is less than 20 that's what
uniform uh set you know 0 2 4 6 what happens if I want to create numbers uh from 0 to 10 but I need 20 increments in
L equals I don't think we'll actually use any of this again so I don't know why I'm creating unique um identifiers
0o to 10 or 0 to 9 uh remember it doesn't it goes up to 10 and then we want to let's say we have
20 different um increments in there so we're creating a we have a data set and we know it's over a certain time period
and we need to divide that time period by 20 and it happens to just have 10 pieces in it um and here we go you can
see right here we have TW or has 20 pieces in it but it's over 10 years we got to divide it in the middle and you
10 uh and then we can also do random there's np. random if you're doing neural networks uh usually you start it
by seating it with random numbers and we'll just do np. random and we'll just call this array we'll stop
giving it unique numbers we'll print that one out and run it and you can see we have random numbers they are 0er to
one so you'll see that all these numbers are under one and you can easily alter that by multiplying them out or
something like that if you want to do like 0 to 100 um you can also round them up if it's integer 0 to 100 there's all
kinds of things you can do but it generates a random float between zero and one and you have a couple options
you could reshape that um or you can just generate them uh in whatever shape you want and so we can see here uh we
did three and four and so you can see three rows by four variables same thing as doing a reshape
of 12 variables to three and four and if you're going to do that you might need an empty data set um I have
had this come up manytimes times or I need to start off with zero and I don't know you know because I'm
going to be adding stuff in there or it might be zero and one or one is uh if you're removing the background of an
image you might want the background is zero and then you figure out where the image is and you set all those boxes to
one and you create a mask so creating mask over images is really big and doing that with a a numpy array of
shot this time and we'll do the same thing like we did before zeros and in this case we'll do
uh 2 comma 3 and so when we run this forgot the asteris around it I knew I was forgetting something there we go
so when we run this uh you can see here we have our 10 zeros in a row and maybe this is a mask for an image and so it
has uh two rows of three digits in it so it's a very small image little tiny pixel and maybe you're looking to do
something the opposite way uh instead of uh creating a mask of zeros and filling in with ones uh maybe you want to create
a mask of ones and fill them in with zeros and we'll just do just like we did before we'll do three comma 4 and when
we run this you'll see it's all ones and we could even do this even U we'll do it this way let's do
10 10 x 10 icon and then you have your three colors you so creates quite a large array there for doing pictures and
stuff like that when you add that third dimension in um if we take that off it's a little
array and we'll do 0 one two and then in this array um we actually print it right out we want a
repeat so you can actually do a repeat of the array and maybe you need this array um let's repeat it three
you'll see we have 00001 111222 and whenever I think of a repeat I don't really think of repeating being
the first digit three times the second digit I really always think of it as um 012 012 012 it catches me every time uh
three and we run this you can see how you can generate one 012 012 012 and if you're dealing with um an
identity Matrix um we can do that also if you're big on you're doing your matrixes and we'll just
identity I guess we'll go and spill it out today Matrix and the command we're looking for
is um I eye and we'll do three and we'll just go ahead and print this out there we go there's our identity
Matrix and it comes out by a 3X3 array because there's our Matrix uh and then it puts the ones down
the middle and for doing your different Matrix math and we can manipulate that a little bit too um we talk
diagonal 1 2 3 4 5 and when we run this again this generates a value and by just putting that value in there is the
same as putting print around it or putting array equals and then print array and you can see it generates a
diagonal 1 2 3 4 five and there's your uh your beginning of your Matrix array for working with the
matrixes and we can actually go in reverse uh let's create an array equals remember our
random random. random and we'll do a 5x5 array uh oops there we go five by five and just so you can see what that
to take out the brackets and there you go you have your your 5x5 array set up in there and we can now CU we're working
with matrixes we might want to do this in reverse and extract the diagonals which would be the 79 the 678 and so
on and we simply type in np. diagonal we put our array in there um and this will of course print it out
because it returns it as a variable and you can see here here's our diagonal going across from our Matrix
and we did talk about shape earlier if you remember you can do um print the shape out you can also do the dimensions
uh so in Dimensions very similar to shape it comes out and just has two Dimensions we can also look at the size
so if we do size on here we can run that and you can see has a size of 25 two dimensions and of course 5x five and
that was from the shape from earlier that we looked at uh there's our 5x5 shape and if you remember earli we did
random well you can also do uh random I talked a little bit about manipulating 0o to one and how you can get different
four and so we're going to Generate random integers between minus 10 to 10 uh we're going to generate four of those
and so when we run that we have 7 - 3 - 6 - 3 they're all between - 10 and 10 and there's four of
them and now we jump into some of the functionality of arrays uh which is really great because this is where they
my original array from up here with the integers and adds 10 to all of those values so now we have oh
this is the decimal that's right this is a random decimal I had stored in Array um but this takes a random decim
the random numbers I had from 0 to one and adds 10 to them and we can just as easily do uh minus
would it'll take that random number we generated and cut in half so now all these numbers are under 0. five uh
another way you can change the numbers to what you need on there and as you dig deeper into numpy
we can also do exponential so as an exponential function uh which should generate some interesting numbers off of
the random so we're taking them to the power I don't even remember what the original numbers in the um array were
because we did the random numbers up there here's our original numbers and if you build an exponential on there uh
this is where you get e to the X on this and just like you can do e to the x you can also do the log so if you're
doing logarithmic functions that reinforce learning you might be doing some kind of log setup on
there and you can see the logarithmic of these different array numbers and if you're working with uh
because this is not log one 2 3 4 5 um it is Log and log two uh so just a quick note that's not a variable going in that
is an actual command there's a number of them in there and you'll have to go look and see uh what the documentation is but
you can do with this is your sign so we can take a sign value of all of our different uh values in there and if you
have sign you of course have cosine we can run that uh so here's a cosine of those and if you're doing
activations in your numpy array and you're doing a tangent activation uh there's your tangent for
that and the tangent activation is actually uh from uh neural networks that's one of the ways you can activate
it because it forms a nice curve between uh from whether you're generating one to negative one uh with some discrepancy in
networks and then we get into let me just put the array back out there so that we can see it uh while we're doing
this as we're getting into this you can also sum the values so we have NP sum and you can do a summation of all
the values in this array and you'll see that if you added all these together they'd equal
12519 so on I don't know what the whole setup is in there uh but you can see right here the
the summation of this one of the things you can also do is by axes so we could do axes equals
numpy as the rows so that would be uh or you can think of that in numpy as being the columns we're summing these columns
rows and so that is the summation of this row and so forth and so forth going down and maybe you don't need to um know
the summation maybe what you're looking for is the minimum uh so here's our minimal you're
looking for and this comes up a lot because you have like your errors we want to find the minimal error inside of
0.645 is the smallest number in this First Column is 0645 and so on and if you have a minimum well you
might also want to know the max maybe we're looking for the maximum profit and here we go you can see maximum 79 is the
maximum on this first column and just like we did before you can change this to a one on axes you can take the axes
out of here and just find the max value for the whole array and the max value in here was 8344 so on so on
and since we're talking data analytics uh we want to go ahead and look at the mean uh pretty much the same as the
average this is the mean across the whole thing and just like we did before we could also do axis equals
zero and then you'll see this is the mean of this axis and so on and we have mean we might want to know the
median and there's our median our most common numbers uh if we have Med median we might want to know the standard
deviation or if we have the average a lot of times you do the means in the standard deviation um we can run that
array uh if we're going to do standard deviations there's also uh variance which is your
looked at variant we looked at standard deviation the median and the means there's more but those are the most
common ones used with data analytics um and then going through your data and figuring out uh uh what you're going to
present to the shareholders and some other things we can do is we can actually take slices uh
you'll hear that terminology and a slice might be um like we have a 5x5 array but maybe we don't
want the whole array maybe we want uh from one on we don't want the zero in in there so we got up to four and maybe on
the second part we just want two to row three and see this notation right here says one to the end and if we run this
you can see how that generates a single row to the end and then row two and three now remember it doesn't include
three that's why we only get the one column so if you wanted two and three you would need to go ahead and go two to
four so it goes up two four we could also do this in Reverse just like we learned earlier we can go minus one
oops and when we go to minus one it's the same thing because we have 0 1 2 3 4 this is the same thing as 2 to 4 goes
two to the last one also very common with arrays is you're going to want to sort them so we
sort it and we'll go and throw an axis back in there uh axis equals one if we run this you can see from the axes that
it sorts it uh the point 2 being the lowest value to the highest value by the row we can also change this of course to
axis zero if you're sorting it by columns so maybe your values are based on columns and then of course you can do
the whole array and we can sort that don't usually do that but you know I guess sometimes
you might that might come up and so you can see right here we have a nice sorted array uh something else
let's just go ahead and reprint our array so we can look at it again starting to get too many boxes up there
uh something else you can do with an array is we can take and transpose it this comes up more than you would think
when you transpose it you'll see that um the rows and the column are transposed so where 79.5
index you can see this really more dramatic if we take a slice and we'll just do a slice of the
first couple and then we'll just do all the other um the full rows and if we run this you can see how it comes up a
little bit different we'll just do the same slice up here so you can see how those two look next to each other
there we go there's our slice run uh and so you can see the slice comes up and it has uh one two three four five columns
now we have 1 2 3 four five rows and three columns versus three rows and the original version when they
first started putting this um together uh was a function so the original version was transpose and this still
works you can still see it generates the same value as just a capital t so many times we flip this data because we'll
have an XY value or we'll have an image or something like that and it's being read one way into the next process and
the next one needs it the opposite uh so this actually happens a lot you need to know how to transpose the data really
quick and we can go ahead oh let's just take um here's our transpose we'll just stick with the transpose on here and
instead of uh doing it this way we might need to do something called flattening why would you flatten your data uh if
this is an array going into a neural network you might want to send it in as one set of values instead of two rows
array so we covered our scientific uh means transpose median um some different variations on here some of the other
things we want to do is what happens if we want to pin to our array uh so let's create a new array I'm getting tired of
looking at the same set of random numbers we generated earlier um so we'll go and create a new array here something
a little simpler so it's easier to see what we're doing and four five six 7even 8 uh
that's good enough we'll just do 4 five6 78 and if we print this array there it is 4 five678
and we might want to pend something to the array so we have our array we need to extend it you got to be very careful
about appending things to your array and there's a number of reasons for that uh one is runtime because of the way the
numpy array is set up a lot of times you build your data and then push it into the numpy array instead of continually
adding on to the array um and then it also usually it automatically generates a copy for protecting your data so
there's a lot of reasons to be careful about appending this way uh but you can certainly do it and we can just take our
array we're going to create a new array array one and if we print array one and we append eight to it you'll see four
five six 7 and then there's our eight appended on to the end and if you want to Ain something to
an array um you'd probably also want to whoops array one let's try that again there we go now we have the eight
appended on to the end um so you can see four five 6 78 and then we pinned another eight on
there and if you're going to appin something you might want to um go ahead and insert instead of appending it might
be you need to keep a certain order and we can do the same thing we do our array um and we're going to pin or um
insert at the beginning and let's go ahead and insert uh one two three one 2 three and we go ahead and print our
array to we run it and you can see one two three a p is inserted at the beginning uh inserts a lot more powerful
and that you can put it anywhere in the array we can move it to the one spot and there we go one 2 three uh we can do a
minus one just for fun and you'll see it comes up uh 1 two 3 and we're counting backwards by one I imagine you can do a
because that's why registers a zero just takes a minus sign off and just like we add numbers on we
might want to delete numbers and so uh let's do an np. delete well let's let's keep it a little bit make it a little
easy here um to watch we'll go and create an array three and we'll do NP delete we were just working with array
and print that out uh we deleted the one right out of there and we can also do something like
this or we can do it as a slice and we can do let's do one comma 3 and if we run one comma 3 you'll see we've deleted
the one space and the three space out which deleted our two and four now keep in mind when you're
because there's a time element involved um as far as where the date is coming from and it's really easy to delete the
wrong data and corrupt what you're working on or to insert stuff where you don't want it um so there always a
doing uh we'll create an array C which equals we'll just do our um our numpy array that we just created our numpy
array three and we can do copy so you can make a copy of it U maybe you want to protect your original data or maybe
you're making a mask and so you copy the array and then the new array make all these alterations and change it from
values to zero to one to mask over the first one and of course we if we do um array C since it equals a copy of uh
and split arrays I end up doing a lot of this and I don't know how many times I end up fiddling with this and having a
mess uh so but but you do it a lot you know you combine your arrays you split them you might need one set of data for
one thing another set of data for the other so let's go and create two arrays array one array two and I want you to
note in the terminology we're going to look for is concatenate what that means is we're going to take um we'll call
array we're taking array one and two and it's very important to really pay attention to your axes and your counts I
can't merge two arrays that have like if their axes are messed up and I'm merging on axis zero it's going to give me an
error and I'll have to reshape them so you got to make sure that whatever you're concatenating together works in
along the zero axis these each are four values um so it's a 2x4 value and if we go ahead and switch this to one you can
see how that's that flips it a little bit so now we have 1 2 3 4 5 6 7even 8 it's interesting that we chose that
run this and it gives me an answer okay because I have two by two and I'm using axis one but if I switch this to axis
zero where now it's got three and five it gives me an error so you got to be really careful on that to make sure
that your whatever axes you are putting together that they match um so like I said this one oops axis one axis one has
two entities and since we're going on axis one or by row you can see that it lets it merge it right onto the end
there and you could imagine this if this was a xyplot a value or the x value going in and the predicted yvalue coming
out and then you have another prediction and you want to combine them this works really easy for that we'll go back and
let's just put this back to where we had it oops I forgot how many changes I made there we go 'll just put it Oops I
go okay so you can see that we went through the different concatenation axes is really important when you're doing
your concatenation values on here we'll switch this back to one just because I like the looks of that better there we
concatenation uh but instead we don't have to put the axes in there uh because it's v stands for vertical and so if we
the same as making this axis zero for vertical stack and if you're going have a vertical stack uh you can also have an
we'll just change this from cat to cat and I run this it's the same as doing axis zero
the process is identical in the background um this is like a legacy setup uh your v stack and your hstack
most people just use concatenate and then put the axis in there CU it's much uh has a lot more clarity and um is more
is is kind of uh data exploration um and that'll make a little bit more sense in just a moment sometimes they call them
set operations but let's say we have an array 1 two 3 4 5 6 3 whatever it is uh you we generate a nice little array here
and what I want to go ahead and do is find the unique values in that array uh so maybe I'm generating what
they call a one hot encoder and so these values then I'll become I need to know how long my bit array is going to be so
each word how many how many each word is represented by a number and then I want to know just how many of those words are
in there if we're doing word count very popular thing to do um and you can see here when we do
unique uh we have 1 2 3 4 5 six those are our unique values uh some of the things we can do
with the unique values is we can also instead of doing just unique we can do uniques or unique values and counts of
each unique value and this is very similar to what we just did up here where we uh we're doing NP unique uh but
we're going to add a little bit more into there and it's just part of the arguments in this and we want to do
return counts equal equal true so in instead of just returning the unique values uh we want to know how many of
here we have our unique value 1 2 3 4 5 six just like we had before and then there's two of the first of two ones two
twos two 3es two fours one five two sixes and so on and you can go through and actually look at that if you want to
count them um but a quick way to find out your um distribution of different values so you might want to know how
often the word the' is used versus the word and if each word is represented as a unique
number and along the set variables we might want to know um let me just put a note up here we're going to start
looking at now is we want to know hey where do these two arrays intersect and we have 1 2 3 4 5 3 4 5 6 7 we might
this we can see they intersect at 3 four five that's what they have common uh and because we're going to go
ahead and go through these and look at a couple different options let's change this from intersect 1D
and we'll do the same thing we'll go and print this so we might want to know the intersection uh where they have
commonalities another uh unique word is Union of 1D uh so instead of uh intersect we want to know all the values
that are in both of them so here's our Union of 1D when we run that you can see we have 1 2 3 4 5 6 7 so it's all the
uh is we want to know what the set difference is uh and so that's where that you'll
see if you remember set we talked about that being the what they call these things um so the set
difference of a 1D array when we run that you can see that one is only in one array and two is only in one
array and if we want to know uh what's in Array one but not in Array two we might want to know what is in Array one
on here uh so we have the four different options here where we can do an intersection what do they both have in
common uh we can do a union what are all the unique values in both arrays we can see the difference what's in Array one
but not array two so set diff 1D and then set X or what is not in one but is in two and what is in not in two but in
one so we dug a lot in numpy cuz we're talking um there's a lot of different little mathematical things going on in
numpy a lot of this can also be done in pandas although usually the heavy lifting is left for numpy because that's
what it's designed for let's go ahead and open up another Python 3 setup in here and so we want to
explore uh what happens when you want to display this this is where it starts getting in my opinion a little fun
because you're actually playing with it and you have something to show people and we'll go ahead and rename this we're
going to call this uh pandas uh and pip plot so Panda pip plot just so we can remember for next time
and we want to go ahead and import the necessary libraries we're going to import pandas as PD now remember this is
a data frame so we're talking rows and columns and you'll see how uh pandas work so nicely uh when you're actually
showing data to people and then we're going to have numpy in the background numpy works with pandas uh so a lot of
times you just import them by default Seaborn sits on top of the map plot Library uh so sometimes we use the
Seaborn because it kind of extends it's one of the 100 packages that extends the map plot Library probably the most
common used because it has a lot of built-in functionality um almost by default I usually just put caborn in
there in case I need it and of course we have uh matplot Library as pip plot as PLT and note we have as PD as NP as SNS
as PLT those are pretty standard so when you're doing your Imports I would probably keep those just so other people
can read your code and it makes sense to them that's pretty much a standard nowadays and then we have the strange
line here uh it says uh amber sign matplot library in line that is for Jupiter notebook only so if you're
running this in a different package you'll have a popup when it goes to display the matplot library um you can
with the most current version of Jupiter usually leave that out and it will still display it right on the page as we go
and we'll see what that looks like and then we're going to go ahead and just uh do the um caborn the SNS do set and
we're going to set the color codes equals true let them uh just keep the default on so we don't have to think
these values are all set if we don't run this and I access one of these um afterward it'll it'll crash the cool
thing about Jupiter uh notebooks is if you forgot to import one of these you forgot to install it because you do have
to install this under your anaconda setup or whatever setup you're in you can flip over to Anaconda and run your
install for these um and then just come back and run it you don't have to close anything
out and we'll go ahead and paste this one in here real quick where we have car equals pd. read CSV and then we have uh
the actual path this path of course will vary depending on what you are working with uh so it's wherever you saved the
file at and you can see here I have um like my one drive documents simply Learn Python data analytics using python slash
car CSV it's quite a long file when we open that up what we get is we get a CSV file and we have the make
the model the year the engine fuel type uh engine horsepower cylinder and so on um and this is just a comma separated
file so each row is like a row of data think of it as a um spreadsheet and then each one is a
column of data on here and as you can see right here it has the uh make model so it has columns for a header on
here now your pandas just does an excellent job of automatically pulling a lot of this in so when you start seeing
the pandas on here you realize that you are already like halfway done with getting your data in uh I just love
pandas for that reason numpy also has it you can load a CSV directly into numpy um but we're working with pandas and
this is where it really gets cool is I can come down here and I can print uh you remember our print statement we can
actually get rid of it and we're just going to do car head because it's going to print that out the head is going to
print the top values of that data file we just ran in and so you can see right here it does a nice print out it's all
nice and inline because we're in jupyter Notebook I can scroll back and forth and look at the different data uh and just
like we expected we have our column it brought the header right in one thing to note is the index it automatically
created an index 0 1 2 3 4 and so on and we're just looking at the head so we got 0 1 2 3
4 um you can change this you might want to just look at the top two we can run that there's our top two BM
W's um another thing we can do is instead of head we can do tail and look at the last three values
that are in that data file and uh you can see right here it numbered them all the way up to
11,910 oh my goodness they put a lot of data in this file I didn't even look to see how big the file was uh so you can
really easily get through and view the different data in here when you're talking about Big
Data you almost never just print out car uh in fact let's see what happens when we do if we run this and we just run the
car it's huge uh in fact it's so big that the pandas automatically truncates it and just does head plus tail so you
can see the two um so we really don't want to look at the whole thing we'll go and go back to we'll stick with the head
displaying our data there we go so there's a head of our data gives us a quick look to see what's actually in
there um I can zoom out if we want so you actually get a better view although we'll keep it zoomed in so
you can see the code I'm working on and then from the uh data standpoint we course want to look at um data types
uh what's going on with our data what does it look like uh now this you know you show your when you're talking to
spreadsheet uh so it's a nice way of displaying pieces of the chart when we talk about the data types now
we're getting into the data science side of it what are we working with well we have a make model we have an integer 64
for the year uh engine fuel type is an object if we go up here you can see that there most of them are um like you know
there uh and so forth and you it's either going to be a float 64 an integer or an object is the the way it's going
columns and since it loaded the columns automatically uh we have here the make the model the year the engine the size
all the way up to the MSRP and um just out of something you'll see come up a lot is whenever you're in
pandas you type in values it converts it from a panda's uh list to a numpy array and that's true of any of these uh
so then you end up in a numpy array so you'll see a little switch in there in the way that the data is actually stored
and that's true of any of these uh in this case so we went car. columns you have a total list of your
car columns and like any good data um scientist we want to start looking at analytical summary of the data set
what's going on with our data so we can start trying to um piece Mill together so we can do
car uh describe and then we'll do is we'll do include equals all uh so a nice Panda
command is to describe your data if you're working with r this should start looking familiar uh and we come down
here and you can see um count there's uh make the model the year um how many of each one how many unique values of each
one uh the top value of each one what's most common the frequency the mean um clearly on some of these it's an object
so it really can't tell you what the um average is you it' just be the top ones the average I guess um the year what's
the average year on there um all this stuff comes down here your standard deviation your minimum value your
maximum value uh what's in the lower quarter 50% Mark where's that line at and what's in the upper 75% the top 25%
% going into the max now this next part is just cool uh this is what we always wanted computers
to be back like in the '90s instead of 5,000 lines of code to do this maybe not 5,000 all right I built my own plot uh
Library back in 95 and the amount of code for doing a simple plot was um I don't know probably about 100 lines of
code this is being done in one line of code we have our car which is our pandas we generated that as our data frame and
we have hist for histogram that is the power of caborn now it's still going to generate a numpy graph but caborn sits
on top and then we can do the figure size this is just um so it fits nicely on the paper on here and we do something
simple like this and you can see here where it comes up and it does say m plot library and does subplots and
everything but we're looking at a histogram of all the different pieces in our database Bas and we have our engine
cylinders um that's always a good one cuz you can see like they have some that are they had a null on there so they
came out at zero um maybe a couple maybe one of them had a two-cylinder engine way back when four is a common uh six a
little less common and then you see the 8 cylinder uh 12 cylinder engines Bo that's got to be a Speedster or
something uh but you can see right here it just breaks it down so now you have uh how many cars with how many whatever
it is cylinders hor power uh and so on and it does a nice job displaying it you can see if you're working with your uh
um you're going into your uh demo it's really nice just to be able to type that in and boom there it is I can see it all
we'll go ahead and call the um caborn SNS box plot and we're going to go ahead and do um vehicle size in versus um
engine horsepower XY plot and the data comes from the car so if we run this we end up with a nice box plot you see our
midsize Compact and large you can see the variation there's our outlier showing up there on the compact that
must be a high-end sports car uh large car might have a couple engines and again we have all these outliers and
then your deviation on them very powerful and quick way to zero in on one small piece of data and
display it for people who need to have it reduced to something they can see and look at and understand and that's our
Seaborn box plot or SNS do box plot and then if we're going to back out and we want a quick look at um what they
call pair plotting uh we can run that and you can see with the caborn it just does all the work for you uh it takes it
grid um in this grid if you look at uh this one space here which is you might not be able to see the small number says
engine horsepower this is engine horsepower uh to the year was built and it's just flipped so everything to the
right of the middle diagonal is just uh the rotation of what's on the left and as you expect um the engine horsepower
um gets bigger and bigger and bigger as time goes on so the the year it was built further up in the year the more
Trends with our pair plot coming up uh and look how fast that was that was it took it a c you know a moment to process
uh but right away I get a nice view of all these different um information which I can look at visually and and kind of
see how things group and look now if I was doing a meeting I probably wouldn't show all the data
um one of things I've learned over the years is um people myself included love to show all our work you know we were
taught in school show all your work prove what you know the CEO doesn't want to see a huge U grid of of graphs I
guarantee it uh so we want to do is we want to go ahead and drop um the stuff that might not be
interested in and we're going to I'm not really a car person our guyan the back is obviously so you have your engine
fuel type we're going to drop that we're going to drop Market category vehicle style pop popularity number of doors
vehicle size um and we have the axes in here if you're remember from numpy we have to include that axis to make it
clear what we're working on that's also true with pandas and then we'll look at just what it looks like um from the head
and you can see that we dropped out those categories and now we have the make model year uh and so forth um and
you now when you start working with pandas I just love pandas for this reason look how easy it is it just
displays it as a nice um uh spreadsheet for you you can just look at it and view it very easily uh it's also the same
kind of view you're going to get if you're working in spark or P spark which is python for spark across Big Data this
is the kind of thing that they they come up with this is why pandas is so powerful and we might look at this and
decide we don't like these columns and so you can go in here and we can actually rename the The
and you know instead of having like the lengthy here we had um engine horsepower we just want horsepower we don't need to
know it's the engine horsepower engine cylinders we don't need to know that it's for the engine because there's only
one thing we're describing if we're talking about cars and that's cylinders uh we go ahead and just run
this and again here's our car head and you can see how that changed we have model year and horsepower versus model
this so it's more and more readable the more readable you get it the better um and of course we can also adjust the
size a little bit so that when it prints out instead of splitting it on two lines we get like a single line we can do that
command and if you remember from numpy we had shape well pandas works the same way uh we can look at the shape of the
data so we now have um 11,914 rows and 10 columns uh so you see some similarities because pandas is
know duplicate rows and so we can do car and look at this switch here um we're doing a selection this is a panda
selection with the brackets but we want to select it based on car. duplicated so how many duplicates on
there so it's starting to look a little bit different as far as how we access some of the data on here this can be a
again and this is one of those troubleshooting things that we end up doing uh a lot more than we really feel
like we should uh we might go ahead and do like a car count uh just to see how many rows we're dealing with and then
right after that we might want to go ahead and say hey um let's drop duplicates so remember we did all the
duplicates on there so car equals car. drop duplicates and then we can print the head again we'll just do car head
here and you can see the data on there um looks the same as before uh and just note that we did car
actual value and it works on some of them and not on others depending on what you're doing but by default it always
returns a copy so when we do this we're reassigning it to car and you can see it's the same header but we want to go
ahead and do count and see how the count changes let's go ahead and run this and you can see here instead of 11914 we
null uh so it's going to count the values of null and then we want to sum that up and when we do that uh we do the
car is n function. suum uh we end up with uh HP the horsepower is 69 have null values and 30 have cylinders have
null values now if you don't put the sum at the end it's just going to return a mask with the true false of is it null
or is it not by zero and one so you're summing up the ones underneath each column and this of course uh then you
have to decide what you're going to do with the uh null values there's a lot of different options it might be that you
need to put in the average or means uh maybe you want to put in the median value uh there's a lot of different ways
to fill it usually when you first start out with the data a lot of them you just drop your null values and you can see
here car. drop na which is equal to all and then we're going to go ahead and count it and you can see that we've
10827 maybe 75 or so values uh so we clean this is really a big part of cleaning data you need to know how to
get rid of your null values or at least count them and what to do with them and of course if we go back to um uh
counting our null values we should now have uh null null values there we go and you'll see there's zero null values I
don't know how many times I've been running a model that doesn't take null values and it crashes and I just sit
there and look at it trying to why did that crash it should have worked uh it's because I forgot to remove the null
values so we've been jumping around a lot we're going to go back to uh finding liers and let's go ahead and bring that
back into our Seaborn and if you remember we did a box plot earlier uh this time we're going to do a box plot
just on the price and you can see here um our price value and we have the deviation with the two thinner bars on
each side of the main value and then as we get up here we have all these outliers um in fact we have one way out
at if you were doing um fraud analysis you would be jumping on all over these outliers why are these deviation from
the standard what are these people doing again this is probably like I said a really high-end expensive car out here
that's what we're looking at and we can also look at the um box plot for the horsepower and we'll put that in down
here and run that and you can see again here's our horsepower and it just jumps and there's these really odd huge muscle
cars out here that are outliers and we're going to jump and making this a little bit more um as you
start displaying your data or your information to your shareholders uh we're going to look at plotting a
histogram for the number of cars per brand and the first thing we want to go ahead and do is we have with our car go
back over here here we go uh we have our make value counts largest plot um and we want to do a kind equals bar uh fig size
what was it it's um figure reation the value counts and we want the largest value so here's our value counts and
compared to what the different cars are Chevrolet puts out a lot of different kinds of cars I didn't realize that they
made that many cars or different types and then for readability uh let's go ahead and add a title number of cars by
make number of cars and make if you had looked at this the first time you would have been like what the heck am I
looking at well we're looking at the number of cars by make and then you can see here now we're talking about the
type of cars and the different uh ones are put out Lotus I guess only had a few different kinds of cars over there very
things I am most interested in is the relationship between the variables uh so this is always a place
to start we want to know what's going on with our variables and how they connect with each other
uh so the first thing we're going to do is we're going to go ahead and set a figure size because we want to make sure
it fits our graph uh we'll just go ahead and set this one plot Figure Set to figure size 20110 if you never used the
matap plot Library which is sitting behind Seaborn uh whatever is in the PLT this is what's loaded it's like a canvas
you're painting on so the second you load that uh pip plot as PLT anything you do to that is affecting everything
caborn we'll go ahead and create a variable C for uh relationships or correspondence and car
docr that's a correlation in uh cbor on top of pandas again one line and you get the whole correlation on there and
because we're working with Seaborn let's put it into a nice heat map if you're not familiar with heat maps that means
visual and we can see here that the Seaborn connected to the pandas prints out a nice chart we'll talk a little bit
about the color here in a second it prints out a nice chart this is the chart I look at as a data scientist
these are the numbers I want to look at uh and we'll just highlight one of them um here's cylinders versus horsepower
the closer to one the higher the correlation so 788 pretty high correlation between the number of
cylinders and how heavy the horsepower is I'm betting if you looked at the year versus uh horsepower um we just look at
that one here's year and horsepower 314 not as so much but if you combine them uh you don't actually add them but if
you combine them you'll start to see an increase in Horsepower per year and cylinders you could probably get a
correlation there and just like 78 is a positive correlation uh you might notice if we look at
cylinders and and or let's look at horsepower and mileage uh so if we go here to horsepower to mileage you get a
nice um negative we'll do cylinders that's a bigger number with cylinders to the miles per gallon it's a minus 6 so
is and then the chart you would actually show people is a nice heat map this is all our colors and it's just those
numbers put into a heat map the darker the color the higher the correlation you can see straight down the middle um
obviously the year correlates directly with the year horsepower with horsepower and so on that's why it's a one the
closer to the one the higher the correlation between the two pieces of data now this is a good introduction uh
pandas goes Way Beyond this most the functionality in numpy since pandas sits on it is also in pandas and then it even
has additional features in it and we use caborn pretty extensively sitting on top over our pip PL plot uh so keep in mind
that our P plot has a ton of other features in it that we didn't even touch on in here uh we couldn't even if you
had a sole course in it uh there's just so many things hidden in there depending on what your domain you're working on uh
but you can see here here's our Seaborn and here's our matplot library that's all our Graphics that we did and then
the Seaborn works really nicely with the pandas um we really like that and with that queries regarding any of the topics
covered in this session or if you require any of the resources that we used in this session like PPD code
demonstration code documentation data sets Etc please feel free to let us know in the comment section below and our
team of experts will be more than happy to resolve all your queries at the ear least until next time thank you stay
safe and keep learning staying ahead in your career requires continuous learning and upscaling whether you're a student
aiming to learn today's top skills or a working professional looking to advance your career we've got you covered
explore our impressive catalog of certification programs in cuttingedge domains including data science cloud
computing cyber security AI machine learning or digital marketing designed in collaboration with leading
universities and top corporations and delivered by industry experts choose any of our programs and set yourself on the
path to Career Success click the link in the description to know more hi there if you like this video
subscribe to the simply learn YouTube channel and click here to watch similar videos to ner up and get certified click