Introduction to Apache Hive
Apache Hive is an open-source data warehousing software built on top of Apache Hadoop, providing a SQL-like interface for querying and analyzing large datasets stored in Hadoop's HDFS and other file systems like Amazon S3. It simplifies Hadoop data processing by abstracting complex MapReduce jobs and eliminating the need to learn Java or Hadoop APIs.
Why Apache Hive?
- Traditional RDBMS systems cannot handle massive data volumes like Facebook's billions of users and terabytes of data.
- Hadoop handles big data but lacks an easy query interface.
- Hive bridges this gap by offering SQL-like queries on Hadoop data.
Key Features of Apache Hive
- SQL-like query language for ease of use.
- OLAP-based design for multi-dimensional data analysis.
- High scalability and extensibility using Hadoop file systems.
- Fast query execution on large datasets.
- Supports ad hoc querying and data summarization.
Apache Hive Architecture
- Hive Client: Supports Java, Python, C++ applications via Thrift Server, JDBC, and ODBC drivers.
- Hive Services: Includes CLI, Web UI, Metastore (central metadata repository), Hive Server, Driver, Compiler, and Execution Engine.
- Execution Engine: Converts queries into MapReduce jobs executed on Hadoop Distributed File System (HDFS).
Components of Apache Hive
- Shell: Interface to write and execute Hive queries.
- Metastore: Stores metadata about tables, partitions, and schemas.
- Execution Engine: Translates queries into executable tasks.
- Driver: Manages query lifecycle and execution.
- Compiler: Compiles HiveQL into MapReduce jobs.
Installing Apache Hive on Windows
- Use Oracle VirtualBox to run Cloudera QuickStart VM.
- Import and start the VM with at least 8GB RAM.
- Access Hive through Hue web interface with default credentials (username/password: cloudera).
Hive Data Types and Models
- Supports standard data types: tinyint, smallint, int, bigint, float, double, string, boolean.
- Data models include databases, tables (internal/managed and external), partitions, and buckets.
- Partitions help organize data for efficient querying (e.g., by course or section).
- Bucketing clusters data into fixed-size files for optimized query performance.
Creating and Managing Tables
- Internal tables store data managed by Hive; deleting the table deletes data.
- External tables link to data stored externally; deleting the table does not delete data.
- Commands to create, describe, and alter tables including adding columns and renaming.
Partitioning in Hive
- Static Partitioning: Manually specify partition values when loading data.
- Dynamic Partitioning: Hive automatically partitions data based on column values.
- Example: Partitioning student data by course (Hadoop, Java, Python).
Bucketing in Hive
- Bucketing divides data into fixed number of buckets based on a hash function.
- Example: Bucketing employee data by employee ID into three buckets.
Query Operations in Hive
- Arithmetic operations: addition, subtraction on numeric columns.
- Logical operations: filtering data based on conditions.
- Aggregate functions: MAX, MIN, SUM, SQRT.
- String functions: converting text to uppercase or lowercase.
- Group By: Aggregating data by categories (e.g., country).
- Order By and Sort By: Sorting query results.
Join Operations in Hive
- Supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN.
- Example: Joining employee and department tables on department ID.
Limitations of Apache Hive
- Not suitable for real-time data processing; designed for batch processing.
- High query latency compared to real-time processing tools like Spark or Kafka.
- Not designed for online transaction processing (OLTP).
Conclusion
This tutorial covered Apache Hive's fundamentals, installation, architecture, data models, and query capabilities with practical examples. The provided code files and detailed explanations enable hands-on learning and preparation for real-world big data analytics using Hive.
For further learning and certification, consider enrolling in comprehensive Big Data and Hadoop courses that offer real-time projects and industry-relevant training.
For a deeper understanding of the underlying technologies, check out the Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025 which complements Hive's capabilities in big data processing. Additionally, if you're interested in database management, our Comprehensive Guide to PostgreSQL: Basics, Features, and Advanced Concepts provides valuable insights into relational databases that can enhance your data handling skills.
hello and welcome everyone to yet another tech enthusiastic video from edureka today we will learn about apache
hype now let's quickly begin with our session for today apache hype is one of the best open
source software utilities which has sql-like interface that is used in data querying and data analytics
today we will discuss more about hadoop through the following agenda firstly we shall understand why exactly
we needed apache hive followed by that we shall understand what is apache hype then its important features
then comes the important stage where we will understand the apache hive architecture the components involved in
apache hive then we will learn how to install apache hive in windows operating system
followed by that we shall understand the data types operators data models which are present in hype then we shall go
through a brief demo of apache hive so let's quickly begin with the first topic that is why exactly we needed
apache hive it all began in the early 90s when facebook started
slowly the number of users at facebook increase that is nearly one billion users and along with the users increase
the data which is nearly equal to thousands of terabytes of data and nearly one lakh queries then also 500
million photographs uploaded daily and this was a huge amount of data that facebook had to process
and the first thing that everybody had in their mind was to use rdbms and we all know that rdbms couldn't handle such
a huge amount of data and neither it was capable enough to process it and the very next big guy who was capable enough
to handle all this big data was hadoop even when hadoop came into picture it was not too easy to manage all the
queries it used to take a lot of time to execute all the queries performed
so one common thing that all the hadoop developers had was the sequel so they thought to come up with a new
solution that has hadoop's capacity and interface like sql that is when hive came into picture
so now we understand the exact definition of apache high apache hype is a data warehouse software
project built on top of apache hadoop for providing data query and data analysis
hive gives a sql like interface to query data stored in various databases and file systems that integrate with hadoop
also apache hive has data warehousing software utility it can be used for data analytics it is built for sequel users
manages querying of structured data and it simplifies and abstracts the load that is on hadoop and lastly no need to
learn java and hadoop api to handle data using hive so followed by this we shall understand
apache hive applications apache hive is used in many major applications few of the major
applications are as follows hive is a data warehousing infrastructure for hadoop the primary
responsibility of hive is to provide data summarization query and data analysis it supports analysis of large
data sets in hadoop's hdfs as well as on amazon s3 file system followed by that we have document indexing with hive
the goal of hype indexing is to improve the speed of query lookup on certain columns of a table
without an index queries could load an entire table or partition a whole process as rows
this would be troublesome so with hype we have solved this problem followed by that predictive modeling
the data manager allows you to prepare your data so it can be processed in automated analytics it offers a variety
of preparation functionalities including the creation of analytical records and time stamped populations
followed by that the next important application of hive is business intelligence
hive is a data warehousing component of hadoop and it functions well with structured data enabling ad-hoc queries
against large transactional data sets hence it happens to be a best-in-class tool available for business intelligence
and helps many companies to predict their business requirements with high accuracy
last but not the least log processing apache hive is a data warehouse infrastructure built on top of hadoop it
allows processing of data with sql like queries and is very pluggable so that we can configure it to provide our logs
quite easily so these were the few important hype applications
now let us move ahead and understand apache hive features the first and the foremost important feature of apache
hive is sql type queries the sql type queries present on hype will help many of the hadoop developers
to write queries with ease followed by that the next important feature of apache hive is olap based design
olap is nothing but online analytical processing this allows users to analyze database
information from multiple database systems at one time so using apache hive we can achieve olap
with higher accuracy followed by the second feature we have the third feature which says apache hive
is really fast since we have sql like interface in apache hive using this feature on htfs
will help us write inquiries faster and executing them followed by that we believe apache hive
is highly scalable hive tables are defined directly in hadoop file system
hence hive is fast and scalable and easy to learn followed by that it is known to be
highly extensible apache hive uses hadoop file system and hadoop file systems or sdfs provides
horizontal extensibility and finally the ad hoc querying using hive we can execute ad hoc
querying to analyze and predict data so these were the few important features of apache hype
let us move on to our next topic where we deal with apache hive architecture the following architecture explains the
flow of submission of query into hype the first stage is the hive client heim allows writing applications in
various languages including java python and c plus plus it supports different types of clients such as thrift server
jdbc driver and odbc driver so what exactly is thrift server it is a cross language service provider
platform that serves the request from all these programming languages that supports thrift
followed by that jdbc driver it is used to establish connection between hive and java applications the jdbc driver is
present in the class org.apache.hadoop.hive.jdbc.hivedriver finally we come to odbc driver so what
exactly is odbc driver odbc driver allows the applications that support odbc protocol to connect to hive
followed by that we have the hive services the following are the services provided
by hype they are hype cli hive web user interface hype metastore hive server hype driver hive compiler and lastly the
hive execution engine the hive cli or command line interface is a shell where we can execute the hive
queries and commands followed by that the hive web ui is just an alternative for hive cli
it provides a web-based graphical user interface for executing hype queries and commands
followed by that the hive meta store it is a central repository that stores all the structured information of various
tables and partitions in the warehouse it also includes metadata of column and its type information the serializers and
dc realizers which is used to read and write data and the corresponding sdfs files where the data is stored
followed by that the hive server it is referred to as apache thrift server it accepts the request from different
clients and provides to the hype driver moving on we shall deal with hive driver the high driver receives queries from
different sources such as web ui cli thrift and jdbc or odbc drivers it transfers the queries to the compiler
followed by that we have the hive compiler the purpose of hype compiler is to pass the query and perform semantic
analysis on the different query blocks and expressions it converts hive ql statements into
mapreduce jobs finally we have hive execution engine hive execution engine is the optimizer
that generates the logical plan in the form of dag or directed acyclic graph of map reduce tasks and sdfs tasks in the
end the execution engine executes the incoming task in the order of their dependencies
followed by that we have the mapreduce and hdfs mapreduce is the processing layer which
executes the mapping and reducing jobs on the data provided lastly the sdfs or hadoop distributed
file system is the location where the data which we provide is stored so this is the architecture of apache
hive then moving next we have apache hive components so what are the different components
which are present in hive they are first one the shell shell is the place where we write our
queries and execute them followed by that we have metastore as discussed in the architecture the meta
store is a place where all the details related to our tables as stored like schema etc
followed by that we have the execution engine so execution engine is the component of
apache hive which converts the query or the code which we have written into the language which the hive can understand
followed by that driver is the component which executes the code or query in the form of acyclic graphs
and lastly the compiler compiler compiles whatever the code we write and executes and provides us the
output so these are the major hive components moving ahead we shall understand apache
hype installation on windows operating system so eureka is all about providing the
technical knowledge in the simplest way as possible and later play around with the technologies to understand the
complicated parts of it so now let's try to install hype into our local system in the most simplest way as possible
to do so we might need the oracle virtual box which looks like this so once after you download oracle
virtualbox and install it into your local system the next step would be to download the cloudera quick start vm for
your local system the link to this will be provided in the description box below now let's quickly
start our cloudera quick start vm with our oracle virtualbox select import option and now provide the
location where your cloudera quickstart vm is existing in my local system it is in the local
disk drive f there you go select open and now just make sure your
ram size is more than 8 gb just randomly i'm providing 9000 mb which is just above 8gb so that you have
a smooth functionality of cloudera now select import and there you go you can see that
cloudera quickstart vm is getting imported now you can see that cloudera quick
start vm has been successfully imported and it's ready for deployment you can just double click on it and it'll get
started you can see that cloudera vm has been successfully imported and it started and
also you can see that we have gone live on cloudera you can see all the hue hadoop edge face impala spark which are
pre-installed in cloudera now our concern would be to start up hive so to start hive you need to start up hue
first so let me remind you one thing in cloudera every single password and
username is cloudera by default so for example we have got hue username and password here so the username that is
the default username for clouderas here would be cloudera and along with that even the password will be cloudera that
is by default so we have got cloudera and cloudera as username and password respectively let's
just sign in you may select remember option in case if you forget your passwords
so now we are getting connected to hue and we are live on here now there you go we've got started our hue
so now we'll enter into hdfs there we go we have a hive here now that we have successfully installed
hype into our local system let us move further and understand few more concepts in hadoop
firstly we shall deal with the data types the data types are completely similar to any other programming
language which we have they are tiny and small end integer big and similarly followed by that we have float
and inside high float is used for signal precision and if you want double precision you can go ahead with double
and followed by that we have a string and boolean which are completely similar to any other programming languages which
we use in this daily life followed by that we have hive data models so these are the basic data
models which we use in hype that we basically create databases and store our data in the form of tables and sometimes
we also need partitions we will discuss each one of these data models in our demo ahead so we'll first create
databases and inside databases we will be creating tables inside which we will be storing data in the form of rows and
columns and along with that partitions partitions they are like advanced way of storing data like if you have just
imagine you're in a school say standard one and inside standard one you have sections a b c d so partition is like
you're getting partitions for section a section b section c and section d you're storing
different different students in different different sections so that when you're querying for a particular
data for example say you're searching for a kid called sam and you have the section of his class sb so you just
don't have to just search for sam in all the four sections you can just directly go into section b and call in sam and
you'll get access to him that's how partitions work followed by partitions we have buckets so similar to partitions
even buckets work in the same way let's understand each one of these in much better way through a practical demo
after data models we shall understand about hype operators so what are operators
operators are any other operators that we use in normal programming languages such as arithmetic operators logical
operators we shall also go through some examples based on arithmetic and logical operators in hive in the hive demo we
will use some arithmetic operations as well as logical operations on the data which we have stored in the form of
tables in hive we shall go through a brief look on that as well so before we get started let's have a
brief look on the csv files that i have created for today's demo these are the small csv files that i've personally
created using ms excel and i've saved them as dot csv files i've made the csv files to be smaller because just to make
sure the execution time consumed is as less as possible since we're using cloudera the execution time might be a
little more so it's better if we use smaller csv files so this is my first csv file which is employ.csv which has
employee ids employee name salary and age similarly we have another employee2.csv file which has the same
details along with one more column that is the country column i have included country because we will be using this
country column in joins that we will be performing in future followed by that we have the department so here we have
department id and department name so we have development department testing product relationship admin and id
support similarly we also have student csv this is another csv file that i have created this has id name course and age
of the student followed by that we have another csv this is student report.csv which has the reports of a particular
student gender authenticity parental education lunch course math score reading score writings current other so
these are the csv files that we will be using in our demo today so now let's quickly begin with our demo so to start
hive we shall open a terminal so starting or firing up hive in cloudera is really simple you just have to type
in hive and enter there you go logging initialized using configuration files and etc the hive cli is deprecated
and migration to b line is recommended and there you go your hive terminal or cli has been started so first let's try
to create a database to save time i've already created the document which has all the codes that we will be executing
today so this is the particular file which i will be using today so don't worry this file will be linked in the
description box below you can use the same file and try executing the same codes in your personal systems just for
practice if you feel so so just to save time i've already created uh the document which has the codes that we are
going to execute today so this code or this file will be attached in the description box below you can get access
to it and you can also execute the same codes in your own personal system to have a practical experience about this
particular hype tutorial so the first thing that we will be doing today is to create a database
so i'm going to create the database using sql type commands which are create database name of the database which is
edureka there you go the database has been successfully created so now you can also use the following
command to check if your database has been created or not so show databases will help you to find it so there you go
you can see the first database which is a default database which will be pre-existing and followed by that you
have our own database which we have created now that is edureka so followed by this next we will move ahead and try
to create a new table so when you come into tables you need to understand there are two types of tables
in hive they are managed tables or internal tables followed by that external tables
so what is the difference between these two tables so internal table or manage table is the
default table that will be created whenever you try to create a table in high so for example if you're trying to
create a new table say edureka then hive considers that particular table as an internal table by
default so when you create an internal table your data is not secured
understand this so when you create an internal table your data is not secure in case just imagine you are working
with a team and all your team members have access to your hype or hue so the table has been existing in your hive and
some random newbie or some random inexperienced guy tries to change few things in your table and accidentally he
ends up deleting the table so when you delete the table then if the table was created using an internal table code
then your data will be erased so that's the disadvantage of using internal tables but in case if you create an
external table even if somebody tries to delete your table the table or the data whatever is there will be deleted from
their own local system but not from high so that's the best part of using external tables don't worry we will
discuss about internal tables and external tables as well so first we'll try to create an internal table so this
particular code is based on internal tables so we are using sql type command here which is create table and the table
name is employee and the columns inside our table are id of the employee name of the employee salary and age
so row format has been delimited followed by that since this is a csv file so the fields will be terminated by
comma and don't forget you have to use semicolon unless you use semicolon and code is not complete so let's fire and
enter and see if the table gets created or not yeah the table is created successfully
now we shall see the table or let's describe the table so describing the table means you can see what are the
columns which are present in your table so to describe a table you can use the keyword describer name of the table
which is employee and don't forget semicolon there you go so your table has the columns id name
salary ah so those are the four columns which you have included in your particular table employee now let's move
ahead and see if this particular table is an internal table or managed table or the other type of table which has the
external table so to do that we can just write in describe formatted table name
and semicolon there might be a small issue here yeah there is a typing mistake that is
described i missed s so there you go we got it so this particular table is managed
table as you can see here now let's move ahead and try out external tables let's clear our screen
first you can use control plus l to clear your screen there you go we have a clear screen now
now let's try to create an external table creating an external table is completely
similar to that of internal table but the only difference is that you need to add in a keyword which is external
so this particular keyword is used to create an external table now let's fire and enter and see if the table gets
created or not you can see the table got created now let's try to describe the table
employee 2. don't forget the semicolon i'm saying this again and again because most of the
times we miss semicolon and we will get an error so you can see the table got described and we have the following
columns inside our table now let's move ahead and see if this particular table is an external table or
a manage table to do so you can type in describe formatted the same code what we have used earlier
let us describe formatted name of the table that is employee2 semicolon don't forget
there is some issue again i think i've missed something or maybe a typing error yeah this is a typing error
yeah there you go the table type is external table so that's how we create an internal
table or manage table and external table so now that we have understood how to create a database and table and the two
types of tables that are internal table or manage table followed by that the second type of table that is the
external table now let's try to create an external table in a particular location
so for that you can use the following code but the only difference is you are specifying the location that is user
cloudera edureka employee pdu emp is a file that we will be creating in our hype so let's fire and enter and see it
if it's created or not yeah it's successfully created let's go back to hue and see if the
following table is created or not so one thing you have to remember is when you fire in a commander if you try to create
a table the first folder that will be created is a warehouse so inside hive you have your warehouse and inside
warehouse you have all the databases that we have created our first database was the eduraka database and after that
we have created table which is employee and the second table is employee2 so this is in the particular location which
is user cloud error and the file is employed too let's see that this was the file
yeah sometimes you will not show it because of network issues you don't have to worry about it you will get back that
data now followed by this let us enter into hue again
so when you come back into hue if you have to upload a file into hue you can just select this particular option which
is plus so selecting this will give you a dialog box which will be something like this and here you can just select
any of the files which you want to upload into hue now let me select a student report.csv and select open
so there you go upload is in progress so the data file has been successfully uploaded now if you want to access your
data file you can just click on that so there you go you have all your data successfully loaded onto hue
you can also perform queries on this particular data you can just select query and inside
that you just need to select editor and you have various editors over here which is big impala java spark mapreduce shell
scoop and we also have hyben here so if you just select hive and there you go you have the editor here you can just
type in your commands or queries whatever you have see you have many dictionaries as well you can just select
any of one of those select and that's how you write queries on the hype terminal
now let's not waste much time here and we have a lot to learn so let's continue with the next topics in our today's
session now we shall try to edit the tables now we have created the new table that
is employee3 and we have named the columns as id name string salary age and float now we shall try to make some
alterations to our table so the first alteration that we will try to make to our table is to rename our
table as emp table you know that our employee table was named as employee three now we are trying to rename it to
emp table so we are using the keyword alter here so just fire and enter and see if this
is possible or not yeah it is possible the name has been changed to emp table now let's try if it's completely changed
or clearly changed or not you can just type in describe emp table semicolon if we get the same column names in our
description then it should be changed so there you go we can see the same
columns here so we have successfully changed the name to emp table now we shall also try to add in some
more columns to our table which is emp table so here we'll try to add in a new column that is the surname of string
data type so i'm doing that by using the keyword alter followed by that table the table name is emp table and i'm using
the keyword add columns and the column name is surname and the data type of that column is string so now let's fire
in enter and there you go we have successfully added a new row to our table now let's
try to describe our table again and see if the column has been successfully added or not
there you go you can see the last row which is the surname that we have added most recently so this is how you can
alter the table and you can also change the names of the existing columns let's try to do that one as well now what i'm
doing is i'm changing the column name to first name so one of the column name in my table
emp table is the name which gives me the names of the employees so since i added the surname i'll change this column name
from name to first name so this is the command that i'm using for that operation right now let's fire
and enter and see the result yeah the chain has been made let's describe our table
don't forget the semicolon there you go you can see that earlier we had name now it's been changed to first
name and we also have a surname let's clear our screen so that's all for alterations now we
shall move ahead into our next major topic or the data model which is partitioning so we have dealt with the
first two data models that are databases and tables so we have learned how to create a database and we have learnt how
to create a table we have learned how to create internal or managed table and also we have created external table and
also we have learned how to create an external table in a particular location in your hive and load data to your table
and also how to alterate your tables the column names the name of your table and how to add or delete new columns to your
table so far so good and now we shall continue with the next type of data model that is the partitioning
as we have discussed earlier about partitioning it's completely similar to a school or a college just imagine that
you are in a college and you are in computer science section so
a college has many branches so maybe computer science mechanical and electronics and communications
so imagine your name is harry so if someone comes to your collagen if is looking for harry so there are many
hairs in your school so if the person is asking specifically about you that is harry from computer science then can you
imagine how simple is this query so you don't have to search for electronics and mechanical you just have to come into
the class computer science and search for harry and there you go you're present so that's how partitions work to
execute commands or to execute queries on partition we will create a whole new database here let's start everything
from fresh so we'll create a separate database for executing our new data model that is partitioning so i'm
creating a new database that is eduraka student so there you go the database has been
successfully created followed by that let's use this database now to use the database you just need to add
in the keyword use and name of the database so let's fire and enter and now we are currently using edureka student
database now let's create a table in edureka student database
so here i'm creating a normal table that is the managed table so inside my student table i'll be
having some basic columns such as id number of student name of the student what is his age and a course
so you're not finding course here because i'm going to partition the table based on course so here you can find the
course i'm using the keyword partition and on what terms so on the terms of course i'm going to partition students
so we have discussed about our students csv file right so here we have our csv file and the courses that this
particular institute is offering are hadoop java python and yeah so these are the courses that
this particular institute is offering so i'm going to categorize or i'm going to partition these students based on
their courses so this is how i'll be partitioning them using this following code so
basically the table has all the columns and i'm going to partition the table using course so let's fire and enter and
see the execution of this particular code the partition has been done now all we have to do is try to load in our data
before that let's try to describe it let's try to see what are the columns present in our particular table student
so as you can see the course column is present don't worry the code looks that we have missed out course but we did not
miss the course column it is present in the table the only thing is that we have just
partitioned it based on the course that we are going to offer now let's try to categorize the students based on their
course so you can do that by using the following code we are going to load the data using the command load data local
in path so this particular folder that is the student.csv is in my local location
so that is a home cloudera desktop student.csv and i'm loading the data present in this particular location into
the student which is present in hive right now so i'm going to partition the student based on their course hadoop now
let's fire in this command and see the output yeah now you can see some map reduced jobs taking place yeah the data
has been successfully loaded let's now refresh our i've you can refresh your hive or hue based
on two methods the first one is just clicking refresh button on the url or you can also select the manual refresh
this is the manual refresh and there you go it's done you can see the new database that is the eureka student
database that we have right now created and inside that you can see the student table that we have created and there you
go we have the file of students based on course hadoop
now we will try to add in a few more students based on the course java for that all you need to do is just replace
the course name with java there you go here we had hadoop course and now here we have java course just
fire and enter and you can see the output followed by that we also had another
course that is python so let's also execute a code for that there you go python
so now we have uploaded student details into our hive and we have also partitioned by using one of our data
models that is partition into three categories that are based on hadoop java and python now let's go back to our hue
and see if the three categories are done or not yeah we need to refresh that there you go you have successfully
refreshed still there is no sign of java and python maybe a manual refresh could help
yeah the manual refresh has resulted in the two new files which are java and python so you have all the three
partitions here hadoop java and python just enter them and you can see the student details
now that we have understood partitioning sorry i forgot to mention we have two types of partitioning which are dynamic
partitioning and static partitioning so the static partitioning is in static or manual partitioning it is required to
pass the values of partition columns manually by loading the data into the table hence the data file does not
contain partitioned columns you can see that we have sent the partition columns manually for python java and hadoop but
when it comes to dynamic partitioning you just need to do it once and all the three files will be automatically
configured and the files will be created so now what is dynamic partitioning so dynamic partitioning the values of
partition columns exist within the table so it is not required to pass the values of partition columns manually now what
is this don't worry we shall execute the code based on dynamic partitioning and we shall understand this in a much
better way now let's clear our screen now let's start fresh again let's try to create a new database for
dynamic partitioning and let's start again fresh so here we'll be creating a new database
that is eduracast student2 so earlier we created eduraca student and now we'll be testing our dynamic
partitioning on our new database that is edureka student2 so there you go the database has been successfully created
now we shall use this particular database currently a weaver and eureka store into
one database now we'll enter into student 2 database so we'll use it now now we are in indirect student 2. now
before we start up with dynamic partitioning we have to set high execution to dynamic partition is equal
to true because by default the partitions that will be taking place in hype will be static so we need to
convert that into dynamic partition by specifying this particular code now we are good to go with dynamic partitioning
along with that we need to execute another command which says partition mode would be non-strict
so by default when you are partitioning using the static partition the partition mode will be strict so now
you're specifying it to be non-strict now let's execute this so there you go we have executed the two
required codes for that now let's create a new table so the name of the table will be
edureka student that is adu sdud and this will have the same columns which are the id of the student
name of the student course age etc now we will try to load in the data from our local path that is home
cloudera desktop student.csv into the table edu sdud so the data has been successfully loaded and the size is 267
kb number of files is one now comes the tough part so here we are going to partition so we will be
partitioning the table based on the same thing which is the course and we will be separating the data using
the comma now let's fire and enter now the table has been separated based on course and
now we will be loading the data to this particular table which is the student part so this particular table that we
have created based on dynamic partitioning and we are going to partition the data based on course
now it's been created so the student part table has been successfully created now the only part remaining is to load
the data to this particular table now we will be writing a code so using that code the mapreduce will
automatically segregate the data members or the students based on their courses so the guys which are in hadoop will be
separated guys in java will be separated and loaded into different file and similarly with python
now let's see how to do it using the code there you go we are going to insert into
student part partition based on course select id name course h from the table editor
so the data will be imported from the table what we have created here that is eureka student
so this particular location has the student.csv file now let's fire and enter and see if it's
created or not fine you can see some of the mapreduce jobs are getting executed you can see we
have three jobs so first one is getting executed we have three because one is for hadoop one is
for java and one for python so this will take a little time so this is the reason why i have chosen smaller
csv files so to save time when you take up the course from edureka then you can work on real-time data so
that you get hands-on experience from real time and you can get yourself placed in some good companies with the
experience what you gain from this particular course so the stages have been successfully finished and the data
has been loaded now let's see what are the data's present in the particular table student part
there you go you have the output executed here so these are the data members present in the partition student
part so these are the data members which are separated based on their courses that is the partition based on their
courses that is hadoop java and python so now that we have understood dynamic partitioning and
static partitioning we shall move ahead into the last type of data model which is bucketing once after we finish the
bucketing we shall enter into some query functionalities of five or query operations which can be performed in
hive and followed by that we will also learn some functions which are present in hive and some of the other things
like group by order by sort by and finally we shall wind up the session with joins which are available in hive
for now let's get continued with bucketing the last type of data model present in hive so for that let's again
start fresh we shall create a new database for that before that let's go back to hue and check if our partition
has been made or not let's refresh also let us make a manual refresh
so our database was ureca student 2 database and inside that we have the table that
is student part and there you go you can see the files which are based on the partition
so 22 is for a different course 23 is for a different course and 24 is for a different course and
this is the default partition which has all the data members as we discussed earlier now let's start
with the last data model in hive that is bucket now we have created a new database that is at eureka bucket
now we shall also create a new table for that before that we need to start with this
particular database so we can use the command use eduraka bucket now we are in edureka bucket now let's create a new
table so the table name will be at eureka bucket and it will be containing the id
name salary age of the employees the table is created now let's try to
load the data so the data file that we will be using is the same one that is the employee.csv
so the data has been successfully loaded into the location now comes to the major part that is the
bucketing part so to start a bucketing in hive we need to use the command set dot info start
bucketing is equals to true so that's done now we will cluster or classify the data
present in this particular file using this particular code so we will be clustering based on the id
and we will be categorizing them into three different buckets so let's fire in this command and see if
it happens yeah that's successfully done now we will overwrite the data using the following command now we'll be inserting
data into this packets that we have made that is three buckets and we will overwrite the table using this
particular code there you go you can see some mapreduce charts to be taken care of now
so one mapper and reducers are three for now so stage one is getting done so we should be having three tasks
basically so let's see what's the output stage one is finished
the process is finished and data has been successfully inserted now let's go back to hive and check if
it's done or not so before that let's do a refresh now a manual refresh would be much
better there you go we have our database here which is edureka bucket and inside
edureka bucket we have emp bucket and that's our data employ.csv there you go
now let's move ahead and understand the basic operations we can perform in hive so for that let's start fresh again
let's create a new database i'm creating a new database for each and every option or efficient every operation that i'm
performing in this particular tutorial just to make things or keep things in a sorted manner
so as you can see here in our particular file system i have separated each and everything
like i have sorted everything so for bucketing i've got a separate database and for partitioning i've got a separate
database and for understanding how to create database and tables i've got a separate database for that just to keep
things arranged and sorted this looks in a much better way so now let's discuss about the operations that we could
perform in hive so i'm creating a new database again for this so the database would be hive query language
now let's use this particular database this creates a habit of learning things in a better way or it's like a revision
for the things what you have performed or learned so far as you can see the table has been
successfully created now let's try to add in some data into this particular location
that is employed data it's been successfully loaded now let's try to see what are the
details present in this particular file we can use in the command select star from the table ideoreca employee so
there you go these are the details or information present in the table at eureka employee
now we shall see what are the functions that we can perform on this particular file so since we discussed that the
mathematical operations and logical operations can be performed on high so let's try to perform an addition
operation so i'm selecting the columns salary and as we have seen here the salaries are 25
30 40 20 000 rupees for every employee now let me add in 5 000 more to each and every employee so i'm adding uh the
value 5000 by using the addition operation so let's enter
you can see we have added 5000 so the first element was 25 now it's 30 so similarly all the other employees got
5000 rupees height all of a sudden now let's try to remove 1000 so to do so all you need to do is replace the addition
operation with a subtraction operation that is minus
fire an enter and there you go each and every employee lost one thousand so the initial amount was twenty five thousand
so removing one thousand from that will result in twenty four so this is considering the first initial values so
this is how it's working uh followed by that let's also perform some logical operations let's clear the screen and
yeah here i'm fetching for the employees who are having a salary equal to or greater than 25 000.
so these are the employees which are having the salaries above or equal to 25 000.
similarly let's execute another one which detects the employees with salaries less than 25 000
so you have got two employees which are having lower salaries which are amit and chaitanya
fine so this is how you perform some operations in hive so now let's move
ahead and understand the functions which you can perform on height so in the same way let's create a new database again
and let's use this particular database that has hive functions now let's create a table in this
particular database so the table is employee function and it's created
now let's try to load in the data yeah the data has been successfully loaded
and now let's see if the data is correctly loaded or not yeah the data is loaded correctly
now let's try to apply some functions in this particular data so the first thing or the first function i'm going to apply
would be a square root function where i'll be finding out square root of the salaries of the employees
so there you go the square root of 525 000 was 1 5 8 dot decimal numbers so this is how you perform some basic
functions on your data now let's try to find out the maximum salary so yeah the job is getting executed you can
see some mapreduce chops here i think the biggest salary would be from sanjana
so the maximum salary is 40 000 so this is how it works since we are working on cloud era and
the system configuration is limited the execution speed is a bit low but if you're working in real time then this
process would take like few seconds and it's done there you go you have the value 40 000
as shown here so forty thousand the employee name is sanjana is the maximum salary so that's
what we got here now let's try to find out the minimum salary
so the minimum salary is 15 000 and who would that be yeah it's chaitanya with minimum salary 15 000
so that's how you do some operations in hive let's execute some more operations such as converting the names of the
employees to uppercase so you can see the employee names are converted to uppercase here and similarly let's try
to convert to lowercase so here you can see we have converted them to lowercase so this is how you
learn technology you need to play with the technology then you'll come to know the advantages and disadvantages so you
can learn the possible ways where you can make things work out this is how you do it now let's move ahead and
understand group by function in five so for that we'll be creating a separate database that is group now we will use
this particular database that is group so we'll type in command use group semicolon
now we'll create a table so the table has been successfully created now we will load data into this
particular table now we will use the new csv file which will be employee2.csv
now we are using this particular table because we have an additional column in this particular table which is the
country column now as discussed before we will be grouping the employees based on country let's see our data first so
we have countries such as usa india uae so these are the three countries that we are having in our csv file so we will be
categorizing the employees based on their countries so this is the particular command that we will be using
so maybe i made an error while creating the table i think i gave a wrong table name here so let's drop our table
so by mistake i gave different table that is employee order so to drop a table you just need to use
the keyword drop and it's done yeah the keyword table was missing so you need to type in drop table and the
table name and the table gets dropped so we were supposed to create a different table that is employed group
so now let's create a new table that is employee group employee group has been created now let's try to add in data
into the employee group so we have used employee 2 here because the employee 2 has another column which
is based on country so the countries that we are having here are india usa and uae so we will be using the group by
function here and we will categorize the employees based on their countries so there you go you can see some
mapreduce shops getting executed yeah there you go we have categorized the employees based
on their countries that has india uae and usa and the sum of the salary so the guys
working in india and their summation of the salary is 90 000 and similarly ua is nearly 1 lakh 5 000 and usa is 80 000
now let's also execute a different command based on group by so here we'll be using group by function
and we will categorize based on the country as well as the summation of the salary which is greater than or equal to
15 000 so it's similar to the previous command so you can see the data got executed and
we got the same output now let's move ahead and understand order by and sort by methods so for that we'll create a
new database orders now we'll use orders now let's create a new table again so the new table is employ order and the
table got created now let's load the data into this particular table by now i think you have some good
practice of how to create a database how to create a table and how to load data into that particular table
so the data got loaded and now we are going to order the data present in this particular table based on the descending
order of their salary so you're seeing some mapreduce chops going ahead so here we'll see the employees ordered based on
their salaries in descending order so the highest salary will be at the first place and the lowest salary will be at
the last place yeah so we have sanjana the first position
with the 40 000 as the highest salary and she is working for uae and we have chaitanya with lower salary
15 000 working for india now let us also execute another command
based on sort by so first we try to execute a command based on order by now let's see the same
output using sort by so basically both work in the same way so there you go we have sorted the
records based on descending order of salary now that we have learned what are the
various operations that can be performed in hype that are the arithmetic operations logical operations and also
some of the functions such as maximum minimum group by order by
sort by so these are the various operations and functions that you can perform and hive
now let's move ahead into the last type of operations that can be performed and hive those are the joints
so for that let's again create a new database so here i'll be creating a new database that is edureka join
and followed by that let's use this particular database now for that we need to use the keyword use and there you go
we are in eureka join now let's create a new table for that so the table will be emp join here you
can see that i forgot to mention semicolon so now the table got created now we shall load the data into this
particular table so now i've created the first table that is employee table and i'm loading the
employee data into this particular table now to perform join operations we always need two tables
so in this particular database at eureka join i've already created the first table that is employee join now let's
create second table that is the department table which will be present in the same database
so this particular table is a department table which will be having the entities that are department id and department
name now let's load the data of department into this particular table
so the data has been loaded so you can see the employee two dot csv had the columns id name salary agent country and
similarly the department.csv has the entities which are department id and department name so the department ids
are present here and the names are development testing product relationship and admin and id support now we have
created both the tables and we have created or we have loaded the data also now we have four different joints
available in hype they are in a joint left outer joint right auto joint and full outer joint now let's perform the
first type of joint which is the inner joint so in inner join we are going to select the employee name and employee
department and based on the employee id and department id we are going to perform the join operation that is the
first join in the joint so you can see some jobs getting executed
so the mapreduce task successfully completed so
the first set of john has been successfully finished and the output has been generated now let's try out the
second type of join that is the left outer join so the only difference is that we're
using the keyword left outer join now you can see even if the job got started
so you can see the output has been generated as well of the left outer join now let's move ahead and understand
right outer joint so for right outer join you need to use the keyword right outer join
fire in the command and you can see the jobs getting executed so you can see the output of right outer
join has been successfully executed or displayed now let's type in the last join operation that is full auto join
so here i'm using the keyword full outer join file in the command and you can see it's getting executed
so the output for full outer joint has been displayed here so this is how the join operations are
executed in hive so we have learned how to create database how to create table how to load data and
the various data models present in hive that are the tables databases partitions bucketing and after that we have also
understood various operations that are the automatic operations logical operations and functions that can be
performed in hive such as square root and the summation minimum maximum and after that other operations such as
group by sort by order by and also the joints that are possible in hive which are inner joint left outer right outer
and full outer so each and every operation that could be possibly executed in hive have been
displayed in this particular tutorial and everything is sorted here in the base of databases and you can get all
the details about this and you'll also get the code that i have used in the description box below and you can try it
out and also if you're looking for an online certification and training based on big data hadoop then you can check
out the link in the description box below and during the training you'll get to have real-time hands-on experience
with real-time data you'll learn a lot of things in the training and so far so good now we shall also discuss
some of the limitations of hive so apache hive limitations so hive is not capable of handling
real-time data hive was capable of batch processing if you have to work with real-time data then you have to go with
real-time tools such as spark and kafka so it's like i will actually take in the data for example imagine you're working
on twitter and you have one lack comments on a particular post so if you had to process those one lag comments
you'll have to first load all those comments into hive then you need to process it so while
you're loading the data from twitter to hive you may also get a few more comments that you will be missed out
so it's not preferable for real time high was preferable for only batch mode
processing so followed by that it is not designed for online transaction processing
so online transaction processing is something which only works in real time so hive cannot support real-time
processing so last but not the least high queries contain high latency yeah hive queries take a longer time to
process as you have seen i have taken a smaller csv file and the time consumed to process such a small csv file was
taking so long so yeah high queries contain high latency so these are the few important
noticeable limitations of hive so with this we have come to an end of this particular tutorial if you have any
queries regarding this tutorial or if you require code that we have exec
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries

The Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025
This comprehensive 6-hour masterclass covers everything you need to know about Apache Spark in 2025, from architecture and transformations to memory management and optimization techniques. Learn how to effectively use Spark for big data processing and prepare for data engineering interviews with practical insights and examples.

A Comprehensive Guide to PostgreSQL: Basics, Features, and Advanced Concepts
Learn PostgreSQL fundamentals, features, and advanced techniques to enhance your database management skills.

Java Programming: A Comprehensive Guide to Understanding Java and Its Concepts
Explore Java programming concepts including OOP, exception handling, and collections. Learn how to build robust applications!

Python Pandas Basics: A Comprehensive Guide for Data Analysis
Learn the essentials of using Pandas for data analysis in Python, including DataFrames, operations, and CSV handling.

Mastering Pandas DataFrames: A Comprehensive Guide
Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.
Most Viewed Summaries

A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.

Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.

Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.

Pag-unawa sa Denotasyon at Konotasyon sa Filipino 4
Alamin ang kahulugan ng denotasyon at konotasyon sa Filipino 4 kasama ang mga halimbawa at pagsasanay.

How to Use ChatGPT to Summarize YouTube Videos Efficiently
Learn how to summarize YouTube videos with ChatGPT in just a few simple steps.