Comprehensive Apache Hive Tutorial: Installation, Features, and Queries

Introduction to Apache Hive

Apache Hive is an open-source data warehousing software built on top of Apache Hadoop, providing a SQL-like interface for querying and analyzing large datasets stored in Hadoop's HDFS and other file systems like Amazon S3. It simplifies Hadoop data processing by abstracting complex MapReduce jobs and eliminating the need to learn Java or Hadoop APIs.

Why Apache Hive?

Traditional RDBMS systems cannot handle massive data volumes like Facebook's billions of users and terabytes of data.
Hadoop handles big data but lacks an easy query interface.
Hive bridges this gap by offering SQL-like queries on Hadoop data.

Key Features of Apache Hive

SQL-like query language for ease of use.
OLAP-based design for multi-dimensional data analysis.
High scalability and extensibility using Hadoop file systems.
Fast query execution on large datasets.
Supports ad hoc querying and data summarization.

Apache Hive Architecture

Hive Client: Supports Java, Python, C++ applications via Thrift Server, JDBC, and ODBC drivers.
Hive Services: Includes CLI, Web UI, Metastore (central metadata repository), Hive Server, Driver, Compiler, and Execution Engine.
Execution Engine: Converts queries into MapReduce jobs executed on Hadoop Distributed File System (HDFS).

Components of Apache Hive

Shell: Interface to write and execute Hive queries.
Metastore: Stores metadata about tables, partitions, and schemas.
Execution Engine: Translates queries into executable tasks.
Driver: Manages query lifecycle and execution.
Compiler: Compiles HiveQL into MapReduce jobs.

Installing Apache Hive on Windows

Use Oracle VirtualBox to run Cloudera QuickStart VM.
Import and start the VM with at least 8GB RAM.
Access Hive through Hue web interface with default credentials (username/password: cloudera).

Hive Data Types and Models

Supports standard data types: tinyint, smallint, int, bigint, float, double, string, boolean.
Data models include databases, tables (internal/managed and external), partitions, and buckets.
Partitions help organize data for efficient querying (e.g., by course or section).
Bucketing clusters data into fixed-size files for optimized query performance.

Creating and Managing Tables

Internal tables store data managed by Hive; deleting the table deletes data.
External tables link to data stored externally; deleting the table does not delete data.
Commands to create, describe, and alter tables including adding columns and renaming.

Partitioning in Hive

Static Partitioning: Manually specify partition values when loading data.
Dynamic Partitioning: Hive automatically partitions data based on column values.
Example: Partitioning student data by course (Hadoop, Java, Python).

Bucketing in Hive

Bucketing divides data into fixed number of buckets based on a hash function.
Example: Bucketing employee data by employee ID into three buckets.

Query Operations in Hive

Arithmetic operations: addition, subtraction on numeric columns.
Logical operations: filtering data based on conditions.
Aggregate functions: MAX, MIN, SUM, SQRT.
String functions: converting text to uppercase or lowercase.
Group By: Aggregating data by categories (e.g., country).
Order By and Sort By: Sorting query results.

Join Operations in Hive

Supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN.
Example: Joining employee and department tables on department ID.

Limitations of Apache Hive

Not suitable for real-time data processing; designed for batch processing.
High query latency compared to real-time processing tools like Spark or Kafka.
Not designed for online transaction processing (OLTP).

Conclusion

This tutorial covered Apache Hive's fundamentals, installation, architecture, data models, and query capabilities with practical examples. The provided code files and detailed explanations enable hands-on learning and preparation for real-world big data analytics using Hive.

For further learning and certification, consider enrolling in comprehensive Big Data and Hadoop courses that offer real-time projects and industry-relevant training.

For a deeper understanding of the underlying technologies, check out the Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025 which complements Hive's capabilities in big data processing. Additionally, if you're interested in database management, our Comprehensive Guide to PostgreSQL: Basics, Features, and Advanced Concepts provides valuable insights into relational databases that can enhance your data handling skills.

hello and welcome everyone to yet another tech enthusiastic video from edureka today we will learn about apache

hype now let's quickly begin with our session for today apache hype is one of the best open

source software utilities which has sql-like interface that is used in data querying and data analytics

today we will discuss more about hadoop through the following agenda firstly we shall understand why exactly

we needed apache hive followed by that we shall understand what is apache hype then its important features

then comes the important stage where we will understand the apache hive architecture the components involved in

apache hive then we will learn how to install apache hive in windows operating system

followed by that we shall understand the data types operators data models which are present in hype then we shall go

through a brief demo of apache hive so let's quickly begin with the first topic that is why exactly we needed

apache hive it all began in the early 90s when facebook started

slowly the number of users at facebook increase that is nearly one billion users and along with the users increase

the data which is nearly equal to thousands of terabytes of data and nearly one lakh queries then also 500

million photographs uploaded daily and this was a huge amount of data that facebook had to process

and the first thing that everybody had in their mind was to use rdbms and we all know that rdbms couldn't handle such

a huge amount of data and neither it was capable enough to process it and the very next big guy who was capable enough

to handle all this big data was hadoop even when hadoop came into picture it was not too easy to manage all the

queries it used to take a lot of time to execute all the queries performed

so one common thing that all the hadoop developers had was the sequel so they thought to come up with a new

solution that has hadoop's capacity and interface like sql that is when hive came into picture

so now we understand the exact definition of apache high apache hype is a data warehouse software

project built on top of apache hadoop for providing data query and data analysis

hive gives a sql like interface to query data stored in various databases and file systems that integrate with hadoop

also apache hive has data warehousing software utility it can be used for data analytics it is built for sequel users

manages querying of structured data and it simplifies and abstracts the load that is on hadoop and lastly no need to

learn java and hadoop api to handle data using hive so followed by this we shall understand

apache hive applications apache hive is used in many major applications few of the major

applications are as follows hive is a data warehousing infrastructure for hadoop the primary

responsibility of hive is to provide data summarization query and data analysis it supports analysis of large

data sets in hadoop's hdfs as well as on amazon s3 file system followed by that we have document indexing with hive

the goal of hype indexing is to improve the speed of query lookup on certain columns of a table

without an index queries could load an entire table or partition a whole process as rows

this would be troublesome so with hype we have solved this problem followed by that predictive modeling

the data manager allows you to prepare your data so it can be processed in automated analytics it offers a variety

of preparation functionalities including the creation of analytical records and time stamped populations

followed by that the next important application of hive is business intelligence

hive is a data warehousing component of hadoop and it functions well with structured data enabling ad-hoc queries

against large transactional data sets hence it happens to be a best-in-class tool available for business intelligence

and helps many companies to predict their business requirements with high accuracy

last but not the least log processing apache hive is a data warehouse infrastructure built on top of hadoop it

allows processing of data with sql like queries and is very pluggable so that we can configure it to provide our logs

quite easily so these were the few important hype applications

now let us move ahead and understand apache hive features the first and the foremost important feature of apache

hive is sql type queries the sql type queries present on hype will help many of the hadoop developers

to write queries with ease followed by that the next important feature of apache hive is olap based design

olap is nothing but online analytical processing this allows users to analyze database

information from multiple database systems at one time so using apache hive we can achieve olap

with higher accuracy followed by the second feature we have the third feature which says apache hive

is really fast since we have sql like interface in apache hive using this feature on htfs

will help us write inquiries faster and executing them followed by that we believe apache hive

is highly scalable hive tables are defined directly in hadoop file system

hence hive is fast and scalable and easy to learn followed by that it is known to be

highly extensible apache hive uses hadoop file system and hadoop file systems or sdfs provides

horizontal extensibility and finally the ad hoc querying using hive we can execute ad hoc

querying to analyze and predict data so these were the few important features of apache hype

let us move on to our next topic where we deal with apache hive architecture the following architecture explains the

flow of submission of query into hype the first stage is the hive client heim allows writing applications in

various languages including java python and c plus plus it supports different types of clients such as thrift server

jdbc driver and odbc driver so what exactly is thrift server it is a cross language service provider

platform that serves the request from all these programming languages that supports thrift

followed by that jdbc driver it is used to establish connection between hive and java applications the jdbc driver is

present in the class org.apache.hadoop.hive.jdbc.hivedriver finally we come to odbc driver so what

exactly is odbc driver odbc driver allows the applications that support odbc protocol to connect to hive

followed by that we have the hive services the following are the services provided

by hype they are hype cli hive web user interface hype metastore hive server hype driver hive compiler and lastly the

hive execution engine the hive cli or command line interface is a shell where we can execute the hive

queries and commands followed by that the hive web ui is just an alternative for hive cli

it provides a web-based graphical user interface for executing hype queries and commands

followed by that the hive meta store it is a central repository that stores all the structured information of various

tables and partitions in the warehouse it also includes metadata of column and its type information the serializers and

dc realizers which is used to read and write data and the corresponding sdfs files where the data is stored

followed by that the hive server it is referred to as apache thrift server it accepts the request from different

clients and provides to the hype driver moving on we shall deal with hive driver the high driver receives queries from

different sources such as web ui cli thrift and jdbc or odbc drivers it transfers the queries to the compiler

followed by that we have the hive compiler the purpose of hype compiler is to pass the query and perform semantic

analysis on the different query blocks and expressions it converts hive ql statements into

mapreduce jobs finally we have hive execution engine hive execution engine is the optimizer

that generates the logical plan in the form of dag or directed acyclic graph of map reduce tasks and sdfs tasks in the

end the execution engine executes the incoming task in the order of their dependencies

followed by that we have the mapreduce and hdfs mapreduce is the processing layer which

executes the mapping and reducing jobs on the data provided lastly the sdfs or hadoop distributed

file system is the location where the data which we provide is stored so this is the architecture of apache

hive then moving next we have apache hive components so what are the different components

which are present in hive they are first one the shell shell is the place where we write our

queries and execute them followed by that we have metastore as discussed in the architecture the meta

store is a place where all the details related to our tables as stored like schema etc

followed by that we have the execution engine so execution engine is the component of

apache hive which converts the query or the code which we have written into the language which the hive can understand

followed by that driver is the component which executes the code or query in the form of acyclic graphs

and lastly the compiler compiler compiles whatever the code we write and executes and provides us the

output so these are the major hive components moving ahead we shall understand apache

hype installation on windows operating system so eureka is all about providing the

technical knowledge in the simplest way as possible and later play around with the technologies to understand the

complicated parts of it so now let's try to install hype into our local system in the most simplest way as possible

to do so we might need the oracle virtual box which looks like this so once after you download oracle

virtualbox and install it into your local system the next step would be to download the cloudera quick start vm for

your local system the link to this will be provided in the description box below now let's quickly

start our cloudera quick start vm with our oracle virtualbox select import option and now provide the

location where your cloudera quickstart vm is existing in my local system it is in the local

disk drive f there you go select open and now just make sure your

ram size is more than 8 gb just randomly i'm providing 9000 mb which is just above 8gb so that you have

a smooth functionality of cloudera now select import and there you go you can see that

cloudera quickstart vm is getting imported now you can see that cloudera quick

start vm has been successfully imported and it's ready for deployment you can just double click on it and it'll get

started you can see that cloudera vm has been successfully imported and it started and

also you can see that we have gone live on cloudera you can see all the hue hadoop edge face impala spark which are

pre-installed in cloudera now our concern would be to start up hive so to start hive you need to start up hue

first so let me remind you one thing in cloudera every single password and

username is cloudera by default so for example we have got hue username and password here so the username that is

the default username for clouderas here would be cloudera and along with that even the password will be cloudera that

is by default so we have got cloudera and cloudera as username and password respectively let's

just sign in you may select remember option in case if you forget your passwords

so now we are getting connected to hue and we are live on here now there you go we've got started our hue

so now we'll enter into hdfs there we go we have a hive here now that we have successfully installed

hype into our local system let us move further and understand few more concepts in hadoop

firstly we shall deal with the data types the data types are completely similar to any other programming

language which we have they are tiny and small end integer big and similarly followed by that we have float

and inside high float is used for signal precision and if you want double precision you can go ahead with double

and followed by that we have a string and boolean which are completely similar to any other programming languages which

we use in this daily life followed by that we have hive data models so these are the basic data

models which we use in hype that we basically create databases and store our data in the form of tables and sometimes

we also need partitions we will discuss each one of these data models in our demo ahead so we'll first create

databases and inside databases we will be creating tables inside which we will be storing data in the form of rows and

columns and along with that partitions partitions they are like advanced way of storing data like if you have just

imagine you're in a school say standard one and inside standard one you have sections a b c d so partition is like

you're getting partitions for section a section b section c and section d you're storing

different different students in different different sections so that when you're querying for a particular

data for example say you're searching for a kid called sam and you have the section of his class sb so you just

don't have to just search for sam in all the four sections you can just directly go into section b and call in sam and

you'll get access to him that's how partitions work followed by partitions we have buckets so similar to partitions

even buckets work in the same way let's understand each one of these in much better way through a practical demo

after data models we shall understand about hype operators so what are operators

operators are any other operators that we use in normal programming languages such as arithmetic operators logical

operators we shall also go through some examples based on arithmetic and logical operators in hive in the hive demo we

will use some arithmetic operations as well as logical operations on the data which we have stored in the form of

tables in hive we shall go through a brief look on that as well so before we get started let's have a

brief look on the csv files that i have created for today's demo these are the small csv files that i've personally

created using ms excel and i've saved them as dot csv files i've made the csv files to be smaller because just to make

sure the execution time consumed is as less as possible since we're using cloudera the execution time might be a

little more so it's better if we use smaller csv files so this is my first csv file which is employ.csv which has

employee ids employee name salary and age similarly we have another employee2.csv file which has the same

details along with one more column that is the country column i have included country because we will be using this

country column in joins that we will be performing in future followed by that we have the department so here we have

department id and department name so we have development department testing product relationship admin and id

support similarly we also have student csv this is another csv file that i have created this has id name course and age

of the student followed by that we have another csv this is student report.csv which has the reports of a particular

student gender authenticity parental education lunch course math score reading score writings current other so

these are the csv files that we will be using in our demo today so now let's quickly begin with our demo so to start

hive we shall open a terminal so starting or firing up hive in cloudera is really simple you just have to type

in hive and enter there you go logging initialized using configuration files and etc the hive cli is deprecated

and migration to b line is recommended and there you go your hive terminal or cli has been started so first let's try

to create a database to save time i've already created the document which has all the codes that we will be executing

today so this is the particular file which i will be using today so don't worry this file will be linked in the

description box below you can use the same file and try executing the same codes in your personal systems just for

practice if you feel so so just to save time i've already created uh the document which has the codes that we are

going to execute today so this code or this file will be attached in the description box below you can get access

to it and you can also execute the same codes in your own personal system to have a practical experience about this

particular hype tutorial so the first thing that we will be doing today is to create a database

so i'm going to create the database using sql type commands which are create database name of the database which is

edureka there you go the database has been successfully created so now you can also use the following

command to check if your database has been created or not so show databases will help you to find it so there you go

you can see the first database which is a default database which will be pre-existing and followed by that you

have our own database which we have created now that is edureka so followed by this next we will move ahead and try

to create a new table so when you come into tables you need to understand there are two types of tables

in hive they are managed tables or internal tables followed by that external tables

so what is the difference between these two tables so internal table or manage table is the

default table that will be created whenever you try to create a table in high so for example if you're trying to

create a new table say edureka then hive considers that particular table as an internal table by

default so when you create an internal table your data is not secured

understand this so when you create an internal table your data is not secure in case just imagine you are working

with a team and all your team members have access to your hype or hue so the table has been existing in your hive and

some random newbie or some random inexperienced guy tries to change few things in your table and accidentally he

ends up deleting the table so when you delete the table then if the table was created using an internal table code

then your data will be erased so that's the disadvantage of using internal tables but in case if you create an

external table even if somebody tries to delete your table the table or the data whatever is there will be deleted from

their own local system but not from high so that's the best part of using external tables don't worry we will

discuss about internal tables and external tables as well so first we'll try to create an internal table so this

particular code is based on internal tables so we are using sql type command here which is create table and the table

name is employee and the columns inside our table are id of the employee name of the employee salary and age

so row format has been delimited followed by that since this is a csv file so the fields will be terminated by

comma and don't forget you have to use semicolon unless you use semicolon and code is not complete so let's fire and

enter and see if the table gets created or not yeah the table is created successfully

now we shall see the table or let's describe the table so describing the table means you can see what are the

columns which are present in your table so to describe a table you can use the keyword describer name of the table

which is employee and don't forget semicolon there you go so your table has the columns id name

salary ah so those are the four columns which you have included in your particular table employee now let's move

ahead and see if this particular table is an internal table or managed table or the other type of table which has the

external table so to do that we can just write in describe formatted table name

and semicolon there might be a small issue here yeah there is a typing mistake that is

described i missed s so there you go we got it so this particular table is managed

table as you can see here now let's move ahead and try out external tables let's clear our screen

first you can use control plus l to clear your screen there you go we have a clear screen now

now let's try to create an external table creating an external table is completely

similar to that of internal table but the only difference is that you need to add in a keyword which is external

so this particular keyword is used to create an external table now let's fire and enter and see if the table gets

created or not you can see the table got created now let's try to describe the table

employee 2. don't forget the semicolon i'm saying this again and again because most of the

times we miss semicolon and we will get an error so you can see the table got described and we have the following

columns inside our table now let's move ahead and see if this particular table is an external table or

a manage table to do so you can type in describe formatted the same code what we have used earlier

let us describe formatted name of the table that is employee2 semicolon don't forget

there is some issue again i think i've missed something or maybe a typing error yeah this is a typing error

yeah there you go the table type is external table so that's how we create an internal

table or manage table and external table so now that we have understood how to create a database and table and the two

types of tables that are internal table or manage table followed by that the second type of table that is the

external table now let's try to create an external table in a particular location

so for that you can use the following code but the only difference is you are specifying the location that is user

cloudera edureka employee pdu emp is a file that we will be creating in our hype so let's fire and enter and see it

if it's created or not yeah it's successfully created let's go back to hue and see if the

following table is created or not so one thing you have to remember is when you fire in a commander if you try to create

a table the first folder that will be created is a warehouse so inside hive you have your warehouse and inside

warehouse you have all the databases that we have created our first database was the eduraka database and after that

we have created table which is employee and the second table is employee2 so this is in the particular location which

is user cloud error and the file is employed too let's see that this was the file

yeah sometimes you will not show it because of network issues you don't have to worry about it you will get back that

data now followed by this let us enter into hue again

so when you come back into hue if you have to upload a file into hue you can just select this particular option which

is plus so selecting this will give you a dialog box which will be something like this and here you can just select

any of the files which you want to upload into hue now let me select a student report.csv and select open

so there you go upload is in progress so the data file has been successfully uploaded now if you want to access your

data file you can just click on that so there you go you have all your data successfully loaded onto hue

you can also perform queries on this particular data you can just select query and inside

that you just need to select editor and you have various editors over here which is big impala java spark mapreduce shell

scoop and we also have hyben here so if you just select hive and there you go you have the editor here you can just

type in your commands or queries whatever you have see you have many dictionaries as well you can just select

any of one of those select and that's how you write queries on the hype terminal

now let's not waste much time here and we have a lot to learn so let's continue with the next topics in our today's

session now we shall try to edit the tables now we have created the new table that

is employee3 and we have named the columns as id name string salary age and float now we shall try to make some

alterations to our table so the first alteration that we will try to make to our table is to rename our

table as emp table you know that our employee table was named as employee three now we are trying to rename it to

emp table so we are using the keyword alter here so just fire and enter and see if this

is possible or not yeah it is possible the name has been changed to emp table now let's try if it's completely changed

or clearly changed or not you can just type in describe emp table semicolon if we get the same column names in our

description then it should be changed so there you go we can see the same

columns here so we have successfully changed the name to emp table now we shall also try to add in some

more columns to our table which is emp table so here we'll try to add in a new column that is the surname of string

data type so i'm doing that by using the keyword alter followed by that table the table name is emp table and i'm using

the keyword add columns and the column name is surname and the data type of that column is string so now let's fire

in enter and there you go we have successfully added a new row to our table now let's

try to describe our table again and see if the column has been successfully added or not

there you go you can see the last row which is the surname that we have added most recently so this is how you can

alter the table and you can also change the names of the existing columns let's try to do that one as well now what i'm

doing is i'm changing the column name to first name so one of the column name in my table

emp table is the name which gives me the names of the employees so since i added the surname i'll change this column name

from name to first name so this is the command that i'm using for that operation right now let's fire

and enter and see the result yeah the chain has been made let's describe our table

don't forget the semicolon there you go you can see that earlier we had name now it's been changed to first

name and we also have a surname let's clear our screen so that's all for alterations now we

shall move ahead into our next major topic or the data model which is partitioning so we have dealt with the

first two data models that are databases and tables so we have learned how to create a database and we have learnt how

to create a table we have learned how to create internal or managed table and also we have created external table and

also we have learned how to create an external table in a particular location in your hive and load data to your table

and also how to alterate your tables the column names the name of your table and how to add or delete new columns to your

table so far so good and now we shall continue with the next type of data model that is the partitioning

as we have discussed earlier about partitioning it's completely similar to a school or a college just imagine that

you are in a college and you are in computer science section so

a college has many branches so maybe computer science mechanical and electronics and communications

so imagine your name is harry so if someone comes to your collagen if is looking for harry so there are many

hairs in your school so if the person is asking specifically about you that is harry from computer science then can you

imagine how simple is this query so you don't have to search for electronics and mechanical you just have to come into

the class computer science and search for harry and there you go you're present so that's how partitions work to

execute commands or to execute queries on partition we will create a whole new database here let's start everything

from fresh so we'll create a separate database for executing our new data model that is partitioning so i'm

creating a new database that is eduraka student so there you go the database has been

successfully created followed by that let's use this database now to use the database you just need to add

in the keyword use and name of the database so let's fire and enter and now we are currently using edureka student

database now let's create a table in edureka student database

so here i'm creating a normal table that is the managed table so inside my student table i'll be

having some basic columns such as id number of student name of the student what is his age and a course

so you're not finding course here because i'm going to partition the table based on course so here you can find the

course i'm using the keyword partition and on what terms so on the terms of course i'm going to partition students

so we have discussed about our students csv file right so here we have our csv file and the courses that this

particular institute is offering are hadoop java python and yeah so these are the courses that

this particular institute is offering so i'm going to categorize or i'm going to partition these students based on

their courses so this is how i'll be partitioning them using this following code so

basically the table has all the columns and i'm going to partition the table using course so let's fire and enter and

see the execution of this particular code the partition has been done now all we have to do is try to load in our data

before that let's try to describe it let's try to see what are the columns present in our particular table student

so as you can see the course column is present don't worry the code looks that we have missed out course but we did not

miss the course column it is present in the table the only thing is that we have just

partitioned it based on the course that we are going to offer now let's try to categorize the students based on their

course so you can do that by using the following code we are going to load the data using the command load data local

in path so this particular folder that is the student.csv is in my local location

so that is a home cloudera desktop student.csv and i'm loading the data present in this particular location into

the student which is present in hive right now so i'm going to partition the student based on their course hadoop now

let's fire in this command and see the output yeah now you can see some map reduced jobs taking place yeah the data

has been successfully loaded let's now refresh our i've you can refresh your hive or hue based

on two methods the first one is just clicking refresh button on the url or you can also select the manual refresh

this is the manual refresh and there you go it's done you can see the new database that is the eureka student

database that we have right now created and inside that you can see the student table that we have created and there you

go we have the file of students based on course hadoop

now we will try to add in a few more students based on the course java for that all you need to do is just replace

the course name with java there you go here we had hadoop course and now here we have java course just

fire and enter and you can see the output followed by that we also had another

course that is python so let's also execute a code for that there you go python

so now we have uploaded student details into our hive and we have also partitioned by using one of our data

models that is partition into three categories that are based on hadoop java and python now let's go back to our hue

and see if the three categories are done or not yeah we need to refresh that there you go you have successfully

refreshed still there is no sign of java and python maybe a manual refresh could help

yeah the manual refresh has resulted in the two new files which are java and python so you have all the three

partitions here hadoop java and python just enter them and you can see the student details

now that we have understood partitioning sorry i forgot to mention we have two types of partitioning which are dynamic

partitioning and static partitioning so the static partitioning is in static or manual partitioning it is required to

pass the values of partition columns manually by loading the data into the table hence the data file does not

contain partitioned columns you can see that we have sent the partition columns manually for python java and hadoop but

when it comes to dynamic partitioning you just need to do it once and all the three files will be automatically

configured and the files will be created so now what is dynamic partitioning so dynamic partitioning the values of

partition columns exist within the table so it is not required to pass the values of partition columns manually now what

is this don't worry we shall execute the code based on dynamic partitioning and we shall understand this in a much

better way now let's clear our screen now let's start fresh again let's try to create a new database for

dynamic partitioning and let's start again fresh so here we'll be creating a new database

that is eduracast student2 so earlier we created eduraca student and now we'll be testing our dynamic

partitioning on our new database that is edureka student2 so there you go the database has been successfully created

now we shall use this particular database currently a weaver and eureka store into

one database now we'll enter into student 2 database so we'll use it now now we are in indirect student 2. now

before we start up with dynamic partitioning we have to set high execution to dynamic partition is equal

to true because by default the partitions that will be taking place in hype will be static so we need to

convert that into dynamic partition by specifying this particular code now we are good to go with dynamic partitioning

along with that we need to execute another command which says partition mode would be non-strict

so by default when you are partitioning using the static partition the partition mode will be strict so now

you're specifying it to be non-strict now let's execute this so there you go we have executed the two

required codes for that now let's create a new table so the name of the table will be

edureka student that is adu sdud and this will have the same columns which are the id of the student

name of the student course age etc now we will try to load in the data from our local path that is home

cloudera desktop student.csv into the table edu sdud so the data has been successfully loaded and the size is 267

kb number of files is one now comes the tough part so here we are going to partition so we will be

partitioning the table based on the same thing which is the course and we will be separating the data using

the comma now let's fire and enter now the table has been separated based on course and

now we will be loading the data to this particular table which is the student part so this particular table that we

have created based on dynamic partitioning and we are going to partition the data based on course

now it's been created so the student part table has been successfully created now the only part remaining is to load

the data to this particular table now we will be writing a code so using that code the mapreduce will

automatically segregate the data members or the students based on their courses so the guys which are in hadoop will be

separated guys in java will be separated and loaded into different file and similarly with python

now let's see how to do it using the code there you go we are going to insert into

student part partition based on course select id name course h from the table editor

so the data will be imported from the table what we have created here that is eureka student

so this particular location has the student.csv file now let's fire and enter and see if it's

created or not fine you can see some of the mapreduce jobs are getting executed you can see we

have three jobs so first one is getting executed we have three because one is for hadoop one is

for java and one for python so this will take a little time so this is the reason why i have chosen smaller

csv files so to save time when you take up the course from edureka then you can work on real-time data so

that you get hands-on experience from real time and you can get yourself placed in some good companies with the

experience what you gain from this particular course so the stages have been successfully finished and the data

has been loaded now let's see what are the data's present in the particular table student part

there you go you have the output executed here so these are the data members present in the partition student

part so these are the data members which are separated based on their courses that is the partition based on their

courses that is hadoop java and python so now that we have understood dynamic partitioning and

static partitioning we shall move ahead into the last type of data model which is bucketing once after we finish the

bucketing we shall enter into some query functionalities of five or query operations which can be performed in

hive and followed by that we will also learn some functions which are present in hive and some of the other things

like group by order by sort by and finally we shall wind up the session with joins which are available in hive

for now let's get continued with bucketing the last type of data model present in hive so for that let's again

start fresh we shall create a new database for that before that let's go back to hue and check if our partition

has been made or not let's refresh also let us make a manual refresh

so our database was ureca student 2 database and inside that we have the table that

is student part and there you go you can see the files which are based on the partition

so 22 is for a different course 23 is for a different course and 24 is for a different course and

this is the default partition which has all the data members as we discussed earlier now let's start

with the last data model in hive that is bucket now we have created a new database that is at eureka bucket

now we shall also create a new table for that before that we need to start with this

particular database so we can use the command use eduraka bucket now we are in edureka bucket now let's create a new

table so the table name will be at eureka bucket and it will be containing the id

name salary age of the employees the table is created now let's try to

load the data so the data file that we will be using is the same one that is the employee.csv

so the data has been successfully loaded into the location now comes to the major part that is the

bucketing part so to start a bucketing in hive we need to use the command set dot info start

bucketing is equals to true so that's done now we will cluster or classify the data

present in this particular file using this particular code so we will be clustering based on the id

and we will be categorizing them into three different buckets so let's fire in this command and see if

it happens yeah that's successfully done now we will overwrite the data using the following command now we'll be inserting

data into this packets that we have made that is three buckets and we will overwrite the table using this

particular code there you go you can see some mapreduce charts to be taken care of now

so one mapper and reducers are three for now so stage one is getting done so we should be having three tasks

basically so let's see what's the output stage one is finished

the process is finished and data has been successfully inserted now let's go back to hive and check if

it's done or not so before that let's do a refresh now a manual refresh would be much

better there you go we have our database here which is edureka bucket and inside

edureka bucket we have emp bucket and that's our data employ.csv there you go

now let's move ahead and understand the basic operations we can perform in hive so for that let's start fresh again

let's create a new database i'm creating a new database for each and every option or efficient every operation that i'm

performing in this particular tutorial just to make things or keep things in a sorted manner

so as you can see here in our particular file system i have separated each and everything

like i have sorted everything so for bucketing i've got a separate database and for partitioning i've got a separate

database and for understanding how to create database and tables i've got a separate database for that just to keep

things arranged and sorted this looks in a much better way so now let's discuss about the operations that we could

perform in hive so i'm creating a new database again for this so the database would be hive query language

now let's use this particular database this creates a habit of learning things in a better way or it's like a revision

for the things what you have performed or learned so far as you can see the table has been

successfully created now let's try to add in some data into this particular location

that is employed data it's been successfully loaded now let's try to see what are the

details present in this particular file we can use in the command select star from the table ideoreca employee so

there you go these are the details or information present in the table at eureka employee

now we shall see what are the functions that we can perform on this particular file so since we discussed that the

mathematical operations and logical operations can be performed on high so let's try to perform an addition

operation so i'm selecting the columns salary and as we have seen here the salaries are 25

30 40 20 000 rupees for every employee now let me add in 5 000 more to each and every employee so i'm adding uh the

value 5000 by using the addition operation so let's enter

you can see we have added 5000 so the first element was 25 now it's 30 so similarly all the other employees got

5000 rupees height all of a sudden now let's try to remove 1000 so to do so all you need to do is replace the addition

operation with a subtraction operation that is minus

fire an enter and there you go each and every employee lost one thousand so the initial amount was twenty five thousand

so removing one thousand from that will result in twenty four so this is considering the first initial values so

this is how it's working uh followed by that let's also perform some logical operations let's clear the screen and

yeah here i'm fetching for the employees who are having a salary equal to or greater than 25 000.

so these are the employees which are having the salaries above or equal to 25 000.

similarly let's execute another one which detects the employees with salaries less than 25 000

so you have got two employees which are having lower salaries which are amit and chaitanya

fine so this is how you perform some operations in hive so now let's move

ahead and understand the functions which you can perform on height so in the same way let's create a new database again

and let's use this particular database that has hive functions now let's create a table in this

particular database so the table is employee function and it's created

now let's try to load in the data yeah the data has been successfully loaded

and now let's see if the data is correctly loaded or not yeah the data is loaded correctly

now let's try to apply some functions in this particular data so the first thing or the first function i'm going to apply

would be a square root function where i'll be finding out square root of the salaries of the employees

so there you go the square root of 525 000 was 1 5 8 dot decimal numbers so this is how you perform some basic

functions on your data now let's try to find out the maximum salary so yeah the job is getting executed you can

see some mapreduce chops here i think the biggest salary would be from sanjana

so the maximum salary is 40 000 so this is how it works since we are working on cloud era and

the system configuration is limited the execution speed is a bit low but if you're working in real time then this

process would take like few seconds and it's done there you go you have the value 40 000

as shown here so forty thousand the employee name is sanjana is the maximum salary so that's

what we got here now let's try to find out the minimum salary

so the minimum salary is 15 000 and who would that be yeah it's chaitanya with minimum salary 15 000

so that's how you do some operations in hive let's execute some more operations such as converting the names of the

employees to uppercase so you can see the employee names are converted to uppercase here and similarly let's try

to convert to lowercase so here you can see we have converted them to lowercase so this is how you

learn technology you need to play with the technology then you'll come to know the advantages and disadvantages so you

can learn the possible ways where you can make things work out this is how you do it now let's move ahead and

understand group by function in five so for that we'll be creating a separate database that is group now we will use

this particular database that is group so we'll type in command use group semicolon

now we'll create a table so the table has been successfully created now we will load data into this

particular table now we will use the new csv file which will be employee2.csv

now we are using this particular table because we have an additional column in this particular table which is the

country column now as discussed before we will be grouping the employees based on country let's see our data first so

we have countries such as usa india uae so these are the three countries that we are having in our csv file so we will be

categorizing the employees based on their countries so this is the particular command that we will be using

so maybe i made an error while creating the table i think i gave a wrong table name here so let's drop our table

so by mistake i gave different table that is employee order so to drop a table you just need to use

the keyword drop and it's done yeah the keyword table was missing so you need to type in drop table and the

table name and the table gets dropped so we were supposed to create a different table that is employed group

so now let's create a new table that is employee group employee group has been created now let's try to add in data

into the employee group so we have used employee 2 here because the employee 2 has another column which

is based on country so the countries that we are having here are india usa and uae so we will be using the group by

function here and we will categorize the employees based on their countries so there you go you can see some

mapreduce shops getting executed yeah there you go we have categorized the employees based

on their countries that has india uae and usa and the sum of the salary so the guys

working in india and their summation of the salary is 90 000 and similarly ua is nearly 1 lakh 5 000 and usa is 80 000

now let's also execute a different command based on group by so here we'll be using group by function

and we will categorize based on the country as well as the summation of the salary which is greater than or equal to

15 000 so it's similar to the previous command so you can see the data got executed and

we got the same output now let's move ahead and understand order by and sort by methods so for that we'll create a

new database orders now we'll use orders now let's create a new table again so the new table is employ order and the

table got created now let's load the data into this particular table by now i think you have some good

practice of how to create a database how to create a table and how to load data into that particular table

so the data got loaded and now we are going to order the data present in this particular table based on the descending

order of their salary so you're seeing some mapreduce chops going ahead so here we'll see the employees ordered based on

their salaries in descending order so the highest salary will be at the first place and the lowest salary will be at

the last place yeah so we have sanjana the first position

with the 40 000 as the highest salary and she is working for uae and we have chaitanya with lower salary

15 000 working for india now let us also execute another command

based on sort by so first we try to execute a command based on order by now let's see the same

output using sort by so basically both work in the same way so there you go we have sorted the

records based on descending order of salary now that we have learned what are the

various operations that can be performed in hype that are the arithmetic operations logical operations and also

some of the functions such as maximum minimum group by order by

sort by so these are the various operations and functions that you can perform and hive

now let's move ahead into the last type of operations that can be performed and hive those are the joints

so for that let's again create a new database so here i'll be creating a new database that is edureka join

and followed by that let's use this particular database now for that we need to use the keyword use and there you go

we are in eureka join now let's create a new table for that so the table will be emp join here you

can see that i forgot to mention semicolon so now the table got created now we shall load the data into this

particular table so now i've created the first table that is employee table and i'm loading the

employee data into this particular table now to perform join operations we always need two tables

so in this particular database at eureka join i've already created the first table that is employee join now let's

create second table that is the department table which will be present in the same database

so this particular table is a department table which will be having the entities that are department id and department

name now let's load the data of department into this particular table

so the data has been loaded so you can see the employee two dot csv had the columns id name salary agent country and

similarly the department.csv has the entities which are department id and department name so the department ids

are present here and the names are development testing product relationship and admin and id support now we have

created both the tables and we have created or we have loaded the data also now we have four different joints

available in hype they are in a joint left outer joint right auto joint and full outer joint now let's perform the

first type of joint which is the inner joint so in inner join we are going to select the employee name and employee

department and based on the employee id and department id we are going to perform the join operation that is the

first join in the joint so you can see some jobs getting executed

so the mapreduce task successfully completed so

the first set of john has been successfully finished and the output has been generated now let's try out the

second type of join that is the left outer join so the only difference is that we're

using the keyword left outer join now you can see even if the job got started

so you can see the output has been generated as well of the left outer join now let's move ahead and understand

right outer joint so for right outer join you need to use the keyword right outer join

fire in the command and you can see the jobs getting executed so you can see the output of right outer

join has been successfully executed or displayed now let's type in the last join operation that is full auto join

so here i'm using the keyword full outer join file in the command and you can see it's getting executed

so the output for full outer joint has been displayed here so this is how the join operations are

executed in hive so we have learned how to create database how to create table how to load data and

the various data models present in hive that are the tables databases partitions bucketing and after that we have also

understood various operations that are the automatic operations logical operations and functions that can be

performed in hive such as square root and the summation minimum maximum and after that other operations such as

group by sort by order by and also the joints that are possible in hive which are inner joint left outer right outer

and full outer so each and every operation that could be possibly executed in hive have been

displayed in this particular tutorial and everything is sorted here in the base of databases and you can get all

the details about this and you'll also get the code that i have used in the description box below and you can try it

out and also if you're looking for an online certification and training based on big data hadoop then you can check

out the link in the description box below and during the training you'll get to have real-time hands-on experience

with real-time data you'll learn a lot of things in the training and so far so good now we shall also discuss

some of the limitations of hive so apache hive limitations so hive is not capable of handling

real-time data hive was capable of batch processing if you have to work with real-time data then you have to go with

real-time tools such as spark and kafka so it's like i will actually take in the data for example imagine you're working

on twitter and you have one lack comments on a particular post so if you had to process those one lag comments

you'll have to first load all those comments into hive then you need to process it so while

you're loading the data from twitter to hive you may also get a few more comments that you will be missed out

so it's not preferable for real time high was preferable for only batch mode

processing so followed by that it is not designed for online transaction processing

so online transaction processing is something which only works in real time so hive cannot support real-time

processing so last but not the least high queries contain high latency yeah hive queries take a longer time to

process as you have seen i have taken a smaller csv file and the time consumed to process such a small csv file was

taking so long so yeah high queries contain high latency so these are the few important

noticeable limitations of hive so with this we have come to an end of this particular tutorial if you have any

queries regarding this tutorial or if you require code that we have exec

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

The Ultimate Guide to Apache Spark: Concepts, Techniques, and Best Practices for 2025

This comprehensive 6-hour masterclass covers everything you need to know about Apache Spark in 2025, from architecture and transformations to memory management and optimization techniques. Learn how to effectively use Spark for big data processing and prepare for data engineering interviews with practical insights and examples.