Most of the Data Scientists use SQL queries in order to explore the data and get valuable insights from them. Now, as the volume of data is growing at such a high pace, we need new dedicated tools to deal with big volumes of data.
Initially, Hadoop came up and became one of the most popular tools to process and store big data. But developers were required to write complex map-reduce codes to work with Hadoop. This is Facebook’s Apache Hive came to rescue. It is another tool designed to work with Hadoop. We can write SQL like queries in the hive and in the backend it converts them into the map-reduce jobs.
In this article, we will see the architecture of the hive and its working. We will also learn how to do simple operations like creating a database and table, loading data, modifying the table.
Apache Hive is a data warehouse system developed by Facebook to process a huge amount of structure data in Hadoop. We know that to process the data using Hadoop, we need to right complex map-reduce functions which is not an easy task for most of the developers. Hive makes this work very easy for us.
It uses a scripting language called HiveQL which is almost similar to the SQL. So now, we just have to write SQL-like commands and at the backend of Hive will automatically convert them into the map-reduce jobs.
Let’s have a look at the following diagram which shows the architecture.
Now, let’s have a look at the working of the Hive over the Hadoop framework.
Hive data types are divided into the following 5 different categories:
Here is a small description of a few of them.
Creating and Dropping database is very simple and similar to the SQL. We need to assign a unique name to each of the databases in the hive. If the database already exists, it will show a warning and to suppress this warning you can add the keywords IF NOT EXISTS after the database keyword.
CREATE DATABASE <<database_name>> ;
Dropping a database is also very simple, you just need to write a drop database and the database name to be dropped. If you try to drop the database that doesn’t exist, it will give you the SemanticException error.
DROP DATABASE <<database_name>> ;
We use the create table statement to create a table and the complete syntax is as follows.
CREATE TABLE IF NOT EXISTS <<database_name.>><<table_name>> (column_name_1 data_type_1, column_name_2 data_type_2, . . column_name_n data_type_n) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
If you are already using the database, you are not required to write database_name.table_name. In that case, you can only write the table name. In the case of Big Data, most of the time we import the data from external files so here we can pre-define the delimiter used in the file, line terminator and we can also define how we want to store the table.
There are 2 different types of hive tables Internal and External tables. Please go through this article to know more about the concept: Types of Tables in Apache Hive: A Quick Overview
Now, the tables have been created. It’s time to load the data into it. We can load the data from any local file on our system using the following syntax.
LOAD DATA LOCAL INPATH <<path of file on your local system>> INTO TABLE <<database_name.>><<table_name>> ;
When we work with a huge amount of data, there is a possibility of having unmatched data types in some of the rows. In that case, the hive will not throw any error rather it will fill null values in place of them. This is a very useful feature as loading big data files into the hive is an expensive process and we do not want to load the entire dataset just because of few files.
In the hive, we can do multiple modifications to the existing tables like renaming the tables, adding more columns to the table. The commands to alter the table are very much similar to the SQL commands.
Here is the syntax to rename the table:
ALTER TABLE <<table_name>> RENAME TO <<new_name>> ;
Syntax to add more columns from the table:
## to add more columns ALTER TABLE <<table_name>> ADD COLUMNS (new_column_name_1 data_type_1, new_column_name_2 data_type_2, . . new_column_name_n data_type_n) ;
In this article, we have seen the architecture of the Apache Hive and its working and some of the basic operations to get started with. In the next article of this series, we will see some of the more complex and important concepts of partitioning and bucketing in a hive.
If you have any questions related to this article do let me know in the comments section below.