You might have seen an encyclopedia in your school or college library. It is a set of books that will give you information about almost anything. Do you know what is the best thing about the encyclopedia?
Yes, you guessed it correctly. The words are arranged alphabetically. For example, you have a word in mind “Pyramids”. You will directly go and pick up the book with the title “P”. You don’t have to search that in other books. Can you imagine how tough would the task be to search for a single book if they were stored without any order?
Here storing the words alphabetically represents indexing, but using a different location for the words that start from the same character is known as bucketing.
Similar kinds of storage techniques like partitioning and bucketing are there in Apache Hive so that we can get faster results for the search queries. In this article, we will see what is partitioning and bucketing, and when to use which one?
Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally. Let’s understand it with an example:
Suppose we have to create a table in the hive which contains the product details for a fashion e-commerce company. It has the following columns:
Now, the first filter that most of the customer uses is Gender then they select categories like Shirt, its size, and color. Let’s see how to create the partitions for this example.
CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string);
Now, the hive will store the data in the directory structure like:
/user/hive/warehouse/mytable/gender=male/category=shoes/color=black
Partitioning the data gives us performance benefits and also helps us in organizing the data. Now, let’s see when to use the partitioning in the hive.
In the above example, we know that we cannot create a partition over the column price because its data type is float and there is an infinite number of unique prices are possible.
Hive will have to generate a separate directory for each of the unique prices and it would be very difficult for the hive to manage these. Instead of this, we can manually define the number of buckets we want for such columns.
In bucketing, the partitions can be subdivided into buckets based on the hash function of a column. It gives extra structure to the data which can be used for more efficient queries.
CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string) CLUSTERED BY (price) INTO 50 BUCKETS;
Now, only 50 buckets will be created no matter how many unique values are there in the price column. For example, in the first bucket, all the products with a price [ 0 – 500 ] will go, and in the next bucket products with a price [ 500 – 200 ] and so on.
In this article, we have seen what is partitioning and bucketing, how to create them, and are pros and cons of them.
I would highly recommend you go through the following resources to learn more about Apache Hive:
If you have any questions related to this article do let me know in the comments section below.
Very nice article .I read from many places but no one explain in this simple and easy way.liked the article.
Very nice article .I read from many places but no one explain in this simple and easy way.