This article was published as a part of the Data Science Blogathon.
Apache Hive is a data warehouse system built on top of Hadoop which gives the user the flexibility to write complex MapReduce programs in form of SQL- like queries. Performance Tuning is an essential part of running Hive Queries as it helps the programmer to reduce the time and increase the efficiency of the code. Applying correct approaches and using best optimization practices we can achieve great milestones in terms of time and storage of hive queries as well as improve the performance of our cluster, and it is here that Performance Tuning plays an important role.
In this article, I want to share some of the practices of Performance Tuning configuration and queries, which I have found effective while using Hive.
First of all, let’s understand what are the steps for tuning the Hive queries:
There are several Hive optimization techniques to improve its performance which we can implement when we run our hive queries, thereby focusing on the Performance Tuning as well:
It is extremely important to make sure that the tables are being used in any Hive query as sources are not being used by another process. This can lead to locking of the table and our query can be stuck for an unknown time.
We can use the parameters below for making sure that the tables are not being locked:
set hive.support.concurrency=false; set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager; set hive.bin.strict.locking.mode=false;
While executing Hive queries, TEZ execution engine is the preferred choice because it eliminates unnecessary disk access. It takes data from disk once, performs calculations, and produces output, thus saving us from multiple disk traversals. We can consider TEZ to be a much more flexible and powerful successor to the map-reduce framework.
We can set the parameter below for using TEZ engine:
set hive.execution.engine=tez;
Apache Hive provides a cost-based optimizer to improve performance. It generates efficient execution plans like how to order joins, which type of join to perform, the degree of parallelism etc. by examining the query cost. These decisions are collected by ANALYZE statements or the metastore itself, ultimately cutting down on query execution time and reducing resource utilization.
We can set the parameter using :
set hive.cbo.enable=true;
We can improve the performance of aggregations, filters, and joins of our hive queries by using vectorized query execution, which means scanning them in batches of 1024 rows at once instead of single row each time.
We should explore the below parameters which will help to bring in more parallelism and which significantly improves query execution time:
set hive.vectorized.execution.enabled=true; set hive.exec.parallel=true;
For example:
Select a.*, b.* from (select * from table1 ) a Join (select * from table2 ) b On a.id=b.id ;
As we can see the two subqueries are independent so this might increase efficiency.
One important thing to note is, parallel execution will increase cluster utilization. If the cluster utilization of a cluster is already very high, parallel execution will not help much.
When we are joining multiple tables, we can use STREAMTABLE option. By default, the right-most table gets streamed.
For example: If we are joining 2 tables ‘huge_table’ join ‘small_table’, by default ‘small_table’ gets streamed as it is the rightmost table. In this case, ‘huge_table’ , being the bigger table, will try to get buffered into memory and might cause java heap space issues or the job might run longer. In this case, what we can do is add /*+ STREAMTABLE(‘huge_table’) */ and it will make ‘huge_table’ to be streamed rather than coming into memory.
Hence, in this way, we can be free of remembering the order of joining tables.
If one of the tables in the join is a small table and can be loaded into memory, we can force a MAPSIDE join like shown below:
Select /*+ MAPJOIN(small_table) */ large_table.col1,large_table.col2 from large_table join small_table on large_table.col1 = small_table.col1;
A map side join can be performed within a mapper without using a Map/Reduce step.
Also, We can let the execution engine take care of this by setting auto.convert.join as True.
set auto.convert.join = True :
We should avoid using any calculated fields in JOIN and WHERE clauses as they take a long time to run the Hive query. We can use CTE(Create table expression) to handle those functionalities and can optimize our queries.
For example:
Original query:
select a.coll, b.col2 from table1 as a join table2 as b on (a.coll +50 = b.col2);
Optimized query:
Hive supports both ORDER BY and SORT BY causes. ORDER BY works on a single reducer and it causes a performance bottleneck. But, SORT BY orders the data only within each reducer and performs a local ordering where each reducer’s output will be sorted ensuring better performance.
While we are using Hive, If we need only a few columns from a table, avoid using SELECT * FROM as it adds unnecessary time to the execution.
Using appropriate file formats on the basis of data can significantly increase our query performance. Hive comes with columnar input formats like RCFile, ORC, etc. On comparing to Text, Sequence, and RC file formats, ORC shows better performance because Hive has a vectorized ORC reader which allows reducing the read operations in analytics queries by allowing each column to be accessed individually.
This is the fundamental principle for any tuning where we filter or drop records ahead in the process so that we can avoid dealing with long-running of queries and dealing with unnecessary results.
For example :
a join b where a.col !=null
can be written as
(select * from a where a.col!=null) join b
Here we will have a common dataset (Superset) and then we will use that to insert into multiple tables based on specific conditions specified in the WHERE clause. This helps to load multiple tables in parallel when they have the common superset of data.
This helps in terms of the maintenance of the code. Also helps in restarting the jobs in scenarios of failures. This can be used to achieve parallelism by running modules that are independent of each other thus saving time.
We should avoid using a lot of hive parameters if they are not needed. For this we should understand the data well and then use it efficiently else we might end up impacting the overall cluster utilization impacting other processes.
Most of these techniques will work on all jobs but spend the effort where it counts. Start with the jobs that will have the biggest impact on performance, that is the ones that take many hours!
For example, A 10% gain on a 5-hour job saves 30 minutes of processing time!
In the article, we covered the Hive Optimization Techniques for Hive Queries both configuration-wise and query-wise, and how Performance Tuning can be implemented using hive. I hope this article helped you in understanding how to efficiently tune the performance of hive queries to execute them faster on clusters. Please don’t forget to drop in your comments if you have any doubts.
Nausheen Aafreen is a Big Data Engineer and Data Research Analyst working in a multinational artificial intelligence company. She has spent over 5.6 years working in Data Engineering and Data Science projects. She worked in various technologies including big data, automation, and data research.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.