Performance issue in hive version 0.13.1

Question:

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.

The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?

Sample Data:

Table 1:

Sample data for table1:

Table 2:

Sample data for Table 2:

My Query:

Attached the hive logs for the above query.

Answer:

In the very top of your Hive log output, it states “Warning: Shuffle Join JOIN[4][Tables a, b] in Stage ‘Stage-1 Mapred’ is a cross product.”

EDIT:
A ‘cross product’ or Cartesian product is a join without conditions, which returns every row in the ‘b’ table, for every row in the ‘a’ table. So, if you take an example of ‘a’ is 5 rows, and ‘b’ is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely ‘null’ for one or the other tables.

Now, if you have a table ‘a’ of 20,000 rows and join it to another table ‘b’ of 500,000 rows, you are asking the SQL engine to return to you a data set ‘a, b’ of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.

So, if you drop the number of ‘b’ rows, you see you will get more benefit than the ‘a’ – in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT

You’re forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It’s having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.

Try this query:

But I’m having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.

Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Leave a Reply