Question:
I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
1 2 3 4 5 |
hive> describe foo; OK orderno string Time taken: 0.101 seconds, Fetched: 1 row(s) |
Sample data for table1:
1 2 3 4 5 6 7 |
hive>select * from foo; OK 1826203307 1826207803 1826179498 1826179657 |
Table 2:
1 2 3 4 5 6 7 8 |
hive> describe de_geo_ip_logs; OK id bigint startorderno bigint endorderno bigint itemcode int Time taken: 0.047 seconds, Fetched: 4 row(s) |
Sample data for Table 2:
1 2 3 4 5 6 7 |
hive> select * from bar; 127698025 417880320 417880575 306 127698025 3038626048 3038626303 584 127698025 3038626304 3038626431 269 127698025 3038626560 3038626815 163 |
My Query:
1 2 3 4 |
SELECT b.itemcode FROM foo a, bar b WHERE a.orderno BETWEEN b.startorderno AND b.endorderno; |
Answer:
In the very top of your Hive log output, it states “Warning: Shuffle Join JOIN[4][Tables a, b] in Stage ‘Stage-1 Mapred’ is a cross product.”
EDIT:
A ‘cross product’ or Cartesian product is a join without conditions, which returns every row in the ‘b’ table, for every row in the ‘a’ table. So, if you take an example of ‘a’ is 5 rows, and ‘b’ is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely ‘null’ for one or the other tables.
Now, if you have a table ‘a’ of 20,000 rows and join it to another table ‘b’ of 500,000 rows, you are asking the SQL engine to return to you a data set ‘a, b’ of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of ‘b’ rows, you see you will get more benefit than the ‘a’ – in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You’re forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It’s having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
1 2 3 4 |
SELECT b.itemcode FROM foo a JOIN bar b on WHERE a.orderno BETWEEN b.startorderno AND b.endorderno; |
But I’m having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.