The COALESCE expression returns the first not NULL value. It is a shortcut for the CASE expression. In relational database such as SQL server, it can be used in the SELECT clause, WHERE clause as well as JOIN condition. COALESCE can also be used in NON-relational database such as Hive, which use a SQL-like queries HiveQL. In this blog, I want to show you that HIVE does not like COALESCE function as the JOIN condition.
To illustrate what I mean by that, let me show you an example with two tables t1 and t2. Here is the query to create the two tables.
CREATE TABLE t1( trd_id bigint, trd_nb bigint ) ; INSERT INTO t1 VALUES(1, 1) ; INSERT INTO t1 VALUES(null, 2); INSERT INTO t1 VALUES(null, null); INSERT INTO t1 VALUES(3, null); INSERT INTO t1 VALUES(null, 4) CREATE TABLE t2( trd_id bigint, trd_nb bigint ) ; INSERT INTO t2 VALUES(1, 1) ; INSERT INTO t2 VALUES(2, null); INSERT INTO t2 VALUES(null, null); INSERT INTO t2 VALUES(3, null)INSERT INTO t2 VALUES(null, 4);
Quickly run a select statement, you will see the result as shown below.
These two tables has exactly same schemas. They both have 5 rows and 4 of 5 are identical. The only difference is that the second row in which the data for trd_id and trd_nb columns are switched. These two columns have same data type, it is possible to JOIN them with each other. To illustrate the COALESCE function, I have intentionally introduced several NULL values into these two columns.
First, Let’s run a query with trd_id as the JOIN condition.
SELECT t1.*, t2.* FROM t1 INNER JOIN t2 ON t2.trd_id = t1.trd_id
From the above table information, we can easily know that two records should be returned. This is exactly what the result shown.
What if we change the condition by combining trd_id and trd_nb two columns with COALESCE function? Let’s try the following query.
SELECT t1.*, t2.* FROM t1 INNER JOIN t2 ON COALESCE(t2.trd_id, t2.trd_nb ) = COALESCE(t1.trd_id , t1.trd_nb)
Since we have combined the two columns and COALESCE function picks the first non NULL value, one would expect that it should return 4 records. However, it only returns one record as shown below.
Sine the COALESCE function is a shortcut of CASE expression, let’s test with a CASE expression as the JOIN condition.
SELECT t1.*, t2.* FROM t1 JOIN t2 ON CASE WHEN t2.trd_id is NULL THEN t2.trd_nb ELSE t2.trd_id END = CASE WHEN t1.trd_id IS NULL THEN t1.trd_nb ELSE t1.trd_id END
Just like the COALESCE function, The CASE expression returns exactly same result (not shown).
To overcome the caveat of COALESCE function, I have designed a workaround method by creating an intermediate column using a subquery. Then use the intermediate column as the JOIN condition. The detailed query are showed as below.
SELECT * FROM (SELECT *, CASE WHEN t1.trd_id IS NULL then t1.trd_nb ELSE t1.trd_id END col1 FROM t1) a JOIN ( SELECT *, CASE WHEN t2.trd_id is NULL THEN t2.trd_nb ELSE t2.trd_id END col2 FROM t2 ) b ON a.col1 = b.col2
As expected, this workaround method does return 4 records as shown below.
In this blog, I have shown you an interesting observation that HIVE does not like the COALESCE function as the JOIN condition.