HIVE Does NOT Like COALESCE Function As JOIN Condition

The COALESCE expression returns the first not NULL value. It is a shortcut for the CASE expression.  In relational database such as SQL server, it can be used in the SELECT clause, WHERE clause as well as JOIN condition. COALESCE can also be used in NON-relational database such as Hive, which use a SQL-like queries HiveQL. In this blog, I want to show you that HIVE does not like COALESCE function as the JOIN condition.

To illustrate what I mean by that, let me show you an example with two tables t1 and t2. Here is the query to create the two tables.

CREATE TABLE t1(
trd_id bigint,
trd_nb bigint
) ;
INSERT INTO t1 VALUES(1, 1) ;
INSERT INTO t1 VALUES(null, 2);
INSERT INTO t1 VALUES(null, null);
INSERT INTO t1 VALUES(3, null);

INSERT INTO t1 VALUES(null, 4) CREATE TABLE t2(
trd_id bigint,
trd_nb bigint
) ;
INSERT INTO t2 VALUES(1, 1) ;
INSERT INTO t2 VALUES(2, null);
INSERT INTO t2 VALUES(null, null);
INSERT INTO t2 VALUES(3, null)INSERT INTO t2 VALUES(null, 4);

Quickly run a select statement, you will see the result as shown below.

Table t1

t1

Table t2

t2

These two tables has exactly same schemas. They both have 5 rows and 4 of 5 are identical. The only difference is that the second row in which the data for trd_id and trd_nb columns are switched. These two columns have same data type, it is possible to JOIN them with each other. To illustrate the COALESCE function, I have intentionally introduced several NULL values into these two columns.

First, Let’s run a query with trd_id as the JOIN condition.

SELECT t1.*, t2.* FROM t1 INNER JOIN t2 ON t2.trd_id = t1.trd_id

From the above table information, we can easily know that two records should be returned. This is exactly what the result shown.

INNERjoin

What if we change the condition by combining trd_id and trd_nb two columns with COALESCE function? Let’s try the following query.

SELECT t1.*, t2.* FROM t1 INNER JOIN t2
ON COALESCE(t2.trd_id, t2.trd_nb )  = COALESCE(t1.trd_id , t1.trd_nb)

Since we have combined the two columns and COALESCE function picks the first non NULL value, one would expect that it should return 4 records. However, it only returns one record as shown below.

COALESCEjoin

Sine the COALESCE function is a shortcut of CASE expression, let’s test with a CASE expression as the JOIN condition.

SELECT t1.*, t2.*
FROM t1 JOIN t2
ON CASE WHEN t2.trd_id is NULL THEN t2.trd_nb
        ELSE t2.trd_id 
   END 
   = 
   CASE WHEN t1.trd_id IS NULL THEN t1.trd_nb
         ELSE t1.trd_id
   END

Just like the COALESCE function, The CASE expression returns exactly same result (not shown).

To overcome the caveat of COALESCE function, I have designed a workaround method by creating an intermediate column using a subquery. Then use the intermediate column as the JOIN condition. The detailed query are showed as below.

SELECT *
FROM 
(SELECT *,
        CASE WHEN t1.trd_id IS NULL then t1.trd_nb
            ELSE t1.trd_id
        END col1
FROM t1) a 
JOIN (
SELECT *,
    CASE WHEN t2.trd_id is NULL THEN t2.trd_nb
        ELSE t2.trd_id 
    END col2
FROM t2   ) b
   ON a.col1 = b.col2

As expected, this workaround method does return 4 records as shown below.

workaround

Summary

In this blog, I have shown you an interesting observation that HIVE does not like the COALESCE function as the JOIN condition.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s