HIVE Does NOT Like COALESCE Function As JOIN Condition

The COALESCE expression returns the first not NULL value. It is a shortcut for the CASE expression.  In relational database such as SQL server, it can be used in the SELECT clause, WHERE clause as well as JOIN condition. COALESCE can also be used in NON-relational database such as Hive, which use a SQL-like queries HiveQL. In this blog, I want to show you that HIVE does not like COALESCE function as the JOIN condition.

To illustrate what I mean by that, let me show you an example with two tables t1 and t2. Here is the query to create the two tables.

trd_id bigint,
trd_nb bigint
) ;
INSERT INTO t1 VALUES(null, null);

trd_id bigint,
trd_nb bigint
) ;
INSERT INTO t2 VALUES(null, null);

Quickly run a select statement, you will see the result as shown below.

Table t1


Table t2


These two tables has exactly same schemas. They both have 5 rows and 4 of 5 are identical. The only difference is that the second row in which the data for trd_id and trd_nb columns are switched. These two columns have same data type, it is possible to JOIN them with each other. To illustrate the COALESCE function, I have intentionally introduced several NULL values into these two columns.

First, Let’s run a query with trd_id as the JOIN condition.

SELECT t1.*, t2.* FROM t1 INNER JOIN t2 ON t2.trd_id = t1.trd_id

From the above table information, we can easily know that two records should be returned. This is exactly what the result shown.


What if we change the condition by combining trd_id and trd_nb two columns with COALESCE function? Let’s try the following query.

SELECT t1.*, t2.* FROM t1 INNER JOIN t2
ON COALESCE(t2.trd_id, t2.trd_nb )  = COALESCE(t1.trd_id , t1.trd_nb)

Since we have combined the two columns and COALESCE function picks the first non NULL value, one would expect that it should return 4 records. However, it only returns one record as shown below.


Sine the COALESCE function is a shortcut of CASE expression, let’s test with a CASE expression as the JOIN condition.

SELECT t1.*, t2.*
ON CASE WHEN t2.trd_id is NULL THEN t2.trd_nb
        ELSE t2.trd_id 
   CASE WHEN t1.trd_id IS NULL THEN t1.trd_nb
         ELSE t1.trd_id

Just like the COALESCE function, The CASE expression returns exactly same result (not shown).

To overcome the caveat of COALESCE function, I have designed a workaround method by creating an intermediate column using a subquery. Then use the intermediate column as the JOIN condition. The detailed query are showed as below.

        CASE WHEN t1.trd_id IS NULL then t1.trd_nb
            ELSE t1.trd_id
        END col1
FROM t1) a 
    CASE WHEN t2.trd_id is NULL THEN t2.trd_nb
        ELSE t2.trd_id 
    END col2
FROM t2   ) b
   ON a.col1 = b.col2

As expected, this workaround method does return 4 records as shown below.



In this blog, I have shown you an interesting observation that HIVE does not like the COALESCE function as the JOIN condition.

Useful websites for improving your SQL skills

In this blog, I have listed some of the websites which I used for honing my SQL skills.

1.  modeanalytics

One of my favorite one is modeanalytics. The SQL tutorial covers from basic to advanced lessons which I found is very helpful intermediate to experienced SQL persons.

2.  sqlfiddle

Another website I highly recommended is sqlfiddle. The website provides a very interactive interface. You can create your own schema and tables.  It supports most of the RDBMSs such as MS SQL, Oracle, MySQL, PostgreSQL and SQL lite.

3. sqlzoo

sqlzoo provides practice and quizzes to hone your SQL skills. It support only MySQL, Oracle and MS SQL RDBMS engines. This website is good for beginner.

4. w3schools

W3schoolsprovides very details tutorial for almost all available RDBMS system. This website is very helpful for beginner to learn all the nuts and bolts of SQL.






Summary of Data Visualization Methods with Python

In this blog, I present a brief summary of Data Visualization of the Open Machine Learning Course Topic 2. Visual data analysis with Python. I hope this summary provides a quick reference for your next data visualization analysis.

1. Univariate Visualization

1.1 Quantitative features (continuous and discrete)

1.1.1 Histogram
1.1.2 Kernal density plot
1.1.3 Seaborn’s distplot
1.1.4 Box plot (box, whiskers, and outliers)
1.1.5 Seaborn’s violinplot (smoothed distribution) and boxplot

1.2 Categorical and binary features

1.2.1 Bar plot (frequency table)
1.2.2 Seaborn’s countplot

2. Multivariate visualization

2.1 Quantitative-Quantitative

2.1.1 Scatter plot
2.1.2 Seaborn’s heatmap
2.1.3 Seaborn’s joinplot
2.1.4 Seaborn’s pairplot
2.1.5 DataFrame correlation matrix

2.2 Quantitative-Categorical

2.2.1 Seaborn’s lmplot
2.2.2 Seaborn’s boxplot and violinplot
2.2.3 Seaborn’s factorplot (one-quantitative with two categorical ariables)

2.3 Categorical-Categorical

2.3.1 Seaborn’s countplot
2.3.2 Pandas crosstab

3. Whole Dataset

3.1 Histogram and pairplot

3.2 Dimensionality reduction

3.2.1 Linear algorithm – PCA (Principal Component Analysis)
3.2.2 Non-linear algorithm – Manifold Learning – t-SNE (t-distributed Stochastic Neighbor Embedding).

For the detailed code of each method, please refer the original blog. I hope you find this useful. Please feel free to leave any comments. Thanks.

My Collection of Blogs and Websites in My Data Science Journey

I have been studying data science for more than two years. I would like to share a list of of blogs and websites collected during my learning journey. I hope it is useful for you. Many thanks to my Linkedin connection, especially @Georgeliu, @Favio, @Sarah Nooravi, @Randy Lao

N. Face Recognition




Access Sql Server From PowerShell

PowerShell is an interactive command-line shell. It is used for task automation and configuration management for many Microsoft platforms. Today, I am going to show you how to access Sql Server from PowerShell with sqlserver module. To enable PowerShell to access Sql Server, you need to get sqlserver module installed. You can follow the steps shown in this Microsoft document.

To confirm you have sqlserver module properly installed, you can run the following commands in PowerShell.

       import-module sqlserver

        get-module sqlserver

If everything goes well, you should see something similar as the screenshot shown below. In case there are any problems, you can refer to this nice blog.


To further confirm you have all the Cmdlets installed, you can run the following command.

    get-command -module sqlserver

You should see a list of Cmdlets as shown in below:


Now, you are ready to access Sql Server Database. You need to know the server and instance name and certainly you should have the right credentials. In this example, I will use the default instance name. To see a list of database, you can run the following commands subsequently (remember to replace ServerName with your server name).


 dir SQLSERVER:\\SQL\ServerName\Default\Databases

The first piece of command switches you to sqlserver module. The second piece of command shows you a whole list of the databases in your server. If commands run successfully, you should see  a similar screenshot as shown below. As you can see, the command shows you a big picture of the databases in your server.


To select a specific database, you can add a select command followed by your database name.

 dir SQLSERVER:\\SQL\ServerName\Default\Databases | select DatabaseName

In this example, I select a database named ‘Test’. It only shows you the database name and no other database information such as size, recovery model, etc.


Now, I want to create a table named ‘Persons’ in this database. After that, I want to insert data into this table. As you can see from the screenshot shown below, I need to invoke sqlcmd command followed by the quoted query and ServerInstance information. One thing I want to emphasize is that you need to specify the Database name by using three parts table name. After inserting the data, you can check if the insert was successful by running a select command. As you can see, it returns two rows as expected.


You can also run DDL commands like Drop table. In the following screenshot, I first run a drop command and then try to run a select command. As expected, the select statement returns an “Invalid object name” error, confirming the table was dropped.


If you want to save time, you can run all the steps in one file. In this example, I created a sqlcmd.sql file which contains create, insert and select three steps as shown in below.


To run the commands saved in the file, you can run the following command. If you want to save the result into a file, you can do so by specifying a full path of Out-File.

Invoke-sqlcmd -InputFile “full path” | Out-File -FilePath “full path”

The screenshot below shows the result without specifying the out file.


As mentioned, you can route the result into a file by specifying a full path of an output file.


If you check the output file as shown below, you can easily see that data is separated by space which is exactly as we have seen before.



In this blog, I have shown you how to use sqlserver module to access Sql Server from PowerShell. You can interactively access and modify Sql Server data from PowerShell command-line. In addition, you can stack all steps together in one file and run it with one piece of command for convenience. This module provides an alternative method to directly interact with Sql Server.

Thank you for your time reading my blog. If you find it helpful, please feel free to share it with your friends.


A Tweet Sentiment Analysis on Bitcoin

Last weekend, I had some time to work on a sentiment analysis project. Specifically, I worked on a bitcoin sentiment through analyzing freshly collected tweets.

To collect the fresh tweets, I followed the example from Microsoft. I like this method very much because it contains a progress bar, so you know the progress in real time. This could be very important, especially when you collect some topics which are not that popular. It can take several hours. One problem you may not like is that the “progressbar” package for Python 3.0 is not available for anaconda, at least as I tested.

To restrict the topic on bitcoin, I have added “bitcoin’ and “cryptocurrency” as the filter. I have collected 10,000 tweets in total. Since it is a weekend project, I don’t have much time to write my own sentiment algorithm. Fortunately, there is one python package called “Textblob” which simplifies your text processing. You can find the information about Textblob here. This package provides a simple API for common natural language processing (NLP) tasks such as classification as well as sentiment analysis.


To know a little bit more about the tweet contents, I have pooled all the tweet text altogether. To make it a little bit cleaner, I have filtered with stop words as well as the punctuations. Since there are a lot links and retweets, I have also filtered three additional strings (“http”, “https” and “RT”). After a simple pre-processing, I was able to generate a world cloud as shown below using wordcloud python package. As expected, the Bitcoin is the most frequent word shown in the tweets. Other related words such as blockchain, Ethereum, cryptocurrency are also shown in high frequency.


To further analyze the sentiment, I have used the TextBlob package as mentioned before. Out of the 10,000 tweets, 5,493 tweets are classified as neutral, 3,698 tweets are classified as positive, and 809 tweets are classified as negative. So less than 10 percent of the tweets are negative.



As I have mentioned earlier, I don’t have much time to work on the project. The processing steps could be a little bit simple. For example, emojis used in tweets could be a good indicator for sentiment analysis. Furthermore, the sample size is also a little bit small, so it may not reflect the real world sentiment. Anyway, from this analysis, it seems that the bitcoin sentiment is still pretty positive. If you want to learn more about how I did it, please feel free to check it out on my Github.

If you have any comments, you are more than welcome to do it. Thank you for your time.

Extracting tables in PDF Format with Tabula-py

Tabula is a PDF table extracting tool written in Java. Although there are multiple wrappers in R, Ruby, and Node.js, there isn’t any python wrapper available yet till recently. I have been searching for it for a while.  Luckily, I have found one till recently as documented in chezou’s github and Aki Ariga’s blog.

I have done some simple test and it works great! To know how to install it, please refer the blog as well as the github I have given above.

1. Read a single PDF table within one page

It is easily to read a single table within the same page. You almost don’t need to specify any parameter. For the output format, there are several choice such as DtaFrame (default), JSON, CSV and txt.


2. Read two PDF tables in one page

When reading multiple tables in PDF files, we should enable the multiple-table parameter. If you forget enable it, all tables will stack together as the example shown in below.


When you enable the multiple_tables parameter by setting the value as True, the read_pdf () function will return a list.


We can check the data type, the list length and slice the list.


To convert the original table into a data frame, we can run the following command.


3. Read multiple tables in multiple pages

When reading multiple tables from multiple pages, we must specify the pages and enable the multiple_tables parameter. The returned result is a list as shown below.


To convert the original table into a data frame, we can run the following command.


4. Summary

In this blog, I have shown you how to read PDF table with the Tabula-py. This open source tool is very powerful and can be easily used for extracting tables from PDF files without knowing Java. However, to make Tabula-py function, you do need to install Java. For further detailed information and examples, please refer to my github and chezou’s github.

If you have any question, please leave your comment. I hope you find this blog is helpful. Thanks!



Quick Check of Google Colaboratory

Colaboratory is a Chrome App developed by a joint team consisting of Google Research, Matthew Turk, and the IPython/Jupyter. It works like a Jupyter notebook. Colaboratory merges many open source packages with Google technologies which facilitates collaboration and reduces the burden of setting up the working environment.

To check what packages Colaboratory has, I have done a simple test by importing packages. Here is what I found.

Data Processing Packages. Four commonly used data processing/analysis packages like numpy and pandas come in handy

  • Pandas
  • Numpy
  • Scipy
  • Statsmodels

Visualization Packages. I have tested four packages and three of them are available. The one tested but not found is Bokeh.

  • matplotlib
  • seaborn
  • plotly

Machine Learning Packages. I have tested five common python machine learning packages and two of them are available. Theano, Caffe and XGBoost are not available.

  • sklearn
  • tensorflow

Image Processing Packages. Two packages Pillow and Skimage are available. OpenCV, one of my favorite packages is not available.

  • Pillow
  • Skimage

Natural Language Processing Packages. Only NLTK was tested and it is available.

  • NLTK

Packages do not exist

  • OpenCV
  • Bokeh
  • XGBoost
  • Theano
  • Caffe

Please check the notebook for detailed information.



Built-in JSON function in Sql Server 2016 – OPENJSON

Starting from 2016, Sql server and Azure Sql database has introduced built-in JSON functions which allows you to read JSON format data or convert tabular format data into JSON format. There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In my last blog, I have explained and illustrated what the FOR JSON function can do. In this blog, I will explain and illustrate what the OPENJSON can do.

About OPENJSON function

The OPENJSON is a function which parses json format data into tabular format data. If you call OPENJSON without specify the cplumn name, it will return three columns which are Key, Value and Type as mentioned here. You can also specify the column name explicitly by using the WITH clause. Let’s use some example to illustrate how it works.


Suppose we have JSON data as shown in below



If we run the above query with default OPENJSON function to read the above JSON data, we see three columns as we mentioned before. The real schema columns were shown in as value fro the key column. The detailed result was shown in below.



Let’s see a more complicated JSON file which has nested structure as shown in below


If we read the above JSON data with the default OPENJSON query, we will see the data is not completely parsed because the value part still contains JSON format data as shown in below.


To completely parse the nested JSON data, we need to use WITH clause with explicitly defined schema. For example, if we want to know the name and salary of each employee, we can use the query below.

SELECT * FROM OPENJSON(@js2, ‘$.EmployeeInfo’)
WITH ( Name VARCHAR(50) ‘$.Name’,       
Salary INT ‘$.Salary’,    
Address1 VARCHAR(200) ‘$.Address1’,    
Address2 varchar(50) ‘$.Address2’,    
ZipCode varchar(50) ‘$.Zip’,    
City VARCHAR(50) ‘$.City’,    
State VARCHAR(50) ‘$.StateProvince’ )

After you run the above query, you can see the structured tabular data as shown in below image.


We can also access the nested data with path, this method is particularly useful when parsing multi-nested JSON data. For example, if we have the below JSON data and we want to access all the employee’s information including Name, Salary and Address.
To access all the information,w can use OPENJSON with WITH CLAUSE and PATH function. Below is the detailed query we can used.

FROM OPENJSON(@js3, ‘$.Employee’)
WITH ( Name VARCHAR(50) ‘$.Name’,
Salary INT ‘$.Salary’,
Address1 VARCHAR(50) ‘$.Address.Address1’,
ZipCode varchar(50) ‘$.Address.Zip’,
City VARCHAR(50) ‘$.Address.City’,
State VARCHAR(50) ‘$.Address.StateProvince’ )

After we run the above query, we can get all the information in one row as shown in below.

In this blog, I have explained and illustrated what the OPENJSON function can do. The default OPENJSON function returns three columns which are key, value and type. We can use OPENJSON function with WITH clause to define the data schema. To access multi-nested JSON data, we can specify the detailed path.

Thank you for your time reading my blog. If you have any question, please leave your comment below.


Built-in JSON functions in Sql Server 2016 – FOR JSON

JSON is a lightweight text data format used for exchanging data in web and mobile applications. Starting from 2016, Sql server and Azure Sql database has enabled you to convert a tabular format into a JSON format and vice versa via the built-in JSON functions.

There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In this blog, I will explain and illustrate what the FOR JSON function can do.

FOR JSON This function converts tabular format data into JSON format data. To illustrate what FOR JSON function can do, let’s first create two tables with the following sql query.

[Address1] [varchar](100) NULL,
[Address2] [varchar](50) NULL,
[Zip] [varchar](10) NULL,
[City] [varchar](50) NULL,
[StateProvince] [varchar](50) NULL,
[EmployeeID] [int] NULL)

[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](20) NULL,
[Salary] [int] NULL )

INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’900 Rockville Pike’, N’Suit 10′, N’20852′, N’Rockville’, N’MD’, 1)
INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’36523 Hancock Road’, NULL, N’20854′, N’Potomac’, N’MD’, 2)

INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (1, N’Kristeen’, 1420)
INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (2, N’Ashley’, 2006)

To convert the tabular data into JSON format, we can use FOR JSON function. When you use this function, you can explicitly specify the output structure with FOR JSON PATH or let the statement determine the output structure itself with FOR JSON AUTO, the later is usually good for simple structure. For example, we can present employee’s ID, Name and Salary in JSON format with FOR JSON AUTO as shown in below.


The result shown in below image. It looks no so bad using FOR JSON AUTO when the structure is simple.

Let’s see a more complicated structure with a nested structure. First we will use FOR JSON AUTO. For this data set, we want to nest all address related information (Address1, Address2, City, Zip, and StateProvince) into Address so we need to add Address. to those columns. Below is the detailed query.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID

The result shown in below image, we can see that all address-related columns have ‘two parts’ column name and it is not so readable.

Next, let’s check FOR JSON PATH using the code shown below.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID

As seen from the image shown below, the result looks much simple and more beautiful.

Last, we can add a root by using FOR JSON PATH, ROOT clause.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID
FOR JSON PATH, ROOT(‘EmployeeInfo’)

After adding the root information, we can easily understand what information the result contains (see image shown below).


In this blog, I have explained and illustrated what the FOR JSON function can do. For a simple structure, we can use FOR JSON AUTO function. For more complicated structure, FOR JSON PATH presents more simple and readable results.