Summary of Data Visualization Methods with Python

In this blog, I present a brief summary of Data Visualization of the Open Machine Learning Course Topic 2. Visual data analysis with Python. I hope this summary provides a quick reference for your next data visualization analysis.

1. Univariate Visualization

1.1 Quantitative features (continuous and discrete)

1.1.1 Histogram
1.1.2 Kernal density plot
1.1.3 Seaborn’s distplot
1.1.4 Box plot (box, whiskers, and outliers)
1.1.5 Seaborn’s violinplot (smoothed distribution) and boxplot

1.2 Categorical and binary features

1.2.1 Bar plot (frequency table)
1.2.2 Seaborn’s countplot

2. Multivariate visualization

2.1 Quantitative-Quantitative

2.1.1 Scatter plot
2.1.2 Seaborn’s heatmap
2.1.3 Seaborn’s joinplot
2.1.4 Seaborn’s pairplot
2.1.5 DataFrame correlation matrix

2.2 Quantitative-Categorical

2.2.1 Seaborn’s lmplot
2.2.2 Seaborn’s boxplot and violinplot
2.2.3 Seaborn’s factorplot (one-quantitative with two categorical ariables)

2.3 Categorical-Categorical

2.3.1 Seaborn’s countplot
2.3.2 Pandas crosstab

3. Whole Dataset

3.1 Histogram and pairplot

3.2 Dimensionality reduction

3.2.1 Linear algorithm – PCA (Principal Component Analysis)
3.2.2 Non-linear algorithm – Manifold Learning – t-SNE (t-distributed Stochastic Neighbor Embedding).

For the detailed code of each method, please refer the original blog. I hope you find this useful. Please feel free to leave any comments. Thanks.


My Collection of Blogs and Websites in My Data Science Journey

I have been studying data science for more than two years. I would like to share a list of of blogs and websites collected during my learning journey. I hope it is useful for you. Many thanks to my Linkedin connection, especially @Georgeliu, @Favio, @Sarah Nooravi, @Randy Lao

N. Face Recognition




Access Sql Server From PowerShell

PowerShell is an interactive command-line shell. It is used for task automation and configuration management for many Microsoft platforms. Today, I am going to show you how to access Sql Server from PowerShell with sqlserver module. To enable PowerShell to access Sql Server, you need to get sqlserver module installed. You can follow the steps shown in this Microsoft document.

To confirm you have sqlserver module properly installed, you can run the following commands in PowerShell.

       import-module sqlserver

        get-module sqlserver

If everything goes well, you should see something similar as the screenshot shown below. In case there are any problems, you can refer to this nice blog.


To further confirm you have all the Cmdlets installed, you can run the following command.

    get-command -module sqlserver

You should see a list of Cmdlets as shown in below:


Now, you are ready to access Sql Server Database. You need to know the server and instance name and certainly you should have the right credentials. In this example, I will use the default instance name. To see a list of database, you can run the following commands subsequently (remember to replace ServerName with your server name).


 dir SQLSERVER:\\SQL\ServerName\Default\Databases

The first piece of command switches you to sqlserver module. The second piece of command shows you a whole list of the databases in your server. If commands run successfully, you should see  a similar screenshot as shown below. As you can see, the command shows you a big picture of the databases in your server.


To select a specific database, you can add a select command followed by your database name.

 dir SQLSERVER:\\SQL\ServerName\Default\Databases | select DatabaseName

In this example, I select a database named ‘Test’. It only shows you the database name and no other database information such as size, recovery model, etc.


Now, I want to create a table named ‘Persons’ in this database. After that, I want to insert data into this table. As you can see from the screenshot shown below, I need to invoke sqlcmd command followed by the quoted query and ServerInstance information. One thing I want to emphasize is that you need to specify the Database name by using three parts table name. After inserting the data, you can check if the insert was successful by running a select command. As you can see, it returns two rows as expected.


You can also run DDL commands like Drop table. In the following screenshot, I first run a drop command and then try to run a select command. As expected, the select statement returns an “Invalid object name” error, confirming the table was dropped.


If you want to save time, you can run all the steps in one file. In this example, I created a sqlcmd.sql file which contains create, insert and select three steps as shown in below.


To run the commands saved in the file, you can run the following command. If you want to save the result into a file, you can do so by specifying a full path of Out-File.

Invoke-sqlcmd -InputFile “full path” | Out-File -FilePath “full path”

The screenshot below shows the result without specifying the out file.


As mentioned, you can route the result into a file by specifying a full path of an output file.


If you check the output file as shown below, you can easily see that data is separated by space which is exactly as we have seen before.



In this blog, I have shown you how to use sqlserver module to access Sql Server from PowerShell. You can interactively access and modify Sql Server data from PowerShell command-line. In addition, you can stack all steps together in one file and run it with one piece of command for convenience. This module provides an alternative method to directly interact with Sql Server.

Thank you for your time reading my blog. If you find it helpful, please feel free to share it with your friends.


A Tweet Sentiment Analysis on Bitcoin

Last weekend, I had some time to work on a sentiment analysis project. Specifically, I worked on a bitcoin sentiment through analyzing freshly collected tweets.

To collect the fresh tweets, I followed the example from Microsoft. I like this method very much because it contains a progress bar, so you know the progress in real time. This could be very important, especially when you collect some topics which are not that popular. It can take several hours. One problem you may not like is that the “progressbar” package for Python 3.0 is not available for anaconda, at least as I tested.

To restrict the topic on bitcoin, I have added “bitcoin’ and “cryptocurrency” as the filter. I have collected 10,000 tweets in total. Since it is a weekend project, I don’t have much time to write my own sentiment algorithm. Fortunately, there is one python package called “Textblob” which simplifies your text processing. You can find the information about Textblob here. This package provides a simple API for common natural language processing (NLP) tasks such as classification as well as sentiment analysis.


To know a little bit more about the tweet contents, I have pooled all the tweet text altogether. To make it a little bit cleaner, I have filtered with stop words as well as the punctuations. Since there are a lot links and retweets, I have also filtered three additional strings (“http”, “https” and “RT”). After a simple pre-processing, I was able to generate a world cloud as shown below using wordcloud python package. As expected, the Bitcoin is the most frequent word shown in the tweets. Other related words such as blockchain, Ethereum, cryptocurrency are also shown in high frequency.


To further analyze the sentiment, I have used the TextBlob package as mentioned before. Out of the 10,000 tweets, 5,493 tweets are classified as neutral, 3,698 tweets are classified as positive, and 809 tweets are classified as negative. So less than 10 percent of the tweets are negative.



As I have mentioned earlier, I don’t have much time to work on the project. The processing steps could be a little bit simple. For example, emojis used in tweets could be a good indicator for sentiment analysis. Furthermore, the sample size is also a little bit small, so it may not reflect the real world sentiment. Anyway, from this analysis, it seems that the bitcoin sentiment is still pretty positive. If you want to learn more about how I did it, please feel free to check it out on my Github.

If you have any comments, you are more than welcome to do it. Thank you for your time.

Extracting tables in PDF Format with Tabula-py

Tabula is a PDF table extracting tool written in Java. Although there are multiple wrappers in R, Ruby, and Node.js, there isn’t any python wrapper available yet till recently. I have been searching for it for a while.  Luckily, I have found one till recently as documented in chezou’s github and Aki Ariga’s blog.

I have done some simple test and it works great! To know how to install it, please refer the blog as well as the github I have given above.

1. Read a single PDF table within one page

It is easily to read a single table within the same page. You almost don’t need to specify any parameter. For the output format, there are several choice such as DtaFrame (default), JSON, CSV and txt.


2. Read two PDF tables in one page

When reading multiple tables in PDF files, we should enable the multiple-table parameter. If you forget enable it, all tables will stack together as the example shown in below.


When you enable the multiple_tables parameter by setting the value as True, the read_pdf () function will return a list.


We can check the data type, the list length and slice the list.


To convert the original table into a data frame, we can run the following command.


3. Read multiple tables in multiple pages

When reading multiple tables from multiple pages, we must specify the pages and enable the multiple_tables parameter. The returned result is a list as shown below.


To convert the original table into a data frame, we can run the following command.


4. Summary

In this blog, I have shown you how to read PDF table with the Tabula-py. This open source tool is very powerful and can be easily used for extracting tables from PDF files without knowing Java. However, to make Tabula-py function, you do need to install Java. For further detailed information and examples, please refer to my github and chezou’s github.

If you have any question, please leave your comment. I hope you find this blog is helpful. Thanks!



Quick Check of Google Colaboratory

Colaboratory is a Chrome App developed by a joint team consisting of Google Research, Matthew Turk, and the IPython/Jupyter. It works like a Jupyter notebook. Colaboratory merges many open source packages with Google technologies which facilitates collaboration and reduces the burden of setting up the working environment.

To check what packages Colaboratory has, I have done a simple test by importing packages. Here is what I found.

Data Processing Packages. Four commonly used data processing/analysis packages like numpy and pandas come in handy

  • Pandas
  • Numpy
  • Scipy
  • Statsmodels

Visualization Packages. I have tested four packages and three of them are available. The one tested but not found is Bokeh.

  • matplotlib
  • seaborn
  • plotly

Machine Learning Packages. I have tested five common python machine learning packages and two of them are available. Theano, Caffe and XGBoost are not available.

  • sklearn
  • tensorflow

Image Processing Packages. Two packages Pillow and Skimage are available. OpenCV, one of my favorite packages is not available.

  • Pillow
  • Skimage

Natural Language Processing Packages. Only NLTK was tested and it is available.

  • NLTK

Packages do not exist

  • OpenCV
  • Bokeh
  • XGBoost
  • Theano
  • Caffe

Please check the notebook for detailed information.



Built-in JSON function in Sql Server 2016 – OPENJSON

Starting from 2016, Sql server and Azure Sql database has introduced built-in JSON functions which allows you to read JSON format data or convert tabular format data into JSON format. There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In my last blog, I have explained and illustrated what the FOR JSON function can do. In this blog, I will explain and illustrate what the OPENJSON can do.

About OPENJSON function

The OPENJSON is a function which parses json format data into tabular format data. If you call OPENJSON without specify the cplumn name, it will return three columns which are Key, Value and Type as mentioned here. You can also specify the column name explicitly by using the WITH clause. Let’s use some example to illustrate how it works.


Suppose we have JSON data as shown in below



If we run the above query with default OPENJSON function to read the above JSON data, we see three columns as we mentioned before. The real schema columns were shown in as value fro the key column. The detailed result was shown in below.



Let’s see a more complicated JSON file which has nested structure as shown in below


If we read the above JSON data with the default OPENJSON query, we will see the data is not completely parsed because the value part still contains JSON format data as shown in below.


To completely parse the nested JSON data, we need to use WITH clause with explicitly defined schema. For example, if we want to know the name and salary of each employee, we can use the query below.

SELECT * FROM OPENJSON(@js2, ‘$.EmployeeInfo’)
WITH ( Name VARCHAR(50) ‘$.Name’,       
Salary INT ‘$.Salary’,    
Address1 VARCHAR(200) ‘$.Address1’,    
Address2 varchar(50) ‘$.Address2’,    
ZipCode varchar(50) ‘$.Zip’,    
City VARCHAR(50) ‘$.City’,    
State VARCHAR(50) ‘$.StateProvince’ )

After you run the above query, you can see the structured tabular data as shown in below image.


We can also access the nested data with path, this method is particularly useful when parsing multi-nested JSON data. For example, if we have the below JSON data and we want to access all the employee’s information including Name, Salary and Address.
To access all the information,w can use OPENJSON with WITH CLAUSE and PATH function. Below is the detailed query we can used.

FROM OPENJSON(@js3, ‘$.Employee’)
WITH ( Name VARCHAR(50) ‘$.Name’,
Salary INT ‘$.Salary’,
Address1 VARCHAR(50) ‘$.Address.Address1’,
ZipCode varchar(50) ‘$.Address.Zip’,
City VARCHAR(50) ‘$.Address.City’,
State VARCHAR(50) ‘$.Address.StateProvince’ )

After we run the above query, we can get all the information in one row as shown in below.

In this blog, I have explained and illustrated what the OPENJSON function can do. The default OPENJSON function returns three columns which are key, value and type. We can use OPENJSON function with WITH clause to define the data schema. To access multi-nested JSON data, we can specify the detailed path.

Thank you for your time reading my blog. If you have any question, please leave your comment below.


Built-in JSON functions in Sql Server 2016 – FOR JSON

JSON is a lightweight text data format used for exchanging data in web and mobile applications. Starting from 2016, Sql server and Azure Sql database has enabled you to convert a tabular format into a JSON format and vice versa via the built-in JSON functions.

There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In this blog, I will explain and illustrate what the FOR JSON function can do.

FOR JSON This function converts tabular format data into JSON format data. To illustrate what FOR JSON function can do, let’s first create two tables with the following sql query.

[Address1] [varchar](100) NULL,
[Address2] [varchar](50) NULL,
[Zip] [varchar](10) NULL,
[City] [varchar](50) NULL,
[StateProvince] [varchar](50) NULL,
[EmployeeID] [int] NULL)

[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](20) NULL,
[Salary] [int] NULL )

INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’900 Rockville Pike’, N’Suit 10′, N’20852′, N’Rockville’, N’MD’, 1)
INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’36523 Hancock Road’, NULL, N’20854′, N’Potomac’, N’MD’, 2)

INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (1, N’Kristeen’, 1420)
INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (2, N’Ashley’, 2006)

To convert the tabular data into JSON format, we can use FOR JSON function. When you use this function, you can explicitly specify the output structure with FOR JSON PATH or let the statement determine the output structure itself with FOR JSON AUTO, the later is usually good for simple structure. For example, we can present employee’s ID, Name and Salary in JSON format with FOR JSON AUTO as shown in below.


The result shown in below image. It looks no so bad using FOR JSON AUTO when the structure is simple.

Let’s see a more complicated structure with a nested structure. First we will use FOR JSON AUTO. For this data set, we want to nest all address related information (Address1, Address2, City, Zip, and StateProvince) into Address so we need to add Address. to those columns. Below is the detailed query.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID

The result shown in below image, we can see that all address-related columns have ‘two parts’ column name and it is not so readable.

Next, let’s check FOR JSON PATH using the code shown below.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID

As seen from the image shown below, the result looks much simple and more beautiful.

Last, we can add a root by using FOR JSON PATH, ROOT clause.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
ON E.ID = A.EmployeeID
FOR JSON PATH, ROOT(‘EmployeeInfo’)

After adding the root information, we can easily understand what information the result contains (see image shown below).


In this blog, I have explained and illustrated what the FOR JSON function can do. For a simple structure, we can use FOR JSON AUTO function. For more complicated structure, FOR JSON PATH presents more simple and readable results.

Create a Azure Data Factory Using PowerShell

It is easy to create a Azure Data Factory using PowerShell.  However, you might fail to create one if you follow Microsoft’s Example without upgrading your PowerShell.

Prerequisites: you need to have a valid Azure Subscription and a resource group.

Step 1: Login Azure Account. run the following command. It will pop up a windows. You simply enter your username and password.


Step 2: Find your subscription. The following command will show you all subscriptions.


Step 3: Choose your subscription. If you have multiple subscriptions, you need to choose which one you want to use. You need the subscription Id which you can find from the result of step 2.

Select-AzureRmSubscription -SubscriptionID “you subscription Id”

Step 4: Define resource group name and data factory name parameters. You need to know which resource group do you want to put your data factory. If you don’t remember the resource group name, you can get all your resource group names by running Get-AzureRmResourceGroup command. The data factory name must start with a letter or a number, and can contain only letters, numbers, and the dash (-) character. That means you can not use underscore which usually can be used for naming. Data factory name should be at least 3 characters long and no more than 63 characters long. You can check here for the data factory naming convention.

$resourceGroupName = “Your existing resource group name”;

$dataFactoryName = “Your Data Factory Name”;

Step 5: Create new data factory. In my case, you should use New-AzureRmDataFactory instead of Set-AzureRmDataFactoryV2 command.

New-AzureRmDataFactory -ResourceGroupName $resourceGroupName -Location “East US” -Name $dataFactoryName;

If everything runs well, you should see a similar screen shot shown as below.


Generate Table-Creation Script Using FOR XML Clause with PATH Mode

There are multiple ways to automate the generation of table-creation script based on existing tables in SQL server. The common way is to use the ‘generate script’ function to automate this process. This GUI method is relatively easy but it often produces a lot extra command which may not fulfill your expectation. Other methods include recursive CTE and cursor. In this blog, I am going to show you how to automate this process with FOR XML clause with path mode. If you are not familiar with this method, here is a nice blog you can check.

To create a table, we need to know the table name, schema name, column name, column data type. Fortunately, we can get all this information from two views: INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.TABLS.

Furthermore, we need to know the length for a string data type like varchar and the precision and scale number for a decimal data type. In contrast,  we don’t need to define the length for a integer data type and a datetime data type. This information is also available in INFORMATION_SCHEMA.COLUMNS. We can use a CASE clause to decide whether we need to add those information. Similarly, we can use a CASE clause for nullable or non-nullable columns. Finally, we need to concatenate all columns information together with the FOR XML clause.

For the detailed query, please check the attached screenshot. You can also find a similar snippet from my githubTable_Creation

This snippet can be useful when you need to write table-creation script with limited length. For example, it can be used for modifying a Azure Data Factory pipeline JSON file.