Extracting tables in PDF Format with Tabula-py

Tabula is a PDF table extracting tool written in Java. Although there are multiple wrappers in R, Ruby, and Node.js, there isn’t any python wrapper available yet till recently. I have been searching for it for a while.  Luckily, I have found one till recently as documented in chezou’s github and Aki Ariga’s blog.

I have done some simple test and it works great! To know how to install it, please refer the blog as well as the github I have given above.

1. Read a single PDF table within one page

It is easily to read a single table within the same page. You almost don’t need to specify any parameter. For the output format, there are several choice such as DtaFrame (default), JSON, CSV and txt.

PDF_SingleTbl

2. Read two PDF tables in one page

When reading multiple tables in PDF files, we should enable the multiple-table parameter. If you forget enable it, all tables will stack together as the example shown in below.

PDF_TwoTbl

When you enable the multiple_tables parameter by setting the value as True, the read_pdf () function will return a list.

PDF_TwoTbl1

We can check the data type, the list length and slice the list.

PDF_TwoTbl2

To convert the original table into a data frame, we can run the following command.

PDF_TwoTbl3

3. Read multiple tables in multiple pages

When reading multiple tables from multiple pages, we must specify the pages and enable the multiple_tables parameter. The returned result is a list as shown below.

PDF_TwoTblTwopage1

To convert the original table into a data frame, we can run the following command.

PDF_TwoTblTwopage2

4. Summary

In this blog, I have shown you how to read PDF table with the Tabula-py. This open source tool is very powerful and can be easily used for extracting tables from PDF files without knowing Java. However, to make Tabula-py function, you do need to install Java. For further detailed information and examples, please refer to my github and chezou’s github.

If you have any question, please leave your comment. I hope you find this blog is helpful. Thanks!

 

 

Advertisements

Quick Check of Google Colaboratory

Colaboratory is a Chrome App developed by a joint team consisting of Google Research, Matthew Turk, and the IPython/Jupyter. It works like a Jupyter notebook. Colaboratory merges many open source packages with Google technologies which facilitates collaboration and reduces the burden of setting up the working environment.

To check what packages Colaboratory has, I have done a simple test by importing packages. Here is what I found.

Data Processing Packages. Four commonly used data processing/analysis packages like numpy and pandas come in handy

  • Pandas
  • Numpy
  • Scipy
  • Statsmodels

Visualization Packages. I have tested four packages and three of them are available. The one tested but not found is Bokeh.

  • matplotlib
  • seaborn
  • plotly

Machine Learning Packages. I have tested five common python machine learning packages and two of them are available. Theano, Caffe and XGBoost are not available.

  • sklearn
  • tensorflow

Image Processing Packages. Two packages Pillow and Skimage are available. OpenCV, one of my favorite packages is not available.

  • Pillow
  • Skimage

Natural Language Processing Packages. Only NLTK was tested and it is available.

  • NLTK

Packages do not exist

  • OpenCV
  • Bokeh
  • XGBoost
  • Theano
  • Caffe

Please check the notebook for detailed information.

 

 

Built-in JSON function in Sql Server 2016 – OPENJSON

Starting from 2016, Sql server and Azure Sql database has introduced built-in JSON functions which allows you to read JSON format data or convert tabular format data into JSON format. There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In my last blog, I have explained and illustrated what the FOR JSON function can do. In this blog, I will explain and illustrate what the OPENJSON can do.

About OPENJSON function

The OPENJSON is a function which parses json format data into tabular format data. If you call OPENJSON without specify the cplumn name, it will return three columns which are Key, Value and Type as mentioned here. You can also specify the column name explicitly by using the WITH clause. Let’s use some example to illustrate how it works.

Default OPENJSON.

Suppose we have JSON data as shown in below

OpenjsonFig1

SELECT * FROM OPENJSON(@js1)

If we run the above query with default OPENJSON function to read the above JSON data, we see three columns as we mentioned before. The real schema columns were shown in as value fro the key column. The detailed result was shown in below.

OpenjsonRes1

OPENJSON with WITH CLAUSE

Let’s see a more complicated JSON file which has nested structure as shown in below

OpenjsonFig2

If we read the above JSON data with the default OPENJSON query, we will see the data is not completely parsed because the value part still contains JSON format data as shown in below.

OpenjsonRes1Fig2

To completely parse the nested JSON data, we need to use WITH clause with explicitly defined schema. For example, if we want to know the name and salary of each employee, we can use the query below.

SELECT * FROM OPENJSON(@js2, ‘$.EmployeeInfo’)
WITH ( Name VARCHAR(50) ‘$.Name’,       
Salary INT ‘$.Salary’,    
Address1 VARCHAR(200) ‘$.Address1’,    
Address2 varchar(50) ‘$.Address2’,    
ZipCode varchar(50) ‘$.Zip’,    
City VARCHAR(50) ‘$.City’,    
State VARCHAR(50) ‘$.StateProvince’ )

After you run the above query, you can see the structured tabular data as shown in below image.

OpenjsonRes2Fig2

OPENJSON with WITH CLAUSE and PATH
We can also access the nested data with path, this method is particularly useful when parsing multi-nested JSON data. For example, if we have the below JSON data and we want to access all the employee’s information including Name, Salary and Address.
OpenjsonFig3
To access all the information,w can use OPENJSON with WITH CLAUSE and PATH function. Below is the detailed query we can used.

SELECT *
FROM OPENJSON(@js3, ‘$.Employee’)
WITH ( Name VARCHAR(50) ‘$.Name’,
Salary INT ‘$.Salary’,
Address1 VARCHAR(50) ‘$.Address.Address1’,
ZipCode varchar(50) ‘$.Address.Zip’,
City VARCHAR(50) ‘$.Address.City’,
State VARCHAR(50) ‘$.Address.StateProvince’ )

After we run the above query, we can get all the information in one row as shown in below.
OpenjsonRes1Fig3

Summary
In this blog, I have explained and illustrated what the OPENJSON function can do. The default OPENJSON function returns three columns which are key, value and type. We can use OPENJSON function with WITH clause to define the data schema. To access multi-nested JSON data, we can specify the detailed path.

Thank you for your time reading my blog. If you have any question, please leave your comment below.
 

 

Built-in JSON functions in Sql Server 2016 – FOR JSON

JSON is a lightweight text data format used for exchanging data in web and mobile applications. Starting from 2016, Sql server and Azure Sql database has enabled you to convert a tabular format into a JSON format and vice versa via the built-in JSON functions.

There are several built-in JSON functions such as FOR JSON, OPENJSON, JSON_VALUE, JSON_QUERY, JSON_MODIFY and ISJSON. In this blog, I will explain and illustrate what the FOR JSON function can do.

FOR JSON This function converts tabular format data into JSON format data. To illustrate what FOR JSON function can do, let’s first create two tables with the following sql query.

CREATE TABLE [dbo].[ADDRESS](
[Address1] [varchar](100) NULL,
[Address2] [varchar](50) NULL,
[Zip] [varchar](10) NULL,
[City] [varchar](50) NULL,
[StateProvince] [varchar](50) NULL,
[EmployeeID] [int] NULL)

CREATE TABLE [dbo].[EMPLOYEES](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](20) NULL,
[Salary] [int] NULL )

INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’900 Rockville Pike’, N’Suit 10′, N’20852′, N’Rockville’, N’MD’, 1)
INSERT [dbo].[ADDRESS] ([Address1], [Address2], [Zip], [City], [StateProvince], [EmployeeID]) VALUES (N’36523 Hancock Road’, NULL, N’20854′, N’Potomac’, N’MD’, 2)

INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (1, N’Kristeen’, 1420)
INSERT [dbo].[EMPLOYEES] ([ID], [Name], [Salary]) VALUES (2, N’Ashley’, 2006)

To convert the tabular data into JSON format, we can use FOR JSON function. When you use this function, you can explicitly specify the output structure with FOR JSON PATH or let the statement determine the output structure itself with FOR JSON AUTO, the later is usually good for simple structure. For example, we can present employee’s ID, Name and Salary in JSON format with FOR JSON AUTO as shown in below.

SELECT ID, Name, Salary FROM EMPLOYEES
FOR JSON AUTO

The result shown in below image. It looks no so bad using FOR JSON AUTO when the structure is simple.
AutoFig1

Let’s see a more complicated structure with a nested structure. First we will use FOR JSON AUTO. For this data set, we want to nest all address related information (Address1, Address2, City, Zip, and StateProvince) into Address so we need to add Address. to those columns. Below is the detailed query.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
FROM EMPLOYEES E
JOIN ADDRESS A
ON E.ID = A.EmployeeID
FOR JSON AUTO

The result shown in below image, we can see that all address-related columns have ‘two parts’ column name and it is not so readable.
AutoFig

Next, let’s check FOR JSON PATH using the code shown below.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
FROM EMPLOYEES E
JOIN ADDRESS A
ON E.ID = A.EmployeeID
FOR JSON PATH

As seen from the image shown below, the result looks much simple and more beautiful.
PathFig

Last, we can add a root by using FOR JSON PATH, ROOT clause.

SELECT E.Name, E.Salary,
A.Address1 AS [Address.Address1],
A.Address2 AS [Address.Address2],
A.City AS [Address.City],
A.Zip AS [Address.Zip],
A.StateProvince AS [Address.State]
FROM EMPLOYEES E
JOIN ADDRESS A
ON E.ID = A.EmployeeID
FOR JSON PATH, ROOT(‘EmployeeInfo’)

After adding the root information, we can easily understand what information the result contains (see image shown below).
PathRootFig.PNG

Summary

In this blog, I have explained and illustrated what the FOR JSON function can do. For a simple structure, we can use FOR JSON AUTO function. For more complicated structure, FOR JSON PATH presents more simple and readable results.

Create a Azure Data Factory Using PowerShell

It is easy to create a Azure Data Factory using PowerShell.  However, you might fail to create one if you follow Microsoft’s Example without upgrading your PowerShell.

Prerequisites: you need to have a valid Azure Subscription and a resource group.

Step 1: Login Azure Account. run the following command. It will pop up a windows. You simply enter your username and password.

Login-AzureRmAccount

Step 2: Find your subscription. The following command will show you all subscriptions.

Get-AzureRmSubscription

Step 3: Choose your subscription. If you have multiple subscriptions, you need to choose which one you want to use. You need the subscription Id which you can find from the result of step 2.

Select-AzureRmSubscription -SubscriptionID “you subscription Id”

Step 4: Define resource group name and data factory name parameters. You need to know which resource group do you want to put your data factory. If you don’t remember the resource group name, you can get all your resource group names by running Get-AzureRmResourceGroup command. The data factory name must start with a letter or a number, and can contain only letters, numbers, and the dash (-) character. That means you can not use underscore which usually can be used for naming. Data factory name should be at least 3 characters long and no more than 63 characters long. You can check here for the data factory naming convention.

$resourceGroupName = “Your existing resource group name”;

$dataFactoryName = “Your Data Factory Name”;

Step 5: Create new data factory. In my case, you should use New-AzureRmDataFactory instead of Set-AzureRmDataFactoryV2 command.

New-AzureRmDataFactory -ResourceGroupName $resourceGroupName -Location “East US” -Name $dataFactoryName;

If everything runs well, you should see a similar screen shot shown as below.

CreatDataFactory

Generate Table-Creation Script Using FOR XML Clause with PATH Mode

There are multiple ways to automate the generation of table-creation script based on existing tables in SQL server. The common way is to use the ‘generate script’ function to automate this process. This GUI method is relatively easy but it often produces a lot extra command which may not fulfill your expectation. Other methods include recursive CTE and cursor. In this blog, I am going to show you how to automate this process with FOR XML clause with path mode. If you are not familiar with this method, here is a nice blog you can check.

To create a table, we need to know the table name, schema name, column name, column data type. Fortunately, we can get all this information from two views: INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.TABLS.

Furthermore, we need to know the length for a string data type like varchar and the precision and scale number for a decimal data type. In contrast,  we don’t need to define the length for a integer data type and a datetime data type. This information is also available in INFORMATION_SCHEMA.COLUMNS. We can use a CASE clause to decide whether we need to add those information. Similarly, we can use a CASE clause for nullable or non-nullable columns. Finally, we need to concatenate all columns information together with the FOR XML clause.

For the detailed query, please check the attached screenshot. You can also find a similar snippet from my githubTable_Creation

This snippet can be useful when you need to write table-creation script with limited length. For example, it can be used for modifying a Azure Data Factory pipeline JSON file.

 

 

Connect an on-Premises SQL Server Database with an Azure SQL Database Using a Linked Server

Azure SQL database, also known as a platform as a service (Paas) or a database as a service (DBaas), is designed for software-as-a-service application development. Although you can move all your data into the Azure SQL database, there are times that one wants to connect a Azure SQL database to an on-Premises SQL server database.

Linked servers provide a powerful tool to access data from another instance of SQL server database. In this blog, I want to show you that you can also use a linked server to access data reside in Azure SQL database from your on-premises SQL server database.

Create a Linked Server

There are two steps to establish a connection with Azure SQL database.

Step 1. Create a linked server.  We can create a linked server using sp_addlinkedserver stored procedure.  To execute this stored procedure, you essentially need to give four parameters.

  • LinkedServerName. You can give any name as long as it is legal and unique.
  • AzureServerName. You can easily find it on your azure portal. An Azure server name has four parts ending with ‘.database.windows.net’.
  • PortNumber, usually is 1433.
  • AzureDatabaseName

EXEC sp_addlinkedserver
@server=N’LinkedServerName’,
@srvproduct=N’Azure SQL Db’,
@provider=N’SQLNCLI’,
@datasrc=N’AzureServerName, PortNumber’,
@catalog=’AzureDatabaseName’;

Step 2. Set up login mapping. We can use sp_addlinkedsrvlogin stored procedure. To execute this stored procedure, you need to give three parameters.

  • LinkedServerName. The one you just created in step 1
  • UserName.
  • Password.

EXEC sp_addlinkedsrvlogin
@rmtsrvname = ‘LinkedServerName’,
@useself = ‘FALSE’,
@locallogin = NULL,
@rmtuser = ‘UserName’,
@rmtpassword = ‘Password’

Test a Linked Server

After executing the above two steps, you can test your linked server by executing sp_testlinkedserver stored procedure. This stored procedure require one parameter which is the LinkedServerName you just created. If success, you can see “command(s) completed successfully” message in the SQL Server Management Studio.

EXEC sp_testlinkedserver LinkedServerName;

Use a Linked Server 

Example 1. Read data

SELECT TOP 10 *
FROM [LinkedServerName].[AzureDatabaseName].[dbo].[AzureTableName]

Example 2. Update data

UPDATE top(5000) A
SET A.Salary = A.salary * 1.1
FROM [LinkedServerName].[AzureDatabaseName].[dbo].[AzureTableName] A

Summary

In this blog, I have shown you how to create a linked server which connects to an Azure SQL database. Basically, we need to execute two stored procedures: sp_addlinkedserver and sp_addlinkedsrvlogin. We can also test the connection with another stored procedure called sp_testlinkedserver. I have also included two simple examples which show you how to use a linked server to read and update an Azure SQL database table. For more information, please check Microsoft Azure Documents