Extracting tables in PDF Format with Tabula-py

Tabula is a PDF table extracting tool written in Java. Although there are multiple wrappers in R, Ruby, and Node.js, there isn’t any python wrapper available yet till recently. I have been searching for it for a while.  Luckily, I have found one till recently as documented in chezou’s github and Aki Ariga’s blog.

I have done some simple test and it works great! To know how to install it, please refer the blog as well as the github I have given above.

1. Read a single PDF table within one page

It is easily to read a single table within the same page. You almost don’t need to specify any parameter. For the output format, there are several choice such as DtaFrame (default), JSON, CSV and txt.

PDF_SingleTbl

2. Read two PDF tables in one page

When reading multiple tables in PDF files, we should enable the multiple-table parameter. If you forget enable it, all tables will stack together as the example shown in below.

PDF_TwoTbl

When you enable the multiple_tables parameter by setting the value as True, the read_pdf () function will return a list.

PDF_TwoTbl1

We can check the data type, the list length and slice the list.

PDF_TwoTbl2

To convert the original table into a data frame, we can run the following command.

PDF_TwoTbl3

3. Read multiple tables in multiple pages

When reading multiple tables from multiple pages, we must specify the pages and enable the multiple_tables parameter. The returned result is a list as shown below.

PDF_TwoTblTwopage1

To convert the original table into a data frame, we can run the following command.

PDF_TwoTblTwopage2

4. Summary

In this blog, I have shown you how to read PDF table with the Tabula-py. This open source tool is very powerful and can be easily used for extracting tables from PDF files without knowing Java. However, to make Tabula-py function, you do need to install Java. For further detailed information and examples, please refer to my github and chezou’s github.

If you have any question, please leave your comment. I hope you find this blog is helpful. Thanks!

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s