Extract Data Tables from PDF Files in R

A new method to extract data tables from PDF files is introduced. Most of the data scraping tools available are browser-based.  The common tools are also manual in nature and limited to one table at a time. A solution is outlined to extract multiple tables at once.  The solution combines the R programming language with the open-source Java program Tabula. The result is a convenient method that transforms documents into databases.

Benefit Statement
The ability to train a machine to extract data tables from PDF files has several benefits:

  • Intelligence: The analysis of public and private data is often constrained when data is locked-down in PDF files on public or private servers. Any PDF accessible via web access can now be scraped to liberate hundreds of data tables for immediate analysis with only one line of code.
  • Strategic Business Development: Strategic planning and business development transactions require rapid assimilation of data tables from numerous commercial documents.  Transaction “war rooms,” for example, often contain original documents and technical studies in PDF form absent direct data access. The ability to quickly extract data tables is essential to assess value and risk in limited time-frames.
  • Data Journalism: Data journalism is an emerging field where reporting is often delayed due to the need to process, extract and analyze data from numerous documents (e.g. Wikileaks). The ability to quickly extract tables and analyze data for public release can support subscription and advertising revenues.

Standard Approaches
There are several methods used to extract data tables from PDF files:

  • Screenshot Extraction: Pictures of PDF tables involve a manual copy/paste routine one table at a time. The method is easy and convenient, but does not allow direct interface with table data.
  • PDF Converters: Third-party service providers offer free or subscription-based PDF converters for MS Word and Excel. These tools provide direct access to table data, but the results can be messy, requiring data or format corrections. These tools also do not support simultaneous extraction of multiple tables.
  • Optical Character Recognition: OCR uses machine learning techniques to “read” PDF files and to save data to a text or CSV file. OCR code supports extraction of multiple data tables at the same time. However, OCR tools require advanced programming skills and, sometimes, direct interface with the mathematical functions for text recognition.
  • Python Code: The mainstream programming language Python offers built-in extraction functions in the package PDFMiner such as pdftotext and pdftohtml. Functions are convenient and avoid direct interface with source code. Python, however, is an advanced language for compiled application development.

Tabula

Tabula is a locally installed software package.  Its available for Windows, Mac, and the Linux operating systems under an open-source license from MIT. Tabula will upload a PDF file and then extract a selection of rows and columns from any table.

The creation of Tabula was sponsored by a group of investigative reporting companies, including the Knight Foundation, the New York Times, the Times of London, and some grassroots organizations. Tabula is easy to use, but manual intervention is still required to define the location of each data table in the PDF.

Extending Tabula
Tabula’s command line functionality can be exploited in the R programming language to “scrape” multiple data tables. To accomplish this, document parsing routines are used to auto-magically find the data tables in a PDF file. Tabula is then called to extract each table.

Two new R packages have been created to use Tabula in R1

  • tabulizer provides command line functions to parse a PDF document and to call Tabula;
  • tabulizerjars provides the R bindings to the Tabula java library, which is the source code for Tabula.

Once these packages are loaded into R, a simple function call (e.g. one line of code) is sufficient to parse and extract data tables, be it one or many.

Code Example: Disk File With A Single File

Data from the 1974 magazine Motor Trend US has been pulled to test the R packages and PDF extraction routines. The data is found in the file MTdata.pdf. MTdata.pdf (73 downloads)

The data table has fuel consumption by car plus 10 other aspects of automobile design and performance. The table has 32 rows and 11 columns. Here is a screen-shot of the data table:2

 

 

 

 

 

 

 

 

 

 

 

 

Function Calls in R

The command extract_tables() is the command in R to call the Tabula application and to extract tables. For example:

A scatter plot is generated to validate that the extracted data is suitable for a analysis.  The plot illustrates the relationship between miles per gallon versus vehicle weight and cylinder count:

Example - extract data tables from pdf files in r

Code Example: Internet File with Multiple Tables

The Qatar Ministry of Development Planning and Statistics prepares the publication Qatar Monthly Statistics. The PDF report is filled with tables, all of which can be downloaded and scraped into R with one function call.

The following two Consumer Price Index tables will be downloaded to illustrate the multi-table “scraping” capability:

Code syntax is the same as the first example. First, we substitute the Internet URL for the file location.  The code also expands the arguments used in the extract_tables() function since the goal is to extract only a portion of the table. Also, the table location is now specified with 4-corner coordinates to avoid extracting Arabic text fields. These coordinates are determined using the locate_areas() function.  Finally, the method argument is specified to have more control over the format of the output.

Both tables have been extracted successfully. A histogram is generated to confirm the data is suitable for analysis:

Example - extract data tables from pdf files in r

 

 

 

 

 

 

 

 

 

 

Limitations

The Tabula java library serves to liberate data in PDF files and to treat documents as databases. The ability to scrape hundreds of tables from a large document can be accomplished quickly, but is not free of challenges. The following limitations are present:

  • Footnotes: Many data tables contain footnotes next to individual data elements (e.g. R for revised or E for estimated). The ability to extract data tables is not prevented when footnotes are present, but footnotes can alter data object structures once extracted. Secondary code is often required to clean up the resulting output;
  • Merged Cells: Merged cells in column headers reflect poor table management and often capture information that is better placed in table titles, sub-titles or captions. Merged cells also require secondary code to extract data tables and to ensure clean data or object formatting;
  • Numbers as Text: Number formats that do not employ Roman number formats may prove difficult to extract, resulting in numbers returned as text strings. Simple data conversions from text to numeric class formats will quickly resolve the problem, but this solution will not work for Arabic or Asian number formats, where optical
    character recognition may be the preferred solution.
  • Scanned PDFs: Data tables in scanned PDFs can still be extracted using the method profiled. However, poor quality scans where text and numbers are poorly scanned or covered in “shadow clouds” may result in data extraction errors or blank cells in data tables.

Conclusion

A new method to extract data tables from PDF files offers the potential to dramatically improve efficiency in data access, management and analysis. The result can be more timely delivery of high impact insights to support business strategy, transaction opportunities, or investigative analysis and reporting. The new method also highlights a new security vulnerability and the need to encrypt PDFs where document security is critical. The new method converts documents to databases and supports rapid data extraction.

Show 2 footnotes

  1. T.J. Leeper, “Bindings for Tabula PDF Table Extractor Library,” May 7, 2016.
  2. The 11 column fields are mpg: Miles per US gallon; cyl: Number of cylinders; disp: Displacement (cubic inches); hp: Gross horsepower;drat: Rear axle ratio; wt: Weight (1000 pounds); qsec: Quarter mile time in seconds; vs: V/S design (0 = V-engine; 1 = straight  engine); am: Transmission (0 = auto; 1 = manual); gear: Number of forward gears; carb: Number of carburetors.
This entry was posted in Data, Misc Tricks, R Data Import, R Programming. Bookmark the permalink.