Register Login
Sign In
» » Tabular data extraction from a complex and bad PDF Documents - Tips and Tricks


Tabular data extraction from a complex and bad PDF Documents - Tips and Tricks

26-10-2016, 10:03
Author:
2590



extracting tabular data from PDF

Many programmers and developers face the problem of extracting tabular data from PDF documents and in this article, I’m going to talk about important moments, solutions and will demonstrate how to work with PDF conversion.


 

1. Problems of PDF format


 

Specification of PDF format is not to describe complex data structures directly and in most cases

PDF document contains only a set of blocks of any type (text, image) with coordinates, size, and other parameters, but does not contain any tables, paragraphs, columns or rows splitters etc

The process of grabbing the plain-text usually has not brought any problems - there are many utilities that provide similar functional, but in case of converting PDF’s which contains complex and\or bad data structures, a programmer can find many problems and it will be necessary to analyze a PDF document structure by oneself.

To research the structure of the PDF file you can use intermediate formats that contain a description in a friendly form that will make a process and any requests to data easier.


 

2. Utilities, conversion, intermediate formats


 

ByteScout.com provides easy PDF conversion functional, supports many formats for export (XLS, CSV, JSON, XFDF, XML) and has .NET SDK for developers.

 

Further, you will see some ByteScout PDF extractor tips.

 

I used ByteScout PDF Multitool (free GUI version) to try to convert this complex PDF file with a very bad structure to XLS table for next calculations.

 

<test.pdf>

 

After directly converting to XLS we have the following result with errors:

 

<test directly to.xls>

 

In this file, every page has different columns count (1: A-U; 2: A-S; 3: A-AB;) and does not allow us to analyze pages correctly for performing calculations.

 

As you can see converting PDF to other end-user-editable formats directly can give wrong results and a programmer needs to use his own specific algorithms to fix the errors or to analyze PDF document structure by oneself.

 

To facilitate his own algorithm of analyzing the coordinates, we will convert the PDF to a file in XML format.

 

Also, we will use simple CSV format in our program to save the final result.


 

 

3. Analyzing intermediate XML file and extract data to CSV format


 

I used PDF Multitool for PDF to XML conversion, however, you can check how to do it easier using ByteScout PDF Extractor SDK that has PDF to XML API.

 

I received the following XML document with the structure

 

<pdfTo.xml>:

<?xml version="1.0" encoding="UTF-8"?>

<document>

<page index="0">

<row>

...

<column>

<text fontName="Arial" fontSize="6.0" fontStyle="Bold" color="#333333" x="23" y="103" width="7" height="6">V1</text>

</column>

...

</row>

...

<row>

...

<column>

<text fontName="Arial" fontSize="5.0" x="113" y="198" width="22" height="5">01JUN16</text>

</column>

...

</row>

</page>

...

</document>


 

We can see that header-rows and content-rows have different parameters in XML node definition.

 

We need to do PDF extract data and coordinates of elements that allow us to analyze real value position in the document.

 

After XML file research we can approximately define a grid with coordinates of columns (look for the grid in code).

 

For writing analyzing algorithm and conversion row to CSV, we will use C# .NET and XPath.

 

The following is the full program code in C # .NET with comments:


 

C# Code (Program.cs):

 

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Xml;

 

namespace XmlPdfToCsv

{

class Program

{

static bool ValueBetween(int value, int v1, int v2)

{

return value >= v1 && value <= v2;

}

 

//Grid of columns after analyze coordinates of elements in XML file

static int CellFromXCoords(int x)

{

if (ValueBetween(x, 20, 26)) return 0;

if (ValueBetween(x, 38, 44)) return 1;

if (ValueBetween(x, 62, 68)) return 2;

if (ValueBetween(x, 110, 120)) return 3;

if (ValueBetween(x, 145, 150)) return 4;

if (ValueBetween(x, 205, 215)) return 5;

if (ValueBetween(x, 218, 224)) return 6;

if (ValueBetween(x, 268, 285)) return 7;

if (ValueBetween(x, 322, 340)) return 8;

if (ValueBetween(x, 365, 380)) return 9;

if (ValueBetween(x, 385, 395)) return 10;

if (ValueBetween(x, 415, 430)) return 11;

if (ValueBetween(x, 435, 445)) return 12;

if (ValueBetween(x, 470, 490)) return 13;

if (ValueBetween(x, 490, 500)) return 14;

if (ValueBetween(x, 535, 555)) return 15;

if (ValueBetween(x, 565, 575)) return 16;

if (ValueBetween(x, 630, 640)) return 17;

if (ValueBetween(x, 640, 650)) return 18;

if (ValueBetween(x, 700, 710)) return 19;

if (ValueBetween(x, 750, 760)) return 20;

if (ValueBetween(x, 790, 820)) return 21;

return -1;

}

 

static void Main(string[] args)

{

string filename = "pdfTo.xml";

string filename2 = "xmlTo.csv";

 

XmlDocument xDoc = new XmlDocument();

xDoc.Load(filename);

 

XmlElement xRoot = xDoc.DocumentElement;

 

XmlNodeList rowsNodes = xRoot.SelectNodes("//row");

 

List<string> csvLines = new List<string>();

 

foreach (XmlNode rowNode in rowsNodes)

{

//get only content rows

XmlNodeList elements = rowNode.SelectNodes(".//column/text[@fontName='Arial' and @fontSize='5.0']");

 

List<string> row = Enumerable.Repeat("", 22).ToList();

 

try

{

var els = elements.Cast<XmlNode>();

 

foreach (XmlNode node in elements)

{

//Get X coordinate

int x = Convert.ToInt32(node.Attributes.GetNamedItem("x").Value);

 

int ind = CellFromXCoords(x);

 

if (ind != -1)

{

string text = node.InnerText;

row[ind] = text;

}

}

}

catch { continue; }

 

if (!row.All(x => string.IsNullOrWhiteSpace(x)))

csvLines.Add(string.Join(";", row));

}

 

File.WriteAllText(filename2, "");

File.WriteAllLines(filename2, csvLines);

 

}

}

}


 

This code returns next CSV file

 

<xmlTo.csv>


xmlTo.csv

As you can see we have the good result which makes the process of further calculations possible.


 

4. Summary


 

Methods which were described in this article can be adapted, changed and used to extract data from many files.

Correct pattern for determining rows is a key moment and it can have any parameters: font(name, size, color), position(x,y), size(height, width) etc.

 

Also, in some cases, you can use Regular Expressions to determine type of value and type of row, for example, to determine date (01JUN16) from the fourth column you can use this regular expression pattern:

@"\d\d[A-Z]{3}\d\d” which you can use next way:

 

C# Code

 

foreach (XmlNode rowNode in rowsNodes)

{

 

XmlNodeList elements = rowNode.SelectNodes(".//column/text[@fontName='Arial' and @fontSize='5.0']");

 

var els = elements.Cast<XmlNode>();

 

foreach (XmlNode node in elements)

{

if (Regex.IsMatch(node.InnerText, @"\d\d[A-Z]{3}\d\d”)

{

//your actions - determine row type, adding, etc

}

}

 

}

 

In process of analyzing file structure, you can use the methods described above, combine them and invent new ones.

 

Hope this article and PDF extractor tricks help you to extract info from PDFdocuments!

 




Rate:
  
If You Would Like To Submit Your Own Free Theme Template Design, We Encourage You To Register and Get An Account on Template4all.com
Add a Comment
Your Name: *
Your e-mail: *
Comment:
Are You Human: What is 14+36 ?
Question Answer:
Enter The Code Shown On The Image: