In making the http://isthistaxilegal.apphb.com I have had the need to parse pdf files, that contains tables inside. A more daunting task is hard. I have investigating the existing software – and the most used one was https://github.com/tabulapdf – however, not even him was closed to achieve . The problem was not transforming the pdf into text. The problem was transforming tables in PDF to csv or something similar. One of the reasons that is difficult is stated here: https://stackoverflow.com/questions/31579922/extracting-tables-lines-and-coordinates-from-pdf-using-pdfsharp-c :
The (my) solution
Starting from a PDF reader (iTextSharp) you need to:
1. read the lines (hopefully only vertical and horizontal lines);
2. join the lines (a line of a table could be several lines, for example one per cell);
3. understand where the tables are (sometimes making some hypothesis based on your needs);
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs;
5. Insert text inside the cells of the table
So I tried the old automation solution –Word. And word can read PDF files. More ,the tables are transferred as real tables! So – the next task to do is to transform Word tables into CSV.Word Macro to the rescue:
Sub ExportTables()
Dim d As Document
Set d = ActiveDocument
MsgBox d.Tables.Count
Dim t As Table
Dim r As Row
Dim iRow As Integer
Dim iCell As Integer
Dim str As String
Dim start As Integer
start = 1
For Each t In d.Tables
'str = str + t.ConvertToText(Separator:=wdSeparateByCommas).Text
Dim nrRows As Integer
nrRows = t.Rows.Count
For iRow = start To nrRows
Set r = t.Rows.Item(iRow)
Dim nrCells As Integer
nrCells = r.Cells.Count
For iCell = 1 To nrCells
str = str & Replace(Replace(Replace(r.Cells.Item(iCell).Range.Text, vbCr, ""), vbLf, ""), Chr(7), "") & "|"
Next iCell
str = str & vbCrLf
Next iRow
If start = 1 Then start = 2 ' do not read headers for each table
Next t
Open d.Path & "\Cluj.csv" For Output As #1
Print #1, str
Close #1
End Sub
Then from CSV to sqlite I have used sqlite3.exe with the following code
.open ../taxis.sqlite3
.import ../bucuresti.csv bucuresti
.quit
And the image shows the flow: