Interpreting pdf to database(sqlite)
In making the http://isthistaxilegal.apphb.com I have had the need to parse pdf files, that contains tables inside. A more daunting task is hard. I have investigating the existing software – and the most used one was https://github.com/tabulapdf – however, not even him was closed to achieve . The problem was not transforming the pdf into text. The problem was transforming tables in PDF to csv or something similar. One of the reasons that is difficult is stated here: https://stackoverflow.com/questions/31579922/extracting-tables-lines-and-coordinates-from-pdf-using-pdfsharp-c :
The (my) solution
Starting from a PDF reader (iTextSharp) you need to:
1. read the lines (hopefully only vertical and horizontal lines);
2. join the lines (a line of a table could be several lines, for example one per cell);
3. understand where the tables are (sometimes making some hypothesis based on your needs);
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs;
5. Insert text inside the cells of the table
So I tried the old automation solution –Word. And word can read PDF files. More ,the tables are transferred as real tables! So – the next task to do is to transform Word tables into CSV.Word Macro to the rescue:
Sub ExportTables() Dim d As Document Set d = ActiveDocument MsgBox d.Tables.Count Dim t As Table Dim r As Row Dim iRow As Integer Dim iCell As Integer Dim str As String Dim start As Integer start = 1 For Each t In d.Tables 'str = str + t.ConvertToText(Separator:=wdSeparateByCommas).Text Dim nrRows As Integer nrRows = t.Rows.Count For iRow = start To nrRows Set r = t.Rows.Item(iRow) Dim nrCells As Integer nrCells = r.Cells.Count For iCell = 1 To nrCells str = str & Replace(Replace(Replace(r.Cells.Item(iCell).Range.Text, vbCr, ""), vbLf, ""), Chr(7), "") & "|" Next iCell str = str & vbCrLf Next iRow If start = 1 Then start = 2 ' do not read headers for each table Next t Open d.Path & "\Cluj.csv" For Output As #1 Print #1, str Close #1 End Sub
Then from CSV to sqlite I have used sqlite3.exe with the following code
.open ../taxis.sqlite3
.import ../bucuresti.csv bucuresti
.quit
And the image shows the flow:
Leave a Reply