Interpreting pdf to database(sqlite)

In making the http://isthistaxilegal.apphb.com I have had the need to parse pdf files, that contains tables inside. A more daunting task is hard. I have investigating the existing software – and the most used one was https://github.com/tabulapdf – however, not even him was closed to achieve . The problem was not transforming the pdf into text. The problem was transforming tables in PDF to csv or something similar. One of the reasons that is difficult is stated here: https://stackoverflow.com/questions/31579922/extracting-tables-lines-and-coordinates-from-pdf-using-pdfsharp-c :

The (my) solution
Starting from a PDF reader (iTextSharp) you need to:
1. read the lines (hopefully only vertical and horizontal lines);
2. join the lines (a line of a table could be several lines, for example one per cell);
3. understand where the tables are (sometimes making some hypothesis based on your needs);
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs;
5. Insert text inside the cells of the table

So I tried the old automation solution –Word. And word can read PDF files. More ,the tables are transferred as real tables! So – the next task to do is to transform Word tables into CSV.Word Macro to the rescue:

Sub ExportTables()
    Dim d As Document
    Set d = ActiveDocument
    MsgBox d.Tables.Count
    Dim t As Table
    Dim r As Row
    Dim iRow As Integer
    Dim iCell As Integer
    Dim str As String
    Dim start As Integer
    start = 1
    For Each t In d.Tables
        
        'str = str + t.ConvertToText(Separator:=wdSeparateByCommas).Text
        Dim nrRows As Integer
        nrRows = t.Rows.Count
        For iRow = start To nrRows
            
            Set r = t.Rows.Item(iRow)
            Dim nrCells As Integer
            nrCells = r.Cells.Count
            For iCell = 1 To nrCells
                str = str & Replace(Replace(Replace(r.Cells.Item(iCell).Range.Text, vbCr, ""), vbLf, ""), Chr(7), "") & "|"
            Next iCell
            
            str = str & vbCrLf
            
        Next iRow
            
        If start = 1 Then start = 2 ' do not read headers for each table
    Next t
    Open d.Path & "\Cluj.csv" For Output As #1
    Print #1, str
    Close #1
End Sub


Then from CSV to sqlite I have used sqlite3.exe with the following code

.open ../taxis.sqlite3
.import ../bucuresti.csv bucuresti
.quit

And the image shows the flow: