Download large files with R and RCurl

First, read documentation. Try with

#getCurlOptionTypes(opts = getCurlOptionsConstants())
#print (getCurlOptionTypes)
#listCurlOptions()
#getCurlOptionsConstants()
#getCurlOptionTypes(opts = getCurlOptionsConstants())

Then, this is the code ( pretty self explanatory)

downloadFile <-function(url, file){
      curl = getCurlHandle()
curlSetOpt( .opts = list(http.version =HTTP_VERSION_1_0, verbose = TRUE),
                curl = curl)
    
     f = CFILE(file, mode=”wb”)
     a = curlPerform(url = url,curl=curl, writedata = f@ref, noprogress=FALSE)
     close(f)

}

Github as a repository for database (sqlite)

In making http://isthistaxilegal.apphb.com I neede to have a database. I decided to have sqlite, since it needs just a file and the dll’s.( Side Note: The good dll’s for sqlite and .net core are not the original ones, that have a dependency of .NET Framework – but Microsoft.Data.Sqlite ) . But how to deploy this every time?

Simplest way: Verify at start if the file exists, and , if not, download from GitHub

The code is

var url ="https://raw.githubusercontent.com/ignatandrei/IsThisTaxiLegal/master/datas/taxis.sqlite3";
using (var client = new HttpClient())
{

    using (var result = await client.GetAsync(url))
    {
        if (result.IsSuccessStatusCode)
        {
            var bytes=await result.Content.ReadAsByteArrayAsync();
            File.WriteAllBytes("taxis.sqlite3",bytes);
        }

    }
}

And the idea is not so bad  – having the database at github and download every time is a good idea…

Interpreting pdf to database(sqlite)

In making the http://isthistaxilegal.apphb.com I have had the need to parse pdf files, that contains tables inside. A more daunting task is hard. I have investigating the existing software – and the most used one was https://github.com/tabulapdf – however, not even him was closed to achieve . The problem was not transforming the pdf into text. The problem was transforming tables in PDF to csv or something similar. One of the reasons that is difficult is stated here: https://stackoverflow.com/questions/31579922/extracting-tables-lines-and-coordinates-from-pdf-using-pdfsharp-c :

The (my) solution
Starting from a PDF reader (iTextSharp) you need to:
1. read the lines (hopefully only vertical and horizontal lines);
2. join the lines (a line of a table could be several lines, for example one per cell);
3. understand where the tables are (sometimes making some hypothesis based on your needs);
4. optionally find the text outside the tables (better to keep all the text) and insert it in paragraphs;
5. Insert text inside the cells of the table

So I tried the old automation solution –Word. And word can read PDF files. More ,the tables are transferred as real tables! So – the next task to do is to transform Word tables into CSV.Word Macro to the rescue:

Sub ExportTables()
    Dim d As Document
    Set d = ActiveDocument
    MsgBox d.Tables.Count
    Dim t As Table
    Dim r As Row
    Dim iRow As Integer
    Dim iCell As Integer
    Dim str As String
    Dim start As Integer
    start = 1
    For Each t In d.Tables
        
        'str = str + t.ConvertToText(Separator:=wdSeparateByCommas).Text
        Dim nrRows As Integer
        nrRows = t.Rows.Count
        For iRow = start To nrRows
            
            Set r = t.Rows.Item(iRow)
            Dim nrCells As Integer
            nrCells = r.Cells.Count
            For iCell = 1 To nrCells
                str = str & Replace(Replace(Replace(r.Cells.Item(iCell).Range.Text, vbCr, ""), vbLf, ""), Chr(7), "") & "|"
            Next iCell
            
            str = str & vbCrLf
            
        Next iRow
            
        If start = 1 Then start = 2 ' do not read headers for each table
    Next t
    Open d.Path & "\Cluj.csv" For Output As #1
    Print #1, str
    Close #1
End Sub


Then from CSV to sqlite I have used sqlite3.exe with the following code

.open ../taxis.sqlite3
.import ../bucuresti.csv bucuresti
.quit

And the image shows the flow:

License plate number from image

One feature to implement is to have the plate number recognized from image – the user, instead of entering the number, take a picture of the plate and sends to the application. So I decided to do this. Roxana Beloiu pointed me to https://github.com/openalpr/openalpr   –  that can recognize the plate number. I have had some problems with the application in .NET- the latest releases are better than the source code. However, this was done in .NET 4.6 – and my site has .NET Core. First idea: having a mix between those frameworks – no!

Second idea: why not port from .NET 4.6 to .NET Core ? No – too many things that are not yet into .NET Core( e.g. Image)

Third idea: why do not make a serverless function in Azure ?

And this struggle me for 4 hours . The problem? The .NET Dll alpr depends on C libraries and other data – that are copied in the bin folder when you are making a reference from  a console application – but not when you are making a reference from a VS Azure function project.

Solution? Old times to the rescue

Making a console application that receives the file name and send as output the plate number

Uploading this console application and all the necessary dlls with Kudu

Receiving a base64 array in the Azure function, converting to byte, write to file, running the console, delete the file, sending text back.

Not the most advantageous, but hey,. it works.

Making IsThisTaxiLegal–evolving from Proof of Concept

What we need first is automating some tasks – like downloading the authorizations pdf ( and parsing after) from http://www.pmb.ro/adrese_utile/transport_urban/autorizatii_taxi/autorizatii_TAXI.php .

For this I want to use R – because C# I know already and is a good occasion to learn R.

The file is on https://github.com/ignatandrei/IsThisTaxiLegal/blob/master/applications/RDownloadFile/Script.R 

Some ideas from here from a C# developer:

  • Very easy to work with R – as soon as you grasp the concepts
  • Nuget from C# = CRAN from R
  • Does not provide in VS something like nuget, but pacman is perfect

if (!require(“pacman”)) install.packages(“pacman”)
pacman::p_load(‘XML’, “magrittr”, “RCurl”, “rlist”, “rvest”, “pdftools”, “dplyr”, “devtools”)
#if (!require(XML)) install.packages(‘XML’)
#install.packages(“magrittr”)
#install.packages(“RCurl”)
#install.packages(“rlist”)
#install.packages(“rvest”)
library(stringr)
library(magrittr)
library(XML)
library(RCurl)
library(rlist)
library(rvest)

  • R is supports normal programing and  functional programming  and magrittr helps

Code example:

print(head(splitData))  # normal

splitData %>% head(10) %>% print # showing data # function

  • Some WTF ( paste function is concatenate strings  ,  return in function must have () )
  • Many , many WTF when you try to make something in R until you grasp it. ( partially)
  • CRAN have many libraries free – someones requires java on your PC …. C# has a good point here , because he has now Linux + Mac + Windows natively and self contained

You will find code at https://github.com/ignatandrei/IsThisTaxiLegal/blob/master/applications/RDownloadFile/Script.R