Read a PDF File using Progress

Posted by lucas_sergio@denso.com.br on 03-Oct-2019 17:38

Hi,

How can I use Progress 4GL to read information in a PDF? Is there a way to do this?

Tks

All Replies

Posted by gus bjorklund on 03-Oct-2019 19:44

In theory, you could write 4GL to read pdf files. But in practice, you can't.

The pdf file format is extremely complicated and writing and testing a decoder would take you several years.

I advise you to invoke a use an already written third-party pdf reader utility or a pdf extractor utility. There are utilities that can extract all the text from a pdf. Also ones to extract the pictures.

Posted by Patrick Tingen on 03-Oct-2019 20:35
Posted by Jean-Christophe Cardot on 04-Oct-2019 06:09

Hi

As suggested by Patrick, I'd use pdfInclude (disclaimer: I'm the one behind pdfInclude).

@Gus, this is written in 4GL and reads and writes pdf files ;-)

What kind of info do you need to read from the pdf file? Metadata or the text printed on the page?

TIA

JC

Posted by bronco on 04-Oct-2019 09:29

If I read the lucas correctly he's is interested in a solution where he can read information *from* an existing PDF.

Maybe [mention:7cd91d1c0f264527b0274bfa9336dfb2:e9ed411860ed4f2ba0265705b8793d05] can clarify what his requirements are.

EDIT: never mind, I read the features of the product and didn't see any reading capabilities. Occording to JC it is possible.

Posted by Patrick Tingen on 04-Oct-2019 10:11

Yes, reading from PDF has been possible with PDFInclude for many years, even in the older version. The original author - Gordon Campbell - passed away in 2006. Jean-Christophe picked up maintenance of PDF Include some years ago.

Posted by David Abdala on 04-Oct-2019 10:14

As has been already mentioned, PDFInclude reads PDFs and is implemented in ABL, also ABLPDF, which is based on PDFInclude, but OO, also reads existing PDF files.

Depends on what exactly do you need if they fit or not (in both cases you have the source to adapt to your needs, and the later is "easy" to extend). Threre are also several tools and, opposed to what Gus said, a partial implementation of PDF standar to get "what you need" can be done in 1 to 2 weeks (assuming you are good in string handling).

Posted by Jean-Christophe Cardot on 04-Oct-2019 10:58

David, I would nuance your sentence about pdf reading. It is not only about string handling, or you will not be able to read any pdf, only a subset of them, depending on which tool has been used to create it. There might be binary characters embedded in the strings, in particular CHR(0) - especially when using utf-16, which prevents normal ABL/4GL string handling from working. Also the strings can be encoded following various different schemes, so once you have extracted the string, you still have to decode it. For this reason and others, I have completely rewritten the pdf reader part of pdfInclude, which can now handle any pdf file.

ABLPDF, being based on pdfInclude v3, has not had this rewrite (last time I checked), and as such cannot read every pdf file. Only specific ones, internally formatted the way ABLPDF is expecting them.

pdfInclude has not this limitation any more and can read any pdf file, but the "strings" we are speaking above are the metadata of the file (author, title, subject, key words, various dates, strings defining hyperlinks, document outline/summary, etc.), not the contents itself.

Reading the text content of a page is a completely different matter, much more difficult to achieve (and not implemented in pdfInclude yet).

The very basic case can have the page content codified as ASCII and uncompressed, in which case it is easy to read. But this is not usually the case.

First you have compression, easily defeated using zlib.

Then you obtain all the pdf operators and arguments, including font selection, colours, graphical ,elements, drawings, etc. (you can even have embedded pictures here!) out of which you have to extract only the ones which output text. Not an easy task.

Then you have the text in the page. The text? Not really, because it is indeed composed of font glyphs, i.e. not characters, but their graphical representation. There is no reason a glyph would have the same code as a character, even in the unicode space.

Also the pdf file you are reading may have placed each character or word so that it looks nice on the page, but each word or even character can be given in any random order in the pdf file itself, provided they are preceded by text placement operators which will place them correctly on the page.

So in order to really obtain the text, you would have to extract not only the text operators, but also the matrices and text placement operators, and figure out where on the page each given word/character is going. And this cannot be a pure algorithm. Heuristics will have to be implemented (i.e. when do you consider 2 characters are on the same line? if there one character is only one point (roughly 1/72 inch) below the other? etc.)

So now you have the glyphs codes, positioned on a text file.

Add to this the font subsetting system of pdf (a pdf file can embed a ttf file containing only the characters which have been used, in order to save space), then you have to open and parse the embedded ttf file (quite a complex task to be done in ABL, very far away from strings handling. I know it very well for having developed the fonts subset functionality in latest versions of pdfInclude) in order to get the correspondence between glyphs codes and character codes.

Then yes, you have the text.

This a very complex task, and the heuristics I'm speaking above (along with the lack of need) the main reason it has not been implemented in pdfInclude by the way.

regards

JC

Posted by lucas_sergio@denso.com.br on 04-Oct-2019 11:20

Hi, thanks for the suggestions. I need to read an information. We received from our customers (Honda, Toyota,...) a schedule and we need to read and validate some data such as quantity, part-number, date. We don´t need read the metadata, but only the text.

If I understand, PDFInclude does that. Today we use PDFInclude only to create a PDF.

Posted by Jean-Christophe Cardot on 04-Oct-2019 12:17

Hi Lucas.

No pdfInclude does not currently does what you want.

Not the old version, nor the newest one, sorry.

May be my last message was too long and confusing ;-)

Regards

JC

Posted by David Abdala on 04-Oct-2019 12:37

Sure, but you are talking about a "full" and "generic" reading implementation. I'm not. When you need to get some specific info from a PDF, on an specific PDF version, things gets a lot easier.

I've done more than 5 of this implementations and, as you state, there are lots of complexities if you plan to make it fully standard compliant (Gus is refering to this too).

To be more clear: You an Gus are absolutly right about the amount of work required to get anything from a PDF in a fully standard compliant way.

In most situations where you want "someting" from a PDF, this is not the case, usually you need to extract an specific piece of information from an specificn PDF format, which makes things a lot easier.

To make things even easier, there are free tools that transforms PDF streams (I don't remember which one I'm using right now), that converts "any" PDF format to the simpler "older" formats with unencoded streams, from which getting "text" becomes a simpler matter.

I'm not saying any kind of implmentation like this to be easy, or fast. Getting to some understanding of the PDF standard alone requires a couple of weeks.

"Text Extraction" is not implemented in ABLPDF (not even close to) as it wasn't implemented in PDFInclude, and not much additional work has been done to it, besides "moving" code to an OO version. But you can use what PDFInclude already does (metadata extraction primarly) to retrieve the streams from where get the text you are looking for.

Easy? no

Challenging? absolutely

Recommended? NO

Solution? ask the data in a format other than PDF, it was not designed for data exchange.

Posted by Jean-Christophe Cardot on 04-Oct-2019 12:58

Agreed!

Posted by gus bjorklund on 04-Oct-2019 14:22

>

> @Gus, this is written in 4GL and reads and writes pdf files ;-)

>

>

Indeed it does, and it does it well. However, it is not a complete implementation of the pdf spec (not suggesting the missing parts be added). Jean-Christophe Cardot has done an admirable job caring for it. Thank you Jean Christophe.

Posted by jquerijero on 04-Oct-2019 14:52

If you are doing the processing on the client-side only, you might want to consider the PDF library from either Infragistics or Telerik that comes with PDSOE.

Posted by Jean-Christophe Cardot on 04-Oct-2019 15:43

<embarrassed>Thanks Gus</embarrassed> :-)

I've read the PDF spec though and fixed inconsistencies/implemented much of this since I took over. However some parts won't be (not relevant, like javascript, videos or 3D objects :-D), and some might be in the future, according to the needs.

By the way I implemented AES-256 (a.k.a. PDF 2.0) encryption for the next major version of pdfInclude, and lacked SHA-384 in ABL (whereas SHA-256 and 512 are present); I had to use the OpenSSL library just for that. Could this be en enhancement request? (deviating from the original question a bit :p)

Posted by lucas_sergio@denso.com.br on 04-Oct-2019 17:04

Jean, thanks for your excellent work. I understand now, with PDFInclude it is not possible yet. ;-)

I will talk to our key-users about this and look for another option (we found some java libraries to turn a PDF into TXT).

Thank you everyone!

Posted by lucas_sergio@denso.com.br on 04-Oct-2019 17:04

Jean, thanks for your excellent work. I understand now, with PDFInclude it is not possible yet. ;-)

I will talk to our key-users about this and look for another option (we found some java libraries to turn a PDF into TXT).

Thank you everyone!

Posted by Marco Mendoza on 04-Oct-2019 20:18

explore OCR software too.

Automotive industry is strong in data exchange, EDI, XML. You should explore with your customers that options,

This thread is closed