OCR for construction documents does not work, we fixed it

Posted by wcisco17 4 hours ago

OCR for construction documents does not work, we fixed it(www.getanchorgrid.com)

So we've built an API and trained models that detects fixtures, extracts schedules, and analyzes construction documents. Check us out!

More examples: - https://www.getanchorgrid.com/developer/docs/endpoints/drawi...

Main website: - https://www.getanchorgrid.com/developer

Why we did it: https://www.getanchorgrid.com/developer/docs/changelog/const...

61 points | 42 commentspage 2

fithisux 4 hours ago||

Of course it is not working. PDF and images are supposed to be tamper resistant. OCR tries to reverse engineer them.

kube-system 3 hours ago|

Since when is tamper resistance a part of PDF or any common image format?

pwagland 3 hours ago|||

PDF files can be signed, that is tamper resistance. Tamper resistance doesn't have to make any difference to the readability of the document.

kube-system 3 hours ago|||

So can any type of file -- that doesn't have any relevance to the supposed design of every file type in existence. Now, later versions of PDF do have explicit support for signatures, but what does this have to do with preventing OCR? OCR reads a file, it doesn't change the original file.

fithisux 2 hours ago|||

True but you can make modified copies if you reverse engineer it with OCR.

ranger_danger 3 hours ago|||

Some OCR solutions do change the original file, like OCRmyPDF. They take layers that were just images before and replace it with text layers so that you can search the document.

kube-system 3 hours ago||

That isn't OCR, but an application of the resulting output of OCR. Again, a signature on a PDF or any type of file doesn't prevent you from reading it. (It also doesn't technically prevent you from changing it, it just enables the detection of changes to a particular file.)

There's nothing about PDFs or image formats that prevent anyone from doing OCR. The reason construction documents are difficult to OCR is because OCR models are not well trained for them, and they're very technical documents where small details are significant. It doesn't have anything to do with the file format

ranger_danger 3 hours ago|||

Can't one just remove the signature and re-sign it with anything else after tampering? Who verifies PDFs that hard?

kube-system 2 hours ago||

If you're performing OCR, you're almost by definition, disregarding the source file. The whole point of OCR is to be transformative.

fithisux 2 hours ago|||

You can't change a PDF, it is by design to be not easy to OCRed

kube-system 1 hour ago||

PDFs are merely an collection of objects, that can be plainly read by reading the file -- some of those are straight up plain text that doesn't even need to be OCR'd, it can be simply extracted. It is also possible to embed image objects in PDFs, (this is common for scanned files) which might be what you are thinking of. But this is not a design feature of PDF, but rather the output format of a scanner: an image. Editing PDFs is a simple matter of simply editing a file, which you can do plainly as you would any other.

ware-intel 2 hours ago|

Your smart features looks like a game changer? Nice job!