My Digital Document – R.I.P.

Recent information from AIIM and others indicates that high percentages of digitally created documents are being scanned into our document repositories – that is sad for so many reasons.

Quality – The overwhelming reason to keep a digital document in digital form is to preserve the quality of the original. Scanned documents, regardless of the scanner used, do not look as good as the original document unless you invest serious time and energy to re-mastering the resulting image. On the other hand, scanned documents, that are often skewed, offset on the page and littered with gray-scale debris, can look a lot worse without any effort at all.

Size – The second biggest reason to avoid scanning documents that were born digital is file size. This is only in second place if we’re using integers, if we included two decimals, I’d rank this at 1.01. Consider that a 35 page PowerPoint presentation with a healthy mix of text and graphics is a 1.4 MB pptx file. That same presentation, saved as PDF from within PowerPoint is a 3.9 MB pdf file. That same presentation, printed and then scanned to PDF (using maximum compression) requires 21 MB for storage. That’s over 5 times the size of the saved-to-PDF version and over 15 times the size of the original file! More important, it’s over twice the size of the most common email file size limit of 10 MB. That translates directly to slower email send times (if you could send it) and slower downloads from SharePoint.

Utility – Clocking in at 1.02 on our scale of worst reasons to scan content that was born digital is the usefulness or, in the case of the scanned document, uselessness of the result. If we go back to the presentation in the above example, let’s consider what we can do with the PowerPoint version. First, we can edit it. That means we can extend its life by updating it, reusing it, and repurposing the content . That also means that we can share certain good slides with others and save them the time to prepare those slides. Second, we can search for the words in the presentation. We can search our hard drives, network drives and we can search SharePoint. And, if we can search, that means others can search too.

Now let’s look at what we can do with the save-as-PDF version, hint: almost as much. We can’t actually edit the presentation but we could correct a few spelling errors if we had to, say if I was the original author. We can search for individual words in the PDF file, and in SharePoint. We can repurpose the text and the graphics as well. Acrobat holds these artifacts in their original form so we can copy text out of the PDF and we can copy graphics as JPGs. If we save-as-PDF/A, we can also extend the useful life of that presentation or keep it for archive purposes.

What about that scanned-to-PDF version? Nothing, zip, nada, we can’t do anything with that puppy. We can’t search, we can’t reuse, we can’t edit, we can’t send it, and even if we could, we would likely be embarrassed by the quality. Yeah, yeah, I know, Acrobat includes OCR but the concept of performing OCR on a scanned copy of a born-digital document is, well, it hurts my head.

If this practice is so bad, why does it happen? Usually, the reason is time; it’s faster for the person who has to deal with the document(s) to simply scan them. Of course, that person isn’t considering the time the other people will waste if they need the content out of those documents. Other reasons I hear include “I couldn’t find the originals” and “…is a composite presentation of multiple original documents”. These conditions are the fault of the authors. The solution is easy, make that material easy to find. Better yet, save your originals (in original format or as PDF) and keep them in a Document Library called Source Material. Of course, sometimes, composite documents are not all born digital. Even in this case, you are better off adding digital PDFs to graphic PDF’s, at least that’s heading in the right direction.

How about for 2010, we steal a phrase from Vegas, and agree that: “What’s born digital stays digital