2001 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2001 Table of Contents


CONVERTING ARIEL FILES TO TIFF FILES FOR OPTICAL CHARACTER RECOGNITION PROCESSING  

James Bailey
Disability Services
5278 University of Oregon, Eugene, OR  97403
jbailey@darkwing.uoregon.edu

University library patrons are more and more using Interlibrary Loan to obtain journal articles.  Many of these articles move from one library to the other electronically.  One of the most widely used electronic file transfer systems is the Ariel(r) document delivery system.  Ariel is produced by the Research Libraries Group (RLG.) According to RLG there are over 5000 copies of Ariel currently in use around the country.   More than 90% of academic libraries and many public libraries use Ariel(r) to send articles to each other.

At the University of Oregon over 90% of the articles requested are received through Ariel(r).  And over 90% of the articles sent out by the University of Oregon to other libraries are scanned into Ariel(r).

Articles from the owning library are scanned and then transmitted electronically to the borrowing library. The borrowing library then converts these files to Portable Document Format (PDF) files.  These articles are e-mailed as an attachment, sent as a web document, or printed and sent to the patron.  Unfortunately these files are image-only files and contain no text for screen readers or the Adobe accessibility plug-in to retrieve.

Converting such files to electronic text for patrons with either vision or reading issues entails printing the PDF document and then scanning it into an optical character recognition (OCR) program. The fidelity of the electronic text file to the original article depends on various factors.  One factor is the copy's number of generations from the original.  Unlike digital copies analog copying, such as Xerox (r) copies, lose fidelity with each generation.  Scanning an original will yield a more accurate conversion than scanning a copy that is several generations from the original.  If it is necessary to print out a file and then scan it yet again, then one more generation has been introduced and along with it a loss of accuracy.


OCR uses a picture of a document as a basis for the conversion to text.  A logical question is: what sort of electronic file is an Ariel(r) file when it is delivered, but before it is converted to PDF? When the file arrives via network or modem to the borrowing library it has a unique name and extension.  Neither the name nor extension gives a hint as to the file type.  In fact, it is usually simply referred to as an Ariel(r) file.

It turns out that the file is a Tagged Image File (TIF) preceded by a GEDI header. With this header in place the file cannot be opened as a TIF file.  It is possible to manually remove the header from the file and then save it as a TIF file.  TIF files are a common type of image file used by most OCR programs for the conversion of paper documents to digital text.

To remove the header the file must be opened in an ASCII word processor.  The file from Ariel(r) will have a name and extension, but they don't have much meaning for this process.  To open the file, open the word processor first and then open file from within the word processor.  If you have ever inadvertently opened an application in a word processor you know you get a jumble of letters, numbers, and other ASCII symbols in a seemingly random order.  When you open the Ariel(r) file that is exactly what you will find.  Once it is open, you are looking for where the header ends and the TIF file begins.  This point is after a series of question marks followed by two capital "I"s.  What needs to be removed is everything from the start of the file up to, but not including the two capital "I"s.  It should be noted that the file, especially the header, might contain intelligible words or phrases.

Find (and this could be searched for) "?II" ( a question mark and two capital  "I" s) and cut between the "?" and the first "I".  Cut out everything from the start of the file to this point.

After removing the header, save the file.  At this point, name it with a tif extension.  Some text editors will force a txt extension and will not allow a tif extension on the file.  If that is the case, then save the file and rename it later.

Once the file is in the TIF format popular OCR programs such as TextBridge or Omnipage can open and process it into a text file. Consult your OCR application on how to open and process TIF files.

In conclusion, this technique alters the processing of what will become an inaccessible document and creates an accessible one.  It does so without the content degradation of additional printing and scanning.  By eliminating unnecessary steps it also speeds the process of delivering an accessible document to the library patron.

Endnotes

In paragraph five the acronym GEDI is used.  This stands for Group on Electronic Document Interchange an international group that created standards for electronic document delivery.


Go to previous article 
Go to next article 
Return to Table of proceedings


Reprinted with author(s) permission. Author(s) retain copyright.