Building the Digital Collection
|Scanning the Printed Material|
Systems Integration Group (SIG) of Lanham, Maryland, digitized paper-based printed documents for American Notes: Travels In America, 1750-1920. Each item was reproduced as facsimile page images. The image capture took place at the Library of Congress. In order to preserve the originals, bound works were scanned face-up in their bindings, one page at a time. The master version of the textual pages (containing typography and line art) is a 300-pixel-per-inch (ppi) bitonal image in the TIFF format, with ITU Group IV compression. Pages with printed halftone illustrations, finely detailed line drawings, or pages with significant color, including book covers, were captured as 8-bit grayscale or 24-bit color images, as appropriate, and stored in the JFIF image format (with JPEG compression).
Bitonal text pages were scanned using a Minolta PS 3000 or a Bookeye scanner. Most grayscale and color illustration foldouts were scanned using the Pulnix MFCS-50 when scanned by SIG. The remainder of the foldout images were scanned in-house by the Library's Digital Scan Center on a Power Phase FX with 4 x 5 camera back at 300 ppi high resolution.
The browser-display images for all document pages are in the GIF format. Library staff produces these images by creating scripts in Image Alchemy for processing batches of the master images. When bitonal images are processed, gray tones are added and the resulting image is blurred to mimic grayscale. The image is then reduced in scale to fit the typical display monitor and sharpened to enhance legibility. When the source image is grayscale, only rescaling and sharpening are undertaken to create the GIF image.
Some books contained foldouts that were scanned as uncompressed TIFF 8-bit grayscale or 24-bit color images and then converted to MrSID images by Library staff. MrSID, a multi-resolution seamless image database, uses a wavelet-compression technology made available to the Library of Congress by LizardTech of Seattle, Washington. The unique feature of MrSID is its ability to decompress only that portion of the image requested by the user. The compression ratio is approximately 22:1 depending on image content and color depth. MrSID is ideal for viewing maps, orthophotos, terrain models, and satellite data. In contrast to other compression software that relies on tiling, MrSID gets all its sharp resolution from within a single compressed image. MrSID does not require any special hardware and allows immediate access to any part of an image, of any size, at any resolution. File size does not matter. This software for the storage and retrieval of large digital images is derived from the research efforts of Los Alamos National Laboratory, New Mexico.
|Creating the Searchable Text|
After Library of Congress staff approved the images, searchable texts were prepared offsite, where a subcontractor re-keyed the documents from the page images. These typescript materials were converted to machine-readable form at an accuracy rate of 99.95 percent and encoded with Standard Generalized Markup Language (SGML) according to the American Memory Document Type Definition (DTD). This DTD is a markup scheme that conforms to the guidelines of the Text Encoding Initiative (TEI), the work of a consortium of scholarly institutions. The online presentation of the texts also includes a version in HTML (HyperText Markup Language) produced by the Library in an automated process. Because it does not require special software, the HTML version is easier for most users to access.