Digitizing The Collection

The more than 190 books included in "California as I Saw It:" First-Person Narratives of California's Early Years, 1849-1900 represent approximately 40,000 pages and include more than 3,000 illustrations. Creation of a digital version of this collection began during the Library of Congress's American Memory pilot program in 1990 as an early experiment in digitizing printed matter. The program's planners had developed a dual proposal for reproducing textual collections: printed matter would be reproduced as searchable text, without accompanying facsimile images of the pages, while manuscripts would be reproduced as facsimile images, sometimes with accompanying searchable texts. This approach was based on the following premises: (1) searchable texts of adequate quality could be derived from printed matter but not manuscripts and (2) the large numbers of facsimile page images needed for a large book collection could not be crowded onto the CD-ROM disks being used for dissemination by the pilot project. By 1992, after the initial scanning of this collection had been completed, the American Memory program changed its approach and began producing facsimile images for book pages as well as manuscripts. The change answered request from researchers to see images of the original pages and to have a means for correcting errors in transcription.

Following the normal standard of the text-conversion industry, American Memory arranged for the conversion of these texts to machine-readable form at 99.95 percent accuracy. This percentage permits 5 incorrect characters for every 10,000, amounting to about one typographic error per original book page. As part of the experiment, conversion was carried out by a variety of contractors. Some used optical character recognition (OCR) hardware and software, while others turned to rekeying. The conversion of these texts coincided with the Library's exploration of the use of Standard Generalized Markup Language (SGML). At that time, the American Memory planners employed an ad hoc approach to markup and this was used for the first versions of the converted texts in this collection.

Although images of every page were not to be included, the plan called for reproduction of all of the illustrations. In 1991 and 1992, when production got under way, the software that the Library was using for its CD-ROM disks, e.g., Hypercard for Apple Macintosh computers, was not capable of displaying grayscale images. In addition, the program's planners believed that it was important to offer copies of the illustrations suitable for printing on a typical laser printer, e.g., for use in a school classroom. For these reasons, the illustrations were reproduced as bitonal images (pure black and white, one bit of information at each pixel or picture element). These could be incorporated in the software and would print well.

There is little difficulty in reproducing line art in a digital environment. Printed halftones, however, pose special problems. Printed halftone illustrations are those in which shades of gray are conveyed by patterns of dots; printed halftones are typically used to reproduce photographs. When these are scanned, the intersection of the grid of scan lines (or pixels) and the halftone dots creates moire patterns that degrade the image. One solution is to randomize the scanner's grid pattern, and most of the California book illustrations were captured using a scanner that employed a sophisticated "dithering" algorithm to reduce the visibility of the moire patterns.

Production on the California books reached a plateau in 1993, as the American Memory pilot began to draw to a close, and the project was held in abeyance. In 1996, the Library's digital collections activity, under the rubric National Digital Library Program (NDLP), got under way and the team turned back to the California books. In this context, it was possible to take advantage of several new developments. Dissemination had moved from CD-ROM disks to the Internet, where the Library was placing collections on the World Wide Web (WWW) and a more versatile array of software could be used to present the collection to end-users. In addition, the team had adopted a full implementation of SGML that conformed to the guidelines developed by the Text Encoding Initiative (TEI).

In 1996-1997, to the degree and quality possible, the NDLP produced supplementary grayscale versions of the bitonal illustration images and retrofitted the searchable texts to conform to the new SGML implementation. At the same time, versions of the texts marked up with HTML (HyperText Markup Language) were derived for use on the WWW. In addition, the six books identified in the following paragraph were included in the online collection The Evolution of the Conservation Movement 1850-1920. In order to have these works resemble the other books in that collection, facsimile page images were produced for them. (More page images may be scanned if time and resources permit.)

All of these factors have come together to shape "California as I Saw It" as presented at its Internet release in the summer of 1997. The texts are offered in HTML (with illustrations as inline grayscale images) and SGML versions. The bitonal illustration images are offered as a supplementary resource for printing. And facsimile page images are available for Mary Austin's The Land of Little Rain, Lafayette Houghton Bunnell's Discovery of the Yosemite, and the Indian War of 1851, Clarence King's Mountaineering in the Sierra Nevada, Joseph Le Conte's Ramblings through the High Sierra, and John Muir's Mountains of California and My First Summer in the Sierra.

National Digital Library Program
Library of Congress
Spring, 1997

"California As I Saw It"