Trails to Utah and the Pacific: Diaries and Letters, 1846-1869

Integration into American Memory at the Library of Congress

Introduction | Descriptive records | Page-turners | Marked up text | Linking from records and text to images


For this collection, the image files are mounted at Brigham Young University (BYU) by the Harold B. Lee Library. The search and browse interface for the collection at Brigham Young University* is supported by CONTENTdm digital library collection software. The collection includes five categories (source formats) of material: manuscript diaries; printed volumes; prints and photographs; maps; and biographical notes for the diarists.

The service copies at BYU for the collection's prints and photographs are JPEG images; for maps they are MrSID compressed files. The published trail guides are available as sequences of page images through CONTENTdm's compound document view. The biographical notes are marked up as simple HTML files. For the diaries, the focus of the collection, two views are available at BYU: page images and transcribed text through a page-turning interface built using CONTENTdm; and a transcription as a text-only PDF file.

Descriptive records were exported from CONTENTdm for each category of material and delivered by BYU to the Library of Congress as flat, tab-delimited files. The full text transcriptions of the diaries were delivered marked up according to an XML DTD using a subset of TEI-Lite tags and encoded as UTF-8 (UNICODE).

Preparing the Descriptive Records for American Memory

CONTENTdm supports collection-specific element sets, with collection-specific configuration for indexing and labeling; cross-collection searching is supported through mappings to simple Dublin Core. Based on sample data, the Library of Congress chose to receive five sets of descriptive records, a set for each category of material, with slightly different but rich element sets. The five record sets were merged into a single database (using Microsoft Access). The mappings to Dublin Core provided by BYU served as a guide in this task. For the diaries, only the work-level records were used; the page-level topic terms and names assigned with care by staff at BYU and its partner institutions are searchable with the full text. Since American Memory search results display a title rather than a brief record, independently meaningful titles for the diaries were formed from the diarist's name and the dates and the title from the MARC record (usually "Journals," "Diary," or similar). A sample constructed title is "Young, Joseph. Diary and accounts, 1848-1855." The same title was embedded in LC's copy of the diary transcription to ensure consistency in the interface. For each diary, the URLs for presentations through CONTENTdm were replaced by URLs for LC's presentations of the same work. Other minor local modifications were made globally for compatibility across American Memory collections. Selected elements (fields) were then exported and tagged for indexing and labeling within American Memory.

Creating Page-Turners for American Memory

Because it was not technically feasible to link from text displayed at the Library of Congress to a specified page through the CONTENTdm page-turning interface, the Library of Congress used its own system for page-turners. Links to individual image files at BYU were used in the data structure in place of references to local files. The varied and inconsistent page-numbering on the manuscript diaries proved a challenge for both Brigham Young University and the Library of Congress. One item has three different sequence numbers on many pages, one possibly contemporary, and two different sequences applied later by curators or conservators. Other items are really collections of letters and each letter has an individual sequence. The correspondence of page-references in the transcriptions with sequence locations in the page-turner was achieved through a mixture of information compiled at BYU prior to loading compound documents into CONTENTdm and harvesting the HTML source for the navigation frame in the CONTENTdm page-turning display. Where numbering visible on the pages was consistent (or relatively consistent), the Library of Congress used that as the basis for support for jumping among pages in its page-turner. However, in some cases, it was deemed less confusing to use a one-up page sequence instead.

Preparing the Marked Up Text for American Memory

Versions of the diary transcriptions marked up in XML were also delivered to the Library of Congress. This allows users to search and display the full text through American Memory, and link from the text to the corresponding page image. Staff at the Library of Congress converted the TEI-Lite XML files automatically to SGML. This was necessary because the existing tools and program code used by American Memory were acquired and developed to work with SGML -- before the World Wide Web became commercially popular and XML was defined. The conversion included two major functions: replacing the XML treatment of empty tags with the appropriate SGML treatment; and conversion of characters outside the 7-bit ASCII set to corresponding SGML entities (from the standard character entity files usually used with TEI-Lite and the American Memory DTDs). Then, guided by an earlier mapping developed for the Sunday school books marked up by Michigan State University and presented in Sunday School Books: Shaping the Values of Youth in Nineteenth-Century America, and by BYU's display treatment in the PDF files for other TEI-Lite tags, tags were mapped to support the equivalent functionality provided by the American Memory system for documents marked up in the American Memory DTD. The transcripts were divided into "chunks" using the higher-level <div> tags, those that corresponded to months, or years, or major sections. An automatic table of contents was generated from the headings of these divisions. For each page, a link to the page-turning interface was provided. The elements and attributes used by BYU to assign broad topical terms (e.g. "religious life" or "diseases") to pages and provide normalized forms of personal and geographic names were used to construct a set of "added terms" to be displayed as a separate paragraph in square brackets at the end of the relevant page or diary entry. These terms are indexed for seaching with the full text. The transformations to perform the mapping tasks and prepare the files for indexing with InQuery are programmed in the Omnimark programming language, which recognizes the structure of the DTD. These tasks are run in batch mode before indexing,

Linking from Descriptive Records and Transcripts in American Memory to the Page Images at Brigham Young University

Bibliographic displays for these books in American Memory provide links to two presentations of the texts via the Library of Congress. The thumbnail links to the page-turning display; although the code and structure of this display is at the Library of Congress, each embedded page image is retrieved from BYU. The link under the thumbnail links to the table of contents generated automatically at the Library of Congress from the marked up transcription. This serves as a menu to individual chunks of the overall work, usually months in the diaries, but including other miscellaneous elements (e.g., accounts, genealogy, and extraneous sections). The individual sections incorporate links to the page-turning interface for each page; these links are constructed from numbers assigned to each diary and page sequence numbers identified in the marked up text. Links at the bottom of each bibliographic display for a diary provide access to the CONTENTdm page-turner at BYU and the PDF text-only view of the transcription.

The two views of the transcriptions, at the Library of Congress and at Brigham Young University, are generated from the same underlying markup but the resulting presentations are distinctly different. This provides an excellent demonstration of the combination of markup that describes the structure of the document and the use of a stylesheet (or equivalent transformation) to control online display based on the structural markup. At BYU, each book is presented as a single text-only PDF file. This is convenient once downloaded, but may take a long time to download on slower Internet connections, and provides no link to the page images. A single downloaded file can be stored for future reference. The American Memory presentation uses the markup to divide the work into smaller chunks and construct a table of contents "navigator." This is useful for skimming a work, provides chunks that are individually faster to display, and supports linking from the text transcription to the page image. Searches of the full text of the entire collection return a chunk from a work rather than the entire work.

Introduction | Descriptive records | Page-turners | Marked up text | Linking from records and text to images

Return to Building the Digital Collection
Return to Trails to Utah and the Pacific