Caroline R. Arms
National Digital Library Program, Library of Congress
August 18, 1997



This document and two others are intended to take the place of the 1994 document titled Elements of Digital Historical Collections. The documents have been prepared, in part, to offer guidance to applicants in the Library of Congress/Ameritech competition. The other two documents are Digital Historical Collections: Types, Elements, and Construction and Digital Formats for Content Reproductions. Those two present a snapshot of the Library of Congress digital conversion activity as of August 1996. This document was revised in August 1997.

The ideas and approaches outlined in these papers represent the collection-digitization effort of the American Memory pilot program (1990-1994) and the operational National Digital Library Program that has followed the pilot. The institution recognizes that many avenues remain unexplored and that new technology will lead to changing practices.

The challenge, and hence the uncertainty, is perhaps greatest in the area of facilitating access to the growing archives of digital materials. The traditional two-level model of a catalog of records representing items on library shelves has worked well for monographs. For serials, a separate abstracting and indexing industry has developed. For manuscript collections and large archives of pictures and photographs, finding aids (such as archival registers) in the form of structured documents provide access when the effort to catalog each individual item fully can not be justified. The challenge is to integrate these diverse access mechanisms to provide coherent access to digital reproductions of different classes of material. Given the richness of its special collections, the Library of Congress is facing this challenge in mounting its own digital collections. Rather than suggest rigid rules that applicants for the LC/Ameritech competition must follow, the Library has attempted to offer a range of options, encouraging applicants to choose the approach that is appropriate for the materials they propose to digitize.

Another aspect to the challenge is interoperability among digital archives at different institutions. Interoperability will be an essential attribute for easy and effective use of materials in the National Digital Library. While practice within digital libraries is evolving continuously, applicants are expected to propose technical approaches consistent with those in use in August 1997. Alternatively, approaches consistent with standards in advanced stages of development within a national library, archival, or Internet association or community may be proposed by institutions prepared to adapt as those standards take their final form. An example would be the use of Dublin Core data elements in item-level descriptive records. Elements required for interoperability with the Library of Congress have been held to a minimum.

Each collection must be supported by a coherent aid or set of aids to intellectual access, usually through catalog records or a finding aid (archival register). The provision of searchable reproductions of textual materials is encouraged as an additional aid to access. In the networked, distributed environment envisaged for the National Digital Library, each digital reproduction must be identified in a way that permits retrieval across the World Wide Web. This identifier serves the role played by a "call number" in a traditional library. The identifier supports an active link between access aids and the digital reproductions they describe. The exact linking mechanism will depend on the form of the access aid.

To integrate the collections for which digitization is funded through the LC/Ameritech competition and to facilitate the interoperability of old and new collections, applicants must be prepared to use forms of access aid and linking mechanisms that are compatible with those used by the Library of Congress, usually through consistent use of certain fields or tags. Applicants may, but need not, choose to follow current Library of Congress practices exactly. In the weeks following the announcement of awards, awardee institutions will be expected to work closely with the Library of Congress to ensure that the details of their technical approach allow their materials to be integrated effectively with other American Memory collections.

General framework for interoperability for the LC/Ameritech competition

A collection will include digital reproductions (content) and a coherent access aid (typically, a set of descriptive records or a browsable finding aid). The applicant institution may choose to mount the content for Internet access or have the Library of Congress mount it. In order to support searching across all American Memory collections, copies of access aids for all collections will be mounted at the Library of Congress. See Figure 1 for a conceptual view of the relationship between access aids and content and how they support searching across collections within American Memory.

Figure 1: Conceptual framework for supporting search across American Memory collections.
Figure 1
Notes for Figure 1
1.In this context, items are "bibliographic" items. A bibliographic item may represent a group of physical items (e.g. set of photographs or folder of papers) that are described, identified, and presented as a whole. An item-level link points to a single, but possibly multi-part, digital "object."
2.Indexing and searching for American Memory currently uses the free-text search engine, InQuery. The use of a general indexing engine allows the simultaneous retrieval from indexes generated from textual material in different formats.
3.The search interface for American Memory is implemented by LC using HTML forms and "middleware" using the World Wide Web's common gateway interface (CGI) mechanism.

Links between access aids and the digital reproductions should be at the same level as bibliographic description, normally for an item or a group of related items that is described as a whole. Hence, a link from a descriptive record should point to a web-based presentation of the digital object corresponding to the bibliographic item described. If a group of physical items is described as a whole, the record should describe the group and have a link to a presentation of the group. The presentation might be a grid of thumbnails, a table of contents, some other form of overview that provides links to the individual items, or the first item with navigation options. The choice will vary with the type of material and the preference and capabilities of the mounting institution. Applicant institutions wishing to describe materials at a group level will be responsible for the design and generation of such presentations. A finding aid will have many links; as for descriptive records, these links should be at the level of an item or a grouping of items as described in the finding aid. For example, a folder of correspondence might be presented as a group through a single link.

Institutions wishing to mount their own digital content and merely send the Library of Congress copies of the access aids must use identifiers that support links that are resolvable over the Internet into current physical locations for the digital content. Today, these identifiers must take the form of URLs. Copies of the access aids in the awardee institution's system can point to the same repositories of digital content, as shown in Figure 2, in which the left-hand column corresponds to the right-hand column of Figure 1. Those same identifiers can support hyperlinks from exhibits, bibliographies, term papers, and scholarly articles. Many institutions that have made digital materials accessible through a search interface on the World Wide Web have discovered that users wish to have a way of getting directly back to an item they have already located without needing to search again or to provide active citations from online documents, such as reading lists, term papers, or scholarly articles in electronic journals.

Figure 2: Conceptual framework for access to content of LC/Ameritech collections from other systems, applications, and interfaces.
Note: Visualize this figure as being overlapped with Figure 1. The left-hand column of this figure repeats the right-hand column of Figure 1.
Figure 2

Introductory materials can be mounted either at the Library or the applicant's site. A collection-level MARC record in one or more national bibliographic utilities will provide a link to a home page for the collection (either at LC or the awardee institution) and facilitate discovery of the collection by a broad range of library users.

Some suggestions for alternative approaches to identifying items and for forms of access aid follow.

Identifiers for digital reproductions

The Uniform Resource Locators (URLs) used currently to identify resources on the Internet are unsatisfactory as long-term identifiers, since they usually identify a particular file on a particular computer. When the file must be moved, perhaps when the computer is replaced, the URL will usually change. If URLs are embedded directly in bibliographic records or finding aids, those access aids must be modified (and the links tested) when the files are moved. Applicants should consider mechanisms that ensure persistence of URLs and hence minimize the need for manual editing of descriptive records or finding aids in the future.

Some institutions have taken a local approach to establishing "persistent" URLs that might be feasible for some applicants. URLs for individual reproductions are actually database queries retrieving a single known item using a unique, but internal identifier. The databases being queried might be specialized archives used to manage the digital content or tables mapping identifiers or "logical" names to the URLs representing their "physical" location.

Applicants from institutions planning to make use of OCLC's PURL (Persistent URL) system, might consider using PURLs as identifiers (see

There are currently several proposed schemes for long-term identifiers known as Uniform Resource Names (URNs), which would be resolved into physical locators, such as URLs, when users follow links. The Library of Congress offers to help awardees establish URNs for items in their collections using the "handle" service proposed as a URN scheme by the Corporation for National Research Initiatives (see See Selected Topics from NDLP Internal Documentation for more information on URLs, URNs, PURLs, and handles.

The Library of Congress has installed a handle-server to allow experimentation with URNs. However, American Memory's current practice for long-term identification of items is based on following rigid rules for naming files and storing them in a directory tree in a way that allows automatic derivation of names for different digital versions of the same bibliographic item. In effect, field 856 contains a "logical" name in fields $d and $f from which "physical" file locators can be derived. If files are moved from one server to another or to a different level of the file hierarchy, the derivation procedure must be changed, but not the individual catalog records. Applicants wishing to follow Library practices and use this approach will be expected to work with the Library closely, after receiving an award, to establish a naming and storage scheme. Very general information about the storage/naming approach is available within Identifiers for Digital Resources (one of a collection of documents selected from NDLP Internal Documentation), but final details must be worked out after awards are made.

Suggested formats for access aids

1. Bibliographic records in MARC format

As of July 1997, all bibliographic records for current American Memory collections are in MARC format. For the digital reproductions in American Memory, the Library does not create a separate MARC record for the digital version, but adds access information to a record for the original item. Applicants choosing to prepare catalog records in USMARC format need not create full-level records. In addition to the leader and required fixed fields, the MARC fields listed below must be present. No other fields are mandatory for integration with American Memory, although the use of author, subject, and note fields is strongly recommended to facilitate access.

245 Title statement
856 Electronic access
There is not yet consensus on how best to apply field 856 when several versions of an item are available (such as page-images and searchable text for a document or different qualities of reproduction for an image). The Library uses a single 856 field to link to a presentation incorporating links to alternate versions.

Several changes to the 856 field were approved by MARBI at their February and June 1997 meetings.

  1. Proposal 97-09 proposed that the definition of subfield $u be extended to allow URNs as well as URLs. MARBI agreed that the 856 field definition should allow for URNs, but not in subfield $u. The choice of subfield was left to the MARC Standards Office, which has subsequently decided that subfield $g will be redefined to hold a URN.
  2. Proposal 97-08, which was passed as proposed, redefined $q as "electronic format type."
  3. Proposal 97-01, which was passed as proposed, specifies a second indicator (previously unused) to characterize the relationship between the item to which the 856 field provides access and the item described in the MARC record.
This LC local field is needed to keep track of the institution supplying the records and/or reproductions, a unique code to identify the digital collection, and record the fact that the collection was contributed as a result of the 19967/8 LC/Ameritech competition. This field will normally be identical for all records provided by an awardee for a single collection. Details will be worked out with awardees after awards are announced.
This LC local field will hold a broad categorization for the "mode of expression" of the original material, as distinct from the digital format used for reproduction. This categorization is for use with broad navigational functions of the American Memory interface and is independent of the various ways in which format, medium, and genre are specified in the MARC standard. For example, subfield $a will usually contain text or image, since the 1997/8 LC/Ameritech competition is limiting its scope to textual and graphical materials. This field might also be used for more specific terms describing form and genre. The choice of terms and schemas for form and genre is an issue of considerable current discussion within the cataloging and Internet communities. Since there is no clear candidate for a standard, American Memory expects to use a "local" field, rather than 655, to hold the information needed to support its interface.

The Library is unable to accept records using the 774 (Constituent Item) field linked to 856 fields as described in MARBI Proposal 96-4 and integrated into the March 1996 update to the USMARC Bibliographic Format.

Applicants whose local practices conflict with these requirements are encouraged to explain the conflicts and suggest alternatives. After awards are made, there will be an opportunity to resolve differences.

2. Bibliographic records following the Dublin Core approach

In March 1995, at a metadata workshop organized by OCLC and the National Center for Supercomputing Applications and held in Dublin, Ohio, participants representing librarians, archivists, publishers, computer and information scientists, and members of Internet Engineering Task Force working groups approached the task of developing a short list of data elements that could be used by information providers to describe their own resources. This group proposed a set of thirteen elements that is known as the Dublin Core. Further working meetings have been held around the world, with strong participation from Europe and Australia. The set of core elements to support access and retrieval has been extended to fifteen: Title, Author or Creator, Subject or Keywords, Description, Publisher, Other Contributor, Date, Resource Type, Digital Format, Resource Identifier, Source, Language, Relation, Coverage, Rights Management. For more detailed information about the Dublin Core element set, see "The 4th Dublin Core Metadata Workshop Report" by Stuart Weibel, et al. in the June 1997 issue of D-Lib magazine [The URL for the article is

This approach provides an alternative for cataloging that requires less specialized software than the MARC format. However, the Dublin Core concept is still under development and there is not yet an agreed communications format for Dublin Core data. Applicants proposing to create non-MARC bibliographic records should consider how their data elements can be mapped either to MARC or to the Dublin Core element set and be prepared to work with the Library to deliver copies of the database records in a straightforward non-proprietary format, perhaps marked up in SGML.

Requirements for interoperability will be similar to those for MARC bibliographic records, with the following fields being mandatory. Details will be worked out in consultation with the Library after awards are made.

must be present
Resource Identifier
required, with details depending on approach proposed for identifying digital reproductions
Resource Type
required to indicate a broad categorization of original material type
required to support a link to the parent digital collection
Other Contributor
required to identify the institution providing the record

SGML-based syntaxes suggested for a communications format for Dublin Core records are sufficiently simple that special SGML authoring software should not be necessary. It should be feasible to prepare the SGML records as "reports" from many commonly used database software packages. Alternatively, the tags could be added manually using a word-processor or text-editing software.

3. Structured headers in searchable text reproductions

The Library recognizes that some applicant institutions may have elected to incorporate the bibliographic information usually found in a catalog record into structured headers within the digital texts and not to create independent bibliographic records. Applicants taking this approach should be prepared to extract header information into bibliographic records using one of the approaches described above. The Library of Congress has chosen to prepare and manage bibliographic records separately from its searchable text reproductions.

Applicants are strongly encouraged to use SGML document type definitions (DTDs) that conform to the guidelines of the Text Encoding Initiative (TEI) when preparing searchable reproductions of textual materials. These guidelines have been widely adopted for reproductions of historical texts. Applicants are encouraged to consider using the TEI-Lite DTD or the American Memory DTD, which is yet simpler. The experience of the Library of Congress indicates that a simple approach to tagging is appropriate for many historical documents, especially when the SGML version is accompanied by page-images.

A format has been proposed for holding the Dublin Core descriptive elements (see the previous section) in the header section of HTML documents. The proposal (reported by Stuart Weibel at ) may be adopted as a standard, but not necessarily in exactly the form described. Applicants choosing this approach for cataloging and preparing their digital reproductions should be prepared to modify the headers of the converted texts when a standard syntax is agreed upon.

For interoperability, there will be a few mandatory elements in headers for searchable texts. As for MARC records, the following items must all be present: a title, a unique identifier for the document, a pointer to the parent digital collection, and an identifier for the contributing institution. Details will be worked out in consultation with the Library after awards are made. Sample document headers satisfying the minimal requirements will be available soon.

4. Finding aids following the proposed standard for Encoded Archival Description

This format for structured finding aids marked up in SGML is currently being used at the Library of Congress and at many other institutions. The standard is maintained at the Library of Congress and details are available on the WWW pages of the Encoded Archival Description (EAD) maintenance agency at

Development of the standard has been supported by the Society of American Archivists and has involved representation from two international bibliographic utilities (OCLC and the Research Libraries Group). Several consortia have plans for collaborative online archives of finding aids in the EAD format. The initial effort in the archival community seems to be aimed at the production of finding aids for collections of original materials. However, the EAD DTD does provide a mechanism for providing access to digital reproductions (digital archival objects).

To provide links from the finding aid to the digital reproductions, the Library recommends the use of logical identifiers within the main document in conjunction with the SGML "entity" mechanism. Short logical identifiers in the digital archival object element (<dao>) can be resolved into full filenames in an associated entity file. At some future stage, when SGML software recognizes URN identifiers, the entity file, or the entity declaration section within the SGML file, could be modified through a global change to use location-independent identifiers rather than URLs or physical filenames.

Finding aids describing some non-digital collections at the Library of Congress are available at There is currently no free or inexpensive SGML viewer that is satisfactory for general use. The Library expects to create HTML versions automatically from the SGML versions, to support wider accessibility. The Library will be prepared to generate HTML versions of EAD finding aids prepared by awardees.

Supplementary Aids to Access

In addition to a general access aid to support searching, applicants proposing to mount the digital reproductions themselves are encouraged to provide additional support for access by browsing where appropriate. For example, the Library's collections of WPA Life Histories and Panoramic Maps can be accessed by state through a clickable map.