Identifiers for digital resources [August 13, 1996]

Libraries traditionally use unique identifiers for physical items in their collections -- by assigning call numbers and sticking labels on covers. A reader who identifies a book in a catalog can retrieve it by going to the shelf and looking for the label. The call number is the "key" that links the catalog record to the item it identifies. When the item is moved, its identifier goes with it. If library shelves are re-organized, individual call numbers need not be changed, only the signs on the shelves. In conjunction with a map of the stacks, these signs provide vital access support for the reader (or, in the case of closed stacks, for the deck attendant). The map helps the reader "resolve" the call number into a physical location.

Digital resources must also be identified uniquely. Until recently, no attempt was made to provide standard names for digital resources in general, except for very limited applications or in closed systems, such as within a single database. However, a digital library built for the long-term cannot be a closed system. It must be built out of modular components that can be supplemented and upgraded as new technology is developed. As in the traditional collection, the name for an item in the digital library will be the "key" that links catalogs, compilations, and references to the item itself. Figure 1 represents the modular design for supporting access to American Memory collections: the user interface with which the patron interacts, the tools for access (catalogs, free-text indexes, finding aids, etc.) that support that interface, and the archive that contains the digital collections. An item in the digital archive may be accessible through several paths. The item must have a unique name that can be used in references from anywhere, on the Internet or in print.

Figure 1. Primary access paths to American Memory collections


The characteristics of digital resources pose challenges for naming. A file has no "cover" on which a label can be permanently fixed, and available to any user even if the file is copied to another computer. The only "cover" for a generic file is its filename. For digital resources that comprise a single file and have a fixed location, filenames do provide a basis for naming. Every file attached to a particular computer, whatever its operating system, must have a name, and the full name (which includes the "path" of nested directories in which the file is stored) must be unique within that computer's file system. Since every computer on the Internet (or any other computer network) must have an identifier unique across the network, the combination of computer identifier and filename provides a unique identifier for any file on the Internet.

The World Wide Web uses Uniform Resource Locators (URLs) as names

Uniform Resource Locators (URLs), the identifiers used on the World Wide Web (WWW) today, generalize the two-part identifier (computer name, file name) by adding a third component specifying the network protocol which should be used to access the file. The addition of the protocol component to the identifier allows names to be given to resources that are not files, such as interactive terminal sessions or database query forms. Some links below point to more details about URLs. The URL approach has proved powerful and flexible for identifying Internet resources and is one of three building-blocks on which the World Wide Web is based. The World Wide Web has been a phenomenal success because each of those building blocks was simple to understand, implement, and use in the Internet environment of the early 1990s. The American Memory project took advantage of the WWW environment to provide access to its historical collections across the Internet. Each item in the collections is accessed through its URL.

However, there is a problem with URLs as long-term identifiers for digital items and resources. The URL incorporates the names of the computer and files that hold the resource. When a file or resource is moved (perhaps because a computer has failed or no longer has the capacity to handle user demand), the URL is no longer valid. Regular users of the WWW routinely come across links that lead nowhere. In some cases, old links lead to documents that apologize for the inconvenience and provide a link to the new URL, but that is hardly a reliable approach for the long term.

How the Library of Congress avoids the problems with URLs now

Although LC uses URLs to retrieve digital reproductions, URLs have not been embedded directly into American Memory catalog records. LC has created two-part "logical" names and followed strict rules for storing files in a hierarchical arrangment of directories that allows the actual address for a file to be derived from its logical name.

Each collection (or custodial "aggregate" within a collection) is given a unique name of up to 8 characters. The Digital Conversion Team has found it useful to limit names to 8 characters for compatibility with DOS filenaming limitations. Within a collection (or aggregate) an item has a unique name, often limited to fewer than 8 characters because an item may comprise several files. For example, most images are stored in three digital versions. The image with "logical" name "detroit/4a32371" is represented by a thumbnail (4a32371t.gif), a compressed "reference" version for routine access (4a32371r.jpg), and an uncompressed version (4a32371u.tif). A pamphlet or book (say "nawsa/n7111") has a pair of SGML files (n7111.sgm and n7111.ent) that represent the document. Images for each page are numbered in sequence (n7111001.tif, n7111002.tif, etc.). Images for illustrations and tables are numbered in a separate sequence.

The two parts of the logical name for an item is stored in MARC bibliographic records in subfields $d and $f of field 856. Links to page-images and illustrations from SGML documents use logical names (as SGML entities) for the individual components (e.g. "n7111001" for the first page-image of document "nawsa/n7111"). Subsidiary "entity" files provide additional information about the type and location of the linked items. LC stores all component files associated with a single SGML document in the same directory.

Currently, files for each collection are stored in a Unix directory structure in a hierarchy patterned after the numerical part of the name. For example, the thumbnail version of photograph "detroit/4a32371" is stored in directory /4a/4a30000/4a32000/4a32300/ as 4a32371t.gif, and the version used for general WWW access is in the same directory with name 4a32371r.jpg. When a MARC record is retrieved during a search, the $d and $f subfields of field 856 are identified and used in conjunction with a "locator table" to derive the full pathnames (URLs) for an item and its versions or component files.

This scheme allows the Library to avoid embedding identifiers dependent on particular physical file locations into MARC records and other access aids. However, it requires custom coding that are not appropriate in the long term for a fully interoperable digital library, distributed over many sites. The Library has been observing proposals for schemes for Uniform Resource Names (URNs) for the World Wide Web.

Uniform Resource NAMES (URNs) will be valid for the long term

The WWW community recognizes the shortcoming of URLs and has developed the concept of a Uniform Resource Name (URN). A URN is valid for the long term and independent of location, while still being globally unique. Several promising schemes for implementing a system of URNs have emerged. They address the form for names, methods to guarantee global uniqueness, and the design and deployment of a distributed system that provides an efficient address lookup function to "resolve" URNs into pointers to actual locations, with capabilities for publishers/authors/librarians to manage "their" names.

At the December 1995 meeting of the Internet Engineering Task Force (IETF), the most active groups with proposals agreed to go ahead and deploy a variety of systems in a way that allows them to work together and be tested through use by the Internet community. There are links below to information about proposals for URN schemes, including Internet Drafts prepared for the June 1996 IETF meeting.

The URN proposals have some commonalities:

Figure 2. Access to the American Memory collections using URNs


By running its own name authority system, LC can use local naming conventions that support the production, indexing, and management of locally produced digital collections and are compatible with cataloging norms.

LC is working during 1996 with CNRI (Corporation for National Research Initiatives) on a prototype digital archive, based on the Handle System (CNRI's URN scheme) and a Repository that supports management and access control for items in the digital collections. This design can coexist with the current approach. The NDLP will form its URN/handle identifiers from the two parts of the current logical names.

