by Clifford Lynch
This paper has been modestly revised based on discussion at the workshop and a reading of the other papers presented there. All of the papers, but particularly those of David Levy and Peter Hirtle, raise important issues that are relevant to the topic of this article. From Hirtle’s paper, I had the opportunity to learn something of the science of diplomatics, and at the workshop, I had the opportunity to learn much more from Luciana Duranti. Her book, Diplomatics: New Uses for an Old Science (1998), offers valuable and fresh insights on the topics discussed here. These other works provide important additional viewpoints that are not fully integrated into this paper and I urge the reader to explore them. My thanks also to the participants in the Buckland/Lynch Friday Seminar at the School of Information Management and Systems at the University of California, Berkeley, for their comments on an earlier version of this paper.
This paper seeks to illuminate several issues surrounding the ideas of authenticity, integrity, and provenance in the networked information environment. Its perspective is pragmatic and computational, rather than philosophical. Authenticity and integrity are in fact deep and controversial philosophical ideas that are linked in complex ways to our conceptual views of documents and artifacts and their legal, social, cultural, and historical contexts and roles. (See Bearman and Trant  for an excellent introduction to these issues.)
In the digital environment, as Larry Lessig (1999) has recently emphasized, computer code is operationalizing and codifying ideas and principles that, historically, have been fuzzy or subjective, or that have been based on situational legal or social constructs. Authenticity and integrity are two of the key arenas where computational technology connects with philosophy and social constructs. One goal of this paper is to help distinguish between what can be done in code and what must be left for human and social judgment in areas related to authenticity and integrity.
Gustavus Simmons wrote a paper in the 1980s with the memorable title “Secure Communications in the Presence of Pervasive Deceit.” The contents of the paper are not relevant here, but the phrase “pervasive deceit” has stuck in my mind because I believe it perfectly captures the concerns and fears that many people are voicing about information on the Internet. There seems to be a sense that digital information needs to be held to a higher standard for authenticity and integrity than has printed information. In other words, many people feel that in an environment characterized by pervasive deceit, it will be necessary to provide verifiable proof for claims related to authorship and integrity that would usually be taken at face value in the physical world. For example, although forgeries are always a concern in the art world, one seldom hears concerns about (apparently) mass-produced physical goods-books, journal issues, audio CDs-being undetected and undetectable fakes.1
This distrust of the immaterial world of digital information has forced us to closely and rigorously examine definitions of authenticity and integrity-definitions that we have historically been rather glib about-using the requirements for verifiable proofs as a benchmark. As this paper will demonstrate, authenticity and integrity, when held to this standard, are elusive properties. It is much easier to devise abstract definitions than testable ones. When we try to define integrity and authenticity with precision and rigor, the definitions recurse into a wilderness of mirrors, of questions about trust and identity in the networked information world.
While there is widespread distrust of the digital environment, there also seems to be considerable faith and optimism about the potential for information technology to address concerns about authenticity and integrity. Those unfamiliar with the details of cryptographic technology assume the magical arsenal of this technology has solved the problems of certifying authorship and integrity. Moreover, there seems to be an assumption that the solutions are not deployed yet because of some perverse reluctance to implement the necessary tools and infrastructure.2 This paper will take a critical view of these cryptographic technologies. It will try to distinguish between the problems that cryptographic technologies can and cannot solve and how they relate to the development of infrastructure services. There seems to have been surprisingly little examination of these questions; this is itself surprising.
Before attempting to define integrity or authenticity, it is worth trying to gain an intuitive sense of how the digital environment differs from the physical world of information-bearing artifacts (“meatspace,” as some now call it). The archetypal situation is this: We have an object and a collection of assertions about it. The assertions may be internal, as in a claim of authorship or date and place of publication on the title page of a book, or external, represented in metadata that accompany the object, perhaps provided by third parties. We want to ask questions about the integrity of the object: Has the object been changed since its creation, and, if so, has this altered the fundamental essence of the object? (This can include asking these questions about accompanying assertions, either embedded in the object or embodied in accompanying metadata). Further, we want to ask questions about the authenticity of the object: If its integrity is intact, are the assertions that cluster around the object (including those embedded within it, if any) true or false?
How do we begin to answer these questions in meatspace? There are only a few fundamental approaches.
- We examine the provenance of the object (for example, the documentation of the chain of custody) and the extent to which we trust and believe this documentation as well as the extent to which we trust the custodians themselves.
- We perform a forensic and diplomatic examination of the object (both its content and its artifactual form) to ensure that its characteristics and content are consistent with the claims made about it and the record of its provenance.
- We rely on signatures and seals that are attached to the object or the claims that come with it, or both, and evaluate their forensics and diplomatics and their consistency with claims and provenance.
- For mass-produced and distributed (i.e., published) objects, we compare the object in hand with other versions (copies) of the object that may be available (which, in turn, means also assessing the integrity and provenance of these other versions or copies).
In the digital environment, there are few forensics or diplomatics,3 other than the forensics and diplomatics of content itself. We cannot evaluate inks, papers, binding technology, and similar physical characteristics.4 We can note, just as with a physical work, that an essay allegedly written in 1997 that makes detailed references to events and publications from 1999 is either remarkably prescient or incorrectly dated. There are limited forensics of availability, and they mainly provide negative information. For example, if a document claims to have been written in 1998 and we have copies of it that were deposited on various servers in 1997 (and we trust the claims of the servers that the material was in fact deposited in 1997), we can build a case that it was first distributed no later than 1997, regardless of the date contained in the object. Nevertheless, this does not tell us when the document was written.
The fundamental concept of publication in the digital environment-the dissemination of a large number of copies to arbitrary interested parties that are subsequently autonomously managed and maintained-has come under great stress from numerous factors in the networked information environment. These factors include, for example, the move from sale to licensing, limited distribution, making copies public for viewing without giving viewers permission to maintain the copies, and technical protection systems (National Research Council 2000). While the basic principle of broad distribution and subsequent autonomous management of copies remains valid and useful as a base of evidence against which to test the authenticity of documents in question, the availability of relevant and trustworthy copies may be limited in the digital environment, and assessing the copies is likely to be more difficult. Moreover, the forensics and diplomatics of evaluating seals and signatures, and documentation of provenance, become much more formal and computational. It is difficult to say whether digital seals and signatures are more or less compelling in the digital world than in the analog world, but their characters unquestionably change. Finally, provenance and chains of custody in the digital world begin to reflect our evaluation of archives and custodians as implementers and operators of “trusted systems” that enforce the integrity and provenance records of objects entrusted to them.
At some level, authenticity and integrity are mechanical characteristics of digital objects; they do not speak to deeper questions of whether the contents of a digital document are accurate or truthful when judged objectively. An authentic document may faithfully transmit complete falsehoods. There is a hierarchy of assessment in operation: forensics, diplomatics, intellectual analyses of consistency and plausibility, and evaluations of truthfulness and accuracy. Our concern here is with the lower levels of this hierarchy (i.e., forensics and diplomatics as they are reconceived in the digital environment) but we must recognize that conclusive evaluations at the higher levels may also provide evidence that is relevant to lower-level assessment.
Exploring Definitions and Defining Terms:
Digital Objects, Integrity, and Authenticity
The Nature of Digital Information Objects
Before we can discuss integrity and authenticity, we must examine the objects to which we apply these characterizations.
Most commonly, computer scientists are concerned with digital objects that are defined as a set of sequences of bits. One can then ask computationally based questions about whether one has the correct set of sequences of bits, such as whether the digital object in one’s possession is the same as that which some entity published under a specific identifier at a specific point in time. However, this is a simplistic notion. There are additional factors to consider.
Bits are not directly apprehended by the human sensory apparatus-they are never truly artifacts. Instead, they are rendered, executed, performed, and presented to people by hardware and software systems that interpret them. The question is how sophisticated these environmental hardware and software systems are and how integral they are to the understanding of the bits. In some cases, the focus is purely on the bits: numeric data files, or sensor outputs, for example, that are manipulated by computational or visualization programs. Documentary objects are characterized primarily by their bits (think of simple ASCII text), but the craft of publishing begins to make a sensory presentation of this collection of bits-to turn content into experience. Text, marked up in HTML and displayed through a Web browser, takes on a sensory dimension; the words that make up the text being rendered no longer tell the whole story. Digital objects that are performed-music, video, images that are rendered on screen-incorporate a stronger sensory component. Issues of interaction with the human sensory system-psychoacoustics, quality of reproduction, visual artifacts, and the like-become more important. The bits may be the same across space and time, but because of differences in the hardware and software used by recipients,the experience of viewing them may vary substantially. This raises questions about how to define and measure authenticity and integrity. In the most extreme case, we have objects that are rendered experientially-video games, virtual reality walk-throughs, and similar interactive works-where the focus shifts from the bits that constitute the digital object to the behavior of the rendering system, or at least to the interaction between the digital object and the rendering system.
Thus, we might think about a hierarchy of digital objects that could be expressed as follows:
(Interactive) experiential works
As we move up the hierarchy, from data to experiential works, the questions about the integrity and authenticity of the digital objects become more complex and perhaps more subjective; they address experience rather than documentary content (Lynch 2000). This paper will focus on the lower part of the digital object hierarchy. The upper part is poorly understood and today is addressed only in a limited way; for example, through discussions about emulation as a preservation strategy (Rothenberg 1999, 1995). It seems conceivable that one could extend some of the observations and assertions discussed later in this paper to the more experiential works by performing computations on the output of the renderings rather than on the objects themselves. However, this approach is fraught with problems involving canonical representations of the user interface (which, in the most complex cases, involves interaction and not just presentation) and agreeing on what constitutes the authentic experience of the work.
In meatspace, we cheerfully extend the notion of authenticity to much more than objects-in fact, we explicitly apply it to the experiential sphere, speaking of an “authentic” performance of a baroque concerto or an “authentic” Hawaiian luau. To the extent that we can make the extension and expansion of the use of authenticity as a characteristic precise within the framework and terminology of this paper, these statements seem to parallel statements about integrity of what in the digital environment could be viewed as experiential works, or performance.
Even as we struggle with definitions and tests of integrity and authenticity for intellectual works in the digital environment, we are seeing new classes of digital objects-for example, e-cash and digital bearer bonds-that explicitly involve and rely upon stylized and precise manipulation of provenance, authenticity, identity and anonymity, and integrity within a specific trust framework and infrastructure. While these fit somewhere between data and documents in the digital object hierarchy, they are interesting because they derive their meaning and significance from their explicit interaction with frameworks of integrity, authenticity, provenance, and trust.
Canonicalization and (Computational) Essence
Often, we seek to discuss the essence of a work rather than the exact set of sequences of bits that may represent it in a specific context; we are concerned with integrity and authenticity as they apply to this essence, rather than to the literal bits. Discussions of essence become more problematic as we move up the digital object hierarchy. However, even at the lower levels of data and documents, we encounter a troublesome imprecision that is a barrier to making definitions operational computationally when we move beyond the literal definition of precisely equivalent sets of sequences of bits. Those approaching the question from a literary or documentary perspective cast the issue in a palette of grays: there are series (not necessarily a strict hierarchy; at best a partial ordering) of intellectual abstractions of a document that capture its essence at various levels, and the key problem is whether this abstract essence is retained. The abstraction may involve words, layout, typography, or even the feel of the pages. Are hardcover and paperback editions of a book equivalent? Does equivalence depend on whether the pagination is identical? Elsewhere, I have proposed canonicalization as a method of making such abstractions precise (Lynch 1999). The fundamental point of canonicalization as an organizing principle is that it defines computational algorithms (called “canonicalizations”) that can be used to extract the “essence” of documents according to various definitions of what constitutes that essence. If we have such computational procedures for extracting the essence of digital objects, we can then compare digital objects through the prism of that definition of essence. We can also make assertions that involve abstract representations of this essence, rather than more specific (and presumably haphazard) representations that incorporate extraneous characteristics.
The hard problem, of course, is precisely defining and achieving a consensus about the right canonicalization algorithm, or algorithms, for a given context.
When we say that a digital object has “integrity,” we mean that it has not been corrupted over time or in transit; in other words, that we have in hand the same set of sequences of bits that came into existence when the object was created. The introduction of appropriate canonicalization algorithms allows us to consider the integrity of various abstractions of the object, rather than of the literal bits that make it up, and to operationalize this discussion of abstractions into equality of sets of sequences of bits produced by the canonicalization algorithm.
When we seek to test the integrity of an object, however, we encounter paradoxes and puzzles. One way to test integrity is to compare the object in hand with a copy that is known to be “true.”5 Yet, if we have a secure channel to a known true copy, we can simply take a duplicate of the known true copy. We do not need to worry about the accuracy of the copy in hand, unless the point of the exercise is to ensure that the copy in hand is correct-for example, to detect an attempt at fraud, rather than to be sure that we have a correct copy. These are subtly different questions.6
If we do not have secure access to an independently maintained, known true copy of the object (or at least a digest surrogate), then our testing of integrity is limited to internal consistency checking. If the object is accompanied by an authenticated (“digitally signed”) digest, we can check whether the object is consistent with the digest (and thus whether its integrity has been maintained) by recomputing the digest from the object in hand and then comparing it with the authenticated digest. But our confidence in the integrity of the object is only as good as our confidence in the authenticity and integrity of the digest. We have only changed the locus of the question to say that if the digest is authentic and accurate, then we can trust the integrity of the object. Verifying integrity is no different from verifying the authenticity of a claim that “the correct message digest for this object is M” without assigning a name to the object. The linkage between claim and object is done by association and context-by keeping the claim bound with the object, perhaps within the scope of a trusted processing system such as an object repository.
In the digital environment, we also commonly encounter the issue of what might be termed “situational” integrity, i.e., the integrity of derivative works. Consider questions such as “Is this an accurate transcript?”, “Is this a correct translation?”, or “Is this the best possible version given a specific set of constraints on display capability?” Here we are raising a pair of questions: one about the integrity of a base object, and another about the correctness of a computation or other transformation applied to the object. (To be comprehensive, we must also consider the integrity of the result of the computation or transformation after it has been produced). This usually boils down to trust in the source or provider of the computation or transformation, and thus to a question of authentication of source or of validity, integrity, and correctness of code.
Validating authenticity entails verifying claims that are associated with an object-in effect, verifying that an object is indeed what it claims to be, or what it is claimed to be (by external metadata). For example, an object may claim to be created on a given date, to be authored by a specific person, or to be the object that corresponds with a name or identifier assigned by some organization. Some claims may be more mechanistic and indirect than others. For example, a claim that “This object was deposited in a given repository by an entity holding this public/private key pair at this time” might be used as evidence to support authorship or precedence in discovery. Typically, claims are linked to an object in such a way that they include, at least implicitly, a verification of integrity of the object about which claims are made. Rather than simply speaking of the (implied) object accompanying the claim (under the assumption that the correct object will be kept with the claims, and that the object management environment will ensure the integrity of the object) one may include a message digest (and any necessary information about canonicalization algorithms to be applied prior to computing the digest) as part of the metadata assertion that embodies the claim.
It is important to note that tests of authenticity deal only with specific claims (for example, “did X author this document?”) and not with open-ended inquiry (“Who wrote it?”). Validating the authenticity of an object is more limited than is an open-ended inquiry into its nature and provenance.
There are two basic strategies for testing a claim. The first is to believe the claim because we can verify its integrity and authenticate its source, and because we choose to trust the source. In other words, we validate the claim that “A is the author of the object with digest X” by first verifying the integrity of the object relative to the claim (that it has digest X), and then by checking that the claim is authenticated (i.e., digitally signed) by a trusted entity (T). The heart of the problem is ensuring that we are certain who T really is, and that T really makes or warrants the claim. The second strategy is what we might call “independent verification” of the claim. For example, if there is a national author registry that we trust, we might verify that the data in the author registry are consistent with the claim of authorship. In both cases, however, validating a claim that is associated with an object ultimately means nothing more or less than making the decision to trust some entity that makes or warrants the claim.
Several final points about authenticity merit attention. First, trust in the maker or warrantor of a claim is not necessarily binary; in the real world, we deal with levels of confidence or degrees of trust. Second, many claims may accompany an object; in evaluating different claims, we may assign them differing degrees of confidence or trust. Thus, it does not necessarily make sense to speak about checking the authenticity of an object as if it were a simple true-or-false test-a computation that produces a one or a zero. It may be more constructive to think about checking authenticity as a process of examining and assigning confidence to a collection of claims. Finally, claims may be interdependent. For example, an object may be accompanied by claims that “This is the object with identifier N,” and “The object with identifier N was authored by A” (the second claim, of course, is independent of the document itself, in some sense). Perhaps more interesting, in an archival context, would be claims that “This object was derived from the object with message digest M by a specific reformatting process” and “The object with message digest M was authored by A.” (See Lynch 1999 for a more detailed discussion of this case.)
Comparing Integrity and Authenticity
It is an interesting, and possibly surprising, conclusion that in the digital environment, tests of integrity can be viewed as just special cases and byproducts of evaluations of authenticity. Part of this comes from the perspective of the environment of “pervasive deceit” and the idea that checking integrity of an object means comparing it with some precisely identified and rigorously vetted “original version” or “authoritative copy.” In fact, much of the checking for integrity in the physical world is not about ferreting out pervasive deceit and malice, but rather about accepting artifacts for roughly what they seem to be on face value and then looking for evidence of damage or corruption (i.e., torn-out pages or redacted text). For this kind of integrity checking, a message digest that accompanies a digital object as metadata serves as an effective mechanism to ensure that the object has not been damaged or corrupted. This is true even if the message digest is not supported by an elaborate signature chain and trust assessment, but only by a general level of confidence in the computational context in which the objects are being stored and transmitted. In the digital environment, there is a tendency to downplay the need for this kind of integrity checking in favor of stronger measures that combine authenticity claims with integrity checks.
The Role of Copies
David Levy argues that all digital objects are copies; this echoes the findings of the National Research Council Committee on Intellectual Property in the Emerging Information Infrastructure that use-reading, for example-implies the making of copies (National Research Council 2000). If we accept this view, authenticity can be viewed as an assessment that we make about something in the present-something that we have in hand-relative to claims about the past (predecessor copies). The persistent question is whether a given object X has the same properties as object Y. There is no “original.” This is particularly relevant when we are dealing with dynamic objects such as databases, where an economy of copies is meaningless. In such cases, there is no question of authenticity through comparison with other copies;there is only trust or lack of trust in the location and delivery processes and, perhaps, in the archival custodial chain.
The term provenance comes up often in discussions of authenticity and integrity. Provenance, broadly speaking, is documentation about the origin, characteristics, and history of an object; its chain of custody; and its relationship to other objects. The final point is particularly important. There are two ways to think about a digital object that is created by changing the format of an older object that has been validated according to some specific canonicalization algorithm. We might think about a single object the provenance of which includes a particular transformation, or we might think about multiple objects that are related through provenance documentation. Thus, provenance is not simply metadata about an object-it can also be metadata that describe the relationships between objects. Because provenance also includes claims about objects, it is part of the authentication and trust infrastructures and frameworks.
I do not believe that we have a clear understanding of (and surely not consensus about) where provenance data should be maintained in the digital environment, or by what agencies. Indeed, it is not clear to what extent the record of provenance exists independently and permanently, as opposed to being assembled when needed from various pools of metadata that may be maintained by various systems in association with the digital objects that they manage. We also lack well-developed metadata element sets and interchange structures for documenting provenance. It seems possible that the Dublin Core, augmented by semantics for signing metadata assertions, might form a foundation for this, although attributes such as relationship would need to be extended to allow for very precise vocabularies to describe algorithmically based derivations of objects from other objects (or transformations of objects). We would probably also need to incorporate metadata assertions that allow an entity to record claims such as “Object X is equivalent to object Y under canonicalization C.”
Watermarks, Authenticity, and Integrity
In the most general sense, watermarking can be viewed as an attempt to ensure that a set of claims is inseparably bound to a digital object and thus can be assumed to travel with the object; one does not have to trust transport and storage systems to correctly perform this function. The most common use of watermarks today is to help protect intellectual property by attaching a copyright claim (and possibly an object-specific serial number to allow tracing of individual copies) to an object. Software exists to scan public Web sites for objects that contain watermarks and to notify the rights holders about where these objects have been found. A serial number, if present, helps the rights holder not only identify the presence of a possibly illegal copy but also determine where it came from. Various trusted system-based architectures for the control of copyrighted works have also been proposed that use watermarking (for example, the Secure Digital Music Initiative ). The idea is that devices will refuse to play, print, or otherwise process digital objects if the appropriate watermarks are not present.7 The desirable properties of watermarks include being very hard to remove computationally (at least without knowledge of the private key as well as the algorithm used to generate the watermark) and being resilient under various alterations that may be applied to the watermarked file (lossy compression, for example, or image cropping). The development of effective watermarking systems is currently a very active area of research.8
From the perspective of authenticity and integrity, watermarks present several problems. First, they deliberately and systematically corrupt the objects to which they are applied, in much the same way that techniques such as lossy compression do. Fingerprints (individualized watermarks) are particularly bad in this regard since they defeat comparisons among copies as a way of establishing authenticity-indeed this is exactly what they are designed to do, to make each copy unique and traceable. Applying a watermark to a digital object means changing bits within the object, but in such a way that they change the perception of the object only slightly. Thus, finding and verifying a watermark in a digital object give us only weak evidence of its integrity. In fact, the very presence of the watermark means that integrity has been compromised at some level, unless we are willing to accept the watermarked version of the object as the actual authoritative one-an image or sound recording that includes some data that allegedly does not much change our perception of the object. If a watermark can easily be stripped out of an object (a bad watermark design, but perhaps characteristic of watermarking systems that try to minimize corruption), then the absence of such a watermark does not tell us much about the possible corruption of other parts of the object.
A second problem is that some watermarking systems do not emphasize preventing the creation of fake watermarks; they are concerned primarily with the preservation of legitimate watermarks as evidence of ownership or status of the watermarked object. To use watermarking to address authenticity issues, it seems likely that one would need to use it simply as a means of embedding a claim in an object, under the assumption that the claim would then have to be separately verifiable (for example, by being digitally signed).
To summarize: If one obtains a digital object that contains a watermark, particularly if that watermark contains separately verifiable claims, it can provide useful evidence about the provenance and characteristics of the object, including good reasons to assume that it is a systematically and deliberately corrupted version of a predecessor digital object that one may or may not have access to or be able to locate. The watermark may have some value in forensic examination of digital objects, but it does not seem to be a good tool for the management of digital objects within a controlled environment such as an archive or repository system that is concerned with object integrity. It seems more appropriate to require that the environment take responsibility for maintaining linkages and associations between metadata (claims) and the objects themselves. Watermarks are more appropriate for an uncontrolled public distribution environment where integrity is just one variable in a complex set of trade-offs about the management and protection of content.
Semantics of Digital Signatures
One serious shortcoming of current cryptographic technology has to do with the semantics of digital signatures-or, more precisely, the lack thereof. In fairness, many cryptographers are not concerned with replicating the higher levels of semantics that accompany the use of signatures in the physical world. They regard these issues as the responsibility of an applications environment that uses digital signatures as a tool or supporting mechanism. But wherever we assign responsibility for establishing a system of semantics, the need for such semantics is very real, and I believe that many people outside the cryptographic community have been misled by their assumptions about the word signature. They do not understand that the semantics problem is still largely unaddressed.
At its core, a digital signature is a mechanical, computational process. Some entity in possession of a public/private key pair was willing to perform a computation on a set of data using this key pair, which permits someone who knows the public key of the key pair to verify that the data were known to and computed upon by an entity that held the key pair. A digital signature amounts to nothing more than this. Notice that any digital data can be signed-not just documents or their digests, but also assertions about documents. The interface between digital signature processing and documents is extremely complex, questions about the semantics of signatures aside. The reader is invited to explore the work of the joint Worldwide Web Consortium/Internet Engineering Task Force on digital signatures for XML documents (1998) to get a sense of how issues such as canonicalization come into play here.
The use of digital signatures in conjunction with a public key infrastructure (PKI) offers a little more.9 People can choose to trust the procedures of a PKI to do the following kinds of things:
- To verify, according to published policies, a user’s right to an “identity” and to subsequently document the binding between that identity and a public/private key pair. Verification policies vary widely, from taking someone’s word in an e-mail message to demanding witnesses, extensive documentation such as passports and birth certificates, personal interviews, and other proof. In essence, one can trust the PKI service to provide the public key that corresponds to an identity. The identity can be either a name (“John Smith”) or a role (“Chief Financial Officer of X Corporation”). Attributes can also be bound to the identity.
- To provide a means for determining when a key pair/identity binding has been compromised, expired, or revoked and should no longer be considered valid.
Compare this mechanistic view of signatures with the rich and diverse semantics of signatures in the real world. A signature might mean that the signer
- authored the document;
- witnessed the document and other signatures on it;
- believes that the document is correct;
- has seen, or received, the document;
- approves the actions proposed in the document; or
- agrees to the document.
There are questions not only about the meaning of signatures but also about their scope. In some situations, for example, documents are signed or initialed on every page; in others, a signature witnesses only another signature, not the entire document. Questions of scope become complex in a digital world, particularly as signed objects undergo transformations over time (because of reformatting, for example). Considerable research is needed in these areas.
Digital signatures alone can neither differentiate among the possible semantics outlined earlier, nor provide direct evidence of any one of them. In other words, there is no reasonable “default” meaning that can be given to a signature computation. Such signatures can tell us that a set of bits has been computed upon, and, in conjunction with a PKI, they can tell us who performed that computation. We clearly need a mechanism for expressing semantics of signatures that can be used in conjunction with the actual computational signature mechanism-a vocabulary for expressing the meaning of a signature in relationship to a digital object (or, in fact, a set of digital objects that might include other signed assertions).
One can imagine defining such a vocabulary and interchange syntax for the management and preservation of digital objects-for a community of archives and cultural heritage organizations, for example. But there is another problem that has not been well explored, to my knowledge. It is likely that we will see the development of one or more “public” vocabularies for commerce and contracting, and perhaps additional ones for the registry and management of intellectual property. These vocabularies might vary among nations, or even among states in a nation such as the United States, where much contracting is governed by state law.10 In addition, we will almost certainly see the development of organization-specific “internal” vocabularies in support of institutional processes. Many of the initial claims about objects will likely be expressed in one of these other vocabularies rather than the vocabularies of the cultural heritage communities; consequently, we will face complex problems of mapping and interpreting vocabularies. We will also face the problems of trying to interpret vocabularies that may belong to organizations that no longer exist or vocabularies in which usage has changed over time, perhaps in poorly documented ways.
The Roles of Identity and Trust
Virtually all determination of authenticity or integrity in the digital environment ultimately depends on trust. We verify the source of claims about digital objects or, more generally, claims about sets of digital objects and other claims, and, on the basis of that source, assign a level of belief or trust to the claims. As a second, more intellectual form of analysis, we can consider the consistency of claims, and then further consider these claims in light of other contextual knowledge and common sense. For example, an object that claims to have been authored in 2003 by someone who died in 2001 would reasonably raise questions, even if all of the signatures verify. We can draw precious few conclusions from objects standing alone, except by applying this kind of broader intellectual analysis. As we have seen, ensuring the validity of linkages between claims and the objects about which those claims make assertions is an important question. The question becomes even more difficult when we recognize that both objects and sets of claims evolve independently and at different rates, because of maintenance processes such as reformatting or the expiration of key pairs and the issuance of new ones.
Ultimately, trust plays a central role, yet it is elusive. Signatures can allow us to trust a claim if we trust the holder of a key pair, and a public key infrastructure can allow us to know the identity (name) of the holder of a key pair if we trust the operator of the PKI. If we know the name of the entity we trust, we can thus use the PKI to determine its public key and use that to verify signatures that the entity has made. We can establish the link between identity and keys directly (we can directly obtain, through some secure method, the public key from a trusted entity) or through informal intermediaries (we can securely obtain the key from someone we know and trust, as is done in the Pretty Good Privacy [PGP] system) (Zimmermann 1995).
It is important to recognize that trust is not necessarily an absolute, but often a subjective probability that we assign case by case. The probability of trustworthiness may be higher for some PKIs than for others, because of their policies for establishing identity. Moreover, we may establish higher levels of trust based on identities that we have directly confirmed ourselves than on those confirmed by others. Considerable research is being done on methods that people could use to define rules about how they assign trust and belief. These rules can drive computations for a calculus of trust in evaluating claims within the context of a set of known keys and identities and PKI services that maintain identities. An interesting question, which I do not think we are close to being able to answer, is whether there will be a community consensus on trust assignment rules within the cultural heritage community, or whether we will see many, wildly differing, choices about when to establish trust.
We also need an extensive inquiry into the nature of identity in the digital world as it relates to authenticity questions such as claims of authorship. Consider just a few points here. Identity in the digital world means that someone has agreed to trust an association between a name and a key pair, because he or she has directly verified it or trusts an intermediary, such as a PKI, that records such an association. Control of an identity, however, can be mechanically transferred or shared by the simple act of the owner of a key pair sharing that key pair with some other entity. We have to trust not only the identity but also the behavior of the owner of that identity.
If we are to trust a claim of authorship, whom do we expect to sign it? The author? The publisher? A registry such as the copyright office, which would more likely sign a claim stating that the author has registered the object and claimed authorship?
Identity is more than simply a name. We frequently find anonymous or pseudonymous authorship; how are these identities created and named? We have works of corporate authorship, including the notion of “official” works that are created through deliberate corporate acts and that represent policy or statements with legal implications. In this case, the signatory may be someone with a specific role or office within a corporation (an officer of the corporation or the corporate secretary, for example). These may be very volatile in an era of endless mergers and acquisitions, as well as occasional bankruptcies. Finally, we have various ad-hoc groups that come together to author works; these groups may be unwilling or unable to create digital identities within the trust and identity infrastructure (consider, for example, artistic, revolutionary, or terrorist manifestos).
We know little about how identity management systems operate over very long periods. Imagine a digital object that is released from an archive in 2100 for the first time-an object that had been sealed since its deposit in 2000. A group of experts is trying to assess the claims associated with the object. One scenario is that all claims were verified upon deposit, and the archive has recorded that verification; the experts then trust the archive to have correctly maintained the object since its deposit and to have appropriately verified the claims. A second scenario is that the group of experts chooses to re-verify the claims. This may take them into an elaborate exploration of the historical evolution of policies of certificate authorities and public key infrastructure operators that have long since vanished, of histories of key assignment and expiration, and perhaps even of the evolution of our understanding of the vulnerabilities of cryptographic algorithms themselves. This suggests that our ability to manage and understand authenticity and integrity over long periods of time will require us to manage and preserve documentation about the evolution of the trust and identity management infrastructure that supports the assertions and evaluation of authenticity and integrity. This, in turn, raises the concern that relying on services and infrastructure that are being established primarily to support relatively short-term commercial activities may be problematic. At a minimum, it suggests that we may need to begin a discussion about the archival requirements for such services if they are to support the long-term management of our cultural and intellectual heritage.
Authorship is just one example of the difficulties involved in “literary” signature semantics. Consider the problem of assigning publication dates as another example. Every publisher has different standards and thus different semantics.
In an attempt to explore the central roles of trust and identity in addressing authenticity and integrity for digital objects, this paper points to a wide-ranging series of research questions. It identifies the need to begin considering standardization efforts in areas such as signing metadata claims and the semantics of digital signatures to support authenticity and integrity.
But a set of more basic issues about infrastructure development and large-scale deployment also needs to be carefully considered. A great deal of technology and infrastructure now being deployed will be useful in managing integrity and authenticity over time. However, these developments are being driven by commercial requirements with short time horizons in areas such as authentication, electronic commerce, electronic contracting, and management and control of digital intellectual property. The good news is that there is a huge economic base in these areas that will underwrite the development of infrastructure and drive deployment. To the extent that we can share this work to manage cultural and intellectual heritage, we need to worry only about how to pay to use it for these applications, not about how to underwrite its development. Even there, however, we need to think about who will pay to establish the necessary identities and key pairs and to apply them to create the appropriate claims that will accompany digital objects. The less-good news is that we need to be sure that the infrastructure and deployed technology base actually meet the needs of very long-term management of digital objects. To take one example, knowing the authorship of a work is still important, even after all the rights to the work have entered the public domain. It is essential that institutions concerned with the management and preservation of cultural and intellectual heritage engage, participate in, and continue to critically analyze the development of the evolving systems for implementing trust, identity, and attribution in the digital environment.
1. Confusingly, however, we have the appearance of perfect forgeries (at least in terms of content; the packaging is often substandard) of digital goods in the form of pirate audio CDs, DVDs, and software CD-ROMs. In these cases, the purpose is not usually intellectual fraud so much as commercial fraud through piracy. One might argue that these copies have integrity (they are, after all, bitwise equivalent); however, their authenticity is dubious, or at least needs to be proved by comparison with copies that have a provenance that can be documented. Another case that bears consideration and helps refine our thinking is the bootleg or “gray-market” recording—perhaps an audio CD of a live performance of a well-known band, released without the authorization of the performers and not on their usual record label. This does not stop the recording from being authentic and accurate, albeit unauthorized. The performers may or may not be willing to vouch for the authenticity of the recording; alternatively, one may have to rely on the evidence of the content (i.e., nobody else sounds like that) and, possibly, metadata provided by a third party that potentially has its own provenance.
2. It would be useful to better understand why there has not been a greater effort to deploy these capabilities, even though they have substantial limitations. Contributing factors undoubtedly include export controls and other government regulations on cryptography, both in the United States and elsewhere; legal and liability issues involved in an infrastructure that addresses authentication and identity; and social and cultural concerns about privacy, accountability, and related topics. Patent issues are a particular problem. It is hard to develop infrastructure, widely deployed standards, and critical mass when key elements are tied up by patents. With the recent insane proliferation of patents on software methods, algorithms, business models, and the like, uncertainty about patent issues is also a serious barrier to deployment. All of these have been well covered in the literature and the press. What has been less well examined is the lack of clear, well-established economic models to support systems of authentication and integrity management. To put it bluntly, it is not clear who is willing to pay for the substantial development, deployment, and operation of such a system. While many people say they are worried about authenticity and integrity in a digital environment, it is not clear that they are willing to pay the increased costs to effectively address these concerns.
3. It is worth carefully examining the forensic clues available when evaluating a digital object as an artifact. Today, many of them seem trivial, but as our history with digital technology grows longer, understanding them will likely become a specialized body of expertise. Examples include character codes, file formats, and formats of embedded fonts, all of which can help at least place the earliest time that a digital object could be created, and perhaps even provide evidence to argue that it was unlikely to have been created after a certain time. For an object that has undergone format conversions over time as part of its preservation, these forensic clues help only in the evaluation of the record of provenance.
4. For digital objects created by digitizing physical artifacts, if we can identify and obtain access to the source physical artifact, we can apply well-established forensic and diplomatic analysis practices to the source object.
5. As soon as we begin to speak of copies, however, we need to be very careful. Unless we know the location of the copy through some external (contextual) information, we run the risk of confusing authenticity and integrity. For example, if we have an object that includes a claim that “the identifier of this object is N” and we simply go looking for copies of objects with identifier N on a server that we trust, and then securely compare the object in hand with one of these copies, what we have really done is simply to trust the server to make statements about the assignment of the identifier N and then confirmed we had an accurate copy of the object with that identifier in hand. The key difference is between trusting the server to keep a true copy of an object in a known place and trusting the server to vouch for the assignment of an identifier to an object.
6. One thing that we can do with cryptographic technology—specifically, digest algorithms—is to test whether two copies of an object are identical without actually exchanging the object. This is important in contexts where economics and intellectual property come into play. For example, a publisher that is offering copies of a digital document for license can also offer a verification service, where the holder of a copy of a digital object can verify its integrity without having to purchase access to a new copy. Or, two institutions, each of which holds a copy of a digital object but does not have to rights to share it with another institution, can verify that they hold the same object. Digest algorithms are also useful for efficiency purposes, because they avoid the need to transmit copies of what may be very large objects in order to test integrity. We should note that digest algorithms are probabilistic statements, however; the algorithms are designed to make it very unlikely that two different objects (particularly two similar but distinct documents) will have the same digest.
7. This is not a universally accepted definition of a digital watermark. The term is also used to refer to other things, such as modifications to images that allow them to be viewed on-screen with only moderate degradation but that produce very visible and unsightly artifacts when the image is printed. The description here characterizes what I believe to be the most commonly used definition of the technology. Sometimes “watermark” is reserved for a “universal” encoding hidden in all copies of a digital object that are distributed by a given source (for example, containing an object identifier) and the term “fingerprint” is reserved for watermarks that are copy-specific, that is personalized to given recipients (containing a serial number or the recipient’s identifier). The fingerprint individualizes an object to a version associated with a specific recipient.
8. See, for example, the proceedings of the series of conferences on Information Hiding (Anderson 1996, Aucsmith 1998, Pfitzmann 2000). See also proceedings from the first, second, and third international conferences on financial cryptography (Hirschfeld 1997, Hirschfeld 1998, Franklin 1999).
10. In the United States, some of this is likely to be determined by how quickly federal law regarding digital signatures is established and by the extent to which federal law preempts developing state laws. Changes to the Uniform Commercial Code will likely play a role. See http://washofc.epic.org/crypto/dss/ for information on a variety of material on current legislative and standards developments related to digital signatures.
Anderson, Ross, ed. 1996. Information Hiding: First International Workshop, Cambridge, U.K., May 30June 1, 1996, proceedings. Lecture Notes in Computer Science, vol. 1174. Berlin and New York: Springer.
Aucsmith, David, ed. 1998. Information Hiding: Second International Workshop, Portland, Oregon, U.S.A., April 1417 1998, proceedings. Lecture Notes in Computer Science, vol. 1525. Berlin and New York: Springer.
Bearman, David, and Jennifer Trant. 1998. Authenticity of Digital Resources: Towards a Statement of Requirements in the Research Process, D-Lib Magazine (June). Available from http://www.dlib.org/dlib/june98/06bearman.html.
Duranti, Luciana. 1998. Diplomatics: New Uses for an Old Science. Lanham, Md.: Scarecrow Press.
Hirschfeld, Rafael, ed. 1997. Financial Cryptography: First International Conference, Anguilla, British West Indies, February 2428, 1997, proceedings. Lecture Notes in Computer Science, vol. 1318. Berlin and New York: Springer.
Hirschfeld, Rafael, ed. 1998. Financial Cryptography: Second International Conference, Anguilla, British West Indies, February 2325, 1988, proceedings. Lecture Notes in Computer Science, vol. 1465. Berlin and New York: Springer.
Feghhi, Jalal, Jalil Geghhi, and Peter Williams. 1999. Digital Certificates: Applied Internet Security. Reading, Mass.: Addison Wesley.
Ford, Warwick, and Michael S. Baum. 1997. Secure Electronic Commerce: Building the Infrastructure for Digital Signatures and Encryption. Upper Saddle River, N.J.: Prentice Hall.
Franklin, Matthew, ed. 1999. Financial Cryptography: Third International Conference, Anguilla, British West Indies, February 2225, 1999, proceedings. Lecture Notes in Computer Science, vol. 1648. Berlin and New York: Springer.
Lessig, Lawrence. 1999. Code and Other Laws of Cyberspace. New York: Basic Books.
Lynch, Clifford. 2000. “Experiential Documents and the Technologies of Remembrance,” in I in the Sky: Visions of the Information Future, edited by Alison Scammell. London: Library Association Publishing.
Lynch, Clifford. 1999. Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information, D-Lib Magazine 5(9) (September). Available from http://www.dlib.org/dlib/september99/09lynch.html.
National Research Council. 2000. The Digital Dilemma: Intellectual Property in the Information Infrastructure. Washington, D.C.: National Academy Press.
Pfitzmann, Andreas, ed. 2000. Information Hiding: Third International Workshop, Dresden, Germany, September 29October 1, 1999, proceedings. Lecture Notes in Computer Science, vol. 1768. Berlin and New York: Springer.
Rothenberg, Jeff. 1999. Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, D.C. Council on Library and Information Resources. Available from http://www.clir.org.
Rothenberg, Jeff. 1995. Ensuring the Longevity of Digital Documents. Scientific American 272(1):24-9.
Secure Digital Music Initiative. 2000. Available from http://www.sdmi.org.
Worldwide Web Consortium/Internet Engineering Task Force on Digital Signatures for XML Documents. 1998. Digital Signature Initiative. Available from http://www.w3.org/DSig.
Zimmerman, Philip R. 1995. The Official PGP User’s Guide. Cambridge, Mass.: MIT Press.