The steady growth of digital information as a component of major research collections has significant implications for college and research libraries. Many institutions, including Cornell University Library (CUL), have been creating or collecting digital information produced in a wide variety of standard and proprietary formats, including ASCII, common image formats, word processing, spreadsheet, and database documents. Each of these formats continues to evolve, becoming more complex as revised software versions add new features or functionality. It is not uncommon for software enhancements to “orphan,” or leave unreadable, files generated by earlier versions. The threat to aging digital information has surpassed the danger of unstable media or obsolete hardware. The most pressing problems confronting managers of digital collections are data format and software obsolescence.
There is a tacit assumption that digital libraries will preserve the electronic information they create or the information that is entrusted to their care. To preserve this information, institutions must manage collections in a consistent and decisive manner. It is important to decide what should be preserved, in what priority, and with what techniques. Unfortunately, there is little guidance in this area. Leading organizations such as the National Archives and Records Administration have been cautious in adopting standards for document formats other than ASCII; specialized reports prepared by national committees have focused either on broad recommendations(Task Force on Archiving of Digital Information 1996) or on organizational and legal issues (Euhlir 1997). On the basis of its experience in managing electronic collections, the CUL chose to develop a method of “risk management” to replace “heroic rescue” as a means of preserving digital information. The concept of an information life cycle is emerging as a major theme in digital preservation, and as a model it provides some guidance on where risk-management efforts should be directed. In the abstract, a digital life cycle plans for the creation and stages of use of information and, ultimately, for whether the file will remain in a terminal, unchanging state or be transformed into another format for reuse. The choice of how or when to assess risk in the digital life cycle depends on circumstances, the state of the digital information, and the general preservation strategy adopted.
Currently, there are two radically different strategies for managing the later period of a digital life cycle: migration and emulation. Preserving Digital Information defines migration broadly, as “the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation” (Task Force on Archiving of Digital Information 1996). A more specific definition would indicate that migration changes the structure of the original data file. With the exception of files that are simple data streams, most files contain two basic components: structural elements and data elements. A file format represents the arrangement of the structural and data elements in a unique and specific manner. In this context, migration is the process of rearranging the original sequence of structural and data elements (the source format) to conform to another configuration (the target format).
In practice, migration is prone to generating obvious and subtle errors. An obvious error occurs when the set of structural elements in the source format does not fully match the structural elements of the target format. For instance, in a spreadsheet file a structural element defines a cell containing a numeric value. If a comparable element is missing from the format specifications of the target format, data will be lost. A subtle error might occur if the data themselves do not convert properly. Floating point numbers (numbers with fractions) are found in many numeric files. Some formats might allow a floating-point number of 16 digits (e.g., 26.00126700998l9070) while others might allow only 8 digits (e.g., 26.00126701). For some applications, such as vector calculations in geographic information system (GIS) programs, small but significant errors could creep into calculations. In other situations, migration might preserve the content of the file but lose the internal relationships or context of the information. For example, a spreadsheet file migrated to ASCII may save the current values of all the cells but lose any formulas embedded within the cells that are used to create those values.
An alternative preservation approach, emulation, is concerned with preserving the original software environment. Emulators are programs that mimic computer hardware. Strategies adopting this approach store copies of the initial software and descriptions of how to emulate the initial hardware to run the software along with the data files (Rothenberg 1999; 1995). Emulation has been practiced for many years, and there are several commercial and public domain emulators for a variety of hardware/operating system configurations. A good example is MS-DOS emulation in the Windows 95/98/NT operating system.
Emulation as a strategy has some limitations. Emulation assumes future access to the following multiple data objects in a cluster or package:
- the data file to be preserved and reused,
- the application software that generated the data file,
- the operating system in which the application functioned, and
- the hardware environment emulated in software using detailed information about the attributes of that hardware.
This complex environment would most likely fail if one or more components were missing. Moreover, emulation is a patchwork effort, with contributions from commercial vendors and private individuals. There is no system for coordinating or maintaining these emulators, and maintaining obsolete emulators may prove to be as problematic as migrating obsolete file formats.
With two complex and very different strategies, it would be difficult to examine both options simultaneously. Our decision to select migration was partially based on the resources at our disposal. With locally developed and commercial off-the-shelf data migration software, migration could be tested, measured, and evaluated on the basis of certain common criteria from which we could design a suite of risk-assessment tools. File migration was also appealing because it could encompass the following different preservation scenarios:
- the routine refreshing of digital files;
- varying changes in digital formats when files are converted from one application to another;
- radical changes in digital formats, such as the conversion of numeric files from proprietary formats to ASCII; and
- the migration of derivative access copy systems; for instance, system software might convert Tagged Image File Format (TIFF), a master storage format for scanned images, into a Portable Document Format (PDF) derivative designed for easy use by the reader.
For the reasons described above, Cornell concentrated exclusively on developing aids to assess the safety of a migration strategy for its digital information.
We reviewed the literature for information concerning digital preservation, digital migration, risk assessment, and file formats.
Digital Preservation and Migration
An extensive survey of the library literature identified many papers that provided in-depth analyses of issues associated with different aspects of digital preservation. The Task Force on Archiving of Digital Information (1996) documents these issues most effectively, and they will not be repeated here. Most of the remaining literature discussed digital reformatting or file copying from one medium to another. We identified four papers that directly related to our project. The first is the work of John Bennett (1997). His study evaluates preservation requirements by genre, format, media, and platform and uses a rudimentary risk-assessment scoring system. Displayed in a two-dimensional matrix, these requirements effectively communicate the complexity and interdependence of digital materials. Haynes et al. (1997) reported on an in-depth investigation into the responsibilities associated with maintaining digital archives. This paper summarizes numerous interviews with focus groups and individuals and effectively communicates the range of opinions and expectations associated with different stakeholders. The third work is the Reference Model for an Open Archival Information System (OAIS) (CCSDS 1999). The report is remarkable for its breadth and depth. In the authors’ words, the model they describe “provides a framework for the understanding and increased awareness of archival concepts needed for long-term digital information preservation and access, and for describing and comparing architectures and operations of existing and future archives.” The last item is a report written by Ann Green, JoAnn Dionne, and Martin Dennis (1999). Their study describes a project at Yale to convert data from column binary to spread ASCII format. The nine-step data migration process is well documented, and the findings and recommendations clarify important preservation issues.
Our search of the library literature for information concerning risk assessment was not fruitful. We then examined the literature for computer science. In the last 50 years, computer science has witnessed numerous cycles of software development migration, and the literature contains many studies, case reports, and models. Several publications were very useful in developing our understanding of risk assessment of digital information. Rapid Development (McConnell 1996) is a monograph on the general problems associated with software development. In many respects, software development exhibits several of the same problems associated with basic digital preservation. Chapter 5 of McConnell’s book, which concerns risk management, provides an excellent theoretical and practical introduction to controlling risk in software development. It is a good primer for risk studies in digital preservation. Van Scoy (1992) examines a similar topic in a study funded by the U.S. Department of Defense. His study identifies risk-management participants and their activities. A later study (Sisti and Joseph 1994), also for the Department of Defense, expands on the work of Van Scoy and offers a highly detailed software risk evaluation method. All three studies pay particular attention to the organizational issues in risk management.
While researching risk assessments, we were struck by the vast differences in basic definitions used by different disciplines. (For example, see Reinert, Bartell, and Biddinger , Warren-Hicks and Moore , McNamee , Wilson and Crouch , Starr , and Lagadec ). Numerous professions measure risk, and each assigns risks a unique vocabulary and context. The degree and type of risk associated with any data archive may be understood differently by administrators, operational staff members, and data users, depending upon their individual training and experience. The measurement of risk was equally problematic. One paper correlated risk level with the nonlinear relative probability of risk occurring (Kansala 1997). Another publication introduced an algebraic formula (McConnell 1996). In a third instance, a research group felt that cases where one could accurately assess the probability of a future event were rare because the information technology environment for software changes so rapidly.They preferred simple estimates, such as high, medium, and low, which they believed facilitated decision making (Williams, Walker, and Dorofee 1997). Risk-measurement scales, like risk definitions, are as distinctive as their developers.
File format information was located from format specification files available on the Internet and from descriptions of file formats appearing in several monographs. Specifications for TIFF and .wk1 files were located at the following Internet sites:
- Adobe Corporation (http://www.adobe.com)
- Lotus Corporation (http://www.lotus.com)
- Unofficial TIFF Home Page (http://home.earthlink.net/~ritter/tiff/).
- Wotsit’s Format: the Programmer’s Resource (http://www.wotsit.org).
Murray and vanRyper (1996) describe TIFF with numerous illustrations and a detailed narrative about TIFF structure. Brown and Shepherd (1995) provide an effective description of the low-level data stream organization of the TIFF format. Lotus Development Corporation (1986) has prepared the definitive work for Lotus 1-2-3 .wk1 files. More than just a reference about file structure, the work explains why Lotus moved away from simple ASCII representation of spreadsheet data and documents its early attempts to use a general file format for worksheet, database, word processing, and graphics activities. The Lotus book is the best source for information about the .wk1 format. Related .wks format information, released into the public domain in 1984 and found at File Transfer Protocol (FTP) sites, or published by Walden (1986), should be used cautiously.
Risk Assessment as a Migration Analysis Method
In its present state, migration as a digital preservation strategy can be characterized as an uncertain process generating uncertain outcomes. One way to minimize the risk associated with such uncertainty is to develop a risk-management scheme that deconstructs the migration process into steps that can be described and quantified. A risk assessment is simply a means of structuring the process of analyzing risk. If the risk-assessment methodology is well specified, different individuals, supplied with the same information about a digital file, should estimate similar risk values.
We believe that three major categories of risk must be measured when considering migration as a digital strategy:
- Risks associated with the general collection. These risks include the presence or absence of institutional support, funding, system hardware and software, and the staff to manage the archive. These are essential components of a digital archive, which the Task Force on Archiving of Digital Information (1996) describes as “deep infrastructure.” The collection, and the stakeholders who use the collection, will be affected to some degree by a migration of data. Legal and policy issues associated with digital information will introduce additional risks.
- Risks associated with the data file format. These include the internal structural elements of the file that are subject to modification.
- Risks associated with a file format conversion process. The conversion software may or may not produce the intended result; conversion errors may be gross or subtle.
Analysis of these three categories can be illuminating. Table 1 presents information from the image file case study that illustrates the risks specific to image files in migration. The findings are based on research, discussions with digital preservation specialists, and our own experience.
(bit configuration, including bit stream, form, and structure)
|Bits/bit streams are corrupted by software bugs or mishandling of storage media, mechanical failure of devices, etc.|
|File format is accompanied by new compression that alters the bit configuration.|
|File header information does not migrate or is partially or incorrectly migrated.|
|Image quality (e.g., resolution, dynamic range, color spaces) is affected by alterations to the bit configuration.|
|New file format specifications change byte order.|
|Security||Format migration affects watermark, digital stamp, or other cryptographic techniques for “fixity.”|
|Context and integrity
(the relationship and interaction with other related files or other elements of the digital environment, including hardware/software dependencies)
|Because of different hardware and software dependencies, reading and processing the new file format require a new configuration.|
|Linkages to other files (e.g., metadata files, scripts, derivatives such as marked-up or text versions or on-the-fly conversion programs) are altered during migration.|
|New file format reduces the file size (because of file format organization or new compression) and causes denser storage and potential directory-structuring problems if one tries to consolidate files to use extra storage space.|
|Media become more dense, affecting labels and file structuring. (This might also be caused by file organization protocols of the new storage medium or operating system.)|
(the ability to locate images definitively and reliably over time among other digital objects)
|File extensions change because of file format upgrade and its effect on URLs.|
|Migration activity is not well documented, causing provenance information to be incomplete or inaccurate (a potential problem for future migration activities).|
|Long-term costs associated with migration are unpredictable because each migration cycle may involve different procedures, depending on the nature of the migration (routine migration vs. paradigm shift).|
|The value of the collection may be insufficiently determined, making it impossible to set priorities for migration.|
|Costs may be unscalable unless there is a standard architecture (e.g., centralized storage, metadata standards, file format/compression standards) that encompasses the image collections so that the same migration strategy can be easily implemented for other similar collections.|
|Staffing||Staff turnover and lack of continuity in migration decisions can hurt long-term planning, especially if insufficient preservation metadata is captured and the migration path is not well documented.|
|Decisions must be made whether to hire full-time, permanent staff or use temporary workers for rescue operations.|
|Staff may have insufficient technical expertise.|
|The unpredictability of migration cycles makes it difficult to plan for staffing requirements (e.g., skills, time, funding).|
|Functionality||Features introduced by the new file format may affect derivative creation, such as printing.|
|If the master copy is also used for access, changes may cause decreased or increased functionality and require interface modifications (e.g., static vs. multiresolution image, inability of the Web to support the new format).|
|Unique features that are not supported in other file formats may be lost (e.g., the progressive display functionality when Graphics Interchange Format [GIF] files are migrated to another format).|
|The artifactual value (original use context) may be lost because of changes introduced during migration; as a result, the “experience” may not be preserved.|
|Legal||Copyright regulations may limit the use of new derivatives that can be created from the new format (e.g., the institution is allowed to provide images only at a certain resolution so as not to compete with the original).|
Table 1. Risks associated with file-format-based migration for image collections
As each risk category was explored, we recognized that we needed to develop different methods, or tools, to sample each situation and to help quantify risk probability and impact. Over the course of the project, we developed three assessment tools:
- A risk-assessment workbook for the general collection. The workbook provides a general review of risks associated with migration at the collection level.
- A reader software to examine specific files, or collections of files, for high-risk format elements.
- A test file for a .wk1 format of known structural and data elements to test, or exercise, conversion software.
Individually, these three tools provide useful information. Together, they offer a means to gauge the readiness of any archive to migrate information successfully from one format to another.
Risk Assessment of General Collections
In an ideal situation, risk assessments would be performed by a team of experts; each member would be a specialist in a specific area and would have general knowledge of digital preservation. In reality, access to expert advice is costly and not always timely. In place of a human adviser, a workbook can provide a systematic approach to assessing risks and problems. If the questions or exercises are sufficiently developed, the workbook can help the user not only identify potential risks but also measure risk in terms of impact.
When used as a common method of analysis, a workbook should identify and describe problems in a concise, uniform, and easily understood manner that could be shared by administrators and archivists in a given setting.
For the risk-assessment workbook developed in this study, we prepared two risk-assessment scales: one to measure the probability a hazard would occur, and another to measure the impact of that occurrence. These scales were prepared for a risk-assessment case study of a numeric file collection, the test bed for much of our project. Admittedly, the scales lack scientific precision, and at the end one does not simply sum the results and decide to migrate on the basis of a single number. On the other hand, assessment scales can more precisely convey meaningful assessments of risk, and this can help set priorities in preparing for a migration project (Beatty 1999).
The complete workbook is presented in Appendix A.
Risk Assessment of File Formats
As noted earlier, file migration is the process of altering structural and data elements in one file format to conform to a new configuration in another format. In our project, we label the original format the “source” format and the new format the “target” format. Software programs that convert source formats into target formats are grouped into three general categories:
- Translation programs for a specific project written by a company, by the owner of the information, or by a third-party vendor. Data archives often write these programs at considerable cost. The CUL experience with locally developed software is described in the TIFF image file case study.
- A commercial translation program written for a specific purpose. For example, some products extract data fields from numerous files with different formats and create a new data product with a different format. Programs such as DataJunction are written specifically for this purpose.
- A general-purpose commercial translation program. Conversions Plus by DataVis is a good example of this growing genre of software.
Each of these approaches to conversion has its benefits and liabilities. Many conversion programs developed by archives can incorporate extensive knowledge about the functions of the translation software, but require lengthy development cycles and are expensive to prepare. Off-the-shelf commercial programs provide little information about the translation process but offer many features at a low cost.
A format risk assessment has to explore two distinct areas of risk: the risk introduced by the conversion program and the magnitude of recurring risk inherent in a large collection. In addition, the features and usability of the conversion software should be considered as well as the impact on the metadata associated with the files.
Assessing Risk in Conversion Software
Assessing risk inherent in conversion programs can be accomplished by examining a file before and after migration. A test file can be passed through the conversion software, migrating from source to target format. If, following the format conversion, the fields and field values of the original source file are properly reproduced in the target file, the risks incurred in migration are significantly reduced. On the other hand, if the fields or their values are not properly converted, the risks of migration are significantly increased. If the field tags and values in the test file are known, data changes associated with file conversion can be independently verified.
In the numeric file case study, a test file for the Lotus 1-2-3 .wk1 format was created. With the use of public domain specifications and reference manuals published with the original application software, a large file was generated that exercised all the field tags and field values. A simple conversion test might determine how well a conversion program tests the following known values with those generated in a formula:
Fig. 1. Sample test values for assessing conversion accuracy (Lotus 1-2-3 file)
In the example shown above, the “average” function (@AVG) operates on a range of cells (H293..DC293). The precomputed correct result (495) is compared with the computed result derived from the expression, and any differences between the two are recorded. In a similar manner, other complex formulas and functions can be compared before and after conversion.
It took us aboutthree hours to compare our test files manually before and after conversion. Although this method is somewhat laborious, it is quite accurate for the formats we tested. Conversion of different structural elements and data elements is not always a matter of “hit or miss.” We were able to identify conversions that were almost, but not quite perfect. Testing these problematic conversions, we were able to develop a rough scale of conversion risk (1=minor risk, 5=high risk). Documentation for the test file can be found in Appendix B.
Assessing Recurring Risk Inherent in a Large
Heterogeneous File Collection
Manual identification of risk associated with file structures is possible for a small number of files. For large digital collections that have thousands or millions of files that may contain one or more of these at-risk elements, manual methods are expensive and inefficient. One way to measure the collection for files that contain at-risk elements would be to prepare a file reader programmed to examine each file for these items. If one or more risk items are found, the program could be written to produce a report that identifies the file, its location in the collection, and the type and number of at-risk elements associated with that file. Good design would make the program flexible enough to read most, if not all, files with defined structural elements.
A program was developed for the project that can read structured ASCII and binary files. Named Examiner, the program reads a file and detects the presence and frequency of specific file format elements. It does not read or evaluate the data value, although this feature could be implemented. The following example shows a few lines from a report generated during a scan of .wk1 files in the USDA Economics and Statistics System, hosted at Mann Library.
- /usda/ftp/usda/data-sets/crops/94018/budget.wk1: Risk Level 5
Tag 14: NUMBER: Floating point numberQty: 584
/usda/ftp/usda/data-sets/crops/94018/charactr.wk1: Risk Level 5
There are no tags in this file at this level
/usda/ftp/usda/data-sets/crops/94018/conf_int.wk1: Risk Level 5
Tag 14: NUMBER: Floating point numberQty: 59
In the output just listed, Examiner has examined a series of .wk1 files in a single subdirectory with the absolute path /usda.ftp/usda/data-sets/crops/94018. In two of the three files, it located a structural element, or Tag. The program writes to a report file the structural element number (14), the name of the structural element given in the format specifications (NUMBER:), a short description of the structural element (Floating-point number), and the total count of floating-point numbers discovered in that specific file (Qty:). The program also describes the risk level for the structural element. The risk level was determined during the initial sourcetarget analysis described previously. The program can be set to report at-risk tags only if the risk value equals or exceeds a certain threshold.
One strong feature of the Examiner program is that it is nondestructive. It simply reads a file from beginning to end and declares what is found. Also, Examiner can be set to read a single file, all the files in a directory, or all the files on a drive. The program is reasonably efficient and scans approximately 10,000 .wk1 files per hour. Finally, Examiner is written in Java, a modern programming language designed to be easily compiled on different operating systems. The program has been fully tested in the Unix and Windows 95/NT environments. General documentation for Examiner is described in Appendix C. The source code and full documentation are available on the Web site of the Council on Library and Information Resources.
Assessing Risk Associated with the File Conversion Process
Finally, there are risks associated with the features of different conversion software. The project examined two commercial off-the-shelf programs and quickly scanned the advertisements or published reviews of six others. In any mix of conversion programs available, each will provide some or all “core” functions as well as optional features. General performance benchmarks, which can be tailored for specific migration scenarios, provide some uniformity of measurement and highlight obvious defects. For example, we examined DataJunction as a general-purpose conversion program for spreadsheet and database formats. Conversion of .wk1 formats was trouble-free, except for one major flaw: DataJunction was difficult to program to work in batch mode. We did not recognize this flaw until the evaluation was nearly complete. Obviously, a project timetable could be seriously jeopardized by such a limitation. Although not an intended product of the project, we recorded software assessment questions that we should have asked at the start of the project. From these, we developed a short functionality assessment that is now available on the Web site of the Council on Library and Information Resources.
Identification of Metadata-Related Risk
We frequently think of disk files as the sole object of migration because, at first glance, the information they contain is what we have to move from one format to another. The individual files in a collection, however, are frequently useless without other information describing how the files are to be used or how they relate to one another. In other words, any group of files that constitute a cohesive unit can be considered a digital object, and what makes the digital object intelligible is metadata describing the contents and providing structure for the group. When such digital objects exist, the metadata, as well as the individual files containing the raw data, must be successfully migrated.
Metadata at the digital-object level can take various forms. For example, in the collection of TIFF images in one of our case studies, a file in a proprietary format, Raster Document Object (RDO), contains metadata that provides structure to the multiple TIFF files. The RDO file relates the page image stored in each TIFF file to the others that compose the document; in this case, the navigable and searchable digital object represents a paper document containing pages and chapters and other logical constructs. A second example, from our case study of a collection of numeric files in the .wk1 format, shows another way of structuring and describing digital objects. Each digital object-a set of related binary data files-has three metadata components: one that contains information about the structure of the object, one that describes the content of the object, and one that creates a link between the two. The structural metadata is contained in an HTML file whose links point to the individual files that constitute the digital object. The content metadata is in an English-language ASCII file. Its purpose is to provide searchable text so that the object can be located in a search across the larger collection of objects. The third component is a record in a database that creates a relationship between the content file and the structural file. In a successful migration to another data format, the structural metadata in the HTML file would have to be changed if the name or location of the individual files in the digital object were changed. The content description and the database record would not have to be touched.
The risk-assessment tools developed were tested on two digital collections at the Cornell University Library: the Ezra Cornell Papers and the USDA Economics and Statistics System. Each collection contains a dominant file format: TIFF or .wk1. The assessments of these two collections are presented in Appendixes D and E.
Findings and Recommendations
Migration Risk Can Be Quantified
Migration, or the conversion of data from one format to another, has measurable risk. The amount of risk will vary, sometimes significantly, given the context of the migration project. One form of risk depends on the nature of the source and target formats. We have shown that it is possible to compare formats in a number of ways and to identify the level of risk for different format attributes. The format analysis techniques and software may be technical, but the results can be described in general terms. Since basic file structure concepts are common to many file formats, experience with one format can be used to understand other formats.
We draw a similar conclusion concerning organizational, hardware, software, and metadata risks. Information delivery systems must sustain a certain level of organization simply to function. Consistent components of these systems can be evaluated; for example, personnel, funding, metadata, and rough but quantifiable measures of risk can be established for these subjects.
The greatest challenge is the interpretation of the risk, i.e., to determine when a risk is acceptable. Risk-assessment tools cannot replace experience and good judgment. The tools can be compared with navigation aids used on the high seas. Following five centuries of intensive effort to develop risk-reducing technologies, ships’ helms are still manned, and collisions between ships at sea still occur.
In this study, we provide examples to illustrate the evaluation process. In practice, the risk-assessment tools are not fully developed. We recommend the further refinement of these tools to provide results that are more reliable. We must recognize, however, that this will take some time, during which we will lose some data.
This study is unable to recommend a cost-effective, off-the-shelf commercial software program to implement a migration strategy. From our analysis, we believe that migration software should perform the following functions:
- Read the source file and analyze the differences between it and the target format.
- Identify and report the degree of risk if a mismatch occurs.
- Accurately convert the source file(s) to target specifications.
- Work on single files and large collections.
- Provide a record of its conversions for inclusion in the migration project documentation.
Neither of the two programs analyzed in this case study met all these criteria, although our results suggest that commercial conversion programs, with further development, have the potential to meet them. Considering the cost of writing conversion software for a wide range of file formats, we believe a commercially developed solution for migration software will ultimately be cheaper and more flexible than locally developed conversion software. We recommend further work with vendors, such as DataJunction and DataViz, to educate them about our needs and help them develop products that promote safer file migration.
Access to Format Data
The most difficult aspect of this project was the acquisition of complete and reliable file format specifications. Throughout the project, format-specific information was difficult to acquire from a single source. Ultimately, format information for this study was acquired from the following four general sources:
- software developers
- public FTP archives
- Internet discussion lists
Developers of software applications who use a specific proprietary file format should be the best source for file format information. At the start of our search for Lotus .wk1 format information, this was not the case. Lotus, like other large software companies, treats file format information as a business product to sell to software developers. Lotus business products evolved, responding to revisions in 1-2-3 as well as to changes in the DOS/Windows operating system. With the introduction of Windows 3.1, developer interest in earlier DOS specifications disappeared. Since the specifications for the .wk1 format were integrated into the format specifications for later releases (i.e., .wk3, .wk4), the specifications and documentation for the earlier .wk1 format quietly disappeared. Lotus as a company also evolved, and key members of the early development staff-often the corporate memory in software companies-moved on to establish their own companies. In the last months of this project, we were able to contact an individual at Lotus who had been with the company since the mid-1980s. This individual helped us acquire a copy of Lotus File Formats for 1-2-3, Symphony, and Jazz. This work, authored by Lotus, is the only surviving documentation from the company for that period. Fortunately, it describes the .wk1 format in complete detail.
Throughout the year, Lotus staff repeatedly referred us to their FTP archive that contains 1-2-3 .wk1 format specifications. These specifications were indirectly certified by Walden (1986), who describes the specification in detail and provides a sample .wk1 file analyzed byte by byte. Unfortunately, these specifications are incomplete and describe the .wks file format, the format of 1-2-3 release 1A. We were surprised that Walden made such an oversight, but Wotsit’s Format Web site (Oliver 1999) and the comp.apps.spreadsheets FAQ (1999) repeat the error. It is clear that neither the professionals nor the amateurs recognized the mistake.
TIFF specifications are accessible from two Internet locations. The official specifications for TIFF 6.0 are available from the Adobe developers’ support site. Adobe’s site does not list the specifications for TIFF 4.0 and 5.0. These can be located at the Unofficial TIFF Home Page. Our manual examination of the specifications showed them to be consistent with each other, but they are incomplete. For years, developers have been adding their own proprietary tags to the TIFF specification that they register with Adobe. Special tags do not appear in either the official or unofficial specifications. Several books have been written about the TIFF file format specifications and they survey many file formats. However, no single work presents a clear, comprehensive description of the TIFF file format specification or of information about proprietary tags.
We expect these difficulties to be repeated when other formats are explored. Conceptually, the solution is to adopt “open” format specifications, where complete, authoritative specifications are available for anyone to access and analyze. Our experience with TIFF and .wk1 suggests that with file formats, there are two specifications at work. One is the public document, which describes the basic or core elements of any format. The other is a private, nonstandard set of file elements, usually developed to extend the functionality of a file format. These private file elements provide the competitive edge for third-party software and rarely are openly circulated. Over time, new format elements are often integrated into format revisions. For example, TIFF grew from 37 tags in version 4 to 74 tags in version 6.0. New proprietary tags for TIFF version 6.0 are registered with Adobe, which does not make them public. It is uncertain whether all or some of these difficult-to-identify tags will be integrated into the anticipated TIFF version 7.0. We endorse the concept of open specifications and recommend that more thought be directed at coordinating access to both the relatively static, public domain specifications and the dynamic, nonstandard elements.
Public Access Archives of Format Information
If we measured the risk associated with public domain archives on the Internet, we would assess all these sites as high-risk operations. Sites such as Wotsit’s represent the public service efforts of individuals. They lack any vision or plan to sustain the information. This limitation, combined with the unreliable nature of the information contained within these sites, make it unlikely that these sites will contribute meaningfully to digital preservation efforts. There is a pressing need to establish reliable, sustained repositories of file format specifications, documentation, and related software. We recommend the establishment of such depositories for format-specific materials related to migration as a preservation strategy. It is a concern, as well, for emulation programs and their documentation.
Beatty, J. Kelly. 1999. The Torino Scale: Gauging the Impact Threat. Sky & Telescope 98(4):32-3.
Bennett, John C. 1997. A Framework of Data Types and Formats, and Issues Affecting the Long Term Preservation of Digital Material. British Library Research and Innovation Report, No. 50. West Yorkshire, U.K.: British Library Research and Innovation Centre. Available from http://www.ukoln.ac.uk/services/elib/papers/supporting/#blric.
Brown, C. Wayne, and Barry J. Shepherd. 1995. Graphic File Formats. Greenwich, Conn.: Manning Press.
comp.apps.spreadsheets. 1999. comp.apps.spreadsheets FAQ. Available from http://www.faqs.org/faqs//spreadsheets/faq.
Consultative Committee for Space Data Systems. 1999. Reference Model for an Open Archival Information System, Red Book, Issue 1 (CCSDS 650.0-R-1). Available from http://wwwdev.ccsds.org/
Euhlir, Paul. 1997. Framework for the Preservation of and Public Access to USDA Digital Publications. Available from http://preserve.nal.usda.gov:8300/npp/frameprt.html.
Green, Ann, JoAnn Dionne, and Martin Dennis. 1999. Preserving the Whole: A Two-Track Approach to Rescuing Social Science Data and Metadata. Washington, D.C.: Digital Library Federation. Available from http://www.clir.org/pubs/reports/pub83/contents.html.
Haynes, David, et al. 1997. Responsibility for Digital Archiving and Long Term Access to Digital Data. JISC/NPO Studies on the Preservation of Electronic Materials. British Library Research and Innovation Report, No. 67. West Yorkshire, U.K.: British Library Research and Innovation Centre. Available from http://www.ukoln.ac.uk/services/elib/papers/supporting/#blric.
Kansala, Kari. 1997. Integrating Risk Assessment with Cost Estimation. IEEE Software (May/June ):61-7.
Lagadec, Patrick. 1982. Major Technological Risk: An Assessment of Industrial Disaster. Oxford, U.K.: Pergamon Press.
Lotus Development Corporation. 1986. Lotus File Formats for 1-2-3, Symphony and Jazz: File Structure Descriptions for Developers. Cambridge, Mass.: Lotus Books, and Reading, Mass.:
McConnell, Steve. 1996. Rapid Development: Taming Wild Software Schedules. Redmond, Wash.: Microsoft Press.
McNamee, David. 1996. Assessing Risk Assessment. Available from http://www.mc2consulting.com/riskart2.htm.
Murray, James D., and William vanRyper. 1996. Encyclopedia of Graphics File Formats, second edition. Cambridge, Mass.: O’Reilly & Associates, Inc.
Oliver, Paul. 1999. Wotsit’s Format: the Programmer’s Resource. Available from http://www.wotsit.org/.
Reinert, Kevin H., Steven M. Bartell, and Gregory R. Biddinger, eds. 1994. Ecological Risk Assessment Decision-support System: A Conceptual Design. Pensacola, Fla.: SETAC Press.
Rothenberg, Jeff. 1999. Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, D.C.: Council on Library and Information Resources. Available from http://www.clir.org/pubs/reports/rothenberg/contents.html.
Rothenberg, Jeff. 1995. Ensuring the Longevity of Digital Documents. Scientific American 272(1):42-7.
Sisti, Frank J. and Sujoe Joseph. 1994. Software Risk Evaluation Method. Version 1.0. Technical Report CMU/SEI-94-TR-19. ECS-TR-94-019. Pittsburgh, Penn.: Software Engineering Institute, Carnegie Mellon University.
Starr, Chauncey. 1969. Social Benefits versus Technological Risk: What is Our Society Willing to Pay for Safety? Science 165:1232-8.
Task Force on Archiving of Digital Information. 1996. Preserving Digital Information. Report to the Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.: Commission on Preservation and Access. Available from http://www.rlg.org/ArchTF/.
Van Scoy, Roger L. 1992. Software Development Risk: Opportunity, Not Problem. Technical Report CMU/SEI-92-TR-30/ESC-TR-93-030. Pittsburgh, Penn.: Software Engineering Institute, Carnegie Mellon University. Available from http://www.sei.cmu.edu/publications/documents/92.reports/92.tr.030.html.
Warren-Hicks, William J., and Dwayne R. J. Moore. 1995. Uncertainty Analysis in Ecological Risk Assessment. Pensacola, Fla.: SETAC Press.
Walden, Jeff. 1986. File Formats for Popular PC Software: A Programmer’s Reference. New York, N.Y.: John Wiley and Sons, Inc.
Williams, Ray C., Julie A. Walker, and Audrey J. Dorofee. 1997. Putting Risk Management into Practice. IEEE Software (May/June):75-82.
Wilson, Richard, and E. A. C. Crouch. 1987. Risk Assessment and Comparisons: An Introduction. Science 236:267-70.
Web sites noted in report:
Adobe developers’ support site: http://partners.adobe.com/asn/developer/technotes.html.
Council on Library and Information Resources: www.clir.org.
The Unofficial TIFF Home Page: http://home.earthlink.net/~ritter/tiff/.