Archiving Computer-based Artworks

Jonathan Farbowitz
The Electronic Media Review, Volume Five: 2017-2018

ABSTRACT

Art museums throughout the world have been acquiring computer-based artworks with increasing confidence. As artist-created hardware and software enter museum collections, they present unique challenges for long-term preservation. Conservation staff at these institutions face urgent questions regarding the necessary elements required for acquisition and preservation as well as how to define their technical, functional, and conceptual constituents.

Through the lens of the Guggenheim’s initiative on Conserving Computer-based Art in its collection, this article takes a critical look at the physical and digital elements that museums acquire or generate in order to archive and preserve their computer-based artworks. Drawing from the Conserving Computer-based Art collection survey and back-up project, which encompasses artworks dating from 1989 to the present that employ a variety of technologies, the article provides an overview of collected digital assets and documentation and proposes elements that museums should consider obtaining or creating in order to sustain the life of software- and computer-based artworks. The article makes recommendations for information to capture about the hardware and software dependencies and discusses why this documentation is important for future migration, recreation, or replacement of the equipment. Methods for collecting this information through examining the hardware itself and by using software built into the Windows, Mac OS, and Linux operating systems are described.

Disk imaging is introduced as a method of extracting data from artist-provided computers and physical information carriers (such as floppy disks and optical discs), as these carriers are not suitable for long-term data storage. The Guggenheim’s disk imaging workflow is described in detail, including several methods for quality control of disk images after they are created. The article concludes with recommended deliverables for the acquisition of computer- and software-based artworks.

INTRODUCTION

Artworks in which software and computer systems are employed as an artistic medium can take many forms, including artist-created websites, custom computer programs, or complex installations that involve microcontroller units. New acquisitions of computer-based artworks (also known as software-based artworks) often include artist-created or modified hardware and software. As a result, museum staff are faced with concerns about how to archive these collection items, including taking preventive measures to ensure the long-term accessibility of data and documenting each component of the artwork for future conservators. Further complicating matters, each artwork may use vastly different technologies and present unique requirements for long-term preservation.

The Solomon R. Guggenheim Museum currently has 25 computer-based artworks in its permanent collection, dating from 1989 to 2018. The technologies represented include Flash animations, virtual reality experiences, Internet-connected installations, custom-coded robotics, and artist-modified computers. In order to research and develop best preservation practices for these works, the Guggenheim launched the Conserving Computer-based Art (CCBA) initiative in 2014 (Dover 2016).

The goals of the initiative are surveying and archiving the computer-based artworks in the collection; deriving improved methods for collection care and new acquisitions; and in-depth case studies of selected artworks, including source code analysis, documentation, and treatment. The case study work of the CCBA is conducted as a cross-disciplinary collaboration between the Guggenheim Conservation department and Department of Computer Science at New York University (NYU). Research partners at NYU have annotated source code and created prototypes of code migration for artworks needing treatment.

This article reflects the research and findings of the first project phase of the CCBA initiative, the survey and backup of the collection, and explores measures that can be taken to archive computer- and software-based artworks beginning with their physical components (such as computers and microcontroller units) and information carriers (such as hard drives, floppy disks, and optical discs). The discussion then focuses on saving the digital assets stored on these computers or information carriers. In certain cases, museum staff may need to create additional digital assets for preservation purposes. A significant portion of the article examines disk imaging as one strategy for archiving the content of computers and information carriers and describes the Guggenheim’s workflow for disk imaging.

There are currently no widely agreed-upon strategies for archiving the components of computer- and software-based artworks. This article aims to contribute to the ongoing research and development of such strategies. As conserving computer-based artworks is a rapidly evolving field, these are preliminary findings. The article concludes with a set of preliminary recommendations for components to collect when acquiring computer- and software-based artworks.

Archiving software- and computer-based works varies for different artworks and translates into flexible and multipronged preservation strategies. With computer-based artwork, collecting the components and information for preservation as soon as possible is essential, as they may become much harder or impossible to collect in the future. As Serexhe and others have observed, “often it is only when the works are to be presented again that data loss and incompatibility with current hardware and software are discovered” (Serexhe 2013, 83). Exhibition or acquisition may constitute the only opportunities for the conservator to observe or document the artwork running according to the artist’s specifications. Thus, creating documentation of the artwork through video, audio, still images, or other means will prove critically important as reference material later.

Archiving the components of a computer-based artwork means first analyzing all hardware and software involved. As with any time-based media work, this is a critical step to “identify the meaning and work-defining properties of the artwork’s constituent parts, as well as the work’s aesthetic or conceptual dependence on certain devices or technologies” (Phillips n.d.). This analysis enables the conservator to ask the right questions concerning future preservation of the work, though the answers are often artwork dependent.

Any archiving strategies must address this question: “What ‘resources’ are necessary to sustain the artwork?” In other words, what files, hardware, reference materials, and information will be needed to effectively steward this artwork into the future? These considerations involve not just creating a copy of the appropriate data but also considering what information may be needed for future treatments of the artwork, such as migrating code to another programming language or running the artwork within an emulator or virtual machine (software that mimics a different computing environment, usually an older/obsolete computer or operating system).

PHYSICAL COMPONENTS

Through its acquisitions of computer- and software-based artworks, the Guggenheim has received a variety of computers and microcontroller units from artists. This artist-provided equipment typically contains the software that runs the artwork.

The status of physical equipment in a collection depends on its significance for the artwork (Laurenson 2004; Phillips 2012). Artists often provide their own equipment to run these artworks for a variety of reasons; some works employ unique, dedicated equipment. For his work entitled Color Panel (1999), John F. Simon Jr. (b. 1963) custom-modified an Apple Powerbook 280c laptop (fig. 1). In other cases of dedicated equipment, an artist may use unmodified computer hardware, though its appearance may be integral to the artwork.

Fig. 1. Image of John F. Simon Jr.’s Color Panel (1999), an artwork that employs an artist-modified Apple Powerbook 280c laptop. (Courtesy of the ©Guggenheim Museum)
Fig. 1. Image of John F. Simon Jr.’s Color Panel (1999), an artwork that employs an artist-modified Apple Powerbook 280c laptop. (Courtesy of the ©Guggenheim Museum)

In other cases, the hardware may not be dedicated, but the artist may still provide it. The artwork may require specific hardware and software properties and settings; the artist may decide to independently source the equipment for convenience or the artist may provide off-the-shelf hardware to a museum in order to simplify installation (so that the piece can simply be plugged in and turned on).

Computers provided by artists are often consumer-grade models (such as the two HP Pavilion 8140 computers included with an artwork acquired in 2001) and were never intended by the manufacturer to last for more than a few years. While this equipment is functioning and still available to examine, gathering information about the hardware and software configuration of these computers, as well as any connected devices or peripherals, will prove useful to sustain the artwork. In the future, it may be necessary or desirable to replace nondedicated hardware with equipment that shares the same or a similar configuration.

An artwork may also be run independently of the original hardware using a disk image running in an emulator or virtual machine. Gathering information about the hardware, software, and connected devices gives future conservators a reference for choosing the proper settings. For example, in an artwork that uses a serial port to send signals to a projector, this type of connection must be set up in the emulator or virtual machine for the artwork to function properly. In addition, the specific hardware and software configurations used for the work may have aesthetic or conceptual significance or reveal the artist’s working methods, which may affect future art historical literature related to the artist or artwork.

The specifications of a computer can have a direct impact on the aesthetic properties and behavior of an artwork. For example, the clock speed of the central processing unit (CPU) can determine the speed at which an artwork runs. After NYU Computer Science student Xincheng Huang analyzed the source code of Color Panel (1999), he discovered that the artist did not specify a frame rate in the code. Therefore, the speed at which Simon’s program animates its changing color patterns depends on the clock speed of the CPU running the artwork.

While the artwork is still functioning, reference video, screen captures, or still images can be created to illustrate the work’s behaviors. This documentation of the native display environment gives future conservators another point of reference to guide any future migration or re-creation of the artwork. For example, video documentation of the original work running can be compared to the work running under emulation or virtualization.

There is an incredible breadth of technical information that can be generated about the hardware and software of a single computer, but it is unclear at this time exactly which information is critical to obtain if the museum chooses to run the artwork on another computer or in an emulator or virtual machine. A list of minimum information to gather about computer equipment and its configuration is still very much in development. The following findings are preliminary and deserve further research and discussion but may provide some guidance.

By examining the manufacturer’s box or the computer’s external case, one can record the serial number and model number of the machine. The serial number can provide a unique identifier for the computer. It may also indicate when the machine was manufactured. The model number can also be cross-referenced to a specific configuration. Without switching the computer on, one can document the available ports on the computer and any PCI/PCIe cards installed. These ports or cards may be used to attach peripherals that the artwork uses, such as video monitors, audio interfaces, or additional hard drives. This information can be captured through photos. By opening up the computer’s case, it may be possible to determine the model of the CPU, the amount of random access memory (RAM), and the model of graphics processing unit (GPU or graphics card) used in the computer.

If booting up the artwork’s computer is possible, much more detailed information can be obtained. Computers running the Windows, Linux, and Mac OS operating systems have built-in software to export extremely detailed information about their hardware and software configurations. In Mac OS, the program “System Report” can export this information as an .spx file (fig. 2).

Fig. 2. A small section of the contents of a Mac OS .spx file that lists the CPU, amount of RAM, and the computer serial number, among other information. (Courtesy of the ©Guggenheim Museum)
Fig. 2. A small section of the contents of a Mac OS .spx file that lists the CPU, amount of RAM, and the computer serial number, among other information. (Courtesy of the ©Guggenheim Museum)

Within this file, information about the computer is encoded in the eXtensible Markup Language (XML) format. This includes details about the CPU, GPU, and RAM, which programs and drivers are installed, and even any devices that are connected to the computer at the time that the report is exported. In most versions of Windows, running the command “msinfo32” can produce similar results. The user can save an .nfo file, which is also XML encoded. In Debian Linux (and its popular variant Ubuntu) the program “hardinfo” also collects similar information, which can be exported to an HTML file and converted to XML if desired in the future. These types of files have been created when examining computers as part of the CCBA survey and backup, and the process of exporting software and hardware configuration information has been integrated into the Guggenheim’s disk imaging workflow.

Several artworks in the Guggenheim’s collection use microcontroller units, custom built by the artist or the artist’s technician. A microcontroller unit is a small programmable microprocessor that runs a single program. Several microcontroller platforms have been popular among artists, including PIC, BASIC Stamp, and Arduino. Microcontrollers are often used to automate a physical process within a complex installation work, such as starting or stopping conveyor belts and fans at specific times, as in Phoebe Washburn’s Regulated Fool’s Milk Meadow (2007; fig. 3) or to choreograph the movements of motorized objects covered by gunny sacks, as in Susanta Mandal’s Caged Sacks (2007–2008).

Fig. 3: The microcontroller unit used for Phoebe Washburn’s Regulated Fool’s Milk Meadow (2007). This microcontroller runs code that controls conveyor belts, fans, and watering mechanisms within the installation. (Courtesy of the ©Guggenheim Museum)
Fig. 3: The microcontroller unit used for Phoebe Washburn’s Regulated Fool’s Milk Meadow (2007). This microcontroller runs code that controls conveyor belts, fans, and watering mechanisms within the installation. (Courtesy of the ©Guggenheim Museum)

With works that employ microcontrollers, understanding each of the microcontroller’s pin connections is essential to making sense of the artwork’s code and ultimately determining how the piece functions. In many cases, the result of the code running on the microcontroller is that an electrical current is sent through a specific pin to turn a piece of physical equipment (such as a conveyor belt of a fan) on or off. Some pins on microcontroller units are also input pins and may receive signals from physical equipment, such as motion sensors or light sensors.

INFORMATION CARRIERS

Artists also provide the museum with software and files on a variety of information carriers. Stored on these information carriers are source code files, executable files, project files, backups of related files, and, occasionally, disk images. Computers themselves contain an information carrier, an internal hard drive.

Internal and external hard drives, floppy disks, and optical disks are not sustainable information carriers—particularly optical media (Byers 2003; Owens 2014). The time of physical failure for these items is unpredictable; often, technological obsolescence will precede physical breakdown. While there has been little long-term research on the longevity of hard drives, a study produced by cloud storage provider Backblaze of more than 25,000 hard drives has shown that drives experience an 11.8% failure rate over the first four years of the drive’s life. A study by Gibson and Schroeder found that in large IT installations, annual failure rates “typically exceed 1%, with 2-4% [being] common” (2007, 1). Thus, data should be copied off of these information carriers as soon as possible, before they become unreadable (McKinley 2013). The ideal site for the extracted data is dedicated archival storage, also known as a trusted digital repository (Online Computer Library Center et al. 2007). Fortunately, conservation staff at the Guggenheim have not come across media that are entirely unreadable. However, in the course of the CCBA survey and backup, some older hard drives were found to contain “bad sectors”—areas of the drive that are no longer readable or writable due to physical damage or deterioration.

DIGITAL ARTWORK ASSETS

Contained on the physical carriers that the museum collects are the digital artwork assets. A short description of each category of assets follows.

Source Code

Source code is the instructions of a computer program originally written by the artist or the artist’s technician). For new acquisitions, the Guggenheim always attempts to collect the source code of a computer-based artwork. Source code is written in high-level programming or scripting languages such as Python, C, or ASP (fig. 4).

Fig. 4: Screenshot of source code written in ASP for Siebren Versteeg’s Untitled Film II (2004). The code also includes artist-written comments. (Courtesy of the ©Guggenheim Museum)
Fig. 4: Screenshot of source code written in ASP for Siebren Versteeg’s Untitled Film II (2004). The code also includes artist-written comments. (Courtesy of the ©Guggenheim Museum)

Source code is human readable if one understands the programming language; analyzing the source code allows conservators to understand the behaviors of an artwork, which shapes future conservation decisions (Engel and Wharton 2014). Previous research has concluded that “technical research on artist-generated source code not only serves conservation, but it can also aid art-historical research on artists’ aesthetic aims and their working methods” (Engel and Wharton 2015, 91). Without understanding the code, conservators may not be able to make informed decisions about future preservation strategies or treatment.

During acquisition, source code is delivered to the museum as one or more digital files that can be opened in a text editor, not as a PDF or as a hard copy printout. It is sometimes overlooked that microcontroller units run code as well and the source code may need to be requested separately during acquisition.

Project Files

Computer-based artworks written in proprietary development environments such as Adobe Flash, Macromedia Director, or Max MSP do not have source code in the traditional sense (though these development environments sometimes have integrated scripting languages in which code can be added). These are primarily visual programming languages in which a user creates a project and can manipulate elements in a timeline to create a sequence of events or connects various virtual modules to create a logical flow (fig. 5).

Fig. 5: Screenshot of the Max MSP project (also called a patcher) for Doug Wheeler’s PSAD Synthetic Desert III (1971/2017). (Courtesy of the ©Guggenheim Museum)
Fig. 5: Screenshot of the Max MSP project (also called a patcher) for Doug Wheeler’s PSAD Synthetic Desert III (1971/2017). (Courtesy of the ©Guggenheim Museum)

In these cases, the Guggenheim collects the original project file for the artwork, as this is equivalent to the source code for these development environments.

Software Libraries and Plug-ins

Whatever programming language or development environment was used, an additional consideration is whether software libraries or plug-ins were used in the artwork. Libraries are collections of files with prewritten code that handle commonly used routines and make writing new code easier. If a programmer uses a library, it may be necessary to have a complete copy of the entire library in order for the original source code to run or compile properly. In this case, the library becomes a “software dependency,” a necessity for the artwork to function. Plug-ins are used in development environments such as Macromedia Director, Adobe Flash, or Max MSP. These additional files add functionality that is not included in the environment by default. Plug-ins may be dependencies for a project file to open correctly. In the case of Max MSP, in which artworks are often run via the project, the plug-in would be necessary to exhibit the work. In Doug Wheeler’s PSAD Synthetic Desert III (1971, realized 2017), which uses Max MSP to generate a soundscape for the piece in real time, a plug-in called “Spatialisateur” is used to randomly delay sending audio to different speakers throughout the art space.

For Internet-connected artworks, libraries are sometimes called from a separate website. If the website that maintains the library ceases to exist, then the artwork may not function. This issue can be mitigated by collecting a full copy of the library or placing it inside the code base. In the restoration of the Guggenheim’s web artwork Brandon by Shu Lea Cheang (b. 1954), the JQuery library was used. A full copy of this library was placed in all of the areas of the site where it was used.

During acquisition, the artist or the artist’s technician might not always think about including libraries when transferring source code or project files to a museum. Identifying the necessary libraries or plug-ins at the outset of collecting a computer-based artwork helps ensure that the museum obtains all of the artwork’s software dependencies.

Executable Files

Exhibition copies of software-based artworks are often delivered as executable files. The executable simplifies the process of starting the artwork—the executable can simply be clicked or it can be set to autostart when the computer starts. Creating an executable file also eliminates the need to have certain software installed on the computer running the artwork. For example, executable files are created for Adobe Flash projects so that the computer does not need a full version of Adobe Flash installed. Windows executable files typically have the file extension .exe and executables on Mac OS have the extension .app.

To create executable files, source code or a project file is typically compiled, turning it into either binary code or a lower-level programming language and making it unreadable to humans. Thus, the executable often exists as a “black box”; in this form, it is impossible to analyze its specific logic and behaviors. Even if the executable file is collected, the Guggenheim collects the source code or project files as well. Unlike the executable file, the source code or project file allows for analysis of the artwork’s behaviors.

Related Media Assets

The programs behind computer-based artworks often make use of audio files, video files, or still image files. These media assets are also collected as part of the acquisition. Depending on the situation, they may also need to be extracted from project files or separate file directories. The Guggenheim collects these files in the event that they need to be used in any future migration or reconstruction of the work.

MUSEUM-CREATED DIGITAL ASSETS

Guggenheim staff also creates digital assets related to computer-based artworks. These files also require archiving. For example, as part of the Guggenheim’s ongoing collaboration with New York University’s Department of Computer Science, the department’s students conduct source code analysis of artworks. As part of the analysis, the students often annotate the code with detailed descriptions of how it functions. In addition to the original source code, these annotated copies of the source code are tracked and saved as reference material along with other components of the artwork.

In the past, research partners at NYU have decompiled code in order to analyze it because the original source code for an artwork was not accessible at the time. Decompilation means turning a compiled executable file into readable source code. However, there is a significant loss of information in decompiled code—the best source for analysis remains the original source code (Engel and Phillips forthcoming). Nevertheless, copies of decompiled and annotated code have been archived as reference material for artworks.

The Guggenheim has engaged in two restorations of web artworks that used the strategy of code migration: Brandon and John F. Simon Jr.’s Unfolding Object.1 Therefore, a new restoration version of the fileset was created and this restoration version was archived as a separate component of the artwork.

Conservation staff also creates a variety of preservation elements, including web archives (which provide an interactive historical record of the appearance and behavior of web artworks), transcoded files, and disk images.

Transcoding

Files in obsolete or obscure formats require transcoding, which means taking the content of a given file and translating it into a different format, usually for access or preservation purposes. For example, during the restoration of Brandon, audio files in the now-obsolete compressed format Real Media audio were transcoded to the WAV format, which is uncompressed and regarded as the “de facto standard for digital archival audio” by the Association of Recorded Sound Collections and the Library of Congress (Brylawski et al. 2015, 111). In the case of another artwork, Color Panel, an Apple image file that could be opened only in older versions of Photoshop was transcoded to TIFF, an uncompressed format. Even if transcoding occurs, the original file is always kept. Transcoding is never meant to replace the original but rather to prevent its content from becoming inaccessible in the future.

Disk Images and Web Server Images

Creating a disk image is currently the best strategy for archiving the information on a hard drive, floppy disk, or optical disc. As mentioned previously, physical information carriers such as hard drives and disks are quite vulnerable. Creating a disk image is the most effective method for providing a comprehensive backup of all data contained on a physical carrier. The contents of the carrier are written to a single file (the disk image), which cannot be easily modified. With forensic disk image formats, such as E01, metadata can be embedded in the disk image file and this metadata can be connected to the item’s record in a collection management system or inventory. Disk-imaging software reads the entire physical area of the storage device. Therefore, in addition to all the files intentionally saved, the disk image can also contain deleted files and unused empty space on the storage device.

In order to create a disk image of an internal hard drive, it often must be physically removed from the computer’s case. The drive is then connected to a write blocker, also known as a forensic bridge (fig. 6).

Fig. 6: An overview of the disk imaging process for a hard drive that illustrates the use of a write blocker. (Courtesy of the ©Guggenheim Museum)
Fig. 6: An overview of the disk imaging process for a hard drive that illustrates the use of a write blocker. (Courtesy of the ©Guggenheim Museum)

The write blocker prevents the computer that is creating the disk image from ever writing to or altering any data on the drive or disk while it is being imaged by either creating hidden files or altering the metadata of existing files. Write blockers are used by criminal investigators in the field of digital forensics, who must prove that the data they obtained was not altered in any way. Hardware write blockers in use by the digital forensics community are tested by the National Institute for Standards and Technology (NIST) to ensure that they will not write to any connected storage media (Allen 2017). These include write blockers manufactured by Wiebetech and Tableau that are in use in the Guggenheim media lab.

If a write blocker is unavailable, some floppy disks have their own form of write protection built into the disk itself via a plastic slider. Optical discs do not require write blocking because they are typically “read only” storage media. Even if rewritable discs like CD-RW or DVD-RW are being imaged, an examiner would need to take several intentional steps to write data to them, whereas with hard drives, the automated processes of the connected computer could create files or alter metadata.

The Guggenheim’s three web artworks exist on a virtualized web server hosted by a cloud service provider. For the restoration of Brandon, the conservation department collected a virtual disk image of the web server in order to preserve the pre-restoration version of the artwork. The Guggenheim’s cloud service provider had the option to export a virtual hard drive (VHD) from the museum’s virtual server. While it may not be possible to start up an instance of this server again outside of the cloud provider (owing to proprietary settings) the web server image still serves as a valuable historical snapshot of the artwork and the server settings.

DISK IMAGING WORKFLOW

As part of the CCBA initiative, a workflow was developed for disk imaging (fig. 7).

Fig. 7: A visual overview of the steps of the Guggenheim’s disk imaging workflow. (Courtesy of the ©Guggenheim Museum)
Fig. 7: A visual overview of the steps of the Guggenheim’s disk imaging workflow. (Courtesy of the ©Guggenheim Museum)

This procedure is based on research of the workflows used by libraries and archives (Gengenbach 2012), including workflows used by the Guggenheim’s Library and Archives, other museum workflows (such as the Denver Art Museum), and experience in the Guggenheim media lab. The workflow benefited significantly from the Guggenheim’s participation in the December 2017 Museum of Modern Art’s Peer Forum on Disk Imaging and advice from colleagues in conservation and libraries and archives.2 The workflow differs for hard drives, optical discs, and floppy disks. SD cards and USB flash drives are treated similarly to hard drives in the workflow.

Pre-Imaging Documentation

Every item being imaged goes through a pre-imaging stage in which documentation about the physical media is collected. Conservators take check-in photos of the item, including any connectors and labels, and gather technical information, including its capacity, physical size, serial number, model name, and model number. This information is entered into a disk imaging report that is completed for each image. If the hard drive was inside of a computer, staff collects information about the configuration of the computer (using procedures detailed in the earlier “Physical Components” section of this article) during this stage as well.

Creation of Disk Image

With the initial documentation of the storage device and the computer completed, the disk image can then be created. Information about how the disk image was created gets recorded in a disk imaging report, for example, which forensic bridge was used, and which adapters were used.

There are several file formats to consider for disk images, but these formats generally fall into two broad categories: forensic disk images and raw disk images. Forensic disk images allow technical and descriptive metadata (including several fields that an institution call fill out in any way they choose) to be embedded in the disk image itself. Forensic images can also compress the data. They are efficient at compressing empty space on a hard drive. Forensic images also create cyclic redundancy checks (CRC32s), which validate that every sector of the hard drive has been copied accurately in the disk image.

The leading candidate for a forensic image format appears to be E01, also known as the EnCase format, based on both widespread adoption by professionals in libraries and archives and current support by a wide range of software (AVPreserve 2016; Knight 2011).  However, with E01 files and all other forensic images, specific programs or libraries that interpret the forensic file format are needed to open them. Whether these tools for opening E01 files will continue to be supported in the future raises questions about the sustainability of the format. As the forensics community creates new formats, it becomes unclear whether older formats will be supported.

A raw image is simply a sector-by-sector copy of all data on a hard drive with no metadata wrapper and no compression. Therefore, a raw disk image of a 1TB hard drive will be exactly 1TB regardless of how many files were actually stored on the drive. This also means that raw images do not require any interpretation or decompression in order to be read. The file is simply a raw stream of all of the data from the device being imaged.

In the Guggenheim’s workflow, a disk image is first created in the E01 format. Then a raw image is exported from the E01 file. Creating the raw image is an extra preservation measure should support for E01 disk images no longer be available.

Disk Image QC

After creating a disk image, some degree of quality control (QC) inspection must take place to ensure that the disk image is an accurate copy of the data and is useful in the future. As the disk image is being created, most imaging programs (such as FTK Imager and Guymager) automatically check for bad sectors on the drive. These are areas of physical damage to the drive, where the imaging program was not able to extract data. If a drive has bad sectors, it may need to be read with a program such as GNU ddrescue, which can perform multiple passes on the drive to extract data. The presence of bad sectors may indicate that the drive is failing (Hoffman 2013).

Methods also exist to verify that the disk image is a bit-for-bit copy of a storage device. Most imaging programs perform automatic verification of the image—ensuring that the data of the disk image and the data on the physical information carrier are absolutely identical. The imaging programs verify the image (which can also be done manually) by comparing checksums. A checksum is any extremely unique alphanumeric value produced by running a complex mathematical algorithm through all of the bits that constitute a file or storage device. A checksum is sometimes called a “digital fingerprint,” but checksum algorithms create values that are much more unique than fingerprints. For example, the popular checksum algorithm Message Digest 5 (MD5) consists of a 32-digit hexadecimal value, meaning that there are 1632 theoretically possible checksum values (fig. 8).

Fig. 8: A screenshot of running the command “md5” in the Mac OS terminal, which produces an MD5 checksum for a file. The checksum value is displayed on the last line. (Courtesy of the ©Guggenheim Museum)
Fig. 8: A screenshot of running the command “md5” in the Mac OS terminal, which produces an MD5 checksum for a file. The checksum value is displayed on the last line. (Courtesy of the ©Guggenheim Museum)

Due to avalanche effects in the calculation of the checksum, even a single change in one bit of a file will result in producing a radically different checksum (Wikipedia 2018). If the storage device being imaged and the disk image file produce the same checksum, then the image is considered verified.

As part of the QC stage in the Guggenheim’s workflow, ensuring that the images are accessible is also important. Three tests were devised to determine the level of accessibility. First, the disk image is opened in FTK Imager or Bitcurator’s Disk Image Access Interface. The partitions are examined to make sure that they match those of the original drive or disk. This check is completed in order to prevent a situation encountered by Rechert and von Suchodoletz (2015) in which imaging appeared successful but the disk image did not produce the proper partition listing.

If the image passes the previous test, files are exported from it. If exporting files is successful, the image passes the second stage of QC. If the disk image is of a hard drive containing an operating system, the third step is an attempt to run the image in an emulator or virtual machine. If the image is of an optical disk or a floppy disk, mounting the image on different operating systems is attempted at this stage. The results are recorded in the disk imaging report, including any settings that were modified in the emulator or virtual machine. At this point in the CCBA research being conducted, the image not running in an emulator or virtual machine is not grounds to fail QC. The results are simply noted, and staff may return to the image at a later date to determine why it did not run.

Disk Image Analysis and Transfer

After QC is complete, analysis of the contents of the image can be conducted. The goal of this stage is to generate a Digital Forensics XML (or DFXML) file for the contents of the image. DFXML is a standard used to record metadata about a group of files or an entire disk image in XML format. It was created by the digital forensics community and later adapted by librarians, archivists, and conservators. Digital preservation software such as Archivematica has DFXML creation built in as a microservice but creating DFXML files can also be done via the command line or through the Bitcurator suite’s reports module.

Two programs capable of creating DFXML files are Hashdeep and Fiwalk. Hashdeep traverses file directories and can output DFXML for all files and folders contained within a given directory. Fiwalk (or Filesystem Walk) is used for traversing and analyzing the entire file system of a disk image. However, Fiwalk cannot handle all file systems, such as the classic Mac file system HFS (Hierarchical File System), used on all versions of Mac OS before OS 8.1 (released in 1998).

For a single file, a DFXML record contains a wealth of information, such as the full path to the file (which includes the filename) as well as MD5 and SHA1 checksums for the file (fig. 9).

Fig. 9: The DFXML record for one file in a disk image. Line 87293 shows the path to the file. Lines 87313 and 87314 show checksums computed for the file. Lines 87305 to 87308 show the timestamp metadata, and line 87309 shows the file format identification. (Courtesy of the ©Guggenheim Museum)
Fig. 9: The DFXML record for one file in a disk image. Line 87293 shows the path to the file. Lines 87313 and 87314 show checksums computed for the file. Lines 87305 to 87308 show the timestamp metadata, and line 87309 shows the file format identification. (Courtesy of the ©Guggenheim Museum)

The DFXML file also contains file format identification, such as whether the file is a text document or a GIF image, through checking for the “Magic Number” of a file. The Magic Number is an alphanumeric signature of the file type that is usually found within the first few bytes of the file.

Going through an analysis step for a disk image has several benefits. It creates a machine-readable inventory of every file in the image, including its checksum, metadata, and file format. The checksum can be used to ensure that any files exported from the image are exact copies. File metadata may be used as one point of reference to glean the last time that the file was modified. File format identification may also help staff get a clearer picture of the threatened or already obsolete file formats in their collections that may need to be transcoded.

DFXML contains timestamp metadata for the file, including the last modified time (Mtime), last changed time (Ctime), and added time (Atime). However, digital forensics practitioners are skeptical about using this metadata to construct a narrative timeline of events because it can be altered easily, and its meaning is not always intuitive. They suggest gathering additional evidence to corroborate the times and dates found in file timestamp metadata (Manes n.d.; Whitfield n.d.). In general, Mtime is considered the most reliable because it is the hardest field to change (Woods 2018). Some file systems, such as those used for Mac OS and Windows, also record a “creation” or “birth” time for the file. Though it may seem as if this timestamp indicates when the file was originally created, it is highly deceptive. If a file on a Mac is later copied to a Windows computer, the creation time can be altered to the date that the file was copied to the second computer; thus, the creation time does not necessarily reflect when the file was first created.

Timestamp metadata is heavily dependent on the file system; moving a file from one file system to another can cause the timestamp to change. Disk imaging creates encapsulation so that timestamps are always intact, but creating disk images for all components of a computer-based artwork that a museum receives is not always possible. For example, if the artist emails conservators a ZIP file, a disk image cannot be created.

When analysis is complete, the disk imaging report is completed, which marks the end of the disk imaging workflow. A document detailing all of the steps of this workflow will be available on a forthcoming CCBA webpage on the Guggenheim’s website.

Following QC and the completion of the disk imaging report, the next step is transferring the disk image from the disk imaging computer workstation to the museum’s server storage for artworks. Whenever a transfer occurs, one must verify that the new copy is an exact duplicate of the original and no file corruption occurred during the transfer. The Guggenheim uses the program Bagit to ensure that transfers to the server are free of corruption. The program uses a Library of Congress standard for file transfer aptly called a “bag.” Bagit packages the files in a standardized structure and creates a manifest with checksum values for all files contained inside the bag. The contents of the bag can then be verified before and after transfer and at any point later.

CATALOGING AND TRACKING

With so many types of files and file derivatives, tracking all elements of these artworks quickly becomes complex. Many of the challenges in managing information when collecting computer-based art are similar to cataloging and describing video art. For example, when cataloging these works, one must consider both physical information carriers and their contents. Like video art, computer-based artworks often have different versions and the differences between them may be extremely subtle. The complex relationships between different components must be understood and documented as well. An artwork may make use of multiple computers that all run different code and serve different functions, which makes accurate cataloging and tracking critical.

When cataloging computer-based art, the Guggenheim uses the same status hierarchy as with other time-based media artworks. Three general statuses exist for components: “artist-created or provided,” “museum-created,” and “research material.” In addition, every version of an artwork gets a separate set of component numbers. The version and date for each component is described as well as the relationships to other components, and information about how the component was created is recorded if applicable. The Guggenheim uses The Museum System (TMS) to track components, but many institutions use this same database in different ways. For example, the Inter Media Art Institute Foundation in Dusseldorf adapted TMS to describe the relationships between videotapes and the video files that were the result of digitization (Kumura-Myokam 2018). Such a system could be adapted for cataloging the components of computer-based artworks.

Extensive documentation about each artwork, created by conservators, is kept in the museum’s digital object file. This documentation includes recordings of interviews with artists and their technicians, installation instructions, identity and iteration reports, narrated screen recordings of artworks running, treatment reports, schematics of artworks, and flowcharts and site maps related to works. These files are kept on a separate server from files considered artwork, as the object file is frequently being revised and added to.

RECOMMENDED ASSETS FOR ACQUISITION

Currently, there are no widely agreed-upon procedures to guide conservators and curators in the acquisition of computer-based artworks. All museums can stand to improve their acquisition practices for these artworks, including the Guggenheim. Deliverables for acquisition will vary considerably depending on the artwork and which technologies were employed. A conservator will ask for different components for a Python script that scours the Internet for data versus a Flash animation that contains multiple image and audio assets.

Through the CCBA initiative, the Guggenheim has gained additional experience in the acquisition of computer-based artworks and would like to offer this list of recommended assets for acquisition as a means to contribute to the continued dialogue around acquiring computer-based artworks. However, the field is under constant development and these recommendations are preliminary:

  • Original source code or project files
  • Executable files (if the artwork employs them)
  • All supplemental assets used by the artwork, such as images, audio, video files, databases, markup files, fonts, and the like
  • A list of all hardware and software dependencies to answer the question “What will be necessary to make this work run again in the future potentially on a different machine?”
  • A list of external dependencies (web services or web APIs used by the work, web scraping done by the work, or external servers called by the work)—many artists or technicians would never consider formally documenting this information, but it is critical for understanding the work and devising a strategy to make it persist over time.
  • Credentials necessary to operate the work, such as usernames, passwords, license keys, lock codes, and the like
  • Readme documentation for how to start up and operate the work

ACKNOWLEDGEMENTS

I would like to acknowledge the following individuals for their mentorship and assistance with my research:

  • Joanna Phillips, Senior Conservator of Time-based Media, Guggenheim
  • Deena Engel, Clinical Professor, Department of Computer Science, New York University
  • Amy Brost, Assistant Media Conservator, Museum of Modern Art
  • Peter Oleksik, Associate Media Conservator, Museum of Modern Art
  • Martina Haidvogl, Associate Media Conservator, San Francisco Museum of Modern Art
  • Mark Hellar, Owner, Hellar Studios
  • Eddy Colloton, Assistant Conservator, Denver Art Museum
  • Porter Olsen, PhD Candidate, University of Maryland
  • Kam Woods, Research Scientist, University of North Carolina at Chapel Hill
  • Ben Fino-Radin, Founder, Small Data Industries

The CCBA Initiative at the Guggenheim is supported by the Carl & Marilynn Thoma Art Foundation, the New York State Council on the Arts with the support of Governor Andrew Cuomo and the New York State Legislature, Christie’s, and Josh Elkes.

NOTES

1 More information about the restoration of Brandon can be found in the following Guggenheim blog post: www.guggenheim.org/blogs/checklist/restoring-brandon-shu-lea-cheangs-early-web-artwork. The restoration of Unfolding Object is discussed by Deena Engel and Joanna Phillips in this issue of the Electronic Media Review.

2 I would like to acknowledge Kam Woods, Porter Olsen, and Ben Fino-Radin in particular for their advice.

REFERENCES

Allen, T. A. 2017. “Hardware write block.” NIST. www.nist.gov/itl/ssd/software-quality-group/computer-forensics-tool-testing-program-cftt/cftt-technical/hardware (accessed 08/26/2018).

AVPreserve. 2016. “Disk image format matrix.” Google Docs. https://docs.google.com/spreadsheets/d/18t-fU8ZO20Pgio6-QyPYHP3BR07lqL226gg4vRZPl84/edit?usp=embed_facebook (accessed 08/26/2018).

Beach, Brian. 2013. “How long do disk drives last?” Backblaze. https://www.backblaze.com/blog/how-long-do-disk-drives-last/ (accessed 08/27/18).

Brylawski, S., M. Lerman, R. Pike, K. Smith, Association for Recorded Sound Collections, Council on Library and Information Resources, and National Recording Preservation Board (U.S.), eds. 2015. ARSC Guide to Audio Preservation. CLIR Publication no. 164. Eugene, OR, and Washington, DC: Association for Recorded Sound Collections; co-published by Council on Library and Information Resources: National Recording Preservation Board of the Library of Congress.

Byers, Fred R. 2003. Care and Handling of CDs and DVDs: A Guide for Librarians and Archivists. Washington, DC: Council on Library and Information Resources; Gaithersburg, MD: National Institute of Standards and Technology.

Dover, Caitlin. 2016. “How The Guggenheim And NYU Are Conserving Computer Based Art—Part 1.” www.guggenheim.org/blogs/checklist/how-the-guggenheim-and-nyu-are-conserving-computer-based-art-part-1 (accessed 08/26/2018).

Engel, D., and G. Wharton. 2014. “Reading between the lines: Source code documentation as a conservation strategy for software-based art.” Studies in Conservation 59 (6): 404–15.

Engel, D. and G. Wharton. 2015. “Source code analysis as technical art history.” Journal of the American Institute for Conservation 54 (2): 91–101. https://doi.org/10.1179/1945233015Y.0000000004 (accessed 08/26/2018).

Engel, D. and J. Phillips. 2018 (forthcoming). “Applying conservation ethics to the examination and treatment of software- and computer-based art.” Journal of the American Institute for Conservation

Gengenbach, M. 2012. “‘The way we do it here’: Mapping digital forensics workflows in collecting institutions.” Master’s Thesis, Chapel Hill, NC: University of North Carolina, Chapel Hill. http://digitalcurationexchange.org/system/files/gengenbach-forensic-workflows-2012.pdf (accessed 08/26/2018).

Hoffman, C. 2013. “Bad sectors explained: Why hard drives get bad sectors and what you can do about it.” How To Geek. www.howtogeek.com/173463/bad-sectors-explained-why-hard-drives-get-bad-sectors-and-what-you-can-do-about-it/ (accessed 08/26/2018).

Knight, G. 2011. “Forensic disk imaging report.” King’s College London. https://doi.org/10.17037/PUBS.00354890 (accessed 08/26/2018).

Kumura-Myokam, H. 2018. “Describing time-based media art in a database: Metadata and data structure for cataloging of analog and digital moving images.” Presented at the It’s About Time! Building a New Discipline: Time-Based Media Art Conservation conference, New York University, May 22, 2018.

Laurenson, P. 2004. “The management of display equipment in time-based media installations.” In Modern art, new museums: Contributions to the Bilbao Congress 13-17 September 2004, ed. Ashok Roy and Perry Smith. London: International Institute for Conservation of Historic and Artistic Works, 49–53.

Manes, G. W. n.d. “When you shouldn’t trust the metadata: The truth behind creation, modified, and accessed date information.” Avansic. https://web.archive.org/web/20180712232827/http://www.acc.com:80/chapters/louis/upload/Avansic-Why-You-Shouldnt-Trust-Metadata.pdf (accessed 05/09/2019).

McKinley, M. 2013. “Imaging Digital Media for Preservation with LAMMP.” Electronic Media Review 3. http://resources.conservation-us.org/emg-review/volume-three-2013-2014/mckinley/ (accessed 5/9/2019).

Online Computer Library Center, Center for Research Libraries, and National Archives and Records Administration. 2007. “Trustworthy repositories audit & certification: criteria and checklist.” http://www.crl.edu/sites/default/files/d6/attachments/pages/trac_0.pdf (accessed 05/09/2019).

Owens, T. 2014. “Getting public radio’s legacy off ageing rewritable CDs: An interview with WNYC’s John Passmore.” The Signal. http://blogs.loc.gov/thesignal/2014/02/getting-public-radios-legacy-off-ageing-rewritable-cds-an-interview-with-wnycs-john-passmore/ (accessed 08/27/2018).

Phillips, J. n.d. “Time Based Media.” Solomon R. Guggenheim Museum. www.guggenheim.org/conservation/time-based-media (accessed 08/20/2018).

Phillips, J. 2012. “Shifting equipment significance in time-based media art.” The Electronic Media Review 1: 139–154.

Rechert, K., and D. von Suchodoletz. 2015. “Imaging (Old) IDE Disks — Harder than Imagined.” BwFLA blog. September 16, 2015. https://web.archive.org/web/20171223171249/http://bw-fla.uni-freiburg.de/wordpress/?p=788 (accessed 03/15/2017).

Schroeder, B., and G. A. Gibson. 2007. “Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?” In FAST ’07. USENEX Association. http://static.usenix.org/events/fast07/tech/schroeder/schroeder.pdf (accessed 08/27/2018).

Serexhe, B. 2013. “On system change in cultural memory and the conservation of digital art.” In Preservation of digital art: Theory and practice: The Project Digital Art Conservation, ed. Serexhe, B., and Zentrum für Kunst und Medientechnologie Karlsruhe. Wien: Ambra V. 75-84.

Whitfield, L. n.d. “MAC Times, Mac Times, and More.” www.sans.org/summit-archives/file/summit-archive-1498168030.pdf (accessed 08/27/2018.

Wikipedia. “MD5.” 2018. https://en.wikipedia.org/w/index.php?title=MD5&oldid=854147492 (accessed 08/26/2018).

Woods, K. 2018. Personal communication. University of North Carolina at Chapel Hill.

Jonathan Farbowitz
Fellow in the Conservation of Computer-based Art
Solomon R. Guggenheim Museum
jfarbowitz@guggenheim.org