Below is an extended version of the lightning talk I presented at the Society of American Archivist’s annual meeting on August 16, 2014
The repository I work in, Hagley Museum and Library, houses materials created by a variety of business and organizations, some defunct, others still active. Due to the proprietary nature of the business records we collect in addition to the privacy and security concerns of active corporations, many of our collections are on deposit or closed for twenty-five years or more.
In 2012, Hagley received a large hybrid collection, consisting primarily of textual analog materials, in addition to a number of born-digital records. The records were created by various tech corporations during the normal course of business in the late 1990s and early 2000s and document aspects of the dot-com boom and bust, an area of research where primary sources are sorely lacking.
Even though the collection is closed for twenty-five years from date of creation, I could not let the electronic records sit on a shelf untouched during that time, as with paper records. Given the potentially high research value of the collection, I decided the preservation of its born-digital content was a top priority, particularly since much of it resided on physical media that is already at risk for loss. With the assistance of a few coworkers, I culled hundreds of record cartons to discover the following obsolete media formats: 349 compact discs; 134 3.5” floppy disks; 113 digital linear tapes (DLT); 49 digital data storage tapes (DDS); 19 quarter-inch mini cartridges; 15 Travan cartridges; and 8 zip disks.
Although the CDs and floppy disks presented few problems, the remaining obsolete formats offered a lesson in how complex data recovery can be. My attempts to use “freecycled” drives and jerry-rig old PCs were just not working. Even if I could connect a computer to the exact generation DLT or DDS drive to read the tapes, I would also need to know the software program used to create the backup, which could vary widely depending on the date of creation, then successfully install it, and cross my fingers the media was not encrypted or corrupt.
Since Hagley is a small shop with limited in-house resources, it was clear to me that outsourcing the data extraction was the best course of action. After consulting several vendors, I found a company that specializes in data extraction and indexing of backup tapes. The vendor’s office was close enough I could make an in-person visit to the digital lab and test out data retrieval on a few backup tapes, free of charge. Although the lab is not set up to read Travan or quarter-inch mini-cartridges, the vendor successfully read the DLT and DDS tapes I brought.
After establishing a budget for the first phase of the project, I sent the vendor a sample consisting of five DLT and three DDS tapes. Less than a week later, the vendor provided me with access to the indexed data from seven out of eight tapes. After a brief training session, I was able to access the content in the vendor’s hosted system via a web browser where I could eliminate duplicates, search item-level full-text and metadata, and filter content by file type, format, and date. I then tagged data of potentially high research value for download. Due to the size of the collection, I was strict with appraisal, retaining only about ten percent of the data. The original media was returned to Hagley a few weeks later. Having successfully completed the first phase of the project, we will continue to use the same company for the remaining tapes.
In conclusion, here are a few key points to consider when outsourcing data recovery and retrieval. First, ask yourself if the data is even worth recovering. Not all collections are created equal and neither are all born-digital records. Next, do you have the in-house resources to read and extract the data to a secure storage area? Even if no is the answer, this does not mean you should immediately search for a vendor. Instead, consider the short-term and long-term costs of performing the data retrieval in-house. Perhaps the fiscal and temporal costs to your repository are sustainable. Remember such costs include purchasing, installing, and maintaining equipment and software, training yourself and other employees to use the system, perhaps even hiring a new staff member. How often do you anticipate using the system in the near future? If the data resides on a very rare and expensive media format your repository will likely never encounter again, it may not be worth the time and effort to do in-house. More importantly, before turning to a vendor, consider collaborating with another organization or institution to retrieve the data. They may have equipment and resources you need and vice versa. Finally, if you do decide to outsource, research and compare vendors; get quotes; read the vendor agreement carefully before committing; and always send a sample first.