preservation strategy

Big Data, Long Term Digital Information Management Strategies & the Future of (Cartridge) Tape

What is the most effective way to store and manage digital data in the long term? This is a question we have given considerable attention to on this blog. We have covered issues such as analogue obsolescence, digital sustainability and digital preservation policies. It seems that as a question it remains unanswered and up for serious debate.

We were inspired to write about this issue once again after reading an article that was published in the New Scientist a year ago called ‘Cassette tapes are the future of big data storage.’ The title is a little misleading, because the tape it refers to is not the domestic audio tape that has recently acquired much counter cultural kudos, but rather archival tape cartridges that can store up to 100 TB of data. How much?! I hear you cry! And why tape given the ubiquity of digital technology these days? Aren’t we all supposed to be ‘going tapeless’?

The reason for such an invention, the New Scientist reveals, is the ‘Square Kilometre Array (SKA), the world’s largest radio telescope, whose thousands of antennas will be strewn across the southern hemisphere. Once it’s up and running in 2024, the SKA is expected to pump out 1 petabyte (1 million gigabytes) of compressed data per day.’

SKA_dishes

Image of the SKA dishes

Researchers at Fuji and IBM have already designed a tape that can store up to 35TB, and it is hoped that a 100TB tape will be developed to cope with the astronomical ‘annual archive growth [that] would swamp an experiment that is expected to last decades’. The 100TB cartridges will be made ‘by shrinking the width of the recording tracks and using more accurate systems for positioning the read-write heads used to access them.’

If successful, this would certainly be an advanced achievement in material science and electronics. Smaller tape width means less room for error on the read-write function – this will have to be incredibly precise on a tape that will be storing a pretty extreme amount of information. Presumably smaller tape width will also mean there will be no space for guard bands either. Guard bands are unrecorded areas between the stripes of recorded information that are designed to prevent information interference, or what is known as ‘cross-talk‘.They were used on larger domestic video tapes such as U-Matic and VHS, but were dispensed with on smaller formats such as the Hi-8, which had a higher density of magnetic information in a small space, and used video heads with tilted gaps instead of guard bands.

The existence of SKA still doesn’t explain the pressing question: why develop new archival tape storage solutions and not hard drive storage?

Hard drives were embraced quickly because they take up less physical storage space than tape. Gone are the dusty rooms bursting with reel upon reel of bulky tape; hello stacks of infinite quick-fire data, whirring and purring all day and night. Yet when we consider the amount of energy hard drive storage requires to remain operable, the costs – both economic and ecological – dramatically increase.

The report compiled by the Clipper Group published in 2010 overwhelmingly argues for the benefits of tape over disk for the long term archiving of data. They state that ‘disk is more than fifteen times more expensive than tape, based upon vendor-supplied list pricing, and uses 238 times more energy (costing more than the all costs for tape) for an archiving application of large binary files with a 45% annual growth rate, all over a 12-year period.’

This is probably quite staggering to read, given the amount of investment in establishing institutional architecture for tape-less digital preservation. Such an analysis of energy consumption does assume, however, that hard drives are turned on all the time, when surely many organisations transfer archives to hard drives and only check them once every 6-12 months.

Yet due to the pressures of technological obsolescence and the need to remain vigilant about file operability, coupled with the functional purpose of digital archives to be quickly accessible in comparison with tape that can only be played back linearly, such energy consumption does seem fairly inescapable for large institutions in an increasingly voracious, 24/7 information culture. Of course the issue of obsolescence will undoubtedly affect super-storage-data tape cartridges as well. Technology does not stop innovating – it is not in the interests of the market to do so.

Perhaps more significantly, the archive world has not yet developed standards that address the needs of digital information managers. Henry Newman’s presentation at the Designing Storage Architectures 2013 conference explored the difficulty of digital data management, precisely due to the lack of established standards:

  • ‘There are some proprietary solutions available for archives that address end to end integrity;
  • There are some open standards, but none that address end to end integrity;
  • So, there are no open solutions that meet the needs of [the] archival community.’

He goes on to write that standards are ‘technically challenging’ and require ‘years of domain knowledge and detailed understanding of the technology’ to implement. Worryingly perhaps, he writes that ‘standards groups do not seem to be coordinating well from the lowest layers to the highest layers.’ By this we can conclude that the lack of streamlined conversation around the issue of digital standards means that effectively users and producers are not working in synchrony. This is making the issue of digital information management a challenging one, and will continue to be this way unless needs and interests are seen as mutual.

Other presentations at the recent annual meeting for Designing Storage Architectures for Digital Collections which took place on September 23-24, 2013 at the Library of Congress, Washington, DC, also suggest there are limits to innovation in the realm of hard drive storage.  Gary Decad, IBM, delivered a presentation on the ‘The Impact of Areal Density and Millions of Square Inches of Produced Memory on Petabyte Shipments for TAPE, NAND Flash, and HDD Storage Class‘.

For the lay (wo)man this basically translates as the capacity to develop computer memory stored on hard drives. We are used to living in a consumer society where new improved gadgets appear all the time. Devices are getting smaller and we seem to be able buy more storage space for cheaper prices. For example, it now costs under £100 to buy a 3TB hard drive, and it is becoming increasingly more difficult to purchase hard drives which have less than 500GB storage space. Compared with last year, a 1TB hard drive was the top of the range and would have probably cost you about £100.

A 100TB storage unit in 2010, compared with a smaller hard drive symbolising 2020.

Does my data look big in this?

Yet the presentation from Gary Decad suggests we are reaching a plateau with this kind of storage technology – infinite memory growth and reduced costs will soon no longer be feasible. The presentation states that ‘with decreasing rates of areal density increases for storage components and with component manufactures reluctance to invest in new capacity, historical decreases in the cost of storage ($/GB) will not be sustained.’

Where does that leave us now? The resilience of tape as an archival solution, the energy implications of digital hard drive storage, the lack of established archival standards and a foreseeable end to cheap and easy big digital data storage, are all indications of the complex and confusing terrain of information management in the 21st century. Perhaps the Clipper report offers the most grounded appraisal: ‘the best solution is really a blend of disk and tape, but – for most uses – we believe that the vast majority of archived data should reside on tape.’ Yet it seems until the day standards are established in line with the needs of digital information managers, this area will continue to generate troubling, if intriguing, conundrums.

Posted by debra in Audio Tape, Video Tape, 0 comments

Parsimonious Preservation – (another) different approach to digital information management

We have been featuring various theories about digital information management on this blog in order to highlight some of the debates involved in this complex and evolving field.

To offer a different perspective to those that we have focused on so far, take a moment to consider the principles of Parsimonious Preservation that has been developed by the National Archives, and in particular advocated by Tim Gollins who is Head of Preservation at the Institution.

racks of servers storing digital information

In some senses the National Archives seem to be      bucking the trend of panic, hysteria and (sometimes)  confusion that can be found in other literature relating  to digital information management. The advice given in  the report, ‘Putting Parsimonious Preservation into  Practice‘, is very much advocating a hands-off, rather  than hands-on approach, which many other  institutions, including the British Library, recommend.

The principle that digital information requires  continual interference and management during its life  cycle is rejected wholesale by the principles of  parsimonious preservation, which instead argues that  minimal intervention is preferable because this entails  ‘minimal alteration, which brings the benefits of  maximum integrity and authenticity’ of the digital data object.

As detailed in our previous posts, cycles of coding and encoding pose a very real threat to digital data. This is because it can change the structure of the files, and risk in the long run compromising the quality of the data object.

Minimal intervention in practice seems here like a good idea – if you leave something alone in a safe place, rather than continually move it from pillar to post, it is less likely to suffer from everyday wear and tear. With digital data however, the problem of obsolescence is the main factor that prevents a hands-off approach. This too is downplayed by the National Archives report, which suggests that obsolescence is something that, although undeniably a threat to digital information, it is not as a big a worry as it is often presented.

Gollins uses over ten years of experience at the National Archives, as well as the research conducted by David Rosenthal, to offer a different approach to obsolescence that takes note of the ‘common formats’ that have been used worldwide (such as PDF, .xls and .doc). The report therefore concludes ‘that without any action from even a national institution the data in these formats will be accessible for another 10 years at least.’

10 years may seem like a short period of time, but this is the timescale cited as practical and realistic for the management of digital data. Gollins writes:

‘While the overall aim may be (or in our case must be) for ―permanent preservation […] the best we can do in our (or any) generation is to take a stewardship role. This role focuses on ensuring the survival of material for the next generation – in the digital context the next generation of systems. We should also remember that in the digital context the next generation may only be 5 to10 years away!’

It is worth mentioning here that the Parsimonious Preservation report only includes references to file extensions that relate to image files, rather than sound or moving images, so it would be a mistake to assume that the principle of minimal intervention can be equally applied to these kinds of digital data objects. Furthermore, .doc files used in Microsoft Office are not always consistent over time – have you ever tried to open a word file from 1998 on an Office package from 2008? You might have a few problems….this is not to say that Gollins doesn’t know his stuff, he clearly must do to be Head of Preservation at the National Archives! It is just this ‘hands-off, don’t worry about it’ approach seems odd in relation to the other literature about digital information management available from reputable sources like The British Library and the Digital Preservation Coalition. Perhaps there is a middle ground to be struck between active intervention and leaving things alone, but it isn’t suggested here!

For Gollins, ‘the failure to capture digital material is the biggest single risk to its preservation,’ far greater than obsolescence. He goes on to state that ‘this is so much a matter of common sense that it can be overlooked; we can only preserve and process what is captured!’ Another issue here is the quality of the capture – it is far easier to preserve good quality files if they are captured at appropriate bit rates and resolution. In other words, there is no point making low resolution copies because they are less likely to survive the rapid successions of digital generations. As Gollins writes in a different article exploring the same theme, ‘some will argue that there is little point in preservation without access; I would argue that there is little point in access without preservation.’

Diagram explaining how emulation works to make obsolete computers available on new machines

This has been bit of a whirlwind tour through a very interesting and thought provoking report that explains how a large memory institution has put into practice a very different kind of digital preservation strategy. As Gollins concludes:

‘In all of the above discussion readers familiar with digital preservation literature will perhaps be surprised not to see any mention or discussion of “Migration” vs. “Emulation” or indeed of ―“Significant Properties”. This is perhaps one of the greatest benefits we have derived from adopting our parsimonious approach – no such capability is needed! We do not expect that any data we have or will receive in the foreseeable future (5 to 10 years) will require either action during the life of the system we are building.’

Whether or not such an approach is naïve, neglectful or very wise, only time will tell.

Posted by debra in Audio Tape, 2 comments