Alleviating the data boom

Mon, 1st Nov 2010

FYI, this story is more than a year old

Unstructured data is found in very different formats such as email attachments, documents, spreadsheets, and video images. The total volume of unstructured data stored is forecast to grow at a compound annual growth rate in excess of 60%. Total digital information stored is estimated to double every 18 months. However, only an estimated 5% of that data will be traditional structured data. Structured data, which is typically generated by a financial transaction, is stored in a relational database. This makes it easy to understand both the content and the context of the data. This is not the case with unstructured data. This makes it difficult to know what backup and retention policies should apply and how to get the best return from that information.Why doesn’t existing technology work?Unstructured data is usually stored outside of a database, which makes it difficult to manage. Usually, the context of the file is lost or separated from the data. In addition, the content within unstructured data is not easily understood. This makes storage decisions difficult because, for example, different spreadsheets or documents will have very different priorities, depending on their content. Storage policies based simply on file type will not suffice. Current data storage technologies such as Network Attached Storage (NAS) were originally designed to allow concurrent access to data from multiple users. File systems store data in hierarchical systems, or "trees”, of directories, folders, subfolders and files. This allows rapid retrieval of data without much metadata, or information associated with the underlying data, being required. In contrast, the lack of content information in the metadata for unstructured data means that files cannot be differentiated for appropriate storage policies, so generally a blanket approach is taken, such as storing a particular file type indefinitely. As the volume of unstructured data grows, the number of nested folders grows which makes retrieval challenging, particularly if the metadata is limited. As the tree structure grows, the performance of the file system deteriorates and backups become more problematic.In essence, conventional file systems such as NAS are not well suited to manage the explosive growth in unstructured data. High costs, unnecessary complexity and poor performance have led to a new solution for unstructured data.What is the solution?Object-based storage addresses the challenges of unstructured data storage by retaining rich metadata about a data object. This allows both the context and the content relating to the underlying data to be preserved. Whereas a conventional file system may store basic metadata relating to file name, originator, the date of file creation or modification and file type, the object-based storage approach retains multiple elements that enable both the identification of the content as well as providing information to help determine how the data should be treated over its lifecycle.The benefit of object storage is that retrieval of information may be made without knowing the file name, dates or file designations. In addition, the rich metadata may be employed to route the file through a policy-based process to maximise its economic benefit to the firm, as well as minimising the cost of retaining it.In addition, object-based storage makes e-discovery of retained data and its exposure to business intelligence processes so much easier. The rich metadata in object-based storage also makes it easy to store related information, which may be stored as disparate files, in groups. For example, an engineer may have text files, video files, spreadsheets, engineering drawing images associated with a particular project all together, in a similar way that paper-based records are kept.A further benefit of object-based storage is that, unlike a conventional file, each object is assigned a unique ID. This is generated using a 128-bit random number generator, thus ensuring uniqueness and allowing an internet-type access system. This also means that the object does not need a specified location for retrieval and millions of objects can be stored in a flat address space without a complex file system or "trees”, thus dramatically reducing complexity and costs. The migration of data from one system to another system is made much simpler and less disruptive.Furthermore, the presence of a hash signature, together with the object’s ID, allows easy deduplication purges and also helps to prove the integrity of the data where compliance is important. That is, the hash signature is generated according to the data within the object. If the hash signature remains the same, it can provide proof that the data is unmodified. This provides an effective means of archiving data that has high legal and compliance properties.Finally, object-based storage lowers storage costs and complexity. NAS overheads such as concurrency, permissions, file locks are not needed to the same extent, so object storage can be affordably scaled to petabytes of data. Expensive primary storage can be released for its correct purpose.The routing of files using rich metadata in object-based storage, according to policies, allows specific and appropriate treatment of objects for use and archiving rather than treating all data as equal.Why use cloud storage?Conventional file system storage will co-exist with object-based storage for quite some time. While on-premise storage costs continue to decline with improving technology, the avalanche of unstructured data points to cloud storage for a cost-effective solution. Object storage is rapidly emerging as a standard for cloud storage. A number of cloud storage service providers employing object-based storage have emerged in the past few years, including Cleversafe, Mezeo and Iron Mountain. Amazon has had an offering for some time.Cloud storage allows the firm’s unstructured data to be cost-effectively stored while relatively scarce and expensive on-premise primary storage is dedicated to its best use. What should you be careful of in the cloud?As with all cloud computing services, caveat emptor applies. Each storage cloud is unique and appropriate due diligence must be performed prior to any commitment. The service level agreement (SLA) must make clear the terms of engagement, who owns the data, how it will be extracted and what standards are employed while the data is in custody of the service provider.The standards for object-based storage have been established by a group within the Storage Networking Industry Association (SNIA) for the T10 committee of the International Committee for Information Technology Standards (INCITS). Current object-based systems use standard APIs such as Representational State Transfer (REST) and Simple Object Access Protocol (SOAP) or proprietary APIs to tell applications how to store and retrieve object IDs. Any proprietary, non-standard based protocols are a warning sign for further due diligence testing. Object-based storage offers a powerful solution to the problem of exponential growth in the quantity of unstructured data. Cloud storage service providers offer a cost-effective alternative to an on-premise solution. Due diligence is paramount to ensure future data mobility is preserved by seeking vendors who comply with industry standards.

Share on: