Active storage
Jul 1, 2007 12:00 PM, BY PAUL TURNER
In the evolution from tape- to file-based workflows, asynchronous IP-based storage is increasingly chosen for online and nearline archive storage.
While in many cases, the mainstay of this activity has been RAID-based NAS or SAN solutions, grid storage has made inroads over the last year or so. Offering large storage capacities and simplified system management, grid storage is an alternative approach to the idea of bulk data storage, but it also offers another possibility: active storage. This article will examine the concept of active storage — what it is, how it works and the advantages that it can bring to the entire workflow.
The fundamentals of grid storage
In a nutshell, grid storage is comprised of separate, standalone content servers that are each responsible for storing only part (usually referred to as a slice) of each file loaded onto the system. In this way, the file itself is scattered onto multiple autonomous content servers. Separate metadata servers decide which slice goes to which content server. (See Figure 1.) The metadata servers provide the file system namespace to the various clients in the system.
Figure 1. A typical grid storage system
This arrangement is analogous to the operation of a standard hard drive. The content servers are similar to the sectors of a hard drive, and the metadata servers are like the file allocation table of the drive, where a file name is translated into the addresses of the sectors of the disk where the data can be found. The idea has simply been expanded in the case of grid storage.
This architecture allows clients, whether reading or writing, to first ask the metadata servers for the locations of the slices and then interact directly with each content server to gain access to an individual slice. This is significantly faster than the traditional NAS approach, where all access to storage must pass through the NAS head — an obvious bandwidth bottleneck.
Another unique attribute of grid storage is its ability to provide data protection. Data protection is achieved by making copies of the slices onto other content servers in the grid. At any point in time, there exist at least two copies of all of the slices of each file. The principle is that the failure of any individual content server does not render the data unrecoverable because there's always at least one other copy of each slice available somewhere else on the grid.
The content servers operate autonomously, so re-replication of missing data can happen simultaneously through a number of content servers operating in parallel. An important item to note is that grid storage systems rebuild data, whereas RAID systems rebuild drives. The latter includes rebuilding sectors of the replacement drive that never held valid data in the original, which is clearly an invalid operation. This prolongs the rebuild time and extends the window of vulnerability for another drive failure.
Re-replication of data in a grid storage system happens significantly faster than rebuilding of a hard drive via RAID engine, massively reducing the window of vulnerability. If the replication factor is set to three or higher, the failure of any drive or content server will not leave the system in a vulnerable state because even if one copy of the file is completely lost, the data is safe, as there are at least two other copies of the affected slices somewhere on the grid. This offers even greater user-selectable data resiliency capabilities.
Replication has other advantages too. For example, the average latency encountered by each individual client decreases as the replication factor increases, which is extremely important in today's production environment.
The concept of active storage
Until recently, storage systems have been passive members of the workflow. Once media was stored on them, it remained there until external systems read the data, manipulated it and then put the result back onto the storage. This issue was true when media was stored on tape and has remained true in most cases when using disk-based storage.
Grid storage offers a new opportunity. As previously mentioned, grid storage is made up of separate content servers, each of which has a CPU, RAM and all of the other hardware that make up a modern platform. It is entirely possible for a powerful content server platform to take on additional processing tasks.
For example, each CPU can examine the slices located on its hard drives and perform automatic error checking, calculating a cyclic redundancy check (CRC) from the data. It then compares the CRC to a CRC that was calculated for the slice at the time it was created and was stored along with that data as part of the write process. If the two numbers don't match, the content server can declare its slice to be invalid, and the metadata servers can respond by causing the slice to be re-replicated from a known good copy of the slice to some other storage location within the grid. This effectively makes the system self-healing, with an associated reduction in the need for manual intervention by maintenance staff.
Taking this idea a step further, it is equally possible to use some of the processing power of the content servers to manage and process media. If the storage is aware that the data it is holding are actually media files, it is possible to use some of the CPU power of the individual content servers to perform media-specific processing tasks in addition to the activity of storing and serving up data.
It is, of course, vital that such use does not impinge on the ability of the content servers to provide data services to the various clients connected to the grid, which is its primary purpose. To this end, it is necessary to add a management layer to the system's code to ensure that no content server becomes oversubscribed. The remaining CPU power can be used as raw processing capability, acting on the data stored on the grid, or even being given external data sets, along with instructions on how to manipulate the data by some external application server. Typically, the components of such a configuration include:
- application controllers, on which the client application GUIs can run, which manage the operation of their individual applications;
- grid resource management software, which can receive requests for CPU cycles from the application controllers and in response allocate available CPUs to each requestor; and
- a grid application loader, which runs on each content server to set up the processing environment on that server and physically launch a process.
Suddenly, the system ceases to be a mere storage repository and becomes an active part of the user's workflow. It is easy to see how adding this capability can improve the business of processing material as it passes through the workflow. And such active workflows, by the nature of their parallelism, can operate substantially faster than their passive counterparts. Figure 2 is an example of the processes needed to manage grid storage in this way.
There are several activities that immediately come to mind when considering the possibilities enabled by active storage.
| Want to use this article? Click here for options! |


















