The choice of data storage method can have a significant impact on factors ranging from performance and reliability to management complexity and cost. There’s no right choice for all scenarios. Storage strategies should be chosen based on the specific requirements of the application in question: the optimal choice for a geographically redundant data store would be a poor choice for a high-performance database application.

We’re to discuss three potential data storage options and their specific advantages, before focusing on the best option for the very large datasets often required by Big Data applications.

Three Storage Models

Much virtual storage is either file-based or block-based, but at the core of many large cloud platforms is object storage.

1. File-Based Storage

File-based storage systems are most familiar to ordinary users. Your Windows PC and Macs store data as files within directories. File storage is a hierarchical system of data and metadata managed by a filesystem. The file system maintains the metadata which indicates who owns the file, where its constituent parts are located on the storage device, the type of file, and so on.

File-based storage offers excellent performance for local storage and storage on local access networks (NFS works like this).

File-based storage’s limitations become apparent when datasets are very large (billions of files), data is distributed over a wide area, or more flexible metadata is required.

2. Block-Based Storage

A block is simply a chunk of data. Blocks are combined together by an application to recreate a file. These blocks are not typically managed by a filesystem, but by the application. Without the application, blocks are just arbitrary pieces of data scattered across storage devices; they are only given meaning by the application. Block storage does not use large amounts of metadata.

Block storage is an excellent choice for relatively localized applications that require high performance, but geographic distance between the storage device and the application obviates the performance advantage.

3. Object-Based Storage

Object-based storage has some of the characteristics of both file and block based storage, but with unique advantages. An object is a chunk of data — often a file — with of all its associated metadata. There is no limit to the metadata that each object can be associated with. Data and metadata are held together as a single unit which is identified by its object ID. Applications access objects by presenting the object ID to the object storage system.

Of key importance to understanding object storage is its non-hierarchical nature. Object storage systems are flat; all objects occupy the same “level” and are retrieved in the same way with the object ID.

Benefits Of Object Storage For Large Datasets

Of the storage types mentioned here, object storage is the optimal solution for dealing with large datasets. Its inherent flexibility where metadata is concerned means that data can be categorized and analyzed in ways that are not possible — or are extremely complex — with file and block-based storage, making it great for big data applications. It’s hard to overestimate the importance of flexible metadata; it allows for the creation of storage systems and applications that conform readily to users’ needs, rather than forcing users to work around the complexity of predetermined metadata schemas.

The single flat address space allows object storage to be almost infinitely scalable; more storage devices can be added quickly and easily. And the major benefit is that those devices don’t have to be local, which makes object storage the perfect solution for geographically distributed storage.

Object storage systems are also relatively simple to use and manage. Most use simple REST API calls to access objects, which makes its easy for developers to create applications around an object store.

Object storage is the best choice for applications that require flexible metadata, geographically distributed storage, and storage management without the complexity of file- and block-based systems.

Image: Flikr/calgaryreviews