IDC estimates that more than 80% of business information is likely to be made up of unstructured data by 2025.
And while “unstructured” may be a bit of a misnomer, because all files have some sort of metadata by which they can be searched and sorted, for example, there are vast volumes of that data in the hands of companies.
In this article, we discuss the ins and outs of working with unstructured data and the storage (usually a file or object) you need.
In the past, images, voice recordings, videos, chat logs, and documents of various kinds were largely just a storage responsibility and considered a headache for anyone who needed to manage, organize, and maintain them. insurance.
But now unstructured data is seen as a valuable source of business insights. With analytical processing, value can be obtained; For example, it is possible to run AI/ML against sets of advertising images and map what site visitors see to click behavior. Unstructured image data analysis can create structured fields that can drive editorial decision making.
Elsewhere, backup copies, long consigned to dusty and inaccessible tape archives, are now being seen as a potential source of data for analytical processing. And with the threat of ransomware high on the agenda, the need for backup to recover is more pertinent than ever.
Structured, unstructured, semi-structured
Unstructured data, in general terms, is data and information that does not conform to a predefined data model; in other words, information that is created and lives outside of a relational database.
Business information generated by systems is most likely structured, with product and customer details, order numbers, stock levels, and shipping information created by a sales system and stored in its underlying database being typical examples.
These are more than likely SQL databases, configured with a table-based schema and data stored in rows and columns that allow for very fast data writes and queries, with very good transactional integrity. SQL databases are at the heart of the highest performance, mission-critical applications in use.
Unstructured/semi-structured
Unstructured data is often created by people and includes email, social media posts, voice recordings, images, videos, notes, and documents such as PDFs.
As mentioned, most unstructured data can actually be what you’d call semi-structured, and while it’s not in a database, while that’s possible, there’s some structure to its metadata. For example, an image of a delivered article would, superficially, be unstructured, although the metadata of the camera files makes it semi-structured.
And then there are backup files, in which all of an organization’s data is copied, compressed, encrypted, and packaged in the (usually proprietary) format of the backup provider.
The fact that backups aggregate all types of data makes it an unstructured data challenge, and one that is arguably more relevant than ever with the increased threat of ransomware.
Unstructured and semi-structured storage needs
As we have seen, unstructured data is more or less defined by the fact that it is not created through the use of a database. It may be the case that more structure is applied to unstructured data later in its life, but then it becomes something else.
What we will look at here are the key requirements for the storage infrastructure for unstructured data. These are:
- Volume – There is typically a large amount of unstructured data, so capacity is a key requirement.
- File and/or object storage: Block storage is for databases, and as we’ve seen, it’s not a requirement for unstructured data use cases. File-based and object storage (NAS) fills the need for.
- Performance: Historically, this would not have been on the agenda, but with the need for closer to real-End-shutdown analytics and for quick recovery from a cyberattack, it is now more of a consideration.
Cloud and unstructured data
With these requirements in mind, cloud storage seems to be a good fit as a place to store unstructured data. However, there are potentially some things that work against you.
Cloud storage provides object (overwhelmingly, in terms of volume) and file access storage, so it’s potentially suitable in that regard.
Cloud storage can also provide capacity, and data may well be able to be stored in bulk in the cloud in an extremely cost-effective manner. But usually, costs can be kept very low only when the data isn’t being accessed, so that’s the first potential drawback of cloud storage.
So the cloud is great for cold data, but any kind of I/O starts to drive up costs. However, that may be acceptable depending on the size and access requirements of your workload. Small data sets, or those that require infrequent access, would be ideal.
On-site object and file storage
Clustered NAS and object storage are well suited for very large volumes of unstructured data. If anything, object storage is even better suited to large amounts of data due to its superior scalability.
File-based storage is based on a file system and a hierarchical tree-like structure. This can cause performance overheads as the file system is traversed. Object storage, by contrast, is based on a flat structure with objects/files having a unique ID that makes them easy to access.
On-site storage can allay concerns about data security and availability, and can potentially be less expensive than putting data in the cloud.
Any set of protocols (file and object) is suitable for storing unstructured data.
Add flash for quick access
It is quite possible to create adequate performing file and object storage on site using a spinning disk. With the necessary capacities, HDD is usually the cheapest option.
But advances in flash memory manufacturing have made high-capacity solid-state storage available, and storage array manufacturers have begun using it in hardware with file and object storage capabilities.
This is QLC – four level cell – flash. This includes four levels of binary switches for flash cells to provide higher storage density and therefore a lower cost per GB than any other commercially usable flash today.
However, the trade-offs that come with QLC are that flash life can be compromised, making it better suited for large capacity data that is accessed less frequently.
But flash speed is particularly well-suited for unstructured use cases, such as in analytics where fast processing and therefore I/O is needed, and in cases where customers want to restore large data sets from backups in case of a ransomware attack. For example.
Storage hardware vendors that sell QLC-based arrays suitable for file and, in some cases, object storage include:
Dell EMC, with PowerScale, including the (partially) rebranded EMC Isilon scale-out NAS with access to S3 object storage. Its all-flash (it also has hybrid flash) NVMe QLC flash-equipped options come in a range of capacities that scale to tens of PBs.
NetApp, which recently launched a new family of QLC flash storage arrays, the C-series, aimed at higher capacity use cases that also need the speed of SSDs. The C series starts with three options: C250, C400, and C800, which scale to 35PB, 71PB, and 106PB respectively. Access to object storage is possible but limited via the protocol through the NetApp Ontap operating system.
Pure Storage with its FlashArray//C provides fully QLC NVMe attached flash in two models, //C40 and //C60 with capacities in the PB range. Meanwhile, Pure’s FlashBlade//S family is explicitly marketed as “fast file and object” with NVMe QLC in its proprietary modules in two models. The S200 emphasizes capacity, with data reduction, while the S500 bets on performance.