Q&A: Best Practices for Working with Unstructured Data
What is unstructured data and how should IT search on and store it?
What is unstructured data and how should IT search on it and store it? For perspective on this increasingly important type of data, we turned to Bill Roth who is responsible for LogLogic’s worldwide marketing operation. Bill has over 20 years experience as a software operations executive. LogLogic specializes in IT data management.
Enterprise Systems: What is unstructured IT data?
Bill Roth: Unstructured data is everywhere. It is any data that does not fit into some kind of database. It comes in many forms. It can show itself in traditional forms, as in log files from sources such as Web-server logs and Unix/Linux system logs, but it can also come from network devices such as switches, routers, and firewalls, or mainframe data such as the RACF subsystem on IBM mainframes. Several surveys by industry groups have confirmed that about 30 percent of an enterprise’s storage is occupied by this kind of data.
How is it possible to classify unstructured IT data? What are some of the classifications that enterprises typically apply to unstructured data?
Not only is it possible to classify unstructured IT data, but it is not as difficult as you might think. Much of the unstructured data is clear text, and it can be read, indexed, compressed, and stored very easily. Most organizations classify their data by its origin. System log files can be kept together, as well SCADA data, device telemetry, and other forms of streaming data.
How do you search for unstructured IT data efficiently?
Three levels of searching are possible. All three of these methods are usable in modern-day IT data management systems.
The first level of search is “simple string” search. This is akin to a Google-like search, where users type in a string of keywords and the collected data is searched.
The second level of search is known as “regular-expression” search, where users enter a general pattern (via a pattern language) of what they are looking for, and the system searches based on this query.
The third level of searching is via a complex query language such as SQL. This requires that the unstructured data be “structured” somewhat, and that the end-user knows some of the structure and context of the data.
What are some of the issues around storage of unstructured IT data?
The key issues revolve around compression, retrieval, and immutability. Because the bulk of this kind of data is clear text, it can be easily compressed, and should be. With the advent of modern compression technology, the data may also be read in place, without the need to decompress and re-compress on every access. To do this efficiently, it is advisable to make sure the indices are also available to facilitate scanning the data.
This data can also be used in a legal or regulatory context, and as a result, needs to be stored in a way that proves it has not been tampered with. This is known as “immutability.” This is generally handled by storing a key that represents the exact format of the data when it was created. This key can then be stored in a secure location and compared to the unstructured data when needed.
What are some of the mistakes IT makes when storing unstructured data?
The biggest mistakes IT organizations make are not collecting enough data and not storing it long enough. More often than not, most unstructured data is ignored by IT organizations. Much of this data comes from log data that was included by the developer to provide an audit trail, which describes what the system, program, application, or daemon actually did. Much of this data is in non-standard places, but as modern-day systems develop, this data is starting to show up in more standard places.
The problem with IT data is that it is impossible to know when you need it. As a result, most IT organizations only retain data as long as they have to, but archiving or deleting the data makes it unusable for any future forensics investigation or search activity.
What best practices can you suggest to avoid these problems?
There are some key best practices that IT organizations should follow when making decisions about storing unstructured IT data. First, organizations should “over-provision.” We have seen that organizations routinely under-estimate the kinds of data they need to record. Second, organizations need to collect data for longer than the required period. This is because although collecting data for the period required by policy is useful, you can never know when the data may actually be needed.