Bringing Structure to Unstructured Data

    If it were up to most enterprise executives today, they would already have production-level Big Data and Internet of Things (IoT) infrastructure in place. Too bad it’s not that simple. It turns out that the biggest hurdle blocking this digital transformation is how to manage all of the unstructured data that these fast-moving applications are expected to generate.

    The challenges to managing unstructured data are not unknown to the enterprise. To date, however, the process of finding and parsing the wealth of information hidden in emails, chat rooms and other forms of communication has been too cumbersome to make it a serious consideration. If no one can access this data, then it represents neither an advantage nor a disadvantage to anyone.

    But that view is changing, primarily because IoT and advanced analytics require that this knowledge be put to good use. As some of the latest advances in unstructured data management illustrate, this is not only a matter of storage but of deep-dive data analysis and conditioning.

    According to Gartner, object storage and distributed file systems are at the vanguard of efforts to bring structure to unstructured data. Both established vendors and startups are looking for the secret sauce that allows scalable storage clustered file systems and object storage systems to meet the cost and scalability of emerging workloads. Distributed systems, of course, provide fast access to data for multiple hosts simultaneously, while object storage provides the RESTful APIs that accommodate cloud solutions like OpenStack and AWS. Together, these solutions can offer the enterprise a relatively cheap and painless way to build the proper infrastructure to take on heavy data loads.

    Capturing and storing data is one thing, however, turning it into a useful resource is quite another, and this process becomes exponentially more difficult once we enter the fast-moving, dynamic world of the IoT. This is why many organizations are turning to smart productivity assistants, says Forbes contributor Steve Olenski. With tools like Slack and WorkChat making collaboration ever easier, knowledge workers need a way to keep all of their digital communications organized, preferably without spending hours of their day doing it themselves. Using smart assistants like Findo and Yva, employees can not only automate their mailboxes but track and categorize data using highly contextual reference points and metadata, opening up the possibility of finding opportunities that would otherwise go unnoticed.

    A key capability in this process is categorization through text analytics, says Brion Scheidel, of customer experience platform developer MaritzCX. By first defining a set of categories and then assigning text to them, organizations can take the first step toward removing the clutter from unstructured data and putting it into a useable, quantifiable form. The two main methodologies at the moment are rules-based and machine learning, both of which have their good points and bad. A rules-based approach provides a clearer definition of what you hope to achieve, and how, while ML provides a more intuitive means of adjusting to changing conditions, although not always in ways that produce optimum results. At the moment, MaritzCX relies exclusively on rules-based analysis.

    In time-honored tradition, however, just as technology makes it possible to deal with one intractable problem, it opens the door to even greater complexity. In this case, we have the open-source community starting to look past mere infrastructure and applications to actual data. The Linux Foundation recently announced a new open-data framework called the Community Data License Agreement (CDLA), which is intended to provide broad access to data that would otherwise be restricted to certain users. In this way, data-based collaborative communities can share knowledge across Hadoop, Spark and other Big Data platforms, adding still more sources of unstructured information that must be captured, analyzed and conditioned in order to be made useful. The same techniques currently being deployed in-house will likely be able to handle these new volumes, but the CDLA has the potential to significantly increase an already heady challenge.

    The proper conditioning of unstructured data will likely be a key determinant of success in the digital economy. In today’s world, those with the most advanced infrastructure can generally leverage their might to keep competitors at bay. But the steady democratization of infrastructure through the cloud and service-level architectures is quickly leveling the playing field.

    Going forward, the winner won’t be the one with the most resources or even the most data, but who has the ability to act upon actual knowledge.

    Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.


    Arthur Cole
    Arthur Cole
    With more than 20 years of experience in technology journalism, Arthur has written on the rise of everything from the first digital video editing platforms to virtualization, advanced cloud architectures and the Internet of Things. He is a regular contributor to IT Business Edge and Enterprise Networking Planet and provides blog posts and other web content to numerous company web sites in the high-tech and data communications industries.

    Latest Articles