Most organizations are still in the early stages of Big Data adoption, and few have thought beyond the technology angle of how Big Data will profoundly impact their processes and their information architecture. Whether Big Data projects are past the pilot stage and being deployed in production, or still on the horizon, they require strategic thinking and adequate planning to avoid some now-typical pitfalls that tend to get in the way of success for big data projects.
Talend recently identified five key areas companies should monitor to avoid pitfalls that can derail big data projects, and ensure those projects generate value as they move past the early pilot stages.
Click through for five key areas companies should monitor to avoid pitfalls that can derail big data projects, as identified by Talend.
Big Data is large – and small. It’s extremely diverse in origin, in style, in consistency and in quality. Some organizations in certain industries are dealing with massive data volumes, while others have much smaller data sets to exploit, but might have a broader variety of sources and formats. Make sure you go after the “right” data: Identify all the sources that are relevant, and don’t be embarrassed if you don’t need to scale your data computing cluster to hundreds of nodes right away!
Some of the data you need for your Big Data projects is clearly identified, such as transactional data used or generated by business applications. However, more of this data is hidden in log files, manufacturing systems, desktops or various servers; this is what we call “dark data.” Some of it is even going to waste in the exhaust fumes of IT. This “exhaust data” from sensors and logs is purged after a certain amount of time, or never stored in the first place. All of it is potentially relevant. Don’t restrain your project to the first category: Inventory dark data, and deploy collection mechanisms for exhaust data, so that they become value contributors as well.
Too many organizations looking for ways to break down data silos bring all the data together in one central place. While Hadoop is an excellent storage resource for large amounts of data (and it is in itself distributed across clusters), you need to think “distribution” beyond Hadoop. It’s not always necessary to duplicate and replicate everything. Some data is already readily available in the enterprise data warehouse, with fast, random access. Some of it might be better off residing where it was produced. The “logical data warehouse” concept applies well in the non-Big Data world. Leverage it for Big Data.
Hadoop is not only a receptacle for Big Data with its distributed file system, but it is also an engine that brings incredible potential to process data and extract meaningful information. A broad ecosystem of tools and programming paradigms exist that cover all use cases of data manipulation. From MapReduce to YARN, from Pig to HiveQL complemented by Impala, Stinger or Drill, or through the merging of Hadoop and SQL engines like HAWK, there are processing resources available that make it unnecessary to get data out of the platform. All the resources are there, at your fingertips.
Sandboxes are fine for proof of concepts, but when Big Data projects go live, they need to be an integral part of the overall IT infrastructure and information architecture. You need to connect Big Data applications to other systems, upstream and downstream. Big Data must also become part of your IT and information governance policy.