There are two major trends causing organizations to rethink the way they approach doing analytics.
Big data. First, data volumes are exploding. More than a decade ago, Wayne Eckerson, director of TDWI Research, participated in the formation of the Data Warehouse Terabyte Club, which highlighted the few leading-edge organizations whose data warehouses had reached or exceeded a terabyte in size. Today, the notion of a terabyte club seems quaint, as many organizations have blasted through that threshold. In fact, he contends, it is now time to start a petabyte club, since a handful of companies, including Internet businesses, banks, and telecommunications companies, have publicly announced that their data warehouses will soon exceed a petabyte of data.
Deep analytics. Second, organizations want to perform “deep analytics” on these massive data warehouses. Deep analytics ranges from statistics — such as moving averages, correlations, and regressions — to complex functions such as graph analysis, market basket analysis, and tokenization. In addition, many organizations are embracing predictive analytics by using advanced machine learning algorithms, such as neural networks and decision trees, to anticipate behavior and events. Whereas in the past, organizations may have applied these types of analytics to a subset of data, today they want to analyze every transaction. The reason: profits.
For Internet companies, the goal is to gain insight into how people use their websites so they can enhance visitor experiences and provide advertisers with more granular targeted advertising. Telecommunications companies want to mine millions of call detail records to better predict customer churn and profitability. Retailers want to analyze detailed transactions to better understand customer shopping patterns, forecast demand, optimize merchandising, and increase the lift of their promotions.
In all cases, there is an opportunity to cut costs, increase revenues, and gain a competitive advantage. Few industries today are immune to the siren song of analyzing big data.
This slideshow features a basic set of guidelines from TDWI and Aster Data for implementing big data analytics.
Click through for guidelines from TDWI and Aster Data for implementing big data analytics.
There are many reasons organizations are embracing big data analytics.
Data volumes. First, they are accumulating large volumes of data. According to TDWI Research, data warehouse data volumes are expanding rapidly. In 2009, 62% of organizations had less than 3 TB of data in their data warehouses. By 2012, 59% of those organizations estimate they will have more than 3 TB in their data warehouses and 34% said they would have more than 10 TB.
Business value. Second, organizations see value in analyzing this detailed information, which encourages them to collect more data. The more data an organization collects, the more patterns and insights that it can mine for.
Sustainable advantage. Finally, companies see big data analytics as one of the last frontiers of achieving competitive advantage. Analytics offers a sustainable advantage because it harnesses information and intelligence, things that are unique to each organization and cannot be commoditized.
Exploration and analysis. One type of analytics is exploration and analysis. This approach involves navigating historical data in a top-down, deductive manner. To start, an analyst needs to have an idea of what is causing a high-level trend or alert. In other words, the analyst must start with a hypothesis and deduce the cause by exploring the data with ad hoc query tools, OLAP tools, Excel, or SQL. Here, the burden is on the business analyst to sort through large volumes of data and find the needle in the haystack. This type of analytics has been around for a long time and constitutes the bulk of activity done by business analysts.
Prediction and optimization. Another type of analytics is prediction and optimization. Although the algorithms used to power these types of analyses have existed for decades, they have been implemented only by a small number of commercial organizations. Business users model historical data in a bottom-up, inductive manner. They apply data mining tools to create statistical models to identify patterns and trends in the data that can be used to predict the future and optimize business processes. Here, the process is inductive. Rather than starting with a hypothesis, you let the tools discover the trends, patterns, and outliers on your behalf. (However, in reality, it takes some knowledge of the business process and data to apply these tools with reliable accuracy.)
Companies need to create a scalable architecture that supports big data analytics from the outset and utilizes existing skills and infrastructure where possible. To do this, many companies are implementing new, specialized analytical platforms designed to accelerate query performance when running complex functions against large volumes of data. Compared to traditional query processing systems, they are easier to install and manage, offering a better total cost of ownership and sometimes a cost as little as $10,000 per terabyte.
These systems come in a variety of flavors and sizes. There are data warehousing appliances, which are purpose-built, hardware-software solutions; massively parallel processing (MPP) databases running on commodity servers; columnar databases; and distributed file systems running MapReduce and other non-SQL types of data processing languages. Sometimes companies employ multiple types to address processing requirements. For instance, comScore, an online market research firm, uses Hadoop to acquire and transform Web log files and Aster Data’s nCluster database for analysis.
Many companies are now rethinking traditional approaches to performing analytics. Instead of downloading data to local desktops or servers, they are running complex analytics in the database management system itself. This so-called “in-database analytics” minimizes or eliminates data movement, improves query performance, and optimizes model accuracy by enabling analytics to run against all data at a detailed level instead of against samples or summaries.
Many analytic computations are recursive in nature, which requires multiple passes through the database. Such computations are difficult to write in SQL and expensive to run in a database management system. Thus, today most analysts first run SQL queries to create a data set, which they download to another platform, and then run a procedural program written in Java, C, or some other language against the data set. Next, they often load the results of their analysis back into the original database.
This two-step process is time-consuming, expensive, and frustrating. Fortunately, techniques like MapReduce make it possible for business analysts, rather than IT professionals, to custom-code database functions that run in a parallel environment. With embedded functions, new analytical databases will accelerate the development and deployment of complex analytics against big data.
Know your users. First, understand who’s performing the analysis, the type, and how much data they require. If a power user wants to explore departmental data, then all they might need is an ad hoc query or OLAP tool and a data mart. If it’s an IT person creating a complex standard report with sophisticated metrics and functions, then it’s likely they can use a scalable BI tool running against an enterprise data warehouse. If a business analyst wants to run ad hoc queries or apply complex analytical functions against large volumes of detailed data without DBA assistance, then you probably need a specialized analytical database that supports analytical functions.
Performance and scalability. Second, understand your performance and scalability requirements. What query response times are required to make various types of analyses worth doing? If you have to wait days for a result set, then you either need to upgrade your existing data warehousing environment, offload these queries to a specialized analytical platform, or reduce the amount of data by aggregating data or reducing the time span of the analysis.
In-database analytics. Third, evaluate your need for in-database analytics. If complex models or analytics drive a critical portion of your business, then it’s likely you can benefit from creating and scoring these models in the DW rather than a secondary system.
Other. Finally, investigate whether the analytic database integrates with existing tools in your environment, such as ETL, scheduling, and BI tools. If you plan to use it as an enterprise data warehouse replacement, find out how well it supports mixed workloads, including tactical queries, strategic queries, and inserts, updates, and deletes.