As with cloud computing, Big Data is a hot topic around the business right now. Expectations for what it can achieve often reach into the stratosphere. At the same time, the overwhelming amount of data involved is daunting and can seem like an insurmountable obstacle. So what is Big Data, where does it come from, and what can you actually do with it? ThreatTrack Security has identified eight facts and eight fictions regarding Big Data.
Click through for eight facts and eight fictions regarding Big Data, as identified by ThreatTrack Security.
Fact #1: You are Big Data. Much of the world’s Big Data is created as metadata from users’ smartphones and GPS traffic.
Every day you create metadata with smartphones that enable GPS location services. Every picture you take, every website you visit, every route you map creates metadata that is stored and available for analysis. With more than five billion mobile phones in use, including more than one billion smartphones in 2012, according to research firm Strategy Analytics, it’s no wonder that many enterprises and government organizations are interested in gleaning valuable content from the information.
Fact #2: Big Data tends to be mined poorly to build ineffective threat analysis algorithms.
With all the metadata that exists, we are only now figuring out how to make sense of it and how to cultivate beneficial data from it. For one, enterprises traditionally haven’t had the resources in place to analyze metadata. As those investments increase, the mining for trends and useful analysis will increase as well.
Fact #3: Big Data is automating tasks that used to involve tedious manual labor.
Software companies are developing better business intelligence tools that can not only analyze metadata, but also automate tasks to more quickly make use of that data to their advantage. This allows companies to be more flexible and also make the analysis of Big Data much less costly than in the past.
Fact #4: Big Data is being used to categorize and classify malware more effectively, grouping bad files the same way Google ranks pages.
As more information is gleaned about malware and more analysis picks up on trends, algorithms for categorizing and classifying malware are being developed to help security providers. We at ThreatTrack Security use Big Data in four ways: first, to discuss CART (Classification and Regression Trees) for predictive classification of event modifiers; second, to make use of Shewhart Control Charts for outlier threat detection; third, we use Splines for non-linear exploratory modeling; lastly, we apply the Goodness of fit principle to check for stability of historical threat data and constructing a parsimonious model for APTs.
Our case study works by using a closed loop system beginning with identifying a file/URL, correlating the information and finding where the file initially came from, where it was downloaded from, how it entered the company’s data space, what it downloaded, what it installed, its current payload and so on.
Fact #5: Big Data theory is moving faster than the reality of what an enterprise is capable of from both a technology and manpower standpoint.
Since much of Big Data is derived from user-centric behavior and usage, it moves a lot faster than what an enterprise typically generates from its application systems. About 70 percent of the digital universe has been created by individuals, not corporations.
The primary reason why the Big Data theory is moving faster than its practice is to address the solution for managing such humongous amounts of data. Oceans of data will be created between now and the year 2020, resulting in a 4,300 percent increase in annual data generation as the macro drivers of user-generated data, along with the shift from analog to digital media, propel us to the next frontier.
The Big Data tsunami is also causing technologies to be modernized. What used to be stored in conventional RDBMS and later in NoSQL databases are insufficient and cannot be accessed by direct record access methods. The current technology of choice is not conventional RDBMS but a map-reduced database like Hadoop that operates off distributed hardware substrate.
Fact #6: Big Data will create a major shift in visualization of threats within the next three years.
Visualization of objects in excess of a few million in quantity requires thinking differently. For instance, imagine the complexity of modeling huge data sets that are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification readers and wireless sensor networks. Right now, the largest memory requirements for visualizing Big Data working sets can’t be addressed by conventional computing models. That’s why the science of visualization will have to be re-imagined and re-visited in the next three years.
Fact #7: Yesterday’s antivirus endpoints are becoming tomorrow’s real threat-track vectors.
First, antivirus is a well-understood and mature market where users have a reasonable idea of what spam is, what viruses are and how to remediate them using up-to-date antivirus software. Second, the conventional endpoint of corporations is the edge of the network. Both these paradigms are outdated now. First, antivirus is not enough to protect against Advanced Persistent Threats. Enterprises need an anti-threat type of software to combat sophisticated attacks that find their way in through endpoints in new and creative ways.
Fact #8: Yesterday’s endpoints of the brick-and-mortar enterprise have shifted to the users, with the proliferation of the BYOD paradigm where user devices are the real endpoints.
This extends the point made in Fact #7. With the advent of BYOD becoming the norm in the corporate environment, the real vulnerable endpoint of enterprises has turned out to be handheld smartphones. As more smartphones connect to corporate networks and data, it increases the vulnerabilities organizations face trying to secure all those additional points of entry.
Fiction #1: Security companies are equipped to handle the volume and velocity of Big Data.
Like many enterprises, security companies are also learning to wrap their hands around Big Data, and the theory of Big Data for that matter, eliminating potential vulnerabilities to ensure that the data remains clean for analysis and production. As the concept of Big Data grows and evolves, security companies must perpetually grow and evolve too.
Fiction #2: Security developers are easily extracting value from collected data.
The old saying “you don’t know what you don’t know” applies to security developers. Without proper analysis tools in place, security companies aren’t able to extract valuable content from the collected data. Only with those analysis tools, algorithms and applications can developers truly garner valuable insight from collected data.
Fiction #3: Analytics technology is ready-made for security.
From the phrase “finding a needle in the haystack,” analytics is useless in haystacks of data where there are no needles to begin with. The hype has caused us to create massive data stacks with poor references (or indices) around those stacks. Any data analyst will attest to the fact that a better index of smaller data sets yields better analytics than a larger data set with lame indices.
Fiction #4: Leveraging Big Data in a security context is as simple as using it for any generalized purpose.
Successfully leveraging Big Data first must address the point in Fiction #3, that analytics is ready made for security. Second, establishing a security context is the next problem. Security context can be established connecting the relationships (after map reducing the data itself) between data sets to reveal valuable insights in the patterns that were previously not correlated or compared. Mining for trends requires that data be managed coherently at first. Similarly, mining for relationships requires that trends be understood. Only after you have the data map reduced, and the trends in it understood, can you then mine for relationships among the trends of the map-reduced data farms. Only after all of these prerequisites are achievable, can you establish the big security context of Big Data.
Think of security context as the metadata fabric of relationships, which is a lot more powerful and useful for visualizing risks, threats and predictive analytics.
Fiction #5: Big Data will cause major change in the security industry within the next year.
Big Data won’t cause major change in the security industry. Instead, the major change will be in identifying anomalies that can be identified as advanced security attacks. And both concepts will join together and work in concert to realize value for enterprises.
Fiction #6: There is a widespread belief that Big Data sets offer a higher form of intelligence that can generate insights that were previously impossible.
That’s not true by itself. We need more algorithms that can offer more intelligence, not bigger data sets. The two kinds of algorithms are: Bayesian algorithms, which deal with prior occurrences, and predictive analytics, which is forward facing. Looking at the future, big context in security is going to be more innovative than Big Data in security.
Fiction #7: Big Data searched with dumb algorithms fails to yield what little data can yield using smarter algorithms.
The concept of Big Data should be about the algorithms and not about the data itself. Better precision and better searching techniques will trap the breaches. Better algorithms and lesser data stacks will provide more value than lesser algorithms and bigger data stacks. The better net will catch better stuff.
Fiction #8: Most data scientists have experience with Big Data. This isn’t true because much of Big Data isn’t directly used; rather, it is summarized or “map reduced” before being analyzed, which is often not very big.
Data science is a new branch of study. Most data scientists fall into two groups: statisticians turned programmers, and programmers turned statisticians that compete for data scientist jobs. While they used to work with data sets and map-reduced data sets, they’re not as used to working with user profile data, which is what Big Data primarily consists of.