Analysis of data fed into data lakes promises to provide enormous insights for data scientists, business managers, and artificial intelligence (AI) algorithms. However, governance and security managers must also ensure that the data lake conforms to the same data protection and monitoring requirements as any other part of the enterprise.
To enable data protection, data security teams must ensure only the right people can access the right data and only for the right purpose. To help the data security team with implementation, the data governance team must define what “right” is for each context. For an application with the size, complexity and importance of a data lake, getting data protection right is a critically important challenge.
See the Top Data Lake Solutions
From Policies to Processes
Before an enterprise can worry about data lake technology specifics, the governance and security teams need to review the current policies for the company. The various policies regarding overarching principles such as access, network security, and data storage will provide basic principles that executives will expect to be applied to every technology within the organization, including data lakes.
Some changes to existing policies may need to be proposed to accommodate the data lake technology, but the policy guardrails are there for a reason — to protect the organization against lawsuits, breaking laws, and risk. With the overarching requirements in hand, the teams can turn to the practical considerations regarding the implementation of those requirements.
Data Lake Visibility
The first requirement to tackle for security or governance is visibility. In order to develop any control or prove control is properly configured, the organization must clearly identify:
- What is the data in the data lake?
- Who is accessing the data lake?
- What data is being accessed by who?
- What is being done with the data once accessed?
Different data lakes provide these answers using different technologies, but the technology can generally be classified as data classification and activity monitoring/logging.
Data classification determines the value and inherent risk of the data to an organization. The classification determines what access might be permitted, what security controls should be applied, and what levels of alerts may need to be implemented.
The desired categories will be based upon criteria established by data governance, such as:
- Data Source: Internal data, partner data, public data, and others
- Regulated Data: Privacy data, credit card information, health information, etc.
- Department Data: Financial data, HR records, marketing data, etc.
- Data Feed Source: Security camera videos, pump flow data, etc.
The visibility into these classifications depends entirely upon the ability to inspect and analyze the data. Some data lake tools offer built-in features or additional tools that can be licensed to enhance the classification capabilities such as:
- Amazon Web Services (AWS): AWS offers Amazon Macie as a separately enabled tool to scan for sensitive data in a repository.
- Azure: Customers use built-in features of the Azure SQL Database, Azure Managed Instance, and Azure Synapse Analytics to assign categories, and they can license Microsoft Purview to scan for sensitive data in the dataset such as European passport numbers, U.S. social security numbers, and more.
- Databricks: Customers can use built-in features to search and modify data (compute fees may apply).
- Snowflake: Customers use inherent features that include some data classification capabilities to locate sensitive data (compute fees may apply).
For sensitive data or internal designations not supported by features and add-on programs, the governance and security teams may need to work with the data scientists to develop searches. Once the data has been classified, the teams will then need to determine what should happen with that data.
For example, Databricks recommends deleting personal information from the European Union (EU) that falls under the General Data Protection Regulation (GDPR). This policy would avoid future expensive compliance issues with the EU’s “right to be forgotten” that would require a search and deletion of consumer data upon each request.
Other common examples for data treatment include:
- Data accessible for registered partners (customers, vendors, etc.)
- Data only accessible by internal teams (employees, consultants, etc.)
- Data restricted to certain groups (finance, research, HR, etc.)
- Regulated data available as read-only
- Important archival data, with no write-access permitted
The sheer size of data in a data lake can complicate categorization. Initially, data may need to be categorized by input, and teams need to make best guesses about the content until the content can be analyzed by other tools.
In all cases, once data governance has determined how the data should be handled, a policy should be drafted that the security team can reference. The security team will develop controls that enforce the written policy and develop tests and reports that verify that those controls are properly implemented.
Activity monitoring and logging
The logs and reports provided by the data lake tools provide the visibility needed to test and report on data access within a data lake. This monitoring or logging of activity within the data lake provides the key components to verify effective data controls and ensure no inappropriate access is occuring.
As with data inspection, the tools will have various built-in features, but additional licenses or third-party tools may need to be purchased to monitor the necessary spectrum of access. For example:
- AWS: AWS Cloudtrail provides a separately enabled tool to track user activity and events, and AWS CloudWatch collects logs, metrics, and events from AWS resources and applications for analysis.
- Azure: Diagnostic logs can be enabled to monitor API (application programming interface) requests and API activity within the data lake. Logs can be stored within the account, sent to log analytics, or streamed to an event hub. And other activities can be tracked through other tools such as Azure Active Directory (access logs).
- Google: Google Cloud DLP detects different international PII (personal identifiable information) schemes.
- Databricks: Customers can enable logs and direct the logs to storage buckets.
- Snowflake: Customers can execute queries to audit specific user activity.
Data governance and security managers must keep in mind that data lakes are huge and that the access reports associated with the data lakes will be correspondingly immense. Storing the records for all API requests and all activity within the cloud may be burdensome and expensive.
To detect unauthorized usage will require granular controls, so inappropriate access attempts can generate meaningful alerts, actionable information, and limited information. The definitions of meaningful, actionable, and limited will vary based upon the capabilities of the team or the software used to analyze the logs and must be honestly assessed by the security and data governance teams.
Data Lake Controls
Useful data lakes will become huge repositories for data accessed by many users and applications. Good security will begin with strong, granular controls for authorization, data transfers, and data storage.
Where possible, automated security processes should be enabled to permit rapid response and consistent controls applied to the entire data lake.
Authorization in data lakes works similar to any other IT infrastructure. IT or security managers assign users to groups, groups can be assigned to projects or companies, and each of these users, groups, projects, or companies can be assigned to resources.
In fact, many of these tools will link to existing user control databases such as Active Directory, so existing security profiles may be extended to the data link. Data governance and data security teams will need to create an association between various categorized resources within the data lake with specific groups such as:
- Raw research data associated with the research user group
- Basic financial data and budgeting resources associated with the company’s internal users
- Marketing research, product test data, and initial customer feedback data associated with the specific new product project group
Most tools will also offer additional security controls such as security assertion markup language (SAML) or multi-factor authentication (MFA). The more valuable the data, the more important it will be for security teams to require the use of these features to access the data lake data.
In addition to the classic authorization processes, the data managers of a data lake also need to determine the appropriate authorization to provide to API connections with data lakehouse software and data analysis software and for various other third-party applications connected to the data lake.
Each data lake will have their own way to manage the APIs and authentication processes. Data governance and data security managers need to clearly outline the high-level rules and allow the data security teams to implement them.
As a best practice, many data lake vendors recommend setting up the data to deny access by default to force data governance managers to specifically grant access. Additionally, the implemented rules should be verified through testing and monitoring through the records.
A giant repository of valuable data only becomes useful when it can be tapped for information and insight. To do so, the data or query responses must be pulled from the data lake and sent to the data lakehouse, third-party tool, or other resource.
These data transfers must be secure and controlled by the security team. The most basic security measure requires all traffic to be encrypted by default, but some tools will allow for additional network controls such as:
- Limit connection access to specific IP addresses, IP ranges, or subnets
- Private endpoints
- Specific networks
- API gateways
- Specified network routing and virtual network integration
- Designated tools (Lakehouse application, etc.)
IT security teams often use the best practices for cloud storage as a starting point for storing data in data lakes. This makes perfect sense since the data lake will likely also be stored within the basic cloud storage on cloud platforms.
When setting up data lakes, vendors recommend setting the data lakes to be private and anonymous to prevent casual discovery. The data will also typically be encrypted at rest by default.
Some cloud vendors will offer additional options such as classified storage or immutable storage that provides additional security for stored data. When and how to use these and other cloud strategies will depend upon the needs of the organization.
See the Top Big Data Storage Tools
Developing Secure and Accessible Data Storage
Data lakes provide enormous value by providing a single repository for all enterprise data. Of course, this also paints an enormous target on the data lake for attackers that might want access to that data!
Basic data governance and security principles should be implemented first as written policies that can be approved and verified by the non-technical teams in the organization (legal, executives, etc.). Then, it will be up to data governance to define the rules and data security teams to implement the controls to enforce those rules.
Next, each security control will need to be continuously tested and verified to confirm that the control is working. This is a cyclical, and sometimes even a continuous, process that needs to be updated and optimized regularly.
While it’s certainly important to want the data to be safe, businesses also need to make sure the data remains accessible, so they don’t lose the utility of the data lake. By following these high-level processes, security and data lake experts can help ensure the details align with the principles.