What is Data Warehouse Security?
Data warehouses pull data from many different sources, and warehouses have many moving parts. Security issues arise every time data moves from one place to another. Data warehouse security entails taking the necessary steps to protect information so that only authorized personnel can access it.
Data warehouse security should involve the following:
- Strict user access control so that users of the warehouse can only access data they need to do their jobs.
- Taking steps to secure networks on which data is held.
- Carefully moving data and considering the security implications involved in all data movement.
Data Warehouse Security Challenges
The scope and scale of a data warehouse in addition to the sheer number of moving parts present some unique challenges to securing information at this level.
- How to strike a balance between the need to secure data and the need for providing unrestricted access to certain users of the data in the warehouse for analytics, business intelligence, and data mining applications.
- How to classify users for data warehouse access, i.e., should you use a hierarchical approach or a role-based approach.
- How to secure the network. Organizations need to consider encrypting data and investing in highly secure networking hardware.
- How do the necessary security features affect the performance of the data warehouse?
- How to securely extract data from source transactional systems for warehouse use.
What Are the Security Concerns in Building a Data Warehouse?
When building a data warehouse, or when securing an existing one, there are several important concerns to address:
- Mapping existing data – You should be able to catalogue the data in your warehouse, understand where sensitive data is stored, and identify who has access to it. This precaution is important both in terms of risk reduction and in terms of compliance with different regulations. The dynamic nature of data storage makes this process challenging.
- Implementing access controls – Begin with proper authentication, which usually includes integration with your organization’s SSO control and may require additional restrictive policies (i.e., connection is done only from certain network segments or VPNs, require two factor authentication, etc.).
- Proper authorization – Configure which users (who are usually assigned with groups or roles) need access to and what type of access is required. For example, a certain team may need to update data, but other teams only need read access.
- Granularity in the access – In many cases, there is a need for a fine-grained access control to the data warehouse to set different levels of access based on specific columns (e.g., users of specific roles can access the customers’ table but not their PII or payment details), rows, or items (e.g., a team can access only data specific to the customers in its region, not the entire customer base). The modern approach often requires data-based access control (not to be confused with database access control), which allows or restricts access to data based on its type without specific data location limits. For example, with this approach, a certain team may have access to an entire data warehouse but not to PII (as it may be blocked or masked).
- Combating over-privilege – As projects evolve, employees change positions, and configuration issues may lead some of the data warehouse’s users to have more access than they actually require. This incongruity between need and access is a risk which should be mitigated. To solve this problem, it is important to analyze the actual usage of the different data locations compared to the access granted to users. It is important to set workflows around rarely accessed data.
- Auditing and monitoring – Maintaining audit and monitoring queries and jobs running on the data warehouse is essential. This desired state is either default or requires specific configuration as a function of the data warehouse technology. Even when enabled, the query audit or access log may need to be moved to different storage locations and monitored for purging and aggregations.
- Behavior analysis – Access to data should be analyzed for risks such as bad practices or insider threats. For example, abusing a specific account and using it for multiple purposes and/or by different teams is bad practice. An employee unexpectedly pulling large amounts of sensitive data is an example of potential insider threat.
- Continuous security – The highly dynamic nature of data environments requires continuous security, rather than one-time projects. For example, mapping the location of sensitive data may prove futile, as the results may be obsolete by the time you receive them.
Security Policies for Data Warehouses
Security policies in data warehouses address security and compliance gaps in order to reduce the risks involved. Some policies can be implemented on the data warehouse itself (e.g., restricting access to specific resources to specific roles), and others should be handled by external tools. In practice, decoupling the security policies from the data stores themselves may be more efficient and easier to manage.
Examples of security policies for data warehouses:
- A policy ensuring that only users in specific groups or roles have access to sensitive information (such as PII).
- A policy restricting access to amounts or types of data being pulled from outside of the office IP addresses.
- A policy restricting users from accessing data they have not accessed within a given period.
- A policy restricting users only to data from their geographic region.
Implementing Security Controls for Snowflake Data Warehouse
Several common concerns to note when implementing security controls for Snowflake:
- Role hierarchies may be tempting to use but may not prove effective in the long run, as they sometimes make it difficult to understand why specific users or roles have access to various resources. Make sure to plan which roles architecture you want to use for your Snowflake.
- Access logs and query logs are kept for a certain amount of time and may have data latency issues (it takes time for the data to reach the query log). Handling the logs efficiently may mean integrating with SIEM or other systems, or using a specific data warehouse protection product such as Satori.
- Achieving granular access to data may require implementing column and row-based access controls. You may implement this to a certain extent by using secure views or by decoupling the security from Snowflake and using Satori to create such restrictions.
Implementing Security Controls for AWS Redshift Data Warehouse
Common concerns when implementing security controls for AWS Redshift:
- Access and query logs are very limited (especially in terms of retention time), and you may want to enable logging to an external store (such as a S3 bucket). Doing so may require further processing, ETLs, aggregations, and purging of the data.
- In most cases consider investing the time to integrate Redshift access with an identity management system, such as Okta, to enhance the visibility of user access privileges.
Implementing Security Controls for GCP Big Query
Common concerns when implementing security controls for Big Query:
- It is important to monitor your Big Query jobs and to analyse the Big Query jobs log for insight into the queries running on your Big Query data warehouse.
- It is important to ensure that users only have access to the specific tables they need (or even to implement row and column-based security if possible).
Security Requirements
Adding security features has an impact on data warehouse performance, thus it is critical to evaluate security requirements as early as possible. After the data warehouse has gone live, it is tough to implement security features.
During the data warehouse design phase, we should consider what data sources might be added later and what the implications of adding those data sources would be. During the design process, we should evaluate the following options already widely available.
This predicament occurs when the future users and data sources are unknown. In such a case, we must leverage our business knowledge and the data warehouse's goal to determine potential requirements.
Security measures have an impact on the following activities:
- User accessibility
- Data loading
- Data transfer
- Generation of queries
- User Access
We must first classify the data before classifying the users based on the data they may access. In other words, users are categorised based on the data they have access to.
Classification of Data
To classify the data, the following two procedures might be used:
- The sensitivity of data can be classified. Data that is highly sensitive is classified as highly restricted, whereas data that is less sensitive is classified as less restrictive.
- Data can also be categorised based on job function. This restriction restricts access to specified data to specific users. Here, we limit users to viewing only the data that they are interested in and accountable for.
Example: Assume you are creating a data warehouse for a bank. Consider that the data in the data warehouse is the transaction data for all of the accounts. The issue here is who has access to the transaction data. The solution is to categorise the data based on its function.
Classification of Users
The methodologies listed below can be used to categorise users.
- Users can be classified according to an organization's user hierarchy, i.e., users can be classed by departments, sections, groups, and so on.
- Users can also be classed depending on their role, with persons divided into departments based on their role.
Audit Prerequisites
Auditing is a subset of security, and it is an expensive job. Auditing can have a significant impact on the system. To perform an audit on time, we need more hardware, hence it is advised that auditing be turned off whenever possible.
The following audit needs can be classified:
- Connections
- Disconnections
- Data accessibility
- Data modification
It is important to note that each of the aforementioned categories must be audited for success, failure, or both. Failure auditing is extremely critical for security reasons. Failure auditing is crucial because it can reveal unauthorised or fraudulent access.
Network Prerequisites
Network security is just as vital as other types of security. We cannot overlook the need for network security. We need to think about the following issues:
- Is it necessary to encrypt data before it is transferred to a data warehouse?
- Is there any restriction on the network routes that the data can take?
These constraints must be carefully evaluated.
- Overheads will rise as a result of the encryption and decryption processes. It would necessitate more processing power and time.
- Because encryption is borne by the source system, the cost of encryption can be substantial if the system is already burdened.
Data Transfer
While moving data, there are potential security consequences. Assume we need to convey certain restricted data in the form of a flat file to be loaded. When data is imported into the data warehouse, the following issues arise:
- Where is the flat file kept?
- Who has access to that hard drive space?
When we discuss the backup of these flat files, the following questions arise:
- Do you keep encrypted or decrypted versions of your files?
- Is it necessary to make these backups to special tapes that are kept separately?
- Who has access to these cassettes?
Other types of data migration, such as query result sets, must also be considered. The following are the questions that arose when constructing the temporary table:
- Where will that temporary table be located?
- How do you make a table like this visible?
We must avoid inadvertently violating security measures. Data can be available to non-authorized users if a person with restricted data access can construct accessible temporary tables. We can solve this problem by creating a distinct temporary area for users with restricted data access.
Documentation
Audit and security needs must be thoroughly documented. This will be considered part of the justification. This document may include all of the information gathered from Data categorization
- Classification of users
- Network prerequisites
- Requirements for data transit and storage
- All actions that are auditable
The Effect of Security on Design
Security has an impact on application code and development timelines. Security has an impact on the following areas:
- Development of applications
- Database layout
- Testing
Development of Applications
Security has an impact on overall programme development as well as the design of key data warehouse components such as the load manager, warehouse manager, and query manager. Checking code may be required by the load manager to filter records and arrange them in different locations. To conceal specific data, further transformation rules may be required. Extra metadata may also be required to manage any additional objects.
Extra codes may be required by the warehouse manager to develop and maintain extra views in order to enforce security. Extra checks may need to be built into the data warehouse to prevent it from moving data to a location where it should not be available. Any access limitations must be handled by the query manager, which necessitates the adjustments. All additional views and aggregations must be known to the query manager.
Database Design
The database layout is also altered since the number of views and tables increases when security measures are added. Adding security expands the database's size, increasing the complexity of database design and management. It will also complicate backup management and recovery planning.
Testing
Testing the data warehouse is a complicated and time-consuming operation. Increasing the security of the data warehouse has an impact on the testing time complexity. It has an impact on the testing in two ways:
- It will lengthen the time needed for integration and system testing.
- There will be additional functionality to test, which will increase the size of the testing suite.