In order for data as a resource to provide a competitive edge, a company needs to classify “its” data in a systematic way, meaning a data warehouse. This data warehouse must be able to do all manner of things: be efficient, have its scale easily adjusted, networked to other systems, have clearly defined goals and yet be easily controlled. What are the technical, structural and legal requirements for such a data warehouse?
In the latest of her Duet interviews, Dr Caldarola, editor of Data Warehouse as well as author of Big Data and Law, and Martin Szugat discuss on the requirements and architectures of a data warehouse.
Mr Szugat, in the book Data Warehouse, you and Dr Schwarz wrote an article on what a data warehouse could or should look like from a technical point of view. Let us start by clarifying for our readers what a data warehouse actually is ‑in contrast to terms such as data pool, data lake, database with which we are all familiar.
Martin Szugat: In simple terms a data warehouse is a warehouse for data and a warehouse is a central building for storing raw materials and / or products which are then to be produced or distributed. If we use this term for data, however, a data warehouse is a central database for storing raw data and / or derived data which is then to be processed and used. So, a data warehouse is a special database for centralising data as well as transforming and analysing data.
A data lake is a special type of data warehouse. Most data warehouse systems store transformed data in a structured form so later analysis becomes more performant, therefore faster and more efficient. A data lake, on the other hand, stores raw data and thus can store structured as well as unstructured data.
A data pool is a special use for data warehouses. Most companies use data warehouses to aggregate their data from different internal systems, such as a CRM or ERP software and integrate, for instance., customer and transaction data into a common data model. However, a data warehouse could also be used to aggregate transaction data from different companies, for example; the companies in question would then pool their data using a data warehouse as a common storage.
What is the purpose of a data warehouse? When is it advisable for a company to form a data warehouse?
The function of a data warehouse is to enable business analyses, i.e., to turn data into information which in turn informs decisions being made; for instance, to answers questions such as do customers living in rural areas buy more products than those living in cities. The answer (the information) is then used for a data-driven decision: e.g., increase the budget for ads in villages and decrease it for cities. To answer this question, data from different systems using CRM and ERP software have to be correlated in a specific way. Thus, the need for a data warehouse often arises when information is spread among different data sources or is embedded in huge amounts of data. In these situations, companies need a technical-analytical solution: a data warehouse.
Are data warehouses standardised or does every company need a tailor-made warehouse?
Data warehouses are, in most cases, based on standard software covering a wide range of uses and can be configured to the specific needs of the company. For some industries and business sectors standard data schemes exist that cover the most common metrics, dimensions, and attributes, quarterly EBITDA, to name a few possibilities.
I assume that data warehouses of various companies, such as to realise Industry 4.0, are networked with each other. Do you need or are there requirements for standardisation with regard to networking as well as standardisation of the processing of the data contained in the data warehouses themselves so that interoperability is made easier?
Yes, especially in the context of Industry 4.0, there is a strong need to share or pool data to cover fragmented production and supply chains. Accordingly, there are initiatives, such as the International Data Spaces Association, that define standards for data exchanges and so-called data spaces. A data space is a data pool with standardised data schemes and a well-defined data architecture that enables scalable data applications.
Is a data warehouse only about digital data or also about physical data?
Most data warehouse systems only support digital and especially structured data, such as tables. However, data lakes are particularly set up to handle and store any kind of data, meaning unstructured and semi-structured data, such as images or documents. Some systems integrate machine learning-based solutions for object detection on images or optical character recognition on documents making it possible to access and analyse metadata, such as the persons on an image or the text from a document.
You have stated that data transformation depends on data modelling. Since companies often have many different models in use for different purposes – for example for fraud detection, forecasting of sales or potential customers – so that several transformation paths are possible. What do you mean by that statement?
A model forms the data and thereby discloses information within the data. There are different manual and automated methods for modelling, that is, for creating models from data. Descriptive modelling only describes relationships between data points. For example, a simple handcrafted metrics model, such as a value driver tree, describes the effect of various success drivers on the target achievement: for example, online sales are driven by the advertising budget and the ad click rate. For this, the sales transaction and ad campaign data must be transformed into actionable metrics consisting of: 1) sum of online sales revenue driven by ad campaigns, 2) sum of budget of ad campaigns and 3) weighted average ad click rate of campaigns.
On the other hand, predictive modelling uses machine learning methods (among others) to automatically compute the model from the data. These algorithms often expect the data to be a single flat table with the instances, that is examples to learn from. in rows and the features, also called attributes, in columns. So, depending on the modelling method, I first need to put the data into the appropriate form i.e., transform the data. And because each model only represents a certain aspect of reality, I need specific models to address different questions which in turn produce different information. Accordingly, I need to provide the data in distinct forms which then means that I need different transformations for different information.
In terms of data protection law, there are some requirements for a data warehouse with regard to, among other things, documentation, data origin, the basis of legitimacy, the purpose of use. Does each date need tagging for these provisions? You speak of data lineage, data catalogue or metadata repository. What exactly are these and how do they work? Is it possible to use blockchain technology to guard against manipulation?
Under data protection law, these legal requirements are only for personal data. But there are also other regulations, for example, in financial or healthcare sectors, which also require documentation and transparency. And practically speaking, it also makes sense to document the origin for each data point: for example, if my model outputs incorrect information and I discover that the modelling was correct but the data was incorrect, I want to know where the data came from so that I can determine the root cause of the error and prevent incorrect data in the future, for example, by performing plausibility checks directly during input.
This is the idea behind data lineage: to document the data flow from the data source through transformation and storage in a data warehouse to data utilisation in various applications. A metadata repository provides data about data: the data origin, the date of the last update, permission by the user regarding how the data is to be used, consent regarding who may utilise the data etc. A data catalogue is like a shopping catalogue for data sets: what data exists in our company, where can I find it, how can I access it, who do I have to ask for permission and so on?
A lot of work is needed to create all this metadata. Much is done manually by so-called data stewards, to name one example, while some is automated using data analytics and, finally, machine learning methods classify data sources autonomously, for instance.
To finally answer your question about the blockchain, concerning internal corporate data warehouses, I don’t know of any examples using blockchain to secure data because there are other more simple and (energy) efficient technical solutions to prohibit data manipulation, such as those involving access control and data versioning. However, for (semi-)public and decentralised / distributed data pools and data spaces where data producers and consumers don’t know and can’t trust each other, blockchain is the right solution.
A data warehouse should, on the one hand, save energy and, on the other hand, also function quickly. Which schemes are available with which advantages and disadvantages?
This always depends very much on how it is being used: for classic business intelligence, meaning business reporting, such as. sales figures, financial controlling and so on, the “old school” data warehouse systems are particularly suitable. One could, for example, extract the data from the original databases every 24 hours in a batch process, transform the data into a snowflake data scheme and load the transformed data into so-called fact and dimension tables. This is the ETL approach: Extract-Transform-Load. This scheme of data warehousing enables fast analytical queries over huge amounts of structured data, but the possibilities of queries is limited.
For advanced analytics, sometimes called business analytics, the ELT approach might be a better fit: For example, one could extract the data from the databases in real-time when new data arrives, load the “raw data” into a data lake and transform the data into a single flat feature table, when and as they are needed for training a machine learning model. The query then takes longer than with the ETL approach, but since training a predictive model using machine learning is time-consuming anyway, this does not matter. On the positive side, if new or new types of data are collected, it is much easier to transform this new data to include new features and integrate those features into the new model which improves the model performance, such as to enhance prediction accuracy.
My favourite expression is:
George Box
„All models are wrong, but some are useful.”
You point out that the success of a data warehouse depends on its architecture and organisation. What are good and bad examples of a data warehouse and what should be considered when planning and implementing one?
“Structure follows strategy, “ as said by A.D. Chandler in 1962, and this maxim still holds true for data strategy. Data strategy defines how a company creates value from data to achieve its business objectives and thus realises its business strategy. The solutions that create value from data are called “data products”. A data product is the result of data and analytics. Analytics are thus applied to data to extract valuable information from the data. The information is valuable to the user if it helps the user to make better decisions. A decision for an action is good if the result of the action helps in the achievement of the company’s goals. So, the core of a data strategy is a set of data products that help the employees achieve the company’s goals.
These data products need a common basis which is the data platform. Depending on what use is intended, the platform consists of different components, including a data warehouse. How these components interact, defines the data architecture. These data products and the data platform must be designed, developed, and operated by keeping their efficient use together in mind. This goal requires different experts and teams to work together – so it needs a data organisation which fits the data architecture. We often see companies which don’t have a coherent and holistic data strategy. The negative consequences are that the technological platform and personnel organization is either oversized or undersized: If the data platform is larger than needed, the data organisation is overwhelmed and has no time for designing, developing, and operating the data products. Conversely, an undersized data organisation tries to build complex real-time data products using an undersized data platform that does not meet the technical requirements for the data products. The consequences are the same in both cases: no or too little value is generated from the data.
Mr. Szugat, thank you for sharing your insights on what data warehouses are.
Thank you, Dr Caldarola, and I look forward to reading your upcoming interviews with recognised experts, delving even deeper into this fascinating topic.