Is a Data Ware­house both the key to and foun­da­tion for busi­ness­es in the dig­i­tal age? What should one look out for?

I
Mar­tin Szu­gat – Pho­to: Manuel Vescoli

In the lat­est of her Duet inter­views, Dr Cal­daro­la, edi­tor of Data Ware­house as well as author of Big Data and Law, and Mar­tin Szu­gat dis­cuss on the require­ments and archi­tec­tures of a data warehouse.

Mr Szu­gat, in the book Data Ware­house, you and Dr Schwarz wrote an arti­cle on what a data ware­house could or should look like from a tech­ni­cal point of view. Let us start by clar­i­fy­ing for our read­ers what a data ware­house actu­al­ly is ‑in con­trast to terms such as data pool, data lake, data­base with which we are all familiar.

Mar­tin Szu­gat: In sim­ple terms a data ware­house is a ware­house for data and a ware­house is a cen­tral build­ing for stor­ing raw mate­ri­als and / or prod­ucts which are then to be pro­duced or dis­trib­uted. If we use this term for data, how­ev­er, a data ware­house is a cen­tral data­base for stor­ing raw data and / or derived data which is then to be processed and used. So, a data ware­house is a spe­cial data­base for cen­tral­is­ing data as well as trans­form­ing and analysing data.

A data lake is a spe­cial type of data ware­house. Most data ware­house sys­tems store trans­formed data in a struc­tured form so lat­er analy­sis becomes more per­for­mant, there­fore faster and more effi­cient. A data lake, on the oth­er hand, stores raw data and thus can store struc­tured as well as unstruc­tured data. 

A data pool is a spe­cial use for data ware­hous­es. Most com­pa­nies use data ware­hous­es to aggre­gate their data from dif­fer­ent inter­nal sys­tems, such as a CRM or ERP soft­ware and inte­grate, for instance., cus­tomer and trans­ac­tion data into a com­mon data mod­el. How­ev­er, a data ware­house could also be used to aggre­gate trans­ac­tion data from dif­fer­ent com­pa­nies, for exam­ple; the com­pa­nies in ques­tion would then pool their data using a data ware­house as a com­mon storage.

What is the pur­pose of a data ware­house? When is it advis­able for a com­pa­ny to form a data warehouse?

The func­tion of a data ware­house is to enable busi­ness analy­ses, i.e., to turn data into infor­ma­tion which in turn informs deci­sions being made; for instance, to answers ques­tions such as do cus­tomers liv­ing in rur­al areas buy more prod­ucts than those liv­ing in cities. The answer (the infor­ma­tion) is then used for a data-dri­ven deci­sion: e.g., increase the bud­get for ads in vil­lages and decrease it for cities. To answer this ques­tion, data from dif­fer­ent sys­tems using CRM and ERP soft­ware have to be cor­re­lat­ed in a spe­cif­ic way. Thus, the need for a data ware­house often aris­es when infor­ma­tion is spread among dif­fer­ent data sources or is embed­ded in huge amounts of data. In these sit­u­a­tions, com­pa­nies need a tech­ni­cal-ana­lyt­i­cal solu­tion: a data warehouse.

Are data ware­hous­es stan­dard­ised or does every com­pa­ny need a tai­lor-made warehouse?

Data ware­hous­es are, in most cas­es, based on stan­dard soft­ware cov­er­ing a wide range of uses and can be con­fig­ured to the spe­cif­ic needs of the com­pa­ny. For some indus­tries and busi­ness sec­tors stan­dard data schemes exist that cov­er the most com­mon met­rics, dimen­sions, and attrib­ut­es, quar­ter­ly EBIT­DA, to name a few possibilities.

I assume that data ware­hous­es of var­i­ous com­pa­nies, such as to realise Indus­try 4.0, are net­worked with each oth­er. Do you need or are there require­ments for stan­dard­i­s­a­tion with regard to net­work­ing as well as stan­dard­i­s­a­tion of the pro­cess­ing of the data con­tained in the data ware­hous­es them­selves so that inter­op­er­abil­i­ty is made easier?

Yes, espe­cial­ly in the con­text of Indus­try 4.0, there is a strong need to share or pool data to cov­er frag­ment­ed pro­duc­tion and sup­ply chains. Accord­ing­ly, there are ini­tia­tives, such as the Inter­na­tion­al Data Spaces Asso­ci­a­tion, that define stan­dards for data exchanges and so-called data spaces. A data space is a data pool with stan­dard­ised data schemes and a well-defined data archi­tec­ture that enables scal­able data applications.

Is a data ware­house only about dig­i­tal data or also about phys­i­cal data?

Most data ware­house sys­tems only sup­port dig­i­tal and espe­cial­ly struc­tured data, such as tables. How­ev­er, data lakes are par­tic­u­lar­ly set up to han­dle and store any kind of data, mean­ing unstruc­tured and semi-struc­tured data, such as images or doc­u­ments. Some sys­tems inte­grate machine learn­ing-based solu­tions for object detec­tion on images or opti­cal char­ac­ter recog­ni­tion on doc­u­ments mak­ing it pos­si­ble to access and analyse meta­da­ta, such as the per­sons on an image or the text from a document.

You have stat­ed that data trans­for­ma­tion depends on data mod­el­ling. Since com­pa­nies often have many dif­fer­ent mod­els in use for dif­fer­ent pur­pos­es – for exam­ple for fraud detec­tion, fore­cast­ing of sales or poten­tial cus­tomers – so that sev­er­al trans­for­ma­tion paths are pos­si­ble. What do you mean by that statement?

A mod­el forms the data and there­by dis­clos­es infor­ma­tion with­in the data. There are dif­fer­ent man­u­al and auto­mat­ed meth­ods for mod­el­ling, that is, for cre­at­ing mod­els from data. Descrip­tive mod­el­ling only describes rela­tion­ships between data points. For exam­ple, a sim­ple hand­craft­ed met­rics mod­el, such as a val­ue dri­ver tree, describes the effect of var­i­ous suc­cess dri­vers on the tar­get achieve­ment: for exam­ple, online sales are dri­ven by the adver­tis­ing bud­get and the ad click rate. For this, the sales trans­ac­tion and ad cam­paign data must be trans­formed into action­able met­rics con­sist­ing of: 1) sum of online sales rev­enue dri­ven by ad cam­paigns, 2) sum of bud­get of ad cam­paigns and 3) weight­ed aver­age ad click rate of campaigns.

On the oth­er hand, pre­dic­tive mod­el­ling uses machine learn­ing meth­ods (among oth­ers) to auto­mat­i­cal­ly com­pute the mod­el from the data. These algo­rithms often expect the data to be a sin­gle flat table with the instances, that is exam­ples to learn from.  in rows and the fea­tures, also called attrib­ut­es, in columns. So, depend­ing on the mod­el­ling method, I first need to put the data into the appro­pri­ate form i.e., trans­form the data. And because each mod­el only rep­re­sents a cer­tain aspect of real­i­ty, I need spe­cif­ic mod­els to address dif­fer­ent ques­tions which in turn pro­duce dif­fer­ent infor­ma­tion. Accord­ing­ly, I need to pro­vide the data in dis­tinct forms which then means that I need dif­fer­ent trans­for­ma­tions for dif­fer­ent information.

In terms of data pro­tec­tion law, there are some require­ments for a data ware­house with regard to, among oth­er things, doc­u­men­ta­tion, data ori­gin, the basis of legit­i­ma­cy, the pur­pose of use.  Does each date need tag­ging for these pro­vi­sions? You speak of data lin­eage, data cat­a­logue or meta­da­ta repos­i­to­ry. What exact­ly are these and how do they work? Is it pos­si­ble to use blockchain tech­nol­o­gy to guard against manipulation?

Under data pro­tec­tion law, these legal require­ments are only for per­son­al data. But there are also oth­er reg­u­la­tions, for exam­ple, in finan­cial or health­care sec­tors, which also require doc­u­men­ta­tion and trans­paren­cy. And prac­ti­cal­ly speak­ing, it also makes sense to doc­u­ment the ori­gin for each data point: for exam­ple, if my mod­el out­puts incor­rect infor­ma­tion and I dis­cov­er that the mod­el­ling was cor­rect but the data was incor­rect, I want to know where the data came from so that I can deter­mine the root cause of the error and pre­vent incor­rect data in the future, for exam­ple, by per­form­ing plau­si­bil­i­ty checks direct­ly dur­ing input.

This is the idea behind data lin­eage: to doc­u­ment the data flow from the data source through trans­for­ma­tion and stor­age in a data ware­house to data util­i­sa­tion in var­i­ous appli­ca­tions. A meta­da­ta repos­i­to­ry pro­vides data about data: the data ori­gin, the date of the last update, per­mis­sion by the user regard­ing how the data is to be used, con­sent regard­ing who may utilise the data etc.  A data cat­a­logue is like a shop­ping cat­a­logue for data sets: what data exists in our com­pa­ny, where can I find it, how can I access it, who do I have to ask for per­mis­sion and so on?

A lot of work is need­ed to cre­ate all this meta­da­ta. Much is done man­u­al­ly by so-called data stew­ards, to name one exam­ple, while some is auto­mat­ed using data ana­lyt­ics and, final­ly, machine learn­ing meth­ods clas­si­fy data sources autonomous­ly, for instance.

To final­ly answer your ques­tion about the blockchain, con­cern­ing inter­nal cor­po­rate data ware­hous­es, I don’t know of any exam­ples using blockchain to secure data because there are oth­er more sim­ple and (ener­gy) effi­cient tech­ni­cal solu­tions to pro­hib­it data manip­u­la­tion, such as those involv­ing access con­trol and data ver­sion­ing. How­ev­er, for (semi-)public and decen­tralised / dis­trib­uted data pools and data spaces where data pro­duc­ers and con­sumers don’t know and can’t trust each oth­er, blockchain is the right solution.

A data ware­house should, on the one hand, save ener­gy and, on the oth­er hand, also func­tion quick­ly. Which schemes are avail­able with which advan­tages and disadvantages?

This always depends very much on how it is being used: for clas­sic busi­ness intel­li­gence, mean­ing busi­ness report­ing, such as. sales fig­ures, finan­cial con­trol­ling and so on, the “old school” data ware­house sys­tems are par­tic­u­lar­ly suit­able. One could, for exam­ple, extract the data from the orig­i­nal data­bas­es every 24 hours in a batch process, trans­form the data into a snowflake data scheme and load the trans­formed data into so-called fact and dimen­sion tables. This is the ETL approach: Extract-Trans­form-Load. This scheme of data ware­hous­ing enables fast ana­lyt­i­cal queries over huge amounts of struc­tured data, but the pos­si­bil­i­ties of queries is limited.

For advanced ana­lyt­ics, some­times called busi­ness ana­lyt­ics, the ELT approach might be a bet­ter fit: For exam­ple, one could extract the data from the data­bas­es in real-time when new data arrives, load the “raw data” into a data lake and trans­form the data into a sin­gle flat fea­ture table, when and as they are need­ed for train­ing a machine learn­ing mod­el. The query then takes longer than with the ETL approach, but since train­ing a pre­dic­tive mod­el using machine learn­ing is time-con­sum­ing any­way, this does not mat­ter. On the pos­i­tive side, if new or new types of data are col­lect­ed, it is much eas­i­er to trans­form this new data to include new fea­tures and inte­grate those fea­tures into the new mod­el which improves the mod­el per­for­mance, such as to enhance pre­dic­tion accuracy.

My favourite expres­sion is:

„All mod­els are wrong, but some are useful.”

George Box

You point out that the suc­cess of a data ware­house depends on its archi­tec­ture and organ­i­sa­tion. What are good and bad exam­ples of a data ware­house and what should be con­sid­ered when plan­ning and imple­ment­ing one?

“Struc­ture fol­lows strat­e­gy, “ as said by A.D. Chan­dler in 1962, and this max­im still holds true for data strat­e­gy. Data strat­e­gy defines how a com­pa­ny cre­ates val­ue from data to achieve its busi­ness objec­tives and thus realis­es its busi­ness strat­e­gy. The solu­tions that cre­ate val­ue from data are called “data prod­ucts”. A data prod­uct is the result of data and ana­lyt­ics. Ana­lyt­ics are thus applied to data to extract valu­able infor­ma­tion from the data. The infor­ma­tion is valu­able to the user if it helps the user to make bet­ter deci­sions. A deci­sion for an action is good if the result of the action helps in the achieve­ment of the com­pa­ny’s goals. So, the core of a data strat­e­gy is a set of data prod­ucts that help the employ­ees achieve the company’s goals.

These data prod­ucts need a com­mon basis which is the data plat­form. Depend­ing on what use is intend­ed, the plat­form con­sists of dif­fer­ent com­po­nents, includ­ing a data ware­house. How these com­po­nents inter­act, defines the data archi­tec­ture. These data prod­ucts and the data plat­form must be designed, devel­oped, and oper­at­ed by keep­ing their effi­cient use togeth­er in mind. This goal requires dif­fer­ent experts and teams to work togeth­er – so it needs a data organ­i­sa­tion which fits the data archi­tec­ture. We often see com­pa­nies which don’t have a coher­ent and holis­tic data strat­e­gy. The neg­a­tive con­se­quences are that the tech­no­log­i­cal plat­form and per­son­nel orga­ni­za­tion is either over­sized or under­sized: If the data plat­form is larg­er than need­ed, the data organ­i­sa­tion is over­whelmed and has no time for design­ing, devel­op­ing, and oper­at­ing the data prod­ucts. Con­verse­ly, an under­sized data organ­i­sa­tion tries to build com­plex real-time data prod­ucts using an under­sized data plat­form that does not meet the tech­ni­cal require­ments for the data prod­ucts. The con­se­quences are the same in both cas­es: no or too lit­tle val­ue is gen­er­at­ed from the data.

Mr. Szu­gat, thank you for shar­ing your insights on what data ware­hous­es are.

Thank you, Dr Cal­daro­la, and I look for­ward to read­ing your upcom­ing inter­views with recog­nised experts, delv­ing even deep­er into this fas­ci­nat­ing topic.

About me and my guest

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

Martin Szugat

Martin Szugat is founder and managing director of Datentreiber, a consulting firm specialised in supporting companies to become data-driven. He is also the program director of the European Predictive Analytics World and Deep Learning World conferences and a lecturer at HWZ Hochschule für Wirtschaft Zürich. Before starting Datentreiber, he founded one of Germany’s first social media agencies and prior to that Mr Szugat studied and worked in Bioinformatics.

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

FOL­LOW ME