Can big data in the Data Ware­house suc­ceed with­in an open data regime?

C
Dr Till Jaeger

In order to be able to use the grow­ing amount of data, com­pa­nies need data ware­hous­es. Their archi­tec­ture can be equipped with open-source soft­ware. But can the data in the data ware­house also be man­aged with an open data regime, so that access can take place for every­one from all sides, occur quick­ly (real-time), in an uncom­pli­cat­ed man­ner (with­out the var­i­ous data sub­ject rights, with­out data pro­tec­tion require­ments…), be uni­form, free of charge, transna­tion­al and be applic­a­ble for a vari­ety of uses?

In the lat­est of her Duet inter­views, Dr Cal­daro­la, edi­tor of Data Ware­house as well as author of Big Data and Law, and Dr Till Jaeger dis­cuss pos­si­ble open data regimes in a Data Warehouse.

Dr Jaeger, you wrote a great chap­ter about open data in the book Data Ware­house. Before we go into the chal­lenges of open data regimes in data ware­hous­es, we should clar­i­fy the var­i­ous terms: open soft­ware, open hard­ware, open data, open gov­ern­ment, open con­tent, open access and others.

Dr Till Jaeger: These terms are all relat­ed to the use of “open” in Open-Source Soft­ware. This term was first used in 1998 to improve the mar­ket­ing of “Free Soft­ware”. The term “open” was quick­ly used as a syn­onym for any­thing that is copy­right­ed but can be used by any­one, roy­al­ty-free, even though “open” only relates to one cri­te­ri­on of Free Soft­ware: access to the source code. So, it’s about the licens­ing mod­el, not just about access to a pro­tect­ed asset. Accord­ing­ly, Open Con­tent refers to freely licensed works out­side of com­put­er pro­grams while Open Access refers to free access to sci­en­tif­ic pub­li­ca­tions. Open Hard­ware refers to cir­cuit dia­grams and instruc­tions for mak­ing phys­i­cal goods (think of 3D print­ers), and Open Data refers to freely usable data­bas­es. It is inter­est­ing to note that the Open Knowl­edge Foun­da­tion’s ‘Open Def­i­n­i­tion’ also focus­es on Open Data being in an open for­mat, i.e., avail­able in a for­mat of which no organ­i­sa­tion has exclu­sive con­trol. Open Gov­ern­ment has a dif­fer­ent back­ground: it is about trans­paren­cy and col­lab­o­ra­tion across the dif­fer­ent lev­els of government.

In this Duet Inter­view, we want to con­cen­trate sole­ly on the data in a data ware­house and thus on a open data regime. Here we must cer­tain­ly dis­tin­guish between the indi­vid­ual (raw) data, the col­lec­tion of data (data set), the data­bas­es (the struc­ture in which data is stored), and the data deriv­a­tives (the infor­ma­tion / knowl­edge / con­tent). Which of these “assets” do open data licens­es refer to?

Open Data licens­es can cov­er both the data itself and col­lec­tions of data. The lat­ter can be pro­tect­ed as a data­base work or under the sui gener­is right for data­base mak­ers. How­ev­er, there are also Open Data licens­es that only refer to the intel­lec­tu­al prop­er­ty rights in the data­base, leav­ing the licens­ing of the indi­vid­ual data to an Open Con­tent license. Of course, you can’t talk about Open Data if the indi­vid­ual data is licensed in a pro­pri­etary way (e.g., pho­tos) and only the rights to the data­base are free. The two must coin­cide for free reuse to be possible.

Let’s assume that per­son­al data can also be found in a data ware­house. We there­fore need to dis­cuss whether data pro­tec­tion is com­pat­i­ble with open data licens­es. I sug­gest we look at dif­fer­ent aspects in this regard. With all the licens­es in the area of ​​open licens­es, copy­right is always the deci­sive aspect. Is data or data­bas­es a copy­right? Isn’t every per­son, their dig­i­tal twin or avatar with all their attrib­ut­es – mean­ing their data – a creation?

There is also anoth­er aspect that seems dif­fer­ent between data pro­tec­tion and open licens­es: Accord­ing to data pro­tec­tion, the data con­troller must inform trans­par­ent­ly the data sub­ject about the intend­ed use and the controller(s) and the ser­vice provider / data proces­sor involved. With open licens­es, all-encom­pass­ing pro­cess­ing activ­i­ties and uses are grant­ed to every­one. Isn’t there a con­tra­dic­tion or even incompatibility?

The rela­tion­ship between data pro­tec­tion law and Open Data is cer­tain­ly one of the most excit­ing issues in this con­text. In fact, both areas of law are often involved in data col­lec­tions, name­ly when per­son­al data has been aggre­gat­ed in a copy­right­ed data­base. Here we come to a con­tra­dic­tion that can­not be resolved at the moment: Accord­ing to the basic con­cept of the GDPR, each data pro­cess­ing requires its own autho­ri­sa­tion.  This can either be a legal per­mis­sion or con­sent of the data sub­ject. How­ev­er, con­sent can­not be giv­en for a gener­ic pur­pose. Con­sent must be spe­cif­ic, informed and unam­bigu­ous. This is already in con­flict with the con­cept of Open Data, as Open Data licens­es always allow use for any pur­pose. Fur­ther­more, con­sent can be with­drawn at any time. As Open Data licens­es are aimed at every­one, the poten­tial pool of licensees is also unman­age­able, which can make revo­ca­tion prac­ti­cal­ly impossible.

It is there­fore not sur­pris­ing that Open Data licens­es do not even attempt to address pri­va­cy issues. It is prob­a­bly already assumed that only anonymised data, or data with no ref­er­ence to a per­son, can be con­sid­ered as the sub­ject of a license.

I under­stand there is an incon­sis­ten­cy between the laws, and I can also see that data pro­tec­tion laws take prece­dence over a con­trac­tu­al license. But doesn’t the Open Data License, with its uncom­pli­cat­ed, imme­di­ate, inter­na­tion­al, uni­form and all-round use of data, sat­is­fy the needs of the many com­pa­nies who would like to have a stock of data, who want to use all the data for a wide vari­ety of pur­pos­es, imme­di­ate­ly, with­out great effort, with­out a cost-inten­sive con­sent and revo­ca­tion man­age­ment and with­out a time/cost con­sum­ing deletion?

We see here the typ­i­cal con­stel­la­tion of dif­fer­ent inter­ests and legal inter­ests col­lid­ing. Giv­en the enor­mous poten­tial for abuse in the aggre­ga­tion of per­son­al data and the pos­si­bil­i­ty of cre­at­ing unwant­ed per­son­al­i­ty pro­files, it is not sur­pris­ing that data pro­tec­tion law is giv­en pri­or­i­ty here.

My favourite citation:

„Open access poli­cies aim in par­tic­u­lar to pro­vide researchers and the pub­lic at large with access to research data as ear­ly as pos­si­ble in the dis­sem­i­na­tion process and to facil­i­tate its use and re-use. Open access helps enhance qual­i­ty, reduce the need for unnec­es­sary dupli­ca­tion of research, speed up sci­en­tif­ic progress, com­bat sci­en­tif­ic fraud, and it can over­all favour eco­nom­ic growth and innovation. “

EU Direc­tive on open data and the re-use of pub­lic sec­tor infor­ma­tion, Recital 27

The con­se­quence is that open data licens­es can actu­al­ly only be used for data that has no per­son­al ref­er­ence. Is there only one open data license or are there dif­fer­ent licens­es? How do they differ?

There are numer­ous Open Data licens­es, although not quite as many as in the Open-Source world. The most impor­tant dis­tinc­tion is whether or not the license has a pro­vi­sion for the use of mod­i­fied datasets. Some licens­es con­tain a “share alike” or “copy­left” clause, which requires that mod­i­fied data­bas­es must be redis­trib­uted under the terms of the orig­i­nal license when they are redis­trib­uted or made avail­able to third par­ties. Licens­es with­out such a share alike are called per­mis­sive licenses.

I find the copy­left licens­es inter­est­ing. So, if the copy­left license is to extend to the data deriv­a­tives (I assume that’s the infor­ma­tion and knowl­edge gained from the data) or also to the added data, what hap­pens if the data pool hap­pens to con­tain undis­cov­ered per­son­al data? Does the copy­left prevail?

Mod­i­fied data­bas­es tend to be cas­es where pre-exist­ing datasets are enriched with addi­tion­al data or a spe­cif­ic sub­set of data is extract­ed. If per­son­al data is inad­ver­tent­ly involved, fur­ther use is not per­mit­ted until the per­son­al data has been removed. Copy­left, like Open Data licences in gen­er­al, does not have the pow­er to under­mine pri­va­cy laws.

Are data deriv­a­tives mere­ly the result of analy­ses or merg­ing process­es – mean­ing. infor­ma­tion, knowl­edge or extend­ed data pools? Or does data deriv­a­tive also refer to the algo­rithms trained with data, neu­ronal net­works, per­haps even also arti­fi­cial intelligence?

This ques­tion is cur­rent­ly before the courts in the USA. Sta­bil­i­ty AI and Mid­jour­ney, the com­pa­nies behind the text-to-image AI gen­er­a­tors Sta­ble Dif­fu­sion and Mid­jour­ney, have been sued by artists in Cal­i­for­nia. The copy­right infringe­ment claim is that the images gen­er­at­ed by Sta­ble Dif­fu­sion and Mid­jour­ney are mod­i­fi­ca­tions of the images used as train­ing mate­r­i­al, and are there­fore com­plex “col­lage tools”. A sim­i­lar case has been brought against Microsoft for using Open-Source Soft­ware to train the AI mod­el behind the GitHub Co-pilot pro­gram­ming tool. The out­come is dif­fi­cult to pre­dict and is like­ly to depend on the tech­ni­cal details. In the­o­ry, how­ev­er, the train­ing data should not be includ­ed in the neur­al net­work or in the results gen­er­at­ed by AI. A copy­left of the train­ing data would there­fore be irrelevant.

Algo­rithms, neu­ronal net­works and arti­fi­cial intel­li­gence are usu­al­ly clas­si­fied as trade secrets. Let us assume that a copy­left license extends to algo­rithm, neu­ronal net­works and AI, would that mean that they too would have to be made freely avail­able to the gen­er­al pub­lic? This might not be in the inter­est of many com­pa­nies. Is it tech­ni­cal­ly pos­si­ble to deter­mine – in the case of an algo­rithm, a neu­ronal net­work and/or AI – whether copy­left data had been used for their train­ing or devel­op­ment? If not, does that mean that the copy­left effect, unlike soft­ware, can­not be proven ret­ro­spec­tive­ly and there­fore does not have a “seri­ous”, ver­i­fi­able and enforce­able effect with regard to these assets (algo­rithms, neur­al net­works, AI)?

The copy­left is not rel­e­vant if the data is for inter­nal use only. Licences with a copy­left only impose an oblig­a­tion to licence-derived data or data­bas­es if they are either passed on to third par­ties or third par­ties are allowed to use them, such as, if the data is stored on a serv­er but cus­tomers can access it via a web inter­face. Pure­ly inter­nal use of data does not trig­ger copyleft.

Will the Fed­er­al Government’s Open Data Strat­e­gy from 2021, the Euro­pean Union’s Open Data Direc­tive, the Data Gov­er­nance Act, the Data Act among oth­ers bring about inno­va­tions in the near future?

The issue is quite dynam­ic and the pub­lic sec­tor in par­tic­u­lar is try­ing to stim­u­late the data econ­o­my through leg­is­la­tion, but also by releas­ing its own data as Open Data. For exam­ple, the Data Act is planned as a cross-sec­toral law that will include pro­vi­sions to facil­i­tate access to and use of data by con­sumers and busi­ness­es, in par­tic­u­lar, machine data (“data gen­er­at­ed in the use of a prod­uct or asso­ci­at­ed ser­vice”), but also rights of defence against unlaw­ful use and pro­vi­sions on data con­tracts. It also pro­vides for the right of pub­lic author­i­ties to use data held by com­pa­nies in the event of a pub­lic emer­gency, and cov­ers rules to make it eas­i­er for con­sumers to move between cloud and edge ser­vices, and the planned devel­op­ment of inter­op­er­abil­i­ty stan­dards for data to be reused by oth­er sec­tors. How­ev­er, the Euro­pean Com­mis­sion’s draft is con­tro­ver­sial, so some changes are expected.

Dr Jaeger, thank you for shar­ing your man­i­fold insights on Open Data.

Thank you, Dr Cal­daro­la, and I look for­ward to read­ing your upcom­ing inter­views with recog­nised experts, delv­ing even deep­er into this fas­ci­nat­ing topic.

About me and my guest

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

Till Jäger

Dr Till Jaeger is a partner at the law firm JBB Rechtsanwälte and co-founder of the Institute for Legal Questions of Free and Open-Source Software (ifrOSS). As a lawyer specialising in copyright and media law, the focus of his practical and academic work is on free license models.

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

FOL­LOW ME