In order to be able to use the growing amount of data, companies need data warehouses. Their architecture can be equipped with open-source software. But can the data in the data warehouse also be managed with an open data regime, so that access can take place for everyone from all sides, occur quickly (real-time), in an uncomplicated manner (without the various data subject rights, without data protection requirements…), be uniform, free of charge, transnational and be applicable for a variety of uses?
In the latest of her Duet interviews, Dr Caldarola, editor of Data Warehouse as well as author of Big Data and Law, and Dr Till Jaeger discuss possible open data regimes in a Data Warehouse.
Dr Jaeger, you wrote a great chapter about open data in the book Data Warehouse. Before we go into the challenges of open data regimes in data warehouses, we should clarify the various terms: open software, open hardware, open data, open government, open content, open access and others.
Dr Till Jaeger: These terms are all related to the use of “open” in Open-Source Software. This term was first used in 1998 to improve the marketing of “Free Software”. The term “open” was quickly used as a synonym for anything that is copyrighted but can be used by anyone, royalty-free, even though “open” only relates to one criterion of Free Software: access to the source code. So, it’s about the licensing model, not just about access to a protected asset. Accordingly, Open Content refers to freely licensed works outside of computer programs while Open Access refers to free access to scientific publications. Open Hardware refers to circuit diagrams and instructions for making physical goods (think of 3D printers), and Open Data refers to freely usable databases. It is interesting to note that the Open Knowledge Foundation’s ‘Open Definition’ also focuses on Open Data being in an open format, i.e., available in a format of which no organisation has exclusive control. Open Government has a different background: it is about transparency and collaboration across the different levels of government.
In this Duet Interview, we want to concentrate solely on the data in a data warehouse and thus on a open data regime. Here we must certainly distinguish between the individual (raw) data, the collection of data (data set), the databases (the structure in which data is stored), and the data derivatives (the information / knowledge / content). Which of these “assets” do open data licenses refer to?
Open Data licenses can cover both the data itself and collections of data. The latter can be protected as a database work or under the sui generis right for database makers. However, there are also Open Data licenses that only refer to the intellectual property rights in the database, leaving the licensing of the individual data to an Open Content license. Of course, you can’t talk about Open Data if the individual data is licensed in a proprietary way (e.g., photos) and only the rights to the database are free. The two must coincide for free reuse to be possible.
Let’s assume that personal data can also be found in a data warehouse. We therefore need to discuss whether data protection is compatible with open data licenses. I suggest we look at different aspects in this regard. With all the licenses in the area of open licenses, copyright is always the decisive aspect. Is data or databases a copyright? Isn’t every person, their digital twin or avatar with all their attributes – meaning their data – a creation?
There is also another aspect that seems different between data protection and open licenses: According to data protection, the data controller must inform transparently the data subject about the intended use and the controller(s) and the service provider / data processor involved. With open licenses, all-encompassing processing activities and uses are granted to everyone. Isn’t there a contradiction or even incompatibility?
The relationship between data protection law and Open Data is certainly one of the most exciting issues in this context. In fact, both areas of law are often involved in data collections, namely when personal data has been aggregated in a copyrighted database. Here we come to a contradiction that cannot be resolved at the moment: According to the basic concept of the GDPR, each data processing requires its own authorisation. This can either be a legal permission or consent of the data subject. However, consent cannot be given for a generic purpose. Consent must be specific, informed and unambiguous. This is already in conflict with the concept of Open Data, as Open Data licenses always allow use for any purpose. Furthermore, consent can be withdrawn at any time. As Open Data licenses are aimed at everyone, the potential pool of licensees is also unmanageable, which can make revocation practically impossible.
It is therefore not surprising that Open Data licenses do not even attempt to address privacy issues. It is probably already assumed that only anonymised data, or data with no reference to a person, can be considered as the subject of a license.
I understand there is an inconsistency between the laws, and I can also see that data protection laws take precedence over a contractual license. But doesn’t the Open Data License, with its uncomplicated, immediate, international, uniform and all-round use of data, satisfy the needs of the many companies who would like to have a stock of data, who want to use all the data for a wide variety of purposes, immediately, without great effort, without a cost-intensive consent and revocation management and without a time/cost consuming deletion?
We see here the typical constellation of different interests and legal interests colliding. Given the enormous potential for abuse in the aggregation of personal data and the possibility of creating unwanted personality profiles, it is not surprising that data protection law is given priority here.
My favourite citation:
„Open access policies aim in particular to provide researchers and the public at large with access to research data as early as possible in the dissemination process and to facilitate its use and re-use. Open access helps enhance quality, reduce the need for unnecessary duplication of research, speed up scientific progress, combat scientific fraud, and it can overall favour economic growth and innovation. “
EU Directive on open data and the re-use of public sector information, Recital 27
The consequence is that open data licenses can actually only be used for data that has no personal reference. Is there only one open data license or are there different licenses? How do they differ?
There are numerous Open Data licenses, although not quite as many as in the Open-Source world. The most important distinction is whether or not the license has a provision for the use of modified datasets. Some licenses contain a “share alike” or “copyleft” clause, which requires that modified databases must be redistributed under the terms of the original license when they are redistributed or made available to third parties. Licenses without such a share alike are called permissive licenses.
I find the copyleft licenses interesting. So, if the copyleft license is to extend to the data derivatives (I assume that’s the information and knowledge gained from the data) or also to the added data, what happens if the data pool happens to contain undiscovered personal data? Does the copyleft prevail?
Modified databases tend to be cases where pre-existing datasets are enriched with additional data or a specific subset of data is extracted. If personal data is inadvertently involved, further use is not permitted until the personal data has been removed. Copyleft, like Open Data licences in general, does not have the power to undermine privacy laws.
Are data derivatives merely the result of analyses or merging processes – meaning. information, knowledge or extended data pools? Or does data derivative also refer to the algorithms trained with data, neuronal networks, perhaps even also artificial intelligence?
This question is currently before the courts in the USA. Stability AI and Midjourney, the companies behind the text-to-image AI generators Stable Diffusion and Midjourney, have been sued by artists in California. The copyright infringement claim is that the images generated by Stable Diffusion and Midjourney are modifications of the images used as training material, and are therefore complex “collage tools”. A similar case has been brought against Microsoft for using Open-Source Software to train the AI model behind the GitHub Co-pilot programming tool. The outcome is difficult to predict and is likely to depend on the technical details. In theory, however, the training data should not be included in the neural network or in the results generated by AI. A copyleft of the training data would therefore be irrelevant.
Algorithms, neuronal networks and artificial intelligence are usually classified as trade secrets. Let us assume that a copyleft license extends to algorithm, neuronal networks and AI, would that mean that they too would have to be made freely available to the general public? This might not be in the interest of many companies. Is it technically possible to determine – in the case of an algorithm, a neuronal network and/or AI – whether copyleft data had been used for their training or development? If not, does that mean that the copyleft effect, unlike software, cannot be proven retrospectively and therefore does not have a “serious”, verifiable and enforceable effect with regard to these assets (algorithms, neural networks, AI)?
The copyleft is not relevant if the data is for internal use only. Licences with a copyleft only impose an obligation to licence-derived data or databases if they are either passed on to third parties or third parties are allowed to use them, such as, if the data is stored on a server but customers can access it via a web interface. Purely internal use of data does not trigger copyleft.
Will the Federal Government’s Open Data Strategy from 2021, the European Union’s Open Data Directive, the Data Governance Act, the Data Act among others bring about innovations in the near future?
The issue is quite dynamic and the public sector in particular is trying to stimulate the data economy through legislation, but also by releasing its own data as Open Data. For example, the Data Act is planned as a cross-sectoral law that will include provisions to facilitate access to and use of data by consumers and businesses, in particular, machine data (“data generated in the use of a product or associated service”), but also rights of defence against unlawful use and provisions on data contracts. It also provides for the right of public authorities to use data held by companies in the event of a public emergency, and covers rules to make it easier for consumers to move between cloud and edge services, and the planned development of interoperability standards for data to be reused by other sectors. However, the European Commission’s draft is controversial, so some changes are expected.
Dr Jaeger, thank you for sharing your manifold insights on Open Data.
Thank you, Dr Caldarola, and I look forward to reading your upcoming interviews with recognised experts, delving even deeper into this fascinating topic.