The manifold aspects of Big Data projects – Duet Interviews – Big Data out of the box by Dr Caldarola

In her Duet Interview with economist and entrepreneur Vicky Feygina, Dr Caldarola, author of Big Data and Law, discusses this topic.

Companies invoke Big Data-driven models as a synonym for an innovative approach. What is the appeal? What do companies strive for with Big Data projects? What do they really achieve?

Vicky Feygina: I think the days when Big Data was a sexy new term are behind us. In the beginning of the second decade, if a company employed Big Data in its business model, it was signalling to both clients and shareholders that a digital transformation (another quickly aging term) was in progress and that even a brick and mortar business (i.e. a business selling a physical product) could become a part of the new intangible economy by making use of its data. Now that the value of Big Data as the new fuel of our century is firmly entrenched in our business psyche, the real challenge is how to maximize this digital asset while limiting the exposure to legal and reputational risks that Big Data projects invariably entail.

What do companies strive to achieve? Most commonly, it is to create predictive models of customer behaviour. Media and entertainment companies, such as Netflix, Spotify and YouTube (Google), are probably best known for their “recommender” algorithms, based on models of their customers’ likes and dislikes. YouTube, specifically, under the leadership of Cristos Goodrow, is the finest example for this type of algorithm usage, crunching enormously large and diverse data (uploads to the tune of 500,000 video hours per day) to make predictive viewing recommendations and strategic advertisement placements. A more sinister spin on this is that companies aim to increase- perhaps even exponentially- their ability to manipulate and influence consumers or profit from making their customer data available to third parties. Facebook comes to mind as a business case where a corporate reputation was tarnished by data-sharing with dubious third parties. Less known to the public, I think, is the usage of Big Data to create new products. For now, pharmaceuticals are the biggest beneficiaries of this approach, using, for example, algorithmic searches to correlate symptoms, medications, side effects etc. on a large scale. In the early days of the COVID-19 pandemic, there was a lot of coverage in the press about companies like Google using supercomputers to sift through giant swathes of data in search of the effects of existing medications that could be used against the novel coronavirus.

Finally, it is worth noting that reliance on Big Data and algorithmic predictions will make decision-making more formulaic which is, on the one hand, more accurate and efficient, but can also be potentially treacherous. Sir Mervyn King discussed the pitfalls of overreliance on Big Data and predictive models in his Radical Uncertainty: Decision-making for an unknowable future, (2020). Another interesting point to consider is how availability of Big Data about consumer behaviour will influence producers of creative goods, such as artists, writers, musicians. In a recent interview, Gustav Soderstrom, Chief Research and Development Officer at Spotify, discussed how consumer preferences can loop back to musicians to let them know what cords, riffs, intros, lengths and other attributes of a song made it more popular so their future music can be crafted by following these parameters. This approach has got all the makings of a conflict between efficiency and creativity that will be interesting to gauge and understand.

Data in Big Data pools originate from different sources, such as countries, legal systems, and are perhaps crossing international borders every second. Does data provenance influence Big Data projects?

Yes, businesses may collect and combine data from different sources because, first and most obviously, they may not have enough in-house data for meaningful analysis. Second, the accuracy and timeliness of data may be contingent on a multiplicity of sources, such as, for example, autonomous cars will rely on the data from “car networks,” generated simultaneously by multitudes of other drivers and vehicles on the road in real time.

Furthermore, data derived from different sources is not only different in nature (personal or non-personal data), but is also governed by different laws (e.g. personal data by data protection laws and non-personal data by civil laws) and is subject to different legal jurisdictions, depending on where the source has its residence or place of business. So, for example a data subject living and providing his/her data in France, is governed by French law, whereas a data subject being a resident of the US, will be governed by US law.

Data also results from different processes. Some data might have been generated via an e‑commerce site, others from an IoT process to better understand and customize product usage, still others may simply be purchased from information sellers. When we consider these various scenarios, the legal bases allowing for data collection are quite dissimilar and thus may be enacted differently.

Last but not least, the length of time needed to use some of the data in the data pool may differ from the rest of the data within the same data pool. This is because under the GDPR the right to be forgotten comes into effect depending on the data category. This means that some data has to be deleted at the end of the transaction/usage while others might be saved until the end of its recognized retention period and still others might have a longer time span.

Due to these requirements, control and documentation of data use may be daunting. A company using Big Data must be fully transparent about its actions. This means that the company has to provide clear and timely information about the type of data it collects and processes, the purpose for which the data is collected and processed including the legal grounds for collection and processing. The company has to be able to respond to requests for information concerning the data. The company must be able to identify and delete data and its copies, when consent for data processing has been withdrawn or if the lifecycle of data has ended. Overall, the company must rigorously document everything concerning its data processing practices at an internal level, all the while ensuring data safety by considering external influences. The latter translates into continuous investment into maintaining an up-to-date data management system, including cybersecurity and data privacy protection aspects.

I think you get the point – but if there is a temptation to cut corners on costs, the consequences can be quite dire, as I am sure you are well aware of. The General Data Protection Regulation, instituted 2 years ago, gave governments the power to impose fines of up to 4% of a company’s global revenues. We’ve already seen some hefty sums in the charges the GDPR has brought against British Airlines (£183m)¹ and the Marriott hotel group which was fined nearly £100m for failure to keep their data safe from hackers.

But costs are not limited to administrative, supervisory and legal aspects. We haven’t touched upon a potentially vast undertaking to make the unstructured data suitable for processing. Again, because data provenance may be from a variety of sources, raw data may not be fit to be used for analysis. A set of data may have to be put through multiple “clean-up” iterations of retrieval, clean-up, labelling and storage before it can be employed in systematic analysis.

As for data crossing national borders, needless to say, it’s very different compared to transactions involving physical goods. Once data falls within a different national jurisdiction, and has been copied and processed, it is not like a product that can be shipped back.

Data can be very personal and very sensitive. Each national data protection law requires different legal processing as well as a different level of technical security. Does a worldwide standard exist that can be recommended when operating internationally? Is harmonization a challenge of our globalization?

There are no cohesive, long-term worldwide standards at the moment.

In 2018 Europeans established the first harmonized data protection law for its Member States: the GDPR. However, it must be emphasized that Member States diverge in their perceived benefits from the regulation and in how important they consider the GDPR to be. Although the EU is now a leader in data protection, this position is not without its difficulties. As recently reported by FT.com, the GDPR is not being adequately enforced and regulators are not being sufficiently funded at the national level, with the majority of enforcement bodies across Europe having budgets below €10m Euro and 14 members claim budgets of below €5m.

Of course, as you know, a major sticking point for Big Data business is data transfer from the EU to the US, where the Patriot Act may expose data to government snooping. There is currently a woeful lack of clarity and guidance when it comes to cross-Atlantic data transfer. This is why I found your book Big Data and Law to be absolutely indispensable in understanding the ABCs of data privacy protection. Without understanding the fundamental principles and terminology which were so aptly covered in your work, it would be very difficult to understand and act in accordance with current laws.

In addition, the legal landscape on data privacy protection is literally shifting as we speak. As recently as July of this year, Noyb, a non-profit privacy campaign group, won a ruling at the European Court of Justice that totally invalidated the Privacy Shield, a transatlantic agreement used by around 5,000 companies to transfer data from the EU to the US. Since the ruling, companies have had to rely on individual legal agreements, but even the ECJ publicly admitted that these contracts may not satisfy European standards, an admission which has companies scrambling for legal guidance from any data protection authority that can offer specific rules. The data protection authority in Baden-Württemberg in Germany currently recommends that companies encrypt or anonymize data as a way to safeguard it from government interference.

You will surely agree that the problem at its core is not simply economic or managerial but largely political. If the data of EU citizens is not safe from government surveillance in the US, then these guidelines would apply to many other countries, specifically those with documented human rights abuses and mass government surveillance laws. China, a long-time champion of cyber sovereignty over inflow of data from the outside, is now finding itself on the other side of the equation, encountering stiff resistance to the possibility of European or American data ending up in the hands of the Chinese. TikTok is a case in point. Protectionism or defence of liberal democratic principles? I will leave the answers to such questions to your other interviewees, who are more familiar with the matter.

For now, there are two possible outcomes: The first being that, due to these data privacy protection issues, the internet may splinter into zones where data flows freely within, such as the EU block and North America under NAFTA, or the second scenario being that there be greater effort and, hopefully, success in harmonizing national data protection laws. But given that a common set of rules implies having fundamental common grounds to work with, as is the case with the EU, it will not be easy to bring everyone into the fold. Perhaps it’s time to establish a supra-national body dedicated to harmonizing data protection, which would be similar but more authoritative than the WIPO (World Intellectual Property Organization) – a body which has been negotiating and harmonizing patent, trademark and copyright law throughout the world. One other quick note regarding harmonization: Laws can sometimes end up being harmonized to the tune of one partner, for example, the new US-Mexico-Canada trade agreement language mirrors Section 230 in which platforms are granted immunity for the data they host.

My favorite expression when it comes to Big Data is still “GIGO” – garbage in garbage out.
Vicky Feygina

Most Big Data that we deal with today is not properly labelled or annotated. Of course, we now have AI that can clean up data and then AI can be used which can label data before it’s passed on to yet another, smarter AI capable of analysing the data. But when it comes to “learner” algorithms, there is a new approach to AI called transfer learning in which AI learns from smaller sets of data. Just as children don’t need to see a million pictures of dogs to learn how to recognize a dog, learner algorithms may learn from a smaller, high-quality set of data, if given better learning perimeters. What this means is that size will matter less than the quality of content. On the other hand, when it comes to aggregating large data about infectious pandemics, we may take a more Bayesian approach: Better to have a large imperfect model than wait for a small perfect one. I guess time will tell.

Thank you, Cristina, and I look forward to reading your upcoming interviews with recognized experts, delving even deeper into this fascinating topic.

¹A decrease in fine to £20m occurred since the interview was written: https://www.bbc.com/news/technology-54568784

About me and my guest

Dr Maria Cristina Caldarola

Vicky Feygina

Dr Maria Cristina Caldarola

FOLLOW ME

Com­pa­nies invoke Big Data-dri­ven mod­els as a syn­onym for an inno­v­a­tive approach. What is the appeal? What do com­pa­nies strive for with Big Data projects? What do they real­ly achieve?

Data in Big Data pools orig­i­nate from dif­fer­ent sources, such as coun­tries, legal sys­tems, and are per­haps cross­ing inter­na­tion­al bor­ders every sec­ond. Does data prove­nance influ­ence Big Data projects?

Vicky, thank you for shar­ing your insights on Big Data and for so amply con­tex­tu­al­iz­ing the man­i­fold aspects of Big Data projects.

About me and my guest

Dr Maria Cristina Caldarola

Vicky Feygina

Dr Maria Cristina Caldarola

FOL­LOW ME

Companies invoke Big Data-driven models as a synonym for an innovative approach. What is the appeal? What do companies strive for with Big Data projects? What do they really achieve?

Data in Big Data pools originate from different sources, such as countries, legal systems, and are perhaps crossing international borders every second. Does data provenance influence Big Data projects?

Vicky, thank you for sharing your insights on Big Data and for so amply contextualizing the manifold aspects of Big Data projects.

FOLLOW ME