The man­i­fold aspects of Big Data projects

Vicky Fey­gi­na

In her Duet Inter­view with econ­o­mist and entre­pre­neur Vicky Fey­gi­na, Dr Cal­daro­la, author of Big Data and Law, dis­cuss­es this topic.

Com­pa­nies invoke Big Data-dri­ven mod­els as a syn­onym for an inno­v­a­tive approach. What is the appeal? What do com­pa­nies strive for with Big Data projects? What do they real­ly achieve?

Vicky Fey­gi­na: I think the days when Big Data was a sexy new term are behind us. In the begin­ning of the sec­ond decade, if a com­pa­ny employed Big Data in its busi­ness mod­el, it was sig­nalling to both clients and share­hold­ers that a dig­i­tal trans­for­ma­tion (anoth­er quick­ly aging term) was in progress and that even a brick and mor­tar busi­ness (i.e. a busi­ness sell­ing a phys­i­cal prod­uct) could become a part of the new intan­gi­ble econ­o­my by mak­ing use of its data. Now that the val­ue of Big Data as the new fuel of our cen­tu­ry is firm­ly entrenched in our busi­ness psy­che, the real chal­lenge is how to max­i­mize this dig­i­tal asset while lim­it­ing the expo­sure to legal and rep­u­ta­tion­al risks that Big Data projects invari­ably entail.

What do com­pa­nies strive to achieve? Most com­mon­ly, it is to cre­ate pre­dic­tive mod­els of cus­tomer behav­iour. Media and enter­tain­ment com­pa­nies, such as Net­flix, Spo­ti­fy and YouTube (Google), are prob­a­bly best known for their “rec­om­mender” algo­rithms, based on mod­els of their cus­tomers’ likes and dis­likes. YouTube, specif­i­cal­ly, under the lead­er­ship of Cristos Goodrow, is the finest exam­ple for this type of algo­rithm usage, crunch­ing enor­mous­ly large and diverse data (uploads to the tune of 500,000 video hours per day) to make pre­dic­tive view­ing rec­om­men­da­tions and strate­gic adver­tise­ment place­ments. A more sin­is­ter spin on this is that com­pa­nies aim to increase- per­haps even expo­nen­tial­ly- their abil­i­ty to manip­u­late and influ­ence con­sumers or prof­it from mak­ing their cus­tomer data avail­able to third par­ties. Face­book comes to mind as a busi­ness case where a cor­po­rate rep­u­ta­tion was tar­nished by data-shar­ing with dubi­ous third par­ties. Less known to the pub­lic, I think, is the usage of Big Data to cre­ate new prod­ucts. For now, phar­ma­ceu­ti­cals are the biggest ben­e­fi­cia­ries of this approach, using, for exam­ple, algo­rith­mic search­es to cor­re­late symp­toms, med­ica­tions, side effects etc. on a large scale. In the ear­ly days of the COVID-19 pan­dem­ic, there was a lot of cov­er­age in the press about com­pa­nies like Google using super­com­put­ers to sift through giant swathes of data in search of the effects of exist­ing med­ica­tions that could be used against the nov­el coronavirus.

Final­ly, it is worth not­ing that reliance on Big Data and algo­rith­mic pre­dic­tions will make deci­sion-mak­ing more for­mu­la­ic which is, on the one hand, more accu­rate and effi­cient, but can also be poten­tial­ly treach­er­ous. Sir Mervyn King dis­cussed the pit­falls of over­re­liance on Big Data and pre­dic­tive mod­els in his Rad­i­cal Uncer­tain­ty: Deci­sion-mak­ing for an unknow­able future, (2020). Anoth­er inter­est­ing point to con­sid­er is how avail­abil­i­ty of Big Data about con­sumer behav­iour will influ­ence pro­duc­ers of cre­ative goods, such as artists, writ­ers, musi­cians. In a recent inter­view, Gus­tav Soder­strom, Chief Research and Devel­op­ment Offi­cer at Spo­ti­fy, dis­cussed how con­sumer pref­er­ences can loop back to musi­cians to let them know what cords, riffs, intros, lengths and oth­er attrib­ut­es of a song made it more pop­u­lar so their future music can be craft­ed by fol­low­ing these para­me­ters. This approach has got all the mak­ings of a con­flict between effi­cien­cy and cre­ativ­i­ty that will be inter­est­ing to gauge and understand.

Yes, busi­ness­es may col­lect and com­bine data from dif­fer­ent sources because, first and most obvi­ous­ly, they may not have enough in-house data for mean­ing­ful analy­sis. Sec­ond, the accu­ra­cy and time­li­ness of data may be con­tin­gent on a mul­ti­plic­i­ty of sources, such as, for exam­ple, autonomous cars will rely on the data from “car net­works,” gen­er­at­ed simul­ta­ne­ous­ly by mul­ti­tudes of oth­er dri­vers and vehi­cles on the road in real time.

Fur­ther­more, data derived from dif­fer­ent sources is not only dif­fer­ent in nature (per­son­al or non-per­son­al data), but is also gov­erned by dif­fer­ent laws (e.g. per­son­al data by data pro­tec­tion laws and non-per­son­al data by civ­il laws) and is sub­ject to dif­fer­ent legal juris­dic­tions, depend­ing on where the source has its res­i­dence or place of busi­ness. So, for exam­ple a data sub­ject liv­ing and pro­vid­ing his/her data in France, is gov­erned by French law, where­as a data sub­ject being a res­i­dent of the US, will be gov­erned by US law.

Data also results from dif­fer­ent process­es. Some data might have been gen­er­at­ed via an e‑commerce site, oth­ers from an IoT process to bet­ter under­stand and cus­tomize prod­uct usage, still oth­ers may sim­ply be pur­chased from infor­ma­tion sell­ers. When we con­sid­er these var­i­ous sce­nar­ios, the legal bases allow­ing for data col­lec­tion are quite dis­sim­i­lar and thus may be enact­ed differently.

Last but not least, the length of time need­ed to use some of the data in the data pool may dif­fer from the rest of the data with­in the same data pool. This is because under the GDPR the right to be for­got­ten comes into effect depend­ing on the data cat­e­go­ry. This means that some data has to be delet­ed at the end of the transaction/usage while oth­ers might be saved until the end of its rec­og­nized reten­tion peri­od and still oth­ers might have a longer time span.

Due to these require­ments, con­trol and doc­u­men­ta­tion of data use may be daunt­ing. A com­pa­ny using Big Data must be ful­ly trans­par­ent about its actions. This means that the com­pa­ny has to pro­vide clear and time­ly infor­ma­tion about the type of data it col­lects and process­es, the pur­pose for which the data is col­lect­ed and processed includ­ing the legal grounds for col­lec­tion and pro­cess­ing. The com­pa­ny has to be able to respond to requests for infor­ma­tion con­cern­ing the data. The com­pa­ny must be able to iden­ti­fy and delete data and its copies, when con­sent for data pro­cess­ing has been with­drawn or if the life­cy­cle of data has end­ed. Over­all, the com­pa­ny must rig­or­ous­ly doc­u­ment every­thing con­cern­ing its data pro­cess­ing prac­tices at an inter­nal lev­el, all the while ensur­ing data safe­ty by con­sid­er­ing exter­nal influ­ences. The lat­ter trans­lates into con­tin­u­ous invest­ment into main­tain­ing an up-to-date data man­age­ment sys­tem, includ­ing cyber­se­cu­ri­ty and data pri­va­cy pro­tec­tion aspects.

I think you get the point – but if there is a temp­ta­tion to cut cor­ners on costs, the con­se­quences can be quite dire, as I am sure you are well aware of. The Gen­er­al Data Pro­tec­tion Reg­u­la­tion, insti­tut­ed 2 years ago, gave gov­ern­ments the pow­er to impose fines of up to 4% of a company’s glob­al rev­enues. We’ve already seen some hefty sums in the charges the GDPR has brought against British Air­lines (£183m)1 and the Mar­riott hotel group which was fined near­ly £100m for fail­ure to keep their data safe from hackers.

But costs are not lim­it­ed to admin­is­tra­tive, super­vi­so­ry and legal aspects. We haven’t touched upon a poten­tial­ly vast under­tak­ing to make the unstruc­tured data suit­able for pro­cess­ing. Again, because data prove­nance may be from a vari­ety of sources, raw data may not be fit to be used for analy­sis. A set of data may have to be put through mul­ti­ple “clean-up” iter­a­tions of retrieval, clean-up, labelling and stor­age before it can be employed in sys­tem­at­ic analysis.

As for data cross­ing nation­al bor­ders, need­less to say, it’s very dif­fer­ent com­pared to trans­ac­tions involv­ing phys­i­cal goods. Once data falls with­in a dif­fer­ent nation­al juris­dic­tion, and has been copied and processed, it is not like a prod­uct that can be shipped back.

There are no cohe­sive, long-term world­wide stan­dards at the moment.

In 2018 Euro­peans estab­lished the first har­mo­nized data pro­tec­tion law for its Mem­ber States: the GDPR. How­ev­er, it must be empha­sized that Mem­ber States diverge in their per­ceived ben­e­fits from the reg­u­la­tion and in how impor­tant they con­sid­er the GDPR to be. Although the EU is now a leader in data pro­tec­tion, this posi­tion is not with­out its dif­fi­cul­ties. As recent­ly report­ed by FT​.com, the GDPR is not being ade­quate­ly enforced and reg­u­la­tors are not being suf­fi­cient­ly fund­ed at the nation­al lev­el, with the major­i­ty of enforce­ment bod­ies across Europe hav­ing bud­gets below €10m Euro and 14 mem­bers claim bud­gets of below €5m.

Of course, as you know, a major stick­ing point for Big Data busi­ness is data trans­fer from the EU to the US, where the Patri­ot Act may expose data to gov­ern­ment snoop­ing. There is cur­rent­ly a woe­ful lack of clar­i­ty and guid­ance when it comes to cross-Atlantic data trans­fer. This is why I found your book Big Data and Law to be absolute­ly indis­pens­able in under­stand­ing the ABCs of data pri­va­cy pro­tec­tion. With­out under­stand­ing the fun­da­men­tal prin­ci­ples and ter­mi­nol­o­gy which were so apt­ly cov­ered in your work, it would be very dif­fi­cult to under­stand and act in accor­dance with cur­rent laws.

In addi­tion, the legal land­scape on data pri­va­cy pro­tec­tion is lit­er­al­ly shift­ing as we speak. As recent­ly as July of this year, Noyb, a non-prof­it pri­va­cy cam­paign group, won a rul­ing at the Euro­pean Court of Jus­tice that total­ly inval­i­dat­ed the Pri­va­cy Shield, a transat­lantic agree­ment used by around 5,000 com­pa­nies to trans­fer data from the EU to the US. Since the rul­ing, com­pa­nies have had to rely on indi­vid­ual legal agree­ments, but even the ECJ pub­licly admit­ted that these con­tracts may not sat­is­fy Euro­pean stan­dards, an admis­sion which has com­pa­nies scram­bling for legal guid­ance from any data pro­tec­tion author­i­ty that can offer spe­cif­ic rules. The data pro­tec­tion author­i­ty in Baden-Würt­tem­berg in Ger­many cur­rent­ly rec­om­mends that com­pa­nies encrypt or anonymize data as a way to safe­guard it from gov­ern­ment interference.

You will sure­ly agree that the prob­lem at its core is not sim­ply eco­nom­ic or man­age­r­i­al but large­ly polit­i­cal. If the data of EU cit­i­zens is not safe from gov­ern­ment sur­veil­lance in the US, then these guide­lines would apply to many oth­er coun­tries, specif­i­cal­ly those with doc­u­ment­ed human rights abus­es and mass gov­ern­ment sur­veil­lance laws. Chi­na, a long-time cham­pi­on of cyber sov­er­eign­ty over inflow of data from the out­side, is now find­ing itself on the oth­er side of the equa­tion, encoun­ter­ing stiff resis­tance to the pos­si­bil­i­ty of Euro­pean or Amer­i­can data end­ing up in the hands of the Chi­nese. Tik­Tok is a case in point. Pro­tec­tion­ism or defence of lib­er­al demo­c­ra­t­ic prin­ci­ples? I will leave the answers to such ques­tions to your oth­er inter­vie­wees, who are more famil­iar with the matter.

For now, there are two pos­si­ble out­comes: The first being that, due to these data pri­va­cy pro­tec­tion issues, the inter­net may splin­ter into zones where data flows freely with­in, such as the EU block and North Amer­i­ca under NAF­TA, or the sec­ond sce­nario being that there be greater effort and, hope­ful­ly, suc­cess in har­mo­niz­ing nation­al data pro­tec­tion laws. But giv­en that a com­mon set of rules implies hav­ing fun­da­men­tal com­mon grounds to work with, as is the case with the EU, it will not be easy to bring every­one into the fold. Per­haps it’s time to estab­lish a supra-nation­al body ded­i­cat­ed to har­mo­niz­ing data pro­tec­tion, which would be sim­i­lar but more author­i­ta­tive than the WIPO (World Intel­lec­tu­al Prop­er­ty Orga­ni­za­tion) – a body which has been nego­ti­at­ing and har­mo­niz­ing patent, trade­mark and copy­right law through­out the world. One oth­er quick note regard­ing har­mo­niza­tion: Laws can some­times end up being har­mo­nized to the tune of one part­ner, for exam­ple, the new US-Mex­i­co-Cana­da trade agree­ment lan­guage mir­rors Sec­tion 230 in which plat­forms are grant­ed immu­ni­ty for the data they host.

My favorite expres­sion when it comes to Big Data is still “GIGO” – garbage in garbage out.

Vicky Fey­gi­na

Most Big Data that we deal with today is not prop­er­ly labelled or anno­tat­ed. Of course, we now have AI that can clean up data and then AI can be used which can label data before it’s passed on to yet anoth­er, smarter AI capa­ble of analysing the data. But when it comes to “learn­er” algo­rithms, there is a new approach to AI called trans­fer learn­ing in which AI learns from small­er sets of data. Just as chil­dren don’t need to see a mil­lion pic­tures of dogs to learn how to rec­og­nize a dog, learn­er algo­rithms may learn from a small­er, high-qual­i­ty set of data, if giv­en bet­ter learn­ing perime­ters. What this means is that size will mat­ter less than the qual­i­ty of con­tent. On the oth­er hand, when it comes to aggre­gat­ing large data about infec­tious pan­demics, we may take a more Bayesian approach: Bet­ter to have a large imper­fect mod­el than wait for a small per­fect one. I guess time will tell.

Vicky, thank you for shar­ing your insights on Big Data and for so amply con­tex­tu­al­iz­ing the man­i­fold aspects of Big Data projects.

Thank you, Cristi­na, and I look for­ward to read­ing your upcom­ing inter­views with rec­og­nized experts, delv­ing even deep­er into this fas­ci­nat­ing topic.

1A decrease in fine to £20m occurred since the inter­view was writ­ten: https://​www​.bbc​.com/​n​e​w​s​/​t​e​c​h​n​o​l​o​g​y​-​5​4​5​6​8​784

About me and my guest

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

Vicky Feygina

Vicky Feygina, MPA, MBA is a self-described “tireless entrepreneur and lifelong learner”. With Big 4 management consulting experience and, over 10 years in the airline industry in finance and operations roles, Vicky is a TRIUM MBA and a co-founder and CEO of Little Miracles Designs, a Brooklyn-based design firm providing innovative solutions for urban outdoor spaces.

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.