Big Data and Algorithms

Prof. Dr. Katha­ri­na Zweig – Pho­to: Felix Schmitt

Do you know how old algo­rithms real­ly are? Or where they came from? What is an algo­rithm? Are there dif­fer­ent types of algo­rithms…? Muham­mad al-Khwariz­i­mi was a 9th cen­tu­ry Per­sian astronomer and math­e­mati­cian. His text­book On Indi­an Numer­als was trans­lat­ed into Latin four cen­turies lat­er. It is con­sid­ered the most impor­tant source for the Indo-Ara­bic numer­al sys­tem and writ­ten arith­metic. His sur­name even sounds pho­net­i­cal­ly like an algo­rithm “Algo­ris­mi”. Ada Lovelace was the first per­son to write down an algo­rithm intend­ed for a com­put­er in 1843. She is con­sid­ered to be the first female programmer.

In the lat­est of her Duet inter­views, Dr Cal­daro­la, author of Big Data and Law, and IT-Expert Prof. Dr Katha­ri­na Zweig talk about algo­rithms and heuris­tics and their dai­ly use.

Let’s start our duet by explain­ing what an algo­rithm is.

Prof. Dr Katha­ri­na Zweig: We have heard a lot about algo­rithms in the last decades. How­ev­er, it is a spe­cif­ic term, with a slight­ly dif­fer­ent mean­ing from how it is often used: an algo­rithm is a sequence of instruc­tions that can lead to a proven solu­tion of a math­e­mat­i­cal prob­lem. For exam­ple, there are mul­ti­ple ideas about how to sort things on a com­put­er: a bank might want to sort all of their bor­row­ers by the lev­el of their cred­it. Giv­en any sequence of instruc­tions to sup­pos­ed­ly solve this prob­lem, this needs to be proven math­e­mat­i­cal­ly to be called an algo­rithm.  In the media, how­ev­er, any type of soft­ware is nowa­days called an algo­rithm. This is a bit unfor­tu­nate, as AI does in almost all cas­es not use real algo­rithms, but only so-called heuris­tics. A heuris­tic is a sequence of instruc­tions that is like­ly to yield a solu­tion, maybe even a good one. But in con­trast to an algo­rithm, we do not have a math­e­mat­i­cal guar­an­tee that the solu­tion will be opti­mal for a giv­en problem.

Is an algo­rithm more objec­tive than the human brain and, if so, why?

Again, we need to dif­fer­en­ti­ate between real algo­rithms and those that are actu­al­ly heuris­tics. Using the def­i­n­i­tion above, an algo­rithm is objec­tive: it is proven to find the best solu­tion to a giv­en prob­lem. For exam­ple, when you use a nav­i­ga­tion sys­tem, the algo­rithm behind it will find the absolute­ly short­est path between your loca­tion and some oth­er loca­tion. How­ev­er, if you want to find the fastest solu­tion giv­en real traf­fic, no algo­rithm can guar­an­tee the fastest route. Instead, the sys­tem will use AI to pre­dict which street will be con­gest­ed at which time point. There are heuris­tics for these pre­dic­tions and we can mea­sure in hind­sight, how well they per­formed. But there is no objec­tive guar­an­tee that the heuris­tic will find the fastest way. Thus, algo­rithms are always to be pre­ferred, but in most cas­es, no algo­rithm exists and we need to resort to heuris­tics which are not objective.

How much data do you need to devel­op an algo­rithm? Does the qual­i­ty of the algo­rithm depend on the quan­ti­ty and qual­i­ty of the data used? What quan­ti­ty and qual­i­ty of data is required?

The inter­est­ing thing is that the devel­op­ment of an algo­rithm in the strict def­i­n­i­tion men­tioned above does not need any data. It needs math­e­mat­i­cal cre­ativ­i­ty, pens, a black board, time, frus­tra­tion tol­er­ance, and a bit of luck. Find­ing the best sequence of instruc­tions to solve a giv­en prob­lem is an abstract, the­o­ret­i­cal process. But now comes the twist: many inter­est­ing prob­lems can­not be solved effi­cient­ly by an algo­rithm. There are two rea­sons:  either because we haven’t found one yet, or what we have found is too slow.  Some­times we have an algo­rithm, but it would lit­er­al­ly take longer than 1000 years to wait for the result.

Thus, nowa­days, com­put­er sci­ence answers some ques­tions by design­ing heuris­tics that try to find sta­tis­ti­cal­ly robust pat­terns in large amounts of data – inde­pen­dent of the con­crete prob­lem to solve. This approach is called machine learn­ing. Machine learn­ing con­sists of many dif­fer­ent heuris­tics, based on some human intu­ition of how learn­ing by see­ing mul­ti­ple exam­ples works. For exam­ple, the machine could be shown infor­ma­tion on a bank’s bor­row­ers and on whether they paid  back their loan or not. The machine could then be tasked to find the best pat­terns to iden­ti­fy those that are not cred­it­wor­thy. These ideas about learn­ing can be sim­ple math­e­mat­i­cal equa­tions like a line through some data, or the famous neur­al net­works. Neur­al net­works are just huge sets of math­e­mat­i­cal equa­tions with an even larg­er set of para­me­ters or weights, that mod­el which of the pre­sent­ed infor­ma­tion is most impor­tant. All these ideas of learn­ing have in com­mon that they read data from the past to find pat­terns to make deci­sions for the future. But they iden­ti­fy and store these pat­terns in dif­fer­ent struc­tures. In the above exam­ples infor­ma­tion is stored in the weights in the respec­tive math­e­mat­i­cal equations.

The prob­lem to solve is then to find the best pat­terns in the data to make future deci­sions. For exam­ple, find the best line through the data or the best weights in the neur­al net­work such that they can iden­ti­fy cred­it­wor­thy future appli­cants. How­ev­er, only the sim­plest of these prob­lems can be solved by an algo­rithm, name­ly find­ing the best pos­si­ble line. How­ev­er, the line is very often a very poor rep­re­sen­ta­tion of the data. Thus, it can­not be used in real­i­ty. All oth­er ideas about learn­ing cre­ate math­e­mat­i­cal prob­lems for which we do not have any algo­rithms, only heuris­tics. In short, we only have algo­rithms with proven opti­mal solu­tions for the sim­plest learn­ing tasks. For more com­plex learn­ing tasks, we only have heuris­tics that might not result in the best mod­el to rep­re­sent the data.

It is believed that enough data will make it very like­ly that the heuris­tics can come close to a very good solu­tion. How­ev­er, in most cas­es we do not know exact­ly how much data we need.

Are AI sys­tems devel­oped with raw data or do they require processed data? If the lat­ter is the case, what ‘mold’ is required and how is it ‘made’?

My favourite quote by Geof­frey Bowk­er answers that question:

My favourite quote:

“Raw data is both an oxy­moron and a bad idea”.

Geof­frey Bowker

The moment some­one decides to store some infor­ma­tion is the moment the data is “cooked” accord­ing to some flavour of choice. What we select and what remains undoc­u­ment­ed is a choice. There is no such thing as raw data, all data is cooked.

AI sys­tems work with prob­a­bil­i­ties and cor­re­la­tions, while our life, sci­ence and law fol­low causal­i­ty. How is this compatible?

In my view, machine learn­ing is inter­est­ing as a tool to find hypothe­ses about how the world works. How­ev­er, the pat­terns found may be inci­den­tal. Hun­dreds of years of epis­te­mol­o­gy, the sci­ence of sci­ence or how we can deduce facts from obser­va­tions, have shown us that iden­ti­fy­ing cor­re­la­tions is not enough. Thus, if an AI sys­tem actu­al­ly does show supe­ri­or deci­sion-mak­ing it is nec­es­sary to iden­ti­fy why that is the case. For this, we have meth­ods from the nat­ur­al sci­ences: We need to iso­late pos­si­ble causal fac­tors and find out whether they are indeed caus­es or not. I see no way to short­cut this process. The ques­tion of what is a fact can­not be left to the meth­ods of machine learn­ing alone.

AI sys­tems are tout­ed for their effi­cien­cy, mea­sur­a­bil­i­ty, stan­dards, automa­tism and much more. Is that at the expense of human­i­ty because, for instance, peo­ple are seen as “poor” deci­sion-mak­ers or because peo­ple are pre­dictable based on past behaviour?…

The first ques­tion is: Under which cir­cum­stances are humans poor deci­sion mak­ers? Of course, psy­chol­o­gists and behav­iour­al econ­o­mists have shown ample exam­ples in recent years where humans have made biased deci­sions. Some­times these can also be called irra­tional deci­sions. The book Noise by Kah­ne­mann, Sibony and Sun­stein sum­maris­es many of these stud­ies in which experts are asked to come up with deci­sions, but it turns out that they do not agree in their find­ings very well. This can be cost­ly, e.g., if the risk of some insur­ance case is wrong­ly cal­cu­lat­ed. The authors also pro­vide good rules to reduce the noise. So, can AI sys­tems help to make bet­ter deci­sions than humans? I do not know many stud­ies in which this has actu­al­ly been proven. I am open to the idea, but I would like to see stud­ies. Fur­ther­more, the ques­tion is what would we con­clude from such a find­ing? The thing is that machine learn­ing is based on iden­ti­fy­ing sta­tis­ti­cal cor­re­la­tions, not nec­es­sar­i­ly causal rela­tions. Thus, we do not know whether the find­ing can be gen­er­alised and applied to future data or not. Can a machine replace a human deci­sion under this con­di­tion? The answer to this ques­tion is what I con­sid­er in my new book. My gen­er­al answer is a machine can only safe­ly replace humans in deci­sion mak­ing if the machine comes to an answer that can be ver­i­fied inde­pen­dent­ly. For all kinds of judg­ments, like the suit­abil­i­ty of a job can­di­date, it cannot.

We talk about errors in an AI sys­tem as a mat­ter of course. But when is the result of such a sys­tem actu­al­ly “good”? The word “good” cer­tain­ly dif­fers accord­ing to var­i­ous cul­tures, ethics, morals, etc. How does that go hand in hand with the world­wide use of an algorithm?

A deci­sion is good if it is cor­rect. If we can­not define what a cor­rect deci­sion is then we should not use AI sys­tems. The deci­sion of a machine is cor­rect if an inde­pen­dent ver­i­fi­ca­tion process comes to the same con­clu­sion. If there is no such ver­i­fi­ca­tion process, we should not use AI sys­tems which some­body just claims that they are good to go. Unless in sit­u­a­tions that are not very harm­ful, such as prod­uct rec­om­men­da­tion systems.

How does a con­sumer of a prod­uct or ser­vice that con­tains an AI sys­tem get the nec­es­sary evi­dence or the train­ing data behind it to be able to show that the algo­rithm con­tains an error? Does the aver­age con­sumer have the nec­es­sary com­pe­tence for this and isn’t the AI ​​sys­tem rather “opaque”?

If there is a ver­i­fi­ca­tion process, even if it is expen­sive to use it, the per­son can at least iden­ti­fy incor­rect deci­sions. For exam­ple, if a machine pre­dict­ed a Coro­na infec­tion based on the cough­ing sound of a patient, a PCR could be used to ver­i­fy that result. For cred­it­wor­thi­ness, it is already much more com­pli­cat­ed: Of course, it can be checked in 10 years whether per­sons deemed cred­it­wor­thy by the sys­tem, actu­al­ly did pay their loans back or not. But we can­not know whether the per­sons deemed unwor­thy of cred­it would have paid back their loans or not. Thus, for risk pre­dic­tions, we already have a prob­lem of know­ing whether the sys­tem is accu­rate or not. 

I assume from your answer that it takes a lot of knowl­edge and expe­ri­ence to detect an error in an AI sys­tem. And I also assume that it takes even more exper­tise to find the cause for it after the fact. If an employ­ee in a com­pa­ny has detect­ed an error in an AI sys­tem, it will cer­tain­ly be dif­fi­cult to find a col­league who is com­pe­tent in terms of IT (for the AI ​​sys­tem) and the spe­cial­ist area (e.g., the area of ​​appli­ca­tion tax­es) who can ver­i­fy the opin­ion. It is even more dif­fi­cult to per­suade col­leagues to stop using the sys­tem or even “repair” it, because for the com­pa­ny this means effort, time, loss of sales, pos­si­bly loss of rep­u­ta­tion and so on. Is that an unsolv­able dilemma?

An AI sys­tem is a process used by some­one. If that process is faulty and the con­se­quences of this faulty process are high, soci­ety needs to find ways to stop them. I do not think that there is some­thing inher­ent­ly dif­fer­ent with respect to whether AI is used or not. How­ev­er, it is dif­fi­cult to find out whether they are faulty or not, as already dis­cussed. Thus, those sys­tems for which we have a hard time prov­ing their fault­i­ness, should not be used for deci­sions with a high social impact.

Prof. Dr Zweig, thank you for shar­ing your insights on algo­rithms, heuris­tics and their impact on dai­ly life.

Thank you for hav­ing me.

About me and my guest

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.

Prof. Dr. Katharina Zweig

Katharina Zweig studied Biochemistry and Bioinformatics, completed her doctorate in theoretical computer science and accomplished her postdoctoral work in statistical physics. Currently she is a Professor for Computer Science at the RPTU in Kaiserslautern. There, she created the unique field of study called socioinformatics, which is concerned with the impact of software on society, and heads the algorithm accountability lab. She is a bestselling author, a founder, a policy consultant, and has received multiple awards for her work.

Dr Maria Cristina Caldarola

Dr Maria Cristina Caldarola, LL.M., MBA is the host of “Duet Interviews”, co-founder and CEO of CU³IC UG, a consultancy specialising in systematic approaches to innovation, such as algorithmic IP data analysis and cross-industry search for innovation solutions.

Cristina is a well-regarded legal expert in licensing, patents, trademarks, domains, software, data protection, cloud, big data, digital eco-systems and industry 4.0.

A TRIUM MBA, Cristina is also a frequent keynote speaker, a lecturer at St. Gallen, and the co-author of the recently published Big Data and Law now available in English, German and Mandarin editions.