
More and more data are being collected. Everything will be equipped with a sensor for data collection as part of the Internet of Things. What do we want with this data? Do we expect more knowledge through data? More insight? Technological progress? Are we evaluating data with the necessary due diligence? Do we have the quantity and quality of data needed to get good evaluation results?
In the latest of her Duet interviews, Dr Caldarola, editor of Data Warehouse as well as author of Big Data and Law, and Prof. Dr Brödner discuss the intentions, the goals and the quality of data analysis.
People always seem to be saying that we need more data. Will Big Data / AI lead to a new understanding of the world? Will they generate progress?
Prof. Dr Brödner: You are opening up a huge discussion, and I am very grateful to you for this question. What matters is how data relates to the real world. First of all, data is nothing more than numerals and other characters. These characters have no meaning per se. People assign meaning to these characters. Consequently, the question arises as to which theoretical or other consideration people can use to assign meaning. Therefore, we must start by clarifying what relationship we humans have to the world or to reality.
There have been a wide variety of approaches throughout history. Without going into detail, I am an advocate of the humanistic tradition, according to which we humans, ourselves the product of natural evolution and enabled by it through socialisation to engage in a collectively active confrontation with nature, have both practical abilities to act and reflective consciousness. Aristotle already made a clear distinction between practical reason – by which he also included technology as the ability to produce something that can be used for a purpose – and theoretical reason as the source of knowledge. According to Aristotle, practical reason is essential and decisive for action. Humankind must always laboriously acquire theoretical knowledge by means of his/her intellect.
Later, scientists such as Galileo or Keppler postulated that one must first start from some indisputable facts from which a question is derived. Undeniable facts are nature and, within a social environment, practices, traditions and habits. Of course, we can change nature as well as our social environment through intervention, but we cannot change the big picture according to our own wishes. It is true that we make our own conditions, but not under circumstances of our own choosing, but under circumstances that have been found and handed down in each case. These early scientists came up with the idea of inventing hypotheses and feasible experiments based on their knowledge at the time, for which nature would provide them with answers. These are, for example, Galileo’s famous free fall experiments or the abundance of observational data of cosmic light points with the help of technical devices such as telescopes, timing devices, etc. This made it possible to explain the seemingly chaotic movements in the cosmos. By means of hypothetical explanations, Keppler and Galileo were able to test the behaviour of nature through measurements and validate the counter-evidence to earlier ideas. In this way, researchers were able to determine a tiny aspect of “truth” again and again.
If we now come back to the data, we humans have questions that we are trying to explain and validate today with measured data. The problem nowadays is that we have a lot of data. For example, with the “Internet of Things” many things now have a sensor that provides data. Often, however, these do not fit our questions or we measure without having a question (e.g. because we can simply sell the thing with the sensor better) or we cannot recognise the supposed connection.
Let’s assume that the door has a sensor that measures the time of each movement using a time stamp. What do we know using these measured time and movement data of the door? We only know that the door moved at a certain time, but not for what purpose, for what reason, due to which person, possibly not even with which tool or even how it opened. We simply do not know the reasons or circumstances. And then humans start to think and use other sensors such as cameras, which in turn raise new questions.
This brings us back to the question of what data is, what meaning it has and what it actually tells us. And it is precisely this question that humans can only answer if they follow up on why and how the data was generated. At its beginning, there is usually surprise at a certain event, the explanation of which scientists come up with an experiment, create a hypothesis and collect data in a suitably methodologically sound manner in order to be able to validate the hypothesis.
Moreover, the full practical significance of the data only becomes apparent in the respective context of action, for example when answering the question: How much is a lot? Thus, the measured data obtained must first be interpreted, based on the respective theoretical background as well as on the context of action of a particular social practice, in order to be able to determine what they “tell” us. Apart from careful assessment of the methodological validity and reliability of the data, the question of how much data is important in a pragmatic context of action can only be answered in relation to other processes that are similarly relevant. Thus, meaningful and effective action depends entirely on this dual interpretation. However, this is precisely where mistakes in the handling of data are frequently rooted.
In short: As pure signals, as numbers and letters taken in themselves, the data do not say anything at all, as such they are initially completely meaningless. They gain their significance exclusively on the basis of the theory behind the measurement method, on the basis of which the method and the measuring apparatus were developed, as well as on the basis of their proven validity and reliability.
What do companies or governments expect from still more data? More knowledge? Better forecasts and decisions? Technological leadership in AI…?
Today, we are often under the illusion that we can explain and validate everything with the available data without knowing for sure how it came about or being able to assess its quality. In doing so, we succumb to the temptation of ascribing meaning to the data without systematic review, because this meaning or interpretation fits current projects, opinions, ideas or even wishes. In addition, with the advent of digitalisation, companies have a great desire for data because they hope that it will ultimately lead to progress. That’s why companies everywhere are siphoning off data. Furthermore, many people have faith in technology; they give more credence to data evaluation than to their own human instinct and evaluation.
But has it ever been carefully examined whether these data pools of a wide variety of data could actually be used to explain something, sell something better, win an election or achieve a technological competitive advantage? Have our politicians defined carefully enough what they want to achieve with the Internet of Things, data and its evaluation, with “big data”, etc.? These would be – but only with a methodically sound approach – at best only tools for achieving a goal, and that does not seem to be clear. Thus, more data by no means provides more reliable knowledge, but it does create the illusion of knowledge. And so, there are no more accurate forecasts or decisions.
Accordingly, it is also questionable to strive for technological leadership in so-called “artificial intelligence (AI)” without clarifying the background in more detail. Quite apart from the fact that the meaning of “AI” is often not very clear, the performance of the “connectionist” approaches, which are based on the application of decision trees or artificial neural networks, also depends decisively on the quality of the data used for “training”, for calculating the initially undetermined parameters.
Data quality is therefore of great importance. This is often unknown. In most cases, it is not known where the data comes from or under what conditions it was collected. Therefore, the data quality cannot be assessed or can only be assessed with difficulty.
In some of the previous interviews, we have heard that the results of data evaluation always depend on the quality as well as the quantity of the data. In other words, if the input of the data is bad, the output of the data is also bad – in short, “garbage in, garbage out”. How do we get good or better data?
In technology, medicine and the social sciences, there is a methodologically sound and proven approach. You have to follow the premises of stochastics, i.e. a random experiment must be carried out and examined using a random sample. Hypotheses developed on the basis of theoretical considerations and data collected specifically to test them then guarantee that the data evaluation was secure, valid, low-error, complete, appropriate – i.e. the data quality is appropriate and correct. This is based on a theoretical, derived hypothesis, which of course can still be wrongly or insufficiently determined. Therefore, a random experiment as well as associated methods of data collection must also be validated. The question to be answered is whether the method really measures exactly what is to be measured, the question of validity. In addition, the question must be answered as to whether the method also performs a reliable measurement without interference factors – basically the question of reliability.
There are examples. Google, for example, thought that they could make better, more accurate and faster predictions of flu epidemics from their search queries than the health authorities with their stochastic methods because of amount of their data. However, this could not be confirmed because Google did not have sufficient knowledge of the background of their data and its attribution of meaning. Social mechanisms came into play here, which sociologists call self-fulfilling prophecy: When people perceive many flu cases, more queries are being made on the Internet and this can then be exaggerated, so that there are more respective queries than corresponds to the actual flu outbreak and its course.
My favourite quote:
Albert Einstein
“Not everything that counts can be counted, and not everything that can be counted counts.”
Another illuminating example: On the occasion of the terrorist attack at the Christmas market at the Berlin Memorial Church on December 19th 2016, a one-year mass test with automatic facial recognition was later carried out at Berlin’s Südkreuz train station. After the evaluation, the Minister of the Interior proudly announced a hit rate of 80% and a false alarm rate of 0.1% – i.e. out of 10 suspects, 8 were correctly identified and out of 1000 innocent passers-by, only one was falsely suspected. This means that with a large number of about 100,000 passers-by daily and assuming 100 actual suspects among them, 80 are correctly identified and about 100 are incorrectly suspected, i.e. 56% (= 100⁄180) of the suspected passers-by are falsely suspected; Therefore, all 180 must be checked individually. However, like MEP Hohlmeier, most people believe that a false alarm rate of 0.1% means 99.9% correct hits – which testifies to the widespread inability to deal with conditional probabilities (Bayes’ theorem), according to which there are still many false positives with large observation numbers and a small number of actual suspects. So mass surveillance becomes part of the problem and not part of the solution.
Yet another example: A vaccination with zero efficacy can show an effect of 70 percent in studies such as those conducted during the pandemic. In order to estimate the protective effect of mRNA vaccines, authorities, vaccination commissions and physicians have been largely dependent on observational studies and model calculations since the beginning of 2021. The problem with this is that they can lead to results that do not reflect reality. According to John Ioannidis, infectiologist and epidemiologist at Stanford University, most reports on the effectiveness of the Covid vaccine are based on “distorted findings”.[i] Three scientists have now conducted a thought experiment: They assumed that the Covid mRNA vaccinations were completely ineffective – and calculated what effect observational studies could nevertheless attest to the vaccinations. A vaccine that has zero effect can still “achieve” a supposed protective effect of 67 percent. Highly regarded observational studies, some of which were widely cited and served as a guideline for the authorities, had greatly overestimated the protective effect of the mRNA vaccines, according to the conclusion of the three scientists Peter Doshi, Kaiser Fung and Mark Jones.
So what have we learned from these examples? That we should have precise knowledge about the data, how it comes about and its possible disruptive factors. A condition which will never be 100 percent possible and which we cannot know exactly. But we can get closer. That’s why I actually feel that Google’s approach to predicting a peak flu outbreak is like reading tea leaves – the hypothesis and the measurement method are not good enough. That’s why I also speak of dataism as a kind of blind belief in the objectivity of data and would like to take up a defence for stochastics and probability theory.
We also know from the theory of measurement that every measurement is wrong, for fundamental reasons. The true value can never be derived from a physical quantity, but only the measured or read value. Data recording creates a diagram, a kind of point cloud, which in turn shows a tendency. With the help of a mathematical function (a straight line or a polynomial of higher order), the deviations of these measuring points from this function can be determined. This allows the sum of the deviation squares to be set minimally and so we come to the so-called regression analysis. This generates a curve that corresponds to the measured values as best as possible. This function can then also be used for future forecasts. This method can be clearly justified and the measured values can be reasonably interpreted – namely against the background of my theoretical considerations. Science has been working with this for a long time – at least since Gauss.
In order to be able to assess the quality of the results of a data evaluation, transparency about the data and the underlying logic, explainability and provability are required. For humans, traceability and verifiability seem difficult. How good or sufficient is the verifiability by scientifically recognised mathematical methods such as LRP or LIME in the case of AI?
In the case of AI, the first thing to do is to define what is meant by it. Today, it is usually understood to mean artificial neural networks. These can be mathematically modelled with precision using functions of linear algebra (matrix calculus). These neural networks can be “trained” for a specific use case by optimally adapting the large number of their parameters, which are still undetermined for the time being, to large amounts of data from the field of application. This can then be used to make predictions, comparable to the procedure for regression.
There is some talk that a neural network is a black box that is not comprehensible. Here I have to vehemently disagree insofar as a neural network and its behaviour can also be explained and recalculated by linear algebra. The calculating procedure does what the formal description by programs specifies and delivers a determined result. And that’s exactly why we can be sure that there are no mistakes in this area, irrespective however of inevitable stochastic uncertainties. Of course, there further are some challenges in linear algebra, such as the fact that individual matrices are poorly conditioned and the computational procedure does not really run well. These are numerical problems. But the description by linear algebra is equivalent to the neural network. Neural networks are mathematical spaces with tens of thousands of dimensions and although this overwhelms human perception, the mind can rely on the calculations. Beyond that, however, an individual result cannot be explained.
Let’s assume that the quality of the data is excellent, then the result is still a result with a certain uncertainty or a certain probability. It’s like regression, where the curve only describes possible outcomes but not the real ones. Even if the probability of false results from a neural network is relatively small, then we only know that a certain percentage of all results are false, but not which of these results are false. So, there is always inherent uncertainty.
How we should deal with this fundamental uncertainty cannot be clearly answered. It depends on the situation. One situation can tolerate a higher, the other a lower probability of error rate. We already know this from many technical systems.
But do we want to tolerate mistakes and simply accept the result from a neural network, or should or must we check the result individually? The individual inspections are time-consuming and accepted errors can have considerable monetary consequences – possibly also in terms of our lives. And how do supervisors react when mistakes happen – even if they know that there is an error rate and only probable results are delivered?
The fact of a probability-based error rate cannot be eliminated and it is precisely this question that society must deal with. Perhaps there can be a comparison between the susceptibility to error of a neural network and the susceptibility to error in humans, and perhaps tolerance limits for high-risk AI must then be defined in laws and contracts so that responsibility is clarified and fully automated processes become possible. Certification processes are also conceivable. Then there is the assessment by the judiciary, which must continue to deal with the knowledge of the susceptibility to error and the tolerance limit in the event of damage.
Do we need a philosophy of science adapted to new technological changes such as AI / Big Data? Or perhaps a completely new one?
No, we need neither a new nor an adapted theory of science. Rather, we must use the proven and tested methods.
Around 2008, the online magazine “Wired” had an editor-in-chief named Chris Anderson, who put forward the bold thesis that there was no longer a need for hypothesis-based analyses in view of the large amount of data on the Internet. One can simply read the truth about reality from data correlations.
This is a blatant, ancient form of the error “hoc ergo procter hoc” (Latin: “by this, therefore because of this”), i.e. the fallacy of pseudo-causality, in which the joint occurrence of events or the correlation between characteristics is understood as a causal relationship without closer examination. But a correlation does not imply causality, even if the connection may seem to suggest it. Therefore, it is important that we know the difference between causality and correlation, because just because A and B occur together does not mean that we know whether A is caused by B or vice versa or even by a third C in each case.
And it is precisely this statement that is an indicator for me of how far our society and science have regressed when an editor-in-chief can make such a statement in a globally read magazine. An outrageous scandal!
The French postmodern philosophers after Sartre also incurred a great deal of blame – even if they were not really taken seriously in France and Europe – because they were celebrated as lucky charms in the USA. They were courted in the elite universities from Harvard to Stanford. From there comes the questionable claim that reality is nothing more than a chaos of narratives – the theoretical basis for the fight against “fake news” later on.
There are two logicians, Charles Sanders Peirce and Gottlob Frege, who have developed a first-level predicate calculus with existence- and all-quantifiers. They thus make it possible to formalise arguments in practice and theory in many sciences in important areas and to check their validity. In addition, they have developed a logically advanced concept of signs, which is particularly important for computer technology.
This is of the utmost importance, especially for our dialogue, because both say independently of each other that a sign is not a thing, but a triadic relation. We need 3 entities that are related to each other: First, a physical sign carrier or body, called »representamen« (e.g., a figure made of printer’s ink on paper), second, the object named or designated by it, about which the sign says something (e.g., cup), and third, the interpretant, the concept or meaning that an interpreter assigns to the sign depending on the situation.
In this structural relationship, the peculiarity of the object plays a decisive role. If the object is only a thought object, as in mathematics, for example, because no one has ever seen a point or a line, strict logical rules apply to these thought objects. It can have a certain predicate, such as the point as a thing without extension (unlike the atom). These are “inventions” of mathematics. David Hilbert said so beautifully: “Mathematics is a game with few rules and meaningless characters on paper”. So, if we want to talk about imaginary mathematical objects, then what is said must agree with the rules of mathematics, logic (especially predicate logic). If there is no conformity, then we know that the statement or meaning assignment is wrong. If we have a natural object such as a star in front of us, then we can only talk about it meaningfully if we recognise the laws of gravity and bring our statement into line with these laws. That’s why the object is so important, because it provides information about right and wrong statements. At the same time, it follows that calculation methods used in computer technology operate only with data in the form of binary signals, without “knowing” what they stand for or what they mean.
How can science have a better impact on our social and political sphere in order to advance our civilisation?
Data or AI per se will not bring about any progress, at best, data and AI will help us from the point of view of efficiency, but not in the field of creativity.
Computer technology is completely overrated in economic terms. It does not represent an increase in productivity. This has been shown empirically time and time again, because increases in productivity can only be achieved through a reorganisation of sign-based cooperation processes and not through mere data processing.
The delusion from the 1980s: “Experts go, expert systems stay” will not come true this time either. Although the approaches of “symbolic AI” at that time have been largely replaced by those of “connectionist AI” this time around, they are not “intelligent” either, but, like all artifacts, only objectify insights from the intelligence of their creators. Jean Piaget said at the end of his life: “Intelligence is what you use when you don’t know what to do”. And the algorithmic control of computational procedures is exactly the opposite, because you have to know exactly what to do and how to control the process. Therefore, AI systems, their computer programs, algorithms, etc. are not intelligent.
I have studied the history of computers and have collected many newspaper headlines from the early days of computer technology in the 1940s. One of the headlines was: “30-ton electron brain at Philadelphia University thinks faster than Einstein” or a book by E.C. Berkeley in 1949 is entitled “Giant Brains or Machines that think”. This shows how long the “thinking machine” and its supposed possibilities have been an issue in our society. They are also today’s leitmotif in the cognitive sciences – in the form of the so-called “computational theory of mind”.
We are dealing here with views from a long rationalist tradition, in which, like Pierre-Simon Laplace, people believed that the world could be explained by a system of differential equations, and according to which intellectual insights were nothing more than calculation (man as machine). This is also contained in the word “rational”, because ratio means both calculation and reason.
In the 1950s, there were philosophers such as Gilbert Ryle who wrote against it. He made a strict distinction between “knowing how” and “knowing that”, i.e. the first is implicit ability and the second is explicit knowledge. Or the chemist Michael Polanyi, who also wrote something similar in his book “The tacit dimension”. We also find this in Aristotle, who postulated that there is practical action competence that also has access to our creativity and intuition.
Today we know that we cannot explicate all implicit ability as knowledge. This is only possible to a limited extent with the scientific methods mentioned above. Only individual aspects of reality or ability can be explicated. We have recently found this in the core and nuances in Daniel Kahneman’s “Fast thinking and slow thinking” or in Albert Einstein’s “The sign of true intelligence is not knowledge, but imagination.”
In view of all this, I ask myself with a view to today’s social situation: How are upcoming existential crises to be overcome in the future if it is no longer even possible to interpret data appropriately against the background of their emergence and to answer the fundamental question “How much is a lot?” in a relevant comparison? When it is no longer even possible to classify an observed event in its factual and temporal context. When it is no longer even possible to conduct rational discourses based on evidence-based arguments as the core of reason-driven knowledge gain. When it is no longer even possible to conduct rational discourses based on evidence-based arguments as the core of reason-driven knowledge gain. When the awareness of the fundamental difference between theoretical and practical reason (Aristotle), between intuitive judgment and calculating reason, between experience-based ability and explicit knowledge has disappeared. When it is no longer possible to combine the disciplinary fragments of knowledge of a science to form a coherent overall picture of a complex situation that has occurred?
We are currently experiencing all of these phenomena. This is by no means a matter of subtle sophistry or subject-specific methodological subtleties that would only be of interest to relevant experts, but of the most elementary logical and epistemic knowledge and skills that determine our access to and understanding of the world. In a society that is too capable of artifice, there should be a core component of elementary education, but this too has apparently been largely lost. This loss of a realistic and methodologically secured access and understanding of the world is tantamount to uprooting and means nothing other than a profound decline in civilisation.
Prof. Dr Brödner, thank you for sharing your insights on dataism and our question whether data quantity automatically leads to quality if the data set is big enough.
Thank you, Dr Caldarola, and I look forward to reading your upcoming interviews with recognised experts, delving even deeper into this fascinating topic.
[i] https://www.infosperber.ch/gesundheit/wie-ein-unwirksamer-impfstoff-wirksam-erscheinen-kann/ Seite 1 von 11 Wie ein unwirksamer Impfstoff wirksam erscheinen kann – infosperber 16.01.25, 10:29