»All the beauty with data«

BI-Spektrum spoke with Jörg Vogler, TOLERANT Software Associate, about how data can be collected and maintained qualitatively so that it can be used well for analytics and artificial intelligence, as well as about what some US companies have ahead of the local ones.

The interview was conducted by Christoph Witte, Editor-in-Chief of BI-Spektrum.

BI-Spektrum: Companies are more data-hungry than ever before. Business intelligence and Analytics need usable data, and artificial intelligence applications also demand more and more data. Can companies actually satisfy this hunger for data?

Vogler: They can satisfy it if they take a disciplined approach to data collection. What is really important is that companies pay attention to quality and completeness already when collecting the data. If these criteria are met and as long as the data collection is legally permissible, this hunger for data can certainly be satisfied. This means that already in the first contacts with the customers, the corresponding information obligations towards them are met, so that the data that one would like to use may also be used. That’s the second big construction site we see. Companies too often tacitly assume that their data is recorded correctly, that the data fields are filled in correctly. They often notice too late – for example, when the data is made available to an analytics application or an AI is to be fuelled with it – that fields are not filled in correctly, that data have gaps or inconsistencies.

BI-Spektrum: Why is the quality of the data entry so important?

Vogler: Companies often capture data in order to be able to address customers correctly, not just to evaluate the data later. If the data is not captured correctly, I cannot address the customer properly and do not know to which target group he belongs. In addition, quality is important when combining customer information from different sources. For that, too, I need reliable characteristics. These have existed for years and are supported by Master Data Record Management, for example. But we still see considerable discrepancies between ideal and reality.

BI-Spektrum: The issue of c has been a problem for as long as IT has existed. Why can’t we get a grip on this issue?

Vogler: On the one hand, there are very good approaches that have been known for a long time, such as the data steward. But the processes for data quality management are unfortunately still put on the back burner in many companies. It is seen as a chore, especially because the work is becoming more and more condensed. A sales employee who actually looks after customers does the data collection in such a way that it is just enough and does not cost him too much time. However, we also have customers who live from and with data, credit agencies for example. They have their data under control. They have the appropriate measuring points to check the quality of the incoming data, they have regular routines to correct weaknesses and they invest a lot in monitoring and diagnostics. More recently, however, we have noticed that the issue of data quality is becoming more important due to data protection. This may sound surprising, but it has to do with the companies’ duty of care for the data. Only if the data is recorded correctly and I have obtained the appropriate authorisations can it be used properly. Overall, however, it must be noted that data quality is not yet sufficiently supported by the system and the corresponding processes are not yet consistently lived.

BI-Spektrum: But in view of the lack of data quality, you must feel sorry for the people who are supposed to ensure data-based decisions and overall more data-drivenness in the companies.

Vogler: As data quality professionals, we of course try to support the companies. In doing so, we also have to act as a catalyst for interdepartmental communication, especially between IT, which is supposed to provide corresponding systems, and the specialised departments that want to work with the data. Then there are the data protection and compliance guidelines that define what can be done with the data.

BI-Spektrum: Aren’t you overreaching yourself if you, as a provider of data quality tools, also want to take care of communication? That is actually a completely different topic.

Vogler: Of course we don’t take care of the communication processes themselves. But we do help to create an awareness of what works and what doesn’t – for example, when an AI initiative is started in the company and the IT department is asked to build an AI model. They then do that, but very quickly realise that the data is just the way it is. We can help to ensure transparency, make people aware of why data is missing and check the quality of the existing data, explain how the existing data can be improved and how to get to the data that is still missing. In doing so, we bring a little more sense of reality into the companies. Often, especially in higher management, there is no awareness of the importance of data quality.

BI-Spektrum: So companies could benefit much more from AI if the data were cleaner?

Vogler: Yes, especially if they had their data entry processes under control. There are certainly examples of IT companies in the US where this is the case. Of course, you can do a lot to improve the quality of the data afterwards, but in order to really leverage the potential, as is done in some cases in the USA, the quality of the incoming data must be improved. The historically grown structure of IT systems in companies with different data models and their beauties and weaknesses does not necessarily make the task easier either.

BI-Spektrum: Why are the Americans so much better at this?

Vogler: Especially the big role models like Google or Amazon have the clear advantage that they have all customer data in one place in a relatively homogeneous environment. For one thing, they don’t have the problems of legacy systems and they have a very clear service architecture with precisely defined transfer points where they can get their hands on the data. In addition, they understood much earlier what a central role data plays and what value it has. This understanding, also with regard to the value of “by-catch data”, was developed very early by the large American companies.

BI-Spektrum: By-catch data?

Vogler: This refers to the user data that is generated when systems and devices are used. When you read an e-book, for example, it is noted which pages you have already read, where you last stopped and how fast you read. On the one hand, conclusions about reader behaviour can be drawn from this information, but optimisations for the e-book itself and tips for other readers can also be derived. The Americans realised very early on how valuable such data can be. The data on market size and prices, products in demand and buying behaviour that Amazon obtains simply by opening its platform to other retailers is tremendously valuable.

BI Spectrum: Isn’t one reason for the high data quality at Amazon that it’s not employees who have to do this annoying duty of data entry, but that the customers do it themselves?

Partly, but they also have very proper checking routines in the background and know exactly what they can expect their customers to do. Moreover, they not only check the data automatically, but suspicious data sets are also checked by humans, and in this combination of automation and human checking lies a great strength with these companies. This is also true for the tools we offer. They find a lot of things, but they only become perfect in combination with human review.

BI-Spektrum: You mentioned that it is possible to repair incorrectly or incompletely recorded data. What can be done?

Vogler: Of course, we have possibilities regarding the data field assignment, the harmonisation of master data such as telephone numbers, e-mails, etc. You can, of course, standardise the data entry, you can check semantically whether, for example, the names are entered in the correct fields, the genders are correctly specified or similar. A lot of things can be smoothed out automatically. We can check relocations or correct many things in the corporate environment, because we have many external references. In addition, we offer duplicate recognition and correction. The difficulty here is not so much in the recognition, but in the merging with the corresponding order histories. So we can check data fields, we can complete data and we can test it for timeliness. In addition, we can validate data sets for our clients.

BI-Spektrum: Give three more tips for companies that want to have clean data.

Vogler: It’s basically like a doctor: First, a diagnosis has to be made, I have to suggest treatment measures and I have to be able to say how to avoid “data diseases” in the future, i.e. make suggestions for prevention. In addition, I have to anchor an important principle in the company: All the beauty with data only works if it is properly maintained.

The interview was published in the magazine BI-Spektrum, Issue 3/2019, p. 30-32.