Tips and Tricks: Duplicates

Large amounts of data require special tools. If you want to match or search millions of data records, the performance must be right. TOLERANT Match offers custom-fit options to optimise performance.

Partitioning

Partitioning is a central element in TOLERANT Match. It has a direct influence on performance as well as the type and amount of data found. In partitioning, all data records are assigned (»partitioned«) to N data areas (»partitions«) before processing. The allocation is based on configurable criteria that map each data set to a partition (1…N).

partitioning

During the search, the system then only searches in the partitions that result from the values in the search. In this way, the time for a search can be significantly reduced compared to processing without partitioning. For large amounts of data (more than about 1 million data records), partitioning should therefore always be used.

The products TOLERANT Match and “Match Service” differ with regard to partitioning. Match Batch uses one field for partitioning, Match Service uses several fields. This has the advantage that with Match Service partitioning – and thus an improvement in performance – is possible even if the same fields are not always used in a search.

The postcode is a typical example of a field that is used for partitioning. Records with the same postcode or common initial digits of the postcode are assigned to the same partition and can therefore be compared with each other. In TOLERANT Match Service, you could use other fields besides the postcode, such as a customer type or a date of birth, for partitioning. The advantage is obvious: Optimised search operations are possible for different combinations of search terms, even without always having to include the same field (such as the postcode).

Please note that partitioning should only be done over fields whose values are completely filled and reliable. In the case of the postcode, a postal check and standardisation of all data records should therefore have preceded, for example with the software TOLERANT Match. This ensures the allocation to correct partitions.

Prefer matching type DICECOEFF

For the comparison of two field values in TOLERANT Match a number of match types are available (DICECOEFF, COMPARE, etc.). These match types differ mainly in the way short matching text fragments affect the result; accordingly, they have typical areas of application.

Above all, however, the matching types influence performance. For large amounts of data, DICECOEFF should be used because a special implementation optimises performance.

Use emptyScore selectively

There is an elegant way to take into account unfilled fields when searching with TOLERANT Match. The configuration element »emptyScore« can process empty fields expediently. In this way, only a few rules are needed in a configuration.

However, if this option is used with very large amounts of data, performance may decrease. In such cases, it is recommended instead to define several rules that are used for different empty fields. Conditions are used to control when a rule is used.

Hardware platform

In addition to an optimal configuration, the available hardware platform of course decisively determines the performance of TOLERANT Match. In contrast to other TOLERANT products, the dimensioning of Match depends very much on the data to be processed. Decisive for high performance are as much main memory as possible – especially with »Match Service« – and a high CPU performance.

Database

If a database is used for TOLERANT Match Service to store data, it should be connected to »Match Service« with the lowest possible latency. Fast processing of queries is also essential. We therefore recommend installing a database on the same server as Match Service; or on a server that can be reached via a very fast network connection. For the database server, the use of SSD hard disks has proved successful.

Execution plan

The speed with which TOLERANT Match processes search queries depends not only on the hardware used and the amount of data, but also strongly on the configuration of the programme. Different field types, partitioning, the fields used in a search as well as the concrete values in a search have an influence on the speed. In order to always achieve the shortest possible response time, a plan for the execution of the individual search query is determined before each search. This execution plan delivers a good result in the vast majority of cases. If the execution plan is to be optimised manually, this can be done at any time via the configuration.