Skip to content

Dataset Similarities

Introduction#

When browsing for datasets it may be of interest to users what datasets are similar to a particular one. The Dataset Similarities service fingerprints a combination of a dataset's title and description of each dataset using the TLSH algorithm. One file is generated for each catalogue, into which both a dataset's URI and the corresponding hash value is written.

Incoming dataset URIs can then be looked up in the parent catalogue's file. Next, the respective hash value is compared to other dataset's hash. This allows retrieval of the most similar datasets.

For deployment configuration options please consult the services readme.md. For this service a helm chart is available.

Warning

At this moment the Dataset Similarities Service only works with Datasets that are tagged with english as language.

Info

It is not planned to add more features or functionalities to this service, as it is planned to replace it with a new one.

API#

This service is not a pipe module.

Rehashing of datasets can be triggered manually or via cron job. The default cron job will trigger once a day. Similar datasets can be retrieved via a dedicated endpoint.

The OpenApi description is available in the service repository.