Harvesting a piveau instance#

piveau hub repo offers two endpoints that return the metadata of the datasets in RDF. These can be used to harvest the datasets of a portal:

List datasets (/datasets) for a list of all datasets in the portal
List datasets of catalogue (/catalogues/{catalogueId}/datasets) for a list of all datasets in a specific catalogue

Both endpoints offer mostly the same set of query parameters to navigate through the list:

valueType: has to be metadata to return the complete metadata of the datasets
offset: the offset of the first returned dataset, can be used together with limit to get the next set of datasets in the list
limit: the number of returned datasets. Max 5000, default 100. Can be used together with offset to get the next set of datasets in the list

Only available in the list datasets endpoint:

hydra: set to true to include a HYDRA pagination object in the RDF response. This object will include a link to the next page of datasets.

Returned format & accept header: If no specific accept header is set, the content type returned will be application/ld+json. Other possible formats can be found in the API description and include e.g. application/rdf+xml or text/turtle.

Warning

Please always refer to the API documentation for an up-to-date reference.

Hydra paging in hub-repo#

The listDataset enpoint has an optional prameter hydra which will add HYDRA paging to the dataset list. The HYDRA graph will look like this:

{
            "@id": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
            "@graph": [
                {
                    "@id": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
                    "@type": "http://www.w3.org/ns/hydra/core#PartialCollectionView",
                    "http://www.w3.org/ns/hydra/core#totalItems": {
                        "@value": "1234",
                        "@type": "http://www.w3.org/2001/XMLSchema#int"
                    },
                    "http://www.w3.org/ns/hydra/core#next": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=3",
                    "http://www.w3.org/ns/hydra/core#first": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
                    "http://www.w3.org/ns/hydra/core#last": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0"
                }
            ]
        }

Warning

This parameter is, as of this writing only available on the list datasets endpoint, not in the list datasets of catalogue endpoint.

If this service is behind a proxy that rewrites the URI, wrong URIs might show up. To fix this, you can configure your proxy to send the correct current absolute URI in the X-Original-URI header and this will be used. For a kubernetes ingress you might want to set it like this:

    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header X-Original-URI $scheme://$host$request_uri;