Harvesting a piveau instance#
piveau hub repo offers two endpoints that return the metadata of the datasets in RDF. These can be used to harvest the datasets of a portal:
- List datasets (
/datasets) for a list of all datasets in the portal - List datasets of catalogue (
/catalogues/{catalogueId}/datasets) for a list of all datasets in a specific catalogue
Both endpoints offer mostly the same set of query parameters to navigate through the list:
- valueType: has to be
metadatato return the complete metadata of the datasets - offset: the offset of the first returned dataset, can be used together with
limitto get the next set of datasets in the list - limit: the number of returned datasets. Max 5000, default 100. Can be used together with
offsetto get the next set of datasets in the list
Only available in the list datasets endpoint:
- hydra: set to
trueto include a HYDRA pagination object in the RDF response. This object will include a link to the next page of datasets.
Returned format & accept header: If no specific accept header is set, the content type returned will be
application/ld+json. Other possible formats can be found in the API description and include e.g.
application/rdf+xml or text/turtle.
Warning
Please always refer to the API documentation for an up-to-date reference.
Hydra paging in hub-repo#
The listDataset enpoint has an optional parameter hydra which will add HYDRA paging to the dataset list. The HYDRA
pagination is compatible with the Hydra specification.
The HYDRA graph will look like this:
{
"@id": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
"@graph": [
{
"@id": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
"@type": "http://www.w3.org/ns/hydra/core#PartialCollectionView",
"http://www.w3.org/ns/hydra/core#totalItems": {
"@value": "1234",
"@type": "http://www.w3.org/2001/XMLSchema#int"
},
"http://www.w3.org/ns/hydra/core#next": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=3",
"http://www.w3.org/ns/hydra/core#first": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0",
"http://www.w3.org/ns/hydra/core#last": "http://piveau.io/datasets/?valueType=metadata&hydra=true&limit=3&offset=0"
}
]
}
Warning
This parameter is, as of this writing only available on the list datasets endpoint, not in the list datasets of catalogue endpoint.
If this service is behind a proxy that rewrites the URI, wrong URIs might show up. To fix this, you can configure your
proxy to send the correct current absolute URI in the X-Original-URI header and this will be used.
For a kubernetes ingress you might want to set it like this:
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header X-Original-URI $scheme://$host$request_uri;
$request_uri, but want to use normalized URIs, you can use:
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header X-Original-URI $scheme://$host$uri$is_args$args;
$uri is the normalized URI without query string and $is_args$args adds the query string if present.
Info
It is important to set the full absolute URI including the full path and query string, otherwise the links in the HYDRA
pagination will not work correctly. E.g. the piveau demo instance currently uses $scheme://$host/api/hub/repo$uri$is_args$args
to include the full path /api/hub/repo and the query string.