piveau RDF Vocabularies#
Introduction#
Vocabularies play a crucial role in structuring and standardizing data within the piveau open data portal. They help ensure interoperability, reusability, and meaningful indexing of datasets. This document explains what vocabularies are, how they are used in piveau, and provides detailed technical instructions on configuring and managing them.
What Are Vocabularies?#
A vocabulary is a structured set of terms and concepts used to define and categorize data in a
machine-readable format. In piveau, vocabularies help to:
- Standardize dataset properties (e.g., dcat:theme
linking to predefined categories).
- Enable efficient data retrieval and enrichment.
- Improve search capabilities through structured indexing.
For more information on vocabularies in the semantic web, refer to the W3C ontology standards.
Vocabularies Used in piveau#
Linking Data to Vocabularies#
The RDF schema in piveau recommends using specific vocabularies for certain RDF properties.
For example, dcat:theme
links datasets to categories defined in the EU Data-Theme Vocabulary.
Example: Assigning a Theme to a Dataset
@prefix dcat: <http://www.w3.org/ns/dcat#> .
<http://data.europa.eu/88u/dataset/simple-dataset> dcat:theme <http://publications.europa.eu/resource/authority/data-theme/AGRI> .
AGRI
) without needing additional metadata, as the vocabulary already defines the term.
Querying Vocabulary-Linked Data#
Using SPARQL, users can retrieve structured information about a dataset's assigned vocabulary terms:
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?theme ?prefLabel
WHERE {
<http://data.europa.eu/88u/dataset/simple-dataset> dcat:theme ?theme .
?theme skos:prefLabel ?prefLabel .
}
Warning
- The piveau triplestore does not resolve external URIs automatically.
- Vocabulary graphs must be stored locally in the triplestore to enable querying and indexing.
Vocabulary storage structure#
Piveau organizes vocabularies using a three-layer graph structure in the triplestore, ensuring clear separation and efficient management of different types of vocabulary-related information.
graph LR
subgraph "Catalogue Graph"
Cat["Vocabularies Catalogue<br/>(dcat:Catalog)<br/>graph: https://piveau.io/id/catalogue/vocabularies"]
end
subgraph "Vocabulary Dataset Graphs"
DS1["Dataset: theme-vocabulary<br/>(dcat:Dataset)<br/>graph: https://piveau.io/set/data/theme-vocabulary"]
DS2["Dataset: language-vocabulary<br/>(dcat:Dataset)<br/>graph: https://piveau.io/set/data/language-vocabulary"]
DS3["Dataset: other-vocabulary<br/>(dcat:Dataset)<br/>graph: https://piveau.io/set/data/other-vocabulary"]
end
subgraph "Vocabulary Content Graphs"
V1["Theme Vocabulary Content<br/>(skos:ConceptScheme)<br/>graph: http://publications.europa.eu/resource/authority/data-theme"]
V2["Language Vocabulary Content<br/>(skos:ConceptScheme)<br/>graph: http://publications.europa.eu/resource/authority/language"]
V3["Other Vocabulary Content<br/>(skos:ConceptScheme)<br/>graph: http://example.org/vocabulary/other"]
end
Cat -->|dcat:dataset| DS1
Cat -->|dcat:dataset| DS2
Cat -->|dcat:dataset| DS3
DS1 -->|dcat:distribution<br/>accessURL| V1
DS2 -->|dcat:distribution<br/>accessURL| V2
DS3 -->|dcat:distribution<br/>accessURL| V3
classDef catalogStyle fill:#f9f,stroke:#333,stroke-width:2px
classDef datasetStyle fill:#bbf,stroke:#333,stroke-width:2px
classDef vocabStyle fill:#bfb,stroke:#333,stroke-width:2px
class Cat catalogStyle
class DS1,DS2,DS3 datasetStyle
class V1,V2,V3 vocabStyle
Layer 1: Vocabularies Catalogue
- Graph: https://piveau.io/id/catalogue/vocabularies
- Type: dcat:Catalog
- Purpose: Serves as the main entry point for vocabulary management
- Content: Contains metadata about all vocabulary datasets
- Visibility: Marked as hidden to exclude from general dataset indexing
Layer 2: Vocabulary Dataset Graphs
Each vocabulary has its own dataset representation stored in a dedicated graph:
- Graph Pattern: https://piveau.io/set/data/{vocabulary-id} (If following the standard piveau URI schema)
- Type: dcat:Dataset
- Purpose: Stores metadata about specific vocabularies
- Key Information:
- Vocabulary hash (stored as dct:identifier)
- Access URL to vocabulary content
- Version information
- Update timestamps
- Description and documentation
Layer 3: Vocabulary Content Graphs
The actual vocabulary definitions are stored in separate graphs, typically using their original URIs:
- Graph Pattern: Original vocabulary URI (e.g., http://publications.europa.eu/resource/authority/data-theme)
- Type: skos:ConceptScheme
- Purpose: Contains the actual vocabulary terms and relationships
- Content:
- SKOS concepts and their relationships
- Labels in multiple languages
- Hierarchical structures
- Mappings to other vocabularies
Relationships Between Layers
-
Catalogue to Datasets:
-
Datasets to Content:
Warning
- The piveau triplestore does not resolve external URIs automatically.
- Vocabulary graphs must be stored locally in the triplestore to enable querying and indexing.
Vocabularies in different piveau services#
Vocabulary storage and management with Repo:
- Vocabularies are stored in a catalogue structure.
- Hidden from being indexed as normal datasets to maintain system organization
- Each dataset contains metadata linking to the vocabulary.
- A hash comparison prevents unnecessary updates during import.
Vocabulary usage in the Data Provider Interface:
- Users can assign themes to datasets without handling raw URIs.
- Labels are resolved for improved usability.
Vocabulary usage in search & indexing:
- Before indexing, vocabulary properties are resolved to enhance search results.
- Each vocabulary term is stored as a searchable instance within an index.
- The search service indexes each vocabulary separately (
vocabulary_*
naming convention).
Adding and managing vocabularies#
On first start#
piveau offers manual command for loading vocabularies used in DCAT-AP and DCAT-AP.de. This command is called
installVocabularies
and has the flag -h
available to provide more information about it. It is available through the
hub repo shell and also soon via the hub repo action API.
Note
The hub repo shell can be enabled via the PIVEAU_HUB_SHELL_CONFIG
environment variable like it is done in
the sample config.
It is then accessible via repo-url/shell.html
. More can be found out in the
cli reference.
Non-SKOS vocabularies#
In piveau, there are some vocabularies that are not available as SKOS vocabularies but are still needed as vocabularies for the frontend. These vocabularies can be imported directly via the hub search CLI and are not stored as RDF in hub repo.
Vocabulary Management API#
The piveau API provides endpoints for managing vocabularies through standard HTTP operations.
Each vocabulary is identified by a unique vocabularyId
.
Please refer to the OpenAPI description for the most up to date reference.
Authentication#
For write operations (PUT, DELETE), authentication is required using either:
- API Key: Provided in the
X-API-Key
header - Bearer Token: Provided in the
Authorization
header withBearer
prefix
Endpoints#
1. Retrieve a Vocabulary#
Retrieves a vocabulary in RDF format.
Example request:
curl -X GET \
'https://piveau.io/api/hub/repo/vocabularies/data-theme' \
-H 'Accept: application/rdf+xml'
Response (200 OK):
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<skos:ConceptScheme rdf:about="http://publications.europa.eu/resource/authority/data-theme">
<!-- Vocabulary content -->
</skos:ConceptScheme>
</rdf:RDF>
2. Check Vocabulary Existence#
Checks if a vocabulary exists without retrieving its content.
Example request:
Response (200 OK):
3. Create or Update a Vocabulary#
Creates a new vocabulary or updates an existing one. Requires authentication. If they not exist, the vocabulary catalogue and dataset will be created automatically.
Example request for creating/updating:
curl -X PUT \
'https://piveau.io/api/hub/repo/vocabularies/custom-theme' \
-H 'X-API-Key: your_api_key' \
-H 'Content-Type: application/rdf+xml' \
--data-binary @vocabulary.rdf
Possible responses:
- 201 Created (new vocabulary)
- 204 No Content (updated existing vocabulary)
4. Delete a Vocabulary#
Permanently removes a vocabulary. Requires authentication.
Example request:
curl -X DELETE \
'https://piveau.io/api/hub/repo/vocabularies/custom-theme' \
-H 'X-API-Key: your_api_key'
Response (204 No Content) if successful.
Error Handling#
Common error responses:
-
Authentication Errors
- 401 Unauthorized: Missing or invalid credentials
- 403 Forbidden: Valid credentials but insufficient permissions
-
Resource Errors
- 404 Not Found: Vocabulary doesn't exist
- 400 Bad Request: Invalid RDF data in PUT request
Vocabulary Enrichment in hub-search#
Vocabulary usage is enriched during indexing in hub-repo or hub-search. The fields to be indexed can be configured with piveau profile. The enrichment process itself can be configured via the elasticsearch configuration in piveau hub search.
Example: Vocabulary Enrichment#
Dataset with a Vocabulary-Defined Type#
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
<https://piveau.io/set/data/simple-dataset>
a dcat:Dataset ; # (1)!
dct:type <http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA> . # (2)!
- Defines the resource as a
dcat:Dataset
. - Assigns a dataset type from a controlled vocabulary.
Enriched Representation in Search Index#
{
"id": "simple-dataset",
"type": {
"id": "TEST_DATA", // (1)!
"label": "Test data", // (2)!
"resource": "http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA" // (3)!
}
}
- Retains the dataset type identifier.
- Adds a human-readable label for improved usability.
- Stores the original resource URI.
Configuring Enrichment in hub-search#
To enable enrichment, hub-search must be configured in the Elasticsearch configuration:
{
"PIVEAU_HUB_SEARCH_ES_CONFIG": {
"vocabulary": {
"dataset-type": {
"fields": ["type"], // (1)!
"excludes": ["distributions"], // (2)!
"replacements": [
"id:id", // (3)!
"label:pref_label.en", // (4)!
"resource:resource" // (5)!
]
}
}
}
}
- Specifies which fields to enrich (in this case,
type
). - Excludes
distributions.type
from enrichment to prevent conflicts. - Maps the
type.id
field to the vocabulary ID. - Replaces
type.label
with the preferred English label from the vocabulary. - Ensures
type.resource
retains the original URI.
Configuration Details#
-
Vocabulary Identifier (
dataset-type
):- Must match the vocabulary's identifier in the system
- Used to locate the correct vocabulary data
-
Fields Array (
fields
):- Lists all dataset properties that should be enriched using this vocabulary
- Example:
"type"
enriches any property named "type"
-
Excludes Array (
excludes
):- Specifies JSON paths where enrichment should not occur
- Example:
"distributions"
prevents enriching "type" fields within distribution objects - Optional: Use
"includes"
instead to specify only where enrichment should occur
-
Replacements Array (
replacements
):- Format:
"target:source"
- Target: Field name in the enriched output
- Source: Field name in the vocabulary data
- Common mappings:
"id:id"
- Short identifier"label:pref_label.en"
- English display label"resource:resource"
- Full URI
- Format:
Future Enhancements#
Future improvements will need to:
- Index additional metadata for specific vocabularies (e.g., corporate bodies).
- Extend the existing schema for more flexible enrichment strategies.
- Reference additional dataset attributes beyond standard vocabularies.
Relevant Code References: - VocabularyHelper.kt - Enrichment Implementation - Specific Enrichment Logic (Line 191)
Understanding SKOS#
SKOS (Simple Knowledge Organization System) is a W3C standard used to represent controlled vocabularies, taxonomies, and thesauri in RDF. Think of it as a way to organize concepts and their relationships, similar to how you might organize items in a library catalog or product categories in an online store.
Key SKOS Concepts#
-
Concepts
- The basic building blocks of SKOS
- Represent ideas, meanings, or categories
- Example: "Agriculture" as a dataset theme
-
Labels
- Ways to name concepts in different languages:
- prefLabel: The main label (only one per language)
- altLabel: Alternative labels or synonyms
- hiddenLabel: Labels for search matching
- Ways to name concepts in different languages:
-
Hierarchical Relationships
- broader: Links to more general concepts
- narrower: Links to more specific concepts
-
Concept Schemes
- Collections of concepts
- Similar to a controlled vocabulary or taxonomy
SKOS in Piveau#
In piveau, SKOS is used to: 1. Organize dataset themes and categories 2. Provide multilingual labels for concepts 4. Support semantic search functionality
Example of a dataset using SKOS concepts:
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
<https://piveau.io/set/data/farming-dataset>
a dcat:Dataset ;
dcat:theme <http://publications.europa.eu/resource/authority/data-theme/AGRI> .
When this dataset is displayed in the portal: - The SKOS prefLabel is shown instead of the URI - Users can find the dataset using any altLabel