How it works
Introduction
The main purpose of Data manager or rs-dam microservice is to populate
the Regards meta-catalog. The microservice is composed of several
modules : crawler module, dam module(main core), indexer module, model module and opensearch module.
To do so, this microservice uses :
- Data sources to retrieve products from several catalogs. Data sources are configured through an UI or XML file (AIP, GeoJSON, external database: REGARDS UI).
- Data models to transform and standardize crawled products before adding them into the meta catalog, represented by the Elasticsearch index.
- Data access rights to calculate access rights of each product in the meta catalog. Access rights concern the permissions granted to a group of users for accessing a set of products that constitute a dataset ( see REGARDS UI).
This microservice is required to expose products managed by the OAIS Products Manager (rs-ingest microservice), the
GeoJson Products Manager (rs-fem microservice), or products accessible from an external database or a web service.
Each Elasticsearch index stores products for each project or tenant created in the REGARDS application.
Main points of data modification in Elasticsearch
The data stored in Elasticsearch consist of Datasets and DataObjects. DataObjects can originate from various sources depending on the selected plugin (OAIS, FEM, external database, etc.).
In REGARDS, there are several ways to add, modify, or delete objects stored in Elasticsearch:
- Through the
CrawlingJob(data crawler/harvester), which will be detailed in the following sections. - When a Dataset is updated OR when access rights to a Dataset are modified, a
DatasetEventis triggered and processed by rs-dam to update the Dataset stored in Elasticsearch, as well as all DataObjects linked to this Dataset. - When a product is deleted from the OAIS or FEM catalogs, a
FeatureEventis generated and processed by rs-dam, which immediately requests the deletion of the dataObject from ElasticSearch.
Meta catalog population
A scheduler is launched at a configurable interval to iterate through each configured datasource and create a job CrawlOneDatasourceJob if needed. This allows independent and parallel management of data ingestion for each datasource. The scheduling frequency can be adjusted by the user (see the static configuration)
Crawling Job (CrawlOneDatasourceJob)
The main steps performed by each job are:
1. Retrieve new products from the catalog
Using an implementation of the IDataSourcePlugin interface, the system retrieves new products and transforms the datasource-specific product format into the REGARDS standard format. See more details below
2. Insert or update products in the Elasticsearch index
New or updated products retrieved in step 1 are inserted or updated in the appropriate Elasticsearch index using upsert requests. The upsert operation ensures that a product is either created or updated in a single atomic operation, depending on its existence. See more details below
Upsert requests are sent in parallel to Elasticsearch to improve performance. Each upsert request contains a configurable batch of products, and the number of parallel requests is also configurable to optimize throughput according to the cluster's capacity.
Ensure that the Hikari connection pool size (regards.jpa.multitenant.maxPoolSize) is STRICTLY GREATER than the Elasticsearch threadpool size (regards.elasticsearch.threadpool.size) to avoid connection starvation. Otherwise, rs-dam may enter in deadlock state, with all threads waiting for a connection from the pool, and no thread available to release a connection.
3. Update access groups for products
The update of access groups is performed using a dedicated Painless script in Elasticsearch, allowing all necessary changes to be applied in a single update request on the Elasticsearch side. This is achieved by sending an 'update by query' request to Elasticsearch, specifying the script id. This single request allows Elasticsearch to apply the script to all documents matching the filter—typically all the data ingested in the previous step—thus efficiently updating access groups for a large set of products at once. See more details below
If needed (depending on access rights configuration), access groups are updated for products using an implementation of the IDataObjectAccessFilterPlugin interface to apply any custom product filtering. This operation cannot be performed by the update group script and must be manually calculated by REGARDS, updating each document individually.
4. Compute calculated attributes
If the data model contains calculated attributes, these are computed using an implementation of the IComputedAttribute interface.
Retrieve new products from data sources
To manage different data sources, an extension point (see the implementation of the IDataSourcePlugin interface) is used to handle the specific requirements for loading products in the REGARDS format :
- Products: A DATA Entity, as defined by a model in REGARDS
- A set of products: A DATASET Entity, as defined by a model in REGARDS
The following types of crawlers are available:
- AIP Crawlers: These crawlers allow crawling of SIPs from the
rs-ingestmicroservice. Incremental ingestion uses the last data update. - Feature Crawlers: These crawlers allow crawling of features from the
rs-femmicroservice. Incremental ingestion uses the last data update. - Database Crawlers: These crawlers allow crawling of data from an external database, with the following modes:
- Non-incremental ingestion (not recommended)
- Incremental ingestion based on the last data update
- Incremental ingestion based on the data identifier
The user selects the incremental ingestion mode during datasource creation.
- Web Source crawlers allows to crawl data from an OpenSearch Web Source: incremental aspiration bas on the data last update date.
The configuration of the extension point plugin can be used to define, as needed, the type of ingestion, the data source refresh rate (in seconds), and the overlap duration (in seconds) to prevent data loss.
Next, a mapping must be created between the datasource products and the REGARDS model data before indexing the products.
Configuration options are available for various connectors used with the crawler's external database (
see UI). The PostgreSQL connector is available as:
postgresql-db-connection (1.0-SNAPSHOT).
Insert or Update new products in meta catalog
Dataset and Data entities are stored in a different Elasticsearch index for each project/tenant in REGARDS application. There is only one index for each tenant.
Access rights calculation for dataset
Acces rights are defined for each dataset and group of users as follows :
- Dataset and Data access
- Dataset access
- Full access to dataset, but partial access to Data (filtered by dynamic plugins)
- No access
Any change in access rights between a group of users and a dataset has an impact on the meta catalog stored in Elasticsearch. Access rights are indicated in each dataset and products.
Access rights calculation are made when :
- There is a data modification (dataset update, add or remove data object, ...)
- There is a user group modification
Dynamic plugins (see extension point with IDataObjectAccessFilterPlugins interface) are made to re-calculate access
rights every day. Access rights will be applied to data filtered by the OpenSearch query.
The periodicity of re-calculation of dynamic plugins is set to once a day by default, but it is configurable in the
microservice properties with the properties regards.access.rights.update.cron. The value is in standard cron format.
Meta catalog reindexation
Reindexation allows rebuilding the Elasticsearch index of a tenant from its data sources, ensuring that the meta catalog remains consistent, and aligned with the latest data models.
Each tenant/project in REGARDS has its own Elasticsearch index, referenced by an alias. During a reindexation, a new index is created, populated, and validated before it replaces the current one — ensuring that the meta catalog REGARDS remains available without interruption.
Process overview
When a reindexation is triggered, the process follows these main steps:
1. Create a new index
A new Elasticsearch index is created for the tenant, using the latest model mappings and settings.
The index name is automatically generated and associated to the tenant’s alias entry (EsIndexAlias entity) as the
building index. An alias points to exactly one index.
The tenant’s alias follows the naming convention: [PROJECT_NAME]_alias. Each new building index created for this tenant follows the pattern [PROJECT_NAME]_XXXXX_X, where the final digit increments with every new rebuild.
Example : Project/tenant : projectA Alias name: projectA_alias Current index: projectA_3472e9_3 Building index: projectA_3472e9_4
2. Populate the new index
All active data sources of the tenant are crawled using the same ingestion workflow as the standard CrawlingJob.
Products are retrieved, transformed into REGARDS entities (DATA DATASET or COLLECTION), and inserted into the
building index.
3. Wait for all ingestion jobs to finish
Once ingestion starts, the system monitors the running jobs associated with the building index until all have completed twice. This ensures that the index is fully populated and consistent before activation.
4. Switch aliases
When the new index is ready, rs-dam performs an alias switch. The alias is updated to point to the new index ; the
previous index is obsolete and is ready to be deleted.
This operation is transparent for end users: queries using the alias never experience downtime.
5. Clean up old indexes
Once the alias has been switched, the old index (previously referenced as current) is deleted to free resources. The old ingestions and the jobs they are associated with are removed too.
Alias management and consistency
The mapping between tenants and their indexes is persisted in the REGARDS database through the entity EsIndexAlias:
| Nom | Type | Description |
|---|---|---|
| alias | text | The logical name used by REGARDS to query Elasticsearch |
| current | text | The name of the currently active index for the tenant |
| building | text | The name of the index currently being populated during reindexation |
The alias switch updates the current field to the new index name and clears the building field once completed.
Caching mechanisms ensure quick access to alias information while keeping the database as the source of truth.
Elasticsearch index representation
Data entities are never stored in the REGARDS database, only in Elasticsearch.
Dataset entities are stored in the REGARDS database with the following information:
- creation date and update date,
- Identifier of the Uniform Resource Name (example: URN:AIP:DATASET:validation: 39c574a0-2ad6-4f47-9f4a-251d494892b1:V1)
- model of the products in this dataset
- Identifier of the dataset model
- Identifier of the plugin used to load products from a data source
- sub-setting criterion setting on a Dataset for Elasticsearch
The following tables show the structure of stocked entities in Elasticsearch index of REGARDS.
Entity(product) for DATA type
| Nom | Type | Description |
|---|---|---|
| type | text | Entity type: DATA |
| creationDate | Date (format: date_optional_time) | Creation date of entity |
| lastUpdate | Date (format: date_optional_time) | Update date of entity |
| dataSourceId | long | Data source identifier |
| datasetModelNames | text | List of dataset model names |
| groups | text | List of group names for access right |
| id | long | Entity technical identifier for database |
| internal | boolean | true if a entity of DATA type is internal(created from AIP) or false, external (created from external database) |
| ipId | text | Identifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]) |
| metadata | Object(see details below) | Information about a group access to a specific dataset for data objects |
| model | Object | Entity model |
| model.description | text | Model description |
| model.id | long | Model technical identifier for database |
| model.name | text | Model name (identical with model property of feature) |
| model.type | text | Model type : DATA |
| newPoint | geo_point | Bounding box north west point |
| setPoint | geo_point | Bounding box south east point |
| openSearchSubsettingClause | text | Representation of the above subsetting clause as an OpenSearch string request |
| tags | text | List of tags (included related dataset) |
| wgs84 | geo_shape | Geometry projection on WGS84 crs |
| feature | Object(see details below) | Raw entity feature |
Metadata for DATA type of entity
| Name | Type | Description |
|---|---|---|
| groups | Map | Map of group names with access right for dataset |
| groups.<name>.dataset | text | Identifier of Uniform Resource Name type for dataset |
| groups.<name>.dataAccessRight | boolean | true if access right for the dataset; otherwise false |
| modelNames | Map | Map of model names with dataset URN |
| modelNames.<name>.<URN> | text | Identifier of Uniform Resource Name type for dataset |
Feature for DATA type of entity
| Name | Type | Description |
|---|---|---|
| sessionOwner | text | Session owner |
| Session | text | Session name |
| virtualId | text | Virtual identifier of URN type in order to indicate if this is the last version (format: URN:StringId:DATA:tenant:UUID(entityId):LAST) |
| providerId | text | Provider identifier |
| entityType | text | Entity type : DATA |
| label | text | Entity label (sometimes identical provider identifier property) |
| model | text | Model name of entity (identical with name property of model) |
| files | Map<DataType, DataFile> | Product-related entity files (example: thumbnail, quicklook, rawdata...) |
| DataType | text | Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...) |
| DataFile | Object | Data for file |
| DataFile.dataType | text | Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...) |
| DataFile.reference | boolean | False indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference. |
| DataFile.uri | text | Uniform Resource identifier of file in order to download. This URI is created by REGARDS. |
| DataFile.mimeType | text | Mime type of file |
| DataFile.online | boolean | True indicates file is on line, otherwise near line for storage service |
| DataFile.checksum | text | Checksum of file |
| DataFile.digestAlgorithm | text | Algorithm for checksum of file |
| DataFile.filesize | double | Size of file |
| DataFile.filename | text | File name |
| DataFile.types | array | Custom data file types |
| tags | text | List of tags (included dataset identifier) |
| last | boolean | true if this the last version; otherwise false |
| version | text | Entity version |
| id | text | Identifier of Uniform Resource Name type (identical with IpId property) |
| geometry | Object | Information package geometry in GeoJSON RFC 7946 Format |
| geometry.coordinates | double | Geometry coordinates |
| geometry.type | text | Geometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...) |
| geometry.bbox | array | Geometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type. |
| geometry.crs | text | Coordinate reference system. If not specified, WGS84 is considered as the default CRS |
| normalizedGeometry | Object | Geometry but normalized to be used on a cylindrical project |
| normalizedGeometry.coordinates | doi | Normalized geometry coordinates |
| normalizedGeometry.type | text | Normalized geometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...) |
| normalizedGeometry.bbox | array | Geometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type. |
| normalizedGeometry.crs | text | Coordinate reference system. If not specified, WGS84 is considered as the default CRS |
| type | text | Feature |
| crs | text | Coordinate Reference System (default value: WGS84) |
| properties | Object | DATA model attributes |
Entity for DATASET type
| Nom | Type | Description |
|---|---|---|
| type | text | Entity type: DATASET |
| creationDate | Date (format: date_optional_time) | Creation date of entity |
| lastUpdate | Date (format: date_optional_time) | Update date of entity |
| dataModel | text | Model of Data type for entities included in this dataset |
| dataSourceId | long | Data source identifier |
| groups | text | List of group names for access right |
| id | long | Entity technical identifier for database |
| internal | boolean | true if a entity of DATA type is internal(created from AIP) or false, external (created from external database) |
| ipId | text | Identifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]) |
| metadata | Object(see details below) | Information about a group access to a specific dataset for data objects |
| model | Object | Entity model |
| model.description | text | Model description |
| model.id | long | Model technical identifier for database |
| model.name | text | Model name (identical with model property of feature) |
| model.type | text | Model type : DATASET |
| newPoint | geo_point | Bounding box north west point |
| setPoint | geo_point | Bounding box south east point |
| openSearchSubsettingClause | text | Representation of the above subsetting clause as an OpenSearch string request |
| plgConfDataSource | Object | Plugin configuration for the extension point (IDataSourcePlugin interface) |
| plgConfDataSource.active | boolean | Active or not the plugin |
| plgConfDataSource.businessId | text | Plugin business identifier |
| plgConfDataSource.label | text | Plugin label |
| plgConfDataSource.parameters | nested | Configuration parameters of the plugin |
| plgConfDataSource.pluginId | text | Plugin identifier |
| plgConfDataSource.priorityOrder | long | Priority order of the plugin. |
| plgConfDataSource.version | text | Plugin version |
| tags | text | List of tags |
| wgs84 | geo_shape | Geometry projection on WGS84 crs |
| feature | Object(see details below) | Raw entity feature |
Metadata for DATASET type of entity
| Name | Type | Description |
|---|---|---|
| dataObjectsGroups | Map | Map of group names with access right for dataset |
| dataObjectsGroups.<name>.groupName | text | Group name |
| dataObjectsGroups.<name>.dataFileAccess | boolean | true if access right for files of product; otherwise false |
| dataObjectsGroups.<name>.dataObjectAccess | boolean | true if access right for objects of products; otherwise false |
| dataObjectsGroups.<name>.dataAccess | boolean | true if access right for data of products; otherwise false |
| dataObjectsGroups.<name>.metaDataObjectAccessFilterPluginBusinessId | String | Plugin identifier for the extension point : IDataObjectAccessFilterPlugins |
Feature for DATASET type of entities
| Name | Type | Description |
|---|---|---|
| dataObjectsFilesAccessGranted | boolean | true if granted Access for data object files; otherwise denied access |
| dataObjectsAccessGranted | boolean | true if granted Access for data objects; otherwise denied access |
| licence | text | Licence for dataset |
| virtualId | text | Virtual identifier of URN type in order to indicate if this is the last version (format: URN:StringId:DATASET:tenant:UUID(entityId):LAST) |
| providerId | text | Provider identifier |
| entityType | text | Entity type : DATASET |
| id | text | Identifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]) |
| label | text | Label of dataset |
| model | text | Model name of entity (identical with name property of model) |
| files | Map<DataType, DataFile> | Dataset-related entity files (example: thumbnail, quicklook, rawdata...) |
| DataType | text | Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...) |
| DataFile | Object | Data for file |
| DataFile.dataType | text | Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...) |
| DataFile.reference | boolean | False indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference. |
| DataFile.uri | text | Uniform Resource identifier of file in order to download. This URI is created by REGARDS. |
| DataFile.mimeType | text | Mime type of file |
| DataFile.online | boolean | True indicates file is on line, otherwise near line for storage service |
| DataFile.checksum | text | Checksum of file |
| DataFile.digestAlgorithm | text | Algorithm for checksum of file |
| DataFile.filesize | double | Size of file |
| DataFile.filename | text | File name |
| DataFile.types | array | Custom data file types |
| tags | text | List of tags |
| version | integer | Entity version |
| type | text | Feature |
| properties | Object | DATA model attributes |