Version: 2.3.1

How it works

Introduction

The main purpose of Data manager or rs-dam microservice is to populate the Regards meta-catalog. The microservice is composed of several modules : crawler module, dam module(main core), indexer module, model module and opensearch module.

To do so, this microservice uses :

Data sources to retrieve products from several catalogs. Data sources are configured through an UI or XML file (AIP, GeoJSON, external database: REGARDS UI).
Data models to transform and standardize crawled products before adding them into the meta catalog, represented by the Elasticsearch index.
Data access rights to calculate access rights of each product in the meta catalog. Access rights concern the permissions granted to a group of users for accessing a set of products that constitute a dataset ( see REGARDS UI).

This microservice is required to expose products managed by the OAIS Products Manager (rs-ingest microservice), the GeoJson Products Manager (rs-fem microservice), or products accessible from an external database or a web service.

Each Elasticsearch index stores products for each project or tenant created in the REGARDS application.

Main points of data modification in Elasticsearch

The data stored in Elasticsearch consist of Datasets and DataObjects. DataObjects can originate from various sources depending on the selected plugin (OAIS, FEM, external database, etc.).

In REGARDS, there are several ways to add, modify, or delete objects stored in Elasticsearch:

Through the CrawlingJob (data crawler/harvester), which will be detailed in the following sections.
When a Dataset is updated OR when access rights to a Dataset are modified, a DatasetEvent is triggered and processed by rs-dam to update the Dataset stored in Elasticsearch, as well as all DataObjects linked to this Dataset.
When a product is deleted from the OAIS or FEM catalogs, a FeatureEvent is generated and processed by rs-dam, which immediately requests the deletion of the dataObject from ElasticSearch.

Meta catalog population

A scheduler is launched at a configurable interval to iterate through each configured datasource and create a job CrawlOneDatasourceJob if needed. This allows independent and parallel management of data ingestion for each datasource. The scheduling frequency can be adjusted by the user (see the static configuration)

Crawling Job (CrawlOneDatasourceJob)

The main steps performed by each job are:

1. Retrieve new products from the catalog

Using an implementation of the IDataSourcePlugin interface, the system retrieves new products and transforms the datasource-specific product format into the REGARDS standard format. See more details below

2. Insert or update products in the Elasticsearch index

New or updated products retrieved in step 1 are inserted or updated in the appropriate Elasticsearch index using upsert requests. The upsert operation ensures that a product is either created or updated in a single atomic operation, depending on its existence. See more details below

note

Upsert requests are sent in parallel to Elasticsearch to improve performance. Each upsert request contains a configurable batch of products, and the number of parallel requests is also configurable to optimize throughput according to the cluster's capacity.

warning

Ensure that the Hikari connection pool size (regards.jpa.multitenant.maxPoolSize) is STRICTLY GREATER than the Elasticsearch threadpool size (regards.elasticsearch.threadpool.size) to avoid connection starvation. Otherwise, rs-dam may enter in deadlock state, with all threads waiting for a connection from the pool, and no thread available to release a connection.

3. Update access groups for products

The update of access groups is performed using a dedicated Painless script in Elasticsearch, allowing all necessary changes to be applied in a single update request on the Elasticsearch side. This is achieved by sending an 'update by query' request to Elasticsearch, specifying the script id. This single request allows Elasticsearch to apply the script to all documents matching the filter—typically all the data ingested in the previous step—thus efficiently updating access groups for a large set of products at once. See more details below

note

If needed (depending on access rights configuration), access groups are updated for products using an implementation of the IDataObjectAccessFilterPlugin interface to apply any custom product filtering. This operation cannot be performed by the update group script and must be manually calculated by REGARDS, updating each document individually.

4. Compute calculated attributes

If the data model contains calculated attributes, these are computed using an implementation of the IComputedAttribute interface.

Retrieve new products from data sources

To manage different data sources, an extension point (see the implementation of the IDataSourcePlugin interface) is used to handle the specific requirements for loading products in the REGARDS format :

Products: A DATA Entity, as defined by a model in REGARDS
A set of products: A DATASET Entity, as defined by a model in REGARDS

The following types of crawlers are available:

AIP Crawlers: These crawlers allow crawling of SIPs from the rs-ingest microservice. Incremental ingestion uses the last data update.
Feature Crawlers: These crawlers allow crawling of features from the rs-fem microservice. Incremental ingestion uses the last data update.
Database Crawlers: These crawlers allow crawling of data from an external database, with the following modes:
- Non-incremental ingestion (not recommended)
- Incremental ingestion based on the last data update
- Incremental ingestion based on the data identifier

The user selects the incremental ingestion mode during datasource creation.

Web Source crawlers allows to crawl data from an OpenSearch Web Source: incremental aspiration bas on the data last update date.

The configuration of the extension point plugin can be used to define, as needed, the type of ingestion, the data source refresh rate (in seconds), and the overlap duration (in seconds) to prevent data loss.

Next, a mapping must be created between the datasource products and the REGARDS model data before indexing the products.

info

Configuration options are available for various connectors used with the crawler's external database ( see UI). The PostgreSQL connector is available as: postgresql-db-connection (1.0-SNAPSHOT).

Insert or Update new products in meta catalog

Dataset and Data entities are stored in a different Elasticsearch index for each project/tenant in REGARDS application. There is only one index for each tenant.

Access rights calculation for dataset

note

Acces rights are defined for each dataset and group of users as follows :

Dataset and Data access
Dataset access
Full access to dataset, but partial access to Data (filtered by dynamic plugins)
No access

Any change in access rights between a group of users and a dataset has an impact on the meta catalog stored in Elasticsearch. Access rights are indicated in each dataset and products.

Access rights calculation are made when :

There is a data modification (dataset update, add or remove data object, ...)
There is a user group modification

Dynamic plugins (see extension point with IDataObjectAccessFilterPlugins interface) are made to re-calculate access rights every day. Access rights will be applied to data filtered by the OpenSearch query. The periodicity of re-calculation of dynamic plugins is set to once a day by default, but it is configurable in the microservice properties with the properties regards.access.rights.update.cron. The value is in standard cron format.

Meta catalog reindexation

Reindexation allows rebuilding the Elasticsearch index of a tenant from its data sources, ensuring that the meta catalog remains consistent, and aligned with the latest data models.

Each tenant/project in REGARDS has its own Elasticsearch index, referenced by an alias. During a reindexation, a new index is created, populated, and validated before it replaces the current one — ensuring that the meta catalog REGARDS remains available without interruption.

Process overview

When a reindexation is triggered, the process follows these main steps:

1. Create a new index

A new Elasticsearch index is created for the tenant, using the latest model mappings and settings. The index name is automatically generated and associated to the tenant’s alias entry (EsIndexAlias entity) as the building index. An alias points to exactly one index.

The tenant’s alias follows the naming convention: [PROJECT_NAME]_alias. Each new building index created for this tenant follows the pattern [PROJECT_NAME]_XXXXX_X, where the final digit increments with every new rebuild.

Example : Project/tenant : projectA Alias name: projectA_alias Current index: projectA_3472e9_3 Building index: projectA_3472e9_4

2. Populate the new index

All active data sources of the tenant are crawled using the same ingestion workflow as the standard CrawlingJob. Products are retrieved, transformed into REGARDS entities (DATA DATASET or COLLECTION), and inserted into the building index.

3. Wait for all ingestion jobs to finish

Once ingestion starts, the system monitors the running jobs associated with the building index until all have completed twice. This ensures that the index is fully populated and consistent before activation.

4. Switch aliases

When the new index is ready, rs-dam performs an alias switch. The alias is updated to point to the new index ; the previous index is obsolete and is ready to be deleted. This operation is transparent for end users: queries using the alias never experience downtime.

5. Clean up old indexes

Once the alias has been switched, the old index (previously referenced as current) is deleted to free resources. The old ingestions and the jobs they are associated with are removed too.

Alias management and consistency

The mapping between tenants and their indexes is persisted in the REGARDS database through the entity EsIndexAlias:

Nom	Type	Description
alias	text	The logical name used by REGARDS to query Elasticsearch
current	text	The name of the currently active index for the tenant
building	text	The name of the index currently being populated during reindexation

The alias switch updates the current field to the new index name and clears the building field once completed.

Caching mechanisms ensure quick access to alias information while keeping the database as the source of truth.

Elasticsearch index representation

Data entities are never stored in the REGARDS database, only in Elasticsearch.
Dataset entities are stored in the REGARDS database with the following information:

creation date and update date,
Identifier of the Uniform Resource Name (example: URN:AIP:DATASET:validation: 39c574a0-2ad6-4f47-9f4a-251d494892b1:V1)
model of the products in this dataset
Identifier of the dataset model
Identifier of the plugin used to load products from a data source
sub-setting criterion setting on a dataset for Elasticsearch

The following tables show the structure of stocked entities in Elasticsearch index of REGARDS.

Entity(product) for DATA type

Nom	Type	Description
type	text	Entity type: DATA
creationDate	Date (format: date_optional_time)	Creation date of entity
lastUpdate	Date (format: date_optional_time)	Update date of entity
dataSourceId	long	Data source identifier
datasetModelNames	text	List of dataset model names
groups	text	List of group names for access right
id	long	Entity technical identifier for database
internal	boolean	true if a entity of DATA type is internal(created from AIP) or false, external (created from external database)
ipId	text	Identifier of Uniform Resource Name type (format: `URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]`)
metadata	Object(see details below)	Information about a group access to a specific dataset for data objects
model	Object	Entity model
model.description	text	Model description
model.id	long	Model technical identifier for database
model.name	text	Model name (identical with model property of feature)
model.type	text	Model type : DATA
nwPoint	geo_point	Bounding box north west point
sePoint	geo_point	Bounding box south east point
openSearchSubsettingClause	text	Representation of the above subsetting clause as an OpenSearch string request
tags	text	List of tags (related dataset identifiers)
wgs84	geo_shape	Geometry projection on WGS84 crs
feature	Object(see details below)	Raw entity feature

Metadata for DATA type of entity

Name	Type	Description
groups	Map	Map of group names with access right for dataset
groups.<name>.dataset	text	Identifier of Uniform Resource Name type for dataset
groups.<name>.dataAccessRight	boolean	true if access right for the dataset; otherwise false
modelNames	Map	Map of model names with dataset URN
modelNames.<name>.<URN>	text	Identifier of Uniform Resource Name type for dataset

Feature for DATA type of entity

Name	Type	Description
sessionOwner	text	Session owner
Session	text	Session name
virtualId	text	Virtual identifier of URN type in order to indicate if this is the last version (format: `URN:StringId:DATA:tenant:UUID(entityId):LAST`)
providerId	text	Provider identifier
entityType	text	Entity type : DATA
label	text	Entity label (sometimes identical provider identifier property)
model	text	Model name of entity (identical with name property of model)
files	Map<DataType, DataFile>	Product-related entity files (example: thumbnail, quicklook, rawdata...)
DataType	text	Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile	Object	Data for file
DataFile.dataType	text	Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile.reference	boolean	False indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference.
DataFile.uri	text	Uniform Resource identifier of file in order to download. This URI is created by REGARDS.
DataFile.mimeType	text	Mime type of file
DataFile.online	boolean	True indicates file is on line, otherwise near line for storage service
DataFile.checksum	text	Checksum of file
DataFile.digestAlgorithm	text	Algorithm for checksum of file
DataFile.filesize	double	Size of file
DataFile.filename	text	File name
DataFile.types	array	Custom data file types
tags	text	List of tags (related dataset identifiers)
last	boolean	true if this the last version; otherwise false
version	text	Entity version
id	text	Identifier of Uniform Resource Name type (identical with IpId property)
geometry	Object	Information package geometry in GeoJSON RFC 7946 Format
geometry.coordinates	double	Geometry coordinates
geometry.type	text	Geometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...)
geometry.bbox	array	Geometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type.
geometry.crs	text	Coordinate reference system. If not specified, WGS84 is considered as the default CRS
normalizedGeometry	Object	Geometry but normalized to be used on a cylindrical project
normalizedGeometry.coordinates	doi	Normalized geometry coordinates
normalizedGeometry.type	text	Normalized geometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...)
normalizedGeometry.bbox	array	Geometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type.
normalizedGeometry.crs	text	Coordinate reference system. If not specified, WGS84 is considered as the default CRS
type	text	Feature
crs	text	Coordinate Reference System (default value: WGS84)
properties	Object	DATA model attributes

Entity for DATASET type

Nom	Type	Description
type	text	Entity type: DATASET
creationDate	Date (format: date_optional_time)	Creation date of entity
lastUpdate	Date (format: date_optional_time)	Update date of entity
dataModel	text	Model of Data type for entities included in this dataset
dataSourceId	long	Data source identifier
groups	text	List of group names for access right
id	long	Entity technical identifier for database
internal	boolean	true if a entity of DATA type is internal(created from AIP) or false, external (created from external database)
ipId	text	Identifier of Uniform Resource Name type (format: `URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]`)
metadata	Object(see details below)	Information about a group access to a specific dataset for data objects
model	Object	Entity model
model.description	text	Model description
model.id	long	Model technical identifier for database
model.name	text	Model name (identical with model property of feature)
model.type	text	Model type : DATASET
openSearchSubsettingClause	text	Representation of the above subsetting clause as an OpenSearch string request
plgConfDataSource	Object	Plugin configuration for the extension point (IDataSourcePlugin interface)
plgConfDataSource.active	boolean	Active or not the plugin
plgConfDataSource.businessId	text	Plugin business identifier
plgConfDataSource.label	text	Plugin label
plgConfDataSource.parameters	nested	Configuration parameters of the plugin
plgConfDataSource.pluginId	text	Plugin identifier
plgConfDataSource.priorityOrder	long	Priority order of the plugin.
plgConfDataSource.version	text	Plugin version
tags	text	List of tags
wgs84	geo_shape	Geometry projection on WGS84 crs
feature	Object(see details below)	Raw entity feature

Metadata for DATASET type of entity

Name	Type	Description
dataObjectsGroups	Map	Map of group names with access right for dataset
dataObjectsGroups.<name>.groupName	text	Group name
dataObjectsGroups.<name>.dataFileAccess	boolean	true if access right for files of product; otherwise false
dataObjectsGroups.<name>.dataObjectAccess	boolean	true if access right for objects of products; otherwise false
dataObjectsGroups.<name>.dataAccess	boolean	true if access right for data of products; otherwise false
dataObjectsGroups.<name>.metaDataObjectAccessFilterPluginBusinessId	String	Plugin identifier for the extension point : IDataObjectAccessFilterPlugins

Feature for DATASET type of entities

Name	Type	Description
dataObjectsFilesAccessGranted	boolean	true if granted Access for data object files; otherwise denied access
dataObjectsAccessGranted	boolean	true if granted Access for data objects; otherwise denied access
licence	text	Licence for dataset
virtualId	text	Virtual identifier of URN type in order to indicate if this is the last version (format: `URN:StringId:DATASET:tenant:UUID(entityId):LAST`)
providerId	text	Provider identifier
entityType	text	Entity type : DATASET
id	text	Identifier of Uniform Resource Name type (format: `URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision]`)
label	text	Label of dataset
model	text	Model name of entity (identical with name property of model)
files	Map<DataType, DataFile>	Dataset-related entity files (example: thumbnail, quicklook, rawdata...)
DataType	text	Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile	Object	Data for file
DataFile.dataType	text	Enum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile.reference	boolean	False indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference.
DataFile.uri	text	Uniform Resource identifier of file in order to download. This URI is created by REGARDS.
DataFile.mimeType	text	Mime type of file
DataFile.online	boolean	True indicates file is on line, otherwise near line for storage service
DataFile.checksum	text	Checksum of file
DataFile.digestAlgorithm	text	Algorithm for checksum of file
DataFile.filesize	double	Size of file
DataFile.filename	text	File name
DataFile.types	array	Custom data file types
tags	text	List of tags
version	integer	Entity version
type	text	Feature
properties	Object	DATA model attributes

Introduction​

Main points of data modification in Elasticsearch​

Meta catalog population​

Crawling Job (CrawlOneDatasourceJob)​

1. Retrieve new products from the catalog​

2. Insert or update products in the Elasticsearch index​

3. Update access groups for products​

4. Compute calculated attributes​

Retrieve new products from data sources​

Insert or Update new products in meta catalog​

Access rights calculation for dataset​

Meta catalog reindexation​

Process overview​

1. Create a new index​

2. Populate the new index​

3. Wait for all ingestion jobs to finish​

4. Switch aliases​

5. Clean up old indexes​

Alias management and consistency​

Elasticsearch index representation​

Entity(product) for DATA type​

Entity for DATASET type​

Introduction

Main points of data modification in Elasticsearch

Meta catalog population

Crawling Job (CrawlOneDatasourceJob)

1. Retrieve new products from the catalog

2. Insert or update products in the Elasticsearch index

3. Update access groups for products

4. Compute calculated attributes

Retrieve new products from data sources

Insert or Update new products in meta catalog

Access rights calculation for dataset

Meta catalog reindexation

Process overview

1. Create a new index

2. Populate the new index

3. Wait for all ingestion jobs to finish

4. Switch aliases

5. Clean up old indexes

Alias management and consistency

Elasticsearch index representation

Entity(product) for DATA type

Entity for DATASET type