Skip to main content
Version: 2.3.0

How it works

Introduction

The main purpose of Data manager or rs-dam microservice is to populate the Regards meta-catalog. The microservice is composed of several modules : crawler module, dam module(main core), indexer module, model module and opensearch module.

To do so, this microservice uses :

  • Data sources to retrieve products from several catalogs. Data sources are configured through an UI or XML file (AIP, GeoJSON, external database: REGARDS UI).
  • Data models to transform and standardize crawled products before adding them into the meta catalog, represented by the Elasticsearch index.
  • Data access rights to calculate access rights of each product in the meta catalog. Access rights concern the permissions granted to a group of users for accessing a set of products that constitute a dataset ( see REGARDS UI).

This microservice is required to expose products managed by the OAIS Products Manager (rs-ingest microservice), the GeoJson Products Manager (rs-fem microservice), or products accessible from an external database or a web service.

Each Elasticsearch index stores products for each project or tenant created in the REGARDS application.

Main points of data modification in Elasticsearch

The data stored in Elasticsearch consist of Datasets and DataObjects. DataObjects can originate from various sources depending on the selected plugin (OAIS, FEM, external database, etc.).

In REGARDS, there are several ways to add, modify, or delete objects stored in Elasticsearch:

  1. Through the CrawlingJob (data crawler/harvester), which will be detailed in the following sections.
  2. When a Dataset is updated OR when access rights to a Dataset are modified, a DatasetEvent is triggered and processed by rs-dam to update the Dataset stored in Elasticsearch, as well as all DataObjects linked to this Dataset.
  3. When a product is deleted from the OAIS or FEM catalogs, a FeatureEvent is generated and processed by rs-dam, which immediately requests the deletion of the dataObject from ElasticSearch.

Meta catalog population

A scheduler is launched at a configurable interval to iterate through each configured datasource and create a job CrawlOneDatasourceJob if needed. This allows independent and parallel management of data ingestion for each datasource. The scheduling frequency can be adjusted by the user (see the static configuration)

Crawling Job (CrawlOneDatasourceJob)

The main steps performed by each job are:

1. Retrieve new products from the catalog

Using an implementation of the IDataSourcePlugin interface, the system retrieves new products and transforms the datasource-specific product format into the REGARDS standard format. See more details below

2. Insert or update products in the Elasticsearch index

New or updated products retrieved in step 1 are inserted or updated in the appropriate Elasticsearch index using upsert requests. The upsert operation ensures that a product is either created or updated in a single atomic operation, depending on its existence. See more details below

note

Upsert requests are sent in parallel to Elasticsearch to improve performance. Each upsert request contains a configurable batch of products, and the number of parallel requests is also configurable to optimize throughput according to the cluster's capacity.

warning

Ensure that the Hikari connection pool size (regards.jpa.multitenant.maxPoolSize) is STRICTLY GREATER than the Elasticsearch threadpool size (regards.elasticsearch.threadpool.size) to avoid connection starvation. Otherwise, rs-dam may enter in deadlock state, with all threads waiting for a connection from the pool, and no thread available to release a connection.

3. Update access groups for products

The update of access groups is performed using a dedicated Painless script in Elasticsearch, allowing all necessary changes to be applied in a single update request on the Elasticsearch side. This is achieved by sending an 'update by query' request to Elasticsearch, specifying the script id. This single request allows Elasticsearch to apply the script to all documents matching the filter—typically all the data ingested in the previous step—thus efficiently updating access groups for a large set of products at once. See more details below

note

If needed (depending on access rights configuration), access groups are updated for products using an implementation of the IDataObjectAccessFilterPlugin interface to apply any custom product filtering. This operation cannot be performed by the update group script and must be manually calculated by REGARDS, updating each document individually.

4. Compute calculated attributes

If the data model contains calculated attributes, these are computed using an implementation of the IComputedAttribute interface.

Retrieve new products from data sources

To manage different data sources, an extension point (see the implementation of the IDataSourcePlugin interface) is used to handle the specific requirements for loading products in the REGARDS format :

  • Products: A DATA Entity, as defined by a model in REGARDS
  • A set of products: A DATASET Entity, as defined by a model in REGARDS

The following types of crawlers are available:

  • AIP Crawlers: These crawlers allow crawling of SIPs from the rs-ingest microservice. Incremental ingestion uses the last data update.
  • Feature Crawlers: These crawlers allow crawling of features from the rs-fem microservice. Incremental ingestion uses the last data update.
  • Database Crawlers: These crawlers allow crawling of data from an external database, with the following modes:
    • Non-incremental ingestion (not recommended)
    • Incremental ingestion based on the last data update
    • Incremental ingestion based on the data identifier

The user selects the incremental ingestion mode during datasource creation.

  • Web Source crawlers allows to crawl data from an OpenSearch Web Source: incremental aspiration bas on the data last update date.

The configuration of the extension point plugin can be used to define, as needed, the type of ingestion, the data source refresh rate (in seconds), and the overlap duration (in seconds) to prevent data loss.

Next, a mapping must be created between the datasource products and the REGARDS model data before indexing the products.

info

Configuration options are available for various connectors used with the crawler's external database ( see UI). The PostgreSQL connector is available as: postgresql-db-connection (1.0-SNAPSHOT).

Insert or Update new products in meta catalog

Dataset and Data entities are stored in a different Elasticsearch index for each project/tenant in REGARDS application. There is only one index for each tenant.

Access rights calculation for dataset

note

Acces rights are defined for each dataset and group of users as follows :

  • Dataset and Data access
  • Dataset access
  • Full access to dataset, but partial access to Data (filtered by dynamic plugins)
  • No access

Any change in access rights between a group of users and a dataset has an impact on the meta catalog stored in Elasticsearch. Access rights are indicated in each dataset and products.

Access rights calculation are made when :

  • There is a data modification (dataset update, add or remove data object, ...)
  • There is a user group modification

Dynamic plugins (see extension point with IDataObjectAccessFilterPlugins interface) are made to re-calculate access rights every day. Access rights will be applied to data filtered by the OpenSearch query. The periodicity of re-calculation of dynamic plugins is set to once a day by default, but it is configurable in the microservice properties with the properties regards.access.rights.update.cron. The value is in standard cron format.

Meta catalog reindexation

Reindexation allows rebuilding the Elasticsearch index of a tenant from its data sources, ensuring that the meta catalog remains consistent, and aligned with the latest data models.

Each tenant/project in REGARDS has its own Elasticsearch index, referenced by an alias. During a reindexation, a new index is created, populated, and validated before it replaces the current one — ensuring that the meta catalog REGARDS remains available without interruption.

Process overview

When a reindexation is triggered, the process follows these main steps:

1. Create a new index

A new Elasticsearch index is created for the tenant, using the latest model mappings and settings. The index name is automatically generated and associated to the tenant’s alias entry (EsIndexAlias entity) as the building index. An alias points to exactly one index.

The tenant’s alias follows the naming convention: [PROJECT_NAME]_alias. Each new building index created for this tenant follows the pattern [PROJECT_NAME]_XXXXX_X, where the final digit increments with every new rebuild.

Example : Project/tenant : projectA Alias name: projectA_alias Current index: projectA_3472e9_3 Building index: projectA_3472e9_4

2. Populate the new index

All active data sources of the tenant are crawled using the same ingestion workflow as the standard CrawlingJob. Products are retrieved, transformed into REGARDS entities (DATA DATASET or COLLECTION), and inserted into the building index.

3. Wait for all ingestion jobs to finish

Once ingestion starts, the system monitors the running jobs associated with the building index until all have completed twice. This ensures that the index is fully populated and consistent before activation.

4. Switch aliases

When the new index is ready, rs-dam performs an alias switch. The alias is updated to point to the new index ; the previous index is obsolete and is ready to be deleted. This operation is transparent for end users: queries using the alias never experience downtime.

5. Clean up old indexes

Once the alias has been switched, the old index (previously referenced as current) is deleted to free resources. The old ingestions and the jobs they are associated with are removed too.

Alias management and consistency

The mapping between tenants and their indexes is persisted in the REGARDS database through the entity EsIndexAlias:

NomTypeDescription
aliastextThe logical name used by REGARDS to query Elasticsearch
currenttextThe name of the currently active index for the tenant
buildingtextThe name of the index currently being populated during reindexation

The alias switch updates the current field to the new index name and clears the building field once completed.

Caching mechanisms ensure quick access to alias information while keeping the database as the source of truth.

Elasticsearch index representation

Data entities are never stored in the REGARDS database, only in Elasticsearch.
Dataset entities are stored in the REGARDS database with the following information:

  • creation date and update date,
  • Identifier of the Uniform Resource Name (example: URN:AIP:DATASET:validation: 39c574a0-2ad6-4f47-9f4a-251d494892b1:V1)
  • model of the products in this dataset
  • Identifier of the dataset model
  • Identifier of the plugin used to load products from a data source
  • sub-setting criterion setting on a Dataset for Elasticsearch

The following tables show the structure of stocked entities in Elasticsearch index of REGARDS.

Entity(product) for DATA type

NomTypeDescription
typetextEntity type: DATA
creationDateDate (format: date_optional_time)Creation date of entity
lastUpdateDate (format: date_optional_time)Update date of entity
dataSourceIdlongData source identifier
datasetModelNamestextList of dataset model names
groupstextList of group names for access right
idlongEntity technical identifier for database
internalbooleantrue if a entity of DATA type is internal(created from AIP) or false, external (created from external database)
ipIdtextIdentifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision])
metadataObject(see details below)Information about a group access to a specific dataset for data objects
modelObjectEntity model
model.descriptiontextModel description
model.idlongModel technical identifier for database
model.nametextModel name (identical with model property of feature)
model.typetextModel type : DATA
newPointgeo_pointBounding box north west point
setPointgeo_pointBounding box south east point
openSearchSubsettingClausetextRepresentation of the above subsetting clause as an OpenSearch string request
tagstextList of tags (included related dataset)
wgs84geo_shapeGeometry projection on WGS84 crs
featureObject(see details below)Raw entity feature

Metadata for DATA type of entity

NameTypeDescription
groupsMapMap of group names with access right for dataset
groups.<name>.datasettextIdentifier of Uniform Resource Name type for dataset
groups.<name>.dataAccessRightbooleantrue if access right for the dataset; otherwise false
modelNamesMapMap of model names with dataset URN
modelNames.<name>.<URN>textIdentifier of Uniform Resource Name type for dataset

Feature for DATA type of entity

NameTypeDescription
sessionOwnertextSession owner
SessiontextSession name
virtualIdtextVirtual identifier of URN type in order to indicate if this is the last version (format: URN:StringId:DATA:tenant:UUID(entityId):LAST)
providerIdtextProvider identifier
entityTypetextEntity type : DATA
labeltextEntity label (sometimes identical provider identifier property)
modeltextModel name of entity (identical with name property of model)
filesMap<DataType, DataFile>Product-related entity files (example: thumbnail, quicklook, rawdata...)
DataTypetextEnum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFileObjectData for file
DataFile.dataTypetextEnum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile.referencebooleanFalse indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference.
DataFile.uritextUniform Resource identifier of file in order to download. This URI is created by REGARDS.
DataFile.mimeTypetextMime type of file
DataFile.onlinebooleanTrue indicates file is on line, otherwise near line for storage service
DataFile.checksumtextChecksum of file
DataFile.digestAlgorithmtextAlgorithm for checksum of file
DataFile.filesizedoubleSize of file
DataFile.filenametextFile name
DataFile.typesarrayCustom data file types
tagstextList of tags (included dataset identifier)
lastbooleantrue if this the last version; otherwise false
versiontextEntity version
idtextIdentifier of Uniform Resource Name type (identical with IpId property)
geometryObjectInformation package geometry in GeoJSON RFC 7946 Format
geometry.coordinatesdoubleGeometry coordinates
geometry.typetextGeometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...)
geometry.bboxarrayGeometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type.
geometry.crstextCoordinate reference system. If not specified, WGS84 is considered as the default CRS
normalizedGeometryObjectGeometry but normalized to be used on a cylindrical project
normalizedGeometry.coordinatesdoiNormalized geometry coordinates
normalizedGeometry.typetextNormalized geometry type (Point, MultiPoint, LineString, Polygon, MultiPolygon...)
normalizedGeometry.bboxarrayGeometry bounding box. List of points coordinates [xmin, ymin, xmax, ymax] in Double type.
normalizedGeometry.crstextCoordinate reference system. If not specified, WGS84 is considered as the default CRS
typetextFeature
crstextCoordinate Reference System (default value: WGS84)
propertiesObjectDATA model attributes

Entity for DATASET type

NomTypeDescription
typetextEntity type: DATASET
creationDateDate (format: date_optional_time)Creation date of entity
lastUpdateDate (format: date_optional_time)Update date of entity
dataModeltextModel of Data type for entities included in this dataset
dataSourceIdlongData source identifier
groupstextList of group names for access right
idlongEntity technical identifier for database
internalbooleantrue if a entity of DATA type is internal(created from AIP) or false, external (created from external database)
ipIdtextIdentifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision])
metadataObject(see details below)Information about a group access to a specific dataset for data objects
modelObjectEntity model
model.descriptiontextModel description
model.idlongModel technical identifier for database
model.nametextModel name (identical with model property of feature)
model.typetextModel type : DATASET
newPointgeo_pointBounding box north west point
setPointgeo_pointBounding box south east point
openSearchSubsettingClausetextRepresentation of the above subsetting clause as an OpenSearch string request
plgConfDataSourceObjectPlugin configuration for the extension point (IDataSourcePlugin interface)
plgConfDataSource.activebooleanActive or not the plugin
plgConfDataSource.businessIdtextPlugin business identifier
plgConfDataSource.labeltextPlugin label
plgConfDataSource.parametersnestedConfiguration parameters of the plugin
plgConfDataSource.pluginIdtextPlugin identifier
plgConfDataSource.priorityOrderlongPriority order of the plugin.
plgConfDataSource.versiontextPlugin version
tagstextList of tags
wgs84geo_shapeGeometry projection on WGS84 crs
featureObject(see details below)Raw entity feature

Metadata for DATASET type of entity

NameTypeDescription
dataObjectsGroupsMapMap of group names with access right for dataset
dataObjectsGroups.<name>.groupNametextGroup name
dataObjectsGroups.<name>.dataFileAccessbooleantrue if access right for files of product; otherwise false
dataObjectsGroups.<name>.dataObjectAccessbooleantrue if access right for objects of products; otherwise false
dataObjectsGroups.<name>.dataAccessbooleantrue if access right for data of products; otherwise false
dataObjectsGroups.<name>.metaDataObjectAccessFilterPluginBusinessIdStringPlugin identifier for the extension point : IDataObjectAccessFilterPlugins

Feature for DATASET type of entities

NameTypeDescription
dataObjectsFilesAccessGrantedbooleantrue if granted Access for data object files; otherwise denied access
dataObjectsAccessGrantedbooleantrue if granted Access for data objects; otherwise denied access
licencetextLicence for dataset
virtualIdtextVirtual identifier of URN type in order to indicate if this is the last version (format: URN:StringId:DATASET:tenant:UUID(entityId):LAST)
providerIdtextProvider identifier
entityTypetextEntity type : DATASET
idtextIdentifier of Uniform Resource Name type (format: URN:StringId:DATA:tenant:UUID(entityId):version[,order][:revision])
labeltextLabel of dataset
modeltextModel name of entity (identical with name property of model)
filesMap<DataType, DataFile>Dataset-related entity files (example: thumbnail, quicklook, rawdata...)
DataTypetextEnum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFileObjectData for file
DataFile.dataTypetextEnum of data type (RAWDATA, QUICKLOOK_SD, QUICKLOOK_MD, QUICKLOOK_HD, DOCUMENT, THUMBNAIL...)
DataFile.referencebooleanFalse indicates if the file is stored physical in REGARDS, otherwise true REGARDS doesn't store in REGARDS, only reference.
DataFile.uritextUniform Resource identifier of file in order to download. This URI is created by REGARDS.
DataFile.mimeTypetextMime type of file
DataFile.onlinebooleanTrue indicates file is on line, otherwise near line for storage service
DataFile.checksumtextChecksum of file
DataFile.digestAlgorithmtextAlgorithm for checksum of file
DataFile.filesizedoubleSize of file
DataFile.filenametextFile name
DataFile.typesarrayCustom data file types
tagstextList of tags
versionintegerEntity version
typetextFeature
propertiesObjectDATA model attributes