Skip to main content

Advanced usage of "CDF as code"

Beta

The CDF Toolkit (cdf-tk) is currently in beta. It should be stable and mature enough to use it to bootstrap and configure Cognite Data Fusion projects. However, if you are using the tool to manage production projects, we recommend that you test out in a staging project to ensure that you know what the tool is going to do. A part of the beta is to get feedback on the tool and improve how to use the tool for project lifecycle management, so please engage with us on hub.cognite.com.

This section is for advanced users that want to understand the details of CDF configurations as code and how they are structured, and how to use the CDF Toolkit to manage the lifecycle of a CDF project. From here on, the toolkit is used to refer to the cdf-tk command-line tool and the YAML configurations that specify how to configure a CDF project. The templates are the pre-built configurations that come with the toolkit.

Target usage

Any user with admin control of a CDF project may use the toolkit. However, please be aware of the following:

  • The toolkit is designed to manage the lifecyle of a project, starting with a new project (day 0 and 1), i.e. either provisioned from Cognite or done as part of initial setup. If you have an existing project, you may use the bundled templates, but if you have overlaps in external IDs for configuration entities like transformations, data models, etc, you will have to adjust the variables and possibly the configurations before applying them to your project.
  • Once a project is provisioned from the YAML configurations, you can continue to manage the governed parts of your projects using the configuration files. The tool will only target resources in CDF that are specified with the externalIDs in the YAMl configuration files. Any other resources in CDF will not be touched.
  • Your own modules should NOT have the cdf_ prefix as it is reserved for official Cognite product modules.

Packages

Many of the pre-configured modules from Cognite are organized in packages. A package is a list of modules, each module consisting of one or more YAML configuration files.

You cannot define packages yourself. These are pre-defined from Cognite and are documented at Modules and packages. If you want to deploy a set of your own modules, you need to specify the full list in the environment.selected_modules_and_packages section of the config.<env>.yaml file.

Environments

An environment refers to a specific CDF project. Typically, you will have a personal or local environment, a dev CDF project used for development, a staging CDF project to validate configurations, and then finally a production CDF project. Different projects require different modules to be loaded. E.g. you may want to be able to edit and change configurations in the development environment so you can quickly iterate, but in production, you want to lock down configurations and only allow changes through the YAML configuration files. You will find the definition of the environment in the environments section of the config.<env>.yaml file. You create a new environment by creating a new config.<env>.yaml file (remember to edit environment.name property in the file.)

As you can see from the example configuration, you can load different modules for groups and authentication in the different environments. E.g. the package "demo_infield" loads groups for read/write access, as well as sample data. In your production environment, you want to load the "infield" package.

CI/CD pipeline

The below shows the high-level process. The build step should be executed as part of your pipeline to pick up the modules to be deployed, parse them, and replace with template variables from the config.<env>.yaml file. The basic syntax is validated as part of loading yaml files. The results are written to the build/ directory (unless you specify another target directory).

The deploy step is then executed with the build/ directory as (default) parameter. Environment variables will be resolved as part of the deploy step.

Once validations have passed, the configurations will be pushed to the CDF project. The --dry-run option is your friend and you should use it to validate that the configurations are correct before you push them.

Work process for governed configurations

This section describes the design principle behind the Cognite templates and how you should build your own modules.

Conceptually, data in a CDF project can be split into three different types as illustrated by the diagram below.

The governed configuration is the data that is managed by the YAML configurations in your project. Once these data pipelines and processing configurations (and more) have been applied to the CDF project, the data ingestion should start. Your project should NOT manage the data that is ingested into the project, but it should configure and enable the ingestion to run as well as set up the access control mechanisms. Using the Toolkit to load data is considered an antipattern, only to be used for static data or demo purposes.

The extractors running outside CDF (i.e. inside the customer network to get access to source systems) are not deployed directly from your project, but you can configure extraction pipelines that will be the "receiving end" of an extractor. You can configure an extraction pipeline with an extractor configuration, data sets, data models, and transformations to fully control the configuration of an end-to-end data pipeline. The examples/cdf_data_pipeline_asset_valhall) is an example of such a data pipeline configuration.

The governed data should be configured with CDF data sets and data model spaces and should not be modified directly by users in CDF. As part of the template configurations, certain data sets and spaces will be created for writing for users in CDF. This is the 3rd level in the diagram, User generated data. This data can be Charts, comments, annotations, and even transformations, functions, and data models.

In the governed data, you will also find the contextualizations that are done in CDF. Some of the contextualizations may go through a process where initial contextualizations are done by the user and live in the user generated data, but then are later promoted to governed data.

User generated data like Cognite Functions, transformations, data models, and other data typically start out as quick iterations and prototyping to solve a specific problem. Once the solution has been approved for production use and/or you want to scale the solution to multiple assets, you may want to move the solution to be governed. This is done by exporting the configurations for a solution and then importing them into your governed configuration (i.e. your version of this template). (Note: this is not yet supported by the toolkit, but is a manual process.)

NOTE!! This work process is particularly suitable for a workflow with three CDF projects: development, staging, and production. These three projects are then used to evolve the governed configuration.

Structure and setup for template modules

Template modules are organized in a flat structure under ./core, ./common, ./modules, experimental/, infield/ and ./examples in the cognite_modules directory of your project root (there may be more in the future). Each module has a structure with directories for each type of configuration, e.g. ./modules/<moduleA>/transformations, ./modules/<moduleA>/data_models.

See the module reference for details on each module, and the YAML configuration reference for details on the YAML configuration files.

The config.<env>.yaml files in root is used to specify the actual modules that should be deployed.

tip

You are free to delete and edit modules in the cognite_modules directory. You can also add your own modules in custom_modules directory. The cdf-tk tool will pick up all modules in the cognite_modules and custom_modules directories. If you edit template modules, an upgrade will guide you how to modify your configurations to any updates or changes in how the tool interprets YAML files. However, if you keep modules in the cognite_modules directory untouched, you will get updates to these modules when doing cdf-tk init --upgrade. If they are edited, you will get migration help.

Templating and configuration

In config.<env>.yaml and the environment section, you specify details on the environments you want to deploy. The cdf-tk build script command will set CDF_ENVIRON and CDF_BUILD_TYPE = (dev, staging, prod) as environment variables, so these can be used as variables in your YAML files.

Configuration variables used across your module configurations should be defined in the config.<env>.yaml files. Each is managed separately.

Template variables in files in the modules should be in the form {{variable_name}}. If you want template variables to be replaced by environment variables, use the following format in the config.<env>.yaml file: variable_name: ${ENV_VAR_NAME}. If you want variables to be set dependent on the environment you deploy to (e.g. cdf-tk build --env=prod), you need to edit the corresponding config.<env>.yaml file for the environment you are deploying to.

The config.<env>.yaml file is used by the build step to process the configurations and create a build/ directory where all the configurations are merged into a single directory structure.

Pull changes made through UI into repository

Certain configuration changes are easier and more efficient to make in the Cognite Data Fusion web user interface, for example in the early phases of a project where it is useful to make many small incremental changes and see the results immediately. This is especially true for Transformations where the preview feature allows you to test and verify queries before making them part of the data pipeline.

The Toolkit is designed to support this workflow for Transformation and instance Nodes. The cdf-tk pull command will compare the configuration in the Cognite Data Fusion project with the configuration in modules (in the local modules) and download any changes made in the CDF project so that you can store it in version control.

For the cdf-tk pull command to be able to recognise the destination of the data pulled from the CDF project, the resource has to be managed by the Toolkit. Practically, this means that the resource has to configured in a yaml in a module listed under selected_modules_and_packages in your config.<env>.yaml and be built and deployed according to the usual process.

In short, the process looks like this:

  • Configure a skeleton Transformation yaml and and sql file locally.
  • Build and deploy the module to the CDF project.
  • Edit and test the Transformation in the CDF user interface as many iterations as needed.
  • Run cdf-tk pull to download the changes made in the CDF project to the local repository.
  • Use Git to bring the changes into version control.

Repeat as necessary. From now on you will be able to restore the Transformation configuration if unwanted changes are made in the CDF project, or deploy it to other CDF projects.