Advanced usage of "CDF as code"
This section is for advanced users that want to understand the details of CDF configurations as code and how
they are structured, and how to use the CDF Toolkit to manage the lifecycle of a CDF project.
From here on, the toolkit is used to refer to the cdf-tk
command-line tool and the YAML configurations that
specify how to configure a CDF project. The templates are the pre-built configurations that come with the toolkit.
Target usage
Any user with admin control of a CDF project may use the toolkit. However, please be aware of the following:
- The toolkit is designed to manage the lifecyle of a project, starting with a new project (day 0 and 1), i.e., either provisioned from Cognite or done as part of initial setup. If you have an existing project, you may use the bundled modules, but if you have overlaps in external IDs for configuration entities like transformations, data models, etc, you will have to adjust the variables and possibly the configurations before applying them to your project.
- Once a project is provisioned from the YAML configurations, you can continue to manage the governed parts of your projects using the configuration files. The tool will only target resources in CDF that are specified with the externalIDs in the YAMl configuration files. Any other resources in CDF will not be touched.
- Your own modules should NOT have the
cdf_
prefix as it is reserved for official Cognite product modules.
Packages
Many of the pre-configured modules from Cognite are organized in packages. A package is a list of modules, each module consisting of one or more YAML configuration files.
You cannot define packages yourself. These are pre-defined from Cognite and are documented at
Modules and packages. If you want to deploy a set of your own modules,
you need to specify the full list in the environment.selected_modules_and_packages
section of the config.<env>.yaml
file.
Environments
An environment refers to a specific CDF project. Typically, you
will have a personal or local environment, a dev CDF project used for development, a staging CDF project
to validate configurations, and then finally a production CDF project. Different projects require different
modules to be loaded. For example, you may want to be able to
edit and change configurations in the development environment, so you can quickly iterate. On the other hand, in
production, you want to lock down configurations and only allow changes through the YAML configuration
files. You will find the definition of the environment in the environments
section of the config.<env>.yaml
file. You create a new environment by creating a new config.<env>.yaml
file (remember to edit environment.name
property in the file.)
As you can see from the example configuration, you can load different modules for groups and authentication in the different environments. For example, the package "demo_infield" loads groups for read/write access, as well as sample data. In your production environment, you want to load the "infield" package.
CI/CD pipeline
The below shows the high-level process. The build step should be executed as part of your pipeline to
pick up the modules to be deployed, parse them, and replace with template variables from the
config.<env>.yaml
file. The basic syntax is validated as part
of loading yaml files. The results are written to the build/ directory (unless you specify another target
directory).
The deploy step is then executed with the build/ directory as (default) parameter. Environment variables will be resolved as part of the deploy step.
Once validations have passed, the configurations will be pushed to the CDF project. The --dry-run
option
is your friend and you should use it to validate that the configurations are correct before you push them.
Work process for governed configurations
This section describes the design principle behind the Cognite templates and how you should build your own modules.
Conceptually, data in a CDF project can be split into three different types as illustrated by the diagram below.
The governed configuration is the data that is managed by the YAML configurations in your project. Once these data pipelines and processing configurations (and more) have been applied to the CDF project, the data ingestion should start. Your project should NOT manage the data that is ingested into the project, but it should configure and enable the ingestion to run as well as set up the access control mechanisms. Using the Toolkit to load data is considered an antipattern, only to be used for static data or demo purposes.
The extractors running outside CDF (i.e. inside the customer network to
get access to source systems) are not deployed directly from your project, but you can configure extraction
pipelines that will be the "receiving end" of an extractor. You can
configure an extraction pipeline with an extractor configuration, data sets, data models, and transformations
to fully control the configuration of an end-to-end data pipeline. The
examples/cdf_data_pipeline_asset_valhall
)
is an example of such a data pipeline configuration.
The governed data should be configured with CDF data sets and data model spaces and should not be modified directly by users in CDF. As part of the template configurations, certain data sets and spaces will be created for writing for users in CDF. This is the 3rd level in the diagram, User generated data. This data can be Charts, comments, annotations, and even transformations, functions, and data models.
In the governed data, you will also find the contextualizations that are done in CDF. Some contextualizations may go through a process where initial contextualizations are done by the user and live in the user generated data, but then are later promoted to governed data.
User generated data like Cognite Functions, transformations, data models, and other data typically start out as quick iterations and prototyping to solve a specific problem. Once the solution has been approved for production use and/or you want to scale the solution to multiple assets, you may want to move the solution to be governed. This is done by exporting the configurations for a solution and then importing them into your governed configuration (i.e. your version of this template). (Note: this is not yet supported by the toolkit, but is a manual process.)
NOTE!! This work process is particularly suitable for a workflow with three CDF projects: development, staging, and production. These three projects are then used to evolve the governed configuration.
Structure and setup for template modules
Template modules are organized in a flat structure under ./core, ./common, ./modules, experimental/, infield/ and ./examples
in the cognite_modules directory of your project root (there may be more in the future). Each module has a structure
with directories for each type of configuration, e.g. ./modules/<moduleA>/transformations
, ./modules/<moduleA>/data_models
.
See the module reference for details on each module, and the YAML configuration reference for details on the YAML configuration files.
The config.<env>.yaml
files in root is used to specify the actual modules that should be deployed.
You are free to delete and edit modules in the cognite_modules directory. You can also add your own modules
in custom_modules directory. The cdf-tk
tool will pick up all modules in the cognite_modules and custom_modules
directories. If you edit template modules, an upgrade will guide you how to modify your configurations to any
updates or changes in how the tool interprets YAML files.
However, if you keep modules in the cognite_modules directory untouched, you will get updates to these
modules when doing cdf-tk init --upgrade
. If they are edited, you will get migration help.
Templating and configuration
In config.<env>.yaml
and the environment section, you specify details on the environments you want to deploy.
The cdf-tk build
script command will
set CDF_ENVIRON
and CDF_BUILD_TYPE = (dev, staging, prod)
as environment variables, so these can be used as variables
in your YAML files.
Configuration variables used across your module configurations should be defined in the config.<env>.yaml
files.
Each is managed separately.
Template variables in files in the modules should be in the form {{variable_name}}
.
If you want template variables to be replaced by environment variables in the deploy step, then use the following
format in the config.<env>.yaml
file: variable_name: ${ENV_VAR_NAME}
.
You need one config.<env>.yaml
file for each environment you want to deploy to. For example, you may have a
prod
, staging
, and dev
environment, then you will have config.prod.yaml
, config.staging.yaml
,
and config.dev.yaml
configurations.
The config.<env>.yaml
file is used by the build step to
process the configurations and create a build/
directory where all the configurations are
merged into a single directory structure.
Version control changes made through the Cognite Data Fusion user interface
Certain configuration changes are easier and more efficient to make in the CDF user interface, especially in the early phases of a project, where it's useful to make many small incremental changes and see the results immediately. This is especially true for Transformations, where the preview feature allows you to test and verify queries before making them part of the data pipeline.
The Toolkit is designed to support this workflow for Transformation and instance Nodes. The cdf-tk pull
command compares the configuration in the CDF project with modules in the local modules and downloads any changes made in the CDF project so that you can store it in version control.
::: caution
The resource has to be managed by the Toolkit in the first place, i.e., it has to belong to a module listed under selected_modules_and_packages in config.<env>.yaml
and be built and deployed according to the usual process for the cdf-tk pull
command to be able to recognize it.
:::
In short, the process looks like this:
- Configure a skeleton Transformation yaml and an sql file locally.
- Build and deploy the module to the CDF project.
- Edit and test the Transformation in the CDF user interface as many iterations as needed.
- Run
cdf-tk pull
to download the changes made in the CDF project to the local repository. - Use Git to bring the changes into version control.
Repeat as necessary. From now, you will be able to restore the Transformation configuration
if unwanted changes are made in the CDF project, or deploy it to other CDF projects such as staging
and production
.