Using Templates
This section covers the steps 3-5. For steps 1-2, see Getting started.
Step | Command | Description |
---|---|---|
1. | cdf-tk init <proj_dir> | Create a new configuration folder, cd <proj_dir> , and initialise the project. |
2. | cdf-tk auth verify --interactive | Check that you have access to the project and create a .env . This step can be skipped if you have configured environment variables. Alternatively do cdf-tk auth verify just to verify that everything works. |
3. | Edit config.<env>.yaml in <proj_dir> | Specify the modules you want to deploy for each environment you deploy to. config.<env>.yaml also contains all variables the modules expect. Change the variables for the modules that are relevant for your deployments. |
4. | cdf-tk --verbose build --env=dev | Build configurations to the build/ directory using the config.dev.yaml configuration file. |
5. | cdf-tk deploy --dry-run --env=dev | Test deploy the configurations to your CDF project from the build/ directory. Then remove --dry-run to actually push configurations to the project. |
Introduction
When you are starting out, your Cognite Data Fusion (CDF) project is empty. You need to set up the project with a core structure that fits with how you want to work with CDF. Depending on your industry and use case, you may want tailored data models, and the systems where your existing data resides will have different naming schemes and structures, so you need to configure CDF to extract, transform, and contextualise your data.
Although CDF is a flexible platform that can be configured to fit most use cases, there are some common patterns and use cases that are useful to have as a starting point. The CDF Toolkit comes with a set of templates that you can use to get started with your project.
What is a template?
A template is a set of configurations that can be deployed to a CDF project through Cognite's open APIs. You don't need to know the APIs to use the templates, but if you create your own configurations, it is useful to know that they mirror the APIs. Technically, the templates is a set of yaml-formatted files that follow the CDF API specifications and thus allows you to textually describe how the CDF project should be set up.
When we refer to templates, we mean the configuration sets that
are quality assured and come bundled with the cdf-tk
tool. To start using the templates, you do
cdf-tk init <folder>
to
create a new local project folder with the templates pre-installed. You then edit configuration variables that tailor
the templates to your project, and you can add new configurations for your project. You can also modify and adapt the
templates to fit your project.
Modules and packages
The simplest possible template is a single yaml text file that configures a small, simple thing in a CDF project, like a
group or a data set.
Then, a group of such yaml files can be put together into a module. A module is a bundle of CDF configurations that
logically belong together, are deployed together, and that gives you a certain functionality in your CDF project. For
example, the Infield application needs a set
of configurations that are shared with other Asset Performance Management (APM) use cases and applications, including
the APM data model.
These configurations are bundled and found in the cdf_apm_base
module.
A package is just a list of modules that are deployed together in a specific order. For example, the cdf_infield
package will give you all the modules necessary for Infield to work.
The pre-installed templates are found as modules in the modules/
, examples/
, common/
, and experimental/
directories below the cognite_modules
directory in your project direcotry. You are free to edit the configurations in
these modules (or copy them to custom_modules/
), but if you do not edit them, you get
the benefit of being able to do cdf-tk init --upgrade
to get the latest version of the templates installed into your project.
Practical steps
The basic flow is as follows:
First you build the templates to resolve variables (as defined in config.<env>.yaml
) and gather
the modules that should be deployed. Then you deploy what was built to the CDF project environment of choice:
Configuring what to deploy
This step describes how to onfigure what to deploy to each of your project environments. These are configured in the
config.<env>.yaml
files found in the root of your project directory. <env>
is the name of the environment you want to
manage. Default, two environments are created: dev
and prod
. You can create any number of environments you want by
copying a config.<env>.yaml
file and changing the <env>
in the file name and editing the environment.name
property.
If you want to configure the dev
environment, you edit the config.dev.yaml
file.
Open up config.dev.yaml
in the root of the project directory you created with cdf-tk init <folder>
.
This file is the starting point for how your project is configured. It defines a set of environments, and for each
environment, it defines which modules to deploy.
Here is a snippet of the config.dev.yaml
file that defines the environment:
environment:
name: dev
project: <customer-dev>
type: dev
selected_modules_and_packages:
- cdf_demo_infield
- cdf_oid_example_data
common_function_code: ./common_function_code
Edit the project
property to match the name of your project. This is used as a safety measure to ensure that you don't
accidentally deploy to the wrong project. The type
property is used to distinguish between different types of
environments, and is not currently used by the cdf-tk
tool (but will be used to support migrations in the future).
The selected_modules_and_packages
property is a list of modules to deploy. The module can be found in any of the
module directories below the cognite_modules and custom_modules directories.
Finally, the common_function_code
property is the path to a directory where you can put common code that is used by
your functions. The default code found in common_function_code
is used to support local execution of functions.
Testing Functions locally
The toolkit repository might not be the ideal environment for active code development, due to how modules are "packaged" in the directory hierarchy (it is easy to get lost). A suggested way of working is to think of the toolkit repository (commit history) as snapshots of a fully working state. Thus we recommend developing and testing function(s) separately, then copying in the verified files.
With that disclaimer out of the way, here's a guide to running locally:
To run, for example, fn_context_files_oid_fileshare_annotation
, simply call the file handler.py
normally from
the root folder of the toolkit - or any subsequent folder, as long as you don't enter into the "package" itself,
i.e. fn_context_files_oid_fileshare_annotation
:
cognite_toolkit/
cognite_modules/
examples/
cdf_data_pipeline_files_valhall/
functions/
Assuming you have navigated to functions
, the full command would be (you may skip poetry run
if you already
have activated your virtual environment):
poetry run python fn_context_files_oid_fileshare_annotation/handler.py
This works because a special run_locally
method has been added (and imports made to work). A list of
required environment variables, mostly for authentication towards CDF will be raised if not set correctly
(you may inspect said file directly).
Changing the default variables
Each module has variables you may want to change to adapt to your project. It may be the name of your default location
(i.e. plant/asset/site), or other things. The configuration variables can be found in the same config.<env>.yaml
file
as the environment configuration, but further below. They are found in the modules
section.
You are free to delete modules that you don't need, both from the cognite_modules directory and in the
config.<env>.yaml
file.
Most of the variables are set to default values that are useful for the example data set that comes with the templates.
You can deploy the configurations as they are without changing these variables, but you probably want to adapt them
to your project. Other variables are set to <change_me>
. If these are not changed, some functionality will not work.
Module-specific configurations
Here are the default modules and how to edit the variables. If you don't use these modules, the below serves as examples for how to configure the modules you use.
common: cdf_auth_readwrite_all
readwrite_source_id: <change_me>
readonly_source_id: <change_me>
These are the group ids from your identity provider. The readwrite_source_id should be the group id of the group you
created for the cdf-tk
tool.
The readonly_source_id should be the group id of a group that administrators belong to, so they can use the Fusion UI
and API to read data.
core: cdf_apm_base
apm_datamodel_space: 'APM_SourceData'
apm_datamodel_version: '1'
Should not be changed. These will be changed with new versions of the APM data model.
examples: cdf_oid_example_data
default_location: oid
source_asset: workmate
source_workorder: workmate
source_files: fileshare
source_timeseries: pi
Each of the source_* variables should just be the name of the system where the data originates from. The defaults are here for the example data set that comes with the templates ("Open Industrial Data" or OID).
The default_location
is the default location (plant/asset/site) that is used in the example data set. Here we
just use oid
, but this should be something short and meaningful to your project, like houston
or plantY
.
infield: cdf_infield_common
applicationsconfiguration_source_id: <change_me>
Users that are members of this group in your identity provider will be able to configure Infield.
infield: cdf_infield_location
default_location: oid
module_version: '1
apm_datamodel_space: APM_SourceData
apm_app_config_external_id: default-infield-config-minimal
apm_config_instance_space: APM_Config
source_asset: workmate
source_workorder: workmate
workorder_raw_db: workorder_oid_workmate
workorder_table_name: workorders
root_asset_external_id: WMT:VAL
infield_default_location_checklist_admin_users_source_id: <change_me>
infield_default_location_normal_users_source_id: <change_me>
infield_default_location_template_admin_users_source_id: <change_me>
infield_default_location_viewer_users_source_id: <change_me>
The Infield location module references default the cdf_oid_example_data
module. This means that if you change the
source_* variables there
(or create your own example data module), you need to change here as well. Note also that workorder_raw_db
and workorder_table_name
references the actual database and table name in RAW where the work order data is stored.
So, if you changed the cdf_oid_example_data
module default_location to my_location
and the source_workorder to
my_workorder
, the workorder_raw_db would be workorder_my_location_my_workorder
.
You also need to change the root_asset_external_id to the root asset of your project. This is the asset that is the parent of all other assets. Finally, you need to configure the group ids from your service provider for the different Infield roles. These will typically be new groups that you create for Infield use. The CDF groups will be created automatically when you deploy the configurations.
Building
Once you have configured what to deploy and changed the variables you need to change, you can build the configurations:
cdf-tk build --clean --env=dev
This will substitute the variables in the templates and create a build/
directory with the configurations that will
be deployed. The --env=dev
specifies to use the config.dev.yaml
file that you edited in the previous step.
Deploying
Finally, you can deploy the configurations to your CDF project. Do a test:
cdf-tk deploy --dry-run --env=dev
And then you can drop the --dry-run
to deploy for real: cdf-tk deploy --env=dev
The deploy command is doing a diff against the CDF project and only deploying what has been changed and by
updating the configurations in the CDF project. This ensures that run history, logs, etc will be kept.
However, if you want to ensure that you deploy from a clean state, you can clean up configurations before deploying
using --drop
:
cdf-tk deploy --env=dev --drop
You can even add --drop-data
to also delete all the data that is managed by the configurations (this is a dangerous
operation and similar to clean
below):
cdf-tk deploy --env=dev --drop --drop-data
If you want to delete everything in your project that is managed by your configurations, you can use (use with caution!):
cdf-tk clean --dry-run --env=dev
(and then drop --dry-run
when you understand what it's going to do).
Next steps
Once you have tried out the scripts and how to use the templates to deploy to your CDF project, the next step is to set up a CI/CD pipeline where you can deploy to your staging and production environments as part of your development workflow. The advanced documentation explains in more detail how to use a DevOps approach and build modules for your own projects.