Entity matching
Use entity matching to contextualize your data with machine learning (ML) and rules engines, and then let domain experts validate and fine-tune the results.
Different sources of industrial data can use different naming standards when they refer to the same entity. With CDF, you can match all entities like time series, files, and sequences, along with 3d from different source systems to assets.
About entity matching
Assets in CDF connect related data from different sources, such as time series, files, sequences, and events. The data often use different naming conventions, even when they refer to the same entity. Entity matching applies artificial intelligence techniques to automatically match the different resources according to name, description, etc.
See the entity matching API documentation for more information on working with relationships.
Example data
This tutorial uses a small example data set to show how to do entity matching. The data is in Python format, and you can convert it to other formats that fits your programming language.
sources = [
{"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
{"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
{"id":0, "name" : "21AA1019CA", "description": "correct"},
{"id":1, "name" : "21AA1019CA", "description": "wrong"},
{"id":2, "name" : "13FV1234BU"},
{"id":3, "name" : "13FV1234BU", "description": "ok"}
]
true_matches = [{"sourceId": 0,"targetId": 0}]
Fit a supervised ML model and predict for the same data
The supervised model calculates one or more similarity measures between match-to and match-from items. Then it uses these calculated similarity measures as features and fits a classification model using the labeled data.
Note that a set of candidate matches is selected before calculating the similarity measures and training a model. To consider a pair of match-to and match-from items a candidate, the items should have at least one token in common. Only the candidate match-from and match-to combinations are used in the training. This is done to reduce computing time - calculating similarity measures for all possible combinations can be extremely heavy (10.000 time series and 30.000 assets -> 300.000.000 combinations).
When to use a supervised ML model?
A supervised ML model is applicable when you have a lot of labeled data items. The more labeled data you have, the better results you can achieve. Make sure not to apply a supervised model if you have fewer than 500 labeled data items.
Create a model
POST /api/v1/projects/publicdata/context/entitymatching
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"trueMatches":true_matches,
}
Predict
When predict
is called without any data, predictions are on the training data.
num_matches
determines the number of matches to return for each matchFrom
item. The default is 1.
POST /api/v1/projects/publicdata/context/entitymatching/predict
Content-Type: application/json
{
"id": modelId,
"numMatches": 3,
"scoreThreshold": 0.7
}
Get matching results
Replace job_id
in the following GET body with the jobId
returned by the response above.
GET /api/v1/projects/publicdata/context/entitymatching/jobs/<job_id>
Content-Type: application/json
Example output:
[
{
"matchFrom": {
"description": "correct",
"id": 0,
"name": "KKL_21AA1019CA.PV"
},
"matches": [
{
"matchTo": { "description": "correct", "id": 0, "name": "21AA1019CA" },
"score": 0.35,
"target": { "description": "correct", "id": 0, "name": "21AA1019CA" }
},
{
"matchTo": { "description": "wrong", "id": 1, "name": "21AA1019CA" },
"score": 0.35,
"target": { "description": "wrong", "id": 1, "name": "21AA1019CA" }
}
],
"source": { "description": "correct", "id": 0, "name": "KKL_21AA1019CA.PV" }
},
{
"matchFrom": { "description": "ok", "id": 1, "name": "KKL_13FV1234BU.VW" },
"matches": [
{
"matchTo": { "id": 2, "name": "13FV1234BU" },
"score": 0.35,
"target": { "id": 2, "name": "13FV1234BU" }
},
{
"matchTo": { "description": "ok", "id": 3, "name": "13FV1234BU" },
"score": 0.35,
"target": { "description": "ok", "id": 3, "name": "13FV1234BU" }
}
],
"source": { "description": "ok", "id": 1, "name": "KKL_13FV1234BU.VW" }
}
]
For both matchFrom
items, the two matches have an equal score, and the model isn't able to distinguish between the correct and incorrect match. Also, the scores are relatively low. When the data set is small, unsupervised learning makes more sense.
Refit
Refit lets you retrain a model, using the same parameters, with extra labels/true-matches. The new true_matches
(1,3) are added to the true_matches
list from the original model.
To fit a model using only the (1,3) label. A new model must be trained using fit.
true_matches=[{"sourceId": 0,"targetId": 3}]
POST /api/v1/projects/publicdata/context/entitymatching/refit
Content-Type: application/json
{
"id": modelId
"trueMatches":true_matches,
}
POST /api/v1/projects/publicdata/context/entitymatching/predict
Content-Type: application/json
{
"id": modelId,
"numMatches": 3,
"scoreThreshold": 0.7
}
GET /api/v1/projects/publicdata/context/entitymatching/jobs/<job_id>
Content-Type: application/json
In this example, the results are the same. The new true-match
follows the exact same pattern as the original.
Fit an unsupervised ML model
If there are no true_matches
included in the fit
call, an unsupervised model is applied.
As with supervised models, candidates are selected and similarity measures between the candidates are calculated. Instead of training a classification model, the average of the similarity measures is calculated and returned as the score.
When to use an unsupervised ML model?
Use an unsupervised ML model when there are no or few true matches (labeled data).
Create a model
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"simple_model_1",
"description":"Simple model 1",
"featureType":"simple",
"classifier":"randomforest"
}
Example response:
{
"classifier": "randomforest",
"createdTime": 1606151476539,
"description": "Simple model 1",
"externalId": "None",
"featureType": "simple",
"id": 7895111848480381,
"ignoreMissingFields": True,
"matchFields": [
{
"source": "name",
"target": "name"
},
{
"source": "description",
"target": "description"
}
],
"name": "simple_model_1",
"startTime": "None",
"status": "Queued",
"statusTime": 1606151476539
}
Create a matching job
In the following POST body, replace id
with the id
returned by the response above. You can set maximum matching items and threshold to filter valid matching.
POST /api/v1/projects/publicdata/context/entitymatching/predict
Content-Type: application/json
{
"id": 7895111848480381,
"numMatches": 3,
"scoreThreshold": 0.7
}
Example response:
{
"createdTime":1606151691970,
"jobId":6147120367590349,
"startTime":"None",
"status":"Queued",
"statusTime":1606151691970
}
Get matching results
In the following GET body, replace job_id
with the jobId
return by the response above.
GET /api/v1/projects/publicdata/context/entitymatching/jobs/<job_id>
Content-Type: application/json
Example response:
{'createdTime': 1606216545060,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '21AA1019CA'},
'score': 1.0,
'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}},
{'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
'score': 1.0,
'target': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
'score': 1.0,
'target': {'id': 2, 'name': '13FV1234BU'}},
{'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
'score': 1.0,
'target': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 6147120367590349,
'startTime': 1606216546372,
'status': 'Completed',
'statusTime': 1606216546552}
}
Add additional key match_fields
By default, only name
in sources and name
in targets are used to calculate similarity measures. The match_fields
parameter lets you specify all combinations of fields in sources and targets that should be used to calculate features.
In this example, it looks like comparing the description
field for both targets and sources could improve the model.
Calculating similarity measures can be time-consuming. It's recommended that you don't use match_fields
combinations that add little or no information to the model.
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"simple_model_1",
"description":"Simple model 1",
"featureType":"simple",
"matchFields":[
{
"source":"name",
"target":"name"
},
{
"source":"description",
"target":"description"
}
],
"classifier":"randomforest",
}
The request above results in an error because one of the items in match_to
is missing description
. If complete_missing
is set to True
, empty strings replace missing values.
Add ignore_missing_fields
To avoid errors if items in sources or targets have missing values, add ignore_missing_fields=True
.
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"simple_model_1",
"description":"Simple model 1",
"featureType":"simple",
"matchFields":[
{
"source":"name",
"target":"name"
},
{
"source":"description",
"target":"description"
}
],
"classifier":"randomforest",
"ignoreMissingFields":true
}
The model gives the correct matches a score of 1 and the incorrect matches a score of 0.5.
Using feature types
By default, feature_type
is set to simple
. The different feature types are created to improve the model's accuracy for different types of input data. Which feature type works best for your model varies based on what your data look like. The options for feature_type
are: "simple", "bigram", "frequency-weighted-bigram", "bigram-extra-tokenizers", "bigram-combo". This section illustrates the strengths and weaknesses of the different feature types.
When to use feature_type=simple
?
"Simple" is the default feature type and is preferred when one string is a substring of the other and the other characters don't appear in the other string. For example, "BCDEF" is a substring of "ABCDEFG" and the other characters "AG" don't appear in the string "BCDEF". This feature type is also the fastest option.
Limitations of the simple feature type
The data below is the same as in the earlier examples, except for two new items in the targets.
Id 10 and 13 are similar to 0 and 3, respectively. The first letter combination ("AA" and "FV") is swapped with the prefix for the match_from
items (KKL).
This leads to difficulties with using the "simple" feature type.
sources = [
{"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
{"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
{"id":0, "name" : "21AA1019CA", "description": "correct"},
{"id":10, "name" : "21KKL1019CA", "description": "correct"},
{"id":1, "name" : "21AA1119CA", "description": "wrong"},
{"id":2, "name" : "13FV1234BU"},
{"id":3, "name" : "13FV1334BU", "description": "ok"},
{"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"simple_model_1",
"description":"Simple model 1",
"featureType":"simple",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
Example results:
{'createdTime': 1606153318209,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '21AA1019CA'},
'score': 0.8944271909999159,
'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}},
{'matchTo': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'},
'score': 0.8944271909999159,
'target': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
'score': 0.8944271909999159,
'target': {'id': 2, 'name': '13FV1234BU'}},
{'matchTo': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'},
'score': 0.8944271909999159,
'target': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 5049964554908808,
'startTime': 1606153318586,
'status': 'Completed',
'statusTime': 1606153318777}
The new target items have identical scores to the correct matches. This is because with feature_type="simple"
, only the number of matching tokens are considered.
Target items with id
0 21, AA, 1019, and CA match a token in sources items with id
0. Target items with id
10 21, KKL, 1019, and CA matches a token in source items with id
0. Thus, the same number of tokens matches.
The model doesn't consider that the target items with id
0 have more and longer contiguous sequences of tokens.
When to use feature_type=bigram
?
The "bigram" feature type does account for sequences of tokens. In addition to counting the number of matching tokens, it looks at the number of matching bigrams. that's the number of matching tokens when two and two adjacent tokens are combined, BCDEF and ABBCDE1F, for example.
sources = [
{"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
{"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
{"id":0, "name" : "21AA1019CA", "description": "correct"},
{"id":10, "name" : "21KKL1019CA", "description": "correct"},
{"id":1, "name" : "21AA1119CA", "description": "wrong"},
{"id":2, "name" : "13FV1234BU"},
{"id":3, "name" : "13FV1334BU", "description": "ok"},
{"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"bigram_model_1",
"description":"bigram model 1",
"featureType":"bigram",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
Example results:
{'createdTime': 1606153733720,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '21AA1019CA'},
'score': 0.9149207688467006,
'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
'score': 0.9149207688467006,
'target': {'id': 2, 'name': '13FV1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 6677803085358639,
'startTime': 1606153733995,
'status': 'Completed',
'statusTime': 1606153734183}
When to use feature_type=Frequency-Weighted-Bigram
?
The "Frequency-Weighted-Bigram" feature type calculates a similarity score based on the sequence of the terms. It also gives higher weights to less commonly occurring tokens. This can be helpful when the "simple" feature type doesn't return useful results.
sources = [
{"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
{"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
{"id":0, "name" : "21AA1019CA", "description": "correct"},
{"id":10, "name" : "21KKL1019CA", "description": "correct"},
{"id":1, "name" : "21AA1119CA", "description": "wrong"},
{"id":2, "name" : "13FV1234BU"},
{"id":3, "name" : "13FV1334BU", "description": "ok"},
{"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"fbw_model_1",
"description":"fwb model 1",
"featureType":"Frequency-Weighted-Bigram",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
Example results:
{'createdTime': 1606153836538,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '21AA1019CA'},
'score': 0.9149207688467006,
'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
'score': 0.9149207688467006,
'target': {'id': 2, 'name': '13FV1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 5204435061404568,
'startTime': 1606153836790,
'status': 'Completed',
'statusTime': 1606153837071}
When to use feature_type=Bigram-Extra-Tokenizers
?
The "Bigram-Extra-Tokenizers" feature type is similar to bigram but can learn that leading zeros and spaces should be ignored in matching. For example ABCDE and 000ABBCDE1F.
sources = [
{"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
{"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
{"id":0, "name" : "000021AA1019CA", "description": "correct"},
{"id":10, "name" : "21KKL1019CA", "description": "correct"},
{"id":1, "name" : "21AA1119CA", "description": "wrong"},
{"id":2, "name" : "000013FV1234BU"},
{"id":3, "name" : "13FV1334BU", "description": "ok"},
{"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"BET_model_1",
"description":"BET model 1",
"featureType":"Bigram-Extra-Tokenizers",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
Example results:
{'createdTime': 1606154023578,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '000021AA1019CA'},
'score': 0.8477338488399959,
'target': {'description': 'correct', 'id': 0, 'name': '000021AA1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '000013FV1234BU'},
'score': 0.8477338488399959,
'target': {'id': 2, 'name': '000013FV1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 6675842531673047,
'startTime': 1606154023915,
'status': 'Completed',
'statusTime': 1606154024190}
When to use feature_type=Bigram-Combo
?
The "Bigram-Combo" feature type calculates all the above options relying on the model to find the appropriate features to use. This is the slowest option and is mostly suitable for supervised models.
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"BC_model_1",
"description":"BC model 1",
"featureType":"Bigram-Combo",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
Example results:
{'createdTime': 1606154392343,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'KKL_21AA1019CA.PV'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '21AA1019CA'},
'score': 0.8195704160778803,
'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
{'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
'score': 0.8195704160778803,
'target': {'id': 2, 'name': '13FV1234BU'}}],
'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
'jobId': 4397635623636246,
'startTime': 1606154392670,
'status': 'Completed',
'statusTime': 1606154392848}
About the score
A score above 0.8 indicates that the source and the target are matched with high probability.
Note that a score below 0.8 and above 0.5 doesn't indicate that the source and target are matched with more than 50% probability. The example below illustrates that even if the source and the target don't match at all, they can still receive a score over 0.6:
sources = [{"id":0, "name" : "J04_ONSTREAM_HOUR_AVG", "description": "correct"}]
targets = [{"id":0, "name" : "87-JB-004-J04", "description": "correct"}]
POST /api/v1/projects/publicdata/context/entitymatching/
Content-Type: application/json
{
"sources":sources,
"targets":targets,
"name":"bigram_model_1",
"description":"bigram model 1",
"featureType":"bigram",
"matchFields":[
{
"source":"name",
"target":"name"
}
],
"classifier":"randomforest",
}
POST /api/v1/projects/publicdata/context/entitymatching/predict
Content-Type: application/json
{
"id": 6147120367590349,
"numMatches": 3,
"scoreThreshold": 0.5
}
GET /api/v1/projects/publicdata/context/entitymatching/jobs/<job_id>
Content-Type: application/json
Example matching results
{'createdTime': 1606154525247,
'items': [{'matchFrom': {'description': 'correct',
'id': 0,
'name': 'J04_ONSTREAM_HOUR_AVG'},
'matches': [{'matchTo': {'description': 'correct',
'id': 0,
'name': '87-JB-004-J04'},
'score': 0.6049029006116509,
'target': {'description': 'correct', 'id': 0, 'name': '87-JB-004-J04'}}],
'source': {'description': 'correct',
'id': 0,
'name': 'J04_ONSTREAM_HOUR_AVG'}}],
'jobId': 1019931106940009,
'startTime': 1606154525580,
'status': 'Completed',
'statusTime': 1606154525758}
Get model info
If you have a model_id
and want to know which parameters you used when training the model, use the retrieve method.
GET /api/v1/projects/publicdata/context/entitymatching/<model_id>
Content-Type: application/json