Permissions required: "Modify Datasets"
What are Custom Regex Entities?
A Custom Regex Entity can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.
This is a useful option for simple, structured entities with little variation, whereas for those with significant variation and where context has a big influence on predictions, a machine-learning based entity would be the right choice. Combinations of the two can be used in any dataset within Communications Mining.
A broader Regex (i.e. set of rules to define the entity) can also be used as the base of a custom entity. This combines the rules with contextual, machine learning based refinement through training within Communications Mining to create sophisticated custom entities. This provides the most optimal performance as well as the necessary restrictions on values extracted for automation.
Custom Regex Template
A Custom Regex Entity is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the entity.
Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same entity type.
A template is made of two parts:
- The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as an entity
- The formatting, which expresses how to normalise the extracted string into a more standard format
For instance, if your customer IDs can be either the word “ID” followed by 7 digits, or an alphanumeric string of 9 characters, here is what your two templates will look like:
Type-ahead validation
When typing into the text box for either the Regex or the Formatting, the interface will provide immediate feedback on the validity of the input. For instance, the invalid input Regex ID\
d{}
will show:
Extraction preview
The Custom Regex Template can be tested on text to ensure that it behaves as expected. Any entity that would be extracted with the Template will be shown in a list, with its value, as well as the position of the start and end characters.
For instance, if the Regex is \d{4}
and the formatting ID-{$}
, the following test string will show one extraction:
Regex
The regex is the pattern used to extract entities in the text. See here for the syntax documentation.
Named capture groups can be used to identify a specific section of the extracted string for subsequent formatting. The names of the capture groups should be unique across all templates, and should only contain lowercase letters or digits.
Formatting
Formatting can be provided to post-process the extracted entity.
By default, no formatting is applied and the string returned by the platform will be the string extracted by the regex. However, if needed, more complex transformations can be defined, using the following rules.
Variables
Any named capture group defined in the regex will be available to use in the formatting logic as a variable, prefixed with the $
symbol. Note that the $
symbol by itself represents the full regex match.
Variables can then be used in the formatting string to insert the corresponding extracted span into the value returned by the platform; the variable name needs to be surrounded by {
and }
braces.
For instance, if we want to extract seven digits as an ID, and return these seven digits prefixed with ID-
then the regex and the formatting would be:
Or, using a named capture group:
Later on, if the platform is given the text: My identification number is 1234567
, it will return one entity: ID-1234567
.
String Operations
Raw strings can be used, and strings can be concatenated using the &
symbol.
Regex | (?P<id1>\b\d{3}\b)|(?P<id2>\b\d{4}\b) |
Formatting | {$id1 & "-" & $id2} |
Text | The first id is 123 and the second one is 4567 |
Entity returned by the platform | 123-4567 |
Functions
Some functions can also be used in the formatting to transform the extracted string. The names of the functions and their signatures are inspired by Excel.
Upper
Converts all characters in the extracted span to uppercase:
Regex | \w{3} |
Formatting | {upper($)} |
Text | abc |
Entity returned by the platform | ABC |
Lower
Converts all characters in the extracted span to lowercase:
Regex | \w{3} |
Formatting | {lower($)} |
Text | AbC |
Entity returned by the platform | abc |
Proper
Capitalises the extracted span:
Regex | \w+\s\w+ |
Formatting | {proper($)} |
Text | albert EINSTEIN |
Entity returned by the platform | Albert Einstein |
Pad
Pads the extracted span up to a given size with a given character.
Function arguments:
- The text containing the characters to be padded
- Size of the padded string
- Character to be used for padding
Regex | \d{2,5} |
Formatting | {pad($, 5, "0")} |
Text | 123 |
Entity returned by the platform | 00123 |
Substitute
Replaces characters with other characters.
Function arguments:
- The text containing the characters to be substituted
- What characters to replace
- What the old characters should be replaced with
Regex | ab |
Formatting | {substitute($, "a", "12")} |
Text | ab |
Entity returned by the platform | 12b |
Left
Returns the first n characters from the span.
Function arguments:
- The text containing the characters to be extracted
- The number of characters to return
Regex | \w{4} |
Formatting | {left($, 2)} |
Text | ABCD |
Entity returned by the platform | AB |
Right
Returns the last n characters from the span.
Function arguments:
- The text containing the characters to be extracted
- The number of characters to return
Regex | \w{4} |
Formatting | {right($, 2)} |
Text | ABCD |
Entity returned by the platform | CD |
Mid
Returns n characters after the specified position from the span.
Function arguments:
- The text containing the characters to be extracted
- The position of the first character to return
- The number of characters to return
Regex | \w{5} |
Formatting | {mid($, 2, 3)} |
Text | ABCDE |
Entity returned by the platform | BCD |
Previous: Improving entity performance