Building custom regex entities

Permissions required: "Modify Datasets"

What are Custom Regex Entities?

A Custom Regex Entity can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.

This is a useful option for simple, structured entities with little variation, whereas for those with significant variation and where context has a big influence on predictions, a machine-learning based entity would be the right choice. Combinations of the two can be used in any dataset within Communications Mining.

A broader Regex (i.e. set of rules to define the entity) can also be used as the base of a custom entity. This combines the rules with contextual, machine learning based refinement through training within Communications Mining to create sophisticated custom entities. This provides the most optimal performance as well as the necessary restrictions on values extracted for automation.

Custom Regex Template

A Custom Regex Entity is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the entity.

Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same entity type.

A template is made of two parts:

The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as an entity
The formatting, which expresses how to normalise the extracted string into a more standard format

For instance, if your customer IDs can be either the word “ID” followed by 7 digits, or an alphanumeric string of 9 characters, here is what your two templates will look like:

Type-ahead validation

When typing into the text box for either the Regex or the Formatting, the interface will provide immediate feedback on the validity of the input. For instance, the invalid input Regex ID\d{} will show:

Extraction preview

The Custom Regex Template can be tested on text to ensure that it behaves as expected. Any entity that would be extracted with the Template will be shown in a list, with its value, as well as the position of the start and end characters.

For instance, if the Regex is \d{4} and the formatting ID-{$}, the following test string will show one extraction:

Regex

The regex is the pattern used to extract entities in the text. See here for the syntax documentation.

Named capture groups can be used to identify a specific section of the extracted string for subsequent formatting. The names of the capture groups should be unique across all templates, and should only contain lowercase letters or digits.

Formatting

Formatting can be provided to post-process the extracted entity.

By default, no formatting is applied and the string returned by the platform will be the string extracted by the regex. However, if needed, more complex transformations can be defined, using the following rules.

Variables

Any named capture group defined in the regex will be available to use in the formatting logic as a variable, prefixed with the $ symbol. Note that the $ symbol by itself represents the full regex match.

Variables can then be used in the formatting string to insert the corresponding extracted span into the value returned by the platform; the variable name needs to be surrounded by { and } braces.

For instance, if we want to extract seven digits as an ID, and return these seven digits prefixed with ID- then the regex and the formatting would be:

Or, using a named capture group:

Later on, if the platform is given the text: My identification number is 1234567, it will return one entity: ID-1234567.

String Operations

Raw strings can be used, and strings can be concatenated using the & symbol.

Regex	(?P<id1>\b\d{3}\b)\|(?P<id2>\b\d{4}\b)
Formatting	{$id1 & "-" & $id2}
Text	The first id is 123 and the second one is 4567
Entity returned by the platform	123-4567

Functions

Some functions can also be used in the formatting to transform the extracted string. The names of the functions and their signatures are inspired by Excel.

Upper

Converts all characters in the extracted span to uppercase:

Regex	\w{3}
Formatting	{upper($)}
Text	abc
Entity returned by the platform	ABC

Lower

Converts all characters in the extracted span to lowercase:

Regex	\w{3}
Formatting	{lower($)}
Text	AbC
Entity returned by the platform	abc

Proper

Capitalises the extracted span:

Regex	\w+\s\w+
Formatting	{proper($)}
Text	albert EINSTEIN
Entity returned by the platform	Albert Einstein

Pad

Pads the extracted span up to a given size with a given character.

Function arguments:

The text containing the characters to be padded
Size of the padded string
Character to be used for padding

Regex	\d{2,5}
Formatting	{pad($, 5, "0")}
Text	123
Entity returned by the platform	00123

Substitute

Replaces characters with other characters.

Function arguments:

The text containing the characters to be substituted
What characters to replace
What the old characters should be replaced with

Regex	ab
Formatting	{substitute($, "a", "12")}
Text	ab
Entity returned by the platform	12b

Left

Returns the first n characters from the span.

Function arguments:

The text containing the characters to be extracted
The number of characters to return

Regex	\w{4}
Formatting	{left($, 2)}
Text	ABCD
Entity returned by the platform	AB

Right

Returns the last n characters from the span.

Function arguments:

The text containing the characters to be extracted
The number of characters to return

Regex	\w{4}
Formatting	{right($, 2)}
Text	ABCD
Entity returned by the platform	CD

Mid

Returns n characters after the specified position from the span.

Function arguments:

The text containing the characters to be extracted
The position of the first character to return
The number of characters to return

Regex	\w{5}
Formatting	{mid($, 2, 3)}
Text	ABCDE
Entity returned by the platform	BCD

Previous: Improving entity performance

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Model Training & Maintenance

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Building custom regex entities

What are Custom Regex Entities?

Custom Regex Template

Type-ahead validation

Extraction preview

Regex

Formatting

Variables

String Operations

Functions

Upper

Lower

Proper

Pad

Substitute

Left

Right

Mid

Sections