PLEASE NOTE: UiPath Communications Mining's Knowledge Base has been fully migrated to UiPath Docs. Please navigate to equivalent articles in UiPath Docs (here) for up to date guidance, as this site will no longer be updated and maintained.

Knowledge Base

Model Training & Maintenance

Guides on how to create, improve and maintain Models in Communications Mining, using platform features such as Discover, Explore and Validation

Building custom regex entities

Permissions required: "Modify Datasets"


What are Custom Regex Entities?


A Custom Regex Entity can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.


This is a useful option for simple, structured entities with little variation, whereas for those with significant variation and where context has a big influence on predictions, a machine-learning based entity would be the right choice. Combinations of the two can be used in any dataset within Communications Mining.


A broader Regex (i.e. set of rules to define the entity) can also be used as the base of a custom entity. This combines the rules with contextual, machine learning based refinement through training within Communications Mining to create sophisticated custom entities. This provides the most optimal performance as well as the necessary restrictions on values extracted for automation.

 


Custom Regex Template


A Custom Regex Entity is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the entity. 

Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same entity type.

 

A template is made of two parts:


  1. The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as an entity
  2. The formatting, which expresses how to normalise the extracted string into a more standard format

 

For instance, if your customer IDs can be either the word “ID” followed by 7 digits, or an alphanumeric string of 9 characters, here is what your two templates will look like:

 


Type-ahead validation


When typing into the text box for either the Regex or the Formatting, the interface will provide immediate feedback on the validity of the input. For instance, the invalid input Regex ID\d{} will show:



Extraction preview


The Custom Regex Template can be tested on text to ensure that it behaves as expected. Any entity that would be extracted with the Template will be shown in a list, with its value, as well as the position of the start and end characters.


For instance, if the Regex is \d{4} and the formatting ID-{$}, the following test string will show one extraction:



Regex


The regex is the pattern used to extract entities in the text. See here for the syntax documentation.


Named capture groups can be used to identify a specific section of the extracted string for subsequent formatting. The names of the capture groups should be unique across all templates, and should only contain lowercase letters or digits.

 


Formatting


Formatting can be provided to post-process the extracted entity.


By default, no formatting is applied and the string returned by the platform will be the string extracted by the regex. However, if needed, more complex transformations can be defined, using the following rules.

  

Variables


Any named capture group defined in the regex will be available to use in the formatting logic as a variable, prefixed with the $ symbol. Note that the $ symbol by itself represents the full regex match.


Variables can then be used in the formatting string to insert the corresponding extracted span into the value returned by the platform; the variable name needs to be surrounded by { and } braces.


For instance, if we want to extract seven digits as an ID, and return these seven digits prefixed with ID- then the regex and the formatting would be:

 

Or, using a named capture group:

 


Later on, if the platform is given the text: My identification number is 1234567, it will return one entity: ID-1234567.

 

String Operations


Raw strings can be used, and strings can be concatenated using the & symbol.


Regex(?P<id1>\b\d{3}\b)|(?P<id2>\b\d{4}\b)
Formatting{$id1 & "-" & $id2}
TextThe first id is 123 and the second one is 4567
Entity returned by the platform123-4567


Functions


Some functions can also be used in the formatting to transform the extracted string. The names of the functions and their signatures are inspired by Excel.


Upper

Converts all characters in the extracted span to uppercase:

Regex\w{3}
Formatting{upper($)}
Textabc
Entity returned by the platformABC


Lower

Converts all characters in the extracted span to lowercase:

 

Regex\w{3}
Formatting{lower($)}
TextAbC
Entity returned by the platformabc


Proper

Capitalises the extracted span:

 

Regex\w+\s\w+
Formatting{proper($)}
Textalbert EINSTEIN
Entity returned by the platformAlbert Einstein


Pad

Pads the extracted span up to a given size with a given character.

Function arguments:

  1. The text containing the characters to be padded
  2. Size of the padded string
  3. Character to be used for padding

 

Regex\d{2,5}
Formatting{pad($, 5, "0")}
Text123
Entity returned by the platform00123


Substitute

Replaces characters with other characters.

Function arguments:

  1. The text containing the characters to be substituted
  2. What characters to replace
  3. What the old characters should be replaced with

 

Regexab
Formatting{substitute($, "a", "12")}
Textab
Entity returned by the platform12b


Left

Returns the first n characters from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The number of characters to return

 

Regex\w{4}
Formatting{left($, 2)}
TextABCD
Entity returned by the platformAB


Returns the last n characters from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The number of characters to return

 

Regex\w{4}
Formatting{right($, 2)}
TextABCD
Entity returned by the platformCD


Mid

Returns n characters after the specified position from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The position of the first character to return
  3. The number of characters to return

 

Regex\w{5}
Formatting{mid($, 2, 3)}
TextABCDE
Entity returned by the platformBCD

 


Previous: Improving entity performance

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.

Sections

View all