What are entities?
Entities are additional elements of structured data which can be extracted from within the verbatims in your dataset. Entities include data points such as monetary quantities, dates, currency codes, organisations, people, email addresses, URLs, as well as many other industry specific categories (see below for an example).
Example email verbatim with address line, city name, and policy number entities predicted
Unlike labels, the platform is able to predict most entities (except those trained from scratch) as soon as they are enabled, as it can identify them based on their typical, or in some instances very specific, format and a training set of similar entities.
Like labels, users are able to accept or reject entities that are correctly or incorrectly predicted, enhancing the model’s ability to identify them in future.
Types of entities
There are currently three main types of entities:
- Pre-trained entities that are typically based on a set of standard or custom-defined rules - e.g. Monetary Quantity, URL, and Date
- Pre-trained entities that are machine learning based and rely on a large pool of available training data - e.g. Organisation and Person (i.e. names)
- Entities trained from scratch by a user (like they would train labels) that are machine learning based
Trainable versus non-trainable entities
All entities are either 'trainable' by nature (entities trained from scratch), or can be made 'trainable' when they're enabled (all other entity kinds).
'Trainable' entities are those that will update live in the platform based on training provided by users. For more detail on training entities, see here.
If you enable training on a pre-trained entity that is typically based on a set of standard or custom-defined rules, you can refine the platform's understanding of that entity within the parameters of those rules. Essentially, further training on these will reduce the scope of what the platform can consider that entity, but not increase it.
This is because many of these entities, like dates (e.g. 'tomorrow') and monetary quantities (e.g. £20), need to be normalised into a structured data format for downstream systems. Also for entities like ISINs or CUSIPs, these must have a set format, so the platform should not be taught to predict anything that does not conform to their defined formats.
When any trainable entities are assigned, the platform looks at both the text of the entity, as well as the context of the entity within the rest of the communication, i.e. what is happening before and after the entity value (in the same paragraph, and the one above and below). It learns to better predict the entity based on the values themselves, as well as how the value appears within the context of the communication.
If a pre-trained entity is not set as trainable (see detail on enabling entities on a dataset here), users can still accept or reject the entity predictions they see in their dataset. These are updated and refined offline using this in-platform feedback provided by users. It’s therefore still helpful for users to accept or reject these entities when reviewing verbatims.
Previous: Defining and setting up your entities | Next: Which pre-trained entities are available?