Multilingual sources and datasets

Communications Mining supports multilingual sources and datasets. This means that the models can understand sources that contain multiple different supported languages, without actually having to translate them.

The languages that are currently 'General Availability' within multilingual sources and datasets are:

English
French
German
Spanish
Italian
Portuguese
Dutch

And we'll be expanding this list over time!

What this means in practice is that if users work and do business in several languages that are supported by Communications Mining, they can train models on verbatims in those languages, rather than translating everything into a single language.

A large list of additional languages are supported 'In Preview' (included at the bottom of this article), meaning that we will be working to fine-tune them over time as our customers and partners begin to use them. A large proportion of these languages will perform very strongly regardless and will require little to no fine-tuning by our teams to achieve high performance.

Important considerations when looking to use multilingual sources and datasets:

If a dataset is multilingual, users will not be able to see translations of any verbatims (as provided for translated datasets), so they will need to be able to understand all of the languages in the dataset to effectively train their model
Understanding multiple languages is a more complex machine learning problem than understanding a single language, so these datasets may potentially experience a slight drop in performance compared to datasets in a single language
The platform will only be able to understand language from one of the supported languages listed above. If there are other languages present in the dataset, tagging these verbatims with labels used on verbatims in supported languages will be confusing for the platform. It is better to label these as their own specific labels that capture the language as a label, but the platform will not be able to interpret the specifics of the unsupported language

How do you create multilingual sources and datasets?

For both data source and datasets, the language family is selected when they are created, and cannot be changed once they are.

Simply select 'multilingual' from the language family dropdown on the create source or create dataset modal (it's typically the last setting to select).

Please Note: Multilingual datasets can contain sources of any language family that the platform supports.

For more detail on creating a source in the UI, see here.

For more detail on creating a dataset, see here.

General Availability Languages

English
Dutch
French
German
Italian
Portuguese
Spanish

Supported languages 'In Preview'

Afrikaans
Albanian
Amharic
Arabic
Armenian
Assamese
Azerbaijani
Basque
Belarusian
Bengali
Bengali (Romanized)
Bosnian
Breton
Bulgarian
Burmese
Burmese
Catalan
Chinese (Simplified)
Chinese (Traditional)
Croatian
Czech
Danish
Esperanto
Estonian
Filipino
Finnish
Galician
Georgian
Greek
Gujarati
Hausa
Hebrew
Hindi
Hindi (Romanized)
Hungarian
Icelandic
Indonesian
Irish
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kyrgyz
Lao
Latin
Latvian
Lithuanian
Macedonian
Malagasy
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian
Oriya
Oromo
Pashto
Persian
Polish
Punjabi
Romanian
Russian
Sanskrit
Scottish Gaelic
Serbian
Sindhi
Sinhala
Slovak
Slovenian
Somali
Sundanese
Swahili
Swedish
Swiss German
Tamil
Tamil (Romanized)
Telugu
Telugu (Romanized)
Thai
Turkish
Ukrainian
Urdu
Urdu (Romanized)
Uyghur
Uzbek
Vietnamese
Welsh
Western Frisian
Xhosa
Yiddish

Previous: Create a new dataset | Next: Enabling sentiment on a dataset

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Model Training & Maintenance

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Model Training & Maintenance

Getting Started

Manage Accounts & Access

Using Analytics & Monitoring

Automations & Communications Mining

Technical Support

FAQs & More

Multilingual sources and datasets

General Availability Languages

Supported languages 'In Preview'

Sections