User permissions required: ‘Datasets admin’
To create a new dataset:
Navigate to the datasets page and click 'New Dataset' which reveals a modal to create the new dataset.
New Dataset modal
Complete the form with all the relevant information, clicking continue to progress through each step:
- Add basic info
- Use the dropdown menu where it says 'team-training' in this example to select the project that the dataset should exist in. You can assign the dataset to any of the projects that you are a member of.
- Use the title and description boxes to provide more information on the dataset you’re creating. It’s good practice to reference the data sources and the purpose of the analysis. These fields are not mandatory, but are helpful to make your dataset more easily identifiable
- Give the dataset a useful, descriptive name under the API name field, using hyphens instead of spaces - e.g. zendesk-cs-chats
- Copy in an existing taxonomy or customise sources
- Select whether you would like to copy an existing taxonomy from another dataset (this will auto-select the same sources, entities, and sentiment selection as that dataset)
- Select all the (additional) sources which you wish to connect to the dataset
- Set the sentiment and language(s) of the dataset
- Enable or disable sentiment analysis - with sentiment analysis enabled every label in the taxonomy has an associated positive or negative sentiment, to understand why you would or wouldn't enable it, see here
- Confirm the model family, i.e. language(s) - if you're selecting a multilingual model, see here for more detail on these and important considerations
- Add pretrained labels
- Add entities
- Select any entities that you wish to enable. You do not have to enable any during the dataset creation, and you can always enable them later in the dataset settings page as well.
Lastly, click 'Create Dataset'.
Please Note:
- You can add up to 20 individual sources to a dataset in the GUI
- Sources can sit in a different project to a dataset. As long as users have the appropriate permissions in each project, they will be able to see the verbatims and label as usual
- If there are multiple sources in a dataset, they should share a similar intended purpose for your analysis
What does copying a taxonomy mean and why would you do it?
When you create a new dataset, you can choose to essentially create a carbon copy of a pre-existing dataset. This means that you copy over the same sources, entities, sentiment selection, labels and reviewed examples as the dataset you've copied the taxonomy from.
You can then work on the copy dataset (which will require a different name) and make changes to it freely without impacting the original.
There's two main reasons why you might want to do this:
- You want to make major changes to your model, in terms of taxonomy structure for instance, and want to preserve the original dataset in case you want to revert back to it
- You want to utilise the work already done by labelling the original dataset and create a new dataset that you then add additional sources of a similar nature to
Previous: Uploading a CSV file into a source | Next: Multilingual sources and datasets