Working with data#
In this chapter you will learn how to manipulate intent classification data with AutoIntent.
[1]:
import importlib.resources as ires
import datasets
from autointent import Dataset
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar() # disable tqdm outputs
Creating a dataset#
The first thing you need to think about is your data. You need to collect a set of labeled utterances and save it as JSON file with the following schema:
{
"train": [
{
"utterance": "Hello!",
"label": 0
},
"...",
],
"test": [
{
"utterance": "Hi!",
"label": 0
},
"...",
]
}
Note:
For a multilabel dataset, the
label
field should be a list of integers representing the corresponding class labels.Test split is optional. By default, a portion of the training split will be allocated for testing.
Loading a dataset#
After you converted your labeled data into JSON, you can load it into AutoIntent as Dataset. We will load sample dataset that is provided by AutoIntent library to demonstrate this functionality.
[3]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset_unsplitted.json")
dataset = Dataset.from_json(path_to_dataset)
Note: to load your data, just change path_to_dataset
variable.
Accessing dataset splits#
The Dataset class organizes your data as a dictionary of datasets.Dataset. For example, after initialization, an oos
key may be added if OOS samples are provided.
[4]:
dataset["train"]
[4]:
Dataset({
features: ['utterance', 'label'],
num_rows: 36
})
Working with dataset splits#
Each split in the Dataset class is an instance of datasets.Dataset, so you can work with them accordingly.
[5]:
dataset["train"][:5] # get first 5 train samples
[5]:
{'utterance': ['do they take reservations at mcdonalds',
'i would like an update on the progress of my credit card application',
'can you tell me why is my bank account frozen',
'why in the world am i locked out of my bank account',
'who froze my bank account'],
'label': [0, 3, 1, 1, 1]}
Save Dataset#
To share your dataset on the Hugging Face Hub, use method push_to_hub.
[6]:
# dataset.push_to_hub("<repo_id>")
Note: ensure that you are logged in using huggingface-cli
.
After that you can load the dataset with one line. For example:
[7]:
Dataset.from_hub("AutoIntent/banking77")
[7]:
{'train': Dataset({
features: ['utterance', 'label'],
num_rows: 10003
}),
'test': Dataset({
features: ['utterance', 'label'],
num_rows: 3080
})}
See Also#
Next chapter of the user guide “Using modules”: 02_modules