Working with data#

In this chapter you will learn how to manipulate intent classification data with AutoIntent.

[1]:
import importlib.resources as ires

import datasets

from autointent import Dataset
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar()  # disable tqdm outputs

Creating a dataset#

The first thing you need to think about is your data. You need to collect a set of labeled utterances and save it as JSON file with the following schema:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "test": [
        {
            "utterance": "Hi!",
            "label": 0
        },
        "...",
    ]
}

Note:

  • For a multilabel dataset, the label field should be a list of integers representing the corresponding class labels.

  • Test split is optional. By default, a portion of the training split will be allocated for testing.

Loading a dataset#

After you converted your labeled data into JSON, you can load it into AutoIntent as Dataset. We will load sample dataset that is provided by AutoIntent library to demonstrate this functionality.

[3]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset.json")
dataset = Dataset.from_json(path_to_dataset)

Note: to load your data, just change path_to_dataset variable.

Accessing dataset splits#

The Dataset class organizes your data as a dictionary of datasets.Dataset. For example, after initialization, an oos key may be added if OOS samples are provided.

[4]:
dataset["train"]
[4]:
Dataset({
    features: ['utterance', 'label'],
    num_rows: 37
})

Working with dataset splits#

Each split in the Dataset class is an instance of datasets.Dataset, so you can work with them accordingly.

[5]:
dataset["train"][:5]  # get first 5 train samples
[5]:
{'utterance': ['can i make a reservation for redrobin',
  'is it possible to make a reservation at redrobin',
  'does redrobin take reservations',
  'are reservations taken at redrobin',
  'does redrobin do reservations'],
 'label': [0, 0, 0, 0, 0]}

Save Dataset#

To share your dataset on the Hugging Face Hub, use method push_to_hub.

[6]:
# dataset.push_to_hub("<repo_id>")

Note: ensure that you are logged in using huggingface-cli.

See Also#

  • Next chapter of the user guide “Using modules”: 02_modules