Working with data#

This chapter is a more detailed version of data chapter from basic user guide about how to manipulate intent classification data with AutoIntent.

[1]:
import importlib.resources as ires

import datasets

from autointent import Dataset
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar()  # disable tqdm outputs

Creating a dataset#

To create a dataset, you need to provide a training split containing samples with utterances and labels, as shown below:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ]
}

For a multilabel dataset, the label field should be a list of integers representing the corresponding class labels.

Handling out-of-scope samples#

To indicate that a sample is out-of-scope (see concepts), omit the label field from the sample dictionary. For example:

{
    "train": [
        {
            "utterance": "OOS request"
        },
        "...",
    ]
}

Validation and test splits#

By default, a portion of the training split will be allocated for validation and testing. However, you can also specify a test split explicitly:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "test": [
        {
            "utterance": "Hi!",
            "label": 0
        },
        "...",
    ]
}

Adding metadata to intents#

You can add metadata to intents in your dataset, such as regular expressions, intent names, descriptions, or tags, using the intents field:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "intents": [
        {
            "id": 0,
            "name": "greeting",
            "tags": ["conversation_start"],
            "regex_partial_match": ["\bhello\b"],
            "regex_full_match": ["^hello$"],
            "description": "User wants to initiate a conversation with a greeting."
        },
        "...",
    ]
}
  • name: A human-readable representation of the intent.

  • tags: Used in multilabel scenarios to predict the most probable class listed in a specific Tag.

  • regex_partial_match and regex_full_match: Used by the RegExp module to predict intents based on provided patterns.

  • description: Used by the DescriptionScorer to calculate scores based on the similarity between an utterance and intent descriptions.

All fields in the intents list are optional except for id.

Loading a dataset#

There are three main ways to load your dataset:

  1. From a Python dictionary.

  2. From a JSON file.

  3. Directly from the Hugging Face Hub.

Creating a dataset from a Python dictionary#

One can load data into Python using our Dataset object.

[3]:
dataset = Dataset.from_dict(
    {
        "train": [
            {
                "utterance": "Please help me with my card. It won't activate.",
                "label": 0,
            },
            {
                "utterance": "I tried but am unable to activate my card.",
                "label": 0,
            },
            {
                "utterance": "I want to open an account for my children.",
                "label": 1,
            },
            {
                "utterance": "How old do you need to be to use the bank's services?",
                "label": 1,
            },
        ],
        "test": [
            {
                "utterance": "I want to start using my card.",
                "label": 0,
            },
            {
                "utterance": "How old do I need to be?",
                "label": 1,
            },
        ],
        "intents": [
            {
                "id": 0,
                "name": "activate_my_card",
            },
            {
                "id": 1,
                "name": "age_limit",
            },
        ],
    },
)

Loading a dataset from a file#

The AutoIntent library includes sample datasets.

[4]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset.json")
dataset = Dataset.from_json(path_to_dataset)

Loading a dataset from the Hugging Face Hub#

If your dataset on the Hugging Face Hub matches the required format, you can load it directly using its repository ID:

[5]:
dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

Accessing dataset splits#

The Dataset class organizes your data as a dictionary of datasets.Dataset. For example, after initialization, an oos key may be added if OOS samples are provided.

[6]:
dataset["train_0"]
[6]:
Dataset({
    features: ['utterance', 'label'],
    num_rows: 18
})

Working with dataset splits#

Each split in the Dataset class is an instance of datasets.Dataset, so you can work with them accordingly.

[7]:
dataset["train_0"][:5]  # get first 5 train samples
[7]:
{'utterance': ['do they take reservations at mcdonalds',
  'i would like an update on the progress of my credit card application',
  'can you tell me why is my bank account frozen',
  'why in the world am i locked out of my bank account',
  'who froze my bank account'],
 'label': [0, 3, 1, 1, 1]}

Working with intents#

Metadata that you added to intents in your dataset is stored in intents attribute.

[8]:
dataset.intents[:3]
[8]:
[Intent(id=0, name=None, tags=[], regex_full_match=[], regex_partial_match=[], description='some description to some intent'),
 Intent(id=1, name=None, tags=[], regex_full_match=[], regex_partial_match=[], description='some description to another intent'),
 Intent(id=2, name=None, tags=[], regex_full_match=[], regex_partial_match=[], description='another description to some intent')]

Pushing dataset to the Hugging Face Hub#

To share your dataset on the Hugging Face Hub, use method push_to_hub. Ensure that you are logged in using the huggingface-cli tool:

[9]:
# dataset.push_to_hub("<repo_id>")