Working with data#

This chapter is a more detailed version of data chapter from basic user guide about how to manipulate intent classification data with AutoIntent.

[1]:
import importlib.resources as ires

import datasets

from autointent import Dataset
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar()  # disable tqdm outputs

Creating a dataset#

To create a dataset, you need to provide a training split containing samples with utterances and labels, as shown below:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ]
}

For a multilabel dataset, the label field should be a list of integers representing the corresponding class labels.

Handling out-of-scope samples#

To indicate that a sample is out-of-scope (see concepts), omit the label field from the sample dictionary. For example:

{
    "train": [
        {
            "utterance": "OOS request"
        },
        "...",
    ]
}

Validation and test splits#

By default, a portion of the training split will be allocated for validation and testing. However, you can also specify a test split explicitly:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "test": [
        {
            "utterance": "Hi!",
            "label": 0
        },
        "...",
    ]
}

Adding metadata to intents#

You can add metadata to intents in your dataset, such as regular expressions, intent names, descriptions, or tags, using the intents field:

{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "intents": [
        {
            "id": 0,
            "name": "greeting",
            "tags": ["conversation_start"],
            "regexp_partial_match": ["\bhello\b"],
            "regexp_full_match": ["^hello$"],
            "description": "User wants to initiate a conversation with a greeting."
        },
        "...",
    ]
}
  • name: A human-readable representation of the intent.

  • tags: Used in multilabel scenarios to predict the most probable class listed in a specific Tag.

  • regexp_partial_match and regexp_full_match: Used by the RegExp module to predict intents based on provided patterns.

  • description: Used by the DescriptionScorer to calculate scores based on the similarity between an utterance and intent descriptions.

All fields in the intents list are optional except for id.

Loading a dataset#

There are three main ways to load your dataset:

  1. From a Python dictionary.

  2. From a JSON file.

  3. Directly from the Hugging Face Hub.

Creating a dataset from a Python dictionary#

One can load data into Python using our Dataset object.

[3]:
dataset = Dataset.from_dict(
    {
        "train": [
            {
                "utterance": "Please help me with my card. It won't activate.",
                "label": 0,
            },
            {
                "utterance": "I tried but am unable to activate my card.",
                "label": 0,
            },
            {
                "utterance": "I want to open an account for my children.",
                "label": 1,
            },
            {
                "utterance": "How old do you need to be to use the bank's services?",
                "label": 1,
            },
        ],
        "test": [
            {
                "utterance": "I want to start using my card.",
                "label": 0,
            },
            {
                "utterance": "How old do I need to be?",
                "label": 1,
            },
        ],
        "intents": [
            {
                "id": 0,
                "name": "activate_my_card",
            },
            {
                "id": 1,
                "name": "age_limit",
            },
        ],
    },
)

Loading a dataset from a file#

The AutoIntent library includes sample datasets.

[4]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset.json")
dataset = Dataset.from_json(path_to_dataset)

Loading a dataset from the Hugging Face Hub#

If your dataset on the Hugging Face Hub matches the required format, you can load it directly using its repository ID:

[5]:
dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

Accessing dataset splits#

The Dataset class organizes your data as a dictionary of datasets.Dataset. For example, after initialization, an oos key may be added if OOS samples are provided.

[6]:
dataset["train"]
[6]:
Dataset({
    features: ['utterance', 'label'],
    num_rows: 60
})

Working with dataset splits#

Each split in the Dataset class is an instance of datasets.Dataset, so you can work with them accordingly.

[7]:
dataset["train"][:5]  # get first 5 train samples
[7]:
{'utterance': ['does acero in maplewood allow reservations',
  'do they take reservations at bar tartine',
  'does cowgirl creamery in san francisco take reservations',
  'can i get a reservation at melting pot tomorrow',
  'will they take reservations at torris'],
 'label': [0, 0, 0, 0, 0]}

Working with intents#

Metadata that you added to intents in your dataset is stored in intents attribute.

[8]:
dataset.intents[:3]
[8]:
[Intent(id=0, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None),
 Intent(id=1, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None),
 Intent(id=2, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None)]

Pushing dataset to the Hugging Face Hub#

To share your dataset on the Hugging Face Hub, use method push_to_hub. Ensure that you are logged in using the huggingface-cli tool:

[9]:
# dataset.push_to_hub("<repo_id>")