Working with data#
This chapter is a more detailed version of data chapter from basic user guide about how to manipulate intent classification data with AutoIntent.
[1]:
import importlib.resources as ires
import datasets
from autointent import Dataset
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar() # disable tqdm outputs
Creating a dataset#
To create a dataset, you need to provide a training split containing samples with utterances and labels, as shown below:
{
"train": [
{
"utterance": "Hello!",
"label": 0
},
"...",
]
}
For a multilabel dataset, the label
field should be a list of integers representing the corresponding class labels.
Handling out-of-scope samples#
To indicate that a sample is out-of-scope (see concepts), omit the label
field from the sample dictionary. For example:
{
"train": [
{
"utterance": "OOS request"
},
"...",
]
}
Validation and test splits#
By default, a portion of the training split will be allocated for validation and testing. However, you can also specify a test split explicitly:
{
"train": [
{
"utterance": "Hello!",
"label": 0
},
"...",
],
"test": [
{
"utterance": "Hi!",
"label": 0
},
"...",
]
}
Adding metadata to intents#
You can add metadata to intents in your dataset, such as regular expressions, intent names, descriptions, or tags, using the intents
field:
{
"train": [
{
"utterance": "Hello!",
"label": 0
},
"...",
],
"intents": [
{
"id": 0,
"name": "greeting",
"tags": ["conversation_start"],
"regexp_partial_match": ["\bhello\b"],
"regexp_full_match": ["^hello$"],
"description": "User wants to initiate a conversation with a greeting."
},
"...",
]
}
name
: A human-readable representation of the intent.tags
: Used in multilabel scenarios to predict the most probable class listed in a specific Tag.regexp_partial_match
andregexp_full_match
: Used by the RegExp module to predict intents based on provided patterns.description
: Used by the DescriptionScorer to calculate scores based on the similarity between an utterance and intent descriptions.
All fields in the intents
list are optional except for id
.
Loading a dataset#
There are three main ways to load your dataset:
From a Python dictionary.
From a JSON file.
Directly from the Hugging Face Hub.
Creating a dataset from a Python dictionary#
One can load data into Python using our Dataset object.
[3]:
dataset = Dataset.from_dict(
{
"train": [
{
"utterance": "Please help me with my card. It won't activate.",
"label": 0,
},
{
"utterance": "I tried but am unable to activate my card.",
"label": 0,
},
{
"utterance": "I want to open an account for my children.",
"label": 1,
},
{
"utterance": "How old do you need to be to use the bank's services?",
"label": 1,
},
],
"test": [
{
"utterance": "I want to start using my card.",
"label": 0,
},
{
"utterance": "How old do I need to be?",
"label": 1,
},
],
"intents": [
{
"id": 0,
"name": "activate_my_card",
},
{
"id": 1,
"name": "age_limit",
},
],
},
)
Loading a dataset from a file#
The AutoIntent library includes sample datasets.
[4]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset.json")
dataset = Dataset.from_json(path_to_dataset)
Loading a dataset from the Hugging Face Hub#
If your dataset on the Hugging Face Hub matches the required format, you can load it directly using its repository ID:
[5]:
dataset = Dataset.from_hub("AutoIntent/clinc150_subset")
Accessing dataset splits#
The Dataset class organizes your data as a dictionary of datasets.Dataset. For example, after initialization, an oos
key may be added if OOS samples are provided.
[6]:
dataset["train"]
[6]:
Dataset({
features: ['utterance', 'label'],
num_rows: 60
})
Working with dataset splits#
Each split in the Dataset class is an instance of datasets.Dataset, so you can work with them accordingly.
[7]:
dataset["train"][:5] # get first 5 train samples
[7]:
{'utterance': ['does acero in maplewood allow reservations',
'do they take reservations at bar tartine',
'does cowgirl creamery in san francisco take reservations',
'can i get a reservation at melting pot tomorrow',
'will they take reservations at torris'],
'label': [0, 0, 0, 0, 0]}
Working with intents#
Metadata that you added to intents in your dataset is stored in intents attribute.
[8]:
dataset.intents[:3]
[8]:
[Intent(id=0, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None),
Intent(id=1, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None),
Intent(id=2, name=None, tags=[], regexp_full_match=[], regexp_partial_match=[], description=None)]
Pushing dataset to the Hugging Face Hub#
To share your dataset on the Hugging Face Hub, use method push_to_hub. Ensure that you are logged in using the huggingface-cli
tool:
[9]:
# dataset.push_to_hub("<repo_id>")