Data#
In this chapter you will learn how to work with intent classification data in AutoIntent. We’ll cover creating datasets, loading data from different sources, and manipulating your data for optimal results.
[1]:
import datasets
from autointent import Dataset
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar() # disable tqdm outputs
Creating your first dataset#
The easiest way to get started is by creating a dataset from a Python dictionary. Let’s start with a simple banking intent classification example:
[3]:
# Create a simple intent classification dataset
data = {
"train": [
{"utterance": "What is my account balance?", "label": 0},
{"utterance": "Check my current balance", "label": 0},
{"utterance": "Show me my account details", "label": 0},
{"utterance": "I want to transfer money", "label": 1},
{"utterance": "How do I send funds to someone?", "label": 1},
{"utterance": "Make a payment to my friend", "label": 1},
{"utterance": "Cancel my last transaction", "label": 2},
{"utterance": "Reverse the payment I just made", "label": 2},
{"utterance": "Stop this transfer", "label": 2},
],
"validation": [
{"utterance": "Display my balance", "label": 0},
{"utterance": "Send money to John", "label": 1},
{"utterance": "Cancel the last payment", "label": 2},
],
"test": [
{"utterance": "How much money is in my account?", "label": 0},
{"utterance": "Transfer funds to my savings", "label": 1},
{"utterance": "Undo my recent payment", "label": 2},
],
}
# Load the data into AutoIntent
dataset = Dataset.from_dict(data)
print(f"Dataset created with {len(dataset['train'])} training samples")
Dataset created with 9 training samples
This creates a dataset with three intent classes:
0: Balance inquiries
1: Money transfers
2: Transaction cancellations
Important notes about data splits:
Test split: Highly recommended as a frozen evaluation set that’s never used during training
Validation split: Optional - if not provided, AutoIntent will automatically split your training data
Training split: Required - this is where your model learns from
Understanding the data format#
AutoIntent expects your data in a specific format. Here are the key requirements:
Single-label classification#
For most intent classification tasks, each utterance belongs to exactly one class:
{
"train": [
{"utterance": "Hello!", "label": 0},
{"utterance": "Book a flight", "label": 1},
{"utterance": "What's the weather?", "label": 2}
],
"test": [ # Recommended: frozen test set
{"utterance": "Hi there!", "label": 0}
]
# validation split is optional - AutoIntent will create one if needed
}
Multi-label classification#
For tasks where utterances can belong to multiple classes, use a list of labels:
{
"train": [
{"utterance": "Book urgent flight to Paris", "label": [1, 0, 1]}, # booking=1, weather=0, urgent=1
{"utterance": "What's the weather like?", "label": [0, 1, 0]}
],
"test": [
{"utterance": "Emergency flight booking", "label": [1, 0, 1]}
]
}
Loading data from different sources#
AutoIntent supports multiple ways to load your data:
From a dictionary (recommended for getting started)#
Perfect when you have your data ready in Python:
[4]:
# Example with a complete dataset including all splits
banking_data = {
"train": [
{"utterance": "What is my account balance?", "label": 0},
{"utterance": "Check my savings balance", "label": 0},
{"utterance": "How much money do I have?", "label": 0},
{"utterance": "Transfer $100 to savings", "label": 1},
{"utterance": "Send money to my friend", "label": 1},
{"utterance": "Make a payment", "label": 1},
{"utterance": "Cancel my last payment", "label": 2},
{"utterance": "Stop this transaction", "label": 2},
{"utterance": "Reverse my transfer", "label": 2},
],
"validation": [
{"utterance": "Display my balance", "label": 0},
{"utterance": "Send $50 to John", "label": 1},
{"utterance": "Stop my last transaction", "label": 2},
],
"test": [
{"utterance": "Show me my current balance", "label": 0},
{"utterance": "I want to transfer funds", "label": 1},
{"utterance": "Cancel this payment", "label": 2},
],
}
dataset_from_dict = Dataset.from_dict(banking_data)
print("✅ Dataset loaded from dictionary")
print(f"Splits: {list(dataset_from_dict.keys())}")
✅ Dataset loaded from dictionary
Splits: ['train', 'validation', 'test']
From a JSON file#
When you have your data saved as a JSON file with the same structure:
[5]:
# Example: dataset_from_json = Dataset.from_json("/path/to/your/data.json")
From Hugging Face Hub#
For loading public datasets or sharing your own:
[6]:
# Load a sample dataset from HuggingFace Hub
dataset_from_hub = Dataset.from_hub("DeepPavlov/banking77")
print("✅ Dataset loaded from Hugging Face Hub")
print(f"Training samples: {len(dataset_from_hub['train'])}")
✅ Dataset loaded from Hugging Face Hub
Training samples: 10003
Working with your dataset#
Once loaded, your dataset behaves like a dictionary of Hugging Face datasets:
[7]:
# Access different splits
print("Available splits:", list(dataset_from_hub.keys()))
print(f"Training samples: {len(dataset_from_hub['train'])}")
# View the first few samples
print("\nFirst 3 training samples:")
train_split = dataset_from_hub["train"][:3]
for i, (utterance, label) in enumerate(zip(train_split["utterance"], train_split["label"], strict=True)):
print(f"{i+1}. '{utterance}' → label {label}")
Available splits: ['train', 'test']
Training samples: 10003
First 3 training samples:
1. 'Please help me with my card. It won't activate.' → label 0
2. 'I tired but an unable to activate my card.' → label 0
3. 'I want to start using my card.' → label 0
Working with individual samples#
[8]:
# Access specific samples
first_sample = dataset_from_hub["train"][0]
print(f"First sample: '{first_sample['utterance']}' (label: {first_sample['label']})")
# Slice multiple samples
batch = dataset_from_hub["train"][5:10]
print("\nBatch of 5 samples:")
for utterance, label in zip(batch["utterance"], batch["label"], strict=True):
print(f" '{utterance}' → {label}")
First sample: 'Please help me with my card. It won't activate.' (label: 0)
Batch of 5 samples:
'How do i activate my card' → 0
'Can someone assist me with activating my card?' → 0
'My card needs to be activated.' → 0
'I was unable to activate my card.' → 0
'Is my card ready for use or does it need activated and if so how?' → 0
Saving and sharing datasets#
Save to Hugging Face Hub#
To share your dataset with others or for reproducibility:
[9]:
# dataset.push_to_hub("your-username/your-dataset-name")
# Note: Make sure you're logged in with `huggingface-cli login`
Save to a local JSON file#
You can also save your dataset to a local JSON file for backup or sharing outside the Hugging Face Hub. Use the to_json
method:
[10]:
# Save the dataset to a local JSON file
# dataset_from_hub.to_json("my_banking77_dataset.json")
Best practices#
1. Data quality matters#
Ensure consistent labeling across your dataset
Include diverse examples for each intent
Aim for balanced classes when possible
2. Split your data wisely#
Training: 60-80% of your data (required)
Validation: 10-20% (optional - AutoIntent will create from training if not provided)
Test: 10-20% (highly recommended - keep this frozen for final evaluation)
3. Start small, then scale#
Begin with a small representative sample (10-20 examples per intent)
Use AutoIntent to find the best approach
Scale up with more data once you’ve validated your setup
Tip: If you have limited data, consider using AutoIntent’s augmentation tools (see index)
Next steps#
Now that you know how to work with data in AutoIntent, you’re ready to explore the different modules and techniques available for intent classification.
Up next: Learn how to use individual modules for more control over your intent classification pipeline.
Next chapter: 02_modules
See also: 01_data covering advanced topics like OOS samples and adding information about intent