Data#
In this chapter you will learn how to work with intent classification data in AutoIntent. We’ll cover creating datasets, loading data from different sources, and manipulating your data for optimal results.
[1]:
import datasets
from autointent import Dataset
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
datasets.logging.disable_progress_bar()  # disable tqdm outputs
Creating your first dataset#
The easiest way to get started is by creating a dataset from a Python dictionary. Let’s start with a simple banking intent classification example:
[3]:
# Create a simple intent classification dataset
data = {
    "train": [
        {"utterance": "What is my account balance?", "label": 0},
        {"utterance": "Check my current balance", "label": 0},
        {"utterance": "Show me my account details", "label": 0},
        {"utterance": "I want to transfer money", "label": 1},
        {"utterance": "How do I send funds to someone?", "label": 1},
        {"utterance": "Make a payment to my friend", "label": 1},
        {"utterance": "Cancel my last transaction", "label": 2},
        {"utterance": "Reverse the payment I just made", "label": 2},
        {"utterance": "Stop this transfer", "label": 2},
    ],
    "validation": [
        {"utterance": "Display my balance", "label": 0},
        {"utterance": "Send money to John", "label": 1},
        {"utterance": "Cancel the last payment", "label": 2},
    ],
    "test": [
        {"utterance": "How much money is in my account?", "label": 0},
        {"utterance": "Transfer funds to my savings", "label": 1},
        {"utterance": "Undo my recent payment", "label": 2},
    ],
}
# Load the data into AutoIntent
dataset = Dataset.from_dict(data)
print(f"Dataset created with {len(dataset['train'])} training samples")
Dataset created with 9 training samples
This creates a dataset with three intent classes:
- 0: Balance inquiries 
- 1: Money transfers 
- 2: Transaction cancellations 
Important notes about data splits:
- Test split: Highly recommended as a frozen evaluation set that’s never used during training 
- Validation split: Optional - if not provided, AutoIntent will automatically split your training data 
- Training split: Required - this is where your model learns from 
Understanding the data format#
AutoIntent expects your data in a specific format. Here are the key requirements:
Single-label classification#
For most intent classification tasks, each utterance belongs to exactly one class:
{
    "train": [
        {"utterance": "Hello!", "label": 0},
        {"utterance": "Book a flight", "label": 1},
        {"utterance": "What's the weather?", "label": 2}
    ],
    "test": [  # Recommended: frozen test set
        {"utterance": "Hi there!", "label": 0}
    ]
    # validation split is optional - AutoIntent will create one if needed
}
Multi-label classification#
For tasks where utterances can belong to multiple classes, use a list of labels:
{
    "train": [
        {"utterance": "Book urgent flight to Paris", "label": [1, 0, 1]},  # booking=1, weather=0, urgent=1
        {"utterance": "What's the weather like?", "label": [0, 1, 0]}
    ],
    "test": [
        {"utterance": "Emergency flight booking", "label": [1, 0, 1]}
    ]
}
Loading data from different sources#
AutoIntent supports multiple ways to load your data:
From a dictionary (recommended for getting started)#
Perfect when you have your data ready in Python:
[4]:
# Example with a complete dataset including all splits
banking_data = {
    "train": [
        {"utterance": "What is my account balance?", "label": 0},
        {"utterance": "Check my savings balance", "label": 0},
        {"utterance": "How much money do I have?", "label": 0},
        {"utterance": "Transfer $100 to savings", "label": 1},
        {"utterance": "Send money to my friend", "label": 1},
        {"utterance": "Make a payment", "label": 1},
        {"utterance": "Cancel my last payment", "label": 2},
        {"utterance": "Stop this transaction", "label": 2},
        {"utterance": "Reverse my transfer", "label": 2},
    ],
    "validation": [
        {"utterance": "Display my balance", "label": 0},
        {"utterance": "Send $50 to John", "label": 1},
        {"utterance": "Stop my last transaction", "label": 2},
    ],
    "test": [
        {"utterance": "Show me my current balance", "label": 0},
        {"utterance": "I want to transfer funds", "label": 1},
        {"utterance": "Cancel this payment", "label": 2},
    ],
}
dataset_from_dict = Dataset.from_dict(banking_data)
print("✅ Dataset loaded from dictionary")
print(f"Splits: {list(dataset_from_dict.keys())}")
✅ Dataset loaded from dictionary
Splits: ['train', 'validation', 'test']
From a JSON file#
When you have your data saved as a JSON file with the same structure:
[5]:
# Example: dataset_from_json = Dataset.from_json("/path/to/your/data.json")
From Hugging Face Hub#
For loading public datasets or sharing your own:
[6]:
# Load a sample dataset from HuggingFace Hub
dataset_from_hub = Dataset.from_hub("DeepPavlov/banking77")
print("✅ Dataset loaded from Hugging Face Hub")
print(f"Training samples: {len(dataset_from_hub['train'])}")
✅ Dataset loaded from Hugging Face Hub
Training samples: 10003
Working with your dataset#
Once loaded, your dataset behaves like a dictionary of Hugging Face datasets:
[7]:
# Access different splits
print("Available splits:", list(dataset_from_hub.keys()))
print(f"Training samples: {len(dataset_from_hub['train'])}")
# View the first few samples
print("\nFirst 3 training samples:")
train_split = dataset_from_hub["train"][:3]
for i, (utterance, label) in enumerate(zip(train_split["utterance"], train_split["label"], strict=True)):
    print(f"{i+1}. '{utterance}' → label {label}")
Available splits: ['train', 'test']
Training samples: 10003
First 3 training samples:
1. 'Please help me with my card.  It won't activate.' → label 0
2. 'I tired but an unable to activate my card.' → label 0
3. 'I want to start using my card.' → label 0
Working with individual samples#
[8]:
# Access specific samples
first_sample = dataset_from_hub["train"][0]
print(f"First sample: '{first_sample['utterance']}' (label: {first_sample['label']})")
# Slice multiple samples
batch = dataset_from_hub["train"][5:10]
print("\nBatch of 5 samples:")
for utterance, label in zip(batch["utterance"], batch["label"], strict=True):
    print(f"  '{utterance}' → {label}")
First sample: 'Please help me with my card.  It won't activate.' (label: 0)
Batch of 5 samples:
  'How do i activate my card' → 0
  'Can someone assist me with activating my card?' → 0
  'My card needs to be activated.' → 0
  'I was unable to activate my card.' → 0
  'Is my card ready for use or does it need activated and if so how?' → 0
Saving and sharing datasets#
Save to Hugging Face Hub#
To share your dataset with others or for reproducibility:
[9]:
# dataset.push_to_hub("your-username/your-dataset-name")
# Note: Make sure you're logged in with `huggingface-cli login`
Save to a local JSON file#
You can also save your dataset to a local JSON file for backup or sharing outside the Hugging Face Hub. Use the to_json method:
[10]:
# Save the dataset to a local JSON file
# dataset_from_hub.to_json("my_banking77_dataset.json")
Best practices#
1. Data quality matters#
- Ensure consistent labeling across your dataset 
- Include diverse examples for each intent 
- Aim for balanced classes when possible 
2. Split your data wisely#
- Training: 60-80% of your data (required) 
- Validation: 10-20% (optional - AutoIntent will create from training if not provided) 
- Test: 10-20% (highly recommended - keep this frozen for final evaluation) 
3. Start small, then scale#
- Begin with a small representative sample (10-20 examples per intent) 
- Use AutoIntent to find the best approach 
- Scale up with more data once you’ve validated your setup 
- Tip: If you have limited data, consider using AutoIntent’s augmentation tools (see index) 
Next steps#
Now that you know how to work with data in AutoIntent, you’re ready to explore the different modules and techniques available for intent classification.
Up next: Learn how to use individual modules for more control over your intent classification pipeline.
- Next chapter: 02_modules 
- See also: 01_data covering advanced topics like OOS samples and adding information about intent