AutoML Pipeline Configuration#
AutoML (Automated Machine Learning) in AutoIntent allows you to automatically find the best configuration for your intent classification pipeline. Instead of manually tuning hyperparameters and selecting components, AutoML explores different combinations to find the optimal setup for your specific dataset.
[1]:
from autointent import Pipeline
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
In this tutorial, we’ll walk through the pipeline auto-configuration process step by step. We’ll learn how to:
Use predefined search spaces and presets
Customize search configurations
Set up logging and validation strategies
Run the optimization process
Save and load optimized pipelines
Let’s start by loading a small subset of the popular clinc150
dataset for demonstration.
[2]:
from autointent import Dataset
# Load the dataset from Hugging Face hub
dataset = Dataset.from_hub("DeepPavlov/clinc150_subset")
print(f"Dataset contains {len(dataset)} splits")
dataset
Dataset contains 5 splits
[2]:
{'train_0': Dataset({
features: ['utterance', 'label'],
num_rows: 18
}),
'train_1': Dataset({
features: ['utterance', 'label'],
num_rows: 18
}),
'validation_0': Dataset({
features: ['utterance', 'label'],
num_rows: 4
}),
'validation_1': Dataset({
features: ['utterance', 'label'],
num_rows: 8
}),
'test': Dataset({
features: ['utterance', 'label'],
num_rows: 12
})}
Let’s examine the structure of our dataset by looking at a sample utterance:
[3]:
sample = dataset["train_0"][0]
print(f"Sample utterance: '{sample['utterance']}'")
print(f"Intent label: '{sample['label']}'")
sample
Sample utterance: 'do they take reservations at mcdonalds'
Intent label: '0'
[3]:
{'utterance': 'do they take reservations at mcdonalds', 'label': 0}
Search Space#
AutoIntent provides default search spaces. One can utilize them by constructing Pipeline with factory from_preset:
[4]:
pipeline = Pipeline.from_preset("classic-light")
You can inspect the structure and default values of any preset:
[5]:
from pprint import pprint
from autointent.utils import load_preset
preset = load_preset("classic-light")
pprint(preset)
{'embedder_config': {'model_name': 'intfloat/multilingual-e5-large-instruct'},
'hpo_config': {'n_startup_trials': 10, 'n_trials': 20, 'sampler': 'tpe'},
'search_space': [{'node_type': 'scoring',
'search_space': [{'k': {'high': 20, 'low': 1},
'module_name': 'knn',
'weights': ['uniform',
'distance',
'closest']},
{'module_name': 'linear'},
{'k': {'high': 20, 'low': 1},
'module_name': 'mlknn'}],
'target_metric': 'scoring_f1'},
{'node_type': 'decision',
'search_space': [{'module_name': 'threshold',
'thresh': {'high': 0.9, 'low': 0.1}},
{'module_name': 'argmax'},
{'module_name': 'jinoos'},
{'module_name': 'tunable'},
{'module_name': 'adaptive'}],
'target_metric': 'decision_accuracy'}]}
Customizing Search Spaces#
The search space can be customized to fit your specific needs. For example, you can modify hyperparameter ranges:
[6]:
# Example: modify the maximum k value for KNN-based components
preset["search_space"][0]["search_space"][0]["k"]["high"] = 10
custom_pipeline = Pipeline.from_optimization_config(preset)
See tutorial 03_search_space_configuration on how the search space is structured.
Logging and Storage Configuration#
During the AutoML process, you’ll want to control what artifacts are saved and where they’re stored. The LoggingConfig allows you to specify:
project_dir
: Directory where results will be saveddump_modules
: Whether to save trained model filesclear_ram
: Whether to clear models from memory after training to save RAM
[7]:
from pathlib import Path
from autointent.configs import LoggingConfig
logging_config = LoggingConfig(
project_dir=Path.cwd() / "runs", # Save results to 'runs' directory
dump_modules=False, # Don't save large model files
clear_ram=False, # Keep models in memory for inference
)
custom_pipeline.set_config(logging_config)
Model Configuration#
You can specify which transformer models to use for text embeddings and cross-encoding. This is useful when you want to:
Use smaller/faster models for experimentation
Apply domain-specific pre-trained models
Control model parameters like tokenizer settings
[8]:
from autointent.configs import CrossEncoderConfig, EmbedderConfig, TokenizerConfig
# Configure embedding model (used for vector representations)
custom_pipeline.set_config(EmbedderConfig(model_name="prajjwal1/bert-tiny"))
# Configure cross-encoder model (used for scoring text pairs)
custom_pipeline.set_config(
CrossEncoderConfig(model_name="cross-encoder/ms-marco-MiniLM-L2-v2", tokenizer_config=TokenizerConfig(max_length=8))
)
See the documentation for EmbedderConfig and CrossEncoderConfig for all available customization options.
Validation Strategy#
Choose between two validation approaches based on your dataset size:
Hold-out validation (default): Uses separate train/validation splits. Best when you have plenty of data.
Cross-validation: Splits data into k folds for more robust evaluation. Better for smaller datasets as it uses all data for both training and validation.
[9]:
from autointent.configs import DataConfig
# Use 3-fold cross-validation for better performance on small datasets
custom_pipeline.set_config(DataConfig(scheme="cv", n_folds=3))
See the docs for DataConfig for other options available to customize.
Complete Example#
Let’s put everything together in a comprehensive example that demonstrates the full AutoML workflow:
[10]:
from autointent import Dataset, Pipeline
from autointent.configs import LoggingConfig
from autointent.utils import load_preset
# Step 1: Load your dataset
dataset = Dataset.from_hub("DeepPavlov/clinc150_subset")
print(f"Loaded dataset with {len(dataset)} splits")
# Step 2: Load and customize a preset configuration
preset = load_preset("classic-light")
# You can modify the preset here if needed
# preset["search_space"][0]["search_space"][0]["k"]["high"] = 5
# Step 3: Create pipeline from the configuration
pipeline = Pipeline.from_optimization_config(preset)
# Step 4: Configure logging and storage
logging_config = LoggingConfig(
dump_modules=True, # Save trained models for later use
clear_ram=False, # Keep models in memory for immediate inference
)
pipeline.set_config(logging_config)
# Step 5: Run AutoML optimization
print("Starting AutoML optimization...")
context = pipeline.fit(dataset)
print("✅ AutoML optimization completed!")
# Step 6: Test the optimized pipeline
test_utterances = ["hello world!", "I want to transfer money", "book a flight"]
predictions = pipeline.predict(test_utterances)
print(f"Predictions: {predictions}")
Loaded dataset with 5 splits
Starting AutoML optimization...
[I 2025-08-01 06:20:20,750] A new study created in RDB with name: scoring
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/autointent/modules/decision/_jinoos.py:150: RuntimeWarning: invalid value encountered in scalar divide
accuracy_oos = correct_oos / total_oos
✅ AutoML optimization completed!
Predictions: [2, 1, None]
Dump Results#
One can save all results of auto-configuration process to file system (to LoggingConfig.dirpath
):
[11]:
context.dump()
Or one can dump only the configured pipeline to any desired location (by default LoggingConfig.dirpath
):
[12]:
pipeline.dump()
Load Pipeline for Inference#
[13]:
loaded_pipe = Pipeline.load(logging_config.dirpath)
Since this notebook is launched automatically while building the docs, we will clean the space if you don’t mind :)
[14]:
import shutil
shutil.rmtree(logging_config.dirpath)