AutoML#

[1]:
from autointent import Pipeline
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In this tutorial we will walk through pipeline auto configuration process.

Let us use small subset of popular clinc150 dataset for the demonstation.

[2]:
from autointent import Dataset

dataset = Dataset.from_hub("AutoIntent/clinc150_subset")
dataset
[2]:
{'train_0': Dataset({
     features: ['utterance', 'label'],
     num_rows: 18
 }),
 'train_1': Dataset({
     features: ['utterance', 'label'],
     num_rows: 18
 }),
 'validation_0': Dataset({
     features: ['utterance', 'label'],
     num_rows: 4
 }),
 'validation_1': Dataset({
     features: ['utterance', 'label'],
     num_rows: 8
 }),
 'test': Dataset({
     features: ['utterance', 'label'],
     num_rows: 12
 })}
[3]:
dataset["train_0"][0]
[3]:
{'utterance': 'do they take reservations at mcdonalds', 'label': 0}

Search Space#

AutoIntent provides default search spaces. One can utilize them by constructing Pipeline with factory from_preset:

[4]:
pipeline = Pipeline.from_preset("light_extra")

One can explore its contents:

[5]:
from pprint import pprint

from autointent.utils import load_preset

preset = load_preset("light_extra")
pprint(preset)
{'sampler': 'random',
 'search_space': [{'node_type': 'scoring',
                   'search_space': [{'k': {'high': 20, 'low': 1},
                                     'module_name': 'knn',
                                     'n_trials': 10,
                                     'weights': ['uniform',
                                                 'distance',
                                                 'closest']},
                                    {'module_name': 'linear'},
                                    {'k': {'high': 20, 'low': 1},
                                     'module_name': 'mlknn',
                                     'n_trials': 10}],
                   'target_metric': 'scoring_roc_auc'},
                  {'node_type': 'decision',
                   'search_space': [{'module_name': 'threshold',
                                     'n_trials': 10,
                                     'thresh': {'high': 0.9, 'low': 0.1}},
                                    {'module_name': 'argmax'}],
                   'target_metric': 'decision_accuracy'}]}

Search space is allowed to customize:

[6]:
preset["search_space"][0]["search_space"][0]["k"] = [1, 3]
custom_pipeline = Pipeline.from_optimization_config(preset)

See tutorial 02_search_space_configuration on how the search space is structured.

Logging Settings#

The important thing is what assets you want to save during the pipeline auto-configuration process. You can control it with LoggingConfig:

[7]:
from pathlib import Path
from autointent.configs import LoggingConfig

logging_config = LoggingConfig(project_dir=Path.cwd() / "runs", dump_modules=False, clear_ram=False)
custom_pipeline.set_config(logging_config)

Default Transformers#

One can specify what embedding model and cross-encoder model want to use along with default settings:

[8]:
from autointent.configs import EmbedderConfig, CrossEncoderConfig

custom_pipeline.set_config(EmbedderConfig(model_name="prajjwal1/bert-tiny", device="cpu"))
custom_pipeline.set_config(CrossEncoderConfig(model_name="cross-encoder/ms-marco-MiniLM-L2-v2", max_length=8))

See the docs for EmbedderConfig and CrossEncoderConfig for options available to customize.

Cross-Validation vs Hold-Out Validation#

If you have lots of training and evaluation data, you can use default hold-out validation strategy. If not, you can choose cross-validation and spend a little more time but utilize the full amount of available data for better hyperparameter tuning.

This behavior is controlled with DataConfig:

[9]:
from autointent.configs import DataConfig

custom_pipeline.set_config(DataConfig(scheme="cv", n_folds=3))

See the docs for DataConfig for other options available to customize.

Complete Example#

[10]:
from autointent import Dataset, Pipeline
from autointent.configs import LoggingConfig
from autointent.utils import load_preset

# load data
dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

# customize search space
preset = load_preset("light_extra")

# make pipeline
custom_pipeline = Pipeline.from_optimization_config(preset)

# custom settings
logging_config = LoggingConfig()

custom_pipeline.set_config(logging_config)

# start auto-configuration
context = custom_pipeline.fit(dataset)

# inference on-the-fly
custom_pipeline.predict(["hello world!"])
[I 2025-03-08 22:27:31,124] A new study created in memory with name: no-name-c9c772b6-dee6-46e1-a6fd-20719feaaaa6
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[10]:
[None]

Dump Results#

One can save all results of auto-configuration process to file system (to LoggingConfig.dirpath):

[11]:
context.dump()

Or one can dump only the configured pipeline to any desired location (by default LoggingConfig.dirpath):

[12]:
custom_pipeline.dump()
Attribute _artifact of type <class 'autointent.context.optimization_info._data_models.ScorerArtifact'> cannot be dumped to file system.
Attribute _artifact of type <class 'autointent.context.optimization_info._data_models.DecisionArtifact'> cannot be dumped to file system.

Load Pipeline for Inference#

[13]:
loaded_pipe = Pipeline.load(logging_config.dirpath)

Since this notebook is launched automatically while building the docs, we will clean the space if you don’t mind :)

[14]:
import shutil

shutil.rmtree(logging_config.dirpath)