AutoML Customization#

In this guide, you will learn how to configure a custom hyperparameter search space.

Python API#

Before reading this guide, we recommend familiarizing yourself with the sections concepts and optimization.

Optimization Module#

To set up the optimization module, you need to create the following dictionary:

[1]:

knn_module = {
    "module_name": "knn",
    "k": [1, 5, 10, 50],
    "embedder_config": ["sergeyzh/rubert-tiny-turbo"],
}

The module_name field specifies the name of the module. You can explore the available names by yourself:

[2]:

from autointent.modules import SCORING_MODULES, DECISION_MODULES, EMBEDDING_MODULES, REGEX_MODULES

print(list(SCORING_MODULES.keys()))
print(list(DECISION_MODULES.keys()))
print(list(EMBEDDING_MODULES.keys()))
print(list(REGEX_MODULES.keys()))

/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

['dnnc', 'knn', 'linear', 'description', 'rerank', 'sklearn', 'mlknn']
['argmax', 'jinoos', 'threshold', 'tunable', 'adaptive']
['retrieval', 'logreg_embedding']
['simple']

All fields except module_name are lists that define the search space for each hyperparameter (see KNNScorer). If you omit them, the default set of hyperparameters will be used:

[3]:

linear_module = {"module_name": "linear"}

See docs LinearScorer.

Optimization Node#

To set up the optimization node, you need to create a list of modules and specify the target metric for optimization:

[4]:

scoring_node = {
    "node_type": "scoring",
    "target_metric": "scoring_roc_auc",
    "search_space": [
        knn_module,
        linear_module,
    ],
}

Search Space#

The search space for the entire pipeline looks approximately like this:

[5]:

search_space = [
    {
        "node_type": "embedding",
        "target_metric": "retrieval_hit_rate",
        "search_space": [
            {
                "module_name": "retrieval",
                "k": [10],
                "embedder_config": ["avsolatorio/GIST-small-Embedding-v0", "sergeyzh/rubert-tiny-turbo"],
            }
        ],
    },
    {
        "node_type": "scoring",
        "target_metric": "scoring_roc_auc",
        "search_space": [
            {"module_name": "knn", "k": [1, 3, 5, 10], "weights": ["uniform", "distance", "closest"]},
            {"module_name": "linear"},
            {
                "module_name": "dnnc",
                "cross_encoder_config": ["cross-encoder/ms-marco-MiniLM-L-6-v2"],
                "k": [1, 3, 5, 10],
            },
        ],
    },
    {
        "node_type": "decision",
        "target_metric": "decision_accuracy",
        "search_space": [{"module_name": "threshold", "thresh": [0.5]}, {"module_name": "argmax"}],
    },
]

Load Data#

Let us use small subset of popular clinc150 dataset:

[6]:

from autointent import Dataset

dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

Start Auto Configuration#

[7]:

from autointent import Pipeline

pipeline_optimizer = Pipeline.from_search_space(search_space)
pipeline_optimizer.fit(dataset, sampler="random")

[I 2025-03-08 22:23:14,898] A new study created in memory with name: no-name-8a51bbb9-dab5-4e44-84ae-ef2a6bf74439
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
"argmax" is NOT designed to handle OOS samples, but your data contains it. So, using this method reduces the power of classification.
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

[7]:

<autointent.context._context.Context at 0x7f88f8b93340>

There are three hyperparameter tuning samplers available:

“random”
“brute”
“tpe”

All the samplers are implemented with optuna .

One can use more versatile OptimizationConfig and from_optimization_config.