Search Space Configuration#

In this guide, you will learn how to configure a custom hyperparameter search space.

Python API#

Before reading this guide, we recommend familiarizing yourself with the sections concepts and optimization.

Optimization Module#

To set up the optimization module, you need to create the following dictionary:

[1]:

knn_module = {
    "module_name": "knn",
    "k": [1, 5, 10, 50],
    "embedder_name": ["avsolatorio/GIST-small-Embedding-v0", "infgrad/stella-base-en-v2"],
}

The module_name field specifies the name of the module. You can find the names, for example, in…

TODO: Add docs for all available modules.

All fields except module_name are lists that define the search space for each hyperparameter (see KNNScorer). If you omit them, the default set of hyperparameters will be used:

[2]:

linear_module = {"module_name": "linear"}

See docs LinearScorer.

Optimization Node#

To set up the optimization node, you need to create a list of modules and specify the metric for optimization:

[3]:

scoring_node = {
    "node_type": "scoring",
    "metric_name": "scoring_roc_auc",
    "search_space": [
        knn_module,
        linear_module,
    ],
}

Search Space#

The search space for the entire pipeline looks approximately like this:

[4]:

search_space = [
    {
        "node_type": "embedding",
        "metric": "retrieval_hit_rate",
        "search_space": [
            {
                "module_name": "retrieval",
                "k": [10],
                "embedder_name": ["avsolatorio/GIST-small-Embedding-v0", "infgrad/stella-base-en-v2"],
            }
        ],
    },
    {
        "node_type": "scoring",
        "metric": "scoring_roc_auc",
        "search_space": [
            {"module_name": "knn", "k": [1, 3, 5, 10], "weights": ["uniform", "distance", "closest"]},
            {"module_name": "linear"},
            {
                "module_name": "dnnc",
                "cross_encoder_name": ["BAAI/bge-reranker-base", "cross-encoder/ms-marco-MiniLM-L-6-v2"],
                "k": [1, 3, 5, 10],
            },
        ],
    },
    {
        "node_type": "decision",
        "metric": "decision_accuracy",
        "search_space": [{"module_name": "threshold", "thresh": [0.5]}, {"module_name": "argmax"}],
    },
]

Load Data#

Let us use small subset of popular clinc150 dataset:

[5]:

from autointent import Dataset

dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm, trange

Start Auto Configuration#

[6]:

from autointent import Pipeline

pipeline_optimizer = Pipeline.from_search_space(search_space)
pipeline_optimizer.fit(dataset)

No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[6]:

<autointent.context._context.Context at 0x7f7128f429e0>