Pipeline Auto Configuration (AutoML)#

[1]:
from autointent import Pipeline
/home/runner/.cache/pypoetry/virtualenvs/autointent-FDypUDHQ-py3.10/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm, trange

In this tutorial we will walk through pipeline auto configuration process.

Let us use small subset of popular clinc150 dataset for the demonstation.

[2]:
from autointent import Dataset

dataset = Dataset.from_hub("AutoIntent/clinc150_subset")
dataset
[2]:
{'train': Dataset({
     features: ['utterance', 'label'],
     num_rows: 60
 }),
 'oos': Dataset({
     features: ['utterance', 'label'],
     num_rows: 10
 })}
[3]:
dataset["train"][0]
[3]:
{'utterance': 'does acero in maplewood allow reservations', 'label': 0}

Search Space#

AutoIntent provides default search spaces for multi-label and single-label classification problems. One can utilize them by constructing Pipeline with factory default_optimizer:

[4]:
multiclass_pipeline = Pipeline.default_optimizer(multilabel=False)
multilabel_pipeline = Pipeline.default_optimizer(multilabel=True)

One can explore its contents:

[5]:
from pprint import pprint

from autointent.utils import load_default_search_space

search_space = load_default_search_space(multilabel=True)
pprint(search_space)
[{'metric': 'retrieval_hit_rate_intersecting',
  'node_type': 'embedding',
  'search_space': [{'embedder_name': ['deepvk/USER-bge-m3'],
                    'k': [10],
                    'module_name': 'retrieval'}]},
 {'metric': 'scoring_roc_auc',
  'node_type': 'scoring',
  'search_space': [{'k': [3],
                    'module_name': 'knn',
                    'weights': ['uniform', 'distance', 'closest']},
                   {'module_name': 'linear'}]},
 {'metric': 'decision_accuracy',
  'node_type': 'decision',
  'search_space': [{'module_name': 'threshold', 'thresh': [0.5]},
                   {'module_name': 'adaptive'}]}]

Search space is allowed to customize:

[6]:
search_space[1]["search_space"][0]["k"] = [1, 3]
custom_pipeline = Pipeline.from_search_space(search_space)

See tutorial 02_search_space_configuration on how the search space is structured.

Embedder Settings#

Embedder is one of the key components of AutoIntent. It affects both the quality of the resulting classifier and the efficiency of the auto configuration process.

To select embedding models for your optimization, you need to customize search space (02_search_space_configuration). Here, we will observe settings affecting efficiency.

Several options are customizable via EmbedderConfig. Defaults are the following:

[7]:
from autointent.configs import EmbedderConfig

embedder_config = EmbedderConfig(
    batch_size=32,
    max_length=None,
    use_cache=False,
)

To set selected settings, method set_config is provided:

[8]:
custom_pipeline.set_config(embedder_config)

Vector Index Settings#

VectorIndex is one of the key utilities of AutoIntent. During the auto-configuration process, lots of retrieval is used. By modifying VectorIndexConfig you can select whether to save built vector index into file system and where to save it.

Default options are the following:

[9]:
from autointent.configs import VectorIndexConfig

vector_index_config = VectorIndexConfig(db_dir=None, save_db=False)
  • db_dir=None tells AutoIntent to store intermediate files in a current working directory

  • save_db=False tells AutoIntent to clear all the files after auto configuration is finished

These settings can be applied in a familiar way:

[10]:
custom_pipeline.set_config(vector_index_config)

Logging Settings#

The important thing is what assets you want to save during the pipeline auto-configuration process. You can control it with LoggingConfig. Default settings are the following:

[11]:
from autointent.configs import LoggingConfig

logging_config = LoggingConfig(run_name=None, dirpath=None, dump_dir=None, dump_modules=False, clear_ram=False)
custom_pipeline.set_config(logging_config)

Complete Example#

[12]:
from autointent import Dataset, Pipeline
from autointent.configs import EmbedderConfig, LoggingConfig, VectorIndexConfig
from autointent.utils import load_default_search_space

# load data
dataset = Dataset.from_hub("AutoIntent/clinc150_subset")

# customize search space
search_space = load_default_search_space(multilabel=False)

# make pipeline
custom_pipeline = Pipeline.from_search_space(search_space)

# custom settings
embedder_config = EmbedderConfig()
vector_index_config = VectorIndexConfig()
logging_config = LoggingConfig()

custom_pipeline.set_config(embedder_config)
custom_pipeline.set_config(vector_index_config)
custom_pipeline.set_config(logging_config)

# start auto-configuration
custom_pipeline.fit(dataset)

# inference
custom_pipeline.predict(["hello world!"])
No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[12]:
array([2])