autointent.context.data_handler.StratifiedSplitter#

class autointent.context.data_handler.StratifiedSplitter(test_size, label_feature, random_seed, shuffle=True, is_few_shot=False, examples_per_label=8)#

A class for stratified splitting of datasets.

This class provides methods to split a dataset into training and testing subsets while preserving the distribution of target labels. It supports both single-label and multi-label datasets.

Parameters:
  • test_size (float)

  • label_feature (str)

  • random_seed (int | None)

  • shuffle (bool)

  • is_few_shot (bool)

  • examples_per_label (int)

test_size#
label_feature#
random_seed#
shuffle = True#
is_few_shot = False#
examples_per_label = 8#
__call__(dataset, multilabel, allow_oos_in_train=None)#

Split the dataset into training and testing subsets.

Parameters:
  • dataset (datasets.Dataset) – The input dataset to be split.

  • multilabel (bool) – Whether the dataset is multi-label.

  • allow_oos_in_train (bool | None) – Set to True if you want to see out-of-scope utterances in train split.

Returns:

A tuple containing the training and testing datasets.

Raises:

ValueError – If OOS samples are present but allow_oos_in_train is not specified.

Return type:

tuple[datasets.Dataset, datasets.Dataset]

has_oos_samples(dataset)#

Check if the dataset contains out-of-scope samples.

Parameters:

dataset (datasets.Dataset) – The dataset to check.

Returns:

True if the dataset contains OOS samples, False otherwise.

Return type:

bool

get_stratify_inputs(dataset, multilabel, allow_oos_in_train)#

Return the effective dataset and post-split hook for stratification.

Single source of truth for OOS handling: both splitting and readiness checks use this so logic is not duplicated.

Parameters:
  • dataset (datasets.Dataset) – The input dataset (may contain OOS).

  • multilabel (bool) – Whether the dataset is multi-label.

  • allow_oos_in_train (bool | None) – Whether OOS samples are allowed in the train split. Must be set when the dataset contains OOS samples.

Returns:

StratifyInputs with the dataset to stratify on and a post_split_fn.

Raises:

ValueError – If the dataset contains OOS samples and allow_oos_in_train is None.

Return type:

StratifyInputs