autointent.context.data_handler.StratifiedSplitter#
- class autointent.context.data_handler.StratifiedSplitter(test_size, label_feature, random_seed, shuffle=True, is_few_shot=False, examples_per_label=8)#
A class for stratified splitting of datasets.
This class provides methods to split a dataset into training and testing subsets while preserving the distribution of target labels. It supports both single-label and multi-label datasets.
- Parameters:
- test_size#
- label_feature#
- random_seed#
- shuffle = True#
- is_few_shot = False#
- examples_per_label = 8#
- __call__(dataset, multilabel, allow_oos_in_train=None)#
Split the dataset into training and testing subsets.
- Parameters:
dataset (datasets.Dataset) – The input dataset to be split.
multilabel (bool) – Whether the dataset is multi-label.
allow_oos_in_train (bool | None) – Set to True if you want to see out-of-scope utterances in train split.
- Returns:
A tuple containing the training and testing datasets.
- Raises:
ValueError – If OOS samples are present but allow_oos_in_train is not specified.
- Return type:
- has_oos_samples(dataset)#
Check if the dataset contains out-of-scope samples.
- Parameters:
dataset (datasets.Dataset) – The dataset to check.
- Returns:
True if the dataset contains OOS samples, False otherwise.
- Return type:
- get_stratify_inputs(dataset, multilabel, allow_oos_in_train)#
Return the effective dataset and post-split hook for stratification.
Single source of truth for OOS handling: both splitting and readiness checks use this so logic is not duplicated.
- Parameters:
dataset (datasets.Dataset) – The input dataset (may contain OOS).
multilabel (bool) – Whether the dataset is multi-label.
allow_oos_in_train (bool | None) – Whether OOS samples are allowed in the train split. Must be set when the dataset contains OOS samples.
- Returns:
StratifyInputs with the dataset to stratify on and a post_split_fn.
- Raises:
ValueError – If the dataset contains OOS samples and allow_oos_in_train is None.
- Return type:
StratifyInputs