detector_benchmark.dataset_loader.cnn_dataset¶

Classes¶

Module Contents¶

class detector_benchmark.dataset_loader.cnn_dataset.CNNDataLoader(dataset_size: int, hf_dataset_path: str = 'abisee/cnn_dailymail', text_field: str = 'article', prefix_size: int = 10, max_sample_len: int = 500, train_fraction: float = 0.8, eval_fraction: float = 0.1, test_fraction: float = 0.1, seed: int = 42)¶

Bases: detector_benchmark.dataset_loader.fake_true_dataset.FakeTruePairsDataLoader

dataset_size¶
text_field¶
prefix_size¶
hf_dataset_path¶
max_sample_len¶
seed¶
train_fraction¶
eval_fraction¶
test_fraction¶
dataset_name = 'cnn_dailymail'¶
regroup_pairs(dataset_true: datasets.Dataset, dataset_fake: datasets.Dataset) datasets.Dataset¶

Merge the two datasets by regrouping the pairs of human and AI samples with the same prefix. The first element of the pair is chosen randomly among the true and fake samples.

Parameters:¶

dataset_true: Dataset

The dataset containing the true samples.

dataset_fake: Dataset

The dataset containing the fake samples.

Returns:¶

Dataset

The merged dataset.

clean_dataset(dataset: datasets.Dataset) datasets.Dataset¶

Clean the dataset by removing bloat from the text field.

Parameters:
dataset: Dataset

The dataset to clean.

Returns:
Dataset

The cleaned dataset.

process_data(dataset: datasets.DatasetDict, sample_size: int = None) datasets.DatasetDict¶

Main method to process the dataset called by load_data.

Parameters:
dataset: DatasetDict

The dataset to process.

Returns:
DatasetDict

The processed dataset.

load_data() datasets.DatasetDict¶

Function that we call to load the dataset.

Returns:
DatasetDict

The processed dataset.