detector_benchmark.dataset_loader.cnn_dataset¶
Classes¶
Module Contents¶
- class detector_benchmark.dataset_loader.cnn_dataset.CNNDataLoader(dataset_size: int, hf_dataset_path: str = 'abisee/cnn_dailymail', text_field: str = 'article', prefix_size: int = 10, max_sample_len: int = 500, train_fraction: float = 0.8, eval_fraction: float = 0.1, test_fraction: float = 0.1, seed: int = 42)¶
Bases:
detector_benchmark.dataset_loader.fake_true_dataset.FakeTruePairsDataLoader
- dataset_size¶
- text_field¶
- prefix_size¶
- hf_dataset_path¶
- max_sample_len¶
- seed¶
- train_fraction¶
- eval_fraction¶
- test_fraction¶
- dataset_name = 'cnn_dailymail'¶
- regroup_pairs(dataset_true: datasets.Dataset, dataset_fake: datasets.Dataset) datasets.Dataset ¶
Merge the two datasets by regrouping the pairs of human and AI samples with the same prefix. The first element of the pair is chosen randomly among the true and fake samples.
Parameters:¶
- dataset_true: Dataset
The dataset containing the true samples.
- dataset_fake: Dataset
The dataset containing the fake samples.
Returns:¶
- Dataset
The merged dataset.
- clean_dataset(dataset: datasets.Dataset) datasets.Dataset ¶
Clean the dataset by removing bloat from the text field.
- Parameters:
- dataset: Dataset
The dataset to clean.
- Returns:
- Dataset
The cleaned dataset.
- process_data(dataset: datasets.DatasetDict, sample_size: int = None) datasets.DatasetDict ¶
Main method to process the dataset called by load_data.
- Parameters:
- dataset: DatasetDict
The dataset to process.
- Returns:
- DatasetDict
The processed dataset.
- load_data() datasets.DatasetDict ¶
Function that we call to load the dataset.
- Returns:
- DatasetDict
The processed dataset.