detector_benchmark.dataset_loader ================================= .. py:module:: detector_benchmark.dataset_loader Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/detector_benchmark/dataset_loader/cnn_dataset/index /autoapi/detector_benchmark/dataset_loader/dataset_loader_utils/index /autoapi/detector_benchmark/dataset_loader/fake_true_dataset/index Classes ------- .. autoapisummary:: detector_benchmark.dataset_loader.CNNDataLoader detector_benchmark.dataset_loader.FakeTruePairsDataLoader Package Contents ---------------- .. py:class:: CNNDataLoader(dataset_size: int, hf_dataset_path: str = 'abisee/cnn_dailymail', text_field: str = 'article', prefix_size: int = 10, max_sample_len: int = 500, train_fraction: float = 0.8, eval_fraction: float = 0.1, test_fraction: float = 0.1, seed: int = 42) Bases: :py:obj:`detector_benchmark.dataset_loader.fake_true_dataset.FakeTruePairsDataLoader` .. py:attribute:: dataset_size .. py:attribute:: text_field .. py:attribute:: prefix_size .. py:attribute:: hf_dataset_path .. py:attribute:: max_sample_len .. py:attribute:: seed .. py:attribute:: train_fraction .. py:attribute:: eval_fraction .. py:attribute:: test_fraction .. py:attribute:: dataset_name :value: 'cnn_dailymail' .. py:method:: regroup_pairs(dataset_true: datasets.Dataset, dataset_fake: datasets.Dataset) -> datasets.Dataset Merge the two datasets by regrouping the pairs of human and AI samples with the same prefix. The first element of the pair is chosen randomly among the true and fake samples. Parameters: ---------- dataset_true: Dataset The dataset containing the true samples. dataset_fake: Dataset The dataset containing the fake samples. Returns: ---------- Dataset The merged dataset. .. py:method:: clean_dataset(dataset: datasets.Dataset) -> datasets.Dataset Clean the dataset by removing bloat from the text field. Parameters: dataset: Dataset The dataset to clean. Returns: Dataset The cleaned dataset. .. py:method:: process_data(dataset: datasets.DatasetDict, sample_size: int = None) -> datasets.DatasetDict Main method to process the dataset called by load_data. Parameters: dataset: DatasetDict The dataset to process. Returns: DatasetDict The processed dataset. .. py:method:: load_data() -> datasets.DatasetDict Function that we call to load the dataset. Returns: DatasetDict The processed dataset. .. py:class:: FakeTruePairsDataLoader(dataset_size, dataset_path, text_field, prefix_size=10, max_sample_len=500, load_local=True, dataset_name='', seed=42) .. py:attribute:: dataset_size .. py:attribute:: text_field .. py:attribute:: prefix_size .. py:attribute:: dataset_path .. py:attribute:: max_sample_len .. py:attribute:: dataset_name .. py:attribute:: load_local .. py:attribute:: seed .. py:method:: regroup_pairs(dataset_true, dataset_fake) -> datasets.Dataset Merge the two datasets by regrouping the pairs of human and AI samples with the same prefix. The first element of the pair is chosen randomly. Parameters: ---------- dataset_true: Dataset The dataset containing the true samples .. py:method:: process_data(dataset: datasets.DatasetDict) -> datasets.DatasetDict .. py:method:: load_data() -> datasets.DatasetDict