How to add a dataset¶

First, let’s describe the datasets format used for testing the detectors. The datasets are huggingface Datasets where each samples has the following fields:

label: the label of the text (0 for human written, 1 for AI-written)
text: contains the full text of the sample
prefix: prefix from the human written texts used to generated the AI-written text. For each prefix, we always have the corresponding true sample (label 0) and the fake one (label 1) sharing the same prefix.
generation_config: the config (data from the config file in conf) used to generate the text AI-generated text for the dataset
watermark_config: the watermarking config used to generate the text (which watermarking algorithm if any,…)

Now to add a different dataset than the existing ones, we need to add dataset loader class inheriting the base FakeTruePairsDataLoader class inside the dataset_loader folder. This class should implement the load_data function with the following signature:

def load_data(self) -> DatasetDict:

This function should return a DatasetDict (huggingface dataset format) with a train, eval and test split and respecting the dataset format. To see how to apply the correct format to the dataset, see the existing dataset loaders.

Note: the load_data returns a dataset where the fake samples (label 1) have an empty text and no generation_config nor watermark_config fields since the AI texts have not been generated yet. The only fields that should be fully field for AI texts are the label and the prefix.