hugdatafast.transform

Module Contents

Classes

SimpleTokenize Initialize self. See help(type(self)) for accurate signature.
CombineTransform Base Class for Transform that combine multiple original samples into a new sample.
LMTransform Transform any dataset has tokenized text into dataset (autotgressive) language model.
ELECTRADataTransform Process any text corpus for ELECTRA’s use

Functions

my_map(self: datasets.arrow_dataset.Dataset, *args, **kwargs) The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us.
my_map(self: datasets.arrow_dataset.Dataset, *args, **kwargs) The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us.
hugdatafast.transform.my_map(self: datasets.arrow_dataset.Dataset, *args, **kwargs)

The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us.

Example

>>> dataset.map(a_func, cache_file_name='processed')
# cache file path become "<dataset cache directory>/processed.arrow"
hugdatafast.transform.my_map(self: datasets.dataset_dict.DatasetDict, *args, **kwargs)

The same as datasets.dataset_dict.DatasetDict , but it can infer cache names for us.

Example

>>> datasets.map(a_func, cache_file_names='processed_{split}')
# cache file paths : "<dataset cache directory>/processed_train.arrow", "<dataset cache directory>/processed_validation.arrow", "<dataset cache directory>/processed_test.arrow"
class hugdatafast.transform.SimpleTokenize(cols, hf_toker)

Initialize self. See help(type(self)) for accurate signature.

class hugdatafast.transform.CombineTransform(hf_dset, in_cols, out_cols, drop_last=False)

Base Class for Transform that combine multiple original samples into a new sample.

Parameters:
  • hf_dset (Dataset or DatasetDict) – The Hugging Face dataset(s) to do the transformation
  • in_cols (List[str]) – names of input columns that used to produce samples
  • out_cols (List[str]) – names of output columns to put combined samples.
  • (Optional[bool], default (drop_last`) – False): whether to drop the last accumulated sample.
reset_states(self)

Child Class should implement this method.

Reset all containers, flags to their initial values.

accumulate(self, *args)

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

create_example(self)

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

map(self, batch_size=1000, cache_file_name=None, **kwargs)
Parameters:
  • batch_size (int) – See datasets.Dataset.map, shouldn’t be None here
  • cache_file_name – The same with the one of my_map()
  • kwargs – passed to datasets.Dataset.map
class hugdatafast.transform.LMTransform(tokenized_hf_dset, max_len, text_col, x_text_col='x_text', y_text_col='y_text', **kwargs)

Transform any dataset has tokenized text into dataset (autotgressive) language model. !! Caution: This span context window across examples. So make sure your texts in examples of the datasets are consecutive or relative. Or you are knowing what you are doing.

Parameters:
  • tokenized_hf_dset (Dataset or DatasetDict) – tokenized Hugging Face dataset(s) to do LM transform
  • max_len (int) – the length of a sentence
  • text_col (str) – the name of column that contains tokenized text (ids) of tokenized_hf_dset
  • x_text_col (str) – the name of the output column
  • y_text_col (str) – the name fo the output column
  • kwargs – passed to :class:CombineTransform

Example

>>> lm_dataset = LMTransform(tokenized_cola['validation'], max_len=20, text_col='text_idxs').map()
>>> lm_dataset[0]
{'x_text': [ 1996, 11279,  8469,  1996,  9478,  3154,  1997,  1996,  5749,  1012,
    1996, 15871,  2081,  1996,  8164,  7683,  2058,  1996,  4139,  3240],
 'y_text': [11279,  8469,  1996,  9478,  3154,  1997,  1996,  5749,  1012,  1996,
   15871,  2081,  1996,  8164,  7683,  2058,  1996,  4139,  3240,  1012]}
reset_states(self)

Child Class should implement this method.

Reset all containers, flags to their initial values.

create_example(self)

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

accumulate(self, text)

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

class hugdatafast.transform.ELECTRADataTransform(hf_dset, is_docs, text_col, max_length, hf_toker, delimiter='n', **kwargs)

Process any text corpus for ELECTRA’s use

Parameters:
  • hf_dset (Dataset or DatasetDict) – untokenized Hugging Face dataset(s) to do the transform
  • is_docs (bool) – Whether each sample of this dataset is a doc
  • text_col (str) – the name of column of the dataset contains text
  • max_length (str) – max length of each sentence
  • hf_toker (transformers.PreTrainedTokenizer) – Hugging Face tokenizer
  • delimiter (str) – what is the delimiter to segment sentences in the input text
  • kwargs – passed to CombineTransform
reset_states(self)

Child Class should implement this method.

Reset all containers, flags to their initial values.

accumulate(self, text)

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

create_example(self)

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

add_line(self, tokids)

Adds a line of text to the current example being built.