`hugdatafast.transform`¶

Module Contents¶

Classes¶

`SimpleTokenize`	Initialize self. See help(type(self)) for accurate signature.
`CombineTransform`	Base Class for Transform that combine multiple original samples into a new sample.
`LMTransform`	Transform any dataset has tokenized text into dataset (autotgressive) language model.
`ELECTRADataTransform`	Process any text corpus for ELECTRA’s use

Functions¶

`my_map`(self: datasets.arrow_dataset.Dataset, args, *kwargs)	The same as `datasets.arrow_dataset.Dataset` , but it can add cache directory and .arrow to cache_file_name autmomatically for us.
`my_map`(self: datasets.arrow_dataset.Dataset, args, *kwargs)	The same as `datasets.arrow_dataset.Dataset` , but it can add cache directory and .arrow to cache_file_name autmomatically for us.

hugdatafast.transform.my_map(self: datasets.arrow_dataset.Dataset, *args, **kwargs)¶

The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us.

Example

>>> dataset.map(a_func, cache_file_name='processed')
# cache file path become "<dataset cache directory>/processed.arrow"

hugdatafast.transform.my_map(self: datasets.dataset_dict.DatasetDict, *args, **kwargs)

The same as datasets.dataset_dict.DatasetDict , but it can infer cache names for us.

Example

>>> datasets.map(a_func, cache_file_names='processed_{split}')
# cache file paths : "<dataset cache directory>/processed_train.arrow", "<dataset cache directory>/processed_validation.arrow", "<dataset cache directory>/processed_test.arrow"

class hugdatafast.transform.SimpleTokenize(cols, hf_toker)¶: Initialize self. See help(type(self)) for accurate signature.

class hugdatafast.transform.CombineTransform(hf_dset, in_cols, out_cols, drop_last=False)¶

Base Class for Transform that combine multiple original samples into a new sample.

Parameters:	hf_dset (`Dataset` or `DatasetDict`) – The Hugging Face dataset(s) to do the transformation in_cols (List[str]) – names of input columns that used to produce samples out_cols (List[str]) – names of output columns to put combined samples. (Optional[bool], default (drop_last`) – False): whether to drop the last accumulated sample.

reset_states(self)¶

Child Class should implement this method.

Reset all containers, flags to their initial values.

accumulate(self, *args)¶

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

create_example(self)¶

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

map(self, batch_size=1000, cache_file_name=None, **kwargs)¶

Parameters:	batch_size (int) – See `datasets.Dataset.map`, shouldn’t be None here cache_file_name – The same with the one of `my_map()` kwargs – passed to `datasets.Dataset.map`

class hugdatafast.transform.LMTransform(tokenized_hf_dset, max_len, text_col, x_text_col='x_text', y_text_col='y_text', **kwargs)¶

Transform any dataset has tokenized text into dataset (autotgressive) language model. !! Caution: This span context window across examples. So make sure your texts in examples of the datasets are consecutive or relative. Or you are knowing what you are doing.

Parameters:

tokenized_hf_dset (Dataset or DatasetDict) – tokenized Hugging Face dataset(s) to do LM transform
max_len (int) – the length of a sentence
text_col (str) – the name of column that contains tokenized text (ids) of tokenized_hf_dset
x_text_col (str) – the name of the output column
y_text_col (str) – the name fo the output column
kwargs – passed to :class:CombineTransform

Example

>>> lm_dataset = LMTransform(tokenized_cola['validation'], max_len=20, text_col='text_idxs').map()
>>> lm_dataset[0]
{'x_text': [ 1996, 11279,  8469,  1996,  9478,  3154,  1997,  1996,  5749,  1012,
    1996, 15871,  2081,  1996,  8164,  7683,  2058,  1996,  4139,  3240],
 'y_text': [11279,  8469,  1996,  9478,  3154,  1997,  1996,  5749,  1012,  1996,
   15871,  2081,  1996,  8164,  7683,  2058,  1996,  4139,  3240,  1012]}

reset_states(self)¶

Child Class should implement this method.

Reset all containers, flags to their initial values.

create_example(self)¶

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

accumulate(self, text)¶

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

class hugdatafast.transform.ELECTRADataTransform(hf_dset, is_docs, text_col, max_length, hf_toker, delimiter='n', **kwargs)¶

Process any text corpus for ELECTRA’s use

Parameters:

hf_dset (Dataset or DatasetDict) – untokenized Hugging Face dataset(s) to do the transform
is_docs (bool) – Whether each sample of this dataset is a doc
text_col (str) – the name of column of the dataset contains text
max_length (str) – max length of each sentence
hf_toker (transformers.PreTrainedTokenizer) – Hugging Face tokenizer
delimiter (str) – what is the delimiter to segment sentences in the input text
kwargs – passed to CombineTransform

reset_states(self)¶

Child Class should implement this method.

Reset all containers, flags to their initial values.

accumulate(self, text)¶

Child Class should implement this method.

Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of inp_cols ( passed to __init__() ) of an example

create_example(self)¶

Child Class should implement this method.

Use internal states stored in the child class instance to create a combined example (dict).

When nothing can’t be created, return None or raise any exception to show it.

add_line(self, tokids)¶: Adds a line of text to the current example being built.

hugdatafast.transform¶

Module Contents¶

Classes¶

Functions¶

`hugdatafast.transform`¶