hugdatafast.transform
¶
Module Contents¶
Classes¶
SimpleTokenize |
Initialize self. See help(type(self)) for accurate signature. |
CombineTransform |
Base Class for Transform that combine multiple original samples into a new sample. |
LMTransform |
Transform any dataset has tokenized text into dataset (autotgressive) language model. |
ELECTRADataTransform |
Process any text corpus for ELECTRA’s use |
Functions¶
my_map (self: datasets.arrow_dataset.Dataset, *args, **kwargs) |
The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us. |
my_map (self: datasets.arrow_dataset.Dataset, *args, **kwargs) |
The same as datasets.arrow_dataset.Dataset , but it can add cache directory and .arrow to cache_file_name autmomatically for us. |
-
hugdatafast.transform.
my_map
(self: datasets.arrow_dataset.Dataset, *args, **kwargs)¶ The same as
datasets.arrow_dataset.Dataset
, but it can add cache directory and .arrow to cache_file_name autmomatically for us.Example
>>> dataset.map(a_func, cache_file_name='processed') # cache file path become "<dataset cache directory>/processed.arrow"
-
hugdatafast.transform.
my_map
(self: datasets.dataset_dict.DatasetDict, *args, **kwargs) The same as
datasets.dataset_dict.DatasetDict
, but it can infer cache names for us.Example
>>> datasets.map(a_func, cache_file_names='processed_{split}') # cache file paths : "<dataset cache directory>/processed_train.arrow", "<dataset cache directory>/processed_validation.arrow", "<dataset cache directory>/processed_test.arrow"
-
class
hugdatafast.transform.
SimpleTokenize
(cols, hf_toker)¶ Initialize self. See help(type(self)) for accurate signature.
-
class
hugdatafast.transform.
CombineTransform
(hf_dset, in_cols, out_cols, drop_last=False)¶ Base Class for Transform that combine multiple original samples into a new sample.
Parameters: - hf_dset (
Dataset
orDatasetDict
) – The Hugging Face dataset(s) to do the transformation - in_cols (List[str]) – names of input columns that used to produce samples
- out_cols (List[str]) – names of output columns to put combined samples.
- (Optional[bool], default (drop_last`) – False): whether to drop the last accumulated sample.
-
reset_states
(self)¶ Child Class should implement this method.
Reset all containers, flags to their initial values.
-
accumulate
(self, *args)¶ Child Class should implement this method.
Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of
inp_cols
( passed to__init__()
) of an example
-
create_example
(self)¶ Child Class should implement this method.
Use internal states stored in the child class instance to create a combined example (dict).
When nothing can’t be created, return
None
or raise any exception to show it.
- hf_dset (
-
class
hugdatafast.transform.
LMTransform
(tokenized_hf_dset, max_len, text_col, x_text_col='x_text', y_text_col='y_text', **kwargs)¶ Transform any dataset has tokenized text into dataset (autotgressive) language model. !! Caution: This span context window across examples. So make sure your texts in examples of the datasets are consecutive or relative. Or you are knowing what you are doing.
Parameters: - tokenized_hf_dset (
Dataset
orDatasetDict
) – tokenized Hugging Face dataset(s) to do LM transform - max_len (int) – the length of a sentence
- text_col (str) – the name of column that contains tokenized text (ids) of tokenized_hf_dset
- x_text_col (str) – the name of the output column
- y_text_col (str) – the name fo the output column
- kwargs – passed to :class:CombineTransform
Example
>>> lm_dataset = LMTransform(tokenized_cola['validation'], max_len=20, text_col='text_idxs').map() >>> lm_dataset[0] {'x_text': [ 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 1996, 15871, 2081, 1996, 8164, 7683, 2058, 1996, 4139, 3240], 'y_text': [11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 1996, 15871, 2081, 1996, 8164, 7683, 2058, 1996, 4139, 3240, 1012]}
-
reset_states
(self)¶ Child Class should implement this method.
Reset all containers, flags to their initial values.
-
create_example
(self)¶ Child Class should implement this method.
Use internal states stored in the child class instance to create a combined example (dict).
When nothing can’t be created, return
None
or raise any exception to show it.
-
accumulate
(self, text)¶ Child Class should implement this method.
Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of
inp_cols
( passed to__init__()
) of an example
- tokenized_hf_dset (
-
class
hugdatafast.transform.
ELECTRADataTransform
(hf_dset, is_docs, text_col, max_length, hf_toker, delimiter='n', **kwargs)¶ Process any text corpus for ELECTRA’s use
Parameters: - hf_dset (
Dataset
orDatasetDict
) – untokenized Hugging Face dataset(s) to do the transform - is_docs (bool) – Whether each sample of this dataset is a doc
- text_col (str) – the name of column of the dataset contains text
- max_length (str) – max length of each sentence
- hf_toker (
transformers.PreTrainedTokenizer
) – Hugging Face tokenizer - delimiter (str) – what is the delimiter to segment sentences in the input text
- kwargs – passed to
CombineTransform
-
reset_states
(self)¶ Child Class should implement this method.
Reset all containers, flags to their initial values.
-
accumulate
(self, text)¶ Child Class should implement this method.
Given a example, do self.commit_example(self.create_example()) when a new combined sample is ready. :param args: values of
inp_cols
( passed to__init__()
) of an example
-
create_example
(self)¶ Child Class should implement this method.
Use internal states stored in the child class instance to create a combined example (dict).
When nothing can’t be created, return
None
or raise any exception to show it.
-
add_line
(self, tokids)¶ Adds a line of text to the current example being built.
- hf_dset (