hugdatafast.fastai

Module Contents

Classes

MySortedDL A DataLoader that do smart batching and dynamic padding. Different from SortedDL, it automatically pad every attribute of samples, is able to filter samples, and can be cached to sort/filter only at first time.
HF_Dataset A wrapper for datasets.Dataset. It will behavior like original datasets.Dataset,
HF_Datasets Function as fastai.data.core.Datasets to create fastai.data.core.Dataloaders from a group of :class:`datasets.Dataset`s
class hugdatafast.fastai.MySortedDL(dataset, srtkey_fc=None, filter_fc=False, pad_idx=None, cache_file=None, **kwargs)

A DataLoader that do smart batching and dynamic padding. Different from SortedDL, it automatically pad every attribute of samples, is able to filter samples, and can be cached to sort/filter only at first time.

Parameters:
  • dataset (HF_Dataset) – Actually any object implements __len__ and __getitem__ that return a tuple as a sample.
  • srtkey_fc (*args->int, optional) –

    Get key for decending sorting from a sample .

    • If None, sort by length of first element of a sample.
    • If False, not sort.
  • filter_fc (*args->bool, optional) – Return True to keep the sample.
  • pad_idx (int, optional) –

    pad each attribute of samples to the max length of its max length within the batch.

    • If List[int], specify pad_idx for each attribute of a sample. e.g. a sample is a tuple (masked_inputs, labels), pad_idx=[0 ,-100] pad masked_inputs with 0, labels with -100.
    • If False, do no padding.
    • If None, try dataset.pad_idx, do no padding if no such attribute.
  • cache_file (str, optional) – Path of a json file to cache info for sorting and filtering.
  • kwargs – key arguments for TfmDl or DataLoader

Example

>>> samples = [ (torch.tensor([1]), torch.tensor([7,8]), torch.tensor(1)),,
...             (torch.tensor([2,3]), torch.tensor([9,10,11]), torch.tensor(2)),
...             (torch.tensor([4,5,6]), torch.tensor([11,12,13,14]), torch.tensor(3)), ]
... dl = MySortedDL(samples,
...                 srtkey_fc=lambda *args: len(args[0]),
...                 filter_fc=lambda x1,y1: y1<3,
...                 pad_idx=-1,
...                 cache_file='/tmp/cache.json', # calls after this will load cache
...                 bs=999, # other parameters go to `TfmDL` and `DataLoader`
...                 )
... dl.one_batch()
(tensor([[ 2,  3],
         [ 1, -1]]),
 tensor([[ 9, 10, 11],
        [ 7,  8, -1]]),
 tensor([2, 1]))
class hugdatafast.fastai.HF_Dataset(hf_dset, cols=None, hf_toker=None, neat_show=False, n_inp=1)

A wrapper for datasets.Dataset. It will behavior like original datasets.Dataset, but also function as a fastai.data.core.datasets that provides samples and decodes.

Parameters:
  • hf_dset (datasets.Dataset) – Prerocessed Hugging Face dataset to be wrapped.
  • cols (dict, optional) –

    columns of datasets.Dataset to be used to construct samples, and (optionally) semantic tensor type for each of those columns to decode.

    • cols(Dict[Fastai Semantic Tensor]): encode/decode column(key) with semantic tensor type(value). If {value} is noop, semantic tensor of the column is by default TensorTuple.
    • cols(list[str]): specify only columns and take default setting for semantic tensor type of them.
      • if length is 1, regard the 1st element as TensorText
      • if length is 2, regard the 1st element as TensorText, 2nd element as TensorCategory
      • Otherwise, regard all elements as TensorTuple
    • cols(None): pass hf_dset.column_names (list[str]) as cols.
  • hf_toker (transformers.PreTrainedTokenizer, optional) – Hugging Face tokenizer, used in decode and provide pad_idx for dynamic padding
  • neat_show (bool, optional) – Show the original sentence instead of tokens joined by space.
  • n_inp (int, optional) – take the first n_inp columns of cols as x, and the rest as y .

Example

>>> tokenized_cola_train_set[0]
{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0,
 'text_idxs': [ 2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,  1010,  2292, 2894,  1996,  2279,  2028,  2057, 16599,  1012]}
>>> hf_dset = HF_Datset(tokenized_cola_train_set, cols=['text_idxs', 'label'], hf_toker=tokenizer_electra_small_fast)
>>> len(hf_dset), hf_dset[0]
8551, (TensorText([ 2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,  1010,  2292, 2894,  1996,  2279,  2028,  2057, 16599,  1012]), TensorCategory(1))
>>> hf_dset.decode(hf_dset[0])
("our friends won ' t buy this analysis , let alone the next one we propose .", '1')
# The wrapped dataset "is" also the original huggingface dataset
>>> hf_dset.column_names == tokenized_cola_train_set.column_names
True
# Manually specify `cols` with dict, here it is equivalent to the above. And addtionally, neatly decode samples.
>>> neat_hf_dset = HF_Datset(tokenized_cola_train_set, {'text_idxs':TensorText, 'label':TensorCategory}, hf_toker=tokenizer_electra_small_fast, neat_show=True)
>>> neat_hf_dset.decode(neat_hf_dset[0])
("our friends won't buy this analysis, let alone the next one we propose.", '1')
# Note: Original set will be set to Pytorch format with columns specified in `cols`
>>> tokenized_cola_train_set[0]
{'label': tensor(1),
 'text_idxs': tensor([ 2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,  1010,  2292, 2894,  1996,  2279,  2028,  2057, 16599,  1012])}
class hugdatafast.fastai.HF_Datasets(hf_dsets: dict, test_with_y=False, **kwargs)

Function as fastai.data.core.Datasets to create fastai.data.core.Dataloaders from a group of :class:`datasets.Dataset`s

Parameters:
  • hf_dsets (Dict[datasets.Dataset]) – Prerocessed Hugging Face Datasets, {key} is split name, {value} is datasets.Dataset, order will become the order in fastai.data.core.Dataloaders.
  • test_with_y (bool, optional) – Whether the test set come with y (answers) but not with fake y (e.g. all -1 label). If False, tell only test set to construct samples from first n_inp columns (do not output fake y). And all datasets passed in hf_dsets with its name starts with “test” will be regarded as test set.
  • kwargs – Passed to HF_Dataset. Be sure to pass arguments that HF_Dataset needs !!
dataloaders(self, device='cpu', cache_dir=None, cache_name=None, dl_kwargs=None, **kwargs)
Parameters:
  • device (str) – device where outputed batch will be on. Because a batch will be loaded to test when creating :class: fastai.data.core.Dataloaders, to prevent always leaving a batch of tensor in cuda:0, using default value cpu and then dls.to(other device) at the time you want is suggested.
  • cache_dir (str, optional) – directory to store caches of MySortedDL. if None, use cache directory of the first datasets.Dataset in hf_dsets that passed to :method:`HF_Datasets.__init__`.
  • cache_name (str, optional) – format string that includes one param “{split}”, which will be replaced with name of split as cache file name under cache_dir for each split. If None, tell :class:MySortedDL don’t do caching.
  • dl_kwargs (list[dict], optional) – ith item is addtional kwargs to be passed to initialization of ith dataloader for ith split
  • kwargs – Passed to fastai.data.core.FilteredBase.dataloaders()

Example

>>> tokenized_cola
{'train': datasets.Dataset, 'validation': datasets.Dataset, 'test': datasets.Dataset}
>>> tokenized_cola['test'][0]
{'sentence': 'Bill whistled past the house.',
 'label': -1, # Fake label. True labels are not open to the public.
 'idx': 0,
 'text_idxs': [3021, 26265, 2627, 1996, 2160, 1012]}
>>> dls = HF_Datasets(tokenized_cola,
...                   cols=['text_idxs', 'label'], hf_toker=hf_tokenizer,  # args for HF_Dataset
...                   ).dataloaders(bs=32 , cache_name="dl_cached_for_{split}") # args for MySortedDL
>>> dls.show_batch(max_n=2)
                                                                                                                   text_idxs           label
---------------------------------------------------------------------------------------------------------------------------------------------
0  everybody who has ever, worked in any office which contained any typewriter which had ever been used to type any letters which had    1
   to be signed by any administrator who ever worked in any department like mine will know what i mean.
---------------------------------------------------------------------------------------------------------------------------------------------
1  playing with matches is ; lots of fun, but doing, so and emptying gasoline from one can to another at the same time is a sport best   1
   reserved for arsons.
# test set won't produce label becuase of `test_with_y=False`
>>> dls[-1].show_batch(max_n=2)
                                                                               text_idxs
------------------------------------------------------------------------------------------
0  cultural commissioner megan smith said that the five ` ` soundscape'' pieces would ` `
   give a festive air to park square, they're fun and interesting''.
------------------------------------------------------------------------------------------
1  wendy is eager to sail around the world and bruce is eager to climb kilimanjaro, but
   neither of them can because money is too tight.