hugdatafast.fastai
¶
Module Contents¶
Classes¶
MySortedDL |
A DataLoader that do smart batching and dynamic padding. Different from SortedDL , it automatically pad every attribute of samples, is able to filter samples, and can be cached to sort/filter only at first time. |
HF_Dataset |
A wrapper for datasets.Dataset . It will behavior like original datasets.Dataset , |
HF_Datasets |
Function as fastai.data.core.Datasets to create fastai.data.core.Dataloaders from a group of :class:`datasets.Dataset`s |
-
class
hugdatafast.fastai.
MySortedDL
(dataset, srtkey_fc=None, filter_fc=False, pad_idx=None, cache_file=None, **kwargs)¶ A
DataLoader
that do smart batching and dynamic padding. Different fromSortedDL
, it automatically pad every attribute of samples, is able to filter samples, and can be cached to sort/filter only at first time.Parameters: - dataset (HF_Dataset) – Actually any object implements
__len__
and__getitem__
that return a tuple as a sample. - srtkey_fc (
*args->int
, optional) –Get key for decending sorting from a sample .
- If
None
, sort by length of first element of a sample. - If
False
, not sort.
- If
- filter_fc (
*args->bool
, optional) – ReturnTrue
to keep the sample. - pad_idx (
int
, optional) –pad each attribute of samples to the max length of its max length within the batch.
- If
List[int]
, specify pad_idx for each attribute of a sample. e.g. a sample is a tuple (masked_inputs, labels), pad_idx=[0 ,-100] pad masked_inputs with 0, labels with -100. - If
False
, do no padding. - If
None
, trydataset.pad_idx
, do no padding if no such attribute.
- If
- cache_file (
str
, optional) – Path of a json file to cache info for sorting and filtering. - kwargs – key arguments for TfmDl or DataLoader
Example
>>> samples = [ (torch.tensor([1]), torch.tensor([7,8]), torch.tensor(1)),, ... (torch.tensor([2,3]), torch.tensor([9,10,11]), torch.tensor(2)), ... (torch.tensor([4,5,6]), torch.tensor([11,12,13,14]), torch.tensor(3)), ] ... dl = MySortedDL(samples, ... srtkey_fc=lambda *args: len(args[0]), ... filter_fc=lambda x1,y1: y1<3, ... pad_idx=-1, ... cache_file='/tmp/cache.json', # calls after this will load cache ... bs=999, # other parameters go to `TfmDL` and `DataLoader` ... ) ... dl.one_batch() (tensor([[ 2, 3], [ 1, -1]]), tensor([[ 9, 10, 11], [ 7, 8, -1]]), tensor([2, 1]))
- dataset (HF_Dataset) – Actually any object implements
-
class
hugdatafast.fastai.
HF_Dataset
(hf_dset, cols=None, hf_toker=None, neat_show=False, n_inp=1)¶ A wrapper for
datasets.Dataset
. It will behavior like originaldatasets.Dataset
, but also function as afastai.data.core.datasets
that provides samples and decodes.Parameters: - hf_dset (
datasets.Dataset
) – Prerocessed Hugging Face dataset to be wrapped. - cols (dict, optional) –
columns of
datasets.Dataset
to be used to construct samples, and (optionally) semantic tensor type for each of those columns to decode.- cols(
Dict[Fastai Semantic Tensor]
): encode/decode column(key) with semantic tensor type(value). If {value} isnoop
, semantic tensor of the column is by default TensorTuple. - cols(
list[str]
): specify only columns and take default setting for semantic tensor type of them.- if length is 1, regard the 1st element as TensorText
- if length is 2, regard the 1st element as TensorText, 2nd element as TensorCategory
- Otherwise, regard all elements as TensorTuple
- cols(None): pass
hf_dset.column_names
(list[str]) as cols.
- cols(
- hf_toker (
transformers.PreTrainedTokenizer
, optional) – Hugging Face tokenizer, used in decode and providepad_idx
for dynamic padding - neat_show (bool, optional) – Show the original sentence instead of tokens joined by space.
- n_inp (int, optional) – take the first
n_inp
columns ofcols
as x, and the rest as y .
Example
>>> tokenized_cola_train_set[0] {'sentence': "Our friends won't buy this analysis, let alone the next one we propose.", 'label': 1, 'idx': 0, 'text_idxs': [ 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012]} >>> hf_dset = HF_Datset(tokenized_cola_train_set, cols=['text_idxs', 'label'], hf_toker=tokenizer_electra_small_fast) >>> len(hf_dset), hf_dset[0] 8551, (TensorText([ 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012]), TensorCategory(1)) >>> hf_dset.decode(hf_dset[0]) ("our friends won ' t buy this analysis , let alone the next one we propose .", '1') # The wrapped dataset "is" also the original huggingface dataset >>> hf_dset.column_names == tokenized_cola_train_set.column_names True # Manually specify `cols` with dict, here it is equivalent to the above. And addtionally, neatly decode samples. >>> neat_hf_dset = HF_Datset(tokenized_cola_train_set, {'text_idxs':TensorText, 'label':TensorCategory}, hf_toker=tokenizer_electra_small_fast, neat_show=True) >>> neat_hf_dset.decode(neat_hf_dset[0]) ("our friends won't buy this analysis, let alone the next one we propose.", '1') # Note: Original set will be set to Pytorch format with columns specified in `cols` >>> tokenized_cola_train_set[0] {'label': tensor(1), 'text_idxs': tensor([ 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012])}
- hf_dset (
-
class
hugdatafast.fastai.
HF_Datasets
(hf_dsets: dict, test_with_y=False, **kwargs)¶ Function as
fastai.data.core.Datasets
to createfastai.data.core.Dataloaders
from a group of :class:`datasets.Dataset`sParameters: - hf_dsets (Dict[datasets.Dataset]) – Prerocessed Hugging Face Datasets, {key} is split name, {value} is
datasets.Dataset
, order will become the order infastai.data.core.Dataloaders
. - test_with_y (bool, optional) – Whether the test set come with y (answers) but not with fake y (e.g. all -1 label).
If
False
, tell only test set to construct samples from firstn_inp
columns (do not output fake y). And all datasets passed inhf_dsets
with its name starts with “test” will be regarded as test set. - kwargs – Passed to
HF_Dataset
. Be sure to pass arguments thatHF_Dataset
needs !!
-
dataloaders
(self, device='cpu', cache_dir=None, cache_name=None, dl_kwargs=None, **kwargs)¶ Parameters: - device (str) – device where outputed batch will be on. Because a batch will be loaded to test when creating :class: fastai.data.core.Dataloaders, to prevent always leaving a batch of tensor in cuda:0, using default value cpu and then
dls.to(other device)
at the time you want is suggested. - cache_dir (str, optional) – directory to store caches of
MySortedDL
. ifNone
, use cache directory of the firstdatasets.Dataset
inhf_dsets
that passed to :method:`HF_Datasets.__init__`. - cache_name (str, optional) – format string that includes one param “{split}”, which will be replaced with name of split as cache file name under cache_dir for each split. If
None
, tell :class:MySortedDL don’t do caching. - dl_kwargs (list[dict], optional) – ith item is addtional kwargs to be passed to initialization of ith dataloader for ith split
- kwargs – Passed to
fastai.data.core.FilteredBase.dataloaders()
Example
>>> tokenized_cola {'train': datasets.Dataset, 'validation': datasets.Dataset, 'test': datasets.Dataset} >>> tokenized_cola['test'][0] {'sentence': 'Bill whistled past the house.', 'label': -1, # Fake label. True labels are not open to the public. 'idx': 0, 'text_idxs': [3021, 26265, 2627, 1996, 2160, 1012]} >>> dls = HF_Datasets(tokenized_cola, ... cols=['text_idxs', 'label'], hf_toker=hf_tokenizer, # args for HF_Dataset ... ).dataloaders(bs=32 , cache_name="dl_cached_for_{split}") # args for MySortedDL >>> dls.show_batch(max_n=2) text_idxs label --------------------------------------------------------------------------------------------------------------------------------------------- 0 everybody who has ever, worked in any office which contained any typewriter which had ever been used to type any letters which had 1 to be signed by any administrator who ever worked in any department like mine will know what i mean. --------------------------------------------------------------------------------------------------------------------------------------------- 1 playing with matches is ; lots of fun, but doing, so and emptying gasoline from one can to another at the same time is a sport best 1 reserved for arsons. # test set won't produce label becuase of `test_with_y=False` >>> dls[-1].show_batch(max_n=2) text_idxs ------------------------------------------------------------------------------------------ 0 cultural commissioner megan smith said that the five ` ` soundscape'' pieces would ` ` give a festive air to park square, they're fun and interesting''. ------------------------------------------------------------------------------------------ 1 wendy is eager to sail around the world and bruce is eager to climb kilimanjaro, but neither of them can because money is too tight.
- device (str) – device where outputed batch will be on. Because a batch will be loaded to test when creating :class: fastai.data.core.Dataloaders, to prevent always leaving a batch of tensor in cuda:0, using default value cpu and then
- hf_dsets (Dict[datasets.Dataset]) – Prerocessed Hugging Face Datasets, {key} is split name, {value} is