create_tensors.ipynb

It seems that the API has changed and now RobertaTokenizer is returning a dict:

print(len(batch)) print(len(batch['input_ids'])) print(len(batch['attention_mask']))

2 10000 10000
I updated this:

labels = torch.tensor(batch['input_ids']) mask = torch.tensor(batch['attention_mask'])
and it runs, but the predictions are garbage, it is even predicting a mask token. The special tokens seem to have changed values too:

i = 0 print(batch['input_ids'][0])

[0, 692, 18622, 1357, 7751, 292, 1055, 280, 7404, 6320, 775, 725, 2144, 280, 11, 10204, 3777, 1265, 1809, 1196, 603, 1141, 10292, 30, 551, 267, 1339, 16, 385, 3374, 458, 9776, 5941, 376, 25474, 2869, 1200, 391, 2690, 421, 17926, 16995, 738, 305, 306, 22813, 376, 7949, 17823, 979, 435, 18387, 1474, 275, 2596, 391, 37, 24908, 738, 2688, 27868, 275, 5802, 624, 769, 13458, 483, 4778, 275, 12869, 532, 18, 679, 3866, 24137, 376, 7751, 17629, 18622, 1133, 8881, 269, 431, 287, 12449, 483, 8040, 6055, 275, 5285, 18, 11754, 367, 275, 6160, 317, 10527, 569, 1593, 13180, 18, 458, 16, 372, 456, 2149, 12053, 16, 500, 317, 6121, 5323, 3328, 569, 1593, 13180, 280, 14633, 18, 762, 12655, 6322, 2483, 6543, 5084, 469, 9105, 18, 679, 1008, 841, 1517, 25736, 3652, 303, 3299, 306, 3062, 292, 15163, 18, 381, 330, 2871, 343, 4721, 316, 16, 16847, 267, 5215, 317, 1008, 841, 1517, 16, 4815, 338, 330, 2756, 435, 3652, 27080, 10964, 12, 39, 13, 18714, 1864, 17, 5579, 1055, 991, 363, 18, 360, 94, 1181, 588, 1728, 841, 343, 351, 12862, 300, 841, 5239, 16, 5617, 10798, 480, 2260, 3606, 421, 14590, 16995, 18, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

I wonder if something else also changed and is impacting the tokenization and the learning by consequence...

jamescalam/create_tensors.ipynb

noreun commented Mar 2, 2023 •

edited

Loading

jamescalam/create_tensors.ipynb

noreun commented Mar 2, 2023 • edited Loading

noreun commented Mar 2, 2023 •

edited

Loading