1.准备与加载本地数据¶
该部分参考的文档有:
本地数据可以处理成 csv 或者 json 格式的数据。csv 格式的数据本来就是每行对应一条数据,这个不用说。json 格式也要处理成这种每行一条数据的格式,文件中每行是一个 json 格式的数据,举例如下:
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
加载本地的csv
和json
格式数据的代码如下:
from datasets import load_dataset
# 加载 csv 格式数据
dataset_csv = load_dataset("csv", data_files="dataset/test.csv")
# 加载 json 格式数据
dataset_json = load_dataset("json", data_files="dataset/test.json")
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 1002.46it/s] Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 249.36it/s] Generating train split: 3 examples [00:00, 70.71 examples/s] Downloading data files: 100%|██████████| 1/1 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 142.83it/s] Generating train split: 1 examples [00:00, 133.03 examples/s]
另外还可以在加载时,指定训练集和测试集,此时参数data_files
对应的值是一个字典。样例代码如下所示:
dataset_train_test = load_dataset("json", data_files={"train":["dataset/test_1.json", "dataset/test_2.json"], "test":"dataset/test.json"})
Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 1983.12it/s] Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 569.92it/s] Generating train split: 2 examples [00:00, 153.38 examples/s] Generating test split: 1 examples [00:00, 95.02 examples/s]
2.对dataset进行访问¶
该部分参考文档:
以数据集 rotten_tomatoes 为例进行说明,加载数据集的代码如下。加载之后会获取到一个 Dataset 对象,下面主要是介绍该对象的使用方法。
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")
Downloading builder script: 100%|██████████| 5.03k/5.03k [00:00<?, ?B/s] Downloading metadata: 100%|██████████| 2.02k/2.02k [00:00<?, ?B/s] Downloading readme: 100%|██████████| 7.25k/7.25k [00:00<?, ?B/s] Downloading data: 100%|██████████| 488k/488k [00:00<00:00, 1.26MB/s] Generating train split: 100%|██████████| 8530/8530 [00:00<00:00, 31398.76 examples/s] Generating validation split: 100%|██████████| 1066/1066 [00:00<00:00, 21079.39 examples/s] Generating test split: 100%|██████████| 1066/1066 [00:00<00:00, 22167.00 examples/s]
2.1 使用索引或列名进行访问¶
说明:索引和列名结合使用时,要先使用索引,后使用列名。虽然顺序反过来结果是相同的,但是速度会慢很多。
# 使用索引访问
dataset[0]
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
# 使用列名访问
dataset["text"]
# 索引和列名结合使用
dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
2.2 分片¶
# Get the first three rows
dataset[:3]
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'effective but too-tepid biopic'], 'label': [1, 1, 1]}
# Get rows between three and six
dataset[3:6]
{'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .', "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .", 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'], 'label': [1, 1, 1]}
3.对数据进行tokenize¶
该部分参考文档有:
3.1 直接将文本传入tokenizer¶
这里只是介绍 tokenize 的最简单的用法,关于 tokenize 的详细用法可以见 Tokenizers 库的文档。
下面,tokenize 以模型 bert-base-uncased 为例,数据集以 rotten_tomatoes 为例进行说明。加载 tokenizer 和数据集的代码如下:
from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.19MB/s] d:\Software\Miniconda3\envs\llm\lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\dd\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)
对数据集中的单条数据做 tokenize 的代码如下,返回值中有三个字段 input_ids
、oken_type_ids
、attention_mask
。然后有一点可以看出,当输入是单条文本时,返回值中每个字段对应的值都是单个向量;而当输入多条文本时,返回值中每个字段对应的是向量的列表。
tokenizer(dataset[0]["text"])
tokenizer(dataset["text"])
3.2 使用 map 函数¶
使用 map 函数时需要先定义一个对数据做 tokenize 的函数,比如下述样例代码中的 tokenization
函数。当调用 map 函数时指定了参数 batched
时,那么自定义的函数 tokenization
的入参也是多条数据的,对该入参的使用方式和对 Dataset
对象的使用方式基本完全相同。
def tokenization(example):
return tokenizer(example["text"])
dataset = dataset.map(tokenization, batched=True)
Map: 100%|██████████| 8530/8530 [00:00<00:00, 11501.64 examples/s]
上述样例代码只是对 map 函数的最简单的用法,其他的高级用法比如 map 函数还允许输入与输出的数据数量不同等等功能。
3.3 设置 dataset 的格式¶
huggingface 的这个 datasets 库不仅支持 PyTorch,还支持 TensorFlow,所以对数据集做完 tokenize 之后,还需要设置其是哪种框架。另外数据集中可能还包括部署 tensor 的字段,这些字段最好在训练之前丢弃掉,否则在训练的 loop 中可能会报错。这两个操作可以通过如下一行代码实现:
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) e:\02Personal\personal\jupyter\huggingface\datasets.ipynb 单元格 27 line 1 ----> <a href='vscode-notebook-cell:/e%3A/02Personal/personal/jupyter/huggingface/datasets.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"]) File d:\Software\Miniconda3\envs\llm\lib\site-packages\datasets\fingerprint.py:511, in fingerprint_transform.<locals>._fingerprint.<locals>.wrapper(*args, **kwargs) 507 validate_fingerprint(kwargs[fingerprint_name]) 509 # Call actual function --> 511 out = func(dataset, *args, **kwargs) 513 # Update fingerprint of in-place transforms + update in-place history of transforms 515 if inplace: # update after calling func so that the fingerprint doesn't change if the function fails File d:\Software\Miniconda3\envs\llm\lib\site-packages\datasets\arrow_dataset.py:2525, in Dataset.set_format(self, type, columns, output_all_columns, **format_kwargs) 2523 columns = list(columns) 2524 if columns is not None and any(col not in self._data.column_names for col in columns): -> 2525 raise ValueError( 2526 f"Columns {list(filter(lambda col: col not in self._data.column_names, columns))} not in the dataset. Current columns in the dataset: {self._data.column_names}" 2527 ) 2528 if columns is not None: 2529 columns = columns.copy() # Ensures modifications made to the list after this call don't cause bugs ValueError: Columns ['labels'] not in the dataset. Current columns in the dataset: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask']