{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

from datasets import load_dataset

# 加载 csv 格式数据
dataset_csv = load_dataset("csv", data_files="dataset/test.csv")

# 加载 json 格式数据
dataset_json = load_dataset("json", data_files="dataset/test.json")

Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 1002.46it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 249.36it/s]
Generating train split: 3 examples [00:00, 70.71 examples/s]
Downloading data files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 142.83it/s]
Generating train split: 1 examples [00:00, 133.03 examples/s]

dataset_train_test = load_dataset("json", data_files={"train":["dataset/test_1.json", "dataset/test_2.json"], "test":"dataset/test.json"})

Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 1983.12it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 569.92it/s]
Generating train split: 2 examples [00:00, 153.38 examples/s]
Generating test split: 1 examples [00:00, 95.02 examples/s]

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

Downloading builder script: 100%|██████████| 5.03k/5.03k [00:00<?, ?B/s]
Downloading metadata: 100%|██████████| 2.02k/2.02k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 7.25k/7.25k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 488k/488k [00:00<00:00, 1.26MB/s]
Generating train split: 100%|██████████| 8530/8530 [00:00<00:00, 31398.76 examples/s]
Generating validation split: 100%|██████████| 1066/1066 [00:00<00:00, 21079.39 examples/s]
Generating test split: 100%|██████████| 1066/1066 [00:00<00:00, 22167.00 examples/s]

# 使用索引访问
dataset[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

# 使用列名访问
dataset["text"]

# 索引和列名结合使用
dataset[0]["text"]

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

# Get the first three rows
dataset[:3]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

# Get rows between three and six
dataset[3:6]

{'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'],
 'label': [1, 1, 1]}

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")

tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.19MB/s]
d:\Software\Miniconda3\envs\llm\lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\dd\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)

tokenizer(dataset[0]["text"])

tokenizer(dataset["text"])

def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map: 100%|██████████| 8530/8530 [00:00<00:00, 11501.64 examples/s]

dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
e:\02Personal\personal\jupyter\huggingface\datasets.ipynb 单元格 27 line 1
----> <a href='vscode-notebook-cell:/e%3A/02Personal/personal/jupyter/huggingface/datasets.ipynb#X33sZmlsZQ%3D%3D?line=0'>1</a> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

File d:\Software\Miniconda3\envs\llm\lib\site-packages\datasets\fingerprint.py:511, in fingerprint_transform.<locals>._fingerprint.<locals>.wrapper(*args, **kwargs)
    507             validate_fingerprint(kwargs[fingerprint_name])
    509 # Call actual function
--> 511 out = func(dataset, *args, **kwargs)
    513 # Update fingerprint of in-place transforms + update in-place history of transforms
    515 if inplace:  # update after calling func so that the fingerprint doesn't change if the function fails

File d:\Software\Miniconda3\envs\llm\lib\site-packages\datasets\arrow_dataset.py:2525, in Dataset.set_format(self, type, columns, output_all_columns, **format_kwargs)
   2523     columns = list(columns)
   2524 if columns is not None and any(col not in self._data.column_names for col in columns):
-> 2525     raise ValueError(
   2526         f"Columns {list(filter(lambda col: col not in self._data.column_names, columns))} not in the dataset. Current columns in the dataset: {self._data.column_names}"
   2527     )
   2528 if columns is not None:
   2529     columns = columns.copy()  # Ensures modifications made to the list after this call don't cause bugs

ValueError: Columns ['labels'] not in the dataset. Current columns in the dataset: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask']

1.准备与加载本地数据¶

2.对dataset进行访问¶

2.1 使用索引或列名进行访问¶

2.2 分片¶

3.对数据进行tokenize¶

3.1 直接将文本传入tokenizer¶

3.2 使用 map 函数¶

3.3 设置 dataset 的格式¶