无法加载数据集和指标的解决方案

时间:2024-10-26 07:34:29

诸神缄默不语-个人****博文目录

本文是作者在使用huggingface的datasets包时,出现无法加载数据集和指标的问题,故撰写此博文以记录并分享这一问题的解决方式。以下将依次介绍我的代码和环境、报错信息、错误原理和解决方案。首先介绍数据集的,后面介绍指标的。

系统环境:
操作系统:Linux
Python版本:3.8.12
代码编辑器:VSCode+Jupyter Notebook
datasets版本:2.0.0

数据集的:

代码:

import datasets
dataset=datasets.load_dataset("yelp_review_full")

报错信息:

ConnectionError                           Traceback (most recent call last)
/tmp/ipykernel_21708/ in <module>
----> 1 dataset=datasets.load_dataset("yelp_review_full")

myenv/lib/python3.8/site-packages/datasets/ in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1658 
   1659     # Create a dataset builder
-> 1660     builder_instance = load_dataset_builder(
   1661         path=path,
   1662         name=name,

myenv/lib/python3.8/site-packages/datasets/ in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
   1484         download_config = download_config.copy() if download_config else DownloadConfig()
   1485         download_config.use_auth_token = use_auth_token
-> 1486     dataset_module = dataset_module_factory(
   1487         path,
   1488         revision=revision,

myenv/lib/python3.8/site-packages/datasets/ in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1236                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1237                     ) from None
-> 1238                 raise e1 from None
   1239     else:
   1240         raise FileNotFoundError(

myenv/lib/python3.8/site-packages/datasets/ in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1173             if ("/") == 0:  # even though the dataset is on the Hub, we get it from GitHub for now
   1174                 # TODO(QL): use a Hub dataset module factory instead of GitHub
-> 1175                 return GithubDatasetModuleFactory(
   1176                     path,
   1177                     revision=revision,

myenv/lib/python3.8/site-packages/datasets/ in get_module(self)
    531         revision = 
    532         try:
--> 533             local_path = self.download_loading_script(revision)
    534         except FileNotFoundError:
    535             if revision is not None or ("HF_SCRIPTS_VERSION", None) is not None:

myenv/lib/python3.8/site-packages/datasets/ in download_loading_script(self, revision)
    511         if download_config.download_desc is None:
    512             download_config.download_desc = "Downloading builder script"
--> 513         return cached_path(file_path, download_config=download_config)
    514 
    515     def download_dataset_infos_file(self, revision: Optional[str]) -> str:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach /huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=100)")))

很明显这是上不了的问题。
如果你可以使用代理,最好的解决方式就是直接挂代理运行全程。

对于不方便直接使用代理的情况,以下介绍我使用的解决方案:在本机使用代理,然后将文件上传到运行环境的解决方案。(注意本机和服务器可以是不同操作系统的)

我试过直接把这个Python文件下载下来,然后上传到服务器上,但是操作了半天也不行,因为这个Python文件里面给出的数据下载链接在谷歌云,但是直接把那个数据下下来上传还是不行,修改数据下载链接到S3文件也不行。总之不行,如果有可行的方法请直接给我讲一下。
大略来说,我的成功做法就是现在本地加载数据集,然后储存到磁盘,然后将文件夹上传至服务器,并从磁盘直接加载数据集。

在本地加载数据集并储存到本地磁盘(注意这个路径是Windows系统的路径):

import datasets
dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache')

dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')

将路径文件夹上传到服务器:
可以使用bypy和百度网盘来进行操作,参考我之前撰写的博文bypy:使用Linux命令行上传及下载百度云盘文件(远程服务器大文件传输必备)_诸神缄默不语的博客-****博客_bypy 命令
先上传到我的应用数据-bypy文件夹中,然后在服务器上下载文件夹(注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹,而不是直接下载整个文件夹):bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk

然后在服务器上从磁盘加载数据集:

dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")

就可以正常使用数据集了:
在这里插入图片描述
注意,根据datasets的文档,这个数据集也可以直接存储到S3FileSystem(/docs/datasets/v2.0.0/en/package_reference/main_classes#.S3FileSystem)上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API?感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能,所以没有使用这个。

指标的:
代码:

metric=datasets.load_metric('accuracy')

报错信息:

ConnectionError                           Traceback (most recent call last)
/tmp/ipykernel_24141/ in <module>
----> 1 metric=datasets.load_metric('accuracy')

myenv/lib/python3.8/site-packages/datasets/ in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, revision, **metric_init_kwargs)
   1390     """
   1391     download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS)
-> 1392     metric_module = metric_module_factory(
   1393         path, revision=revision, download_config=download_config, download_mode=download_mode
   1394     ).module_path

myenv/lib/python3.8/site-packages/datasets/ in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
   1322             except Exception as e2:  # noqa: if it's not in the cache, then it doesn't exist.
   1323                 if not isinstance(e1, FileNotFoundError):
-> 1324                     raise e1 from None
   1325                 raise FileNotFoundError(
   1326                     f"Couldn't find a metric script at {relative_to_absolute_path(combined_path)}. "

myenv/lib/python3.8/site-packages/datasets/ in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
   1310     elif is_relative_path(path) and ("/") == 0 and not force_local_path:
   1311         try:
-> 1312             return GithubMetricModuleFactory(
   1313                 path,
   1314                 revision=revision,

myenv/lib/python3.8/site-packages/datasets/ in get_module(self)
    598         revision = 
    599         try:
--> 600             local_path = self.download_loading_script(revision)
    601             revision = 
    602         except FileNotFoundError:

myenv/lib/python3.8/site-packages/datasets/ in download_loading_script(self, revision)
    592         if download_config.download_desc is None:
    593             download_config.download_desc = "Downloading builder script"
--> 594         return cached_path(file_path, download_config=download_config)
    595 
    596     def get_module(self) -> MetricModule:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
    232     if is_remote_url(url_or_filename):
    233         # URL, so get it from the cache (downloading if necessary)
--> 234         output_path = get_from_cache(
    235             url_or_filename,
    236             cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
    580         _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
    581         if head_error is not None:
--> 582             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    583         elif response is not None:
    584             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach /huggingface/datasets/2.0.0/metrics/accuracy/ (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=100)")))

指标的简单一点,只要把这个Python文件下载到本地(这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文,但是可以参考我之前写的类似主题的博文:PyG的Planetoid无法直接下载Cora等数据集的3个解决方式_诸神缄默不语的博客-****博客_planetoid数据集),然后改为调用这个文件即可:

metric=datasets.load_metric('mypath/')

本文撰写过程中所使用的参考资料

  1. datasets加载数据集相关方法的文档:/docs/datasets/v2.0.0/en/package_reference/loading_methods
  2. datasets.save_to_disk()的文档:/docs/datasets/v2.0.0/en/package_reference/main_classes#.save_to_disk
  3. HuggingFace使用datasets加载数据时 出现ConnectionError 无法获得数据 可以将数据保存到本地_zero requiem的博客-****博客:这一篇使用的方法跟我的差不多,他用的是google colab来加载和存储数据集。
  4. ConnectionError: Couldn‘t reach //huggingface/datasets/1.15.1/datasets/squad/_随便写写诶的博客-****博客:呃感觉这篇可能是因为datasets版本比较早,所以我看现在数据集不再存储在那个位置了,可能这个方法无法使用了。
  5. HuggingFace代码本地运行报错ConnectionError: Couldn‘t reach https://raw.githubuserc_愚昧之山绝望之谷开悟之坡的博客-****博客:这个方法我试过,我把Python文件放到cache文件夹后,发现需要下载谷歌云数据。我把谷歌云数据也放到cache文件夹后,它还是给我报一些别的错,我不会解决,所以放弃了这个解决思路。
  6. HuggingFace 加载数据集报错 ConnectionError 无需GoogleColab_zero requiem的博客-****博客:和序号4的情况类似。
  7. 使用datasets库加载glue数据集时load_dataset发生Connection Error问题解决方法_j_thame_myhome的博客-****博客_datasets.load_dataset:升级datasets版本对我的情况无效,因为2.0.0已经是现在最新的datasets版本了。