Tensorflow和PyTorch在用CUDA初始化时挂起了彩。

当我尝试运行一个非常简陋的Tensorflow例子时。

import tensorflow as tf

c = tf.constant([1,2,3])

系统永远挂起(至少10分钟),没有任何迹象表明它在做什么。在这种状态下,它100%使用了一个虚拟CPU核心。当在Juypter笔记本中运行时,内核会将此输出到控制台。

2020-03-31 11:12:04.840507: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:12:04.840576: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:12:04.840589: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-03-31 11:12:05.521172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-31 11:12:05.539193: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.539639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7845GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-03-31 11:12:05.539841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-31 11:12:05.541113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-31 11:12:05.542119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-31 11:12:05.542324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-31 11:12:05.543632: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-31 11:12:05.544401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-31 11:12:05.547212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-31 11:12:05.547337: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.548015: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.548512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-03-31 11:12:05.567845: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3393550000 Hz
2020-03-31 11:12:05.568364: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564107e16440 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-31 11:12:05.568395: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

我之前在这个系统上用过Tensorflow 所以我觉得这可能是系统更新导致的某种库的问题。

GPU是Nvidia GTX 1070。Tensorflow的版本是2.1.0,从它工作的时候开始就没有改变过。运行Arch Linux,如果这很重要的话。

我试着从CUDA 10.2降级到10.1,但问题仍然发生。

我也可以用PyTorch重现这个问题。

import torch
import transformers

t = torch.tensor([1,2,3])
t.cuda()

(import transformers 防止出现 “CUDA.Out of memory “的问题–一定是有什么东西在初始化PyTorch,我不知道该怎么做)。内存不足 “问题–它一定是做了什么初始化PyTorch的工作,但我不知道该怎么做)。)

这也有同样的问题,它冻结了一个CPU核心,虽然它产生的输出较少。

020-03-31 11:13:41.428483: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:13:41.428571: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:13:41.428587: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

我很确定关于TensorRT的抱怨与此无关 因为之前我让它工作的时候 它也会输出这些结果

我怎样才能解决这个问题?或者至少,我还能做什么来确定它在冻结时在做什么?

解决方案:

我的问题是由我对Python进程可以消耗的虚拟内存量设置的ulimit引起的 (在使用了 ulimit -Sv 12000000 zsh中的)。) 我不知道为什么这会导致它挂起,但如果有人遇到类似的问题,请确保你没有限制虚拟内存。

本文来自投稿,不代表运维实战侠立场,如若转载,请注明出处:https://www.shizhanxia.com/225.html

(0)
上一篇 2022年6月29日 下午3:54
下一篇 2022年6月29日 下午3:54

相关推荐

发表评论

登录后才能评论