Mindspore多机多卡AI分布式训练RuntimeError 1456
#!/bin/bash # applicable to Ascend echo "==============================================================================================================" echo "Please run the script as: " echo "bash run.sh DATA_PATH RANK_TABLE_FILE RANK_SIZE RANK_START" echo "For example: bash run.sh /path/dataset /path/rank_table.json 16 0" echo "It is better to use the absolute path." echo "==============================================================================================================" execute_path=$(pwd) echo ${execute_path} script_self=$(readlink -f "$0") self_path=$(dirname "${script_self}") echo ${self_path} export DATA_PATH=$1 export RANK_TABLE_FILE=$2 export RANK_SIZE=$3 RANK_START=$4 DEVICE_START=0 for((i=0;i<=7;i++)); do export RANK_ID=$[i+RANK_START] export DEVICE_ID=$[i+DEVICE_START] rm -rf ${execute_path}/device_$RANK_ID mkdir ${execute_path}/device_$RANK_ID cd ${execute_path}/device_$RANK_ID || exit pytest -s ${self_path}/resnet50_distributed_training.py >train$RANK_ID.log 2>&1 &
使用Mindspore进行多机多卡的AI分布式训练。共使用两台机器,一台Ascend 910A 八卡,另一台也是Ascend 910A 八卡,共十六卡。但是在使用https://www.mindspore.cn/tutorials/experts/zh-CN/r1.7/parallel/train_ascend.html的分布式AI训练的教程时发现一些问题。
运行报错,运行的命令如下:
# server0 bash run_cluster.sh /path/dataset /path/rank_table.json 16 0 # server1 bash run_cluster.sh /path/dataset /path/rank_table.json 16 8
[ERROR] DEVICE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.356 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1268] HcclInit] Invalid environment variable MINDSPORE_HCCL_CONFIG_PATH or RANK_TABLE_FILE, the path is: cluster_rank_table_16pcs.json. Please check (1) whether the path exists, (2) whether the path has the access permission, (3) whether the path is too long. [ERROR] DEVICE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.434 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1183] InitDevice] HcclInit init failed [CRITICAL] PIPELINE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.459 [mindspore/ccsrc/pipeline/jit/pipeline.cc:1456] InitHccl] Runtime init failed. ============================= test session starts ============================== platform linux -- Python 3.7.5, pytest-7.1.2, pluggy-1.0.0 rootdir: /sunhanyuan/docs-r1.7/docs/sample_code/distributed_training collected 0 items / 1 error
==================================== ERRORS ==================================== ______________ ERROR collecting resnet50_distributed_training.py _______________ ../resnet50_distributed_training.py:35: in <module> init() /usr/local/python37/lib/python3.7/site-packages/mindspore/communication/management.py:142: in init init_hccl() E RuntimeError: mindspore/ccsrc/pipeline/jit/pipeline.cc:1456 InitHccl] Runtime init failed. =========================== short test summary info ============================ ERROR ../resnet50_distributed_training.py - RuntimeError: mindspore/ccsrc/pip... !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! =============================== 1 error in 7.40s
解答:
经过查找发现是cluster_rank_table_16pcs.json中ip的设置问题!