V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
yiyi1010
V2EX  ›  问与答

"find_unused_parameters=True" 这个参数的 true 设置会影响多卡模型训练的收

  •  
  •   yiyi1010 · 2023-04-01 11:42:14 +08:00 · 1097 次点击
    这是一个创建于 645 天前的主题,其中的信息可能已经有所发展或是发生改变。

    当我设置"find_unused_parameters=True", 是模型在训练不收敛,感觉好像啥都没学到似的,感觉应该就是 gradients 出现了问题。

    当我设置"find_unused_parameters=False" 会报一下的错误,这个错误是因为 decoder 没有返回梯度 gradient ,这是什么原因造成的 他没有返回梯度呢?有什么建议吗?

    In my model training, if "find_unused_parameters=False", it will raise an error as follows: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 1: decoder.0.norm1.weight, decoder.0.norm1.bias, decoder.0.self_attn.qkv.weight, decoder.0.self_attn.proj.weight, decoder.0.self_attn.proj.bias, decoder.0.norm_q.weight, decoder.0.norm_q.bias, decoder.0.norm_v.weight, decoder.0.norm_v.bias, decoder.0.cross_attn.q_map.weight, decoder.0.cross_attn.k_map.weight, decoder.0.cross_attn.v_map.weight, decoder.0.cross_attn.proj.weight, decoder.0.cross_attn.proj.bias, decoder.0.norm2.weight, decoder.0.norm2.bias, decoder.0.mlp.fc1.weight, decoder.0.mlp.fc1.bias, decoder.0.mlp.fc2.weight, decoder.0.mlp.fc2.bias, decoder.1.norm1.weight, decoder.1.norm1.bias, decoder.1.self_attn.qkv.weight, decoder.1.self_attn.proj.weight, decoder.1.self_attn.proj.bias, decoder.1.norm_q.weight, decoder.1.norm_q.bias, decoder.1.norm_v.weight, decoder.1.norm_v.bias, decoder.1.cross_attn.q_map.weight, decoder.1.cross_attn.k_map.weight, decoder.1.cross_attn.v_map.weight, decoder.1.cross_attn.proj.weight, decoder.1.cross_attn.proj.bias, decoder.1.norm2.weight, decoder.1.norm2.bias, decoder.1.mlp.fc1.weight, decoder.1.mlp.fc1.bias, decoder.1.mlp.fc2.weight, decoder.1.mlp.fc2.bias Parameter indices which did not receive grad for rank 1: 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 463885) of binary: /home/mnt/xyqian/miniconda3/envs/detector_21806_2/bin/python /home/mnt/xyqian/miniconda3/envs/detector_21806_2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

    目前尚无回复
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   6032 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 43ms · UTC 02:44 · PVG 10:44 · LAX 18:44 · JFK 21:44
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.