The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?
Here are the steps that I ran (from "parameter_server/example/linear" dir):
../../script/ps.sh start -nw 4 -ns 3 -hostfile hostfile ../../build/linear -app_file ctr/online_l1lr.conf -num_replicas 2 -report_interval 1
Then I killed a server process on one of the nodes. This stops the system. Killing a worker node, still continues the SGD and converges eventually.
Any help in this regard will be highly appreciated.
Thanks,
Danish
The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?
Here are the steps that I ran (from "parameter_server/example/linear" dir):
Then I killed a server process on one of the nodes. This stops the system. Killing a worker node, still continues the SGD and converges eventually.
Any help in this regard will be highly appreciated.
Thanks,
Danish