WARNING:tensorflow:From /home/ray/anaconda3/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Loading environment football failed: No module named 'gfootball'
2021-02-04 17:19:43,107	INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.31.24.224:6379
2021-02-04 17:19:43,183	WARNING import_thread.py:132 -- The actor 'DistributedTorchRunner' has been exported 100 times. It's possible that this warning is accidental, but this may indicate that the same remote function is being defined repeatedly from within many tasks and exported to all of the workers. This can be a performance issue and can be resolved by defining the remote function on the driver instead. See https://github.com/ray-project/ray/issues/6240 for more discussion.
== Status ==
Memory usage on this node: 3.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/128 CPUs, 3/8 GPUs, 0.0/661.13 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 1/4 (1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+


2021-02-04 17:19:43,276	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:19:43,336	WARN commands.py:259 -- Loaded cached provider configuration
2021-02-04 17:19:43,336	WARN commands.py:263 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2021-02-04 17:19:43,336	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:19:43,570	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:19:43,571	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:19:43,571	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[2m[36m(pid=4267, ip=172.31.21.209)[0m 2021-02-04 17:19:45,081	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:45779 [rank=2]
[2m[36m(pid=4268, ip=172.31.21.209)[0m 2021-02-04 17:19:45,453	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:33671 [rank=0]
[32mStopped all 25 Ray processes.[39m
[0mShared connection to 34.218.250.17 closed.

[2m[36m(pid=4267, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:19:50,513	ERROR worker.py:1053 -- Possible unhandled error from worker: [36mray::NoFaultToleranceTrainable.__init__()[39m (pid=4215, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:19:50,513	ERROR worker.py:1053 -- Possible unhandled error from worker: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4215, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:19:50,758	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4215, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.13 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 1
+---------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                        |
|---------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:19:50,920	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=13461, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:19:50,921	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:19:50,926	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=308, ip=172.31.18.216)[0m 2021-02-04 17:19:51,671	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=2]
[2m[36m(pid=312, ip=172.31.18.216)[0m 2021-02-04 17:19:51,670	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=0]
[2m[36m(pid=4266, ip=172.31.21.209)[0m 2021-02-04 17:19:51,667	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=1]
[2m[36m(pid=4231, ip=172.31.21.209)[0m 2021-02-04 17:19:52,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58033 [rank=0]
2021-02-04 17:19:53,562	WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff6e87c25bbaeb5bc7c4349db903000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {30.000000/32.000000 CPU, 158.203125 GiB/158.203125 GiB memory, 0.000000/2.000000 GPU, 1.000000/1.000000 node:172.31.24.224, 49.511719 GiB/49.511719 GiB object_store_memory, 1.000000/1.000000 accelerator_type:M60}
. In total there are 0 pending tasks and 3 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
[2m[36m(pid=4231, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=312, ip=172.31.18.216)[0m 
0it [00:00, ?it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /home/ray/data/cifar-10-python.tar.gz
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  0%|          | 0/170498071 [00:00<?, ?it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  0%|          | 40960/170498071 [00:00<09:51, 288275.09it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  0%|          | 155648/170498071 [00:00<04:50, 585655.61it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  0%|          | 270336/170498071 [00:00<04:09, 682297.12it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  0%|          | 630784/170498071 [00:00<01:46, 1587561.34it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  1%|          | 1187840/170498071 [00:01<01:02, 2724002.92it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  2%|▏         | 3317760/170498071 [00:01<00:20, 8258417.15it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  3%|▎         | 5332992/170498071 [00:01<00:14, 11448934.68it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  5%|▍         | 8036352/170498071 [00:01<00:10, 15976323.38it/s]
[2m[36m(pid=4266, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  6%|▌         | 10018816/170498071 [00:01<00:09, 16577204.38it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  7%|▋         | 12689408/170498071 [00:01<00:08, 19496979.59it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
  9%|▊         | 14704640/170498071 [00:01<00:08, 19063345.58it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 10%|█         | 17358848/170498071 [00:01<00:07, 21210657.64it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 11%|█▏        | 19513344/170498071 [00:01<00:07, 20412667.95it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 13%|█▎        | 22159360/170498071 [00:01<00:06, 22133928.91it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 14%|█▍        | 24403968/170498071 [00:02<00:06, 21106717.00it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 16%|█▌        | 27041792/170498071 [00:02<00:06, 22582812.32it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 17%|█▋        | 29335552/170498071 [00:02<00:06, 21418204.30it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 19%|█▊        | 31891456/170498071 [00:02<00:06, 22516216.66it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 20%|██        | 34177024/170498071 [00:02<00:06, 21757868.64it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 21%|██▏       | 36610048/170498071 [00:02<00:06, 21441638.69it/s]
2021-02-04 17:19:58,261	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4257, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.55 GiB heap, 0.0/148.63 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 2
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=312, ip=172.31.18.216)[0m 
 23%|██▎       | 39100416/170498071 [00:02<00:05, 22394072.21it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 24%|██▍       | 41361408/170498071 [00:02<00:06, 21446177.55it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 26%|██▌       | 44015616/170498071 [00:02<00:05, 22853383.41it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 27%|██▋       | 46325760/170498071 [00:03<00:05, 21666544.06it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 29%|██▊       | 48963584/170498071 [00:03<00:05, 22923353.34it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 30%|███       | 51290112/170498071 [00:03<00:05, 21825105.48it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 32%|███▏      | 53813248/170498071 [00:03<00:05, 21716023.90it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 33%|███▎      | 56221696/170498071 [00:03<00:05, 22331231.35it/s]
[2m[36m(pid=4232, ip=172.31.21.209)[0m 2021-02-04 17:19:59,193	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:46325 [rank=2]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 34%|███▍      | 58499072/170498071 [00:03<00:05, 21799886.59it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 36%|███▌      | 60874752/170498071 [00:03<00:04, 22260662.31it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 37%|███▋      | 63217664/170498071 [00:03<00:04, 21973226.05it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 38%|███▊      | 65560576/170498071 [00:03<00:04, 22275837.53it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 40%|███▉      | 67903488/170498071 [00:04<00:04, 22013096.20it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 41%|████      | 70213632/170498071 [00:04<00:04, 22317311.67it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 43%|████▎     | 72589312/170498071 [00:04<00:04, 22004030.48it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 44%|████▍     | 74899456/170498071 [00:04<00:04, 22296391.77it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 45%|████▌     | 77275136/170498071 [00:04<00:04, 22018406.36it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 47%|████▋     | 79568896/170498071 [00:04<00:04, 22248586.75it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 48%|████▊     | 81960960/170498071 [00:04<00:04, 22035699.94it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 49%|████▉     | 84254720/170498071 [00:04<00:03, 22234369.74it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 51%|█████     | 86646784/170498071 [00:04<00:03, 22053426.05it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 52%|█████▏    | 88924160/170498071 [00:05<00:03, 22238716.80it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 54%|█████▎    | 91332608/170498071 [00:05<00:03, 22057743.67it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 55%|█████▍    | 93609984/170498071 [00:05<00:03, 22230442.86it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 56%|█████▋    | 96018432/170498071 [00:05<00:03, 22062449.00it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 58%|█████▊    | 98271232/170498071 [00:05<00:03, 22194420.08it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 59%|█████▉    | 100704256/170498071 [00:05<00:03, 22106389.05it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 60%|██████    | 102924288/170498071 [00:05<00:03, 22121912.27it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 62%|██████▏   | 105406464/170498071 [00:05<00:02, 22173622.86it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 63%|██████▎   | 107626496/170498071 [00:05<00:02, 22159703.11it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 65%|██████▍   | 110125056/170498071 [00:05<00:02, 22283033.27it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 66%|██████▌   | 112353280/170498071 [00:06<00:02, 22166949.84it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 67%|██████▋   | 114827264/170498071 [00:06<00:02, 22327024.36it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 69%|██████▊   | 117063680/170498071 [00:06<00:02, 22169101.84it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 70%|███████   | 119513088/170498071 [00:06<00:02, 22289870.33it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 71%|███████▏  | 121741312/170498071 [00:06<00:02, 21907161.40it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 73%|███████▎  | 124198912/170498071 [00:06<00:02, 22329469.42it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 74%|███████▍  | 126435328/170498071 [00:06<00:02, 21778336.98it/s]
[2m[36m(pid=4232, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 76%|███████▌  | 128884736/170498071 [00:06<00:01, 22165228.60it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 77%|███████▋  | 131104768/170498071 [00:06<00:01, 21637750.07it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 78%|███████▊  | 133570560/170498071 [00:07<00:01, 22346030.12it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 80%|███████▉  | 135815168/170498071 [00:07<00:01, 21762208.01it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 81%|████████  | 138256384/170498071 [00:07<00:01, 22379346.72it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 82%|████████▏ | 140500992/170498071 [00:07<00:01, 21784728.51it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 84%|████████▍ | 142925824/170498071 [00:07<00:01, 22395395.94it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 85%|████████▌ | 145178624/170498071 [00:07<00:01, 21745874.19it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 87%|████████▋ | 147611648/170498071 [00:07<00:01, 22420195.58it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 88%|████████▊ | 149864448/170498071 [00:07<00:00, 21779173.76it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 89%|████████▉ | 152330240/170498071 [00:07<00:00, 22509665.48it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 91%|█████████ | 154591232/170498071 [00:07<00:00, 21851667.44it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 92%|█████████▏| 157016064/170498071 [00:08<00:00, 22493407.97it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 93%|█████████▎| 159277056/170498071 [00:08<00:00, 21836628.63it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 95%|█████████▍| 161701888/170498071 [00:08<00:00, 22512059.64it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 96%|█████████▌| 163962880/170498071 [00:08<00:00, 21835997.03it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 98%|█████████▊| 166420480/170498071 [00:08<00:00, 22282436.08it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
 99%|█████████▉| 168656896/170498071 [00:08<00:00, 21919517.06it/s]
[2m[36m(pid=312, ip=172.31.18.216)[0m Extracting /home/ray/data/cifar-10-python.tar.gz to /home/ray/data
2021-02-04 17:20:04,947	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=381, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:04,947	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.55 GiB heap, 0.0/148.63 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 3
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=381, ip=172.31.18.216)[0m 2021-02-04 17:20:04,945	INFO trainable.py:103 -- Trainable.setup took 13.263 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:20:04,953	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=4230, ip=172.31.21.209)[0m 2021-02-04 17:20:06,441	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48243 [rank=2]
[2m[36m(pid=308, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:20:09,196	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4216, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4216, ip=172.31.21.209)[0m 2021-02-04 17:20:09,163	INFO trainable.py:103 -- Trainable.setup took 24.785 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=312, ip=172.31.18.216)[0m 
170500096it [00:13, 12393845.48it/s]                               
[2m[36m(pid=4230, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=341, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=1]
[2m[36m(pid=335, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=2]
[2m[36m(pid=4223, ip=172.31.21.209)[0m 2021-02-04 17:20:09,990	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=0]
2021-02-04 17:20:12,194	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=367, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:12,195	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.55 GiB heap, 0.0/148.63 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:12,201	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=341, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4223, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4222, ip=172.31.21.209)[0m 2021-02-04 17:20:13,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52225 [rank=0]
[2m[36m(pid=335, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:20:15,889	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=336, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=336, ip=172.31.18.216)[0m 2021-02-04 17:20:15,887	INFO trainable.py:103 -- Trainable.setup took 10.187 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=4222, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=321, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=2]
[2m[36m(pid=330, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=1]
[2m[36m(pid=4489, ip=172.31.21.209)[0m 2021-02-04 17:20:17,055	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=0]
2021-02-04 17:20:19,524	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4221, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:19,527	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.55 GiB heap, 0.0/148.63 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:19,536	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=330, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4489, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4497, ip=172.31.21.209)[0m 2021-02-04 17:20:21,317	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56123 [rank=0]
[2m[36m(pid=321, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:20:22,912	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=329, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=322, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=1]
[2m[36m(pid=320, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=2]
[2m[36m(pid=4492, ip=172.31.21.209)[0m 2021-02-04 17:20:23,816	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=0]
[2m[36m(pid=4721, ip=172.31.31.247)[0m 2021-02-04 17:19:45,455	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:33671 [rank=2]
[2m[36m(pid=4721, ip=172.31.31.247)[0m 2021-02-04 17:19:45,517	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=4721, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=4721, ip=172.31.31.247)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 61, in setup_process_group
[2m[36m(pid=4721, ip=172.31.31.247)[0m     url, world_rank, world_size, timeout, backend=self.backend)
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/utils.py", line 43, in setup_process_group
[2m[36m(pid=4721, ip=172.31.31.247)[0m     timeout=timeout)
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 446, in init_process_group
[2m[36m(pid=4721, ip=172.31.31.247)[0m     timeout=timeout)
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 532, in _new_process_group_helper
[2m[36m(pid=4721, ip=172.31.31.247)[0m     timeout)
[2m[36m(pid=4721, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=4721, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=4721, ip=172.31.31.247)[0m SystemExit: 1
[2m[36m(pid=4722, ip=172.31.31.247)[0m 2021-02-04 17:19:45,454	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:33671 [rank=1]
[2m[36m(pid=4722, ip=172.31.31.247)[0m 2021-02-04 17:19:45,494	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=4722, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=4722, ip=172.31.31.247)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 61, in setup_process_group
[2m[36m(pid=4722, ip=172.31.31.247)[0m     url, world_rank, world_size, timeout, backend=self.backend)
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/utils.py", line 43, in setup_process_group
[2m[36m(pid=4722, ip=172.31.31.247)[0m     timeout=timeout)
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 446, in init_process_group
[2m[36m(pid=4722, ip=172.31.31.247)[0m     timeout=timeout)
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 532, in _new_process_group_helper
[2m[36m(pid=4722, ip=172.31.31.247)[0m     timeout)
[2m[36m(pid=4722, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=4722, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=4722, ip=172.31.31.247)[0m SystemExit: 1
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: *** Aborted at 1612475192 (unix time) try "date -d @1612475192" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=4497, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=322, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4492, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:20:27,058	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4498, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:27,058	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.55 GiB heap, 0.0/148.63 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            4 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            1 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:27,064	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=320, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4600, ip=172.31.21.209)[0m 2021-02-04 17:20:28,911	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54427 [rank=2]
2021-02-04 17:20:29,593	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=13464, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=5250, ip=172.31.31.247)[0m 2021-02-04 17:20:30,427	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57851 [rank=0]
[2m[36m(pid=5251, ip=172.31.31.247)[0m 2021-02-04 17:20:30,428	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57851 [rank=1]
[2m[36m(pid=4692, ip=172.31.21.209)[0m 2021-02-04 17:20:30,426	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57851 [rank=2]
[2m[36m(pid=4600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5250, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4692, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:20:34,683	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=13844, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:34,684	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:34,689	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5251, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4671, ip=172.31.21.209)[0m 2021-02-04 17:20:36,187	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:39841 [rank=0]
2021-02-04 17:20:36,313	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=13809, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=313, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=1]
[2m[36m(pid=311, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=0]
[2m[36m(pid=4625, ip=172.31.21.209)[0m 2021-02-04 17:20:37,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=2]
[2m[36m(pid=4671, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=313, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4625, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=311, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:20:41,948	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4699, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:41,951	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |            4 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:41,960	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:20:42,929	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4679, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4617, ip=172.31.21.209)[0m 2021-02-04 17:20:43,466	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52603 [rank=0]
[2m[36m(pid=5275, ip=172.31.31.247)[0m 2021-02-04 17:20:44,013	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:48333 [rank=1]
[2m[36m(pid=5323, ip=172.31.31.247)[0m 2021-02-04 17:20:44,013	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:48333 [rank=0]
[2m[36m(pid=624, ip=172.31.18.216)[0m 2021-02-04 17:20:44,016	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:48333 [rank=2]
[2m[36m(pid=4617, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5323, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=624, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5275, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:20:49,205	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4624, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:49,206	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |            2 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:49,212	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:20:49,880	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4626, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4601, ip=172.31.21.209)[0m 2021-02-04 17:20:50,752	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35443 [rank=1]
[2m[36m(pid=4616, ip=172.31.21.209)[0m 2021-02-04 17:20:50,752	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35443 [rank=0]
[2m[36m(pid=5276, ip=172.31.31.247)[0m 2021-02-04 17:20:51,090	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57345 [rank=1]
[2m[36m(pid=5274, ip=172.31.31.247)[0m 2021-02-04 17:20:51,090	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57345 [rank=2]
[2m[36m(pid=4616, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5276, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5274, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:20:56,442	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4615, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:20:56,445	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            4 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:20:56,454	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:20:56,916	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=13783, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4602, ip=172.31.21.209)[0m 2021-02-04 17:20:58,069	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55123 [rank=0]
[2m[36m(pid=4603, ip=172.31.21.209)[0m 2021-02-04 17:20:58,069	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55123 [rank=1]
[2m[36m(pid=5266, ip=172.31.31.247)[0m 2021-02-04 17:20:58,071	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55123 [rank=2]
[2m[36m(pid=5267, ip=172.31.31.247)[0m 2021-02-04 17:20:58,087	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:47995 [rank=2]
[2m[36m(pid=5266, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4602, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5267, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4603, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:21:03,857	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14110, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:03,857	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            3 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:03,863	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:21:03,906	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4604, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=640, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=1]
[2m[36m(pid=639, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=0]
[2m[36m(pid=4997, ip=172.31.21.209)[0m 2021-02-04 17:21:05,711	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:51375 [rank=1]
[2m[36m(pid=4996, ip=172.31.21.209)[0m 2021-02-04 17:21:05,710	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:51375 [rank=0]
[2m[36m(pid=639, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4996, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=640, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4997, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:21:11,104	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=638, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:11,108	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            4 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:11,117	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:21:11,509	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4945, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4995, ip=172.31.21.209)[0m 2021-02-04 17:21:12,619	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37075 [rank=2]
[2m[36m(pid=4985, ip=172.31.21.209)[0m 2021-02-04 17:21:12,629	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58665 [rank=0]
[2m[36m(pid=5254, ip=172.31.31.247)[0m 2021-02-04 17:21:12,631	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58665 [rank=2]
[2m[36m(pid=5265, ip=172.31.31.247)[0m 2021-02-04 17:21:12,630	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58665 [rank=1]
[2m[36m(pid=4995, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5265, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4985, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5254, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:21:18,363	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14062, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:18,364	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |            5 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:18,369	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:21:18,430	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14063, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=5252, ip=172.31.31.247)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=0]
[2m[36m(pid=5537, ip=172.31.31.247)[0m 2021-02-04 17:21:20,182	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=1]
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5252, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5537, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:21:25,606	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14055, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:25,609	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:25,643	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:21:25,715	INFO commands.py:441 -- Shutdown i-0ac149179edeecfcd
2021-02-04 17:21:25,716	INFO command_runner.py:356 -- Fetched IP: 34.215.60.186
2021-02-04 17:21:25,716	INFO log_timer.py:27 -- NodeUpdater: i-0ac149179edeecfcd: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.215.60.186' (ECDSA) to the list of known hosts.

[2m[36m(pid=630, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=1]
[2m[36m(pid=629, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=0]
[32mStopped all 10 Ray processes.[39m
[0mShared connection to 34.215.60.186 closed.

2021-02-04 17:21:33,018	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.399 s, which may be a performance bottleneck.
2021-02-04 17:21:33,018	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 3.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/661.08 GiB heap, 0.0/198.14 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:33,031	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:21:33,033	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5253, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:33,033	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:21:33,037	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=5585, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=1]
[2m[36m(pid=5586, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=2]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5586, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5585, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:21:40,245	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4964, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |            6 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:40,398	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14394, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:21:40,399	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:21:40,402	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=5553, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=2]
[2m[36m(pid=5587, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5553, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:21:46,841	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4948, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:21:46,870	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:21:46,929	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:21:46,929	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:21:46,929	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[32mStopped all 13 Ray processes.[39m
[0mShared connection to 54.68.206.108 closed.

2021-02-04 17:21:54,262	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.419 s, which may be a performance bottleneck.
2021-02-04 17:21:54,267	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14386, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.0/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:21:54,274	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5578, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:21:54,274	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:21:54,279	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=341, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=341, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=1]
[2m[36m(pid=336, ip=172.31.18.216)[0m 2021-02-04 17:20:15,887	INFO trainable.py:103 -- Trainable.setup took 10.187 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=631, ip=172.31.18.216)[0m 2021-02-04 17:21:27,788	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=631, ip=172.31.18.216)[0m Traceback (most recent call last):
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=631, ip=172.31.18.216)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=631, ip=172.31.18.216)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=631, ip=172.31.18.216)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=631, ip=172.31.18.216)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=631, ip=172.31.18.216)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=631, ip=172.31.18.216)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=631, ip=172.31.18.216)[0m     address=address, world_size=num_workers))
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=631, ip=172.31.18.216)[0m     return func(*args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=631, ip=172.31.18.216)[0m     object_refs, timeout=timeout)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=631, ip=172.31.18.216)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=631, ip=172.31.18.216)[0m     sys.exit(1)
[2m[36m(pid=631, ip=172.31.18.216)[0m SystemExit: 1
[2m[36m(pid=322, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=640, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=1]
[2m[36m(pid=639, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=0]
[2m[36m(pid=321, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=2]
[2m[36m(pid=322, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=1]
[2m[36m(pid=330, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=624, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=330, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=1]
[2m[36m(pid=320, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=2]
[2m[36m(pid=381, ip=172.31.18.216)[0m 2021-02-04 17:20:04,945	INFO trainable.py:103 -- Trainable.setup took 13.263 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=312, ip=172.31.18.216)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /home/ray/data/cifar-10-python.tar.gz
[2m[36m(pid=312, ip=172.31.18.216)[0m Extracting /home/ray/data/cifar-10-python.tar.gz to /home/ray/data
[2m[36m(pid=313, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=1]
[2m[36m(pid=311, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=0]
[2m[36m(pid=639, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=335, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=640, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=311, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=320, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=624, ip=172.31.18.216)[0m 2021-02-04 17:20:44,016	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:48333 [rank=2]
[2m[36m(pid=312, ip=172.31.18.216)[0m 2021-02-04 17:19:51,670	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=0]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
0it [00:00, ?it/s]
  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 40960/170498071 [00:00<09:51, 288275.09it/s]
  0%|          | 155648/170498071 [00:00<04:50, 585655.61it/s]
  0%|          | 270336/170498071 [00:00<04:09, 682297.12it/s]
  0%|          | 630784/170498071 [00:00<01:46, 1587561.34it/s]
  1%|          | 1187840/170498071 [00:01<01:02, 2724002.92it/s]
  2%|▏         | 3317760/170498071 [00:01<00:20, 8258417.15it/s]
  3%|▎         | 5332992/170498071 [00:01<00:14, 11448934.68it/s]
  5%|▍         | 8036352/170498071 [00:01<00:10, 15976323.38it/s]
  6%|▌         | 10018816/170498071 [00:01<00:09, 16577204.38it/s]
  7%|▋         | 12689408/170498071 [00:01<00:08, 19496979.59it/s]
  9%|▊         | 14704640/170498071 [00:01<00:08, 19063345.58it/s]
 10%|█         | 17358848/170498071 [00:01<00:07, 21210657.64it/s]
 11%|█▏        | 19513344/170498071 [00:01<00:07, 20412667.95it/s]
 13%|█▎        | 22159360/170498071 [00:01<00:06, 22133928.91it/s]
 14%|█▍        | 24403968/170498071 [00:02<00:06, 21106717.00it/s]
 16%|█▌        | 27041792/170498071 [00:02<00:06, 22582812.32it/s]
 17%|█▋        | 29335552/170498071 [00:02<00:06, 21418204.30it/s]
 19%|█▊        | 31891456/170498071 [00:02<00:06, 22516216.66it/s]
 20%|██        | 34177024/170498071 [00:02<00:06, 21757868.64it/s]
 21%|██▏       | 36610048/170498071 [00:02<00:06, 21441638.69it/s]
 23%|██▎       | 39100416/170498071 [00:02<00:05, 22394072.21it/s]
 24%|██▍       | 41361408/170498071 [00:02<00:06, 21446177.55it/s]
 26%|██▌       | 44015616/170498071 [00:02<00:05, 22853383.41it/s]
 27%|██▋       | 46325760/170498071 [00:03<00:05, 21666544.06it/s]
 29%|██▊       | 48963584/170498071 [00:03<00:05, 22923353.34it/s]
 30%|███       | 51290112/170498071 [00:03<00:05, 21825105.48it/s]
 32%|███▏      | 53813248/170498071 [00:03<00:05, 21716023.90it/s]
 33%|███▎      | 56221696/170498071 [00:03<00:05, 22331231.35it/s]
 34%|███▍      | 58499072/170498071 [00:03<00:05, 21799886.59it/s]
 36%|███▌      | 60874752/170498071 [00:03<00:04, 22260662.31it/s]
 37%|███▋      | 63217664/170498071 [00:03<00:04, 21973226.05it/s]
 38%|███▊      | 65560576/170498071 [00:03<00:04, 22275837.53it/s]
 40%|███▉      | 67903488/170498071 [00:04<00:04, 22013096.20it/s]
 41%|████      | 70213632/170498071 [00:04<00:04, 22317311.67it/s]
 43%|████▎     | 72589312/170498071 [00:04<00:04, 22004030.48it/s]
 44%|████▍     | 74899456/170498071 [00:04<00:04, 22296391.77it/s]
 45%|████▌     | 77275136/170498071 [00:04<00:04, 22018406.36it/s]
 47%|████▋     | 79568896/170498071 [00:04<00:04, 22248586.75it/s]
 48%|████▊     | 81960960/170498071 [00:04<00:04, 22035699.94it/s]
 49%|████▉     | 84254720/170498071 [00:04<00:03, 22234369.74it/s]
 51%|█████     | 86646784/170498071 [00:04<00:03, 22053426.05it/s]
 52%|█████▏    | 88924160/170498071 [00:05<00:03, 22238716.80it/s]
 54%|█████▎    | 91332608/170498071 [00:05<00:03, 22057743.67it/s]
 55%|█████▍    | 93609984/170498071 [00:05<00:03, 22230442.86it/s]
 56%|█████▋    | 96018432/170498071 [00:05<00:03, 22062449.00it/s]
 58%|█████▊    | 98271232/170498071 [00:05<00:03, 22194420.08it/s]
 59%|█████▉    | 100704256/170498071 [00:05<00:03, 22106389.05it/s]
 60%|██████    | 102924288/170498071 [00:05<00:03, 22121912.27it/s]
 62%|██████▏   | 105406464/170498071 [00:05<00:02, 22173622.86it/s]
 63%|██████▎   | 107626496/170498071 [00:05<00:02, 22159703.11it/s]
 65%|██████▍   | 110125056/170498071 [00:05<00:02, 22283033.27it/s]
 66%|██████▌   | 112353280/170498071 [00:06<00:02, 22166949.84it/s]
 67%|██████▋   | 114827264/170498071 [00:06<00:02, 22327024.36it/s]
 69%|██████▊   | 117063680/170498071 [00:06<00:02, 22169101.84it/s]
 70%|███████   | 119513088/170498071 [00:06<00:02, 22289870.33it/s]
 71%|███████▏  | 121741312/170498071 [00:06<00:02, 21907161.40it/s]
 73%|███████▎  | 124198912/170498071 [00:06<00:02, 22329469.42it/s]
 74%|███████▍  | 126435328/170498071 [00:06<00:02, 21778336.98it/s]
 76%|███████▌  | 128884736/170498071 [00:06<00:01, 22165228.60it/s]
 77%|███████▋  | 131104768/170498071 [00:06<00:01, 21637750.07it/s]
 78%|███████▊  | 133570560/170498071 [00:07<00:01, 22346030.12it/s]
 80%|███████▉  | 135815168/170498071 [00:07<00:01, 21762208.01it/s]
 81%|████████  | 138256384/170498071 [00:07<00:01, 22379346.72it/s]
 82%|████████▏ | 140500992/170498071 [00:07<00:01, 21784728.51it/s]
 84%|████████▍ | 142925824/170498071 [00:07<00:01, 22395395.94it/s]
 85%|████████▌ | 145178624/170498071 [00:07<00:01, 21745874.19it/s]
 87%|████████▋ | 147611648/170498071 [00:07<00:01, 22420195.58it/s]
 88%|████████▊ | 149864448/170498071 [00:07<00:00, 21779173.76it/s]
 89%|████████▉ | 152330240/170498071 [00:07<00:00, 22509665.48it/s]
 91%|█████████ | 154591232/170498071 [00:07<00:00, 21851667.44it/s]
 92%|█████████▏| 157016064/170498071 [00:08<00:00, 22493407.97it/s]
 93%|█████████▎| 159277056/170498071 [00:08<00:00, 21836628.63it/s]
 95%|█████████▍| 161701888/170498071 [00:08<00:00, 22512059.64it/s]
 96%|█████████▌| 163962880/170498071 [00:08<00:00, 21835997.03it/s]
 98%|█████████▊| 166420480/170498071 [00:08<00:00, 22282436.08it/s]
 99%|█████████▉| 168656896/170498071 [00:08<00:00, 21919517.06it/s]
170500096it [00:13, 12393845.48it/s]                               
[2m[36m(pid=308, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=335, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=2]
[2m[36m(pid=321, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=630, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=1]
[2m[36m(pid=308, ip=172.31.18.216)[0m 2021-02-04 17:19:51,671	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=2]
[2m[36m(pid=313, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=629, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=0]
[2m[36m(pid=5543, ip=172.31.31.247)[0m 2021-02-04 17:22:04,336	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=0]
[2m[36m(pid=1003, ip=172.31.18.216)[0m 2021-02-04 17:22:04,340	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=2]
[2m[36m(pid=1002, ip=172.31.18.216)[0m 2021-02-04 17:22:04,360	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=2]
[2m[36m(pid=5544, ip=172.31.31.247)[0m 2021-02-04 17:22:04,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=0]
[2m[36m(pid=1003, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5543, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:08,654	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5551, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:08,657	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 6.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            7 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:08,665	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5551, ip=172.31.31.247)[0m 2021-02-04 17:22:08,650	INFO trainable.py:103 -- Trainable.setup took 13.595 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=1002, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:10,205	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5552, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=5552, ip=172.31.31.247)[0m 2021-02-04 17:22:10,201	INFO trainable.py:103 -- Trainable.setup took 15.146 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=1038, ip=172.31.18.216)[0m 2021-02-04 17:22:10,588	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=1]
[2m[36m(pid=5542, ip=172.31.31.247)[0m 2021-02-04 17:22:10,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=2]
[2m[36m(pid=1015, ip=172.31.18.216)[0m 2021-02-04 17:22:11,065	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=1]
[2m[36m(pid=5536, ip=172.31.31.247)[0m 2021-02-04 17:22:11,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=0]
[2m[36m(pid=1038, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5542, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:14,958	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14641, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:14,958	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:14,964	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=1015, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5536, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:16,453	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14637, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1009, ip=172.31.18.216)[0m 2021-02-04 17:22:16,570	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=0]
[2m[36m(pid=5852, ip=172.31.31.247)[0m 2021-02-04 17:22:16,568	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=1]
[2m[36m(pid=1008, ip=172.31.18.216)[0m 2021-02-04 17:22:17,373	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=1]
[2m[36m(pid=5851, ip=172.31.31.247)[0m 2021-02-04 17:22:17,370	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=0]
[2m[36m(pid=1009, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5852, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:20,900	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1014, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:20,903	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.1/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |            8 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:20,913	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5851, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1008, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:22:22,422	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5829, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1195, ip=172.31.18.216)[0m 2021-02-04 17:22:22,814	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=2]
[2m[36m(pid=5835, ip=172.31.31.247)[0m 2021-02-04 17:22:22,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=0]
[2m[36m(pid=5834, ip=172.31.31.247)[0m 2021-02-04 17:22:23,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=0]
[2m[36m(pid=1237, ip=172.31.18.216)[0m 2021-02-04 17:22:23,292	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=1]
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=1195, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:22:27,195	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5845, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:27,196	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.1/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:27,201	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5834, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1237, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:22:28,671	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5846, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=5830, ip=172.31.31.247)[0m 2021-02-04 17:22:28,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=2]
[2m[36m(pid=1220, ip=172.31.18.216)[0m 2021-02-04 17:22:28,801	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=0]
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=5831, ip=172.31.31.247)[0m 2021-02-04 17:22:29,479	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=2]
[2m[36m(pid=5830, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1220, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:22:33,080	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1219, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:33,084	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 5.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:33,092	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:22:33,123	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:22:33,189	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:22:33,189	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:22:33,189	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[2m[36m(pid=5831, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=1212, ip=172.31.18.216)[0m 2021-02-04 17:22:34,935	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48577 [rank=2]
[32mStopped all 20 Ray processes.[39m
[0mShared connection to 34.218.250.17 closed.

[2m[36m(pid=1212, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:22:42,186	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 9.088 s, which may be a performance bottleneck.
2021-02-04 17:22:42,190	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14643, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 220, in start_workers
    self.apply_all_workers(self._initialization_hook)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 246, in apply_all_workers
    return ray.get(self._apply_all_workers(fn))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
== Status ==
Memory usage on this node: 7.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           10 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |            9 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:42,194	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14642, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:42,195	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
2021-02-04 17:22:42,201	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:22:42,210	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5597, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 231, in start_workers
    ray.get(self._setup_operator())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayTaskError(RuntimeError): [36mray::DistributedTorchRunner.setup_operator()[39m (pid=5622, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 92, in setup_operator
    scheduler_step_freq=self.scheduler_step_freq)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 148, in __init__
    self.setup(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 1025, in setup
    state = self.register(**kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 328, in register
    ddp_args=ddp_args)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 161, in _configure_ddp
    for model in models
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 161, in <listcomp>
    for model in models
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 420, in _sync_params_and_buffers
    authoritative_rank)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL error: unhandled system error, NCCL version 2.7.8
[2m[36m(pid=1201, ip=172.31.18.216)[0m 2021-02-04 17:22:43,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:51383 [rank=0]
[2m[36m(pid=1197, ip=172.31.18.216)[0m 2021-02-04 17:22:43,732	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=2]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=1201, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=1197, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:22:49,403	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1211, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:49,406	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:49,416	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:22:49,488	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5623, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=1198, ip=172.31.18.216)[0m 2021-02-04 17:22:51,246	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=0]
[2m[36m(pid=1453, ip=172.31.18.216)[0m 2021-02-04 17:22:51,270	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57991 [rank=2]
[2m[36m(pid=1453, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=1198, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:22:56,995	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1196, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:22:56,996	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           11 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:22:57,001	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:22:57,009	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14983, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1468, ip=172.31.18.216)[0m 2021-02-04 17:22:58,546	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58885 [rank=1]
[2m[36m(pid=1490, ip=172.31.18.216)[0m 2021-02-04 17:22:58,546	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58885 [rank=0]
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=1490, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=1468, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:23:04,297	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1491, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:04,300	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:04,309	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:23:04,343	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1499, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1467, ip=172.31.18.216)[0m 2021-02-04 17:23:05,823	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54089 [rank=2]
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=1462, ip=172.31.18.216)[0m 2021-02-04 17:23:06,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=2]
[2m[36m(pid=1467, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5834, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5252, ip=172.31.31.247)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=0]
[2m[36m(pid=5586, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5834, ip=172.31.31.247)[0m 2021-02-04 17:22:23,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=0]
[2m[36m(pid=5543, ip=172.31.31.247)[0m 2021-02-04 17:22:04,336	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=0]
[2m[36m(pid=5552, ip=172.31.31.247)[0m 2021-02-04 17:22:10,201	INFO trainable.py:103 -- Trainable.setup took 15.146 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5536, ip=172.31.31.247)[0m 2021-02-04 17:22:11,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=0]
[2m[36m(pid=5542, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=1]
[2m[36m(pid=5553, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=2]
[2m[36m(pid=5585, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=1]
[2m[36m(pid=5852, ip=172.31.31.247)[0m 2021-02-04 17:22:16,568	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=1]
[2m[36m(pid=5542, ip=172.31.31.247)[0m 2021-02-04 17:22:10,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=2]
[2m[36m(pid=5537, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5586, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=2]
[2m[36m(pid=5585, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m 2021-02-04 17:22:04,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=0]
[2m[36m(pid=5551, ip=172.31.31.247)[0m 2021-02-04 17:22:08,650	INFO trainable.py:103 -- Trainable.setup took 13.595 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5851, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5553, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5537, ip=172.31.31.247)[0m 2021-02-04 17:21:20,182	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=1]
[2m[36m(pid=5830, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5851, ip=172.31.31.247)[0m 2021-02-04 17:22:17,370	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=0]
[2m[36m(pid=5536, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5830, ip=172.31.31.247)[0m 2021-02-04 17:22:28,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=2]
[2m[36m(pid=5831, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m 2021-02-04 17:22:22,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=0]
[2m[36m(pid=5543, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m 2021-02-04 17:22:29,479	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=2]
[2m[36m(pid=6068, ip=172.31.31.247)[0m 2021-02-04 17:22:35,083	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=6068, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 375, in ray._raylet.execute_task
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 400, in load_actor_class
[2m[36m(pid=6068, ip=172.31.31.247)[0m     job_id, actor_creation_function_descriptor)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
[2m[36m(pid=6068, ip=172.31.31.247)[0m     actor_class = pickle.loads(pickled_class)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch import TorchTrainer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/__init__.py", line 12, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 13, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune import Trainable
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/__init__.py", line 2, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.tune import run_experiments, run
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 18, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.trial_runner import TrialRunner
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 28, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.web_server import TuneServer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/web_server.py", line 16, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import requests  # `requests` is not part of stdlib.
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/__init__.py", line 43, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import urllib3
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 980, in _find_and_load
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 148, in __enter__
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 174, in _get_module_lock
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=6068, ip=172.31.31.247)[0m SystemExit: 1
[2m[36m(pid=5252, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5852, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: *** Aborted at 1612475192 (unix time) try "date -d @1612475192" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=1462, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:23:11,548	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=14957, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:11,548	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:11,556	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:23:11,948	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5598, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1461, ip=172.31.18.216)[0m 2021-02-04 17:23:13,090	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=2]
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[36m(pid=1452, ip=172.31.18.216)[0m 2021-02-04 17:23:13,113	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49555 [rank=0]
[2m[36m(pid=1452, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=1461, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:23:18,824	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5920, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:18,828	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.1/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           12 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:18,836	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:23:18,871	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=5919, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=1706, ip=172.31.18.216)[0m 2021-02-04 17:23:20,715	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:42641 [rank=1]
[2m[36m(pid=1704, ip=172.31.18.216)[0m 2021-02-04 17:23:20,715	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:42641 [rank=0]
[2m[36m(pid=6510, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=0]
[2m[36m(pid=6511, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=1]
[2m[36m(pid=1704, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6510, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1706, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6511, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:23:26,465	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1454, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:26,466	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           16 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:26,471	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:23:26,555	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6483, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=6497, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=1]
[2m[36m(pid=6496, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=0]
[2m[36m(pid=1755, ip=172.31.18.216)[0m 2021-02-04 17:23:28,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:47463 [rank=1]
[2m[36m(pid=1771, ip=172.31.18.216)[0m 2021-02-04 17:23:28,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:47463 [rank=2]
[2m[36m(pid=6496, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1771, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1755, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:23:33,796	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6512, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:33,799	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           13 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:33,809	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:23:34,085	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15353, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=6487, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=2]
[2m[36m(pid=6495, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=1]
[2m[36m(pid=1723, ip=172.31.18.216)[0m 2021-02-04 17:23:35,388	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52515 [rank=1]
[2m[36m(pid=1757, ip=172.31.18.216)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52515 [rank=0]
[2m[36m(pid=1757, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1723, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6487, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:23:41,057	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1764, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:41,058	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:41,090	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:23:41,169	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:23:41,169	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:23:41,169	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=1722, ip=172.31.18.216)[0m 2021-02-04 17:23:42,628	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=2]
[32mStopped all 12 Ray processes.[39m
[0mShared connection to 54.68.206.108 closed.

2021-02-04 17:23:48,287	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.224 s, which may be a performance bottleneck.
2021-02-04 17:23:48,290	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 5.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/96 CPUs, 9/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           14 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:48,306	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:23:48,307	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1756, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:48,308	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:23:48,312	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=1715, ip=172.31.18.216)[0m 2021-02-04 17:23:49,780	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:45265 [rank=2]
[2m[36m(pid=1713, ip=172.31.18.216)[0m 2021-02-04 17:23:49,833	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=0]
[2m[36m(pid=6486, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=1]
[2m[36m(pid=6488, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=2]
[2m[36m(pid=1715, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6486, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1713, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:23:55,531	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15377, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:23:55,624	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15379, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:23:55,625	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
2021-02-04 17:23:55,629	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=6715, ip=172.31.31.247)[0m 2021-02-04 17:23:56,734	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=2]
[2m[36m(pid=6716, ip=172.31.31.247)[0m 2021-02-04 17:23:56,733	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=0]
[2m[36m(pid=1705, ip=172.31.18.216)[0m 2021-02-04 17:23:57,469	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34483 [rank=0]
[2m[36m(pid=2012, ip=172.31.18.216)[0m 2021-02-04 17:23:57,470	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34483 [rank=1]
[2m[36m(pid=6716, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=1705, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6715, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2012, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:24:02,428	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1724, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           16 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=1724, ip=172.31.18.216)[0m 2021-02-04 17:24:02,426	INFO trainable.py:103 -- Trainable.setup took 13.344 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:24:03,198	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=1714, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:03,199	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:03,203	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=6776, ip=172.31.31.247)[0m 2021-02-04 17:24:03,332	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=2]
[2m[36m(pid=6767, ip=172.31.31.247)[0m 2021-02-04 17:24:03,331	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=0]
[2m[36m(pid=6767, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6776, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:24:09,156	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6719, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 5.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           16 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=6719, ip=172.31.31.247)[0m 2021-02-04 17:24:09,151	INFO trainable.py:103 -- Trainable.setup took 12.454 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=2050, ip=172.31.18.216)[0m 2021-02-04 17:24:10,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=1]
[2m[36m(pid=6748, ip=172.31.31.247)[0m 2021-02-04 17:24:10,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=2]
[2m[36m(pid=2048, ip=172.31.18.216)[0m 2021-02-04 17:24:10,377	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=0]
[2m[36m(pid=6747, ip=172.31.31.247)[0m 2021-02-04 17:24:10,375	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=1]
[2m[36m(pid=6748, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2050, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:24:14,351	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15356, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:14,354	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.1/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:24:14,363	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:24:14,395	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:24:14,463	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:24:14,464	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:24:14,464	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=6747, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2048, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2049, ip=172.31.18.216)[0m 2021-02-04 17:24:15,803	INFO trainable.py:103 -- Trainable.setup took 11.822 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=2025, ip=172.31.18.216)[0m 2021-02-04 17:24:15,922	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=0]
[2m[36m(pid=6742, ip=172.31.31.247)[0m 2021-02-04 17:24:15,920	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=2]
Did not find any active Ray processes.
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=6742, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2025, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5859, ip=172.31.21.209)[0m 2021-02-04 17:23:43,058	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=5859, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=5859, ip=172.31.21.209)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     address=address, world_size=num_workers))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return func(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, timeout=timeout)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=5859, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,099 E 5547 5587] logging.cc:415: *** Aborted at 1612488223 (unix time) try "date -d @1612488223" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,100 E 5547 5587] logging.cc:415: PC: @                0x0 (unknown)
2021-02-04 17:24:21,290	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.921 s, which may be a performance bottleneck.
2021-02-04 17:24:21,294	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2024, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           15 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:24:21,298	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2049, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:21,299	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:21,303	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=6336, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=1]
[2m[36m(pid=6338, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=2]
[2m[36m(pid=6725, ip=172.31.31.247)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=0]
[2m[36m(pid=2019, ip=172.31.18.216)[0m 2021-02-04 17:24:22,826	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41643 [rank=2]
[2m[36m(pid=6338, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2019, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6336, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:24:28,107	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6741, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           16 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=6741, ip=172.31.31.247)[0m 2021-02-04 17:24:28,103	INFO trainable.py:103 -- Trainable.setup took 12.878 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:24:28,598	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15719, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:28,599	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:28,603	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=6717, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=2]
[2m[36m(pid=6723, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=1]
[2m[36m(pid=2013, ip=172.31.18.216)[0m 2021-02-04 17:24:28,994	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=0]
[2m[36m(pid=2236, ip=172.31.18.216)[0m 2021-02-04 17:24:30,447	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=2]
[2m[36m(pid=6354, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=1]
[2m[36m(pid=6355, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=0]
[2m[36m(pid=6723, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2013, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6717, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2236, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:24:34,814	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2018, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 3.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           17 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=2018, ip=172.31.18.216)[0m 2021-02-04 17:24:34,812	INFO trainable.py:103 -- Trainable.setup took 12.710 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6354, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2274, ip=172.31.18.216)[0m 2021-02-04 17:24:35,618	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48397 [rank=2]
2021-02-04 17:24:36,248	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6400, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:36,249	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:36,252	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=2275, ip=172.31.18.216)[0m 2021-02-04 17:24:37,765	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=0]
[2m[36m(pid=6720, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=2]
[2m[36m(pid=6718, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=1]
[2m[36m(pid=2274, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2275, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:24:41,409	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6399, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.1/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=6399, ip=172.31.21.209)[0m 2021-02-04 17:24:41,404	INFO trainable.py:103 -- Trainable.setup took 12.054 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6346, ip=172.31.21.209)[0m 2021-02-04 17:24:42,241	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=0]
[2m[36m(pid=2266, ip=172.31.18.216)[0m 2021-02-04 17:24:42,246	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=2]
[2m[36m(pid=6347, ip=172.31.21.209)[0m 2021-02-04 17:24:42,262	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=1]
[2m[36m(pid=6718, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:24:43,595	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2267, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:43,596	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:43,600	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=2252, ip=172.31.18.216)[0m 2021-02-04 17:24:45,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52591 [rank=0]
[2m[36m(pid=2266, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6346, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6347, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:24:48,016	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6356, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 6.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           18 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=6356, ip=172.31.21.209)[0m 2021-02-04 17:24:48,011	INFO trainable.py:103 -- Trainable.setup took 10.973 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=2252, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6345, ip=172.31.21.209)[0m 2021-02-04 17:24:48,944	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=2]
[2m[36m(pid=7064, ip=172.31.31.247)[0m 2021-02-04 17:24:48,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=1]
[2m[36m(pid=7065, ip=172.31.31.247)[0m 2021-02-04 17:24:48,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=0]
2021-02-04 17:24:50,893	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2251, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:50,894	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:50,898	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7065, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6345, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2253, ip=172.31.18.216)[0m 2021-02-04 17:24:52,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37355 [rank=2]
[2m[36m(pid=7064, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:24:54,753	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15913, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 6.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=2239, ip=172.31.18.216)[0m 2021-02-04 17:24:55,545	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=1]
[2m[36m(pid=2253, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m 2021-02-04 17:24:55,540	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=0]
[2m[36m(pid=6627, ip=172.31.21.209)[0m 2021-02-04 17:24:55,541	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=2]
2021-02-04 17:24:58,243	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15937, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:24:58,243	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
2021-02-04 17:24:58,247	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=2239, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7155, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=1]
[2m[36m(pid=2241, ip=172.31.18.216)[0m 2021-02-04 17:24:59,800	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=0]
[2m[36m(pid=6627, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:25:01,332	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6337, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 3.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           19 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=2242, ip=172.31.18.216)[0m 2021-02-04 17:25:02,157	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39265 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2241, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7155, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2242, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:25:05,533	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6640, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:05,537	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
2021-02-04 17:25:05,543	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7148, ip=172.31.31.247)[0m 2021-02-04 17:25:07,118	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=2]
[2m[36m(pid=7162, ip=172.31.31.247)[0m 2021-02-04 17:25:07,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=1]
[2m[36m(pid=2240, ip=172.31.18.216)[0m 2021-02-04 17:25:07,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=0]
2021-02-04 17:25:07,960	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=6625, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=2549, ip=172.31.18.216)[0m 2021-02-04 17:25:09,230	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=2]
[2m[36m(pid=6604, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=1]
[2m[36m(pid=6628, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=0]
[2m[36m(pid=7162, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2240, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7148, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6604, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2549, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:25:12,955	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15919, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:12,957	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:25:12,962	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 3.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           20 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=6628, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2578, ip=172.31.18.216)[0m 2021-02-04 17:25:14,817	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:52791 [rank=2]
2021-02-04 17:25:15,064	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7176, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=7093, ip=172.31.31.247)[0m 2021-02-04 17:25:15,873	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=1]
[2m[36m(pid=7091, ip=172.31.31.247)[0m 2021-02-04 17:25:15,872	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=0]
[2m[36m(pid=2579, ip=172.31.18.216)[0m 2021-02-04 17:25:15,875	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=2]
[2m[36m(pid=2578, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7091, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2579, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7093, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:25:20,628	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=15920, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:20,631	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:25:20,640	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:25:20,671	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:25:20,733	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:25:20,733	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:25:20,733	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=2564, ip=172.31.18.216)[0m 2021-02-04 17:25:22,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52873 [rank=0]
[32mStopped all 14 Ray processes.[39m
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=2564, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:25:27,724	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.078 s, which may be a performance bottleneck.
2021-02-04 17:25:27,728	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7092, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:25:27,757	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:25:27,813	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:25:27,813	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:25:27,814	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Did not find any active Ray processes.
[0mShared connection to 54.68.206.108 closed.

2021-02-04 17:25:33,986	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.255 s, which may be a performance bottleneck.
2021-02-04 17:25:33,990	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2563, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:33,992	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:25:34,001	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:25:34,010	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2577, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 220, in start_workers
    self.apply_all_workers(self._initialization_hook)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 246, in apply_all_workers
    return ray.get(self._apply_all_workers(fn))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
[2m[36m(pid=7082, ip=172.31.31.247)[0m 2021-02-04 17:25:35,493	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=1]
[2m[36m(pid=7084, ip=172.31.31.247)[0m 2021-02-04 17:25:35,492	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=0]
[2m[36m(pid=2553, ip=172.31.18.216)[0m 2021-02-04 17:25:35,561	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41029 [rank=2]
[2m[36m(pid=2562, ip=172.31.18.216)[0m 2021-02-04 17:25:35,561	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41029 [rank=1]
[2m[36m(pid=7084, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2553, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7082, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=2562, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:25:41,316	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7083, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:41,317	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00002 |           21 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:25:41,322	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:25:41,348	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16197, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:41,375	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:25:41,460	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:25:41,461	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:25:41,461	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[2m[36m(pid=2554, ip=172.31.18.216)[0m 2021-02-04 17:25:42,850	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=2]
[2m[36m(pid=2555, ip=172.31.18.216)[0m 2021-02-04 17:25:42,871	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39763 [rank=2]
[2m[36m(pid=7069, ip=172.31.31.247)[0m 2021-02-04 17:25:42,847	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=0]
[2m[36m(pid=7066, ip=172.31.31.247)[0m 2021-02-04 17:25:42,848	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=1]
[32mStopped all 11 Ray processes.[39m
[0mShared connection to 34.218.250.17 closed.

[2m[36m(pid=2555, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:25:48,563	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.214 s, which may be a performance bottleneck.
2021-02-04 17:25:48,567	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.78 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:25:48,598	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16164, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:25:48,599	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:25:48,603	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m 2021-02-04 17:24:42,241	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5859, ip=172.31.21.209)[0m 2021-02-04 17:23:43,058	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=5859, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=5859, ip=172.31.21.209)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     address=address, world_size=num_workers))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return func(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, timeout=timeout)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=5859, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=6604, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6628, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=6356, ip=172.31.21.209)[0m 2021-02-04 17:24:48,011	INFO trainable.py:103 -- Trainable.setup took 10.973 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=6627, ip=172.31.21.209)[0m 2021-02-04 17:24:55,541	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=2]
[2m[36m(pid=6627, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=6338, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6354, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=1]
[2m[36m(pid=6354, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=6628, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=0]
[2m[36m(pid=6399, ip=172.31.21.209)[0m 2021-02-04 17:24:41,404	INFO trainable.py:103 -- Trainable.setup took 12.054 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6338, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=2]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=6604, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=1]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=0]
[2m[36m(pid=6587, ip=172.31.21.209)[0m 2021-02-04 17:24:55,540	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=0]
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6347, ip=172.31.21.209)[0m 2021-02-04 17:24:42,262	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=6345, ip=172.31.21.209)[0m 2021-02-04 17:24:48,944	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=2]
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6347, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[36m(pid=6345, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,099 E 5547 5587] logging.cc:415: *** Aborted at 1612488223 (unix time) try "date -d @1612488223" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,100 E 5547 5587] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=7105, ip=172.31.21.209)[0m 2021-02-04 17:26:01,329	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=2]
[2m[36m(pid=7106, ip=172.31.21.209)[0m 2021-02-04 17:26:01,338	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=2]
[2m[36m(pid=2797, ip=172.31.18.216)[0m 2021-02-04 17:26:01,333	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=1]
[2m[36m(pid=2798, ip=172.31.18.216)[0m 2021-02-04 17:26:01,342	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=1]
[2m[36m(pid=7105, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2797, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:26:05,669	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16155, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:05,672	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:26:05,680	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:26:05,711	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:26:05,786	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:26:05,787	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:26:05,787	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[2m[36m(pid=2798, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7106, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m 2021-02-04 17:26:07,208	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=2]
[2m[36m(pid=2840, ip=172.31.18.216)[0m 2021-02-04 17:26:07,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=0]
Did not find any active Ray processes.
[0mShared connection to 34.218.250.17 closed.

[2m[36m(pid=2840, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:26:12,620	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.935 s, which may be a performance bottleneck.
2021-02-04 17:26:12,625	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16153, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           22 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:26:12,629	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=2841, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:12,630	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:12,635	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=2812, ip=172.31.18.216)[0m 2021-02-04 17:26:13,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=2]
[2m[36m(pid=2827, ip=172.31.18.216)[0m 2021-02-04 17:26:13,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=1]
[2m[36m(pid=7112, ip=172.31.21.209)[0m 2021-02-04 17:26:13,433	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=0]
[2m[36m(pid=7111, ip=172.31.21.209)[0m 2021-02-04 17:26:14,173	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45755 [rank=0]
[2m[36m(pid=2827, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7112, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7111, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2812, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:26:19,383	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7147, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=7147, ip=172.31.21.209)[0m 2021-02-04 17:26:19,376	INFO trainable.py:103 -- Trainable.setup took 12.935 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:26:19,976	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7117, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:19,977	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:19,981	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7284, ip=172.31.21.209)[0m 2021-02-04 17:26:20,576	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=0]
[2m[36m(pid=2804, ip=172.31.18.216)[0m 2021-02-04 17:26:20,581	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=2]
[2m[36m(pid=2803, ip=172.31.18.216)[0m 2021-02-04 17:26:20,581	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=1]
[2m[36m(pid=7295, ip=172.31.21.209)[0m 2021-02-04 17:26:21,483	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45427 [rank=0]
[2m[36m(pid=6510, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=0]
[2m[36m(pid=6767, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7069, ip=172.31.31.247)[0m 2021-02-04 17:25:42,847	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=0]
[2m[36m(pid=6510, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=2]
[2m[36m(pid=5834, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6748, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5252, ip=172.31.31.247)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=0]
[2m[36m(pid=5586, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7082, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m 2021-02-04 17:24:15,920	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=2]
[2m[36m(pid=6488, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=1]
[2m[36m(pid=6747, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6715, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m 2021-02-04 17:24:48,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=1]
[2m[36m(pid=6720, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6486, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=1]
[2m[36m(pid=7148, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5834, ip=172.31.31.247)[0m 2021-02-04 17:22:23,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=0]
[2m[36m(pid=7162, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5543, ip=172.31.31.247)[0m 2021-02-04 17:22:04,336	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=0]
[2m[36m(pid=5552, ip=172.31.31.247)[0m 2021-02-04 17:22:10,201	INFO trainable.py:103 -- Trainable.setup took 15.146 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7148, ip=172.31.31.247)[0m 2021-02-04 17:25:07,118	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m 2021-02-04 17:22:11,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=0]
[2m[36m(pid=5542, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6496, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7155, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=2]
[2m[36m(pid=5587, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=1]
[2m[36m(pid=5553, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=2]
[2m[36m(pid=7082, ip=172.31.31.247)[0m 2021-02-04 17:25:35,493	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=1]
[2m[36m(pid=7084, ip=172.31.31.247)[0m 2021-02-04 17:25:35,492	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=0]
[2m[36m(pid=5585, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=1]
[2m[36m(pid=6776, ip=172.31.31.247)[0m 2021-02-04 17:24:03,332	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=2]
[2m[36m(pid=6747, ip=172.31.31.247)[0m 2021-02-04 17:24:10,375	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=1]
[2m[36m(pid=6487, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=1]
[2m[36m(pid=6723, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6487, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=1]
[2m[36m(pid=6715, ip=172.31.31.247)[0m 2021-02-04 17:23:56,734	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=2]
[2m[36m(pid=5852, ip=172.31.31.247)[0m 2021-02-04 17:22:16,568	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=1]
[2m[36m(pid=6717, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=2]
[2m[36m(pid=5542, ip=172.31.31.247)[0m 2021-02-04 17:22:10,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=2]
[2m[36m(pid=6716, ip=172.31.31.247)[0m 2021-02-04 17:23:56,733	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=0]
[2m[36m(pid=7065, ip=172.31.31.247)[0m 2021-02-04 17:24:48,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=0]
[2m[36m(pid=5537, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6748, ip=172.31.31.247)[0m 2021-02-04 17:24:10,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=2]
[2m[36m(pid=6719, ip=172.31.31.247)[0m 2021-02-04 17:24:09,151	INFO trainable.py:103 -- Trainable.setup took 12.454 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5586, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=2]
[2m[36m(pid=5585, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=0]
[2m[36m(pid=6496, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=0]
[2m[36m(pid=5544, ip=172.31.31.247)[0m 2021-02-04 17:22:04,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=0]
[2m[36m(pid=5551, ip=172.31.31.247)[0m 2021-02-04 17:22:08,650	INFO trainable.py:103 -- Trainable.setup took 13.595 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6776, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7093, ip=172.31.31.247)[0m 2021-02-04 17:25:15,873	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=1]
[2m[36m(pid=5851, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7091, ip=172.31.31.247)[0m 2021-02-04 17:25:15,872	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=0]
[2m[36m(pid=5553, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5537, ip=172.31.31.247)[0m 2021-02-04 17:21:20,182	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=1]
[2m[36m(pid=6741, ip=172.31.31.247)[0m 2021-02-04 17:24:28,103	INFO trainable.py:103 -- Trainable.setup took 12.878 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7175, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5830, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5851, ip=172.31.31.247)[0m 2021-02-04 17:22:17,370	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=0]
[2m[36m(pid=6486, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7065, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7162, ip=172.31.31.247)[0m 2021-02-04 17:25:07,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=1]
[2m[36m(pid=6767, ip=172.31.31.247)[0m 2021-02-04 17:24:03,331	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=0]
[2m[36m(pid=7066, ip=172.31.31.247)[0m 2021-02-04 17:25:42,848	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=1]
[2m[36m(pid=5830, ip=172.31.31.247)[0m 2021-02-04 17:22:28,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=2]
[2m[36m(pid=6717, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m 2021-02-04 17:22:22,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=0]
[2m[36m(pid=5543, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m 2021-02-04 17:22:29,479	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=2]
[2m[36m(pid=6068, ip=172.31.31.247)[0m 2021-02-04 17:22:35,083	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=6068, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 375, in ray._raylet.execute_task
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 400, in load_actor_class
[2m[36m(pid=6068, ip=172.31.31.247)[0m     job_id, actor_creation_function_descriptor)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
[2m[36m(pid=6068, ip=172.31.31.247)[0m     actor_class = pickle.loads(pickled_class)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch import TorchTrainer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/__init__.py", line 12, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 13, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune import Trainable
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/__init__.py", line 2, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.tune import run_experiments, run
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 18, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.trial_runner import TrialRunner
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 28, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.web_server import TuneServer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/web_server.py", line 16, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import requests  # `requests` is not part of stdlib.
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/__init__.py", line 43, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import urllib3
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 980, in _find_and_load
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 148, in __enter__
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 174, in _get_module_lock
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=6068, ip=172.31.31.247)[0m SystemExit: 1
[2m[36m(pid=7093, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6511, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=1]
[2m[36m(pid=7155, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=1]
[2m[36m(pid=7091, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6723, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=1]
[2m[36m(pid=5252, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6716, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7084, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5852, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6511, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: *** Aborted at 1612475192 (unix time) try "date -d @1612475192" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=2804, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7284, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2803, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7295, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:26:26,340	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16539, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           28 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           25 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:26:27,340	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7288, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:27,341	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:27,345	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7704, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=1]
[2m[36m(pid=7705, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=0]
[2m[36m(pid=7294, ip=172.31.21.209)[0m 2021-02-04 17:26:27,321	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=2]
[2m[36m(pid=2805, ip=172.31.18.216)[0m 2021-02-04 17:26:28,910	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:58393 [rank=2]
[2m[36m(pid=7704, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7294, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=2805, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7705, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:26:33,288	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7289, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           23 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           25 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=7289, ip=172.31.21.209)[0m 2021-02-04 17:26:33,282	INFO trainable.py:103 -- Trainable.setup took 12.554 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=3017, ip=172.31.18.216)[0m 2021-02-04 17:26:34,099	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=0]
[2m[36m(pid=7396, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=1]
[2m[36m(pid=7435, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=2]
2021-02-04 17:26:34,707	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16550, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:34,708	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:34,712	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=3071, ip=172.31.18.216)[0m 2021-02-04 17:26:36,286	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:45747 [rank=2]
[2m[36m(pid=3017, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7396, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7435, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=3071, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:26:39,913	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7756, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 6.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           25 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           25 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=7756, ip=172.31.31.247)[0m 2021-02-04 17:26:39,909	INFO trainable.py:103 -- Trainable.setup took 11.764 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=3045, ip=172.31.18.216)[0m 2021-02-04 17:26:40,705	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=2]
[2m[36m(pid=7820, ip=172.31.31.247)[0m 2021-02-04 17:26:40,702	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=0]
[2m[36m(pid=7821, ip=172.31.31.247)[0m 2021-02-04 17:26:40,703	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=1]
2021-02-04 17:26:42,087	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7421, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:42,088	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:42,092	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7412, ip=172.31.21.209)[0m 2021-02-04 17:26:43,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=0]
[2m[36m(pid=7413, ip=172.31.21.209)[0m 2021-02-04 17:26:43,587	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=1]
[2m[36m(pid=3045, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7821, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7820, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:26:46,427	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7425, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 5.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=7425, ip=172.31.21.209)[0m 2021-02-04 17:26:46,422	INFO trainable.py:103 -- Trainable.setup took 10.926 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7412, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=3036, ip=172.31.18.216)[0m 2021-02-04 17:26:47,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52661 [rank=2]
[2m[36m(pid=3046, ip=172.31.18.216)[0m 2021-02-04 17:26:47,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52661 [rank=0]
[2m[36m(pid=7413, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:26:49,262	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7426, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:49,263	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:49,268	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=3046, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=0]
[2m[36m(pid=7402, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=1]
[2m[36m(pid=3036, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:26:53,046	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=3044, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 6.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           24 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=3044, ip=172.31.18.216)[0m 2021-02-04 17:26:53,045	INFO trainable.py:103 -- Trainable.setup took 10.163 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7399, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7813, ip=172.31.31.247)[0m 2021-02-04 17:26:54,129	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=0]
[2m[36m(pid=7758, ip=172.31.31.247)[0m 2021-02-04 17:26:54,130	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=1]
[2m[36m(pid=7402, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:26:56,576	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7411, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:26:56,579	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
2021-02-04 17:26:56,586	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=7758, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3037, ip=172.31.18.216)[0m 2021-02-04 17:26:58,109	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37881 [rank=1]
[2m[36m(pid=3028, ip=172.31.18.216)[0m 2021-02-04 17:26:58,109	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37881 [rank=2]
[2m[36m(pid=7813, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:26:59,872	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=3035, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 6.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           25 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=7400, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=0]
[2m[36m(pid=7401, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=1]
[2m[36m(pid=7760, ip=172.31.31.247)[0m 2021-02-04 17:27:00,776	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=2]
[2m[36m(pid=3028, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7400, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7760, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3037, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7401, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:27:05,688	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7759, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:05,689	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 5.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:05,694	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:27:06,499	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16902, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=3027, ip=172.31.18.216)[0m 2021-02-04 17:27:07,189	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53703 [rank=2]
[2m[36m(pid=3026, ip=172.31.18.216)[0m 2021-02-04 17:27:07,356	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=0]
[2m[36m(pid=7725, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=2]
[2m[36m(pid=7757, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=1]
[2m[36m(pid=3027, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=7757, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3026, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:27:12,954	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16884, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:12,957	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           28 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:12,991	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:27:13,078	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:27:13,079	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:27:13,079	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=3019, ip=172.31.18.216)[0m 2021-02-04 17:27:14,538	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38389 [rank=2]
[32mStopped all 7 Ray processes.[39m
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=3019, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:27:20,046	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.081 s, which may be a performance bottleneck.
2021-02-04 17:27:20,047	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/660.84 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           26 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           28 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:20,061	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16901, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:20,280	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7723, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:20,281	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
2021-02-04 17:27:20,285	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:27:20,318	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:27:20,376	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:27:20,376	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:27:20,377	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Did not find any active Ray processes.
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=7714, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=0]
[2m[36m(pid=7711, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=1]
[2m[36m(pid=7714, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7711, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:27:26,564	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.273 s, which may be a performance bottleneck.
2021-02-04 17:27:27,701	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7724, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 5.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:27,737	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:27:27,830	INFO commands.py:441 -- Shutdown i-0ac149179edeecfcd
2021-02-04 17:27:27,831	INFO command_runner.py:356 -- Fetched IP: 34.215.60.186
2021-02-04 17:27:27,831	INFO log_timer.py:27 -- NodeUpdater: i-0ac149179edeecfcd: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.215.60.186' (ECDSA) to the list of known hosts.

[2m[36m(pid=7712, ip=172.31.31.247)[0m 2021-02-04 17:27:28,592	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59485 [rank=2]
[2m[36m(pid=3345, ip=172.31.18.216)[0m 2021-02-04 17:27:28,594	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59485 [rank=1]
[2m[36m(pid=3018, ip=172.31.18.216)[0m 2021-02-04 17:27:28,929	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38191 [rank=2]
[2m[36m(pid=8092, ip=172.31.31.247)[0m 2021-02-04 17:27:28,925	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38191 [rank=0]
[32mStopped all 19 Ray processes.[39m
[0mShared connection to 34.215.60.186 closed.

2021-02-04 17:27:35,059	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.350 s, which may be a performance bottleneck.
2021-02-04 17:27:35,062	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7713, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:27:35,063	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.0/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.31 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           28 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:35,070	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:27:35,077	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=16875, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    address=address, world_size=num_workers))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
[2m[36m(pid=8121, ip=172.31.31.247)[0m 2021-02-04 17:27:36,594	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44109 [rank=2]
[2m[36m(pid=8121, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:27:42,293	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17187, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           27 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=8114, ip=172.31.31.247)[0m 2021-02-04 17:27:43,077	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33593 [rank=0]
[2m[36m(pid=8114, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:27:49,600	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17180, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:49,601	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           28 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:49,606	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=7284, ip=172.31.21.209)[0m 2021-02-04 17:26:20,576	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m 2021-02-04 17:24:42,241	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=7412, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7294, ip=172.31.21.209)[0m 2021-02-04 17:26:27,321	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=2]
[2m[36m(pid=5859, ip=172.31.21.209)[0m 2021-02-04 17:23:43,058	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=5859, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=5859, ip=172.31.21.209)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     address=address, world_size=num_workers))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return func(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, timeout=timeout)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=5859, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=7284, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6604, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=7400, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6628, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=7435, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=2]
[2m[36m(pid=6356, ip=172.31.21.209)[0m 2021-02-04 17:24:48,011	INFO trainable.py:103 -- Trainable.setup took 10.973 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=6627, ip=172.31.21.209)[0m 2021-02-04 17:24:55,541	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=2]
[2m[36m(pid=6627, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7402, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=0]
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7425, ip=172.31.21.209)[0m 2021-02-04 17:26:46,422	INFO trainable.py:103 -- Trainable.setup took 10.926 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7105, ip=172.31.21.209)[0m 2021-02-04 17:26:01,329	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=2]
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=7289, ip=172.31.21.209)[0m 2021-02-04 17:26:33,282	INFO trainable.py:103 -- Trainable.setup took 12.554 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7396, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=1]
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=6338, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6354, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=1]
[2m[36m(pid=6354, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m 2021-02-04 17:26:07,208	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=6628, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=0]
[2m[36m(pid=6399, ip=172.31.21.209)[0m 2021-02-04 17:24:41,404	INFO trainable.py:103 -- Trainable.setup took 12.054 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7106, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6338, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=2]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=6604, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=1]
[2m[36m(pid=7401, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7412, ip=172.31.21.209)[0m 2021-02-04 17:26:43,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=0]
[2m[36m(pid=7413, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7295, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=0]
[2m[36m(pid=6587, ip=172.31.21.209)[0m 2021-02-04 17:24:55,540	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=0]
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7111, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=7402, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=1]
[2m[36m(pid=7295, ip=172.31.21.209)[0m 2021-02-04 17:26:21,483	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45427 [rank=0]
[2m[36m(pid=7396, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7413, ip=172.31.21.209)[0m 2021-02-04 17:26:43,587	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=1]
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7147, ip=172.31.21.209)[0m 2021-02-04 17:26:19,376	INFO trainable.py:103 -- Trainable.setup took 12.935 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7400, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=0]
[2m[36m(pid=6347, ip=172.31.21.209)[0m 2021-02-04 17:24:42,262	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=1]
[2m[36m(pid=7105, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7401, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=1]
[2m[36m(pid=7435, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7106, ip=172.31.21.209)[0m 2021-02-04 17:26:01,338	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=2]
[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=6345, ip=172.31.21.209)[0m 2021-02-04 17:24:48,944	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m 2021-02-04 17:26:13,433	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=0]
[2m[36m(pid=7294, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6347, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7111, ip=172.31.21.209)[0m 2021-02-04 17:26:14,173	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45755 [rank=0]
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[36m(pid=6345, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,099 E 5547 5587] logging.cc:415: *** Aborted at 1612488223 (unix time) try "date -d @1612488223" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,100 E 5547 5587] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=7944, ip=172.31.21.209)[0m 2021-02-04 17:27:52,810	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=2]
[2m[36m(pid=8097, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=1]
[2m[36m(pid=8102, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=0]
[2m[36m(pid=7943, ip=172.31.21.209)[0m 2021-02-04 17:27:52,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39101 [rank=2]
[2m[36m(pid=7944, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8097, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7943, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8102, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:27:58,548	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8103, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:27:58,551	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00003 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:27:58,559	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:27:58,623	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17136, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=7955, ip=172.31.21.209)[0m 2021-02-04 17:28:00,102	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:36377 [rank=2]
[2m[36m(pid=8096, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=1]
[2m[36m(pid=8259, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=2]
[2m[36m(pid=7956, ip=172.31.21.209)[0m 2021-02-04 17:28:00,461	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=0]
[2m[36m(pid=7955, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8096, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7956, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8259, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:28:05,890	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17135, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:05,891	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:05,923	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:28:06,034	INFO commands.py:441 -- Shutdown i-0ac149179edeecfcd
2021-02-04 17:28:06,035	INFO command_runner.py:356 -- Fetched IP: 34.215.60.186
2021-02-04 17:28:06,035	INFO log_timer.py:27 -- NodeUpdater: i-0ac149179edeecfcd: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.215.60.186' (ECDSA) to the list of known hosts.

[2m[36m(pid=7950, ip=172.31.21.209)[0m 2021-02-04 17:28:07,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41425 [rank=0]
Did not find any active Ray processes.
[0mShared connection to 34.215.60.186 closed.

[2m[36m(pid=3046, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1724, ip=172.31.18.216)[0m 2021-02-04 17:24:02,426	INFO trainable.py:103 -- Trainable.setup took 13.344 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=1008, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=341, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3037, ip=172.31.18.216)[0m 2021-02-04 17:26:58,109	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37881 [rank=1]
[2m[36m(pid=2797, ip=172.31.18.216)[0m 2021-02-04 17:26:01,333	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=1]
[2m[36m(pid=3036, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2239, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2240, ip=172.31.18.216)[0m 2021-02-04 17:25:07,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=0]
[2m[36m(pid=1237, ip=172.31.18.216)[0m 2021-02-04 17:22:23,292	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=1]
[2m[36m(pid=341, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=1]
[2m[36m(pid=2555, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3017, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2012, ip=172.31.18.216)[0m 2021-02-04 17:23:57,470	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34483 [rank=1]
[2m[36m(pid=2804, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1705, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=336, ip=172.31.18.216)[0m 2021-02-04 17:20:15,887	INFO trainable.py:103 -- Trainable.setup took 10.187 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=2840, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1461, ip=172.31.18.216)[0m 2021-02-04 17:23:13,090	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=2]
[2m[36m(pid=2554, ip=172.31.18.216)[0m 2021-02-04 17:25:42,850	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=2]
[2m[36m(pid=1197, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1467, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2579, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2827, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2275, ip=172.31.18.216)[0m 2021-02-04 17:24:37,765	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=0]
[2m[36m(pid=631, ip=172.31.18.216)[0m 2021-02-04 17:21:27,788	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=631, ip=172.31.18.216)[0m Traceback (most recent call last):
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=631, ip=172.31.18.216)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=631, ip=172.31.18.216)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=631, ip=172.31.18.216)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=631, ip=172.31.18.216)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=631, ip=172.31.18.216)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=631, ip=172.31.18.216)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=631, ip=172.31.18.216)[0m     address=address, world_size=num_workers))
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=631, ip=172.31.18.216)[0m     return func(*args, **kwargs)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=631, ip=172.31.18.216)[0m     object_refs, timeout=timeout)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=631, ip=172.31.18.216)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=631, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=631, ip=172.31.18.216)[0m     sys.exit(1)
[2m[36m(pid=631, ip=172.31.18.216)[0m SystemExit: 1
[2m[36m(pid=1462, ip=172.31.18.216)[0m 2021-02-04 17:23:06,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=2]
[2m[36m(pid=322, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1003, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=640, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=1]
[2m[36m(pid=2025, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2812, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1771, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2564, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1713, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1490, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=639, ip=172.31.18.216)[0m 2021-02-04 17:21:05,397	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37325 [rank=0]
[2m[36m(pid=2797, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=321, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=2]
[2m[36m(pid=2562, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1453, ip=172.31.18.216)[0m 2021-02-04 17:22:51,270	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57991 [rank=2]
[2m[36m(pid=322, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=1]
[2m[36m(pid=1705, ip=172.31.18.216)[0m 2021-02-04 17:23:57,469	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34483 [rank=0]
[2m[36m(pid=2578, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2840, ip=172.31.18.216)[0m 2021-02-04 17:26:07,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=0]
[2m[36m(pid=1038, ip=172.31.18.216)[0m 2021-02-04 17:22:10,588	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=1]
[2m[36m(pid=2048, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3028, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2241, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1015, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2564, ip=172.31.18.216)[0m 2021-02-04 17:25:22,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52873 [rank=0]
[2m[36m(pid=330, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2798, ip=172.31.18.216)[0m 2021-02-04 17:26:01,342	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=1]
[2m[36m(pid=2253, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1755, ip=172.31.18.216)[0m 2021-02-04 17:23:28,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:47463 [rank=1]
[2m[36m(pid=2549, ip=172.31.18.216)[0m 2021-02-04 17:25:09,230	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=2]
[2m[36m(pid=1008, ip=172.31.18.216)[0m 2021-02-04 17:22:17,373	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=1]
[2m[36m(pid=3044, ip=172.31.18.216)[0m 2021-02-04 17:26:53,045	INFO trainable.py:103 -- Trainable.setup took 10.163 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=624, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=330, ip=172.31.18.216)[0m 2021-02-04 17:20:17,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48855 [rank=1]
[2m[36m(pid=2253, ip=172.31.18.216)[0m 2021-02-04 17:24:52,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37355 [rank=2]
[2m[36m(pid=1038, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1771, ip=172.31.18.216)[0m 2021-02-04 17:23:28,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:47463 [rank=2]
[2m[36m(pid=1461, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1009, ip=172.31.18.216)[0m 2021-02-04 17:22:16,570	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=0]
[2m[36m(pid=320, ip=172.31.18.216)[0m 2021-02-04 17:20:23,821	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:35023 [rank=2]
[2m[36m(pid=1195, ip=172.31.18.216)[0m 2021-02-04 17:22:22,814	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=2]
[2m[36m(pid=1468, ip=172.31.18.216)[0m 2021-02-04 17:22:58,546	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58885 [rank=1]
[2m[36m(pid=2266, ip=172.31.18.216)[0m 2021-02-04 17:24:42,246	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=2]
[2m[36m(pid=2242, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1237, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1198, ip=172.31.18.216)[0m 2021-02-04 17:22:51,246	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=0]
[2m[36m(pid=381, ip=172.31.18.216)[0m 2021-02-04 17:20:04,945	INFO trainable.py:103 -- Trainable.setup took 13.263 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=312, ip=172.31.18.216)[0m Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /home/ray/data/cifar-10-python.tar.gz
[2m[36m(pid=312, ip=172.31.18.216)[0m Extracting /home/ray/data/cifar-10-python.tar.gz to /home/ray/data
[2m[36m(pid=1002, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1715, ip=172.31.18.216)[0m 2021-02-04 17:23:49,780	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:45265 [rank=2]
[2m[36m(pid=2252, ip=172.31.18.216)[0m 2021-02-04 17:24:45,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52591 [rank=0]
[2m[36m(pid=1706, ip=172.31.18.216)[0m 2021-02-04 17:23:20,715	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:42641 [rank=1]
[2m[36m(pid=1195, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2827, ip=172.31.18.216)[0m 2021-02-04 17:26:13,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=1]
[2m[36m(pid=2812, ip=172.31.18.216)[0m 2021-02-04 17:26:13,438	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=2]
[2m[36m(pid=3028, ip=172.31.18.216)[0m 2021-02-04 17:26:58,109	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:37881 [rank=2]
[2m[36m(pid=1757, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3019, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3017, ip=172.31.18.216)[0m 2021-02-04 17:26:34,099	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=0]
[2m[36m(pid=1755, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3026, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1003, ip=172.31.18.216)[0m 2021-02-04 17:22:04,340	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=2]
[2m[36m(pid=1009, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3019, ip=172.31.18.216)[0m 2021-02-04 17:27:14,538	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38389 [rank=2]
[2m[36m(pid=2236, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2578, ip=172.31.18.216)[0m 2021-02-04 17:25:14,817	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:52791 [rank=2]
[2m[36m(pid=2553, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1212, ip=172.31.18.216)[0m 2021-02-04 17:22:34,935	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48577 [rank=2]
[2m[36m(pid=1452, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=313, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=1]
[2m[36m(pid=2239, ip=172.31.18.216)[0m 2021-02-04 17:24:55,545	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=1]
[2m[36m(pid=1704, ip=172.31.18.216)[0m 2021-02-04 17:23:20,715	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:42641 [rank=0]
[2m[36m(pid=2018, ip=172.31.18.216)[0m 2021-02-04 17:24:34,812	INFO trainable.py:103 -- Trainable.setup took 12.710 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=1462, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1490, ip=172.31.18.216)[0m 2021-02-04 17:22:58,546	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58885 [rank=0]
[2m[36m(pid=2048, ip=172.31.18.216)[0m 2021-02-04 17:24:10,377	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=0]
[2m[36m(pid=2240, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2803, ip=172.31.18.216)[0m 2021-02-04 17:26:20,581	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=1]
[2m[36m(pid=311, ip=172.31.18.216)[0m 2021-02-04 17:20:37,160	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55527 [rank=0]
[2m[36m(pid=2013, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3027, ip=172.31.18.216)[0m 2021-02-04 17:27:07,189	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53703 [rank=2]
[2m[36m(pid=2553, ip=172.31.18.216)[0m 2021-02-04 17:25:35,561	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41029 [rank=2]
[2m[36m(pid=2013, ip=172.31.18.216)[0m 2021-02-04 17:24:28,994	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=0]
[2m[36m(pid=1220, ip=172.31.18.216)[0m 2021-02-04 17:22:28,801	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=0]
[2m[36m(pid=1722, ip=172.31.18.216)[0m 2021-02-04 17:23:42,628	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=2]
[2m[36m(pid=2549, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3045, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3037, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=639, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2805, ip=172.31.18.216)[0m 2021-02-04 17:26:28,910	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:58393 [rank=2]
[2m[36m(pid=2275, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1452, ip=172.31.18.216)[0m 2021-02-04 17:23:13,113	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49555 [rank=0]
[2m[36m(pid=1212, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=335, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2266, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2798, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2241, ip=172.31.18.216)[0m 2021-02-04 17:24:59,800	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=0]
[2m[36m(pid=640, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1706, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1723, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2274, ip=172.31.18.216)[0m 2021-02-04 17:24:35,618	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48397 [rank=2]
[2m[36m(pid=3045, ip=172.31.18.216)[0m 2021-02-04 17:26:40,705	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=2]
[2m[36m(pid=2242, ip=172.31.18.216)[0m 2021-02-04 17:25:02,157	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39265 [rank=2]
[2m[36m(pid=3071, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=311, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2012, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1201, ip=172.31.18.216)[0m 2021-02-04 17:22:43,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:51383 [rank=0]
[2m[36m(pid=2803, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1713, ip=172.31.18.216)[0m 2021-02-04 17:23:49,833	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=0]
[2m[36m(pid=1453, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2019, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2049, ip=172.31.18.216)[0m 2021-02-04 17:24:15,803	INFO trainable.py:103 -- Trainable.setup took 11.822 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=3345, ip=172.31.18.216)[0m 2021-02-04 17:27:28,594	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59485 [rank=1]
[2m[36m(pid=2579, ip=172.31.18.216)[0m 2021-02-04 17:25:15,875	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=2]
[2m[36m(pid=3027, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1468, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2555, ip=172.31.18.216)[0m 2021-02-04 17:25:42,871	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39763 [rank=2]
[2m[36m(pid=1015, ip=172.31.18.216)[0m 2021-02-04 17:22:11,065	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=1]
[2m[36m(pid=2050, ip=172.31.18.216)[0m 2021-02-04 17:24:10,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=1]
[2m[36m(pid=320, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=624, ip=172.31.18.216)[0m 2021-02-04 17:20:44,016	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:48333 [rank=2]
[2m[36m(pid=2252, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=312, ip=172.31.18.216)[0m 2021-02-04 17:19:51,670	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=0]
[2m[36m(pid=312, ip=172.31.18.216)[0m 
0it [00:00, ?it/s]
  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 40960/170498071 [00:00<09:51, 288275.09it/s]
  0%|          | 155648/170498071 [00:00<04:50, 585655.61it/s]
  0%|          | 270336/170498071 [00:00<04:09, 682297.12it/s]
  0%|          | 630784/170498071 [00:00<01:46, 1587561.34it/s]
  1%|          | 1187840/170498071 [00:01<01:02, 2724002.92it/s]
  2%|▏         | 3317760/170498071 [00:01<00:20, 8258417.15it/s]
  3%|▎         | 5332992/170498071 [00:01<00:14, 11448934.68it/s]
  5%|▍         | 8036352/170498071 [00:01<00:10, 15976323.38it/s]
  6%|▌         | 10018816/170498071 [00:01<00:09, 16577204.38it/s]
  7%|▋         | 12689408/170498071 [00:01<00:08, 19496979.59it/s]
  9%|▊         | 14704640/170498071 [00:01<00:08, 19063345.58it/s]
 10%|█         | 17358848/170498071 [00:01<00:07, 21210657.64it/s]
 11%|█▏        | 19513344/170498071 [00:01<00:07, 20412667.95it/s]
 13%|█▎        | 22159360/170498071 [00:01<00:06, 22133928.91it/s]
 14%|█▍        | 24403968/170498071 [00:02<00:06, 21106717.00it/s]
 16%|█▌        | 27041792/170498071 [00:02<00:06, 22582812.32it/s]
 17%|█▋        | 29335552/170498071 [00:02<00:06, 21418204.30it/s]
 19%|█▊        | 31891456/170498071 [00:02<00:06, 22516216.66it/s]
 20%|██        | 34177024/170498071 [00:02<00:06, 21757868.64it/s]
 21%|██▏       | 36610048/170498071 [00:02<00:06, 21441638.69it/s]
 23%|██▎       | 39100416/170498071 [00:02<00:05, 22394072.21it/s]
 24%|██▍       | 41361408/170498071 [00:02<00:06, 21446177.55it/s]
 26%|██▌       | 44015616/170498071 [00:02<00:05, 22853383.41it/s]
 27%|██▋       | 46325760/170498071 [00:03<00:05, 21666544.06it/s]
 29%|██▊       | 48963584/170498071 [00:03<00:05, 22923353.34it/s]
 30%|███       | 51290112/170498071 [00:03<00:05, 21825105.48it/s]
 32%|███▏      | 53813248/170498071 [00:03<00:05, 21716023.90it/s]
 33%|███▎      | 56221696/170498071 [00:03<00:05, 22331231.35it/s]
 34%|███▍      | 58499072/170498071 [00:03<00:05, 21799886.59it/s]
 36%|███▌      | 60874752/170498071 [00:03<00:04, 22260662.31it/s]
 37%|███▋      | 63217664/170498071 [00:03<00:04, 21973226.05it/s]
 38%|███▊      | 65560576/170498071 [00:03<00:04, 22275837.53it/s]
 40%|███▉      | 67903488/170498071 [00:04<00:04, 22013096.20it/s]
 41%|████      | 70213632/170498071 [00:04<00:04, 22317311.67it/s]
 43%|████▎     | 72589312/170498071 [00:04<00:04, 22004030.48it/s]
 44%|████▍     | 74899456/170498071 [00:04<00:04, 22296391.77it/s]
 45%|████▌     | 77275136/170498071 [00:04<00:04, 22018406.36it/s]
 47%|████▋     | 79568896/170498071 [00:04<00:04, 22248586.75it/s]
 48%|████▊     | 81960960/170498071 [00:04<00:04, 22035699.94it/s]
 49%|████▉     | 84254720/170498071 [00:04<00:03, 22234369.74it/s]
 51%|█████     | 86646784/170498071 [00:04<00:03, 22053426.05it/s]
 52%|█████▏    | 88924160/170498071 [00:05<00:03, 22238716.80it/s]
 54%|█████▎    | 91332608/170498071 [00:05<00:03, 22057743.67it/s]
 55%|█████▍    | 93609984/170498071 [00:05<00:03, 22230442.86it/s]
 56%|█████▋    | 96018432/170498071 [00:05<00:03, 22062449.00it/s]
 58%|█████▊    | 98271232/170498071 [00:05<00:03, 22194420.08it/s]
 59%|█████▉    | 100704256/170498071 [00:05<00:03, 22106389.05it/s]
 60%|██████    | 102924288/170498071 [00:05<00:03, 22121912.27it/s]
 62%|██████▏   | 105406464/170498071 [00:05<00:02, 22173622.86it/s]
 63%|██████▎   | 107626496/170498071 [00:05<00:02, 22159703.11it/s]
 65%|██████▍   | 110125056/170498071 [00:05<00:02, 22283033.27it/s]
 66%|██████▌   | 112353280/170498071 [00:06<00:02, 22166949.84it/s]
 67%|██████▋   | 114827264/170498071 [00:06<00:02, 22327024.36it/s]
 69%|██████▊   | 117063680/170498071 [00:06<00:02, 22169101.84it/s]
 70%|███████   | 119513088/170498071 [00:06<00:02, 22289870.33it/s]
 71%|███████▏  | 121741312/170498071 [00:06<00:02, 21907161.40it/s]
 73%|███████▎  | 124198912/170498071 [00:06<00:02, 22329469.42it/s]
 74%|███████▍  | 126435328/170498071 [00:06<00:02, 21778336.98it/s]
 76%|███████▌  | 128884736/170498071 [00:06<00:01, 22165228.60it/s]
 77%|███████▋  | 131104768/170498071 [00:06<00:01, 21637750.07it/s]
 78%|███████▊  | 133570560/170498071 [00:07<00:01, 22346030.12it/s]
 80%|███████▉  | 135815168/170498071 [00:07<00:01, 21762208.01it/s]
 81%|████████  | 138256384/170498071 [00:07<00:01, 22379346.72it/s]
 82%|████████▏ | 140500992/170498071 [00:07<00:01, 21784728.51it/s]
 84%|████████▍ | 142925824/170498071 [00:07<00:01, 22395395.94it/s]
 85%|████████▌ | 145178624/170498071 [00:07<00:01, 21745874.19it/s]
 87%|████████▋ | 147611648/170498071 [00:07<00:01, 22420195.58it/s]
 88%|████████▊ | 149864448/170498071 [00:07<00:00, 21779173.76it/s]
 89%|████████▉ | 152330240/170498071 [00:07<00:00, 22509665.48it/s]
 91%|█████████ | 154591232/170498071 [00:07<00:00, 21851667.44it/s]
 92%|█████████▏| 157016064/170498071 [00:08<00:00, 22493407.97it/s]
 93%|█████████▎| 159277056/170498071 [00:08<00:00, 21836628.63it/s]
 95%|█████████▍| 161701888/170498071 [00:08<00:00, 22512059.64it/s]
 96%|█████████▌| 163962880/170498071 [00:08<00:00, 21835997.03it/s]
 98%|█████████▊| 166420480/170498071 [00:08<00:00, 22282436.08it/s]
 99%|█████████▉| 168656896/170498071 [00:08<00:00, 21919517.06it/s]
170500096it [00:13, 12393845.48it/s]                               
[2m[36m(pid=2804, ip=172.31.18.216)[0m 2021-02-04 17:26:20,581	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=2]
[2m[36m(pid=1220, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3018, ip=172.31.18.216)[0m 2021-02-04 17:27:28,929	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38191 [rank=2]
[2m[36m(pid=2019, ip=172.31.18.216)[0m 2021-02-04 17:24:22,826	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41643 [rank=2]
[2m[36m(pid=1723, ip=172.31.18.216)[0m 2021-02-04 17:23:35,388	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52515 [rank=1]
[2m[36m(pid=2274, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1715, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1002, ip=172.31.18.216)[0m 2021-02-04 17:22:04,360	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=2]
[2m[36m(pid=1704, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=308, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=335, ip=172.31.18.216)[0m 2021-02-04 17:20:09,995	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59171 [rank=2]
[2m[36m(pid=1197, ip=172.31.18.216)[0m 2021-02-04 17:22:43,732	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=2]
[2m[36m(pid=2025, ip=172.31.18.216)[0m 2021-02-04 17:24:15,922	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=0]
[2m[36m(pid=1201, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=321, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=1757, ip=172.31.18.216)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52515 [rank=0]
[2m[36m(pid=3036, ip=172.31.18.216)[0m 2021-02-04 17:26:47,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52661 [rank=2]
[2m[36m(pid=2805, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=630, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=1]
[2m[36m(pid=308, ip=172.31.18.216)[0m 2021-02-04 17:19:51,671	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:33939 [rank=2]
[2m[36m(pid=3026, ip=172.31.18.216)[0m 2021-02-04 17:27:07,356	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=0]
[2m[36m(pid=1198, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3046, ip=172.31.18.216)[0m 2021-02-04 17:26:47,212	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52661 [rank=0]
[2m[36m(pid=2050, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3071, ip=172.31.18.216)[0m 2021-02-04 17:26:36,286	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:45747 [rank=2]
[2m[36m(pid=313, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=2236, ip=172.31.18.216)[0m 2021-02-04 17:24:30,447	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=2]
[2m[36m(pid=2562, ip=172.31.18.216)[0m 2021-02-04 17:25:35,561	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41029 [rank=1]
[2m[36m(pid=629, ip=172.31.18.216)[0m 2021-02-04 17:21:27,467	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:58629 [rank=0]
[2m[36m(pid=1467, ip=172.31.18.216)[0m 2021-02-04 17:23:05,823	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54089 [rank=2]
[2m[36m(pid=7950, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:28:12,863	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.966 s, which may be a performance bottleneck.
2021-02-04 17:28:12,866	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:12,883	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7981, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:12,910	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:28:12,970	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:28:12,970	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:28:12,970	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[32mStopped all 9 Ray processes.[39m
[2m[36m(pid=3805, ip=172.31.18.216)[0m 2021-02-04 17:28:14,775	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=2]
[2m[36m(pid=8269, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=1]
[2m[36m(pid=8275, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=0]
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=8275, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3805, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8269, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:28:20,097	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.213 s, which may be a performance bottleneck.
2021-02-04 17:28:20,098	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=7949, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:20,100	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 3.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           29 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:20,106	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:28:20,471	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8274, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:20,498	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:28:20,565	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:28:20,565	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:28:20,565	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
[2m[36m(pid=8260, ip=172.31.31.247)[0m 2021-02-04 17:28:21,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38561 [rank=2]
Did not find any active Ray processes.
[0m[2m[36m(pid=3833, ip=172.31.18.216)[0m 2021-02-04 17:28:21,665	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44301 [rank=2]
[2m[36m(pid=3834, ip=172.31.18.216)[0m 2021-02-04 17:28:21,665	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44301 [rank=1]
[2m[36m(pid=8261, ip=172.31.31.247)[0m 2021-02-04 17:28:21,662	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44301 [rank=0]
Shared connection to 54.68.206.108 closed.

[2m[36m(pid=8260, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3834, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8261, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=3833, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:28:26,752	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.280 s, which may be a performance bottleneck.
2021-02-04 17:28:27,290	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17534, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:27,292	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:27,328	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:28:27,389	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:28:27,390	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:28:27,390	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Did not find any active Ray processes.
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=8431, ip=172.31.31.247)[0m 2021-02-04 17:28:29,136	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44843 [rank=0]
[2m[36m(pid=8431, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:28:33,582	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.280 s, which may be a performance bottleneck.
2021-02-04 17:28:33,582	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 6.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/96 CPUs, 9/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:33,619	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:28:33,679	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:28:33,679	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:28:33,679	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[32mStopped all 16 Ray processes.[39m
[0mShared connection to 34.218.250.17 closed.

2021-02-04 17:28:39,826	ERROR worker.py:1053 -- Possible unhandled error from worker: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=3832, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:40,784	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.190 s, which may be a performance bottleneck.
2021-02-04 17:28:40,789	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8268, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/64 CPUs, 6/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:40,796	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=3832, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=3832, ip=172.31.18.216)[0m 2021-02-04 17:28:40,832	WARNING worker_group.py:359 -- Failed to shutdown gracefully, forcing a shutdown.
[2m[36m(pid=3819, ip=172.31.18.216)[0m 2021-02-04 17:28:41,665	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:45863 [rank=1]
[2m[36m(pid=3817, ip=172.31.18.216)[0m 2021-02-04 17:28:41,664	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:45863 [rank=0]
[2m[36m(pid=3817, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=3819, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:28:47,381	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=3818, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:47,382	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 5.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:47,388	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=3818, ip=172.31.18.216)[0m 2021-02-04 17:28:47,365	INFO trainable.py:103 -- Trainable.setup took 13.030 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=3809, ip=172.31.18.216)[0m 2021-02-04 17:28:49,198	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44905 [rank=2]
[2m[36m(pid=3809, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=7284, ip=172.31.21.209)[0m 2021-02-04 17:26:20,576	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m 2021-02-04 17:24:42,241	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=7412, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7294, ip=172.31.21.209)[0m 2021-02-04 17:26:27,321	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=2]
[2m[36m(pid=5859, ip=172.31.21.209)[0m 2021-02-04 17:23:43,058	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=5859, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=5859, ip=172.31.21.209)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     address=address, world_size=num_workers))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return func(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, timeout=timeout)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=5859, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=7284, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6604, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7956, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=7400, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7956, ip=172.31.21.209)[0m 2021-02-04 17:28:00,461	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=0]
[2m[36m(pid=6628, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=7435, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=2]
[2m[36m(pid=6356, ip=172.31.21.209)[0m 2021-02-04 17:24:48,011	INFO trainable.py:103 -- Trainable.setup took 10.973 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=6627, ip=172.31.21.209)[0m 2021-02-04 17:24:55,541	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=2]
[2m[36m(pid=6627, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7402, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7944, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=0]
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7425, ip=172.31.21.209)[0m 2021-02-04 17:26:46,422	INFO trainable.py:103 -- Trainable.setup took 10.926 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7105, ip=172.31.21.209)[0m 2021-02-04 17:26:01,329	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=2]
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=7289, ip=172.31.21.209)[0m 2021-02-04 17:26:33,282	INFO trainable.py:103 -- Trainable.setup took 12.554 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7396, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=1]
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=6338, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6354, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=1]
[2m[36m(pid=6354, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m 2021-02-04 17:26:07,208	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=6628, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=0]
[2m[36m(pid=6399, ip=172.31.21.209)[0m 2021-02-04 17:24:41,404	INFO trainable.py:103 -- Trainable.setup took 12.054 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7106, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6338, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=2]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=6604, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=1]
[2m[36m(pid=7401, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7955, ip=172.31.21.209)[0m 2021-02-04 17:28:00,102	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:36377 [rank=2]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7950, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7412, ip=172.31.21.209)[0m 2021-02-04 17:26:43,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=0]
[2m[36m(pid=7413, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7295, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=0]
[2m[36m(pid=6587, ip=172.31.21.209)[0m 2021-02-04 17:24:55,540	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=0]
[2m[36m(pid=7955, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7111, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=7402, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=1]
[2m[36m(pid=7295, ip=172.31.21.209)[0m 2021-02-04 17:26:21,483	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45427 [rank=0]
[2m[36m(pid=7396, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7413, ip=172.31.21.209)[0m 2021-02-04 17:26:43,587	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=1]
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7147, ip=172.31.21.209)[0m 2021-02-04 17:26:19,376	INFO trainable.py:103 -- Trainable.setup took 12.935 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7400, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=0]
[2m[36m(pid=6347, ip=172.31.21.209)[0m 2021-02-04 17:24:42,262	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=1]
[2m[36m(pid=7105, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7401, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=1]
[2m[36m(pid=7435, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7943, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7106, ip=172.31.21.209)[0m 2021-02-04 17:26:01,338	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=2]
[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=7944, ip=172.31.21.209)[0m 2021-02-04 17:27:52,810	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=2]
[2m[36m(pid=6345, ip=172.31.21.209)[0m 2021-02-04 17:24:48,944	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m 2021-02-04 17:26:13,433	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=0]
[2m[36m(pid=7294, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6347, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7950, ip=172.31.21.209)[0m 2021-02-04 17:28:07,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41425 [rank=0]
[2m[36m(pid=7943, ip=172.31.21.209)[0m 2021-02-04 17:27:52,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39101 [rank=2]
[2m[36m(pid=7111, ip=172.31.21.209)[0m 2021-02-04 17:26:14,173	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45755 [rank=0]
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[36m(pid=6345, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:28:14,845 E 7901 7941] logging.cc:415: *** Aborted at 1612488494 (unix time) try "date -d @1612488494" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:28:14,845 E 7901 7941] logging.cc:415: PC: @                0x0 (unknown)
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,099 E 5547 5587] logging.cc:415: *** Aborted at 1612488223 (unix time) try "date -d @1612488223" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,100 E 5547 5587] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=3808, ip=172.31.18.216)[0m 2021-02-04 17:28:53,899	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36319 [rank=0]
[2m[36m(pid=8441, ip=172.31.21.209)[0m 2021-02-04 17:28:53,896	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36319 [rank=2]
[2m[36m(pid=8442, ip=172.31.21.209)[0m 2021-02-04 17:28:53,895	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36319 [rank=1]
2021-02-04 17:28:54,937	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17706, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:28:54,940	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           30 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:28:54,948	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=3810, ip=172.31.18.216)[0m 2021-02-04 17:28:56,483	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54133 [rank=2]
[2m[36m(pid=8442, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=3808, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8441, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=3810, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:28:59,850	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17708, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=8448, ip=172.31.21.209)[0m 2021-02-04 17:29:01,031	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58159 [rank=1]
[2m[36m(pid=8454, ip=172.31.21.209)[0m 2021-02-04 17:29:01,030	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58159 [rank=0]
[2m[36m(pid=4044, ip=172.31.18.216)[0m 2021-02-04 17:29:01,035	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58159 [rank=2]
2021-02-04 17:29:02,264	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8453, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:02,265	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00003: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:02,271	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=8454, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4044, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8448, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:29:06,742	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8477, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=8477, ip=172.31.21.209)[0m 2021-02-04 17:29:06,736	INFO trainable.py:103 -- Trainable.setup took 11.038 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=4054, ip=172.31.18.216)[0m 2021-02-04 17:29:07,639	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:55027 [rank=2]
[2m[36m(pid=8598, ip=172.31.21.209)[0m 2021-02-04 17:29:07,908	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=0]
[2m[36m(pid=8597, ip=172.31.21.209)[0m 2021-02-04 17:29:07,908	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=1]
[2m[36m(pid=4053, ip=172.31.18.216)[0m 2021-02-04 17:29:07,913	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=2]
[2m[36m(pid=4054, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8598, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4053, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8597, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:29:13,484	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17711, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:13,484	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00003 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           31 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:13,490	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:29:13,632	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8447, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=8447, ip=172.31.21.209)[0m 2021-02-04 17:29:13,598	INFO trainable.py:103 -- Trainable.setup took 10.533 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=8606, ip=172.31.21.209)[0m 2021-02-04 17:29:15,005	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53823 [rank=1]
[2m[36m(pid=8611, ip=172.31.21.209)[0m 2021-02-04 17:29:15,005	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53823 [rank=0]
[2m[36m(pid=4047, ip=172.31.18.216)[0m 2021-02-04 17:29:15,011	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:46747 [rank=1]
[2m[36m(pid=4048, ip=172.31.18.216)[0m 2021-02-04 17:29:15,011	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:46747 [rank=2]
[2m[36m(pid=6510, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=0]
[2m[36m(pid=6767, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7069, ip=172.31.31.247)[0m 2021-02-04 17:25:42,847	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=0]
[2m[36m(pid=6510, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8259, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=2]
[2m[36m(pid=5834, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6748, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5252, ip=172.31.31.247)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=0]
[2m[36m(pid=8431, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5586, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8121, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8261, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7082, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m 2021-02-04 17:24:15,920	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=2]
[2m[36m(pid=7813, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=1]
[2m[36m(pid=8269, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=1]
[2m[36m(pid=8275, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=0]
[2m[36m(pid=6747, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7704, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=1]
[2m[36m(pid=7725, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=2]
[2m[36m(pid=6715, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7757, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m 2021-02-04 17:24:48,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=1]
[2m[36m(pid=7756, ip=172.31.31.247)[0m 2021-02-04 17:26:39,909	INFO trainable.py:103 -- Trainable.setup took 11.764 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7821, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8260, ip=172.31.31.247)[0m 2021-02-04 17:28:21,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38561 [rank=2]
[2m[36m(pid=6486, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=1]
[2m[36m(pid=7148, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5834, ip=172.31.31.247)[0m 2021-02-04 17:22:23,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=0]
[2m[36m(pid=8102, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7162, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7757, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=1]
[2m[36m(pid=5543, ip=172.31.31.247)[0m 2021-02-04 17:22:04,336	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=0]
[2m[36m(pid=5552, ip=172.31.31.247)[0m 2021-02-04 17:22:10,201	INFO trainable.py:103 -- Trainable.setup took 15.146 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7760, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7148, ip=172.31.31.247)[0m 2021-02-04 17:25:07,118	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m 2021-02-04 17:22:11,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=0]
[2m[36m(pid=5542, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6496, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7155, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=2]
[2m[36m(pid=5587, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=1]
[2m[36m(pid=5553, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=2]
[2m[36m(pid=7082, ip=172.31.31.247)[0m 2021-02-04 17:25:35,493	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=1]
[2m[36m(pid=7084, ip=172.31.31.247)[0m 2021-02-04 17:25:35,492	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=0]
[2m[36m(pid=5585, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=1]
[2m[36m(pid=7714, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=0]
[2m[36m(pid=7813, ip=172.31.31.247)[0m 2021-02-04 17:26:54,129	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=0]
[2m[36m(pid=8102, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=0]
[2m[36m(pid=6776, ip=172.31.31.247)[0m 2021-02-04 17:24:03,332	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=2]
[2m[36m(pid=6747, ip=172.31.31.247)[0m 2021-02-04 17:24:10,375	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=1]
[2m[36m(pid=6487, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=1]
[2m[36m(pid=6723, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8114, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6487, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=1]
[2m[36m(pid=6715, ip=172.31.31.247)[0m 2021-02-04 17:23:56,734	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=2]
[2m[36m(pid=5852, ip=172.31.31.247)[0m 2021-02-04 17:22:16,568	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=1]
[2m[36m(pid=6717, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=2]
[2m[36m(pid=5542, ip=172.31.31.247)[0m 2021-02-04 17:22:10,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=2]
[2m[36m(pid=6716, ip=172.31.31.247)[0m 2021-02-04 17:23:56,733	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=0]
[2m[36m(pid=8259, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=2]
[2m[36m(pid=8431, ip=172.31.31.247)[0m 2021-02-04 17:28:29,136	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44843 [rank=0]
[2m[36m(pid=7711, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=1]
[2m[36m(pid=7065, ip=172.31.31.247)[0m 2021-02-04 17:24:48,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=0]
[2m[36m(pid=5537, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7820, ip=172.31.31.247)[0m 2021-02-04 17:26:40,702	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=0]
[2m[36m(pid=6748, ip=172.31.31.247)[0m 2021-02-04 17:24:10,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=2]
[2m[36m(pid=8096, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7758, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6719, ip=172.31.31.247)[0m 2021-02-04 17:24:09,151	INFO trainable.py:103 -- Trainable.setup took 12.454 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7712, ip=172.31.31.247)[0m 2021-02-04 17:27:28,592	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59485 [rank=2]
[2m[36m(pid=5586, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=2]
[2m[36m(pid=5585, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7714, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7704, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=0]
[2m[36m(pid=6496, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=0]
[2m[36m(pid=5544, ip=172.31.31.247)[0m 2021-02-04 17:22:04,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=0]
[2m[36m(pid=7821, ip=172.31.31.247)[0m 2021-02-04 17:26:40,703	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=1]
[2m[36m(pid=8114, ip=172.31.31.247)[0m 2021-02-04 17:27:43,077	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33593 [rank=0]
[2m[36m(pid=5551, ip=172.31.31.247)[0m 2021-02-04 17:22:08,650	INFO trainable.py:103 -- Trainable.setup took 13.595 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6776, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7093, ip=172.31.31.247)[0m 2021-02-04 17:25:15,873	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=1]
[2m[36m(pid=5851, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7091, ip=172.31.31.247)[0m 2021-02-04 17:25:15,872	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=0]
[2m[36m(pid=5553, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5537, ip=172.31.31.247)[0m 2021-02-04 17:21:20,182	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=1]
[2m[36m(pid=6741, ip=172.31.31.247)[0m 2021-02-04 17:24:28,103	INFO trainable.py:103 -- Trainable.setup took 12.878 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7760, ip=172.31.31.247)[0m 2021-02-04 17:27:00,776	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5830, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5851, ip=172.31.31.247)[0m 2021-02-04 17:22:17,370	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=0]
[2m[36m(pid=7820, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6486, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7705, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7065, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7162, ip=172.31.31.247)[0m 2021-02-04 17:25:07,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=1]
[2m[36m(pid=8275, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6767, ip=172.31.31.247)[0m 2021-02-04 17:24:03,331	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=0]
[2m[36m(pid=7066, ip=172.31.31.247)[0m 2021-02-04 17:25:42,848	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=1]
[2m[36m(pid=8260, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5830, ip=172.31.31.247)[0m 2021-02-04 17:22:28,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=2]
[2m[36m(pid=6717, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8092, ip=172.31.31.247)[0m 2021-02-04 17:27:28,925	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38191 [rank=0]
[2m[36m(pid=7758, ip=172.31.31.247)[0m 2021-02-04 17:26:54,130	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=1]
[2m[36m(pid=8097, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=1]
[2m[36m(pid=5835, ip=172.31.31.247)[0m 2021-02-04 17:22:22,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=0]
[2m[36m(pid=8096, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=1]
[2m[36m(pid=5543, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m 2021-02-04 17:22:29,479	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=2]
[2m[36m(pid=8097, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6068, ip=172.31.31.247)[0m 2021-02-04 17:22:35,083	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=6068, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 375, in ray._raylet.execute_task
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 400, in load_actor_class
[2m[36m(pid=6068, ip=172.31.31.247)[0m     job_id, actor_creation_function_descriptor)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
[2m[36m(pid=6068, ip=172.31.31.247)[0m     actor_class = pickle.loads(pickled_class)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch import TorchTrainer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/__init__.py", line 12, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 13, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune import Trainable
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/__init__.py", line 2, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.tune import run_experiments, run
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 18, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.trial_runner import TrialRunner
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 28, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.web_server import TuneServer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/web_server.py", line 16, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import requests  # `requests` is not part of stdlib.
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/__init__.py", line 43, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import urllib3
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 980, in _find_and_load
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 148, in __enter__
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 174, in _get_module_lock
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=6068, ip=172.31.31.247)[0m SystemExit: 1
[2m[36m(pid=7093, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7705, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=0]
[2m[36m(pid=6511, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=1]
[2m[36m(pid=7155, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=1]
[2m[36m(pid=7091, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8121, ip=172.31.31.247)[0m 2021-02-04 17:27:36,594	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44109 [rank=2]
[2m[36m(pid=6723, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=1]
[2m[36m(pid=8261, ip=172.31.31.247)[0m 2021-02-04 17:28:21,662	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44301 [rank=0]
[2m[36m(pid=5252, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6716, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7084, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8269, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5852, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6511, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7711, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: *** Aborted at 1612475192 (unix time) try "date -d @1612475192" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=8606, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4048, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4047, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8611, ip=172.31.21.209)[0m Files already downloaded and verified
2021-02-04 17:29:20,783	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8612, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:20,786	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:20,821	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:29:20,892	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:29:20,893	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:29:20,893	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[33mStopped only 18 out of 19 Ray processes. Set `[1m-v[22m[26m` to see more details.[39m
[33mTry running the command again, or use `[1m--force[22m[26m`.[39m
[0mShared connection to 54.68.206.108 closed.

2021-02-04 17:29:28,679	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.883 s, which may be a performance bottleneck.
2021-02-04 17:29:28,679	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 3.9/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           32 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00002 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:28,693	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8641, ip=172.31.21.209)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:28,694	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4175, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 220, in start_workers
    self.apply_all_workers(self._initialization_hook)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 246, in apply_all_workers
    return ray.get(self._apply_all_workers(fn))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:29:28,695	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
2021-02-04 17:29:28,698	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=4189, ip=172.31.18.216)[0m 2021-02-04 17:29:30,184	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34075 [rank=0]
[2m[36m(pid=4188, ip=172.31.18.216)[0m 2021-02-04 17:29:30,184	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34075 [rank=1]
[2m[36m(pid=8799, ip=172.31.31.247)[0m 2021-02-04 17:29:30,204	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46259 [rank=1]
[2m[36m(pid=8808, ip=172.31.31.247)[0m 2021-02-04 17:29:30,203	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46259 [rank=0]
[2m[36m(pid=4189, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8808, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4188, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8799, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:29:35,950	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4187, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           33 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:35,986	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:29:36,042	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
2021-02-04 17:29:36,042	INFO command_runner.py:356 -- Fetched IP: 54.68.206.108
2021-02-04 17:29:36,042	INFO log_timer.py:27 -- NodeUpdater: i-00f5016b41ab14ccc: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.68.206.108' (ECDSA) to the list of known hosts.

[2m[36m(pid=4180, ip=172.31.18.216)[0m 2021-02-04 17:29:36,888	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49341 [rank=0]
[2m[36m(pid=4179, ip=172.31.18.216)[0m 2021-02-04 17:29:36,888	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49341 [rank=1]
Did not find any active Ray processes.
[0mShared connection to 54.68.206.108 closed.

[2m[36m(pid=4180, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4179, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8810, ip=172.31.31.247)[0m 2021-02-04 17:29:42,730	INFO trainable.py:103 -- Trainable.setup took 13.257 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-02-04 17:29:42,888	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.928 s, which may be a performance bottleneck.
2021-02-04 17:29:42,892	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8809, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:42,893	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:42,900	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:29:42,908	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8810, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=8800, ip=172.31.31.247)[0m 2021-02-04 17:29:44,410	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54921 [rank=2]
[2m[36m(pid=4178, ip=172.31.18.216)[0m 2021-02-04 17:29:44,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=1]
[2m[36m(pid=4327, ip=172.31.18.216)[0m 2021-02-04 17:29:44,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=2]
[2m[36m(pid=8788, ip=172.31.31.247)[0m 2021-02-04 17:29:44,747	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=0]
[2m[36m(pid=8800, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4178, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8788, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4327, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:29:50,159	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17907, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:50,162	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:50,171	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:29:50,617	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=8801, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=8789, ip=172.31.31.247)[0m 2021-02-04 17:29:51,691	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38445 [rank=2]
[2m[36m(pid=8790, ip=172.31.31.247)[0m 2021-02-04 17:29:51,753	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=2]
[2m[36m(pid=4336, ip=172.31.18.216)[0m 2021-02-04 17:29:51,754	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=1]
[2m[36m(pid=4343, ip=172.31.18.216)[0m 2021-02-04 17:29:51,754	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=0]
[2m[36m(pid=8789, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4343, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=8790, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4336, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:29:57,518	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=17894, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:29:57,519	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.3/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           34 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:29:57,525	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:29:57,564	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4342, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
[2m[36m(pid=9043, ip=172.31.31.247)[0m 2021-02-04 17:29:59,038	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=2]
[2m[36m(pid=4329, ip=172.31.18.216)[0m 2021-02-04 17:29:59,039	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=0]
[2m[36m(pid=4328, ip=172.31.18.216)[0m 2021-02-04 17:29:59,039	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=1]
[2m[36m(pid=9042, ip=172.31.31.247)[0m 2021-02-04 17:29:59,694	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:36601 [rank=0]
[2m[36m(pid=5866, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5860, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=0]
[2m[36m(pid=7284, ip=172.31.21.209)[0m 2021-02-04 17:26:20,576	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53075 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m 2021-02-04 17:24:42,241	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=1]
[2m[36m(pid=5599, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=2]
[2m[36m(pid=7412, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5621, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7294, ip=172.31.21.209)[0m 2021-02-04 17:26:27,321	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=2]
[2m[36m(pid=5859, ip=172.31.21.209)[0m 2021-02-04 17:23:43,058	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=5859, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.setup(copy.deepcopy(self.config))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._trainer = self._create_trainer(config)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
[2m[36m(pid=5859, ip=172.31.21.209)[0m     trainer = TorchTrainer(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self._start_workers(self.max_replicas)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     self.worker_group.start_workers(num_workers)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
[2m[36m(pid=5859, ip=172.31.21.209)[0m     address=address, world_size=num_workers))
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
[2m[36m(pid=5859, ip=172.31.21.209)[0m     return func(*args, **kwargs)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, timeout=timeout)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 310, in get_objects
[2m[36m(pid=5859, ip=172.31.21.209)[0m     object_refs, self.current_task_id, timeout_ms)
[2m[36m(pid=5859, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=5859, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=5859, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=4950, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=2]
[2m[36m(pid=7284, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6604, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7956, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8454, ip=172.31.21.209)[0m 2021-02-04 17:29:01,030	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58159 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=1]
[2m[36m(pid=7400, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8598, ip=172.31.21.209)[0m 2021-02-04 17:29:07,908	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=0]
[2m[36m(pid=5599, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4959, ip=172.31.21.209)[0m 2021-02-04 17:21:41,119	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34263 [rank=0]
[2m[36m(pid=6346, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7956, ip=172.31.21.209)[0m 2021-02-04 17:28:00,461	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=0]
[2m[36m(pid=8441, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6628, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=2]
[2m[36m(pid=7435, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=2]
[2m[36m(pid=8477, ip=172.31.21.209)[0m 2021-02-04 17:29:06,736	INFO trainable.py:103 -- Trainable.setup took 11.038 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6356, ip=172.31.21.209)[0m 2021-02-04 17:24:48,011	INFO trainable.py:103 -- Trainable.setup took 10.973 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=5622, ip=172.31.21.209)[0m 2021-02-04 17:22:29,478	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=1]
[2m[36m(pid=5612, ip=172.31.21.209)[0m 2021-02-04 17:22:51,243	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:54467 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=1]
[2m[36m(pid=8441, ip=172.31.21.209)[0m 2021-02-04 17:28:53,896	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36319 [rank=2]
[2m[36m(pid=6627, ip=172.31.21.209)[0m 2021-02-04 17:24:55,541	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=2]
[2m[36m(pid=6627, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7402, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8611, ip=172.31.21.209)[0m 2021-02-04 17:29:15,005	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53823 [rank=0]
[2m[36m(pid=7399, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7944, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4986, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7399, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=0]
[2m[36m(pid=5621, ip=172.31.21.209)[0m 2021-02-04 17:22:29,477	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=0]
[2m[36m(pid=4958, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4957, ip=172.31.21.209)[0m 2021-02-04 17:21:47,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60783 [rank=1]
[2m[36m(pid=5614, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7425, ip=172.31.21.209)[0m 2021-02-04 17:26:46,422	INFO trainable.py:103 -- Trainable.setup took 10.926 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7105, ip=172.31.21.209)[0m 2021-02-04 17:26:01,329	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:53995 [rank=2]
[2m[36m(pid=8598, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8448, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=1]
[2m[36m(pid=8606, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7289, ip=172.31.21.209)[0m 2021-02-04 17:26:33,282	INFO trainable.py:103 -- Trainable.setup took 12.554 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=4959, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7396, ip=172.31.21.209)[0m 2021-02-04 17:26:34,096	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48065 [rank=1]
[2m[36m(pid=5601, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=2]
[2m[36m(pid=6338, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8454, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5601, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6587, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6354, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=1]
[2m[36m(pid=8447, ip=172.31.21.209)[0m 2021-02-04 17:29:13,598	INFO trainable.py:103 -- Trainable.setup took 10.533 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6354, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7118, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8597, ip=172.31.21.209)[0m 2021-02-04 17:29:07,908	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=1]
[2m[36m(pid=7118, ip=172.31.21.209)[0m 2021-02-04 17:26:07,208	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:38301 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5866, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=1]
[2m[36m(pid=8442, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6628, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=0]
[2m[36m(pid=6399, ip=172.31.21.209)[0m 2021-02-04 17:24:41,404	INFO trainable.py:103 -- Trainable.setup took 12.054 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7106, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6338, ip=172.31.21.209)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=2]
[2m[36m(pid=5614, ip=172.31.21.209)[0m 2021-02-04 17:22:43,728	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:49057 [rank=0]
[2m[36m(pid=8606, ip=172.31.21.209)[0m 2021-02-04 17:29:15,005	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53823 [rank=1]
[2m[36m(pid=4961, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=1]
[2m[36m(pid=8611, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6604, ip=172.31.21.209)[0m 2021-02-04 17:25:09,226	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41665 [rank=1]
[2m[36m(pid=7401, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7955, ip=172.31.21.209)[0m 2021-02-04 17:28:00,102	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:36377 [rank=2]
[2m[36m(pid=4960, ip=172.31.21.209)[0m 2021-02-04 17:21:34,510	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47259 [rank=0]
[2m[36m(pid=4960, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5844, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4961, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7950, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=8448, ip=172.31.21.209)[0m 2021-02-04 17:29:01,031	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:58159 [rank=1]
[2m[36m(pid=7412, ip=172.31.21.209)[0m 2021-02-04 17:26:43,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=0]
[2m[36m(pid=8597, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7413, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5845, ip=172.31.21.209)[0m 2021-02-04 17:23:06,156	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:59089 [rank=0]
[2m[36m(pid=6336, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7295, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=6355, ip=172.31.21.209)[0m 2021-02-04 17:24:30,443	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:38699 [rank=0]
[2m[36m(pid=6587, ip=172.31.21.209)[0m 2021-02-04 17:24:55,540	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:53241 [rank=0]
[2m[36m(pid=7955, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5865, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7111, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5600, ip=172.31.21.209)[0m 2021-02-04 17:22:58,580	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:51469 [rank=1]
[2m[36m(pid=7402, ip=172.31.21.209)[0m 2021-02-04 17:26:50,832	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41233 [rank=1]
[2m[36m(pid=7295, ip=172.31.21.209)[0m 2021-02-04 17:26:21,483	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45427 [rank=0]
[2m[36m(pid=7396, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7413, ip=172.31.21.209)[0m 2021-02-04 17:26:43,587	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:36205 [rank=1]
[2m[36m(pid=5622, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7147, ip=172.31.21.209)[0m 2021-02-04 17:26:19,376	INFO trainable.py:103 -- Trainable.setup took 12.935 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=8768, ip=172.31.21.209)[0m 2021-02-04 17:29:22,745	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=8768, ip=172.31.21.209)[0m Traceback (most recent call last):
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "python/ray/_raylet.pyx", line 375, in ray._raylet.execute_task
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 400, in load_actor_class
[2m[36m(pid=8768, ip=172.31.21.209)[0m     job_id, actor_creation_function_descriptor)
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
[2m[36m(pid=8768, ip=172.31.21.209)[0m     actor_class = pickle.loads(pickled_class)
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.util.sgd.torch import TorchTrainer
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/__init__.py", line 12, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 13, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.tune import Trainable
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/__init__.py", line 2, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.tune.tune import run_experiments, run
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 18, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.tune.trial_runner import TrialRunner
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 28, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from ray.tune.web_server import TuneServer
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/web_server.py", line 6, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     from http.server import SimpleHTTPRequestHandler, HTTPServer
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/http/server.py", line 627, in <module>
[2m[36m(pid=8768, ip=172.31.21.209)[0m     class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/http/server.py", line 871, in SimpleHTTPRequestHandler
[2m[36m(pid=8768, ip=172.31.21.209)[0m     mimetypes.init() # try to read system mime.types
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/mimetypes.py", line 364, in init
[2m[36m(pid=8768, ip=172.31.21.209)[0m     db.read(file)
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/mimetypes.py", line 206, in read
[2m[36m(pid=8768, ip=172.31.21.209)[0m     self.readfp(fp, strict)
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/mimetypes.py", line 221, in readfp
[2m[36m(pid=8768, ip=172.31.21.209)[0m     for i in range(len(words)):
[2m[36m(pid=8768, ip=172.31.21.209)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=8768, ip=172.31.21.209)[0m     sys.exit(1)
[2m[36m(pid=8768, ip=172.31.21.209)[0m SystemExit: 1
[2m[36m(pid=7400, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=0]
[2m[36m(pid=6347, ip=172.31.21.209)[0m 2021-02-04 17:24:42,262	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:48325 [rank=1]
[2m[36m(pid=7105, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7401, ip=172.31.21.209)[0m 2021-02-04 17:27:00,774	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=1]
[2m[36m(pid=8442, ip=172.31.21.209)[0m 2021-02-04 17:28:53,895	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36319 [rank=1]
[2m[36m(pid=7435, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7943, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=5612, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7106, ip=172.31.21.209)[0m 2021-02-04 17:26:01,338	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:57769 [rank=2]
[2m[36m(pid=5848, ip=172.31.21.209)[0m 2021-02-04 17:23:42,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:56565 [rank=1]
[2m[36m(pid=5613, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4987, ip=172.31.21.209)[0m 2021-02-04 17:21:19,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:60985 [rank=2]
[2m[36m(pid=7944, ip=172.31.21.209)[0m 2021-02-04 17:27:52,810	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=2]
[2m[36m(pid=6345, ip=172.31.21.209)[0m 2021-02-04 17:24:48,944	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=2]
[2m[36m(pid=7112, ip=172.31.21.209)[0m 2021-02-04 17:26:13,433	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:54249 [rank=0]
[2m[36m(pid=7294, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=4948, ip=172.31.21.209)[0m 2021-02-04 17:21:46,835	INFO trainable.py:103 -- Trainable.setup took 13.044 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6347, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[36m(pid=7950, ip=172.31.21.209)[0m 2021-02-04 17:28:07,727	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:41425 [rank=0]
[2m[36m(pid=7943, ip=172.31.21.209)[0m 2021-02-04 17:27:52,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:39101 [rank=2]
[2m[36m(pid=7111, ip=172.31.21.209)[0m 2021-02-04 17:26:14,173	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:45755 [rank=0]
[2m[36m(pid=5865, ip=172.31.21.209)[0m 2021-02-04 17:23:13,086	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:55951 [rank=0]
[2m[36m(pid=6345, ip=172.31.21.209)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:28:14,845 E 7901 7941] logging.cc:415: *** Aborted at 1612488494 (unix time) try "date -d @1612488494" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:28:14,845 E 7901 7941] logging.cc:415: PC: @                0x0 (unknown)
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,099 E 5547 5587] logging.cc:415: *** Aborted at 1612488223 (unix time) try "date -d @1612488223" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.21.209)[0m [2021-02-04 17:23:43,100 E 5547 5587] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=4329, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=9043, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=4328, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=9042, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:30:04,862	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=4337, ip=172.31.18.216)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:30:04,868	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/128 CPUs, 6/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:30:04,945	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:30:05,015	INFO commands.py:441 -- Shutdown i-00f5016b41ab14ccc
[2m[36m(pid=4496, ip=172.31.18.216)[0m 2021-02-04 17:30:06,749	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=1]
[2m[36m(pid=4497, ip=172.31.18.216)[0m 2021-02-04 17:30:06,749	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=2]
[2m[36m(pid=9036, ip=172.31.31.247)[0m 2021-02-04 17:30:06,747	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=0]
[2m[36m(pid=4496, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=9036, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:30:10,272	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 5.388 s, which may be a performance bottleneck.
2021-02-04 17:30:10,272	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 7.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 9/128 CPUs, 9/8 GPUs, 0.0/660.79 GiB heap, 0.0/198.05 GiB objects (0/4.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | RUNNING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:30:10,283	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=9013, ip=172.31.31.247)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:30:10,310	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:30:10,363	INFO commands.py:441 -- Shutdown i-0ac149179edeecfcd
2021-02-04 17:30:10,364	INFO command_runner.py:356 -- Fetched IP: 34.215.60.186
2021-02-04 17:30:10,364	INFO log_timer.py:27 -- NodeUpdater: i-0ac149179edeecfcd: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.215.60.186' (ECDSA) to the list of known hosts.

[2m[36m(pid=4497, ip=172.31.18.216)[0m Files already downloaded and verified
[32mStopped all 16 Ray processes.[39m
[0mShared connection to 34.215.60.186 closed.

2021-02-04 17:30:19,306	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 9.022 s, which may be a performance bottleneck.
2021-02-04 17:30:19,311	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18275, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 97, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 713, in setup
    self._trainer = self._create_trainer(config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 671, in _create_trainer
    trainer = TorchTrainer(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 266, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 326, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/worker_group.py", line 231, in start_workers
    ray.get(self._setup_operator())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:30:19,313	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00002: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 3.8/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/96 CPUs, 6/6 GPUs, 0.0/493.26 GiB heap, 0.0/148.54 GiB objects (0/3.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00001 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:30:19,325	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
2021-02-04 17:30:35,254	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
== Status ==
Memory usage on this node: 3.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/64 CPUs, 6/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00001 |           35 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:30:35,262	WARNING worker.py:1107 -- The node with node id 687c2bfdd0c49b9fa1b39301bbeff27886a1441e4cae136fbe7b8e6e has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2021-02-04 17:30:35,263	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00002: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:30:35,289	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:30:35,350	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:30:35,351	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:30:35,351	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

[32mStopped all 14 Ray processes.[39m
[0mShared connection to 34.218.250.17 closed.

2021-02-04 17:30:42,307	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.043 s, which may be a performance bottleneck.
2021-02-04 17:30:42,308	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:30:42,310	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 3.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00000 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:30:42,318	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=4047, ip=172.31.18.216)[0m 2021-02-04 17:29:15,011	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:46747 [rank=1]
[2m[36m(pid=4054, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4180, ip=172.31.18.216)[0m 2021-02-04 17:29:36,888	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49341 [rank=0]
[2m[36m(pid=4496, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4328, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4343, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4496, ip=172.31.18.216)[0m 2021-02-04 17:30:06,749	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=1]
[2m[36m(pid=4497, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4189, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4179, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4179, ip=172.31.18.216)[0m 2021-02-04 17:29:36,888	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49341 [rank=1]
[2m[36m(pid=4178, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4054, ip=172.31.18.216)[0m 2021-02-04 17:29:07,639	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:55027 [rank=2]
[2m[36m(pid=4329, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4336, ip=172.31.18.216)[0m 2021-02-04 17:29:51,754	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=1]
[2m[36m(pid=4329, ip=172.31.18.216)[0m 2021-02-04 17:29:59,039	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=0]
[2m[36m(pid=4048, ip=172.31.18.216)[0m 2021-02-04 17:29:15,011	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:46747 [rank=2]
[2m[36m(pid=4336, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4047, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4497, ip=172.31.18.216)[0m 2021-02-04 17:30:06,749	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=2]
[2m[36m(pid=4497, ip=172.31.18.216)[0m 2021-02-04 17:30:12,221	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=4497, ip=172.31.18.216)[0m Traceback (most recent call last):
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=4497, ip=172.31.18.216)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 92, in setup_operator
[2m[36m(pid=4497, ip=172.31.18.216)[0m     scheduler_step_freq=self.scheduler_step_freq)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 148, in __init__
[2m[36m(pid=4497, ip=172.31.18.216)[0m     self.setup(config)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 999, in setup
[2m[36m(pid=4497, ip=172.31.18.216)[0m     config)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/training_operator.py", line 984, in _initialize_dataloaders
[2m[36m(pid=4497, ip=172.31.18.216)[0m     loaders = self.__class__._data_creator(config)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/pytorch_pbt_failure.py", line 56, in cifar_creator
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 67, in __init__
[2m[36m(pid=4497, ip=172.31.18.216)[0m     if not self._check_integrity():
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 135, in _check_integrity
[2m[36m(pid=4497, ip=172.31.18.216)[0m     if not check_integrity(fpath, md5):
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 43, in check_integrity
[2m[36m(pid=4497, ip=172.31.18.216)[0m     return check_md5(fpath, md5)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 35, in check_md5
[2m[36m(pid=4497, ip=172.31.18.216)[0m     return md5 == calculate_md5(fpath, **kwargs)
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 29, in calculate_md5
[2m[36m(pid=4497, ip=172.31.18.216)[0m     for chunk in iter(lambda: f.read(chunk_size), b''):
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 29, in <lambda>
[2m[36m(pid=4497, ip=172.31.18.216)[0m     for chunk in iter(lambda: f.read(chunk_size), b''):
[2m[36m(pid=4497, ip=172.31.18.216)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=4497, ip=172.31.18.216)[0m     sys.exit(1)
[2m[36m(pid=4497, ip=172.31.18.216)[0m SystemExit: 1
[2m[36m(pid=4180, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4189, ip=172.31.18.216)[0m 2021-02-04 17:29:30,184	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34075 [rank=0]
[2m[36m(pid=4188, ip=172.31.18.216)[0m 2021-02-04 17:29:30,184	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:34075 [rank=1]
[2m[36m(pid=4343, ip=172.31.18.216)[0m 2021-02-04 17:29:51,754	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=0]
[2m[36m(pid=4327, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4178, ip=172.31.18.216)[0m 2021-02-04 17:29:44,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=1]
[2m[36m(pid=4048, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4188, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4328, ip=172.31.18.216)[0m 2021-02-04 17:29:59,039	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=1]
[2m[36m(pid=4053, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4053, ip=172.31.18.216)[0m 2021-02-04 17:29:07,913	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:52281 [rank=2]
[2m[36m(pid=4327, ip=172.31.18.216)[0m 2021-02-04 17:29:44,750	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=2]
[2m[36m(pid=4882, ip=172.31.18.216)[0m 2021-02-04 17:30:50,603	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:33487 [rank=2]
[2m[36m(pid=4882, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:30:56,373	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00003: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18268, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.4/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           36 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

[2m[36m(pid=4881, ip=172.31.18.216)[0m 2021-02-04 17:30:57,287	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49553 [rank=0]
[2m[36m(pid=4927, ip=172.31.18.216)[0m 2021-02-04 17:30:57,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:49553 [rank=1]
[2m[36m(pid=4881, ip=172.31.18.216)[0m Files already downloaded and verified
[2m[36m(pid=4927, ip=172.31.18.216)[0m Files already downloaded and verified
2021-02-04 17:31:03,115	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18274, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:31:03,116	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00001: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 5.5/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:31:03,149	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:31:03,213	INFO commands.py:441 -- Shutdown i-0ac149179edeecfcd
2021-02-04 17:31:03,214	INFO command_runner.py:356 -- Fetched IP: 34.215.60.186
2021-02-04 17:31:03,214	INFO log_timer.py:27 -- NodeUpdater: i-0ac149179edeecfcd: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.215.60.186' (ECDSA) to the list of known hosts.

[2m[36m(pid=4896, ip=172.31.18.216)[0m 2021-02-04 17:31:04,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55061 [rank=1]
[2m[36m(pid=4920, ip=172.31.18.216)[0m 2021-02-04 17:31:04,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:55061 [rank=0]
[32mStopped all 13 Ray processes.[39m
[0mShared connection to 34.215.60.186 closed.

2021-02-04 17:31:10,302	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 7.180 s, which may be a performance bottleneck.
2021-02-04 17:31:10,305	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
== Status ==
Memory usage on this node: 3.6/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 6/32 CPUs, 6/2 GPUs, 0.0/158.2 GiB heap, 0.0/49.51 GiB objects (0/1.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00001 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00001 |           37 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:31:10,320	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00001: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1458, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-*.log files for more information.
2021-02-04 17:31:10,347	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:31:10,431	INFO commands.py:441 -- Shutdown i-083b602e902a78a09
2021-02-04 17:31:10,432	INFO command_runner.py:356 -- Fetched IP: 34.218.250.17
2021-02-04 17:31:10,432	INFO log_timer.py:27 -- NodeUpdater: i-083b602e902a78a09: Got IP  [LogTimer=0ms]
Warning: Permanently added '34.218.250.17' (ECDSA) to the list of known hosts.

Did not find any active Ray processes.
[0mShared connection to 34.218.250.17 closed.

2021-02-04 17:31:17,303	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 6.982 s, which may be a performance bottleneck.
[2m[36m(pid=8790, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6510, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=0]
[2m[36m(pid=6767, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7069, ip=172.31.31.247)[0m 2021-02-04 17:25:42,847	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=0]
[2m[36m(pid=6510, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8259, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5544, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8789, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=2]
[2m[36m(pid=5834, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8789, ip=172.31.31.247)[0m 2021-02-04 17:29:51,691	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38445 [rank=2]
[2m[36m(pid=6748, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5252, ip=172.31.31.247)[0m 2021-02-04 17:21:20,181	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=0]
[2m[36m(pid=8431, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5586, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8121, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8261, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7082, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m 2021-02-04 17:24:15,920	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41341 [rank=2]
[2m[36m(pid=7813, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8808, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=1]
[2m[36m(pid=8269, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=1]
[2m[36m(pid=8275, ip=172.31.31.247)[0m 2021-02-04 17:28:14,772	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44125 [rank=0]
[2m[36m(pid=6747, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7704, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=1]
[2m[36m(pid=7725, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=2]
[2m[36m(pid=6715, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8800, ip=172.31.31.247)[0m 2021-02-04 17:29:44,410	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:54921 [rank=2]
[2m[36m(pid=7757, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m 2021-02-04 17:24:48,946	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=1]
[2m[36m(pid=8790, ip=172.31.31.247)[0m 2021-02-04 17:29:51,753	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:52717 [rank=2]
[2m[36m(pid=7756, ip=172.31.31.247)[0m 2021-02-04 17:26:39,909	INFO trainable.py:103 -- Trainable.setup took 11.764 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7821, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6720, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8260, ip=172.31.31.247)[0m 2021-02-04 17:28:21,624	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:38561 [rank=2]
[2m[36m(pid=6486, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=1]
[2m[36m(pid=7148, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5834, ip=172.31.31.247)[0m 2021-02-04 17:22:23,288	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47929 [rank=0]
[2m[36m(pid=8102, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7162, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7757, ip=172.31.31.247)[0m 2021-02-04 17:27:07,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:36341 [rank=1]
[2m[36m(pid=5543, ip=172.31.31.247)[0m 2021-02-04 17:22:04,336	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37551 [rank=0]
[2m[36m(pid=5552, ip=172.31.31.247)[0m 2021-02-04 17:22:10,201	INFO trainable.py:103 -- Trainable.setup took 15.146 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7760, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7148, ip=172.31.31.247)[0m 2021-02-04 17:25:07,118	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m 2021-02-04 17:22:11,062	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47623 [rank=0]
[2m[36m(pid=5542, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6496, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7155, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=2]
[2m[36m(pid=5587, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=1]
[2m[36m(pid=8808, ip=172.31.31.247)[0m 2021-02-04 17:29:30,203	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46259 [rank=0]
[2m[36m(pid=5553, ip=172.31.31.247)[0m 2021-02-04 17:21:41,915	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:43541 [rank=2]
[2m[36m(pid=7082, ip=172.31.31.247)[0m 2021-02-04 17:25:35,493	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=1]
[2m[36m(pid=7084, ip=172.31.31.247)[0m 2021-02-04 17:25:35,492	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:49265 [rank=0]
[2m[36m(pid=5585, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=1]
[2m[36m(pid=7714, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=0]
[2m[36m(pid=7813, ip=172.31.31.247)[0m 2021-02-04 17:26:54,129	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=0]
[2m[36m(pid=8799, ip=172.31.31.247)[0m 2021-02-04 17:29:30,204	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46259 [rank=1]
[2m[36m(pid=8102, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=0]
[2m[36m(pid=6776, ip=172.31.31.247)[0m 2021-02-04 17:24:03,332	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=2]
[2m[36m(pid=6747, ip=172.31.31.247)[0m 2021-02-04 17:24:10,375	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:43589 [rank=1]
[2m[36m(pid=9043, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6487, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m 2021-02-04 17:24:59,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:47803 [rank=1]
[2m[36m(pid=6723, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8114, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6487, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m 2021-02-04 17:23:35,387	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:48141 [rank=1]
[2m[36m(pid=6715, ip=172.31.31.247)[0m 2021-02-04 17:23:56,734	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=2]
[2m[36m(pid=5852, ip=172.31.31.247)[0m 2021-02-04 17:22:16,568	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41825 [rank=1]
[2m[36m(pid=9042, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6717, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=2]
[2m[36m(pid=5542, ip=172.31.31.247)[0m 2021-02-04 17:22:10,586	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44829 [rank=2]
[2m[36m(pid=6716, ip=172.31.31.247)[0m 2021-02-04 17:23:56,733	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:42111 [rank=0]
[2m[36m(pid=8259, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=2]
[2m[36m(pid=8431, ip=172.31.31.247)[0m 2021-02-04 17:28:29,136	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44843 [rank=0]
[2m[36m(pid=7711, ip=172.31.31.247)[0m 2021-02-04 17:27:21,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:57171 [rank=1]
[2m[36m(pid=7065, ip=172.31.31.247)[0m 2021-02-04 17:24:48,945	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45649 [rank=0]
[2m[36m(pid=5537, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7820, ip=172.31.31.247)[0m 2021-02-04 17:26:40,702	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=0]
[2m[36m(pid=6748, ip=172.31.31.247)[0m 2021-02-04 17:24:10,059	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59085 [rank=2]
[2m[36m(pid=8096, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7758, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6719, ip=172.31.31.247)[0m 2021-02-04 17:24:09,151	INFO trainable.py:103 -- Trainable.setup took 12.454 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7712, ip=172.31.31.247)[0m 2021-02-04 17:27:28,592	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:59485 [rank=2]
[2m[36m(pid=5586, ip=172.31.31.247)[0m 2021-02-04 17:21:34,575	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:40689 [rank=2]
[2m[36m(pid=5585, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7714, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7704, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m 2021-02-04 17:24:22,251	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38955 [rank=0]
[2m[36m(pid=6496, ip=172.31.31.247)[0m 2021-02-04 17:23:27,998	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:35053 [rank=0]
[2m[36m(pid=5544, ip=172.31.31.247)[0m 2021-02-04 17:22:04,357	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47479 [rank=0]
[2m[36m(pid=7821, ip=172.31.31.247)[0m 2021-02-04 17:26:40,703	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:54763 [rank=1]
[2m[36m(pid=8114, ip=172.31.31.247)[0m 2021-02-04 17:27:43,077	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33593 [rank=0]
[2m[36m(pid=5551, ip=172.31.31.247)[0m 2021-02-04 17:22:08,650	INFO trainable.py:103 -- Trainable.setup took 13.595 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6776, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7093, ip=172.31.31.247)[0m 2021-02-04 17:25:15,873	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=1]
[2m[36m(pid=8788, ip=172.31.31.247)[0m 2021-02-04 17:29:44,747	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:47667 [rank=0]
[2m[36m(pid=5851, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7091, ip=172.31.31.247)[0m 2021-02-04 17:25:15,872	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:55935 [rank=0]
[2m[36m(pid=5553, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=9036, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6742, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5537, ip=172.31.31.247)[0m 2021-02-04 17:21:20,182	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:46767 [rank=1]
[2m[36m(pid=6741, ip=172.31.31.247)[0m 2021-02-04 17:24:28,103	INFO trainable.py:103 -- Trainable.setup took 12.878 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=7760, ip=172.31.31.247)[0m 2021-02-04 17:27:00,776	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:47955 [rank=2]
[2m[36m(pid=7175, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=9042, ip=172.31.31.247)[0m 2021-02-04 17:29:59,694	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:36601 [rank=0]
[2m[36m(pid=5830, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5851, ip=172.31.31.247)[0m 2021-02-04 17:22:17,370	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38443 [rank=0]
[2m[36m(pid=7820, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6486, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7705, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7065, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6725, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6497, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6488, ip=172.31.31.247)[0m 2021-02-04 17:23:49,831	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:40643 [rank=2]
[2m[36m(pid=5536, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7064, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8788, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7162, ip=172.31.31.247)[0m 2021-02-04 17:25:07,117	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:41027 [rank=1]
[2m[36m(pid=9043, ip=172.31.31.247)[0m 2021-02-04 17:29:59,038	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:32977 [rank=2]
[2m[36m(pid=8275, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6767, ip=172.31.31.247)[0m 2021-02-04 17:24:03,331	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:51963 [rank=0]
[2m[36m(pid=7066, ip=172.31.31.247)[0m 2021-02-04 17:25:42,848	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37447 [rank=1]
[2m[36m(pid=8260, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5830, ip=172.31.31.247)[0m 2021-02-04 17:22:28,799	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:44279 [rank=2]
[2m[36m(pid=6717, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5835, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8092, ip=172.31.31.247)[0m 2021-02-04 17:27:28,925	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:38191 [rank=0]
[2m[36m(pid=7758, ip=172.31.31.247)[0m 2021-02-04 17:26:54,130	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:33657 [rank=1]
[2m[36m(pid=8800, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8097, ip=172.31.31.247)[0m 2021-02-04 17:27:52,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:45685 [rank=1]
[2m[36m(pid=5835, ip=172.31.31.247)[0m 2021-02-04 17:22:22,811	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:37165 [rank=0]
[2m[36m(pid=8096, ip=172.31.31.247)[0m 2021-02-04 17:28:00,463	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:44217 [rank=1]
[2m[36m(pid=5543, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5587, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=5831, ip=172.31.31.247)[0m 2021-02-04 17:22:29,479	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.21.209:34093 [rank=2]
[2m[36m(pid=8097, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8810, ip=172.31.31.247)[0m 2021-02-04 17:29:42,730	INFO trainable.py:103 -- Trainable.setup took 13.257 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=6068, ip=172.31.31.247)[0m 2021-02-04 17:22:35,083	ERROR worker.py:390 -- SystemExit was raised from the worker
[2m[36m(pid=6068, ip=172.31.31.247)[0m Traceback (most recent call last):
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "python/ray/_raylet.pyx", line 375, in ray._raylet.execute_task
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 400, in load_actor_class
[2m[36m(pid=6068, ip=172.31.31.247)[0m     job_id, actor_creation_function_descriptor)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
[2m[36m(pid=6068, ip=172.31.31.247)[0m     actor_class = pickle.loads(pickled_class)
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/__init__.py", line 1, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch import TorchTrainer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/__init__.py", line 12, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.util.sgd.torch.torch_trainer import (TorchTrainer,
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 13, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune import Trainable
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/__init__.py", line 2, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.tune import run_experiments, run
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 18, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.trial_runner import TrialRunner
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 28, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     from ray.tune.web_server import TuneServer
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/web_server.py", line 16, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import requests  # `requests` is not part of stdlib.
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/__init__.py", line 43, in <module>
[2m[36m(pid=6068, ip=172.31.31.247)[0m     import urllib3
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 980, in _find_and_load
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 148, in __enter__
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "<frozen importlib._bootstrap>", line 174, in _get_module_lock
[2m[36m(pid=6068, ip=172.31.31.247)[0m   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 387, in sigterm_handler
[2m[36m(pid=6068, ip=172.31.31.247)[0m     sys.exit(1)
[2m[36m(pid=6068, ip=172.31.31.247)[0m SystemExit: 1
[2m[36m(pid=7093, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7705, ip=172.31.31.247)[0m 2021-02-04 17:26:27,322	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52861 [rank=0]
[2m[36m(pid=6511, ip=172.31.31.247)[0m 2021-02-04 17:23:20,697	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58667 [rank=1]
[2m[36m(pid=7155, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6718, ip=172.31.31.247)[0m 2021-02-04 17:24:37,763	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:37705 [rank=1]
[2m[36m(pid=7091, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8799, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8121, ip=172.31.31.247)[0m 2021-02-04 17:27:36,594	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:44109 [rank=2]
[2m[36m(pid=6723, ip=172.31.31.247)[0m 2021-02-04 17:24:28,993	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.18.216:48423 [rank=1]
[2m[36m(pid=8261, ip=172.31.31.247)[0m 2021-02-04 17:28:21,662	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:44301 [rank=0]
[2m[36m(pid=5252, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6716, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7084, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=8269, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6495, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=9036, ip=172.31.31.247)[0m 2021-02-04 17:30:06,747	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:58701 [rank=0]
[2m[36m(pid=5852, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=6511, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[36m(pid=7711, ip=172.31.31.247)[0m Files already downloaded and verified
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: *** Aborted at 1612475192 (unix time) try "date -d @1612475192" if you are using GNU date ***
[2m[33m(raylet, ip=172.31.31.247)[0m [2021-02-04 13:46:32,497 E 258 298] logging.cc:415: PC: @                0x0 (unknown)
[2m[36m(pid=9495, ip=172.31.31.247)[0m 2021-02-04 17:31:18,209	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:41925 [rank=2]
[2m[36m(pid=9495, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:31:23,937	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18467, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
2021-02-04 17:31:23,941	INFO trial_runner.py:890 -- Trial NoFaultToleranceTrainable_39edb_00000: Attempting to restore trial state from last checkpoint.
== Status ==
Memory usage on this node: 7.2/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00000 | RUNNING  |       | 0.001 |
| NoFaultToleranceTrainable_39edb_00002 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00000 |           40 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:31:23,950	WARNING ray_trial_executor.py:738 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
[2m[36m(pid=9502, ip=172.31.31.247)[0m 2021-02-04 17:31:25,753	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.24.224:50717 [rank=2]
[2m[36m(pid=9502, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:31:31,547	ERROR trial_runner.py:616 -- Trial NoFaultToleranceTrainable_39edb_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DeprecationWarning): [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18566, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
== Status ==
Memory usage on this node: 7.7/239.9 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 3/64 CPUs, 3/4 GPUs, 0.0/325.73 GiB heap, 0.0/99.02 GiB objects (0/2.0 accelerator_type:M60)
Result logdir: /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------------+----------+-------+-------+
| Trial name                            | status   | loc   |    lr |
|---------------------------------------+----------+-------+-------|
| NoFaultToleranceTrainable_39edb_00002 | RUNNING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00003 | PENDING  |       | 0.1   |
| NoFaultToleranceTrainable_39edb_00001 | PENDING  |       | 0.01  |
| NoFaultToleranceTrainable_39edb_00000 | PENDING  |       | 0.001 |
+---------------------------------------+----------+-------+-------+
Number of errored trials: 4
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                            |   # failures | error file                                                                                                                                         |
|---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
| NoFaultToleranceTrainable_39edb_00002 |           39 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00002_2_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00003 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00003_3_lr=0.1_2021-02-04_17-19-50/error.txt   |
| NoFaultToleranceTrainable_39edb_00001 |           38 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00001_1_lr=0.01_2021-02-04_17-19-43/error.txt  |
| NoFaultToleranceTrainable_39edb_00000 |           41 | /home/ray/ray_results/NoFaultToleranceTrainable_2021-02-04_17-19-43/NoFaultToleranceTrainable_39edb_00000_0_lr=0.001_2021-02-04_17-19-43/error.txt |
+---------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

2021-02-04 17:31:31,576	INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:31:31,636	INFO commands.py:441 -- Shutdown i-0d2245f02cade9c2c
2021-02-04 17:31:31,636	INFO command_runner.py:356 -- Fetched IP: 54.202.148.67
2021-02-04 17:31:31,637	INFO log_timer.py:27 -- NodeUpdater: i-0d2245f02cade9c2c: Got IP  [LogTimer=0ms]
Warning: Permanently added '54.202.148.67' (ECDSA) to the list of known hosts.

[2m[36m(pid=9499, ip=172.31.31.247)[0m 2021-02-04 17:31:32,354	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://172.31.31.247:52149 [rank=0]
[2m[36m(pid=9499, ip=172.31.31.247)[0m Files already downloaded and verified
2021-02-04 17:31:43,143	ERROR worker.py:1053 -- Possible unhandled error from worker: [36mray::NoFaultToleranceTrainable.train_buffered()[39m (pid=18568, ip=172.31.24.224)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 664, in step
    return super(TorchTrainable, self).step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/sgd/torch/torch_trainer.py", line 719, in step
    "Trainable._train is deprecated and will be "
DeprecationWarning: Trainable._train is deprecated and will be removed in a future version of Ray. Override Trainable.step instead.
Error: No such container: ray_container
Shared connection to 54.202.148.67 closed.

2021-02-04 17:31:43,288	WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 11.738 s, which may be a performance bottleneck.
Traceback (most recent call last):
  File "/home/ray/pytorch_pbt_failure.py", line 136, in <module>
    stop={"training_iteration": 1} if args.smoke_test else None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step
    iteration=self._iteration, trials=self._trials)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 172, in on_step_begin
    callback.on_step_begin(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/mock.py", line 122, in on_step_begin
    override_cluster_name=None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 460, in kill_node
    _exec(updater, "ray stop", False, False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 912, in _exec
    shutdown_after_run=shutdown_after_run)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 627, in run
    ssh_options_override_ssh_key=ssh_options_override_ssh_key)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 519, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 445, in _run_helper
    "Command failed:\n\n  {}\n".format(joined_cmd)) from None
click.exceptions.ClickException: Command failed:

  ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/3d9ed41da7/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@54.202.148.67 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'"'"'"'"'"'"'"'"''"'"' )'

[0mLoaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 52.41.99.218
Fetched IP: 52.41.99.218
