mirror of
https://github.com/wassname/ray.git
synced 2026-07-02 22:47:18 +08:00
[docs] Pictures for all the Examples (#5859)
* image * plot resnet * hyperparam * fixup_pictures * custom_direct
This commit is contained in:
@@ -3,18 +3,22 @@ Examples Overview
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Build a simple parameter server using Ray.
|
||||
:figure: /images/param_actor.png
|
||||
:description: :doc:`/auto_examples/plot_parameter_server`
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Asynchronous Advantage Actor Critic agent using Ray.
|
||||
:figure: /images/a3c.png
|
||||
:description: :doc:`/auto_examples/plot_example-a3c`
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Simple parallel asynchronous hyperparameter evaluation.
|
||||
:figure: /images/hyperparameter.png
|
||||
:description: :doc:`/auto_examples/plot_hyperparameter`
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Parallelizing a policy gradient calculation on OpenAI Gym Pong.
|
||||
:figure: /images/pong.png
|
||||
:description: :doc:`/auto_examples/plot_pong_example`
|
||||
|
||||
.. customgalleryitem::
|
||||
@@ -25,10 +29,6 @@ Examples Overview
|
||||
:tooltip: Implementing a simple news reader using Ray.
|
||||
:description: :doc:`/auto_examples/plot_newsreader`
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Using Ray to train ResNet across multiple GPUs.
|
||||
:description: :doc:`/auto_examples/plot_resnet`
|
||||
|
||||
.. customgalleryitem::
|
||||
:tooltip: Implement a simple streaming application using Ray’s actors.
|
||||
:description: :doc:`/auto_examples/plot_streaming`
|
||||
|
||||
@@ -25,6 +25,10 @@ To run the application, first install **ray** and then some dependencies:
|
||||
pip install opencv-python-headless
|
||||
pip install scipy
|
||||
|
||||
|
||||
.. image:: ../images/a3c.png
|
||||
:align: center
|
||||
|
||||
You can run the code with
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -9,6 +9,9 @@ This script will demonstrate how to use two important parts of the Ray API:
|
||||
using ``ray.remote`` to define remote functions and ``ray.wait`` to wait for
|
||||
their results to be ready.
|
||||
|
||||
.. image:: ../images/hyperparameter.png
|
||||
:align: center
|
||||
|
||||
.. important:: For a production-grade implementation of distributed
|
||||
hyperparameter tuning, use `Tune`_, a scalable hyperparameter
|
||||
tuning library built using Ray's Actor API.
|
||||
|
||||
@@ -14,6 +14,10 @@ then be passed back to each Ray actor for more gradient calculation.
|
||||
This application is adapted, with minimal modifications, from
|
||||
Andrej Karpathy's `source code`_ (see the accompanying `blog post`_).
|
||||
|
||||
.. image:: ../images/pong-arch.svg
|
||||
:align: center
|
||||
|
||||
|
||||
To run the application, first install some dependencies.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -1,103 +0,0 @@
|
||||
ResNet
|
||||
======
|
||||
|
||||
This code uses ResNet to do data parallel training
|
||||
across multiple GPUs using Ray. View the `code for this example`_.
|
||||
|
||||
To run the example, you will need to install `TensorFlow`_ (at
|
||||
least version ``1.0.0``). Then you can run the example as follows.
|
||||
|
||||
First download the CIFAR-10 or CIFAR-100 dataset.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Get the CIFAR-10 dataset.
|
||||
curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
|
||||
tar -xvf cifar-10-binary.tar.gz
|
||||
|
||||
# Get the CIFAR-100 dataset.
|
||||
curl -o cifar-100-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz
|
||||
tar -xvf cifar-100-binary.tar.gz
|
||||
|
||||
Then run the training script that matches the dataset you downloaded.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Train Resnet on CIFAR-10.
|
||||
python ray/doc/examples/resnet/resnet_main.py \
|
||||
--eval_dir=/tmp/resnet-model/eval \
|
||||
--train_data_path=cifar-10-batches-bin/data_batch* \
|
||||
--eval_data_path=cifar-10-batches-bin/test_batch.bin \
|
||||
--dataset=cifar10 \
|
||||
--num_gpus=1
|
||||
|
||||
# Train Resnet on CIFAR-100.
|
||||
python ray/doc/examples/resnet/resnet_main.py \
|
||||
--eval_dir=/tmp/resnet-model/eval \
|
||||
--train_data_path=cifar-100-binary/train.bin \
|
||||
--eval_data_path=cifar-100-binary/test.bin \
|
||||
--dataset=cifar100 \
|
||||
--num_gpus=1
|
||||
|
||||
To run the training script on a cluster with multiple machines, you will need
|
||||
to also pass in the flag ``--address=<address>``, where
|
||||
``<address>`` is the address of the Redis server on the head node.
|
||||
|
||||
The script will print out the IP address that the log files are stored on. In
|
||||
the single-node case, you can ignore this and run tensorboard on the current
|
||||
machine.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m tensorflow.tensorboard --logdir=/tmp/resnet-model
|
||||
|
||||
If you are running Ray on multiple nodes, you will need to go to the node at the
|
||||
IP address printed, and run the command.
|
||||
|
||||
The core of the script is the actor definition.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ray.remote(num_gpus=1)
|
||||
class ResNetTrainActor(object):
|
||||
def __init__(self, data, dataset, num_gpus):
|
||||
# data is the preprocessed images and labels extracted from the dataset.
|
||||
# Thus, every actor has its own copy of the data.
|
||||
# Set the CUDA_VISIBLE_DEVICES environment variable in order to restrict
|
||||
# which GPUs TensorFlow uses. Note that this only works if it is done before
|
||||
# the call to tf.Session.
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join([str(i) for i in ray.get_gpu_ids()])
|
||||
with tf.Graph().as_default():
|
||||
with tf.device('/gpu:0'):
|
||||
# We omit the code here that actually constructs the residual network
|
||||
# and initializes it. Uses the definition in the Tensorflow Resnet Example.
|
||||
|
||||
def compute_steps(self, weights):
|
||||
# This method sets the weights in the network, runs some training steps,
|
||||
# and returns the new weights. self.model.variables is a TensorFlowVariables
|
||||
# class that we pass the train operation into.
|
||||
self.model.variables.set_weights(weights)
|
||||
for i in range(self.steps):
|
||||
self.model.variables.sess.run(self.model.train_op)
|
||||
return self.model.variables.get_weights()
|
||||
|
||||
The main script first creates one actor for each GPU, or a single actor if
|
||||
``num_gpus`` is zero.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
train_actors = [ResNetTrainActor.remote(train_data, dataset, num_gpus) for _ in range(num_gpus)]
|
||||
|
||||
Then the main loop passes the same weights to every model, performs
|
||||
updates on each model, averages the updates, and puts the new weights in the
|
||||
object store.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
while True:
|
||||
all_weights = ray.get([actor.compute_steps.remote(weight_id) for actor in train_actors])
|
||||
mean_weights = {k: sum([weights[k] for weights in all_weights]) / num_gpus for k in all_weights[0]}
|
||||
weight_id = ray.put(mean_weights)
|
||||
|
||||
.. _`TensorFlow`: https://www.tensorflow.org/install/
|
||||
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/doc/examples/resnet
|
||||
@@ -1,116 +0,0 @@
|
||||
"""CIFAR dataset input module, with the majority taken from
|
||||
https://github.com/tensorflow/models/tree/master/resnet.
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import tensorflow as tf
|
||||
|
||||
|
||||
def build_data(data_path, size, dataset):
|
||||
"""Creates the queue and preprocessing operations for the dataset.
|
||||
|
||||
Args:
|
||||
data_path: Filename for cifar10 data.
|
||||
size: The number of images in the dataset.
|
||||
dataset: The dataset we are using.
|
||||
|
||||
Returns:
|
||||
queue: A Tensorflow queue for extracting the images and labels.
|
||||
"""
|
||||
image_size = 32
|
||||
if dataset == "cifar10":
|
||||
label_bytes = 1
|
||||
label_offset = 0
|
||||
elif dataset == "cifar100":
|
||||
label_bytes = 1
|
||||
label_offset = 1
|
||||
depth = 3
|
||||
image_bytes = image_size * image_size * depth
|
||||
record_bytes = label_bytes + label_offset + image_bytes
|
||||
|
||||
def load_transform(value):
|
||||
# Convert these examples to dense labels and processed images.
|
||||
record = tf.reshape(tf.decode_raw(value, tf.uint8), [record_bytes])
|
||||
label = tf.cast(
|
||||
tf.slice(record, [label_offset], [label_bytes]), tf.int32)
|
||||
# Convert from string to [depth * height * width] to
|
||||
# [depth, height, width].
|
||||
depth_major = tf.reshape(
|
||||
tf.slice(record, [label_bytes], [image_bytes]),
|
||||
[depth, image_size, image_size])
|
||||
# Convert from [depth, height, width] to [height, width, depth].
|
||||
image = tf.cast(tf.transpose(depth_major, [1, 2, 0]), tf.float32)
|
||||
return (image, label)
|
||||
|
||||
# Read examples from files in the filename queue.
|
||||
data_files = tf.gfile.Glob(data_path)
|
||||
data = tf.data.FixedLengthRecordDataset(
|
||||
data_files, record_bytes=record_bytes)
|
||||
data = data.map(load_transform)
|
||||
data = data.batch(size)
|
||||
iterator = data.make_one_shot_iterator()
|
||||
return iterator.get_next()
|
||||
|
||||
|
||||
def build_input(data, batch_size, dataset, train):
|
||||
"""Build CIFAR image and labels.
|
||||
|
||||
Args:
|
||||
data_path: Filename for cifar10 data.
|
||||
batch_size: Input batch size.
|
||||
train: True if we are training and false if we are testing.
|
||||
|
||||
Returns:
|
||||
images: Batches of images of size
|
||||
[batch_size, image_size, image_size, 3].
|
||||
labels: Batches of labels of size [batch_size, num_classes].
|
||||
|
||||
Raises:
|
||||
ValueError: When the specified dataset is not supported.
|
||||
"""
|
||||
image_size = 32
|
||||
depth = 3
|
||||
num_classes = 10 if dataset == "cifar10" else 100
|
||||
images, labels = data
|
||||
num_samples = images.shape[0] - images.shape[0] % batch_size
|
||||
dataset = tf.data.Dataset.from_tensor_slices(
|
||||
(images[:num_samples], labels[:num_samples]))
|
||||
|
||||
def map_train(image, label):
|
||||
image = tf.image.resize_image_with_crop_or_pad(image, image_size + 4,
|
||||
image_size + 4)
|
||||
image = tf.random_crop(image, [image_size, image_size, 3])
|
||||
image = tf.image.random_flip_left_right(image)
|
||||
image = tf.image.per_image_standardization(image)
|
||||
return (image, label)
|
||||
|
||||
def map_test(image, label):
|
||||
image = tf.image.resize_image_with_crop_or_pad(image, image_size,
|
||||
image_size)
|
||||
image = tf.image.per_image_standardization(image)
|
||||
return (image, label)
|
||||
|
||||
dataset = dataset.map(map_train if train else map_test)
|
||||
dataset = dataset.batch(batch_size)
|
||||
dataset = dataset.repeat()
|
||||
if train:
|
||||
dataset = dataset.shuffle(buffer_size=16 * batch_size)
|
||||
images, labels = dataset.make_one_shot_iterator().get_next()
|
||||
images = tf.reshape(images, [batch_size, image_size, image_size, depth])
|
||||
labels = tf.reshape(labels, [batch_size, 1])
|
||||
indices = tf.reshape(tf.range(0, batch_size, 1), [batch_size, 1])
|
||||
labels = tf.sparse_to_dense(
|
||||
tf.concat([indices, labels], 1), [batch_size, num_classes], 1.0, 0.0)
|
||||
|
||||
assert len(images.get_shape()) == 4
|
||||
assert images.get_shape()[0] == batch_size
|
||||
assert images.get_shape()[-1] == 3
|
||||
assert len(labels.get_shape()) == 2
|
||||
assert labels.get_shape()[0] == batch_size
|
||||
assert labels.get_shape()[1] == num_classes
|
||||
if not train:
|
||||
tf.summary.image("images", images)
|
||||
return images, labels
|
||||
@@ -1,257 +0,0 @@
|
||||
"""ResNet training script, with some code from
|
||||
https://github.com/tensorflow/models/tree/master/resnet.
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import numpy as np
|
||||
import ray
|
||||
import tensorflow as tf
|
||||
|
||||
import cifar_input
|
||||
import resnet_model
|
||||
|
||||
# Tensorflow must be at least version 1.2.0 for the example to work.
|
||||
tf_major = int(tf.__version__.split(".")[0])
|
||||
tf_minor = int(tf.__version__.split(".")[1])
|
||||
if (tf_major < 1) or (tf_major == 1 and tf_minor < 2):
|
||||
raise Exception("Your Tensorflow version is less than 1.2.0. Please "
|
||||
"update Tensorflow to the latest version.")
|
||||
|
||||
parser = argparse.ArgumentParser(description="Run the ResNet example.")
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
default="cifar10",
|
||||
type=str,
|
||||
help="Dataset to use: cifar10 or cifar100.")
|
||||
parser.add_argument(
|
||||
"--train_data_path",
|
||||
default="cifar-10-batches-bin/data_batch*",
|
||||
type=str,
|
||||
help="Data path for the training data.")
|
||||
parser.add_argument(
|
||||
"--eval_data_path",
|
||||
default="cifar-10-batches-bin/test_batch.bin",
|
||||
type=str,
|
||||
help="Data path for the testing data.")
|
||||
parser.add_argument(
|
||||
"--eval_dir",
|
||||
default="/tmp/resnet-model/eval",
|
||||
type=str,
|
||||
help="Data path for the tensorboard logs.")
|
||||
parser.add_argument(
|
||||
"--eval_batch_count",
|
||||
default=50,
|
||||
type=int,
|
||||
help="Number of batches to evaluate over.")
|
||||
parser.add_argument(
|
||||
"--num_gpus",
|
||||
default=0,
|
||||
type=int,
|
||||
help="Number of GPUs to use for training.")
|
||||
parser.add_argument(
|
||||
"--redis-address",
|
||||
default=None,
|
||||
type=str,
|
||||
help="The Redis address of the cluster.")
|
||||
|
||||
FLAGS = parser.parse_args()
|
||||
|
||||
# Determines if the actors require a gpu or not.
|
||||
use_gpu = 1 if int(FLAGS.num_gpus) > 0 else 0
|
||||
|
||||
|
||||
@ray.remote
|
||||
def get_data(path, size, dataset):
|
||||
# Retrieves all preprocessed images and labels using a tensorflow queue.
|
||||
# This only uses the cpu.
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
||||
with tf.device("/cpu:0"):
|
||||
dataset = cifar_input.build_data(path, size, dataset)
|
||||
sess = tf.Session()
|
||||
images, labels = sess.run(dataset)
|
||||
sess.close()
|
||||
return images, labels
|
||||
|
||||
|
||||
@ray.remote(num_gpus=use_gpu)
|
||||
class ResNetTrainActor(object):
|
||||
def __init__(self, data, dataset, num_gpus):
|
||||
if num_gpus > 0:
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(
|
||||
[str(i) for i in ray.get_gpu_ids()])
|
||||
hps = resnet_model.HParams(
|
||||
batch_size=128,
|
||||
num_classes=100 if dataset == "cifar100" else 10,
|
||||
min_lrn_rate=0.0001,
|
||||
lrn_rate=0.1,
|
||||
num_residual_units=5,
|
||||
use_bottleneck=False,
|
||||
weight_decay_rate=0.0002,
|
||||
relu_leakiness=0.1,
|
||||
optimizer="mom",
|
||||
num_gpus=num_gpus)
|
||||
|
||||
# We seed each actor differently so that each actor operates on a
|
||||
# different subset of data.
|
||||
if num_gpus > 0:
|
||||
tf.set_random_seed(ray.get_gpu_ids()[0] + 1)
|
||||
else:
|
||||
# Only a single actor in this case.
|
||||
tf.set_random_seed(1)
|
||||
|
||||
with tf.device("/gpu:0" if num_gpus > 0 else "/cpu:0"):
|
||||
# Build the model.
|
||||
images, labels = cifar_input.build_input(data, hps.batch_size,
|
||||
dataset, True)
|
||||
self.model = resnet_model.ResNet(hps, images, labels, "train")
|
||||
self.model.build_graph()
|
||||
config = tf.ConfigProto(allow_soft_placement=True)
|
||||
config.gpu_options.allow_growth = True
|
||||
sess = tf.Session(config=config)
|
||||
self.model.variables.set_session(sess)
|
||||
init = tf.global_variables_initializer()
|
||||
sess.run(init)
|
||||
self.steps = 10
|
||||
|
||||
def compute_steps(self, weights):
|
||||
# This method sets the weights in the network, trains the network
|
||||
# self.steps times, and returns the new weights.
|
||||
self.model.variables.set_weights(weights)
|
||||
for i in range(self.steps):
|
||||
self.model.variables.sess.run(self.model.train_op)
|
||||
return self.model.variables.get_weights()
|
||||
|
||||
def get_weights(self):
|
||||
# Note that the driver cannot directly access fields of the class,
|
||||
# so helper methods must be created.
|
||||
return self.model.variables.get_weights()
|
||||
|
||||
|
||||
@ray.remote
|
||||
class ResNetTestActor(object):
|
||||
def __init__(self, data, dataset, eval_batch_count, eval_dir):
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
||||
hps = resnet_model.HParams(
|
||||
batch_size=100,
|
||||
num_classes=100 if dataset == "cifar100" else 10,
|
||||
min_lrn_rate=0.0001,
|
||||
lrn_rate=0.1,
|
||||
num_residual_units=5,
|
||||
use_bottleneck=False,
|
||||
weight_decay_rate=0.0002,
|
||||
relu_leakiness=0.1,
|
||||
optimizer="mom",
|
||||
num_gpus=0)
|
||||
with tf.device("/cpu:0"):
|
||||
# Builds the testing network.
|
||||
images, labels = cifar_input.build_input(data, hps.batch_size,
|
||||
dataset, False)
|
||||
self.model = resnet_model.ResNet(hps, images, labels, "eval")
|
||||
self.model.build_graph()
|
||||
config = tf.ConfigProto(allow_soft_placement=True)
|
||||
config.gpu_options.allow_growth = True
|
||||
sess = tf.Session(config=config)
|
||||
self.model.variables.set_session(sess)
|
||||
init = tf.global_variables_initializer()
|
||||
sess.run(init)
|
||||
|
||||
# Initializing parameters for tensorboard.
|
||||
self.best_precision = 0.0
|
||||
self.eval_batch_count = eval_batch_count
|
||||
self.summary_writer = tf.summary.FileWriter(eval_dir, sess.graph)
|
||||
# The IP address where tensorboard logs will be on.
|
||||
self.ip_addr = ray.services.get_node_ip_address()
|
||||
|
||||
def accuracy(self, weights, train_step):
|
||||
# Sets the weights, computes the accuracy and other metrics
|
||||
# over eval_batches, and outputs to tensorboard.
|
||||
self.model.variables.set_weights(weights)
|
||||
total_prediction, correct_prediction = 0, 0
|
||||
model = self.model
|
||||
sess = self.model.variables.sess
|
||||
for _ in range(self.eval_batch_count):
|
||||
summaries, loss, predictions, truth = sess.run(
|
||||
[model.summaries, model.cost, model.predictions, model.labels])
|
||||
|
||||
truth = np.argmax(truth, axis=1)
|
||||
predictions = np.argmax(predictions, axis=1)
|
||||
correct_prediction += np.sum(truth == predictions)
|
||||
total_prediction += predictions.shape[0]
|
||||
|
||||
precision = 1.0 * correct_prediction / total_prediction
|
||||
self.best_precision = max(precision, self.best_precision)
|
||||
precision_summ = tf.Summary()
|
||||
precision_summ.value.add(tag="Precision", simple_value=precision)
|
||||
self.summary_writer.add_summary(precision_summ, train_step)
|
||||
best_precision_summ = tf.Summary()
|
||||
best_precision_summ.value.add(
|
||||
tag="Best Precision", simple_value=self.best_precision)
|
||||
self.summary_writer.add_summary(best_precision_summ, train_step)
|
||||
self.summary_writer.add_summary(summaries, train_step)
|
||||
tf.logging.info("loss: %.3f, precision: %.3f, best precision: %.3f" %
|
||||
(loss, precision, self.best_precision))
|
||||
self.summary_writer.flush()
|
||||
return precision
|
||||
|
||||
def get_ip_addr(self):
|
||||
# As above, a helper method must be created to access the field from
|
||||
# the driver.
|
||||
return self.ip_addr
|
||||
|
||||
|
||||
def train():
|
||||
num_gpus = FLAGS.num_gpus
|
||||
if FLAGS.redis_address is None:
|
||||
ray.init(num_gpus=num_gpus)
|
||||
else:
|
||||
ray.init(redis_address=FLAGS.redis_address)
|
||||
train_data = get_data.remote(FLAGS.train_data_path, 50000, FLAGS.dataset)
|
||||
test_data = get_data.remote(FLAGS.eval_data_path, 10000, FLAGS.dataset)
|
||||
# Creates an actor for each gpu, or one if only using the cpu. Each actor
|
||||
# has access to the dataset.
|
||||
if FLAGS.num_gpus > 0:
|
||||
train_actors = [
|
||||
ResNetTrainActor.remote(train_data, FLAGS.dataset, num_gpus)
|
||||
for _ in range(num_gpus)
|
||||
]
|
||||
else:
|
||||
train_actors = [ResNetTrainActor.remote(train_data, FLAGS.dataset, 0)]
|
||||
test_actor = ResNetTestActor.remote(test_data, FLAGS.dataset,
|
||||
FLAGS.eval_batch_count, FLAGS.eval_dir)
|
||||
print("The log files for tensorboard are stored at ip {}.".format(
|
||||
ray.get(test_actor.get_ip_addr.remote())))
|
||||
step = 0
|
||||
weight_id = train_actors[0].get_weights.remote()
|
||||
acc_id = test_actor.accuracy.remote(weight_id, step)
|
||||
# Correction for dividing the weights by the number of gpus.
|
||||
if num_gpus == 0:
|
||||
num_gpus = 1
|
||||
print("Starting training loop. Use Ctrl-C to exit.")
|
||||
try:
|
||||
while True:
|
||||
all_weights = ray.get([
|
||||
actor.compute_steps.remote(weight_id) for actor in train_actors
|
||||
])
|
||||
mean_weights = {
|
||||
k: (sum(weights[k] for weights in all_weights) / num_gpus)
|
||||
for k in all_weights[0]
|
||||
}
|
||||
weight_id = ray.put(mean_weights)
|
||||
step += 10
|
||||
if step % 200 == 0:
|
||||
# Retrieves the previously computed accuracy and launches a new
|
||||
# testing task with the current weights every 200 steps.
|
||||
acc = ray.get(acc_id)
|
||||
acc_id = test_actor.accuracy.remote(weight_id, step)
|
||||
print("Step {}: {:.6f}".format(step - 200, acc))
|
||||
except KeyboardInterrupt:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
train()
|
||||
@@ -1,317 +0,0 @@
|
||||
"""ResNet model with most of the code taken from
|
||||
https://github.com/tensorflow/models/tree/master/resnet.
|
||||
|
||||
Related papers:
|
||||
https://arxiv.org/pdf/1603.05027v2.pdf
|
||||
https://arxiv.org/pdf/1512.03385v1.pdf
|
||||
https://arxiv.org/pdf/1605.07146v1.pdf
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
from collections import namedtuple
|
||||
import numpy as np
|
||||
|
||||
import tensorflow as tf
|
||||
from tensorflow.python.training import moving_averages
|
||||
|
||||
import ray
|
||||
import ray.experimental.tf_utils
|
||||
|
||||
HParams = namedtuple(
|
||||
"HParams", "batch_size, num_classes, min_lrn_rate, lrn_rate, "
|
||||
"num_residual_units, use_bottleneck, weight_decay_rate, "
|
||||
"relu_leakiness, optimizer, num_gpus")
|
||||
|
||||
|
||||
class ResNet(object):
|
||||
"""ResNet model."""
|
||||
|
||||
def __init__(self, hps, images, labels, mode):
|
||||
"""ResNet constructor.
|
||||
|
||||
Args:
|
||||
hps: Hyperparameters.
|
||||
images: Batches of images of size [batch_size, image_size,
|
||||
image_size, 3].
|
||||
labels: Batches of labels of size [batch_size, num_classes].
|
||||
mode: One of 'train' and 'eval'.
|
||||
"""
|
||||
self.hps = hps
|
||||
self._images = images
|
||||
self.labels = labels
|
||||
self.mode = mode
|
||||
|
||||
self._extra_train_ops = []
|
||||
|
||||
def build_graph(self):
|
||||
"""Build a whole graph for the model."""
|
||||
self.global_step = tf.Variable(0, trainable=False)
|
||||
self._build_model()
|
||||
if self.mode == "train":
|
||||
self._build_train_op()
|
||||
else:
|
||||
# Additional initialization for the test network.
|
||||
self.variables = ray.experimental.tf_utils.TensorFlowVariables(
|
||||
self.cost)
|
||||
self.summaries = tf.summary.merge_all()
|
||||
|
||||
def _stride_arr(self, stride):
|
||||
"""Map a stride scalar to the stride array for tf.nn.conv2d."""
|
||||
return [1, stride, stride, 1]
|
||||
|
||||
def _build_model(self):
|
||||
"""Build the core model within the graph."""
|
||||
|
||||
with tf.variable_scope("init"):
|
||||
x = self._conv("init_conv", self._images, 3, 3, 16,
|
||||
self._stride_arr(1))
|
||||
|
||||
strides = [1, 2, 2]
|
||||
activate_before_residual = [True, False, False]
|
||||
if self.hps.use_bottleneck:
|
||||
res_func = self._bottleneck_residual
|
||||
filters = [16, 64, 128, 256]
|
||||
else:
|
||||
res_func = self._residual
|
||||
filters = [16, 16, 32, 64]
|
||||
|
||||
with tf.variable_scope("unit_1_0"):
|
||||
x = res_func(x, filters[0], filters[1], self._stride_arr(
|
||||
strides[0]), activate_before_residual[0])
|
||||
for i in range(1, self.hps.num_residual_units):
|
||||
with tf.variable_scope("unit_1_%d" % i):
|
||||
x = res_func(x, filters[1], filters[1], self._stride_arr(1),
|
||||
False)
|
||||
|
||||
with tf.variable_scope("unit_2_0"):
|
||||
x = res_func(x, filters[1], filters[2], self._stride_arr(
|
||||
strides[1]), activate_before_residual[1])
|
||||
for i in range(1, self.hps.num_residual_units):
|
||||
with tf.variable_scope("unit_2_%d" % i):
|
||||
x = res_func(x, filters[2], filters[2], self._stride_arr(1),
|
||||
False)
|
||||
|
||||
with tf.variable_scope("unit_3_0"):
|
||||
x = res_func(x, filters[2], filters[3], self._stride_arr(
|
||||
strides[2]), activate_before_residual[2])
|
||||
for i in range(1, self.hps.num_residual_units):
|
||||
with tf.variable_scope("unit_3_%d" % i):
|
||||
x = res_func(x, filters[3], filters[3], self._stride_arr(1),
|
||||
False)
|
||||
with tf.variable_scope("unit_last"):
|
||||
x = self._batch_norm("final_bn", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
x = self._global_avg_pool(x)
|
||||
|
||||
with tf.variable_scope("logit"):
|
||||
logits = self._fully_connected(x, self.hps.num_classes)
|
||||
self.predictions = tf.nn.softmax(logits)
|
||||
|
||||
with tf.variable_scope("costs"):
|
||||
xent = tf.nn.softmax_cross_entropy_with_logits(
|
||||
logits=logits, labels=self.labels)
|
||||
self.cost = tf.reduce_mean(xent, name="xent")
|
||||
self.cost += self._decay()
|
||||
|
||||
if self.mode == "eval":
|
||||
tf.summary.scalar("cost", self.cost)
|
||||
|
||||
def _build_train_op(self):
|
||||
"""Build training specific ops for the graph."""
|
||||
num_gpus = self.hps.num_gpus if self.hps.num_gpus != 0 else 1
|
||||
# The learning rate schedule is dependent on the number of gpus.
|
||||
boundaries = [int(20000 * i / np.sqrt(num_gpus)) for i in range(2, 5)]
|
||||
values = [0.1, 0.01, 0.001, 0.0001]
|
||||
self.lrn_rate = tf.train.piecewise_constant(self.global_step,
|
||||
boundaries, values)
|
||||
tf.summary.scalar("learning rate", self.lrn_rate)
|
||||
|
||||
if self.hps.optimizer == "sgd":
|
||||
optimizer = tf.train.GradientDescentOptimizer(self.lrn_rate)
|
||||
elif self.hps.optimizer == "mom":
|
||||
optimizer = tf.train.MomentumOptimizer(self.lrn_rate, 0.9)
|
||||
|
||||
apply_op = optimizer.minimize(self.cost, global_step=self.global_step)
|
||||
train_ops = [apply_op] + self._extra_train_ops
|
||||
self.train_op = tf.group(*train_ops)
|
||||
self.variables = ray.experimental.tf_utils.TensorFlowVariables(
|
||||
self.train_op)
|
||||
|
||||
def _batch_norm(self, name, x):
|
||||
"""Batch normalization."""
|
||||
with tf.variable_scope(name):
|
||||
params_shape = [x.get_shape()[-1]]
|
||||
|
||||
beta = tf.get_variable(
|
||||
"beta",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(0.0, tf.float32))
|
||||
gamma = tf.get_variable(
|
||||
"gamma",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(1.0, tf.float32))
|
||||
|
||||
if self.mode == "train":
|
||||
mean, variance = tf.nn.moments(x, [0, 1, 2], name="moments")
|
||||
|
||||
moving_mean = tf.get_variable(
|
||||
"moving_mean",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(0.0, tf.float32),
|
||||
trainable=False)
|
||||
moving_variance = tf.get_variable(
|
||||
"moving_variance",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(1.0, tf.float32),
|
||||
trainable=False)
|
||||
|
||||
self._extra_train_ops.append(
|
||||
moving_averages.assign_moving_average(
|
||||
moving_mean, mean, 0.9))
|
||||
self._extra_train_ops.append(
|
||||
moving_averages.assign_moving_average(
|
||||
moving_variance, variance, 0.9))
|
||||
else:
|
||||
mean = tf.get_variable(
|
||||
"moving_mean",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(0.0, tf.float32),
|
||||
trainable=False)
|
||||
variance = tf.get_variable(
|
||||
"moving_variance",
|
||||
params_shape,
|
||||
tf.float32,
|
||||
initializer=tf.constant_initializer(1.0, tf.float32),
|
||||
trainable=False)
|
||||
tf.summary.histogram(mean.op.name, mean)
|
||||
tf.summary.histogram(variance.op.name, variance)
|
||||
# elipson used to be 1e-5. Maybe 0.001 solves NaN problem in deeper
|
||||
# net.
|
||||
y = tf.nn.batch_normalization(x, mean, variance, beta, gamma,
|
||||
0.001)
|
||||
y.set_shape(x.get_shape())
|
||||
return y
|
||||
|
||||
def _residual(self,
|
||||
x,
|
||||
in_filter,
|
||||
out_filter,
|
||||
stride,
|
||||
activate_before_residual=False):
|
||||
"""Residual unit with 2 sub layers."""
|
||||
if activate_before_residual:
|
||||
with tf.variable_scope("shared_activation"):
|
||||
x = self._batch_norm("init_bn", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
orig_x = x
|
||||
else:
|
||||
with tf.variable_scope("residual_only_activation"):
|
||||
orig_x = x
|
||||
x = self._batch_norm("init_bn", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
|
||||
with tf.variable_scope("sub1"):
|
||||
x = self._conv("conv1", x, 3, in_filter, out_filter, stride)
|
||||
|
||||
with tf.variable_scope("sub2"):
|
||||
x = self._batch_norm("bn2", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
x = self._conv("conv2", x, 3, out_filter, out_filter, [1, 1, 1, 1])
|
||||
|
||||
with tf.variable_scope("sub_add"):
|
||||
if in_filter != out_filter:
|
||||
orig_x = tf.nn.avg_pool(orig_x, stride, stride, "VALID")
|
||||
orig_x = tf.pad(
|
||||
orig_x,
|
||||
[[0, 0], [0, 0], [0, 0], [(out_filter - in_filter) // 2,
|
||||
(out_filter - in_filter) // 2]])
|
||||
x += orig_x
|
||||
|
||||
return x
|
||||
|
||||
def _bottleneck_residual(self,
|
||||
x,
|
||||
in_filter,
|
||||
out_filter,
|
||||
stride,
|
||||
activate_before_residual=False):
|
||||
"""Bottleneck residual unit with 3 sub layers."""
|
||||
if activate_before_residual:
|
||||
with tf.variable_scope("common_bn_relu"):
|
||||
x = self._batch_norm("init_bn", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
orig_x = x
|
||||
else:
|
||||
with tf.variable_scope("residual_bn_relu"):
|
||||
orig_x = x
|
||||
x = self._batch_norm("init_bn", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
|
||||
with tf.variable_scope("sub1"):
|
||||
x = self._conv("conv1", x, 1, in_filter, out_filter / 4, stride)
|
||||
|
||||
with tf.variable_scope("sub2"):
|
||||
x = self._batch_norm("bn2", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
x = self._conv("conv2", x, 3, out_filter / 4, out_filter / 4,
|
||||
[1, 1, 1, 1])
|
||||
|
||||
with tf.variable_scope("sub3"):
|
||||
x = self._batch_norm("bn3", x)
|
||||
x = self._relu(x, self.hps.relu_leakiness)
|
||||
x = self._conv("conv3", x, 1, out_filter / 4, out_filter,
|
||||
[1, 1, 1, 1])
|
||||
|
||||
with tf.variable_scope("sub_add"):
|
||||
if in_filter != out_filter:
|
||||
orig_x = self._conv("project", orig_x, 1, in_filter,
|
||||
out_filter, stride)
|
||||
x += orig_x
|
||||
|
||||
return x
|
||||
|
||||
def _decay(self):
|
||||
"""L2 weight decay loss."""
|
||||
costs = []
|
||||
for var in tf.trainable_variables():
|
||||
if var.op.name.find(r"DW") > 0:
|
||||
costs.append(tf.nn.l2_loss(var))
|
||||
|
||||
return tf.multiply(self.hps.weight_decay_rate, tf.add_n(costs))
|
||||
|
||||
def _conv(self, name, x, filter_size, in_filters, out_filters, strides):
|
||||
"""Convolution."""
|
||||
with tf.variable_scope(name):
|
||||
n = filter_size * filter_size * out_filters
|
||||
kernel = tf.get_variable(
|
||||
"DW", [filter_size, filter_size, in_filters, out_filters],
|
||||
tf.float32,
|
||||
initializer=tf.random_normal_initializer(
|
||||
stddev=np.sqrt(2.0 / n)))
|
||||
return tf.nn.conv2d(x, kernel, strides, padding="SAME")
|
||||
|
||||
def _relu(self, x, leakiness=0.0):
|
||||
"""Relu, with optional leaky support."""
|
||||
return tf.where(tf.less(x, 0.0), leakiness * x, x, name="leaky_relu")
|
||||
|
||||
def _fully_connected(self, x, out_dim):
|
||||
"""FullyConnected layer for final output."""
|
||||
x = tf.reshape(x, [self.hps.batch_size, -1])
|
||||
w = tf.get_variable(
|
||||
"DW", [x.get_shape()[1], out_dim],
|
||||
initializer=tf.uniform_unit_scaling_initializer(factor=1.0))
|
||||
b = tf.get_variable(
|
||||
"biases", [out_dim], initializer=tf.constant_initializer())
|
||||
return tf.nn.xw_plus_b(x, w, b)
|
||||
|
||||
def _global_avg_pool(self, x):
|
||||
assert x.get_shape().ndims == 4
|
||||
return tf.reduce_mean(x, [1, 2])
|
||||
@@ -11,6 +11,8 @@ try:
|
||||
except NameError:
|
||||
FileNotFoundError = IOError
|
||||
|
||||
# This is not a top level item in the directory, so we use `../` to refer
|
||||
# to images located at the top level.
|
||||
GALLERY_TEMPLATE = """
|
||||
.. raw:: html
|
||||
|
||||
@@ -18,7 +20,7 @@ GALLERY_TEMPLATE = """
|
||||
|
||||
.. only:: html
|
||||
|
||||
.. figure:: {thumbnail}
|
||||
.. figure:: ../{thumbnail}
|
||||
|
||||
{description}
|
||||
|
||||
@@ -71,12 +73,13 @@ class CustomGalleryItemDirective(Directive):
|
||||
if "figure" in self.options:
|
||||
env = self.state.document.settings.env
|
||||
rel_figname, figname = env.relfn2path(self.options["figure"])
|
||||
thumbnail = os.path.join("_static/thumbs/",
|
||||
os.path.basename(figname))
|
||||
|
||||
os.makedirs("_static/thumbs", exist_ok=True)
|
||||
thumb_dir = os.path.join(env.srcdir, "_static/thumbs/")
|
||||
os.makedirs(thumb_dir, exist_ok=True)
|
||||
image_path = os.path.join(thumb_dir, os.path.basename(figname))
|
||||
sphinx_gallery.gen_rst.scale_image(figname, image_path, 400, 280)
|
||||
|
||||
sphinx_gallery.gen_rst.scale_image(figname, thumbnail, 400, 280)
|
||||
thumbnail = os.path.relpath(image_path, env.srcdir)
|
||||
else:
|
||||
thumbnail = "/_static/img/thumbnails/default.png"
|
||||
|
||||
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 24 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 24 KiB |
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 32 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 5.9 KiB |
@@ -263,7 +263,6 @@ Getting Involved
|
||||
auto_examples/plot_newsreader.rst
|
||||
auto_examples/plot_hyperparameter.rst
|
||||
auto_examples/plot_pong_example.rst
|
||||
auto_examples/plot_resnet.rst
|
||||
auto_examples/plot_streaming.rst
|
||||
auto_examples/plot_parameter_server.rst
|
||||
auto_examples/plot_example-a3c.rst
|
||||
|
||||
Reference in New Issue
Block a user