Deploying TensorFlow Models

Module 3: Scaling Machine Learning Models with Cloud ML Engine

Effective machine learning requires:

Larger data sets
More feature engineering
More complicated model architectures

Refactor the current taxi cab fare prediction machine learning model:

Read out of memory data
Make it easy to add new input features
Make the model evaluate as part of training

Scaling TensorFlow Models

Once you have a working TensorFlow model, you can scale it up to more machines and more data

Taking the written model and scaling it out to more machines is essentially just scripting via gcloud commands

Scaling the Training Process

Most machine learning frameworks can handle toy problems and in-memory data sets

But if data size becomes much larger, need to be able to split data into batches and run model on many machines (batching and distribution are important)

Also doing transformations:

Pre-processing (transformation, cropping, de-colorize, etc.)
Feature creation (combine features, eliminate features, transform features)
Train model (also, hyper-parameter tuning)

The need for the cloud again - if data set is large, need to do these transformations in the cloud, across many machines. Same with hyperparameter tuning - want to explore different model architectures, at scale.

Scaling the Prediction Process

When using the trained model, you still need scaling. To make predictions, you turn a model into a microservice (web application). TensorFlow Model - fit your estimator - then, to predict, take your estimator and call predict() on it (via Python).

Are all clients ("customers") in the code able to run in Python?
Will they all have access to the directory needed to construct the estimator object?
Will they know the feature columns you used to train the model?
The answer to all of these is, NO!
Deploy model as a microservice to serve as a layer between your client and the details of your machine learning model

Microservice architecture:

Need to shield clients from the details of the machine learning prediction details (including programming language, features used, etc)
If clients need a prediction from the model, they bundle everything into a REST API call (with all input variables needed by model)
Web service will take all input variables, convert them into tensors, send them to TensorFlow model, get results back, and convert them back to an API response (HTTP)
If you have millions of clients, and lots of requests coming in simultaneously, need to have a web service that can support this throughput
Weak link is the model evaluation step - this also needs to scale

Problems in training and problems in prediction are different.

Training problems: scaling out data and training process to more machines.

Prediction problems: scaling up prediction engine to handle high throughput and lots of clients

First generation TPU - primarily around prediction (inference) and doing prediction at scale - predicting/evaluating as fast as possible to handle user requests

Cloud ML Engine Workflow

Cloud ML Engine does both the prediction and training scaling. Focused on helping TensorFlow models scale up.

Start with CSV files
Explore datasets in Datalab using Pandas, matplotlib, etc.
Do transformations (preprocessing, feature creation, etc.) in Apache Beam (can handle batch or streaming data - that's the intent - convert everything to a Dataflow pipeline so that you can seamlessly switch from batch to streaming without changing your transformation pipeline into ML Engine)

Dataflow workflow:

Work on transformations using a local Apache Beam runner, ensure everything is working
Scale it up to larger data sets by using a Dataflow runner

Cloud ML workflow:

Work on neural network locally using TensorFlow/notebooks/etc., ensure everything is working
Scale it up to execute TF code on GCP using Cloud ML Engine

Packaging TensorFlow Models as Python Modules for Training

To scale up a TensorFlow model to run on Cloud ML Engine, need to package the model up as a Python module.

We then submit a TensorFlow code by submitting this Python module. The task.py and model.py parts are the key here.

taxifare/
taxifare/PKG-INFO
taxifare/setup.cfg
taxifare/setup.py
taxifare/trainer/
taxifare/trainer/__init__.py
taxifare/trainer/task.py
taxifare/trainer/model.py
taxifare/trainer.egg-info/
taxifare/trainer.egg-info/dependency_links.txt
taxifare/trainer.egg-info/PKG-INFO
taxifare/trainer.egg-info/SOURCES.txt
taxifare/trainer.egg-info/top_level.txt

The TensorFlow code we wrote goes into task.py and model.py (mostly model.py). When we tar up the directory structure above, we get a Python module.

What is in task.py

task.py:

contains a main method
parses command-line parameters
uses command line parameters to run the model

Example task.py:

Experiment(
    model.build_estimator(
        output_dir,
        embedding_size = embedding_size,
        hidden_units = hidden_units
    ),
    train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ),
    eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ),
    eval_metrics = model.get_eval_metrics(),
)

(Note that these refer to functions that must be defined in model.py, which we'll cover in a moment)

Then, use argument parsing to get train_data_paths, for example:

parser.add_argument( '--train_data_paths', required=True )
parser.add_argument( '--num_epochs', ... )
# etc...

This makes code executable as a program, and enables passing information into the program via command line arguments.

What is in model.py

model.py:

All code from previous chapter (estimator API, etc.) goes into model.py

We need to have a function that returns a function. This will take a filename as an argument (passed in via task.py), and then extract from it the TensorFlow stuff that's needed.

def generate_csv_input_fn( filename, num_epochs = None, ... ):

    def _input_fn():
        input_file_names = tf.train.match_filenames_once(filename)
        filename_queue = tf.train.string_input_producer(
                            input_file_names,
                            num_epochs = num_epochs,
                            shuffle = True
                        )
        reader = tf.TextLineReader()
        _, value = reader.read_up_to(filename_queue, num_records = batch_size)
        value_column = tf.expand_dims(value, -1)
        columns = tf.decode_csv( value_column, record_defaults = DEFAULTS)
        features = dict(zip(CSV_COLUMNS, columns))
        label = features.pop(LABEL_COLUMN)
        return features, label

    return _input_fn

Verifying the Package

To verify that the model package runs as expected, you can run the following test:

export $PYTHONPATH=${PYTHONPATH}:/path/to/taxifare
python -m trainer.task \
  --train_data_paths="/path/to/dataset/taxi-train*" \
  --eval_data_paths=/path/to/dataset/taxi-valid.csv \
  --output_dir=/path/to/outputdir \
  --num_epochs=10 \
  --job-dir=/tmp

This simulates the way that the model is run in the cloud.

Python path variable tells python where to look for modules
The -m flag runs a module called trainer.task
The argparse settings pass the path information from the command line on to the program

Now that you know it works, how do you scale it up? Use gcloud command.

Running Packaged Model in the Cloud

Now you can use the gcloud command to submit the model - either locally, or in the cloud.

To run it locally, use "local train":

gcloud ml-engine local train \
    --module-name=trainer.task \
    --package-path=/path/to/taxifare/trainer \
    -- \
    --train_data_paths ... <the rest looks like it did above>

We are running this locally, passing it local directories to the package path, and local directories for the training data, &c.

To run the training task in the cloud, use "jobs submit":

gcloud ml-engine jobs submit \
    training $JOBNAME \
    region $REGION \
    --module-name=trainer.task \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=BASIC \
    --train_data_paths ... <the rest looks like it did above>

Does the following:

Submits a training job in the cloud
Specifies the region (same region as where your data lives)
Specify module name for job/model
Specify bucket location to put temporary files
Scale tier specifies the scale of the resources used (BASIC/STANDARD/PREMIUM/GPU/etc...)

The scale tier determines the cost.

The workflow, again, is:

Try out the job locally, and pass it local module name/location
Then submit it to the cloud

We covered training, but what about prediction?

Cloud ML Engine for Prediction

For the training task, we had the following task.py:

Experiment(
    model.build_estimator(
        output_dir,
        embedding_size = embedding_size,
        hidden_units = hidden_units
    ),
    train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ),
    eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ),
    eval_metrics = model.get_eval_metrics(),
)

For prediction, we want to make slight modifications:

Experiment(
    model.build_estimator(
        output_dir,
        embedding_size = embedding_size,
        hidden_units = hidden_units
    ),
    train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ),
    eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ),

    export_strategies = [saved_model_export_utils.make_export_strategy( 
                model.serving_input_fn,
                default_output_alternative_key = None,
                exports_to_keep = 1
    )],

    eval_metrics = model.get_eval_metrics(),
)

This keeps 1 export (the best one). This also requires us to define a model_serving_input_fn(), which is the function that parses the JSON file that the client is sending when it requests the model be evaluated. It creates all of the input features that the model expects.

Example: this creates placeholders for each input column, and each column is a float32:

def serving_input_fn():
    feature_placeholders = {
            column.name : tf.placeholder(tf.float32, [None]) for column in INPUT_COLUMNS
    }

(This is just an example, could have virtually any kind of types for your input data.)

Once you've done that, it's time to deploy the trained model to Google Cloud Platform:

Can deploy a locally-trained, locally-built model
can deploy a trained model that is somewhere on a Google Cloud Storage bucket

Here is an example of submitting a model that is located in a Cloud Storage bucket:

MODEL_NAME="taxifare"
MODEL_VERSION="v1"
MODEL_LOCATION="gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo/..."

# Create a model
gcloud ml-engine models create ${MODEL_NAME} \
        --regions $REGION

# Create a new version of this model and where it lives
gcloud ml-engine versions create ${MODEL_VERSION} \
        --model ${MODEL_NAME} \
        --origin ${MODEL_LOCATION}

Creating multiple versions allows you to do A/B testing... send 80% of your traffic to version 1, 20% of your traffic to version 2, and gradually scale up one model version or the other.

Client interaction with model predictions

Now we cover how the client interacts with the model. Recall from above, client is sending JSON requests that go to the model. These calls are made via REST calls.

JSON request containing inputs (in a structure called request_data):

# Get credentials for user to make API calls
credentials = GoogleCredentials.get_application_default()
api = discover.build('ml', 'v1beta1', 
                    credentials = credentials,
                    discoveryServiceUrl = 'https://storage.googleapis.com/cloud-ml/discover/ml_v1beta1_discovery.json'
                    )

# Set the JSON file with model inputs
request_data = [ {  'pickup_longitude' : -73.800001,
                    'pickup_latitude'  :  40.700001,
                    'dropoff_longitude': -73.980001,
                    'dropoff_latitude' :  40.730001,
                    'passenger_count'  : 2
                }]

# Now assemble the URL to which to send the model inputs:
# Set the following information:
# - name of the project
# - name of the model
# - name of the version
parent = 'projects/%s/models/%s/versions/%s' % ( 'cloud-training-demos', 'taxifare', 'v1' )

# Make the API request (call the predict function)
response = api.projects().predict( body = { 'instances' : request_data,
                                            name = parent
                                }).execute()

Recall that we specified the model and version number when we ran gcloud ml-engine models create and gcloud ml-engine versions create:

MODEL_NAME="taxifare"
MODEL_VERSION="v1"
MODEL_LOCATION="gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo/..."

Scaling with Cloud Machine Learning Laboratory

Lab link: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#9

Repo: https://github.com/GoogleCloudPlatform/training-data-analyst

Subdirectory: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/cloudmle

Use a single-region bucket for machine learning training inputs and outputs - enables consistency (fast reading/writing from multiple threads)

The lab will accomplish the following:

Package a TensorFlow model
Run the training locally
Run the training on the cloud
Deploy the model to the cloud
Call the model to make predictions

Start by creating a datalab instance from Cloud Shell:

$ cd training-data-analyst/courses/machine_learning
$ datalab create cloudmle

Then run Datalab on port 8081

Check out the training-data-analyst from Github into datalab (using ungit)

Open the cloudmle ipython notebook, empty all cells, and execute them one by one.

Here's what the notebook does:

Imports tensorflow 1.2.0
Sets project/bucket/region settings
Runs gcloud commands via bash magic to set project and region
Runs curl command to get service account name for machine learning engine
Runs gsutil command to authorize service acct to access files in Cloud Storage
"Explores" the taxifare/ Python module using find command (see https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/cloudmle/taxifare)
Finds path to data

The notebook then demonstrates three techniques to train a model and make a prediction using the model.

Method 1: local training using python from command line

Calls the python module to train the neural network locally using the command line
Creates a test case
Makes a prediction using the local model and gcloud

Method 2: local training using gcloud

Calls gcloud to locally train the tensorflow model
Makes a prediction using the local model and gcloud (same technique as previous method)

Method 3: training in Cloud ML Engine using gcloud

Uploads training data to cloud storage
Calls gcloud to upload the model package and train it in the cloud
Results are monitored using gloud command to stream the logs
Output log and resulting model are put into cloud storage bucket
Deploy the tensorflow model using gcloud ml-engine models create
Make prediction using gcloud ml-engine predict (get API credentials for service account, assemble input JSON/dict, pass to model via REST call)

Training and Evaluating Locally from Command Line

Here's the command to train the neural network locally, the manual way:

rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${REPO}/courses/machine_learning/cloudmle/taxifare
python -m trainer.task \
   --train_data_paths="${REPO}/courses/machine_learning/datasets/taxi-train*" \
   --eval_data_paths=${REPO}/courses/machine_learning/datasets/taxi-valid.csv  \
   --output_dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained \
   --num_epochs=10 --job-dir=./tmp

And resulting spew of output:

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcac3312f10>, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:268: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
2017-10-21 21:38:30.661401: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:38:30.661475: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:38:30.661517: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:38:30.661540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:38:30.661577: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
INFO:tensorflow:Saving checkpoints for 1 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt.
INFO:tensorflow:loss = 286.526, step = 1
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-10-21-21:38:31
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-1
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Finished evaluation at 2017-10-21-21:38:32
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 1983.73, rmse = 44.5637
INFO:tensorflow:Validation (step 1): loss = 1983.73, global_step = 1, rmse = 44.5637
INFO:tensorflow:global_step/sec: 68.117
INFO:tensorflow:loss = 188.12, step = 101 (1.468 sec)
INFO:tensorflow:Saving checkpoints for 160 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt.
INFO:tensorflow:Loss for final step: 170.666.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-10-21-21:38:34
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Finished evaluation at 2017-10-21-21:38:34
INFO:tensorflow:Saving dict for global step 160: global_step = 160, loss = 227.456, rmse = 15.1405
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/export/Servo/1508621914/saved_model.pb

The local model is saved to a .pb file in the taxi_trained/ directory.

To make a prediction:

$ cat test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

$ model_dir=$(ls ${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo)

$ gcloud ml-engine local predict \
    --model-dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo/${model_dir} \
    --json-instances=./test.json

Output from running the prediction locally with gcloud ml-engine:

SCORES
0.851055

WARNING: 2017-10-21 21:41:46.451303: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:41:46.451371: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:41:46.451391: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:41:46.451404: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:41:46.451417: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
WARNING:root:MetaGraph has multiple signatures 2. Support for multiple signatures is limited. By default we select named signatures.

Training and Evaluating Locally using Gcloud

Here's the command to train the network locally, the gcloud way:

rm -rf taxifare.tar.gz taxi_trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${REPO}/courses/machine_learning/cloudmle/taxifare/trainer \
   -- \
   --train_data_paths=${REPO}/courses/machine_learning/datasets/taxi-train.csv \
   --eval_data_paths=${REPO}/courses/machine_learning/datasets/taxi-valid.csv  \
   --num_epochs=10 \
   --output_dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained

The output:

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4ce94590>, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:268: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
2017-10-21 21:58:20.069627: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:58:20.069704: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:58:20.069732: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:58:20.069748: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-21 21:58:20.069767: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
INFO:tensorflow:Saving checkpoints for 1 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt.
INFO:tensorflow:loss = 223.366, step = 1
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-10-21-21:58:20
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-1
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Finished evaluation at 2017-10-21-21:58:20
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 4745.01, rmse = 68.8946
INFO:tensorflow:Validation (step 1): loss = 4745.01, global_step = 1, rmse = 68.8946
INFO:tensorflow:global_step/sec: 70.4532
INFO:tensorflow:loss = 187.626, step = 101 (1.419 sec)
INFO:tensorflow:Saving checkpoints for 160 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt.
INFO:tensorflow:Loss for final step: 170.484.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-10-21-21:58:22
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Finished evaluation at 2017-10-21-21:58:22
INFO:tensorflow:Saving dict for global step 160: global_step = 160, loss = 227.261, rmse = 15.134
INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/export/Servo/1508623102/saved_model.pb

The results of training the model locally are in the taxi_trained directory:

$ ls taxi_trained/

checkpoint                                 model.ckpt-160.index
eval                                         model.ckpt-160.meta
events.out.tfevents.1508623099.e26189a31421  model.ckpt-1.data-00000-of-00001
export                                       model.ckpt-1.index
graph.pbtxt                                  model.ckpt-1.meta
model.ckpt-160.data-00000-of-00001

A prediction can be made using the trained model in the same way as the prior section:

$ cat test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

$ model_dir=$(ls ${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo)

$ gcloud ml-engine local predict \
    --model-dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo/${model_dir} \
    --json-instances=./test.json

Training in Cloud using Gcloud

To train the model in the cloud, start by copying the input CSV data to a Cloud Storage bucket:

$ gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/
$ gsutil -m cp ${REPO}/courses/machine_learning/datasets/*.csv gs://${BUCKET}/taxifare/smallinput/

Then run the gcloud command to train the model in the cloud:

OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${REPO}/courses/machine_learning/cloudmle/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=1.0 \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --num_epochs=100

The output of the command to submit the job gives some information about how to monitor the status of the job:

CommandException: 1 files/objects could not be removed.
Job [lab3a_171021_223647] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe lab3a_171021_223647

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs lab3a_171021_223647

When that command is actually run, it prints out some information about the current phase of the job:

$ gcloud ml-engine jobs describe lab3a_171021_223647
createTime: '2017-10-21T22:36:49Z'
jobId: lab3a_171021_223647
state: PREPARING
trainingInput:
  args:
  - --train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train*
  - --eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid*
  - --output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained
  - --num_epochs=100
  jobDir: gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained
  packageUris:
  - gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz
  pythonModule: trainer.task
  region: us-central1
  runtimeVersion: '1.0'
trainingOutput: {}

View job in the Cloud Console at:
https://console.cloud.google.com/ml/jobs/lab3a_171021_223647?project=not-all-broken

View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Flab3a_171021_223647&project=not-all-broken

Streaming the output is useful. Here is the output of the streaming command:

$ gcloud ml-engine jobs stream-logs lab3a_171021_223647

INFO    2017-10-21 15:36:49 -0700    service        Validating job requirements...
INFO    2017-10-21 15:36:49 -0700    service        Job creation request has been successfully validated.
INFO    2017-10-21 15:36:49 -0700    service        Waiting for job to be provisioned.
INFO    2017-10-21 15:36:49 -0700    service        Job lab3a_171021_223647 is queued.
INFO    2017-10-21 15:41:26 -0700    service        Waiting for TensorFlow to start.
INFO    2017-10-21 15:42:28 -0700    master-replica-0        Running task with arguments: --cluster={"master": ["master-2a1d29fbde-0:2222"]} --task={"type": "master", "index": 0} --job={
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "package_uris": ["gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz"],
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "python_module": "trainer.task",
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "args": ["--train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train*", "--eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid*", "--output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained", "--num_epochs=100"],
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "region": "us-central1",
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "runtime_version": "1.0",
INFO    2017-10-21 15:42:28 -0700    master-replica-0          "job_dir": "gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained"
INFO    2017-10-21 15:42:28 -0700    master-replica-0        }
INFO    2017-10-21 15:42:49 -0700    master-replica-0        Running module trainer.task.
INFO    2017-10-21 15:42:49 -0700    master-replica-0        Downloading the package: gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz
INFO    2017-10-21 15:42:49 -0700    master-replica-0        Running command: gsutil -q cp gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:09 -0700    master-replica-0        ERROR: You have configured your Cloud SDK installation to be fixed to version [138.0.0]. Make sure this is a valid archived Cloud SDK version.
INFO    2017-10-21 15:43:15 -0700    master-replica-0        Installing the package: gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:15 -0700    master-replica-0        Running command: pip install --user --upgrade --force-reinstall --no-deps taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:15 -0700    master-replica-0        Processing ./taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:16 -0700    master-replica-0        Building wheels for collected packages: taxifare
INFO    2017-10-21 15:43:16 -0700    master-replica-0          Running setup.py bdist_wheel for taxifare: started
INFO    2017-10-21 15:43:16 -0700    master-replica-0        creating '/tmp/tmpATFYnWpip-wheel-/taxifare-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'trainer/model.py'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'trainer/__init__.py'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'trainer/task.py'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/DESCRIPTION.rst'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/metadata.json'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/top_level.txt'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/WHEEL'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/METADATA'
INFO    2017-10-21 15:43:16 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/RECORD'
INFO    2017-10-21 15:43:16 -0700    master-replica-0          Running setup.py bdist_wheel for taxifare: finished with status 'done'
INFO    2017-10-21 15:43:16 -0700    master-replica-0          Stored in directory: /root/.cache/pip/wheels/b0/8a/16/c8e53c6e84f363a5aac669e062e3ee5ccde849d32101ce58b5
INFO    2017-10-21 15:43:16 -0700    master-replica-0        Successfully built taxifare
INFO    2017-10-21 15:43:16 -0700    master-replica-0        Installing collected packages: taxifare
INFO    2017-10-21 15:43:16 -0700    master-replica-0        Successfully installed taxifare-0.1
INFO    2017-10-21 15:43:16 -0700    master-replica-0        Running command: pip install --user taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:17 -0700    master-replica-0        Processing ./taxifare-0.1.tar.gz
INFO    2017-10-21 15:43:17 -0700    master-replica-0          Requirement already satisfied (use --upgrade to upgrade): taxifare==0.1 from file:///user_dir/taxifare-0.1.tar.gz in /root/.local/lib/python2.7/site-packages
INFO    2017-10-21 15:43:17 -0700    master-replica-0        Building wheels for collected packages: taxifare
INFO    2017-10-21 15:43:17 -0700    master-replica-0          Running setup.py bdist_wheel for taxifare: started
INFO    2017-10-21 15:43:17 -0700    master-replica-0        creating '/tmp/tmpx9Bps1pip-wheel-/taxifare-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'trainer/model.py'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'trainer/__init__.py'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'trainer/task.py'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/DESCRIPTION.rst'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/metadata.json'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/top_level.txt'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/WHEEL'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/METADATA'
INFO    2017-10-21 15:43:17 -0700    master-replica-0        adding 'taxifare-0.1.dist-info/RECORD'
INFO    2017-10-21 15:43:17 -0700    master-replica-0          Running setup.py bdist_wheel for taxifare: finished with status 'done'
INFO    2017-10-21 15:43:17 -0700    master-replica-0          Stored in directory: /root/.cache/pip/wheels/b0/8a/16/c8e53c6e84f363a5aac669e062e3ee5ccde849d32101ce58b5
INFO    2017-10-21 15:43:17 -0700    master-replica-0        Successfully built taxifare
INFO    2017-10-21 15:43:17 -0700    master-replica-0        Running command: python -m trainer.task --train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train* --eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid* --output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained --num_epochs=100 --job-dir gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained
INFO    2017-10-21 15:43:19 -0700    master-replica-0        Using default config.
INFO    2017-10-21 15:43:19 -0700    master-replica-0        Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd9a6374290>, '_tf_config': gpu_options {
INFO    2017-10-21 15:43:19 -0700    master-replica-0          per_process_gpu_memory_fraction: 1.0
INFO    2017-10-21 15:43:19 -0700    master-replica-0        }
INFO    2017-10-21 15:43:19 -0700    master-replica-0        , '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}
INFO    2017-10-21 15:43:19 -0700    master-replica-0        Create CheckpointSaverHook.
INFO    2017-10-21 15:43:23 -0700    master-replica-0        Saving checkpoints for 1 into gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/model.ckpt.
INFO    2017-10-21 15:43:30 -0700    master-replica-0        loss = 496.208, step = 1
INFO    2017-10-21 15:43:31 -0700    master-replica-0        Starting evaluation at 2017-10-21-22:43:31
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [1/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [2/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [3/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [4/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [5/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [6/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [7/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [8/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [9/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Evaluation [10/10]
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Finished evaluation at 2017-10-21-22:43:34
INFO    2017-10-21 15:43:34 -0700    master-replica-0        Saving dict for global step 1: global_step = 1, loss = 169.691, rmse = 13.0265
INFO    2017-10-21 15:43:36 -0700    master-replica-0        Validation (step 1): loss = 169.691, global_step = 1, rmse = 13.0265
INFO    2017-10-21 15:44:15 -0700    master-replica-0        global_step/sec: 2.22504
INFO    2017-10-21 15:44:15 -0700    master-replica-0        loss = 82.2369, step = 101
INFO    2017-10-21 15:44:54 -0700    master-replica-0        global_step/sec: 2.61386
INFO    2017-10-21 15:44:54 -0700    master-replica-0        loss = 81.7113, step = 201
INFO    2017-10-21 15:45:32 -0700    master-replica-0        global_step/sec: 2.60342
INFO    2017-10-21 15:45:32 -0700    master-replica-0        loss = 70.4588, step = 301
INFO    2017-10-21 15:46:11 -0700    master-replica-0        global_step/sec: 2.54869
INFO    2017-10-21 15:46:11 -0700    master-replica-0        loss = 79.9016, step = 401

This took about 20 minutes to finish training...

When the model is trained in Google Cloud ML Engine, the final model is actually stored in Google Cloud storage. Once the job finishes, you'll see a message like this in the logs:

15:54:37.533 SavedModel written to: gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/export/Servo/1508626468656/saved_model.pb

Deploying the Model to the Cloud

Start by finding the location of the saved model. The log contains the location, but to programmatically extract it, everything up to the long multi-digit serial number should be known. Use ls and tail to get it:

MODEL_NAME="taxifare"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo | tail -1)

Now the gcloud command line utility can be used to deploy the model:

echo "Deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}

If an old version of the model needs to be deleted, you can always use the gcloud ml-engine delete command:

#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}

Output from deploy command:

Deleting and deploying taxifare v1 from gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/export/Servo/1508626468656/ ... this will take a few minutes

Created ml engine model [projects/not-all-broken/models/taxifare].
Creating version (this might take a few minutes)......
..........................................................................................................................................................................................................................................................................................................................................................................................................................done.

The model will take about 5 minutes to deploy. Once the model is deployed, predictions can be made using a REST API call.

Making Predictions in Cloud using Gcloud

To make a prediction using the gcloud command line:

$ cat test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

$ gcloud ml-engine predict --model=taxifare --version=v1 --json-instances=./test.json
OUTPUTS
10.8887

Making Predictions in Cloud using REST API

To make a prediction by calling the REST API, we first need OAuth credentials. These can be obtained for this service account by using the application default credentials, and passing those to the API client discovery endpoint (???):

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json';)

Once that's finished, we now have an API endpoint that we can pass the inputs to the machine learning model to, and we'll get our JSON response back.

request_data = {'instances':
  [
      {
        'pickuplon': -73.885262,
        'pickuplat': 40.773008,
        'dropofflon': -73.987232,
        'dropofflat': 40.732403,
        'passengers': 2,
      }
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)