GCDEC/Deploying Tensorflow/Notes
From charlesreid1
Contents
- 1 Deploying TensorFlow Models
- 2 References
- 3 Flags
Deploying TensorFlow Models
Module 3: Scaling Machine Learning Models with Cloud ML Engine
Effective machine learning requires:
- Larger data sets
- More feature engineering
- More complicated model architectures
Refactor the current taxi cab fare prediction machine learning model:
- Read out of memory data
- Make it easy to add new input features
- Make the model evaluate as part of training
Scaling TensorFlow Models
Once you have a working TensorFlow model, you can scale it up to more machines and more data
Taking the written model and scaling it out to more machines is essentially just scripting via gcloud commands
Scaling the Training Process
Most machine learning frameworks can handle toy problems and in-memory data sets
But if data size becomes much larger, need to be able to split data into batches and run model on many machines (batching and distribution are important)
Also doing transformations:
- Pre-processing (transformation, cropping, de-colorize, etc.)
- Feature creation (combine features, eliminate features, transform features)
- Train model (also, hyper-parameter tuning)
The need for the cloud again - if data set is large, need to do these transformations in the cloud, across many machines. Same with hyperparameter tuning - want to explore different model architectures, at scale.
Scaling the Prediction Process
When using the trained model, you still need scaling. To make predictions, you turn a model into a microservice (web application). TensorFlow Model - fit your estimator - then, to predict, take your estimator and call predict() on it (via Python).
- Are all clients ("customers") in the code able to run in Python?
- Will they all have access to the directory needed to construct the estimator object?
- Will they know the feature columns you used to train the model?
- The answer to all of these is, NO!
- Deploy model as a microservice to serve as a layer between your client and the details of your machine learning model
Microservice architecture:
- Need to shield clients from the details of the machine learning prediction details (including programming language, features used, etc)
- If clients need a prediction from the model, they bundle everything into a REST API call (with all input variables needed by model)
- Web service will take all input variables, convert them into tensors, send them to TensorFlow model, get results back, and convert them back to an API response (HTTP)
- If you have millions of clients, and lots of requests coming in simultaneously, need to have a web service that can support this throughput
- Weak link is the model evaluation step - this also needs to scale
Problems in training and problems in prediction are different.
Training problems: scaling out data and training process to more machines.
Prediction problems: scaling up prediction engine to handle high throughput and lots of clients
First generation TPU - primarily around prediction (inference) and doing prediction at scale - predicting/evaluating as fast as possible to handle user requests
Cloud ML Engine Workflow
Cloud ML Engine does both the prediction and training scaling. Focused on helping TensorFlow models scale up.
- Start with CSV files
- Explore datasets in Datalab using Pandas, matplotlib, etc.
- Do transformations (preprocessing, feature creation, etc.) in Apache Beam (can handle batch or streaming data - that's the intent - convert everything to a Dataflow pipeline so that you can seamlessly switch from batch to streaming without changing your transformation pipeline into ML Engine)
Dataflow workflow:
- Work on transformations using a local Apache Beam runner, ensure everything is working
- Scale it up to larger data sets by using a Dataflow runner
Cloud ML workflow:
- Work on neural network locally using TensorFlow/notebooks/etc., ensure everything is working
- Scale it up to execute TF code on GCP using Cloud ML Engine
Packaging TensorFlow Models as Python Modules for Training
To scale up a TensorFlow model to run on Cloud ML Engine, need to package the model up as a Python module.
We then submit a TensorFlow code by submitting this Python module. The task.py
and model.py
parts are the key here.
taxifare/ taxifare/PKG-INFO taxifare/setup.cfg taxifare/setup.py taxifare/trainer/ taxifare/trainer/__init__.py taxifare/trainer/task.py taxifare/trainer/model.py taxifare/trainer.egg-info/ taxifare/trainer.egg-info/dependency_links.txt taxifare/trainer.egg-info/PKG-INFO taxifare/trainer.egg-info/SOURCES.txt taxifare/trainer.egg-info/top_level.txt
The TensorFlow code we wrote goes into task.py and model.py (mostly model.py). When we tar up the directory structure above, we get a Python module.
What is in task.py
task.py:
- contains a main method
- parses command-line parameters
- uses command line parameters to run the model
Example task.py:
Experiment( model.build_estimator( output_dir, embedding_size = embedding_size, hidden_units = hidden_units ), train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ), eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ), eval_metrics = model.get_eval_metrics(), )
(Note that these refer to functions that must be defined in model.py, which we'll cover in a moment)
Then, use argument parsing to get train_data_paths, for example:
parser.add_argument( '--train_data_paths', required=True ) parser.add_argument( '--num_epochs', ... ) # etc...
This makes code executable as a program, and enables passing information into the program via command line arguments.
What is in model.py
model.py:
- All code from previous chapter (estimator API, etc.) goes into model.py
We need to have a function that returns a function. This will take a filename as an argument (passed in via task.py), and then extract from it the TensorFlow stuff that's needed.
def generate_csv_input_fn( filename, num_epochs = None, ... ): def _input_fn(): input_file_names = tf.train.match_filenames_once(filename) filename_queue = tf.train.string_input_producer( input_file_names, num_epochs = num_epochs, shuffle = True ) reader = tf.TextLineReader() _, value = reader.read_up_to(filename_queue, num_records = batch_size) value_column = tf.expand_dims(value, -1) columns = tf.decode_csv( value_column, record_defaults = DEFAULTS) features = dict(zip(CSV_COLUMNS, columns)) label = features.pop(LABEL_COLUMN) return features, label return _input_fn
Verifying the Package
To verify that the model package runs as expected, you can run the following test:
export $PYTHONPATH=${PYTHONPATH}:/path/to/taxifare python -m trainer.task \ --train_data_paths="/path/to/dataset/taxi-train*" \ --eval_data_paths=/path/to/dataset/taxi-valid.csv \ --output_dir=/path/to/outputdir \ --num_epochs=10 \ --job-dir=/tmp
This simulates the way that the model is run in the cloud.
- Python path variable tells python where to look for modules
- The -m flag runs a module called trainer.task
- The argparse settings pass the path information from the command line on to the program
Now that you know it works, how do you scale it up? Use gcloud command.
Running Packaged Model in the Cloud
Now you can use the gcloud command to submit the model - either locally, or in the cloud.
To run it locally, use "local train":
gcloud ml-engine local train \ --module-name=trainer.task \ --package-path=/path/to/taxifare/trainer \ -- \ --train_data_paths ... <the rest looks like it did above>
We are running this locally, passing it local directories to the package path, and local directories for the training data, &c.
To run the training task in the cloud, use "jobs submit":
gcloud ml-engine jobs submit \ training $JOBNAME \ region $REGION \ --module-name=trainer.task \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --train_data_paths ... <the rest looks like it did above>
Does the following:
- Submits a training job in the cloud
- Specifies the region (same region as where your data lives)
- Specify module name for job/model
- Specify bucket location to put temporary files
- Scale tier specifies the scale of the resources used (BASIC/STANDARD/PREMIUM/GPU/etc...)
The scale tier determines the cost.
The workflow, again, is:
- Try out the job locally, and pass it local module name/location
- Then submit it to the cloud
We covered training, but what about prediction?
Cloud ML Engine for Prediction
For the training task, we had the following task.py:
Experiment( model.build_estimator( output_dir, embedding_size = embedding_size, hidden_units = hidden_units ), train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ), eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ), eval_metrics = model.get_eval_metrics(), )
For prediction, we want to make slight modifications:
Experiment( model.build_estimator( output_dir, embedding_size = embedding_size, hidden_units = hidden_units ), train_input_fn = model.generate_csv_input_fn( train_data_paths, ... ), eval_input_fn = model.generate_csv_input_fn( eval_data_paths, ... ), export_strategies = [saved_model_export_utils.make_export_strategy( model.serving_input_fn, default_output_alternative_key = None, exports_to_keep = 1 )], eval_metrics = model.get_eval_metrics(), )
This keeps 1 export (the best one). This also requires us to define a model_serving_input_fn(), which is the function that parses the JSON file that the client is sending when it requests the model be evaluated. It creates all of the input features that the model expects.
Example: this creates placeholders for each input column, and each column is a float32:
def serving_input_fn(): feature_placeholders = { column.name : tf.placeholder(tf.float32, [None]) for column in INPUT_COLUMNS }
(This is just an example, could have virtually any kind of types for your input data.)
Once you've done that, it's time to deploy the trained model to Google Cloud Platform:
- Can deploy a locally-trained, locally-built model
- can deploy a trained model that is somewhere on a Google Cloud Storage bucket
Here is an example of submitting a model that is located in a Cloud Storage bucket:
MODEL_NAME="taxifare" MODEL_VERSION="v1" MODEL_LOCATION="gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo/..." # Create a model gcloud ml-engine models create ${MODEL_NAME} \ --regions $REGION # Create a new version of this model and where it lives gcloud ml-engine versions create ${MODEL_VERSION} \ --model ${MODEL_NAME} \ --origin ${MODEL_LOCATION}
Creating multiple versions allows you to do A/B testing... send 80% of your traffic to version 1, 20% of your traffic to version 2, and gradually scale up one model version or the other.
Client interaction with model predictions
Now we cover how the client interacts with the model. Recall from above, client is sending JSON requests that go to the model. These calls are made via REST calls.
JSON request containing inputs (in a structure called request_data):
# Get credentials for user to make API calls credentials = GoogleCredentials.get_application_default() api = discover.build('ml', 'v1beta1', credentials = credentials, discoveryServiceUrl = 'https://storage.googleapis.com/cloud-ml/discover/ml_v1beta1_discovery.json' ) # Set the JSON file with model inputs request_data = [ { 'pickup_longitude' : -73.800001, 'pickup_latitude' : 40.700001, 'dropoff_longitude': -73.980001, 'dropoff_latitude' : 40.730001, 'passenger_count' : 2 }] # Now assemble the URL to which to send the model inputs: # Set the following information: # - name of the project # - name of the model # - name of the version parent = 'projects/%s/models/%s/versions/%s' % ( 'cloud-training-demos', 'taxifare', 'v1' ) # Make the API request (call the predict function) response = api.projects().predict( body = { 'instances' : request_data, name = parent }).execute()
Recall that we specified the model and version number when we ran gcloud ml-engine models create
and gcloud ml-engine versions create
:
MODEL_NAME="taxifare" MODEL_VERSION="v1" MODEL_LOCATION="gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo/..."
Scaling with Cloud Machine Learning Laboratory
Lab link: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#9
Repo: https://github.com/GoogleCloudPlatform/training-data-analyst
Subdirectory: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/cloudmle
Use a single-region bucket for machine learning training inputs and outputs - enables consistency (fast reading/writing from multiple threads)
The lab will accomplish the following:
- Package a TensorFlow model
- Run the training locally
- Run the training on the cloud
- Deploy the model to the cloud
- Call the model to make predictions
Start by creating a datalab instance from Cloud Shell:
$ cd training-data-analyst/courses/machine_learning $ datalab create cloudmle
Then run Datalab on port 8081
Check out the training-data-analyst from Github into datalab (using ungit)
Open the cloudmle ipython notebook, empty all cells, and execute them one by one.
Here's what the notebook does:
- Imports tensorflow 1.2.0
- Sets project/bucket/region settings
- Runs gcloud commands via bash magic to set project and region
- Runs curl command to get service account name for machine learning engine
- Runs gsutil command to authorize service acct to access files in Cloud Storage
- "Explores" the taxifare/ Python module using find command (see https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/cloudmle/taxifare)
- Finds path to data
The notebook then demonstrates three techniques to train a model and make a prediction using the model.
Method 1: local training using python from command line
- Calls the python module to train the neural network locally using the command line
- Creates a test case
- Makes a prediction using the local model and gcloud
Method 2: local training using gcloud
- Calls gcloud to locally train the tensorflow model
- Makes a prediction using the local model and gcloud (same technique as previous method)
Method 3: training in Cloud ML Engine using gcloud
- Uploads training data to cloud storage
- Calls gcloud to upload the model package and train it in the cloud
- Results are monitored using gloud command to stream the logs
- Output log and resulting model are put into cloud storage bucket
- Deploy the tensorflow model using gcloud ml-engine models create
- Make prediction using gcloud ml-engine predict (get API credentials for service account, assemble input JSON/dict, pass to model via REST call)
Training and Evaluating Locally from Command Line
Here's the command to train the neural network locally, the manual way:
rm -rf taxifare.tar.gz taxi_trained export PYTHONPATH=${PYTHONPATH}:${REPO}/courses/machine_learning/cloudmle/taxifare python -m trainer.task \ --train_data_paths="${REPO}/courses/machine_learning/datasets/taxi-train*" \ --eval_data_paths=${REPO}/courses/machine_learning/datasets/taxi-valid.csv \ --output_dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained \ --num_epochs=10 --job-dir=./tmp
And resulting spew of output:
INFO:tensorflow:Using default config. INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcac3312f10>, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_evaluation_master': '', '_master': ''} WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:268: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05. Instructions for updating: Monitors are deprecated. Please use tf.train.SessionRunHook. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Create CheckpointSaverHook. 2017-10-21 21:38:30.661401: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:38:30.661475: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:38:30.661517: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:38:30.661540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:38:30.661577: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. INFO:tensorflow:Saving checkpoints for 1 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt. INFO:tensorflow:loss = 286.526, step = 1 WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Starting evaluation at 2017-10-21-21:38:31 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-1 INFO:tensorflow:Evaluation [1/10] INFO:tensorflow:Evaluation [2/10] INFO:tensorflow:Evaluation [3/10] INFO:tensorflow:Evaluation [4/10] INFO:tensorflow:Evaluation [5/10] INFO:tensorflow:Evaluation [6/10] INFO:tensorflow:Evaluation [7/10] INFO:tensorflow:Evaluation [8/10] INFO:tensorflow:Evaluation [9/10] INFO:tensorflow:Evaluation [10/10] INFO:tensorflow:Finished evaluation at 2017-10-21-21:38:32 INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 1983.73, rmse = 44.5637 INFO:tensorflow:Validation (step 1): loss = 1983.73, global_step = 1, rmse = 44.5637 INFO:tensorflow:global_step/sec: 68.117 INFO:tensorflow:loss = 188.12, step = 101 (1.468 sec) INFO:tensorflow:Saving checkpoints for 160 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt. INFO:tensorflow:Loss for final step: 170.666. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Starting evaluation at 2017-10-21-21:38:34 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160 INFO:tensorflow:Evaluation [1/10] INFO:tensorflow:Evaluation [2/10] INFO:tensorflow:Evaluation [3/10] INFO:tensorflow:Evaluation [4/10] INFO:tensorflow:Evaluation [5/10] INFO:tensorflow:Evaluation [6/10] INFO:tensorflow:Evaluation [7/10] INFO:tensorflow:Evaluation [8/10] INFO:tensorflow:Evaluation [9/10] INFO:tensorflow:Evaluation [10/10] INFO:tensorflow:Finished evaluation at 2017-10-21-21:38:34 INFO:tensorflow:Saving dict for global step 160: global_step = 160, loss = 227.456, rmse = 15.1405 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160 INFO:tensorflow:Assets added to graph. INFO:tensorflow:No assets to write. INFO:tensorflow:SavedModel written to: /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/export/Servo/1508621914/saved_model.pb
The local model is saved to a .pb file in the taxi_trained/ directory.
To make a prediction:
$ cat test.json {"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2} $ model_dir=$(ls ${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo) $ gcloud ml-engine local predict \ --model-dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo/${model_dir} \ --json-instances=./test.json
Output from running the prediction locally with gcloud ml-engine:
SCORES 0.851055 WARNING: 2017-10-21 21:41:46.451303: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:41:46.451371: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:41:46.451391: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:41:46.451404: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:41:46.451417: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. WARNING:root:MetaGraph has multiple signatures 2. Support for multiple signatures is limited. By default we select named signatures.
Training and Evaluating Locally using Gcloud
Here's the command to train the network locally, the gcloud way:
rm -rf taxifare.tar.gz taxi_trained gcloud ml-engine local train \ --module-name=trainer.task \ --package-path=${REPO}/courses/machine_learning/cloudmle/taxifare/trainer \ -- \ --train_data_paths=${REPO}/courses/machine_learning/datasets/taxi-train.csv \ --eval_data_paths=${REPO}/courses/machine_learning/datasets/taxi-valid.csv \ --num_epochs=10 \ --output_dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained
The output:
INFO:tensorflow:Using default config. INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4ce94590>, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 0, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_evaluation_master': '', '_master': ''} WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:268: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05. Instructions for updating: Monitors are deprecated. Please use tf.train.SessionRunHook. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Create CheckpointSaverHook. 2017-10-21 21:58:20.069627: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:58:20.069704: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:58:20.069732: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:58:20.069748: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-21 21:58:20.069767: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. INFO:tensorflow:Saving checkpoints for 1 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt. INFO:tensorflow:loss = 223.366, step = 1 WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Starting evaluation at 2017-10-21-21:58:20 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-1 INFO:tensorflow:Evaluation [1/10] INFO:tensorflow:Evaluation [2/10] INFO:tensorflow:Evaluation [3/10] INFO:tensorflow:Evaluation [4/10] INFO:tensorflow:Evaluation [5/10] INFO:tensorflow:Evaluation [6/10] INFO:tensorflow:Evaluation [7/10] INFO:tensorflow:Evaluation [8/10] INFO:tensorflow:Evaluation [9/10] INFO:tensorflow:Evaluation [10/10] INFO:tensorflow:Finished evaluation at 2017-10-21-21:58:20 INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 4745.01, rmse = 68.8946 INFO:tensorflow:Validation (step 1): loss = 4745.01, global_step = 1, rmse = 68.8946 INFO:tensorflow:global_step/sec: 70.4532 INFO:tensorflow:loss = 187.626, step = 101 (1.419 sec) INFO:tensorflow:Saving checkpoints for 160 into /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt. INFO:tensorflow:Loss for final step: 170.484. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported. INFO:tensorflow:Starting evaluation at 2017-10-21-21:58:22 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160 INFO:tensorflow:Evaluation [1/10] INFO:tensorflow:Evaluation [2/10] INFO:tensorflow:Evaluation [3/10] INFO:tensorflow:Evaluation [4/10] INFO:tensorflow:Evaluation [5/10] INFO:tensorflow:Evaluation [6/10] INFO:tensorflow:Evaluation [7/10] INFO:tensorflow:Evaluation [8/10] INFO:tensorflow:Evaluation [9/10] INFO:tensorflow:Evaluation [10/10] INFO:tensorflow:Finished evaluation at 2017-10-21-21:58:22 INFO:tensorflow:Saving dict for global step 160: global_step = 160, loss = 227.261, rmse = 15.134 INFO:tensorflow:Restoring parameters from /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/model.ckpt-160 INFO:tensorflow:Assets added to graph. INFO:tensorflow:No assets to write. INFO:tensorflow:SavedModel written to: /content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained/export/Servo/1508623102/saved_model.pb
The results of training the model locally are in the taxi_trained directory:
$ ls taxi_trained/ checkpoint model.ckpt-160.index eval model.ckpt-160.meta events.out.tfevents.1508623099.e26189a31421 model.ckpt-1.data-00000-of-00001 export model.ckpt-1.index graph.pbtxt model.ckpt-1.meta model.ckpt-160.data-00000-of-00001
A prediction can be made using the trained model in the same way as the prior section:
$ cat test.json {"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2} $ model_dir=$(ls ${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo) $ gcloud ml-engine local predict \ --model-dir=${REPO}/courses/machine_learning/cloudmle/taxi_trained/export/Servo/${model_dir} \ --json-instances=./test.json
Training in Cloud using Gcloud
To train the model in the cloud, start by copying the input CSV data to a Cloud Storage bucket:
$ gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/ $ gsutil -m cp ${REPO}/courses/machine_learning/datasets/*.csv gs://${BUCKET}/taxifare/smallinput/
Then run the gcloud command to train the model in the cloud:
OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ml-engine jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${REPO}/courses/machine_learning/cloudmle/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --runtime-version=1.0 \ -- \ --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \ --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*" \ --output_dir=$OUTDIR \ --num_epochs=100
The output of the command to submit the job gives some information about how to monitor the status of the job:
CommandException: 1 files/objects could not be removed. Job [lab3a_171021_223647] submitted successfully. Your job is still active. You may view the status of your job with the command $ gcloud ml-engine jobs describe lab3a_171021_223647 or continue streaming the logs with the command $ gcloud ml-engine jobs stream-logs lab3a_171021_223647
When that command is actually run, it prints out some information about the current phase of the job:
$ gcloud ml-engine jobs describe lab3a_171021_223647 createTime: '2017-10-21T22:36:49Z' jobId: lab3a_171021_223647 state: PREPARING trainingInput: args: - --train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train* - --eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid* - --output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained - --num_epochs=100 jobDir: gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained packageUris: - gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz pythonModule: trainer.task region: us-central1 runtimeVersion: '1.0' trainingOutput: {} View job in the Cloud Console at: https://console.cloud.google.com/ml/jobs/lab3a_171021_223647?project=not-all-broken View logs at: https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Flab3a_171021_223647&project=not-all-broken
Streaming the output is useful. Here is the output of the streaming command:
$ gcloud ml-engine jobs stream-logs lab3a_171021_223647 INFO 2017-10-21 15:36:49 -0700 service Validating job requirements... INFO 2017-10-21 15:36:49 -0700 service Job creation request has been successfully validated. INFO 2017-10-21 15:36:49 -0700 service Waiting for job to be provisioned. INFO 2017-10-21 15:36:49 -0700 service Job lab3a_171021_223647 is queued. INFO 2017-10-21 15:41:26 -0700 service Waiting for TensorFlow to start. INFO 2017-10-21 15:42:28 -0700 master-replica-0 Running task with arguments: --cluster={"master": ["master-2a1d29fbde-0:2222"]} --task={"type": "master", "index": 0} --job={ INFO 2017-10-21 15:42:28 -0700 master-replica-0 "package_uris": ["gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz"], INFO 2017-10-21 15:42:28 -0700 master-replica-0 "python_module": "trainer.task", INFO 2017-10-21 15:42:28 -0700 master-replica-0 "args": ["--train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train*", "--eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid*", "--output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained", "--num_epochs=100"], INFO 2017-10-21 15:42:28 -0700 master-replica-0 "region": "us-central1", INFO 2017-10-21 15:42:28 -0700 master-replica-0 "runtime_version": "1.0", INFO 2017-10-21 15:42:28 -0700 master-replica-0 "job_dir": "gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained" INFO 2017-10-21 15:42:28 -0700 master-replica-0 } INFO 2017-10-21 15:42:49 -0700 master-replica-0 Running module trainer.task. INFO 2017-10-21 15:42:49 -0700 master-replica-0 Downloading the package: gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz INFO 2017-10-21 15:42:49 -0700 master-replica-0 Running command: gsutil -q cp gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz taxifare-0.1.tar.gz INFO 2017-10-21 15:43:09 -0700 master-replica-0 ERROR: You have configured your Cloud SDK installation to be fixed to version [138.0.0]. Make sure this is a valid archived Cloud SDK version. INFO 2017-10-21 15:43:15 -0700 master-replica-0 Installing the package: gs://charlesreid1-cloudmle/lab3a_171021_223647/16c5fc53cac2844ee51c9d42f8de85b44d3adf769087f3fc2545cf10363454f2/taxifare-0.1.tar.gz INFO 2017-10-21 15:43:15 -0700 master-replica-0 Running command: pip install --user --upgrade --force-reinstall --no-deps taxifare-0.1.tar.gz INFO 2017-10-21 15:43:15 -0700 master-replica-0 Processing ./taxifare-0.1.tar.gz INFO 2017-10-21 15:43:16 -0700 master-replica-0 Building wheels for collected packages: taxifare INFO 2017-10-21 15:43:16 -0700 master-replica-0 Running setup.py bdist_wheel for taxifare: started INFO 2017-10-21 15:43:16 -0700 master-replica-0 creating '/tmp/tmpATFYnWpip-wheel-/taxifare-0.1-cp27-none-any.whl' and adding '.' to it INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'trainer/model.py' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'trainer/__init__.py' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'trainer/task.py' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/DESCRIPTION.rst' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/metadata.json' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/top_level.txt' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/WHEEL' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/METADATA' INFO 2017-10-21 15:43:16 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/RECORD' INFO 2017-10-21 15:43:16 -0700 master-replica-0 Running setup.py bdist_wheel for taxifare: finished with status 'done' INFO 2017-10-21 15:43:16 -0700 master-replica-0 Stored in directory: /root/.cache/pip/wheels/b0/8a/16/c8e53c6e84f363a5aac669e062e3ee5ccde849d32101ce58b5 INFO 2017-10-21 15:43:16 -0700 master-replica-0 Successfully built taxifare INFO 2017-10-21 15:43:16 -0700 master-replica-0 Installing collected packages: taxifare INFO 2017-10-21 15:43:16 -0700 master-replica-0 Successfully installed taxifare-0.1 INFO 2017-10-21 15:43:16 -0700 master-replica-0 Running command: pip install --user taxifare-0.1.tar.gz INFO 2017-10-21 15:43:17 -0700 master-replica-0 Processing ./taxifare-0.1.tar.gz INFO 2017-10-21 15:43:17 -0700 master-replica-0 Requirement already satisfied (use --upgrade to upgrade): taxifare==0.1 from file:///user_dir/taxifare-0.1.tar.gz in /root/.local/lib/python2.7/site-packages INFO 2017-10-21 15:43:17 -0700 master-replica-0 Building wheels for collected packages: taxifare INFO 2017-10-21 15:43:17 -0700 master-replica-0 Running setup.py bdist_wheel for taxifare: started INFO 2017-10-21 15:43:17 -0700 master-replica-0 creating '/tmp/tmpx9Bps1pip-wheel-/taxifare-0.1-cp27-none-any.whl' and adding '.' to it INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'trainer/model.py' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'trainer/__init__.py' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'trainer/task.py' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/DESCRIPTION.rst' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/metadata.json' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/top_level.txt' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/WHEEL' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/METADATA' INFO 2017-10-21 15:43:17 -0700 master-replica-0 adding 'taxifare-0.1.dist-info/RECORD' INFO 2017-10-21 15:43:17 -0700 master-replica-0 Running setup.py bdist_wheel for taxifare: finished with status 'done' INFO 2017-10-21 15:43:17 -0700 master-replica-0 Stored in directory: /root/.cache/pip/wheels/b0/8a/16/c8e53c6e84f363a5aac669e062e3ee5ccde849d32101ce58b5 INFO 2017-10-21 15:43:17 -0700 master-replica-0 Successfully built taxifare INFO 2017-10-21 15:43:17 -0700 master-replica-0 Running command: python -m trainer.task --train_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-train* --eval_data_paths=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi-valid* --output_dir=gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained --num_epochs=100 --job-dir gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained INFO 2017-10-21 15:43:19 -0700 master-replica-0 Using default config. INFO 2017-10-21 15:43:19 -0700 master-replica-0 Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd9a6374290>, '_tf_config': gpu_options { INFO 2017-10-21 15:43:19 -0700 master-replica-0 per_process_gpu_memory_fraction: 1.0 INFO 2017-10-21 15:43:19 -0700 master-replica-0 } INFO 2017-10-21 15:43:19 -0700 master-replica-0 , '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''} INFO 2017-10-21 15:43:19 -0700 master-replica-0 Create CheckpointSaverHook. INFO 2017-10-21 15:43:23 -0700 master-replica-0 Saving checkpoints for 1 into gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/model.ckpt. INFO 2017-10-21 15:43:30 -0700 master-replica-0 loss = 496.208, step = 1 INFO 2017-10-21 15:43:31 -0700 master-replica-0 Starting evaluation at 2017-10-21-22:43:31 INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [1/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [2/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [3/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [4/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [5/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [6/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [7/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [8/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [9/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Evaluation [10/10] INFO 2017-10-21 15:43:34 -0700 master-replica-0 Finished evaluation at 2017-10-21-22:43:34 INFO 2017-10-21 15:43:34 -0700 master-replica-0 Saving dict for global step 1: global_step = 1, loss = 169.691, rmse = 13.0265 INFO 2017-10-21 15:43:36 -0700 master-replica-0 Validation (step 1): loss = 169.691, global_step = 1, rmse = 13.0265 INFO 2017-10-21 15:44:15 -0700 master-replica-0 global_step/sec: 2.22504 INFO 2017-10-21 15:44:15 -0700 master-replica-0 loss = 82.2369, step = 101 INFO 2017-10-21 15:44:54 -0700 master-replica-0 global_step/sec: 2.61386 INFO 2017-10-21 15:44:54 -0700 master-replica-0 loss = 81.7113, step = 201 INFO 2017-10-21 15:45:32 -0700 master-replica-0 global_step/sec: 2.60342 INFO 2017-10-21 15:45:32 -0700 master-replica-0 loss = 70.4588, step = 301 INFO 2017-10-21 15:46:11 -0700 master-replica-0 global_step/sec: 2.54869 INFO 2017-10-21 15:46:11 -0700 master-replica-0 loss = 79.9016, step = 401
This took about 20 minutes to finish training...
When the model is trained in Google Cloud ML Engine, the final model is actually stored in Google Cloud storage. Once the job finishes, you'll see a message like this in the logs:
15:54:37.533 SavedModel written to: gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/export/Servo/1508626468656/saved_model.pb
Deploying the Model to the Cloud
Start by finding the location of the saved model. The log contains the location, but to programmatically extract it, everything up to the long multi-digit serial number should be known. Use ls and tail to get it:
MODEL_NAME="taxifare" MODEL_VERSION="v1" MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/Servo | tail -1)
Now the gcloud command line utility can be used to deploy the model:
echo "Deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes" gcloud ml-engine models create ${MODEL_NAME} --regions $REGION gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}
If an old version of the model needs to be deleted, you can always use the gcloud ml-engine delete command:
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} #gcloud ml-engine models delete ${MODEL_NAME}
Output from deploy command:
Deleting and deploying taxifare v1 from gs://charlesreid1-cloudmle/taxifare/smallinput/taxi_trained/export/Servo/1508626468656/ ... this will take a few minutes Created ml engine model [projects/not-all-broken/models/taxifare]. Creating version (this might take a few minutes)...... ..........................................................................................................................................................................................................................................................................................................................................................................................................................done.
The model will take about 5 minutes to deploy. Once the model is deployed, predictions can be made using a REST API call.
Making Predictions in Cloud using Gcloud
To make a prediction using the gcloud command line:
$ cat test.json {"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2} $ gcloud ml-engine predict --model=taxifare --version=v1 --json-instances=./test.json OUTPUTS 10.8887
Making Predictions in Cloud using REST API
To make a prediction by calling the REST API, we first need OAuth credentials. These can be obtained for this service account by using the application default credentials, and passing those to the API client discovery endpoint (???):
from googleapiclient import discovery from oauth2client.client import GoogleCredentials import json credentials = GoogleCredentials.get_application_default() api = discovery.build('ml', 'v1', credentials=credentials, discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json';)
Once that's finished, we now have an API endpoint that we can pass the inputs to the machine learning model to, and we'll get our JSON response back.
request_data = {'instances': [ { 'pickuplon': -73.885262, 'pickuplat': 40.773008, 'dropofflon': -73.987232, 'dropofflat': 40.732403, 'passengers': 2, } ] } parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1') response = api.projects().predict(body=request_data, name=parent).execute() print "response={0}".format(response)
The output:
response={u'predictions': [{u'outputs': 10.888701438903809}]}