From charlesreid1

Building TensorFlow Models

Module 1: Getting Started With Machine Learning

Introduction to Machine Learning

Start with a general purpose function, and find weights (tunable parameters) for a known data set, such that the model approximates it well. You then use that tuned set of parameters to make predictions on unknown values.

The key to machine learning is to use a large amount of data to approximate the function, and that requires being able to do it at scale.

Components of machine learning:

  • Building model
  • Preparing data sets

Example: build a machine learning model to predict taxi cab fares.

Training a machine learning model:

  • Input: image + label
  • Fed into nonlinear, parametric mathematical function
  • Output: image + predicted label
  • Make tiny adjustments to parameters (weights) to make output match label for given input

Write definitions for machine learning terms in your own words:

  • Label - what the model is trying to predict (labels/categories)
  • Input - the data fed to the model/function (domain of model)
  • Example - a particular input with a known output
  • Model - a mathematical function with parameters (weights)
  • Training - the process of adjusting the mathematical model to perform the desired task
  • Prediction - applying the model to an input that it has not seen during training

Use cases:

  • Clustering - unsupervised learning (detecting patterns)
  • Regression - supervised learning (continuous outputs)
  • Classification - supervised learning (categorical/label outputs)

Common source of structured data - data warehouse (i.e., BigQuery)

Applications:

Manufacturing:

  • Predictive maintenance/condition monitoring
  • Warranty reserve estimation
  • Propensity to buy
  • Demand forecasting
  • Process optimization
  • Telematics

Retail:

  • Predictive inventory planning
  • Recommendation engines
  • Upsell/cross-channel marketing
  • Market segmentation/targeting
  • Customer ROI

Healthcare:

  • Alerts and diagnostics from real-time patient data
  • Disease identification and risk satisfaction
  • Patient triage optimization
  • Proactive health management
  • Healthcare provider sentiment analysis

Travel and Hospitality:

  • Aircraft scheduling
  • Dynamic pricing
  • Social media and customer feedback analysis
  • Customer complaint resolution
  • Traffic patterns and congestion management

Financial Services:

  • Risk analytics and regulation
  • Customer segmentation
  • Sales and marketing campaign analysis
  • Credit worthiness evaluation

Energy, Feedstock, and Utilities:

  • Power usage
  • Seismic data processing
  • Carbon emissions and trading
  • Customer-sepcific pricing
  • Smart grid management
  • Energy demand and supply optimization

Simple two-variable neural network with two neurons:

Asking if the linear combination of these two neuron signals is greater than some bias:

Recompute the error after each batch of examples. Don't use entire data set - want to leave some out to test your model after training.

For each neuron, we make a plot of (Error) vs (Value of weight w)

We construct the curve of error versus weight, and decide which direction to move the weight (direction of steepest error gradient).

Process:

  • Start model with random weights
  • For each batch: calculate error based on label set, change weights so that error goes down
  • Every 10 epochs, every N minutes, etc: ask is the model good enough?
  • If not, keep adjusting weights
  • If so, use model for prediction

Machine Learning Playground

Define the following machine learning terms in your own words:

  • Weights - model parameters being adjusted by the training procedure
  • Batch size - amount of data considered at one time (larger batches mean more memory space and potentially more conflicting objectives)
  • Epoch - one pass through all available batches; one traversal through the entire data set
  • Gradient descent - the process used to determine how to adjust the weights during a single BATCH
  • Evaluation - occasionally testing model over entire data set to see how good input-output mapping is, and whether we can stop training process
  • Training - the process of using input data with known outputs to adjust the model to replicate the function/mapping


Neural network playground:

  • Showing concentric circle data set: no hidden layer means we are spitting the data with a single line (this side = blue, other side = orange)
  • If we add two neurons in a hidden layer, we're now trying to split the data set with *two* lines, then recombining them into a single "mask" - not enough
  • If we add a third neuron, we end up with a triangle shape surrounding the middle cluster of blue points - the three lines form a triangle that surround the inner cluster
  • If we add five or six neurons, we see that some of the neurons simply don't do anything - only three neurons matter. So there we have *too many* neurons.

Now consider the tangled spiral data set:

  • If you have 3 hidden layers, and if each line has 8 neurons, last layer has 5 neurons, then you can get enough individual slices of lines to form a polygon that surrounds one of the spirals
  • More hierarchies of features


Return to the concentric circle data set:

  • Is it possible to separate these into two categories *without* adding hidden layers?
  • We can do that - but to do that, we need to introduce our human insight
  • Insight: the blue points are clustered around the origin, the orange points are far from the origin
  • This introduces the concept of distance - so introduce an x1^2 and x2^2 terms.
  • Technical term - we are introducing *features* (additional/transformed inputs)

Write definitions for these machine learning terms in your own words:

  • Neurons - the "unit models" composing the neural network/machine learning model (a way of combining inputs - weighted sum of inputs, followed by activation function)
  • Hidden layer - a group of neurons that is connected to an "upstream" group of neurons or inputs and a "downstream" group of neurons or outputs (a combination of neurons that all share the same set of inputs)
  • Inputs - the data being used to train the model
  • Features - the inputs to the problem - can include the inputs, but can also include transformations of the inputs (things that you calculate from the inputs)
  • Feature engineering - process of deciding which features to include or exclude (using human insight) when trying to improve a model's predictions

Making Machine Learning Models More Effective

Machine learning process:

  • Collect data - complete coverage
  • Organize data - explore it, and fix any problems
  • Create model
  • Use a machine to flesh out the model from the data
  • Deploy the model

Note: In general, the models we use are models that are already implemented in TensorFlow

Collecting Data

What considerations do we need when we are collecting data?

It isn't enough to just scrape together a bunch of data - the dataset should cover all cases and should be labeled

Need a complete data set, and a complete data set means it covers all relevant cases

Examples:

  • If classifying 8 types of screws, need pictures of all 8 screws
  • If classifying clouds, need to first ask, what types of clouds are there? Then you can get pictures that cover every type of cloud.

Further, you need negative examples/near-misses - what kind of mistakes could the ML model make?

  • Texture-based recognition - what else is fluffy that could be mistaken for a cloud?
  • Shape-based/color-based recognition - cartoon clouds
  • Wrong shape - contrails from planes/rockets

Example:

  • MNIST - no negative models
  • Post office handwriting digit classification - must be more complex
  • Machine learning pipeline - tasks happen in multiple steps; looks at an envelope, recognizes where the zip code is, identifies each individual digit

Organizing Data

Explore the data, identify any outliers, fix outliers

Don't always want to throw out the outliers - there is usually a systemic problem underlying the outlier that will come back again

Think carefully about your error metrics:

  • Start a model with random weights
  • Calculate the error based on the data set; change weights so the error goes down
  • If the model is good enough, use the model for prediction; otherwise, change weights so that the error goes down

MSE - mean squared error - "default" good error metric to use

MSE

Y-cap is the model estimate

For ordinal problems, use the MSE:

For classification problems, use the cross-entropy error (which is differentiable):

(if you predict 1, and the actual value is 1, the entropy is 0)

(cool function that can deal with + or -, probability from 0 to 1, but hard to interpret)

These are the quantities that are minimized when adjusting the weights

Confusion Matrix

A way of presenting classes of outcomes for categorization problems

2 x 2 matrix:

           [What the ML system guessed]

[Truth]

(TP) (FN) (FP) (TN)

TP = true positive FP = false positive FN = false negative TN = true negative

You can compute three possible numbers from this matrix:

  • accuracy
  • precision
  • recall

Accuracy: computing (TP + TN)/( TP + FP + FN + TN)

  • How many times it made a correct prediction
  • If the ML system said "this is a cat" and it was in fact a cat, or if the ML system said "this is not a cat" and it was not a cat - in both cases, we increment by one
  • (Most intuitive to understand)
  • (Doesn't always work though)

Example where accuracy fails:

  • System to identify empty parking spaces
  • 1000 spaces, 990 taken, 10 available
  • Our ML system identifies ONE of the 10 available systems:
  • 991/1000 = 99.1% accuracy
  • But this is a really skewed representation

Unbalanced data sets - use precision or recall

Accuracy: computing TP + FP

  • Focuses on the TOTAL number of questions answered correctly

Precision: computing TP / (TP + FP)

  • Focuses on the thing that you want to predict
  • "Positive predictive value" - only using the POSITIVES (true positive, false positive)
  • If the model says it is a cat, how often is the model correct? (Focusing on the POSITIVE question that you want to predict the answer to)
  • Only keep the set of images that the system predicted was a cat
  • Precision in the parking lot example: we are trying to predict the POSITIVE question of is a parking space empty
  • If we ONLY consider the positive predictions (this is a parking space), and we ask, how many were actually parking spaces, we see the result is, 1 - 100% precision

Recall: computing TP / (TP + FN)

  • This is the TRUE positive rate - only using the true values (true is positive, false is negative --> double negation --> true)
  • Instead of only considering the model's POSITIVE answers to the question, we only consider the TRUE answers to the question
  • Cat example: only consider the pictures that are actually cats
  • Recall is the fraction of X that the model actually finds
  • Back to parking lot example: we have 10 true parking spaces, and correctly predicted 1, so we have a recall of 1/10 or 10%

Digging deeper: classification model outputs a 0 or a 1 - but actually, it outputs some number BETWEEN 0 and 1, which is turned into a 0 or 1

ROC curve - (true positive rate, recall) vs. (false positive rate)

  • Radar operating curve
  • Radar is a problem with similar challenges
  • Is this blip actually a plane? Is this (real) plane actually detected?


Define ML terms in your own words:

  • MSE - mean squared error - error metric for continuous variable, summing distance between a real value and a model prediction value
  • Cross-entropy - an error metric for categorical predictions, useful because it can be differentiated (and thus, we can find the gradient of the error to figure out how to adjust the weights to improve the error)
  • Accuracy: accounts for all cases, (TP + TN)/(TP + TN + FP + FN)
  • Precision: focus on positives, (TP)/(TP + FP)
  • Recall: focus on truth, (TP)/(TP + FN)


Example:

  • Building a machine learning model to make predictions about approval for loans
  • Which error metric should we use?
  • Going back to parking lot example: do we have an unbalanced problem? Do we reject and approve equal numbers of people?
  • If we have a balanced problem, use accuracy
  • But we don't have a balanced problem - more people rejected than approved
  • If we have an imbalanced problem, use precision or recall
  • We want to focus on the positive answers - people we decide to give a loan to
  • No risk to a "true" good loan person rejected for a loan, but want to maximize profits so we want to maximize number of POSITIVE cases that we correctly predict
  • POSITIVE predictive value

Another example:

  • Trying to identify fraudulent transactions
  • This is an unbalanced problem - not many transactions are fraudulent
  • We want either precision or recall
  • Important to identify *true* fraud events - don't want to falsely flag legit transactions as fraudulent, and don't want to let fraudulent transactions slip by unnoticed
  • We want recall - the TRUE positive rate

Exploring and Creating Data Sets

Consider a data set of fare amount vs. distance traveled

Two models:

  • One is a simple, single line, with a higher MSE
  • Other is a squiggly, high-order polynomial or complicated predictor that has 0 MSE
  • Intuitively, we know the first model is better - even though it has a higher MSE
  • How to FORMALIZE - use the model to make predictions, and quantify the MSE in the prediction
  • Training MSE: 0, Testing MSE: 32

Split our original, single dataset into pieces so that we can perform this test:

  • Training data
  • Validation data
  • Go through the training process (using training data) and evaluation process (using validation data)
  • Gradually increase model complexity
  • Stop before you overfit - you know you're overfitting when the increase in error between training and validation data sets is larger than some threshold

This process is called hyperparameter tuning

  • Training and validation data sets are used for hyperparameter tuning

But now, how do you evaluate the FINAL result? We can't really use the training or validation data sets... So 3-way split:

  • Training data - used to improve the model parameter values
  • Validation data - used to determine when model is overfit vs. underfit
  • Test data - used to evaluate the model once hyperparameter tuning is finished
  • (Not a great approach, since it leaves data lying around...)
  • (Let data coming in over time, on the back end, form the test data set)

Alternative approach: cross-validation

  • Split the original data into a bundle of smaller training data sets, and a bundle of smaller validation data
  • Advantage: get a range of error measures
  • Disadvantage: involves 10x as much training

Lab: Creating and Preparing Data Set

Link: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=..%2F..%2Findex#0

Doing Steps 1-5

Setting up the right project:

Start by running a cloud shell, and check that you are authenticated for your project:

$ gcloud auth list

List what project you are on with this Google Cloud shell:

$ gcloud config list project

If wrong project, switch to the right one:

$ gcloud config set project 

Launch Datalab to do machine learning in a notebook.

Pick a zone to run in:

$ gcloud compute zones list
$ datalab create my-machine-learning-vm --zone <ZONE>

Once Datalab launches, open the Web Preview icon and switch to port 8081

Open ungit from /content/datalab

Clone the training-data-analyst repo (clone repository > https://github.com/GoogleCloudPlatform/training-data-analyst)

Now open the courses/machine_learning/datasets directory

Open create_datasets.ipynb

Notebook contents:

Start by extracting 10,000 records from BigQuery

Plot values to visualize and identify outliers/problems, and add them into WHERE clause for BigQuery:

  • Zero distance rides (trip_distance > 0)
  • Negative fare amounts (fare amounts >= 2.5)

Other problems, filtered out with Pandas:

  • Zero passenger rides (passenger_count > 0)
  • Latitude/longitude outside of Manhattan:
  • long > -78, long < -70 (pickup and dropoff)
  • lat > 37, lat < 34 (pickup and dropoff)

Finally, split data set into training (70%), validation (15%), and testing (15%)

Export three CSV files for each data set with columns:

['fare_amount', u'pickup_longitude', u'pickup_latitude', u'dropoff_longitude', u'dropoff_latitude', u'passenger_count', 'key']

Primitive benchmark model

Before building a machine learning model, try coming up with a really basic model that will create a benchmark to compare the ML to.

(mean_fare_amount)/(mean_distance) gives an average rate per distance

Then compute RMSE of that model for all three data sets:

  • Rate = $2.56/km
  • Train RMSE = 6.79
  • Valid RMSE = 6.20
  • Test RMSE = 5.84

Now, put it all together by creating a function that will construct an SQL query that does all of the filtering/querying that we need, and pass the results of the query to another function that computes the RMSE

Using the same query each time, our goal is to beat the RMSE of the rule of thumb, (trip_price)/(trip_distance)

Module 2: Building Machine Learning Models with Tensorflow

Building Models with TensorFlow

Goals:

  • TensorFlow code
  • Loops in TensorFlow
  • Monitoring results using TensorBoard

Labs:

  • What is TensorFlow
  • Machine learning
  • Gaining flexibility - more machines
  • Experiment framework

Overview of TensorFlow Models

TensorFlow is portable (like Java) - write code that works on a variety of hardware platforms, whether they are GPUs, CPUs, or phone processors

Two aspects: training, and predicting. Training often happens on higher-power processing hardware (GPUs), and prediction often happens on low-power processing hardware (phones, embedded processors, etc.)

Computations represented with data flow graphs, which abstract the computations away from the specific hardware

TensorFlow toolkit hierarchy:

  • High-level out-of-the-box API (scikit-learn compatible): Estimator API
  • Custom NN model comopnents:tf.layers, tf.losses, tf.metrics
  • Python API for full control: Core TensorFlow (Python)
  • C++ API for very low-level control: Core TensorFlow (C++)
  • Hardware running on different hardware: CPU, GPU, TPU, Android

Core TensorFlow is implemented in C++ - that's what's portable across different platforms (similar to the way that JVM is implemented in C, and allows you to write in a higher-level language (Java))

Can write Python code to create TensorFlow models - but even then there are several levels at which to write Python code (control all the details of the TensorFlow components, or abstract them away with high-level API)

Estimator API - highest level abstraction (uses pre-packaged models, intended for "getting stuff done")

Can run TensorFlow code written at any of these levels using CloudML to scale it up


Core TensorFlow Python API

Lets you build and run directed acyclic graphs

This looks a lot like numpy: operations are written as functions operating on data structures

c = tf.add(a,b)

a and b are tensors

This constructs components on a computation graph

To actually perform the computation, create a TensorFlow session and run it:

session = tf.Session()
numpy_c = session.run(c,feed_dict = ...)

Contrasting numpy and TensorFlow: TF is lazy

In numpy, if we run this code, we get an immediate result:

>>> a  = np.array([1, 2, 3])
>>> b = np.array([4, 5, 6])
>>> c = np.add(a,b)
>>> print(c)
[5, 7, 9]

In TF, we get an abstract operation that is not performed:

>>> a = tf.constant([1, 2, 3])
>>> b = tf.constant([4, 5, 6])
>>> c = tf.add(a,b)
>>> print(c)
Tensor("Add_7:0", shape(3,), dtype=int32)

To actually evaluate this, we have to do:

>>> with tf.Session() as sess:
...    result = sess.run(c)
...    print(result)

[5, 7, 9]

Motivation behind the compute graph: you can separate the creation of the graph from the running of the graph

Additionally, graphs can include send/receive nodes between other nodes in the graph - allows assignment of different parts of the graph to different machines (e.g., one component goes to GPU, another component goes to CPU)

C++/Python front ends assemble the computation graphs

The core TensorFlow Execution System can further modify the graph's operations (add, mul, print, reshape, etc) to insert additional nodes

These are then passed on to the kernels (CPU, GPU, Android, etc.)


Area of a Triangle

Implementation of Heron's Formula using TensorFlow

List of functions in TensorFlow: https://www.tensorflow.org/api_docs/python/tf

By default, the notebook is python2. Everything works fine.

Changing the notebook to python3 and updating the print() statements also works fine.

print(dir(tf.logging))

['DEBUG', 'ERROR', 'FATAL', 'INFO', 'TaskLevelStatusMessage', 'WARN', '_GetFileAndLine', '_GetNextLogCountPerToken', '_THREAD_ID_MASK', 
'__builtins__', '__doc__', '__file__', '__name__', '__package__', '_allowed_symbols', '_get_thread_id', '_handler', '_interactive', 
'_level_names', '_log_counter_per_token', '_log_prefix', '_logger', '_logging', '_logging_target', '_os', '_sys', '_time', 'debug', 
'error', 'fatal', 'flush', 'get_verbosity', 'info', 'log', 'log_every_n', 'log_first_n', 'log_if', 'set_verbosity', 'vlog', 
'warn', 'warning']

Change to tf.logging.ERROR to have less output. Unfortunately, there's very little clear explanation of what's going on in this notebook....

I'm confused about the multiple RSME values flying around. There was an RSME of around 6 for the training, validation, and test datasets in the first notebook. And then, there's another query, the benchmark query, that results in an RSME of 8 with the simple rule-of-thumb model of (mean fare)/(mean distance).

(Gonna be honest - none of this makes any sense. There's no attempt to explain anything about what's going on with the TensorFlow models. Basically, the lab says, "LOOK AT THIS CODE THEN PRESS ENTER OKAY NOW YOU ARE TENSORFLOW MASTER CONGRATS HERE IS UR CERTIFICATE")

Cloud MLE notebook:

Working with TensorFlow Estimator API

Two-step process when creating a machine learning model.

First, to know how to set up the machine learning model:

  • Start by answering question: are we solving a regression problem, or a classification problem?
  • What is the label we are using?
  • What are the features of the problem?

Then carry out the machine learning steps:

  • Train the ML model
  • Evaluate the ML model
  • Predict with the ML model

Structure of TF Estimator API Model

When we create the estimator object, we can either create a LinearRegressor object, or a LinearClassifier model

import tensorflow as tf

# Define input feature columns
# In this silly example, we only use ONE column
feature_columns = [ tf.contrib.layers.real_valued_column("sq_footage") ]

# Instantiate the LinearRegression model
# feature_columns is a list of ONE column[s] (one input)
estimator = tf.contrib.learn.LinearRegressor(feature_columns=feature_columns)

feature_columns specifies the inputs into the machine learning algorithm.

Next, we carry out the ML steps:

# Train
def input_fn_train():
    feature_data = {'sq_footage' : tf.constant([1000,2000])}

    # Label data is the "true value" of the quantity you are trying to predict
    label_data = tf.constant([100000,200000]) # These are house prices: $100,000 and $200,000

    return feature_data, label_data

# One step = one step of gradient descent
# One step is typically done on a batch,
# but because our data set is small,
# one step = one epoch
estimator.fit(input_fn=input_fn_train, steps=100)

# Predict
def input_fn_pred():
    feature_data = {'sq_footage' : tf.constant([1500])}
    return feature_data

list(estimator.predict(input_fn=input_fn_predict))

Setting steps=100 means do gradient descent 100 times.

To clarify a bit more:

  • A "step" refers to carrying out the gradient descent process on a single batch of data.
  • However, in this case, our entire data set is so small that it fits in memory, and we only have one batch.
  • Therefore, a "step" is actually carrying out the gradient descent process on the entire dataset.
  • This is normally called an epoch - finishing doing training on the entire batch of data.
  • In this (special) case of a small data set that fits in memory, 1 step = 1 epoch

Call the fit function on the estimator, and pass it the function that it should use to get the training data (inputs PLUS outputs).

Then we specify steps = 100 - which means, carry out gradient descent 100 times.

Lastly, we can now make predictions - pass in another function, this time a function that it should use to get the testing (?) data (inputs ONLY, no outputs).


Structure of fit input function:

  • Need to provide a list of features (the input quantity) in the form of a dictionary: labels, and a corresponding list of values
  • Need to provide a list of labels (the output quantity being predicted) in the form of quantities - just a list of values

Structure of predict input function:

  • Need to provide a list of features (the test input quantity) in the form of a dictionary: labels and corresponding list of test values
  • (No other quantities are needed - labels are being predicted!)


Going beyond linear regression with TF

Define a deep neural network with DNNRegressor instead of LinearRegressor:

model = DNNRegressor(feature_columns = [...], hidden_units[128, 64, 32])

Constructor takes the architecture of the network. This case: 3 hidden layers [128, 64, 32] plus one input layer and one output layer, for a total of 5 layers. The architecture of the network is a result of some rules of thumb, and some trial and error.

Define a classifier like this:

model = LinearClassifier(feature_columns = [...])

model = DNNClassifier(feature_columns=[...], hidden_units=[...])

The feature columns can be several different types. We saw previously (with the square footage example) that we can define a real_valued_column, but we will see other types of feature columns as well.

The feature_columns input argument is a list of feature columns (of whatever type).

Feature engineering is the step where we'll cover different types of feature_columns.

More on the Estimator API

Helper functions for other types of inputs:

tf.contrib.layers.sparse_column_with_keys(column_name = "gender", keys=["female","male"])

Estimator API Lab

Notebook for this lab: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/tensorflow/b_tflearn.ipynb

Start by reading in the CSV data; create a pandas dataframe and extract the column names

Separate the column names into features (inputs) and labels (outputs)

Define a function that returns the training data (make_input_fn)

  • Note that this is slightly confusing, since it involves nested functions, but basically this is a bundle of code that will extract the correct data/columns

Now create a LinearRegressor

Batch size and epoch size are the same, since we are holding all of the data (df_train) in memory all at once

Now that we have the model, we evaluate the model by passing it the validation data set (df_valid) and use RSME to quantify how good the model is

Pretty high error - benchmark was 8, and RSME is 14 or 15 - but we'll focus on how to IMPROVE the model in later steps of the lab

(This shows the importance of creating a "simple rule" benchmark before you begin!)


Stop and Take Stock

Flustered by mysterious goings-on, both on the GCP side and on the local TensorFlow notebooks side. Specifically:

  • Datalab notebooks and git version control of said notebooks is a bit of a mess. Things don't work the way they should - repos are not created when and where they're supposed to be. Other, smaller things are broken everywhere - keyboard shortcuts for Jupyter notebook don't work. HTTPS connections to ungit don't work. Stuff like that.
  • Cloud networking is a big hassle. It's one of those large, mysterious systems where, when something breaks, my first thought is not "Okay, I can do some thinking to figure out what I did wrong," it's "Oh great, what happened this time?" It's like being a Linux newbie and building a program from source - you don't know when it's your own fault for doing something stupid, or when something is actually broken.
  • Like right now - I know datalab is running on this virtual machine, and I know it's listening on port 8080, and I know I have a working firewall rule, but I still can't open Datalab in my browser. (And yet again - running on an n1-standard-1 datalab image, there's a bunch of initialization scripts and a second disk, and NO FRICKIN APT-GET). Is it because it needs more time? I've already waited 5 minutes. It seems to be up and operational. (Turns out, yes, I just needed to wait longer.)


Module 2b: Refactor Model for Flexibility and Scaling Up

Gaining More Flexibility:

  • Laboratory
  • Building effective ML for big data
  • Refactoring tf.learn model
  • Components for building neural network models
  • Reading data in batches
  • Reading csv files num_epochs times
  • Reading local/GCS csv files in batches
  • Lab overview

To refactor tf.learning model:

  • Refactor it to read out of memory data
  • Refactor it to add new features easiliy
  • Refactor it to evaluate model architecture as a part of training

TensorFlow Architecture for Out of Memory Learning

Back to the middle layer:

  • Reminder, these are the components that are useful when building custom NN models
  • tf.layers, tf.losses, tf.metrics


Recap of terminology:

  • We will store our data in multiple files
  • One step = going through single batch of training data once
  • One epoch = going through entire training data once

Reading data from out of memory:

  • To go through our data for 50 epochs, we just need to create a filename queue (from randomly shuffled filenames) that contains our file names 50 times each
  • Example: dealing with three files A, B, C: our filename queue should be B B C A ... (enqueue_many function)
  • Then we dequeue each file, one at a time, using a Reader (dequeue function)
  • The reader decodes the data and turns it into data
  • The data then goes into an Example Queue (using the enqueye function)
  • Why shuffle filenames and add them in random order? When doing distributed learning, we don't want to bias our learning process, or have one file cause a slowdown (on exact same machine each time)
  • Each Reader will be on a different machine; each Reader takes filenames from the queue, and creates an example queue (an example is an input plus a label)
  • Then, TensorFlow model reads data from the Example Queue

Reading a CSV file num_epochs times:

Start by setting labels for the columns in the files being read:

CSV_COLUMNS = ['fare_amount', 'pickuplon', 'pickuplat', ...]
LABEL_COLUMN = ['fare_amount']

# Now define default values that each value will take on
# (This keeps the ML model from choking if there are one or two missing pieces of data)
Defaults = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

Next, define an input function that will do a wildcard match, and assemble each filename and put it into the Filename Queue:

def input_fn():
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer( input_file_names, num_epochs=num_epochs, shuffle=True)

    # now make the Reader
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records = batch_size)

    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))

    # Take the one label item and pop it from features.
    # Assign the result to label, so now label is a dictionary too.
    label = features.pop(LABEL_COLUMN)

    return features, label


Reading CSV Files num_epochs times

In the input function:

  • Match all filenames (can have a wildcard, like train.*) or sharded files (train-00001-36, train-00002-36, etc)
  • Then, take those input files and repeat them (in a shuffled way) num_epochs times
  • Now, create the reader with TextLineReader() to read CSV files
  • Tell the reader to read a batch of records from the filename queue
  • This is just a line, so we use expand_dims() to make the scalar into a tensor
  • Then we do decode_csv to decode this as a comma-separated string
  • We need to tell TF what the datatypes are, and what to do if the value of the field is missing
  • We now have our values
  • But our features have to be a dictionary - where each column is an entry in the dict, with the key being the name of the column
  • Associate the field names with the tensor values to make it a dictionary (that's features). One key is fare_amount, next key is pickuplon, next is pickuplat, etc.
  • Each key has a tensor associated with it
  • Those are our features - except that fare_amount is the label column, and we aren't trying to predict it
  • saying features.pop(LABEL_COLUMN) tells TF to leave out the quantity we're trying to predict (as output) from the list of inputs
  • We then return the list of features (the dictionary of label:value for each feature) and the label column name

TextLineReader() can read from local files, or from GCS

What it is doing is:

  • Decoding CSV
  • Creating a dictionary of features
  • Creating a dictionary of labels (via features.pop(LABEL_COLUMN))
  • Returning features and labels

This decode_csv can be fed a CSV from a local disk, or from Google Cloud Storage

Next lab:

  • Refactor TensorFlow model
  • Read from a potentially large data set/file in batches
  • Do a wildcard match on filenames and feed them to a filename queue
  • Break up the one-to-one relationships between inputs and features (unclear what this means, exactly)

This will smooth the way to running this TensorFlow model at scale.


Refactoring the ML Model for Big Data

Link to lab: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#7

Link to notebook: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/tensorflow/c_batched.ipynb

First refactoring: reading input data in batch

The first refactoring addresses how the input data are being read. A filename queue is added to the TensorFlow graph, instead of reading the file directly into a Pandas dataframe. We pass the filename, and use this tf.train.match_filename_once() thing. We use a string producer to generate the (one single) filename over and over. We shuffle the input filename queue. We repeat each file num_epochs times. Here's the whole mess:

def read_dataset(filename, num_epochs=None, batch_size=512, mode=tf.contrib.learn.ModeKeys.TRAIN):
  def _input_fn():
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer(
        input_file_names, num_epochs=num_epochs, shuffle=True)
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records=batch_size)

    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))
    label = features.pop(LABEL_COLUMN)
    return features, label

  return _input_fn


Second refactoring: treat input data and features as different

The second refactoring addresses the way we turn input data into features. They refactor this so that they are specifically extracting the input variables in one step, then explicitly specifying the model features in another, separate step. What they mean by "break the one-to-one relationship between inputs and features" is, we aren't forced to use the input data and only the input data as our model features. Once we change the way the input data is loaded (i.e., if we don't read data straight from the input file into the model), we can transform input variables, leave certain input variables out of the model, normalize them, combine them together, etc.

Third refactoring: Move model evaluation into training loop

The problem with the notebook, as is, is that we're specifying a number of epochs. Instead, we want to evaluate the model as we go, and stop when we reach a criteria.

(This will happen in the next lab.)

Also a checkpointing problem - we save checkpoints during the training, and use the final checkpoint as the final model. (Discussion of overfitting - we may not want the last step, because it may be overfit.) This will also be improved by stopping the model training when we reach some error criteria.

Train the model on the training data set, and every few steps, stop and assess RMSE on the validation data set. Stop when the RMSE on the validation data set starts to increase (indicates we're overfitting).

What to improve further?

Handle machine failure in distributed training - what if something goes wrong? Want to be able to pick up training wherever we left off.

Monitor training - especially useful if training is expected to take a very long time. Answer questions like, which epoch are we on, what is the current RMSE, etc.

Choose a model based on the validation data set - use a smarter stopping criteria than number of epochs.

How much does a reasonably realistic machine learning model cost?

It will cost a few thousand dollars for a reasonably realistic model

References

Reading out-of-memory data: schematic: https://www.tensorflow.org/programmers_guide/reading_data

Flags