GCDEC/Building Tensorflow/Notes
From charlesreid1
Contents
- 1 Building TensorFlow Models
- 1.1 Module 1: Getting Started With Machine Learning
- 1.2 Module 2: Building Machine Learning Models with Tensorflow
- 1.2.1 Building Models with TensorFlow
- 1.2.2 Overview of TensorFlow Models
- 1.2.3 Core TensorFlow Python API
- 1.2.4 Area of a Triangle
- 1.2.5 Working with TensorFlow Estimator API
- 1.2.6 Structure of TF Estimator API Model
- 1.2.7 Going beyond linear regression with TF
- 1.2.8 More on the Estimator API
- 1.2.9 Estimator API Lab
- 1.2.10 Stop and Take Stock
- 1.3 Module 2b: Refactor Model for Flexibility and Scaling Up
- 2 References
- 3 Flags
Building TensorFlow Models
Module 1: Getting Started With Machine Learning
Introduction to Machine Learning
Start with a general purpose function, and find weights (tunable parameters) for a known data set, such that the model approximates it well. You then use that tuned set of parameters to make predictions on unknown values.
The key to machine learning is to use a large amount of data to approximate the function, and that requires being able to do it at scale.
Components of machine learning:
- Building model
- Preparing data sets
Example: build a machine learning model to predict taxi cab fares.
Training a machine learning model:
- Input: image + label
- Fed into nonlinear, parametric mathematical function
- Output: image + predicted label
- Make tiny adjustments to parameters (weights) to make output match label for given input
Write definitions for machine learning terms in your own words:
- Label - what the model is trying to predict (labels/categories)
- Input - the data fed to the model/function (domain of model)
- Example - a particular input with a known output
- Model - a mathematical function with parameters (weights)
- Training - the process of adjusting the mathematical model to perform the desired task
- Prediction - applying the model to an input that it has not seen during training
Use cases:
- Clustering - unsupervised learning (detecting patterns)
- Regression - supervised learning (continuous outputs)
- Classification - supervised learning (categorical/label outputs)
Common source of structured data - data warehouse (i.e., BigQuery)
Applications:
Manufacturing:
- Predictive maintenance/condition monitoring
- Warranty reserve estimation
- Propensity to buy
- Demand forecasting
- Process optimization
- Telematics
Retail:
- Predictive inventory planning
- Recommendation engines
- Upsell/cross-channel marketing
- Market segmentation/targeting
- Customer ROI
Healthcare:
- Alerts and diagnostics from real-time patient data
- Disease identification and risk satisfaction
- Patient triage optimization
- Proactive health management
- Healthcare provider sentiment analysis
Travel and Hospitality:
- Aircraft scheduling
- Dynamic pricing
- Social media and customer feedback analysis
- Customer complaint resolution
- Traffic patterns and congestion management
Financial Services:
- Risk analytics and regulation
- Customer segmentation
- Sales and marketing campaign analysis
- Credit worthiness evaluation
Energy, Feedstock, and Utilities:
- Power usage
- Seismic data processing
- Carbon emissions and trading
- Customer-sepcific pricing
- Smart grid management
- Energy demand and supply optimization
Simple two-variable neural network with two neurons:
Asking if the linear combination of these two neuron signals is greater than some bias:
Recompute the error after each batch of examples. Don't use entire data set - want to leave some out to test your model after training.
For each neuron, we make a plot of (Error) vs (Value of weight w)
We construct the curve of error versus weight, and decide which direction to move the weight (direction of steepest error gradient).
Process:
- Start model with random weights
- For each batch: calculate error based on label set, change weights so that error goes down
- Every 10 epochs, every N minutes, etc: ask is the model good enough?
- If not, keep adjusting weights
- If so, use model for prediction
Machine Learning Playground
Define the following machine learning terms in your own words:
- Weights - model parameters being adjusted by the training procedure
- Batch size - amount of data considered at one time (larger batches mean more memory space and potentially more conflicting objectives)
- Epoch - one pass through all available batches; one traversal through the entire data set
- Gradient descent - the process used to determine how to adjust the weights during a single BATCH
- Evaluation - occasionally testing model over entire data set to see how good input-output mapping is, and whether we can stop training process
- Training - the process of using input data with known outputs to adjust the model to replicate the function/mapping
Neural network playground:
- Showing concentric circle data set: no hidden layer means we are spitting the data with a single line (this side = blue, other side = orange)
- If we add two neurons in a hidden layer, we're now trying to split the data set with *two* lines, then recombining them into a single "mask" - not enough
- If we add a third neuron, we end up with a triangle shape surrounding the middle cluster of blue points - the three lines form a triangle that surround the inner cluster
- If we add five or six neurons, we see that some of the neurons simply don't do anything - only three neurons matter. So there we have *too many* neurons.
Now consider the tangled spiral data set:
- If you have 3 hidden layers, and if each line has 8 neurons, last layer has 5 neurons, then you can get enough individual slices of lines to form a polygon that surrounds one of the spirals
- More hierarchies of features
Return to the concentric circle data set:
- Is it possible to separate these into two categories *without* adding hidden layers?
- We can do that - but to do that, we need to introduce our human insight
- Insight: the blue points are clustered around the origin, the orange points are far from the origin
- This introduces the concept of distance - so introduce an x1^2 and x2^2 terms.
- Technical term - we are introducing *features* (additional/transformed inputs)
Write definitions for these machine learning terms in your own words:
- Neurons - the "unit models" composing the neural network/machine learning model (a way of combining inputs - weighted sum of inputs, followed by activation function)
- Hidden layer - a group of neurons that is connected to an "upstream" group of neurons or inputs and a "downstream" group of neurons or outputs (a combination of neurons that all share the same set of inputs)
- Inputs - the data being used to train the model
- Features - the inputs to the problem - can include the inputs, but can also include transformations of the inputs (things that you calculate from the inputs)
- Feature engineering - process of deciding which features to include or exclude (using human insight) when trying to improve a model's predictions
Making Machine Learning Models More Effective
Machine learning process:
- Collect data - complete coverage
- Organize data - explore it, and fix any problems
- Create model
- Use a machine to flesh out the model from the data
- Deploy the model
Note: In general, the models we use are models that are already implemented in TensorFlow
Collecting Data
What considerations do we need when we are collecting data?
It isn't enough to just scrape together a bunch of data - the dataset should cover all cases and should be labeled
Need a complete data set, and a complete data set means it covers all relevant cases
Examples:
- If classifying 8 types of screws, need pictures of all 8 screws
- If classifying clouds, need to first ask, what types of clouds are there? Then you can get pictures that cover every type of cloud.
Further, you need negative examples/near-misses - what kind of mistakes could the ML model make?
- Texture-based recognition - what else is fluffy that could be mistaken for a cloud?
- Shape-based/color-based recognition - cartoon clouds
- Wrong shape - contrails from planes/rockets
Example:
- MNIST - no negative models
- Post office handwriting digit classification - must be more complex
- Machine learning pipeline - tasks happen in multiple steps; looks at an envelope, recognizes where the zip code is, identifies each individual digit
Organizing Data
Explore the data, identify any outliers, fix outliers
Don't always want to throw out the outliers - there is usually a systemic problem underlying the outlier that will come back again
Think carefully about your error metrics:
- Start a model with random weights
- Calculate the error based on the data set; change weights so the error goes down
- If the model is good enough, use the model for prediction; otherwise, change weights so that the error goes down
MSE - mean squared error - "default" good error metric to use
MSE
Y-cap is the model estimate
For ordinal problems, use the MSE:
For classification problems, use the cross-entropy error (which is differentiable):
(if you predict 1, and the actual value is 1, the entropy is 0)
(cool function that can deal with + or -, probability from 0 to 1, but hard to interpret)
These are the quantities that are minimized when adjusting the weights
Confusion Matrix
A way of presenting classes of outcomes for categorization problems
2 x 2 matrix:
[What the ML system guessed]
[Truth]
(TP) (FN) (FP) (TN)
TP = true positive FP = false positive FN = false negative TN = true negative
You can compute three possible numbers from this matrix:
- accuracy
- precision
- recall
Accuracy: computing (TP + TN)/( TP + FP + FN + TN)
- How many times it made a correct prediction
- If the ML system said "this is a cat" and it was in fact a cat, or if the ML system said "this is not a cat" and it was not a cat - in both cases, we increment by one
- (Most intuitive to understand)
- (Doesn't always work though)
Example where accuracy fails:
- System to identify empty parking spaces
- 1000 spaces, 990 taken, 10 available
- Our ML system identifies ONE of the 10 available systems:
- 991/1000 = 99.1% accuracy
- But this is a really skewed representation
Unbalanced data sets - use precision or recall
Accuracy: computing TP + FP
- Focuses on the TOTAL number of questions answered correctly
Precision: computing TP / (TP + FP)
- Focuses on the thing that you want to predict
- "Positive predictive value" - only using the POSITIVES (true positive, false positive)
- If the model says it is a cat, how often is the model correct? (Focusing on the POSITIVE question that you want to predict the answer to)
- Only keep the set of images that the system predicted was a cat
- Precision in the parking lot example: we are trying to predict the POSITIVE question of is a parking space empty
- If we ONLY consider the positive predictions (this is a parking space), and we ask, how many were actually parking spaces, we see the result is, 1 - 100% precision
Recall: computing TP / (TP + FN)
- This is the TRUE positive rate - only using the true values (true is positive, false is negative --> double negation --> true)
- Instead of only considering the model's POSITIVE answers to the question, we only consider the TRUE answers to the question
- Cat example: only consider the pictures that are actually cats
- Recall is the fraction of X that the model actually finds
- Back to parking lot example: we have 10 true parking spaces, and correctly predicted 1, so we have a recall of 1/10 or 10%
Digging deeper: classification model outputs a 0 or a 1 - but actually, it outputs some number BETWEEN 0 and 1, which is turned into a 0 or 1
ROC curve - (true positive rate, recall) vs. (false positive rate)
- Radar operating curve
- Radar is a problem with similar challenges
- Is this blip actually a plane? Is this (real) plane actually detected?
Define ML terms in your own words:
- MSE - mean squared error - error metric for continuous variable, summing distance between a real value and a model prediction value
- Cross-entropy - an error metric for categorical predictions, useful because it can be differentiated (and thus, we can find the gradient of the error to figure out how to adjust the weights to improve the error)
- Accuracy: accounts for all cases, (TP + TN)/(TP + TN + FP + FN)
- Precision: focus on positives, (TP)/(TP + FP)
- Recall: focus on truth, (TP)/(TP + FN)
Example:
- Building a machine learning model to make predictions about approval for loans
- Which error metric should we use?
- Going back to parking lot example: do we have an unbalanced problem? Do we reject and approve equal numbers of people?
- If we have a balanced problem, use accuracy
- But we don't have a balanced problem - more people rejected than approved
- If we have an imbalanced problem, use precision or recall
- We want to focus on the positive answers - people we decide to give a loan to
- No risk to a "true" good loan person rejected for a loan, but want to maximize profits so we want to maximize number of POSITIVE cases that we correctly predict
- POSITIVE predictive value
Another example:
- Trying to identify fraudulent transactions
- This is an unbalanced problem - not many transactions are fraudulent
- We want either precision or recall
- Important to identify *true* fraud events - don't want to falsely flag legit transactions as fraudulent, and don't want to let fraudulent transactions slip by unnoticed
- We want recall - the TRUE positive rate
Exploring and Creating Data Sets
Consider a data set of fare amount vs. distance traveled
Two models:
- One is a simple, single line, with a higher MSE
- Other is a squiggly, high-order polynomial or complicated predictor that has 0 MSE
- Intuitively, we know the first model is better - even though it has a higher MSE
- How to FORMALIZE - use the model to make predictions, and quantify the MSE in the prediction
- Training MSE: 0, Testing MSE: 32
Split our original, single dataset into pieces so that we can perform this test:
- Training data
- Validation data
- Go through the training process (using training data) and evaluation process (using validation data)
- Gradually increase model complexity
- Stop before you overfit - you know you're overfitting when the increase in error between training and validation data sets is larger than some threshold
This process is called hyperparameter tuning
- Training and validation data sets are used for hyperparameter tuning
But now, how do you evaluate the FINAL result? We can't really use the training or validation data sets... So 3-way split:
- Training data - used to improve the model parameter values
- Validation data - used to determine when model is overfit vs. underfit
- Test data - used to evaluate the model once hyperparameter tuning is finished
- (Not a great approach, since it leaves data lying around...)
- (Let data coming in over time, on the back end, form the test data set)
Alternative approach: cross-validation
- Split the original data into a bundle of smaller training data sets, and a bundle of smaller validation data
- Advantage: get a range of error measures
- Disadvantage: involves 10x as much training
Lab: Creating and Preparing Data Set
Doing Steps 1-5
Setting up the right project:
Start by running a cloud shell, and check that you are authenticated for your project:
$ gcloud auth list
List what project you are on with this Google Cloud shell:
$ gcloud config list project
If wrong project, switch to the right one:
$ gcloud config set project
Launch Datalab to do machine learning in a notebook.
Pick a zone to run in:
$ gcloud compute zones list $ datalab create my-machine-learning-vm --zone <ZONE>
Once Datalab launches, open the Web Preview icon and switch to port 8081
Open ungit from /content/datalab
Clone the training-data-analyst repo (clone repository > https://github.com/GoogleCloudPlatform/training-data-analyst)
Now open the courses/machine_learning/datasets directory
Open create_datasets.ipynb
Notebook contents:
Start by extracting 10,000 records from BigQuery
Plot values to visualize and identify outliers/problems, and add them into WHERE clause for BigQuery:
- Zero distance rides (trip_distance > 0)
- Negative fare amounts (fare amounts >= 2.5)
Other problems, filtered out with Pandas:
- Zero passenger rides (passenger_count > 0)
- Latitude/longitude outside of Manhattan:
- long > -78, long < -70 (pickup and dropoff)
- lat > 37, lat < 34 (pickup and dropoff)
Finally, split data set into training (70%), validation (15%), and testing (15%)
Export three CSV files for each data set with columns:
['fare_amount', u'pickup_longitude', u'pickup_latitude', u'dropoff_longitude', u'dropoff_latitude', u'passenger_count', 'key']
Primitive benchmark model
Before building a machine learning model, try coming up with a really basic model that will create a benchmark to compare the ML to.
(mean_fare_amount)/(mean_distance) gives an average rate per distance
Then compute RMSE of that model for all three data sets:
- Rate = $2.56/km
- Train RMSE = 6.79
- Valid RMSE = 6.20
- Test RMSE = 5.84
Now, put it all together by creating a function that will construct an SQL query that does all of the filtering/querying that we need, and pass the results of the query to another function that computes the RMSE
Using the same query each time, our goal is to beat the RMSE of the rule of thumb, (trip_price)/(trip_distance)
Module 2: Building Machine Learning Models with Tensorflow
Building Models with TensorFlow
Goals:
- TensorFlow code
- Loops in TensorFlow
- Monitoring results using TensorBoard
Labs:
- What is TensorFlow
- Machine learning
- Gaining flexibility - more machines
- Experiment framework
Overview of TensorFlow Models
TensorFlow is portable (like Java) - write code that works on a variety of hardware platforms, whether they are GPUs, CPUs, or phone processors
Two aspects: training, and predicting. Training often happens on higher-power processing hardware (GPUs), and prediction often happens on low-power processing hardware (phones, embedded processors, etc.)
Computations represented with data flow graphs, which abstract the computations away from the specific hardware
TensorFlow toolkit hierarchy:
- High-level out-of-the-box API (scikit-learn compatible): Estimator API
- Custom NN model comopnents:tf.layers, tf.losses, tf.metrics
- Python API for full control: Core TensorFlow (Python)
- C++ API for very low-level control: Core TensorFlow (C++)
- Hardware running on different hardware: CPU, GPU, TPU, Android
Core TensorFlow is implemented in C++ - that's what's portable across different platforms (similar to the way that JVM is implemented in C, and allows you to write in a higher-level language (Java))
Can write Python code to create TensorFlow models - but even then there are several levels at which to write Python code (control all the details of the TensorFlow components, or abstract them away with high-level API)
Estimator API - highest level abstraction (uses pre-packaged models, intended for "getting stuff done")
Can run TensorFlow code written at any of these levels using CloudML to scale it up
Core TensorFlow Python API
Lets you build and run directed acyclic graphs
This looks a lot like numpy: operations are written as functions operating on data structures
c = tf.add(a,b)
a and b are tensors
This constructs components on a computation graph
To actually perform the computation, create a TensorFlow session and run it:
session = tf.Session() numpy_c = session.run(c,feed_dict = ...)
Contrasting numpy and TensorFlow: TF is lazy
In numpy, if we run this code, we get an immediate result:
>>> a = np.array([1, 2, 3]) >>> b = np.array([4, 5, 6]) >>> c = np.add(a,b) >>> print(c) [5, 7, 9]
In TF, we get an abstract operation that is not performed:
>>> a = tf.constant([1, 2, 3]) >>> b = tf.constant([4, 5, 6]) >>> c = tf.add(a,b) >>> print(c) Tensor("Add_7:0", shape(3,), dtype=int32)
To actually evaluate this, we have to do:
>>> with tf.Session() as sess: ... result = sess.run(c) ... print(result) [5, 7, 9]
Motivation behind the compute graph: you can separate the creation of the graph from the running of the graph
Additionally, graphs can include send/receive nodes between other nodes in the graph - allows assignment of different parts of the graph to different machines (e.g., one component goes to GPU, another component goes to CPU)
C++/Python front ends assemble the computation graphs
The core TensorFlow Execution System can further modify the graph's operations (add, mul, print, reshape, etc) to insert additional nodes
These are then passed on to the kernels (CPU, GPU, Android, etc.)
Area of a Triangle
Implementation of Heron's Formula using TensorFlow
List of functions in TensorFlow: https://www.tensorflow.org/api_docs/python/tf
By default, the notebook is python2. Everything works fine.
Changing the notebook to python3 and updating the print() statements also works fine.
print(dir(tf.logging)) ['DEBUG', 'ERROR', 'FATAL', 'INFO', 'TaskLevelStatusMessage', 'WARN', '_GetFileAndLine', '_GetNextLogCountPerToken', '_THREAD_ID_MASK', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '_allowed_symbols', '_get_thread_id', '_handler', '_interactive', '_level_names', '_log_counter_per_token', '_log_prefix', '_logger', '_logging', '_logging_target', '_os', '_sys', '_time', 'debug', 'error', 'fatal', 'flush', 'get_verbosity', 'info', 'log', 'log_every_n', 'log_first_n', 'log_if', 'set_verbosity', 'vlog', 'warn', 'warning']
Change to tf.logging.ERROR to have less output. Unfortunately, there's very little clear explanation of what's going on in this notebook....
I'm confused about the multiple RSME values flying around. There was an RSME of around 6 for the training, validation, and test datasets in the first notebook. And then, there's another query, the benchmark query, that results in an RSME of 8 with the simple rule-of-thumb model of (mean fare)/(mean distance).
(Gonna be honest - none of this makes any sense. There's no attempt to explain anything about what's going on with the TensorFlow models. Basically, the lab says, "LOOK AT THIS CODE THEN PRESS ENTER OKAY NOW YOU ARE TENSORFLOW MASTER CONGRATS HERE IS UR CERTIFICATE")
Cloud MLE notebook:
- Had to do some digging, but basically the curl call to get the authorization JSON was trying to do a whole bunch of stuff all at once, but the MLE API was not enabled
- Instead of parsing the JSON request, I just printed it out whole, saw an authorization error due to Cloud MLE not being enabled, and a link
- https://console.developers.google.com/apis/api/ml.googleapis.com/overview?project=not-all-broken&pli=1
Working with TensorFlow Estimator API
Two-step process when creating a machine learning model.
First, to know how to set up the machine learning model:
- Start by answering question: are we solving a regression problem, or a classification problem?
- What is the label we are using?
- What are the features of the problem?
Then carry out the machine learning steps:
- Train the ML model
- Evaluate the ML model
- Predict with the ML model
Structure of TF Estimator API Model
When we create the estimator object, we can either create a LinearRegressor object, or a LinearClassifier model
import tensorflow as tf # Define input feature columns # In this silly example, we only use ONE column feature_columns = [ tf.contrib.layers.real_valued_column("sq_footage") ] # Instantiate the LinearRegression model # feature_columns is a list of ONE column[s] (one input) estimator = tf.contrib.learn.LinearRegressor(feature_columns=feature_columns)
feature_columns specifies the inputs into the machine learning algorithm.
Next, we carry out the ML steps:
# Train def input_fn_train(): feature_data = {'sq_footage' : tf.constant([1000,2000])} # Label data is the "true value" of the quantity you are trying to predict label_data = tf.constant([100000,200000]) # These are house prices: $100,000 and $200,000 return feature_data, label_data # One step = one step of gradient descent # One step is typically done on a batch, # but because our data set is small, # one step = one epoch estimator.fit(input_fn=input_fn_train, steps=100) # Predict def input_fn_pred(): feature_data = {'sq_footage' : tf.constant([1500])} return feature_data list(estimator.predict(input_fn=input_fn_predict))
Setting steps=100 means do gradient descent 100 times.
To clarify a bit more:
- A "step" refers to carrying out the gradient descent process on a single batch of data.
- However, in this case, our entire data set is so small that it fits in memory, and we only have one batch.
- Therefore, a "step" is actually carrying out the gradient descent process on the entire dataset.
- This is normally called an epoch - finishing doing training on the entire batch of data.
- In this (special) case of a small data set that fits in memory, 1 step = 1 epoch
Call the fit function on the estimator, and pass it the function that it should use to get the training data (inputs PLUS outputs).
Then we specify steps = 100 - which means, carry out gradient descent 100 times.
Lastly, we can now make predictions - pass in another function, this time a function that it should use to get the testing (?) data (inputs ONLY, no outputs).
Structure of fit input function:
- Need to provide a list of features (the input quantity) in the form of a dictionary: labels, and a corresponding list of values
- Need to provide a list of labels (the output quantity being predicted) in the form of quantities - just a list of values
Structure of predict input function:
- Need to provide a list of features (the test input quantity) in the form of a dictionary: labels and corresponding list of test values
- (No other quantities are needed - labels are being predicted!)
Going beyond linear regression with TF
Define a deep neural network with DNNRegressor instead of LinearRegressor:
model = DNNRegressor(feature_columns = [...], hidden_units[128, 64, 32])
Constructor takes the architecture of the network. This case: 3 hidden layers [128, 64, 32] plus one input layer and one output layer, for a total of 5 layers. The architecture of the network is a result of some rules of thumb, and some trial and error.
Define a classifier like this:
model = LinearClassifier(feature_columns = [...]) model = DNNClassifier(feature_columns=[...], hidden_units=[...])
The feature columns can be several different types. We saw previously (with the square footage example) that we can define a real_valued_column, but we will see other types of feature columns as well.
The feature_columns input argument is a list of feature columns (of whatever type).
Feature engineering is the step where we'll cover different types of feature_columns.
More on the Estimator API
Helper functions for other types of inputs:
tf.contrib.layers.sparse_column_with_keys(column_name = "gender", keys=["female","male"])
Estimator API Lab
Notebook for this lab: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/tensorflow/b_tflearn.ipynb
Start by reading in the CSV data; create a pandas dataframe and extract the column names
Separate the column names into features (inputs) and labels (outputs)
Define a function that returns the training data (make_input_fn)
- Note that this is slightly confusing, since it involves nested functions, but basically this is a bundle of code that will extract the correct data/columns
Now create a LinearRegressor
Batch size and epoch size are the same, since we are holding all of the data (df_train) in memory all at once
Now that we have the model, we evaluate the model by passing it the validation data set (df_valid) and use RSME to quantify how good the model is
Pretty high error - benchmark was 8, and RSME is 14 or 15 - but we'll focus on how to IMPROVE the model in later steps of the lab
(This shows the importance of creating a "simple rule" benchmark before you begin!)
Stop and Take Stock
Flustered by mysterious goings-on, both on the GCP side and on the local TensorFlow notebooks side. Specifically:
- Datalab notebooks and git version control of said notebooks is a bit of a mess. Things don't work the way they should - repos are not created when and where they're supposed to be. Other, smaller things are broken everywhere - keyboard shortcuts for Jupyter notebook don't work. HTTPS connections to ungit don't work. Stuff like that.
- Cloud networking is a big hassle. It's one of those large, mysterious systems where, when something breaks, my first thought is not "Okay, I can do some thinking to figure out what I did wrong," it's "Oh great, what happened this time?" It's like being a Linux newbie and building a program from source - you don't know when it's your own fault for doing something stupid, or when something is actually broken.
- Like right now - I know datalab is running on this virtual machine, and I know it's listening on port 8080, and I know I have a working firewall rule, but I still can't open Datalab in my browser. (And yet again - running on an n1-standard-1 datalab image, there's a bunch of initialization scripts and a second disk, and NO FRICKIN APT-GET). Is it because it needs more time? I've already waited 5 minutes. It seems to be up and operational. (Turns out, yes, I just needed to wait longer.)
Module 2b: Refactor Model for Flexibility and Scaling Up
Gaining More Flexibility:
- Laboratory
- Building effective ML for big data
- Refactoring tf.learn model
- Components for building neural network models
- Reading data in batches
- Reading csv files num_epochs times
- Reading local/GCS csv files in batches
- Lab overview
To refactor tf.learning model:
- Refactor it to read out of memory data
- Refactor it to add new features easiliy
- Refactor it to evaluate model architecture as a part of training
TensorFlow Architecture for Out of Memory Learning
Back to the middle layer:
- Reminder, these are the components that are useful when building custom NN models
- tf.layers, tf.losses, tf.metrics
Recap of terminology:
- We will store our data in multiple files
- One step = going through single batch of training data once
- One epoch = going through entire training data once
Reading data from out of memory:
- To go through our data for 50 epochs, we just need to create a filename queue (from randomly shuffled filenames) that contains our file names 50 times each
- Example: dealing with three files A, B, C: our filename queue should be B B C A ... (enqueue_many function)
- Then we dequeue each file, one at a time, using a Reader (dequeue function)
- The reader decodes the data and turns it into data
- The data then goes into an Example Queue (using the enqueye function)
- Why shuffle filenames and add them in random order? When doing distributed learning, we don't want to bias our learning process, or have one file cause a slowdown (on exact same machine each time)
- Each Reader will be on a different machine; each Reader takes filenames from the queue, and creates an example queue (an example is an input plus a label)
- Then, TensorFlow model reads data from the Example Queue
Reading a CSV file num_epochs times:
Start by setting labels for the columns in the files being read:
CSV_COLUMNS = ['fare_amount', 'pickuplon', 'pickuplat', ...] LABEL_COLUMN = ['fare_amount'] # Now define default values that each value will take on # (This keeps the ML model from choking if there are one or two missing pieces of data) Defaults = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]
Next, define an input function that will do a wildcard match, and assemble each filename and put it into the Filename Queue:
def input_fn(): input_file_names = tf.train.match_filenames_once(filename) filename_queue = tf.train.string_input_producer( input_file_names, num_epochs=num_epochs, shuffle=True) # now make the Reader reader = tf.TextLineReader() _, value = reader.read_up_to(filename_queue, num_records = batch_size) value_column = tf.expand_dims(value, -1) columns = tf.decode_csv(value_column, record_defaults = DEFAULTS) features = dict(zip(CSV_COLUMNS, columns)) # Take the one label item and pop it from features. # Assign the result to label, so now label is a dictionary too. label = features.pop(LABEL_COLUMN) return features, label
Reading CSV Files num_epochs times
In the input function:
- Match all filenames (can have a wildcard, like train.*) or sharded files (train-00001-36, train-00002-36, etc)
- Then, take those input files and repeat them (in a shuffled way) num_epochs times
- Now, create the reader with TextLineReader() to read CSV files
- Tell the reader to read a batch of records from the filename queue
- This is just a line, so we use expand_dims() to make the scalar into a tensor
- Then we do decode_csv to decode this as a comma-separated string
- We need to tell TF what the datatypes are, and what to do if the value of the field is missing
- We now have our values
- But our features have to be a dictionary - where each column is an entry in the dict, with the key being the name of the column
- Associate the field names with the tensor values to make it a dictionary (that's features). One key is fare_amount, next key is pickuplon, next is pickuplat, etc.
- Each key has a tensor associated with it
- Those are our features - except that fare_amount is the label column, and we aren't trying to predict it
- saying features.pop(LABEL_COLUMN) tells TF to leave out the quantity we're trying to predict (as output) from the list of inputs
- We then return the list of features (the dictionary of label:value for each feature) and the label column name
TextLineReader() can read from local files, or from GCS
What it is doing is:
- Decoding CSV
- Creating a dictionary of features
- Creating a dictionary of labels (via features.pop(LABEL_COLUMN))
- Returning features and labels
This decode_csv can be fed a CSV from a local disk, or from Google Cloud Storage
Next lab:
- Refactor TensorFlow model
- Read from a potentially large data set/file in batches
- Do a wildcard match on filenames and feed them to a filename queue
- Break up the one-to-one relationships between inputs and features (unclear what this means, exactly)
This will smooth the way to running this TensorFlow model at scale.
Refactoring the ML Model for Big Data
Link to lab: https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#7
Link to notebook: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/tensorflow/c_batched.ipynb
First refactoring: reading input data in batch
The first refactoring addresses how the input data are being read. A filename queue is added to the TensorFlow graph, instead of reading the file directly into a Pandas dataframe. We pass the filename, and use this tf.train.match_filename_once() thing. We use a string producer to generate the (one single) filename over and over. We shuffle the input filename queue. We repeat each file num_epochs times. Here's the whole mess:
def read_dataset(filename, num_epochs=None, batch_size=512, mode=tf.contrib.learn.ModeKeys.TRAIN): def _input_fn(): input_file_names = tf.train.match_filenames_once(filename) filename_queue = tf.train.string_input_producer( input_file_names, num_epochs=num_epochs, shuffle=True) reader = tf.TextLineReader() _, value = reader.read_up_to(filename_queue, num_records=batch_size) value_column = tf.expand_dims(value, -1) columns = tf.decode_csv(value_column, record_defaults=DEFAULTS) features = dict(zip(CSV_COLUMNS, columns)) label = features.pop(LABEL_COLUMN) return features, label return _input_fn
Second refactoring: treat input data and features as different
The second refactoring addresses the way we turn input data into features. They refactor this so that they are specifically extracting the input variables in one step, then explicitly specifying the model features in another, separate step. What they mean by "break the one-to-one relationship between inputs and features" is, we aren't forced to use the input data and only the input data as our model features. Once we change the way the input data is loaded (i.e., if we don't read data straight from the input file into the model), we can transform input variables, leave certain input variables out of the model, normalize them, combine them together, etc.
Third refactoring: Move model evaluation into training loop
The problem with the notebook, as is, is that we're specifying a number of epochs. Instead, we want to evaluate the model as we go, and stop when we reach a criteria.
(This will happen in the next lab.)
Also a checkpointing problem - we save checkpoints during the training, and use the final checkpoint as the final model. (Discussion of overfitting - we may not want the last step, because it may be overfit.) This will also be improved by stopping the model training when we reach some error criteria.
Train the model on the training data set, and every few steps, stop and assess RMSE on the validation data set. Stop when the RMSE on the validation data set starts to increase (indicates we're overfitting).
What to improve further?
Handle machine failure in distributed training - what if something goes wrong? Want to be able to pick up training wherever we left off.
Monitor training - especially useful if training is expected to take a very long time. Answer questions like, which epoch are we on, what is the current RMSE, etc.
Choose a model based on the validation data set - use a smarter stopping criteria than number of epochs.
How much does a reasonably realistic machine learning model cost?
It will cost a few thousand dollars for a reasonably realistic model
References
Reading out-of-memory data: schematic: https://www.tensorflow.org/programmers_guide/reading_data