Using The TensorFlow Estimator Design Pattern

Author: Josh Patterson

Date: April 19th, 2019

This tutorial covers the basics of how to use TensorFlow's Estimator API to write modeling code that will run in a consistent fashion in multiple execution modes. In a previous article we looked at how to run a pre-built TensorFlow program in distributed mode on Kubeflow. However, the TensorFlow code itself was rather complex as it had the user dealing with all sorts of things beyond focusing on the model training itself. In this tutorial the reader will learn:

  • What is the Estimator API?
  • Why is it relevant?
  • Sample code to model with Iris dataset locally with an Estimator
Lets jump right into "What is the Estimator API?"

Introduction to TensorFlow's Estimator API

For newer readers who aren't familiar with the landscape of machine learning tooling, we'll start off by defining TensorFlow:

"TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications."

https://www.tensorflow.org/

Building on that definition, the TensorFlow Estimator API is a high-level TensorFlow API that makes machine learning programming easier when dealing with different execution modes (e.g., "local", "distributed"). Historically TensorFlow coding has involved a lot of low-level details such as placing specific operations on specific GPUs. Estimators make sharing implementations of models easier to share between data scientists. Another aspect of Estimators is that they build the TensorFlow graph for you and there is no explicit Session. Many data scientists do not want to have to deal with these type details, so Estimators make things considerably simpler. Ultimately it allows the user to get consistent results regardless if they are executing locally or in the cloud in distributed mode.

Estimators provide a standard way to deal with the following actions:

  • training
  • evaluation
  • prediction
  • export for serving

There are many pre-built Estimators already in the TensorFlow library, but you may write your own custom Estimator as well. Any Estimator, built-in or a custom one we create, will be based on the tf.estimator.Estimator class.

Keras Support with Estimators

TensorFlow now supports converting any Keras model into an Estimator, speeding up your model development. This is done by defining a model with tf.keras.Model and then converting the model to an tf.estimator.Estimator object with the tf.keras.estimator.model_to_estimator() method. Once we've got an Estimator representation of the Keras model, we can train the model in the same way we'd train any Estimator model.

Deprecation of the Experiment Class

Previously we might have used the Experiment class for building TensorFlow training code, but at this point the Experiment class has been marked deprecated. The Estimator class should now be directly used in place of where we'd have used the Experiment class, and it appears to be a better design pattern as well.

Writing TensorFlow Code with Estimators

The primary steps necessary to write TensorFlow training code with Estimators are:

  1. Build your Estimator model or use a pre-built one already in the TensorFlow library
  2. Define how data is fed into the model for both training and test datasets (often these functions are setup the same)
  3. Define training and evaluation specifications (TrainSpec and EvalSpec, respectively) to be passed to tf.estimator.train_and_evaluate
The EvalSpec can also include information on how to export your trained model for prediction (serving).

The .train_and_evaluate(...) method provides a consistent interaface for training locally or in the cloud, non-distributed or in a distributed-fashion. Check out the TensorFlow documentation for more details. In the next section we show an example of the Estimator API in practice.

Example TensorFlow Code for Modeling the Iris Dataset

We include below a basic TensorFlow Estimator API example code listing. This Estimator API example models the canoncical Iris dataset. While this example is not a complex deep learning model, the Iris dataset is simple and well-understood. The example below allows us to see the Estimator API in action without dealing with the distractions of a more complex model.

In the sections below, we provide commentary on the following areas of the code from above:

  • Loading the initial dataset, creating feature columns
  • Configuring our TensorFlow job with the RunConfig class
  • Setting up an Estimator with the DNNClassifier class
  • Setting up the TrainSpec class
  • Setting up the EvalSpec class
  • Training the model

Loading the Iris Dataset

We use the include iris_data.py util functions to download and load the Iris dataset for us locally, as seen in the code snippet below (from the program listing above):

    # Fetch the data
    (train_x, train_y), (test_x, test_y) = iris_data.load_data()

    # Feature columns describe how to use the input.
    my_feature_columns = []
    for key in train_x.keys():
        my_feature_columns.append(tf.feature_column.numeric_column(key=key))

The iris_data.py utilities do a few things under the hood:

  1. download the dataset
  2. read the CSV datasets with Pandas into dataframes
  3. creates dataset features and labels
  4. creates separate train and test datasets
Once we have our data, we can move on and configure our modeling job with the RunConfig class.

Configuration with RunConfig

The RunConfig class is part of the Estimator API in TensorFlow and it specifies the configurations for an Estimator run.

We can see the RunConfig class in action in our code in the snippet below:

    config = tf.estimator.RunConfig(
                model_dir="/tmp/tf_estimator_iris_model",
                save_summary_steps=1,
                save_checkpoints_steps=100,
                keep_checkpoint_max=3,
                log_step_count_steps=10)

This configuration does a few things for us, but primarily sets things like the directory where we'll save our model checkpoints (e.g., model_dir="/tmp/tf_estimator_iris_model") and then how often to checkpoint our model (e.g., save_checkpoints_steps=100). All of the properties for the RunConfig class are listed in the documentation, which we show the init function here for:

__init__(
    model_dir=None,
    tf_random_seed=None,
    save_summary_steps=100,
    save_checkpoints_steps=_USE_DEFAULT,
    save_checkpoints_secs=_USE_DEFAULT,
    session_config=None,
    keep_checkpoint_max=5,
    keep_checkpoint_every_n_hours=10000,
    log_step_count_steps=100,
    train_distribute=None,
    device_fn=None,
    protocol=None,
    eval_distribute=None,
    experimental_distribute=None
)

There are other properties that can be set for RunConfig, but we'll save those for a future article. Now that we've looked at how to configure our job, let's now move on to how to create the Estimator itself.

Setting up an Estimator with the DNNClassifier

For this example we're going to build a small multi-layer perceptron neural network to model our Iris dataset.

    # Build 2 hidden layer DNN with 10, 10 units respectively.
    estimator = tf.estimator.DNNClassifier(
        feature_columns=my_feature_columns,
        model_dir="/tmp/tf_estimator_iris_model",
        # Two hidden layers of 10 nodes each.
        hidden_units=[10, 10],
        # The model must choose between 3 classes.
        n_classes=3)

In the code section above, we can see the DNNClassifier class being instantiated defining our feature columns, where we'll train the model, and then the size of the hidden layers (10 nodes per layer). Finally we tell TensorFlow that we want the model to give us an output based on 3 classes. Let's now move on to setting up the TrainSpec for the Estimator.

TrainSpec and Data Input

    train_input_fn = lambda:iris_data.train_input_fn(train_x, train_y,
                                                 opts.batch_size)


    train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn,
                                        max_steps=1000)    

In the code section above we can see two lines. The first line builds a train_input_fn with the lambda keyword in python based on the train_input_fn(...) method in our iris_utils.py utilities.

The second line takes the train_input_fn we created and passes it to the TrainSpec class in the Estimator API. This class instance will be used by the Estimator when we train the model in a moment.

Oftentimes we will not have a pre-made input function for our Estimator. To write a custom input function for TrainSpec in Estimators, check out the TensorFlow documentation on datasets for Estimators. Now we'll move on to evaluation with Estimators.

EvalSpec and Model Evaluation

    # Evaluate the model.
    eval_input_fn = lambda:iris_data.eval_input_fn(test_x, test_y,
                                                opts.batch_size)

    eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,
                                      steps=None,
                                      start_delay_secs=0,
                                      throttle_secs=60)

Similar to TrainSpec, EvalSpec uses the lambda keyword in Python to create an input function that is then used directly as a parameter in the EvalSpec class.

Training Method for Estimators

    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Lastly, we'll highlight how it all comes together for training an Estimator. We use the Estimator, TrainSpec, and EvalSpec as parameters to the tf.estimator.train_and_evaluate( ... ) method. Veterans of TensorFlow programming instantly recognize that this function encapsulates a number of things for training runs and is far easier to use consistently for data scientists. Another interesting aspect of this method is how it can train models in a distributed fashion for TensorFlow without changing the training code.

Running Our Estimator TensorFlow Program

To run this example application, first we need to pull the code down from github:

git clone https://github.com/pattersonconsulting/tensorflow_estimator_examples.git

Next, change into the new project directory that was just created:

cd tensorflow_estimator_examples

The user needs to account for dependency management in running any Python program. The two common options are:

Once you have at least TensorFlow 1.12.0 installed locally, you should be able to run the example with the command:

python tf_estimator_iris_single.py

The console output should look similar to the output shown below:

INFO:tensorflow:global_step/sec: 116.118 INFO:tensorflow:loss = 10.430826, step = 401 (0.862 sec) INFO:tensorflow:global_step/sec: 147.967 INFO:tensorflow:loss = 6.974675, step = 501 (0.675 sec) INFO:tensorflow:global_step/sec: 161.068 INFO:tensorflow:loss = 5.9006176, step = 601 (0.621 sec) INFO:tensorflow:global_step/sec: 152.809 INFO:tensorflow:loss = 3.3194108, step = 701 (0.654 sec) INFO:tensorflow:global_step/sec: 167.913 INFO:tensorflow:loss = 7.634383, step = 801 (0.596 sec) INFO:tensorflow:global_step/sec: 175.906 INFO:tensorflow:loss = 5.5086565, step = 901 (0.568 sec) INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tf_estimator_iris_model/model.ckpt. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2019-04-18-19:00:17 INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from /tmp/tf_estimator_iris_model/model.ckpt-1000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Finished evaluation at 2019-04-18-19:00:18 INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.93333334, average_loss = 0.06142591, global_step = 1000, loss = 1.8427773 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: /tmp/tf_estimator_iris_model/model.ckpt-1000 INFO:tensorflow:Loss for final step: 2.909092.

Summary

In this example we walked through the new Estimator API for TensorFlow and highlighted some of its core concepts. We hope the reader enjoyed the walkthrough. In future articles we'll take a look at how this example can be extended to distributed TensorFlow and then further executed on systems such as Kubeflow for on-premise/cloud/hybrid operations.

If you'd like further help in topics such as:

  • General machine learning education
  • Advanced deep learning modeling
  • Enterprise machine learning infrastructure
please reach out to us and say hello.