tf-serving if you don’t know is a tool that Google has built to serve model built using tensorflow. Even keras models with a tensorflow backend should do just fine.

Even thought there are a lot of guides on how to use tf-serving, I could not find anything coherent and simple. So I decided to write one, mostly so that next time I have to do this I would have something to refer to.

Why tf-serving

You could just put your model behind a simple flask API and that will work pretty fine for small use cases.

tf-serving mostly comes in handy you have heavy load. It is also pretty useful when you have to version you models and have something like a CI - CD pipeline. This video explains it pretty well.

How tf-serving

OK, now let us get to the part why you might be reading this blog for. We will be using the tensorflow/serving docker container to run the whole thing. This makes things a whole lot simpler. Also later when you have to put the whole thing behind kubernetes acting like a load balancer you will end up using it anyway.

Folder structure

tf-serving needs the model files to be in a specific structure. It should look something like this.

models                                             # base folder for all the models
└── mymodel                                        # model name
    └── 1                                          # model version
        ├── saved_model.pb
        └── variables
            └── variables.index

We will have a base folder called models (you could name it anything, but we will have to pass on the same name to tf-serving).

Inside the base folder we will have different models. The name of the model that I am using here is mymodel, so we have that as the folder name here.

Inside that we will have folders with names 1, 2, 3 … etc. These will be different version. It is set up like this so that when you have a new version, you can just add a new folder and tf-serving will automatically switch to the new model without restarting. Plus you get some form of versioning.

What goes inside them

OK, now that we know where to put the files, let us see what to put in there.

tf-serving will need the files to be in a format what it calls SavedModel. You can find more about it here.

We have utils inside of tensorflow which will let us convert our models into SavedModel. Here I will show how to do it for a keras model.

signature = tf.saved_model.signature_def_utils.predict_signature_def(
    inputs={"image": model.input}, outputs={"scores": model.output}

builder = tf.saved_model.builder.SavedModelBuilder("./models/mymodel/1")
        tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature

You could add the code right at the end of something like below and you should have a model in the path ./models/mymodel/1 with the above specified dir structure.

import tensorflow as tf
from tensorflow import keras
import numpy as np

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_images = train_images / 255.0
test_images = test_images / 255.0

model = keras.Sequential(
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(128, activation=tf.nn.relu),
        keras.layers.Dense(10, activation=tf.nn.softmax),

    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
), train_labels, epochs=5)
test_loss, test_acc = model.evaluate(test_images, test_labels)

predictions = model.predict(test_images)
res = np.argmax(predictions[0])
print("res:", res)

Running using docker

Well, i assume you know what docker is. Well if you don’t let us think of it as a super lightweight VM (I couldn’t be more wrong when I say lightweight VM, but it is a good analogy). Just install docker from here.

Btw, if you don’t know docker, look into it. It is pretty awesome.

Now you can run something like this.

docker run -t --rm -p 8501:8501 \
   -v "$(pwd)/models:/models" \
   -e MODEL_NAME=mymodel \

OK, what we do here is we use the image tensorflow/serving from Docker Hub. It is a preconfigured tensorflow serving setup.

The -p option says that we map the 8501 port of docker to 8501 port in our local. This is the default REST port in tf-serving. For gRPC it is 8500.

With -v we mount $(pwd)/models to /models inside the container as that is where tf-serving will look for the files.

Also we specify the MODEL_NAME as mymodel so that tf-serving will run that model.

Simple client

import json
import requests
import numpy as np
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
test_images = test_images / 255.0

url = 'http://localhost:8501/v1/models/mymodel:predict'
headers = {"content-type": "application/json"}
data = json.dumps({"instances": test_images.tolist()})

resp =, data=data, headers=headers)
if resp.status_code == 200:
    predictions = resp.json()['predictions']
    res = np.argmax(predictions[0])
    print("res:", res)

Not a whole lot of changes from simple prediction. We pretty much replace the line

predictions = model.predict(test_images)

with the lines

resp =, data=data, headers=headers)
predictions = resp.json()['predictions']

Well, that is pretty much it for running tf-serving. Now put load balancing on top of it and you got a pretty solid production deployment.

Btw, here is tf-serving docs for people who wanna use tensorflow instead of keras and gRPC instead of REST.