5. Full MNIST Example¶

MNIST is a popular well known dataset for evaluating machine learning models. It has been effectively solved at this point, but it is still a good starting point for getting to know how Nemesyst works, and to be able to show people how to use Nemesyst in practice. It is also relatively clean so there is little pre-processing that is required other than turning it into a directly usable form.

The dataset will be downloaded for you by the cleaning module.

5.1. Requirements¶

Please ensure you have both MongoDB and the following python dependencies installed as a bare minimum:

examples/requirements/mnist.txt

ConfigArgParse>=0.14.0
pymongo>=3.8.0
future>=0.17.1
scikit-learn>=0.21.3
keras>=2.3.1
tensorflow-gpu>=2.0.0

If you are using pip you can quickly install these using:

Files-only/ development pip requirements installation example:

pip install -r examples/requirements/mnist.txt

Note

Please also ensure you have the Nemesyst files at hand ( Files-only/ development ) as they have all the extra files you will need later on, which are only present in Files-only/ development

5.2. Configuring¶

For this example we have created a configuration file for you so there is nothing additional that needs to be done. It is advised that you read it through. It is a .ini style file. However each of these options can be passed in to Nemesyst as cli or environment options as well but we believed it would be a much nicer introduction to have them in a configuration file.

examples/configs/nemesyst/mnist.conf

# please see full documentation at:
#
# this config file assumes you are in the directory nemesyst from:
# https://github.com/DreamingRaven/nemesyst
# we use relative paths here so they may not work if you arent there.

# mongodb options for your experimental database
--db-user-name=groot          # change this to you desired username
--db-password=True            # this will create a password prompt
; --db-init=True                # initialises the database with user
; --db-start=True               # starts the database
--db-port=65530               # sets the db port
--db-name=data                # sets the database name
--db-path=./data_db/          # sets the path to create a db
--db-log-path=./data_db/      # sets the parent directory of log files
--db-log-name=mongo_log       # sets the file name to use for log
--db-authentication=SCRAM-SHA-1 # sets db to be connected to using user/pass

# cleaning specific options
; --data-clean=True                                             # nothing will be cleaned unless you tell nemesyst to even if you give it the other information
--data-cleaner=examples/cleaners/mnist_cleaner.py             # the path to the cleaner in this case MNIST example cleaner
--data-collection=mnist                                       # sets the collection to import to

# learning specific options
; --dl-learn=True                                               # nothing will be learned unless you tell nemesyst explicitly to do so even if other information is given
--dl-learner=examples/learners/mnist_learner.py               # the path to the learner in this case MNIST example learner
--dl-batch-size=32                                            # set the batch sizes to use
--dl-epochs=12                                                # set the number of epochs we want (times to train on the same data)
--dl-output-model-collection=models

# infering specific options
; --i-predict=True                                              # nothing will be predicted unless you tell nemesyst explicitly to do so even if other information is given
--i-predictor=examples/predictors/mnist_predictor.py          # the path to the predictor in this case MNIST example predictor

If you would like to the skip rest of this example for whatever reason such as you are more interested in checking Nemesyst is working simply remove the symbol “;” from the start of any lines it appears in to uncomment that line, and then run everything using:

Files-only/ development automated example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf

5.3. Serving¶

For this example Nemesyst will create a database for us whenever we call the config file since we pass in options to initialize and start the database (see Configuring). We can do this using:

Files-only/ development serving example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --db-init --db-start

This example will start the database, to close the database you can:

Files-only/ development stopping database example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --db-stop

Note

Nemesyst may ask you a password. As long as you are using the same password between runs it wont cause you issue as you are simultaneously using and creating (when using –db-init) the password for the default user in our config file, you can change this behavior but we wanted to include it so we don’t end up creating universal passwords that lazy users might oversee.

For more complex scenarios pleas refer to Serving with MongoDB

5.4. Checking up on the database¶

It may be necessary after each of the following steps to check on the database to ensure it has done exactly what you expect it to be doing. To login to the database easily you can use:

Files-only/ development logging into running database example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --db-login

This should put you in the Mongo shell which is a javascript based interface of MongoDB for direct user intervention. Where you can do all sorts of operations and checks. This is of course optional but recommended. If you would rather a more graphical interface you can use any of the plethora of tools to visualize the database but we recommend MongoDB Compass, in particular for its aggregation helper.

5.5. Cleaning¶

In this step we will launch the example MNIST cleaner which downloads the data using scikit-learn to get a much cleaner version of the data set for us. Then inserting the data into individual dictionaries row wise, so that each dictionary is a single complete example/ observation, with associated target feature. To put it back into the database we need only yield each dictionary and Nemesyst will handle iteration for us. This document dictionary can also be used to house useful metadata about the dataset so that you can further filter using more advanced Nemesyst and MongoDB functionality that go beyond the scope of this simple introduction.

To begin cleaning you need only tell Nemesyst to clean the data using:

Files-only/ development cleaning example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --data-clean

The example MNIST cleaner is shown below for convenience.

examples/cleaners/mnist_cleaner.py

# @Author: George Onoufriou <archer>
# @Date:   2019-08-15
# @Email:  george raven community at pm dot me
# @Filename: debug_cleaner.py
# @Last modified by:   archer
# @Last modified time: 2019-08-16
# @License: Please see LICENSE in project root

import io
import datetime
from sklearn.datasets import fetch_openml


def main(**kwargs):
    print("downloading mnist dataset...")
    x, y = fetch_openml('mnist_784', version=1, return_X_y=True)
    utc_import_start_time = datetime.datetime.utcnow()
    print("importing mnist dataset to mongodb...")
    for i in range(len(x)):  # could use enumerate but only interested in index
        document = {
            "x": x[i].tolist(),     # converting to list to be bson compatible
            "y": int(y[i]),         # Ensuring is num
            "img_num": i,           # saving the image number
            "utc_import_time":  utc_import_start_time,
            "dataset": "mnist",
            "img_count": len(x)
        }
        yield document

5.6. Learning¶

To learn from the now cleaned database-residing data, you can:

Files-only/ development learning example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --dl-learn

This example trains a CNN, and yields a tuple (metadata_dictionary, pickle.dumps(model)) which is then stored in MongoDB using gridfs as most models exceed the base MongoDB 16MB document size limit. This example is derived from one of the pre-existing Keras MNIST examples, but transformed into a relatively efficient Nemesyst variant. The major differences are that we use fit_generator which takes a generator (in our case a database cursor and pre-processor) for the training set, and another generator for the validation set. For this example we have simply validated against the test set as we aren’t attempting to blind ourselves for the purposes of scientific rigor and over-fitting prevention. Care should be taken in reading the pipelines as they can be quite complex operations to solve very tough problems, but here we simply set them to separate the dataset into train, and validation.

examples/learners/mnist_learner.py

# @Author: George Onoufriou <archer>
# @Date:   2019-08-16
# @Email:  george raven community at pm dot me
# @Filename: mnist_learner.py
# @Last modified by:   archer
# @Last modified time: 2020-01-31T16:13:08+00:00
# @License: Please see LICENSE in project root

import numpy as np
import pickle

import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D


def main(**kwargs):
    """Entry point called by Nemesyst, always yields dictionary or None.

    :param **kwargs: Generic input method to handle infinite dict-args.
    :rtype: yield dict
    """
    # # there are issues using RTX cards with tensorflow:
    # # https://github.com/tensorflow/tensorflow/issues/24496
    # # if this is the case please uncomment the following two lines:
    # import os
    # os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # use cpu

    # just making these a little nicer to read but in a real application
    # we would not want these hardcoded thankfully the database can provide!
    args = kwargs["args"]
    db = kwargs["db"]
    img_rows, img_cols = 28, 28
    num_classes = 10
    # creating two database generators to iterate quickly through the data
    # these are not random they will split data using 60000 as the boundary
    train_generator = inf_mnist_generator(db=db, args=args,
                                          example_dim=(img_rows, img_cols),
                                          num_classes=num_classes,
                                          pipeline=[{"$match":
                                                     {"img_num":
                                                      {"$lt": 60000}}}
                                                    ])
    test_generator = inf_mnist_generator(db=db, args=args,
                                         example_dim=(img_rows, img_cols),
                                         num_classes=num_classes,
                                         pipeline=[{"$match":
                                                    {"img_num":
                                                     {"$gte": 60000}}}
                                                   ])
    # ensuring our input shape is in whatever style keras backend wants
    if K.image_data_format() == 'channels_first':
        input_shape = (1, img_rows, img_cols)
    else:
        input_shape = (img_rows, img_cols, 1)

    model = generate_model(input_shape=input_shape,
                           num_classes=num_classes)
    model.summary()
    hist = model.fit_generator(generator=train_generator,
                               steps_per_epoch=219,  # ceil(70000/32)
                               validation_data=test_generator,
                               validation_steps=219,
                               epochs=args["dl_epochs"][args["process"]],
                               initial_epoch=0)

    excluded_keys = ["pylog", "db_password"]
    # yield metadata, model for gridfs
    best_model = ({
        # metdata dictionary (used to find model later)
        "model": "mnist_example",
        # "validation_loss": float(hist.history["val_loss"][-1]),
        # "validation_accuracy": float(hist.history["val_acc"][-1]),
        "loss": float(hist.history["loss"][-1]),
        "accuracy": float(hist.history["accuracy"][-1]),
        "args": {k: args[k] for k in set(list(args.keys())) - \
                 set(excluded_keys)},
    }, pickle.dumps(model))

    yield best_model


def generate_model(input_shape, num_classes):
    """Generate the keras CNN"""
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3, 3),
                     activation="relu",
                     input_shape=input_shape))
    model.add(Conv2D(64, (3, 3), activation="relu"))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation="softmax"))

    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=keras.optimizers.Adadelta(),
                  metrics=['accuracy'])
    return model


def inf_mnist_generator(db, args, example_dim, num_classes, pipeline=None):
    """Infinite generator of data for keras fit_generator.

    :param db: Mongo() object to use to fetch data.
    :param args: The user provided args and defaults for adaptation.
    :param example_dim: The tuple dimensions of a single example (row, col).
    :param pipeline: The MongoDB aggregate pipeline [{},{},{}] to use.
    :type db: Mongo
    :type args: dict
    :type example_dim: tuple
    :type num_classes: int
    :type pipeline: list(dict())
    :return: Tuple of a single data batch (x_batch,y_batch).
    :rtype: tuple
    """
    # empty pipeline if none provided
    pipeline = pipeline if pipeline is not None else [{"$match": {}}]
    # loop infiniteley over pipeline
    while True:
        c = db.getCursor(db_collection_name=str(args["data_collection"]
                                                [args["process"]]),
                         db_pipeline=pipeline)
        # itetate through the data in batches to minimise requests
        for data_batch in db.getBatches(db_batch_size=args["dl_batch_size"]
                                        [args["process"]], db_data_cursor=c):
            # we recommend you take a quick read of:
            # https://book.pythontips.com/en/latest/map_filter.html
            y = list(map(lambda d: d["y"], data_batch))
            y = np.array(y)  # converting list to numpy ndarray

            x = list(map(lambda d: d["x"], data_batch))
            x = np.array(x)  # converting nlists to ndarray

            # shaping the np array into whatever keras is asking for
            if K.image_data_format() == 'channels_first':
                y = y.reshape((y.shape[0], 1))
                x = x.reshape((x.shape[0], 1,
                               example_dim[0], example_dim[1]))
                # input_shape = (1, example_dim[0], example_dim[1])
            else:
                y = y.reshape((y.shape[0], 1))
                x = x.reshape((x.shape[0],
                               example_dim[0], example_dim[1], 1))
                # input_shape = (example_dim[0], example_dim[1], 1)

            # normalising to 0-1
            x = x.astype('float32')
            x /= 255

            # convert class vectors to binary class matrices
            y = keras.utils.to_categorical(y, num_classes)

            # returning completeley propper data, batch by batch thats all.
            yield x, y

5.7. Inferring¶

Warning

Work in progress section

In this stage we retrieve the model trained previously stored in MongoDB as gridfs chunks and unpack the model again for reuse and prediction. We can predict using the gridfs stored model by passing:

Files-only/ development inferring example:

./nemesyst --config ./examples/configs/nemesyst/mnist.conf --i-predict

As in the previous sections, this lets nemesyst know to run the predictor specified in the config file, which can be seen below. This predictor loads the most recent, most performant mnist model, and uses it to predict against the testing set.

examples/predictors/mnist_predictor.py

# @Author: George Onoufriou <archer>
# @Date:   2019-08-16
# @Email:  george raven community at pm dot me
# @Filename: debug_predictors.py
# @Last modified by:   archer
# @Last modified time: 2019-08-16
# @License: Please see LICENSE in project root


def main(**kwargs):
    """Entry point called by Nemesyst, always yields dictionary, tuple or None.

    :param **kwargs: Generic input method to handle infinite dict-args.
    :rtype: yield dict
    """
    args = kwargs["args"]
    db = kwargs["db"]

    db.connect()

    # define a pipeline to get the latest gridfs file in any collection
    fs_pipeline = [{'$sort': {'uploadDate': -1}},  # sort most recent first
                   {'$limit': 1},  # we only want one model
                   {'$project': {'_id': 1}}]  # we only want its _id
    args["dl_output_model_collection"]
    # we add a suffix to target the metadata collection specifically
    # at the end of the top level model collection name we specified in our
    # config file
    model_coll_root = args["dl_output_model_collection"][args["process"]]
    model_coll_files = "{0}{1}".format(model_coll_root, ".files")
    # apply this pipeline to the collection we used to store the models
    fc = db.getCursor(db_collection_name=model_coll_files,
                      db_pipeline=fs_pipeline)
    # we could return several models but we have limited everything to only one
    # but to be extensible this shows how to get the models from the db
    # in batches, however since we only have one model a batch size higher than
    # one does nothing
    for batch in db.getFiles(db_batch_size=1, db_data_cursor=fc,
                             db_collection_name=model_coll_root):
        for doc in batch:
            # now read the gridout object to get the model (pickled)
            model = doc["gridout"].read()
            print(doc, type(model))

    yield None