Built-in Algorithms

Prerequisite:

This example goes through an algorithm from the list of builtin algorithms that can be applied to a dataset. For simplicity the dataset used for this is MNIST, since it is commonly known and explored.

If one wants to skip the explanation then the full script is available at the bottom of this page.

Step 1: Get Dataset

SystemDS provides builtin for downloading and setup of the MNIST dataset. To setup this simply use

from systemds.context import SystemDSContext
from systemds.examples.tutorials.mnist import DataManager
from systemds.operator.algorithm import multiLogReg, multiLogRegPredict

d = DataManager()

X = d.get_train_data().reshape((60000, 28*28))
Y = d.get_train_labels()

Here the DataManager contains the code for downloading and setting up numpy arrays containing the data.

Step 2: Reshape & Format

Usually data does not come in formats that perfectly fits the algorithms, to make this tutorial more realistic some data preprocessing is required to change the input to fit.

First the training data, X, has multiple dimensions resulting in a shape (60000, 28, 28). The dimensions correspond to first the number of images 60000, then the number of row pixels, 28, and finally the column pixels, 28.

To use this data for logistic regression we have to reduce the dimensions. The input X is the training data. It require the data to have two dimensions, the first resemble the number of inputs, and the other the number of features.

Therefore to make the data fit the algorithm we reshape the X dataset, like so

X = X.reshape((60000, 28*28))

This takes each row of pixels and append to each other making a single feature vector per image.

The Y dataset also does not perfectly fit the logistic regression algorithm, this is because the labels for this dataset is values ranging from 0, to 9, each label correspond to the integer shown in the image. unfortunately the algorithm require the labels to be distinct integers from 1 and upwards.

Therefore we add 1 to each label such that the labels go from 1 to 10, like this

Y = Y + 1

With these steps we are now ready to train a simple model.

Step 3: Training

To start with, we setup a SystemDS context and setup the data:

Yt = d.get_test_labels()

with SystemDSContext() as sds:
    # Train Data

to reduce the training time and verify everything works, it is usually good to reduce the amount of data, to train on a smaller sample to start with

sample_size = 1000
X_ds = sds.from_numpy(X[:sample_size])
Y_ds = sds.from_numpy(Y[:sample_size])

And now everything is ready for our algorithm

from systemds.operator.algorithm import multiLogReg
bias = multiLogReg(X_ds, Y_ds)

Note that nothing has been calculated yet, in SystemDS, since it only happens when you call compute

bias_r = bias.compute()

bias is a matrix, that if matrix multiplied with an instance returns a value distribution where, the highest value is the predicted type. This is the matrix that could be saved and used for predicting labels later.

Step 3: Validate

To see what accuracy the model achieves, we have to load in the test dataset as well.

this can also be extracted from our builtin MNIST loader, to keep the tutorial short the operations are combined

Xt = sds.from_numpy(d.get_test_data().reshape((10000, 28*28)))
Yt = sds.from_numpy(d.get_test_labels()) + 1

The above loads the test data, and reshapes the X data the same way the training data was reshaped.

Finally we verify the accuracy by calling

from systemds.operator.algorithm import multiLogRegPredict
[m, y_pred, acc] = multiLogRegPredict(Xt, bias, Y=Yt).compute()
print(acc)

There are three outputs from the multiLogRegPredict call.

  • m, is the mean probability of correctly classifying each label.

  • y_pred, is the predictions made using the model, bias, trained.

  • acc, is the accuracy achieved by the model.

If the subset of the training data is used then you could expect an accuracy of 85% in this example using 1000 pictures of the training data.

Step 4: Tuning

Now that we have a working baseline we can start tuning parameters.

But first it is valuable to know how much of a difference in performance there is on the training data, vs the test data. This gives an indication of if we have exhausted the learning potential of the training data.

To see how our accuracy is on the training data we use the Predict function again, but with our training data

[m, y_pred, acc] = multiLogRegPredict(X_ds, bias, Y=Y_ds).compute()
print(acc)

In this specific case we achieve 100% accuracy on the training data, indicating that we have fit the training data, and have nothing more to learn from the data as it is now.

To improve further we have to increase the training data, here for example we increase it from our sample of 1k to the full training dataset of 60k, in this example the maxi is set to reduce the number of iterations the algorithm takes, to again reduce training time

Yt = d.get_test_labels()

with SystemDSContext() as sds:
    # Train Data
    X_ds = sds.from_numpy(X)
    Y_ds = sds.from_numpy(Y) + 1.0
    bias = multiLogReg(X_ds, Y_ds, maxi=30, verbose=False)
    # Test data
    Xt_ds = sds.from_numpy(Xt)
    Yt_ds = sds.from_numpy(Yt) + 1.0
    [m, y_pred, acc] = multiLogRegPredict(Xt_ds, bias, Y=Yt_ds, verbose=False).compute()

With this change the accuracy achieved changes from the previous value to 92%. But this is a basic implementation that can be replaced by a variety of algorithms and techniques.

Full Script

The full script, some steps are combined to reduce the overall script. One noteworthy change is the + 1 is done on the matrix ready for SystemDS, this makes SystemDS responsible for adding the 1 to each value.

import logging

from systemds.context import SystemDSContext
from systemds.examples.tutorials.mnist import DataManager
from systemds.operator.algorithm import multiLogReg, multiLogRegPredict

d = DataManager()

X = d.get_train_data().reshape((60000, 28*28))
Y = d.get_train_labels()
Xt = d.get_test_data().reshape((10000, 28*28))
Yt = d.get_test_labels()

with SystemDSContext() as sds:
    # Train Data
    X_ds = sds.from_numpy(X)
    Y_ds = sds.from_numpy(Y) + 1.0
    bias = multiLogReg(X_ds, Y_ds, maxi=30, verbose=False)
    # Test data
    Xt_ds = sds.from_numpy(Xt)
    Yt_ds = sds.from_numpy(Yt) + 1.0
    [m, y_pred, acc] = multiLogRegPredict(Xt_ds, bias, Y=Yt_ds, verbose=False).compute()

logging.info(acc)