.. ------------------------------------------------------------- .. .. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at .. .. http://www.apache.org/licenses/LICENSE-2.0 .. .. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License. .. .. ------------------------------------------------------------ Python end-to-end tutorial ========================== The goal of this tutorial is to showcase different features of the SystemDS framework that can be accessed with the Python API. For this, we want to use the `Adult `_ dataset and predict whether the income of a person exceeds $50K/yr based on census data. The Adult dataset contains attributes like age, workclass, education, marital-status, occupation, race, [...] and the labels >50K or <=50K. Most of these features are categorical string values, but the dataset also includes continuous features. For this, we define three different levels with an increasing level of detail with regard to features provided by SystemDS. In the first level, shows the built-in preprocessing capabilities of SystemDS. With the second level, we want to show how we can integrate custom-built networks or algorithms into our Python program. Prerequisite: - :doc:`/getting_started/install` Level 1 ------- This example shows how one can work the SystemDS framework. More precisely, we will make use of the built-in DataManager, Multinomial Logistic Regression function, and the Confusion Matrix function. The dataset used in this tutorial is a preprocessed version of the "UCI Adult Data Set". If one wants to skip the explanation then the full script is available at the end of this level. We will train a Multinomial Logistic Regression model on the training dataset and subsequently use the test dataset to assess how well our model can predict if the income is above or below $50K/yr based on the features. Step 1: Load and prepare data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, we get our training and testing data from the built-in DataManager. Since the multiLogReg function requires the labels (Y) to be > 0, we add 1 to all labels. This ensures that the smallest label is >= 1. Additionally we will only take a fraction of the training and test set into account to speed up the execution. .. include:: ../code/guide/end_to_end/part1.py :code: python :start-line: 20 :end-line: 51 Here the DataManager contains the code for downloading and setting up either Pandas DataFrames or internal SystemDS Frames, for the best performance and no data transfer from pandas to SystemDS it is recommended to read directly from disk into SystemDS. Step 2: Training ~~~~~~~~~~~~~~~~ Now that we prepared the data, we can use the multiLogReg function. First, we will train the model on our training data. Afterward, we can make predictions on the test data and assess the performance of the model. .. include:: ../code/guide/end_to_end/part1.py :code: python :start-line: 53 :end-line: 54 Note that nothing has been calculated yet. In SystemDS the calculation is executed once compute() is called. E.g. betas_res = betas.compute(). We can now use the trained model to make predictions on the test data. .. include:: ../code/guide/end_to_end/part1.py :code: python :start-line: 56 :end-line: 57 The multiLogRegPredict function has three return values: - m, a matrix with the mean probability of correctly classifying each label. We do not use it further in this example. - y_pred, is the predictions made using the model - acc, is the accuracy achieved by the model. Step 3: Confusion Matrix ~~~~~~~~~~~~~~~~~~~~~~~~ A confusion matrix is a useful tool to analyze the performance of the model and to obtain a better understanding which classes the model has difficulties separating. The confusionMatrix function takes the predicted labels and the true labels. It then returns the confusion matrix for the predictions and the confusion matrix averages of each true class. .. include:: ../code/guide/end_to_end/part1.py :code: python :start-line: 59 :end-line: 60 Full Script ~~~~~~~~~~~ In the full script, some steps are combined to reduce the overall script. .. include:: ../code/guide/end_to_end/part1.py :code: python :start-line: 20 :end-line: 65 Level 2 ------- In this level we want to show how we can integrate a custom built algorithm using the Python API. For this we will introduce another dml file, which can be used to train a basic feed forward network. Step 1: Obtain data ~~~~~~~~~~~~~~~~~~~ For the whole data setup please refer to level 1, Step 1, as these steps are almost identical, but instead of preparing the test data, we only prepare the training data. .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 20 :end-line: 47 Step 2: Load the algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~ We use a neural network with 2 hidden layers, each consisting of 200 neurons. First, we need to source the dml file for neural networks. This file includes all the necessary functions for training, evaluating, and storing the model. The returned object of the source call is further used for calling the functions. The file can be found here: .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 48 :end-line: 51 Step 3: Training the neural network ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Training a neural network in SystemDS using the train function is straightforward. The first two arguments are the training features and the target values we want to fit our model on. Then we need to set the hyperparameters of the model. We choose to train for 1 epoch with a batch size of 16 and a learning rate of 0.01, which are common parameters for neural networks. The seed argument ensures that running the code again yields the same results. .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 52 :end-line: 58 Step 4: Saving the model ~~~~~~~~~~~~~~~~~~~~~~~~ For later usage, we can save the trained model. We only need to specify the name of our model and the file path. This call stores the weights and biases of our model. Similarly the transformation metadata to transform input data to the model, is saved. .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 59 :end-line: 65 Step 5: Predict on Unseen data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once the model is saved along with metadata, it is simple to apply it all to unseen data: .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 66 :end-line: 77 Full Script NN ~~~~~~~~~~~~~~ The complete script now can be seen here: .. include:: ../code/guide/end_to_end/part2.py :code: python :start-line: 20 :end-line: 80