Create a Decision Tree with Scikit-Learn

This is a walkthrough on how to use the ML Data API to create a Decision Tree classifier in Scikit-Learn

  1. Assuming you have scikit learn installed, we are going to start by importing the following methods and libraries

    import requests
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
  3. Next we will import our data through the API. For this tutorial we will use the famous iris dataset
    • We are going to use the requests library in Python to connect to our API and convert the JSON data to a Python dictionary

      response = requests.get("http://www.mldata.io/get-data/dataset/scikit-label/iris")
      dataset = json.loads(response.text)
  4. Scikit-Learn needs the data split into the input (X) and predictor classes (Y). Through the API this can be done by accessing the X and Y keys in the dataset dictionary.

    X = dataset['values']['X']
    Y = dataset['values']['Y']
  6. In Machine Learning, it is best practice to split your data into a training set and testing set. Scikit-learn has a train_test_split method that does this for us. In this example we are using 20% of the total data as the test set

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                test_size=0.2, random_state=100)
  8. Now all that is left for us to do is train our model and test its accuracy
    • First we create a Decision Tree object. In this case we are using the GINI index to choose best attributes, giving a random seed of 100, limiting the depth of the tree to 10 and keeping the minimum sample leaf size to 1

      clf_gini = DecisionTreeClassifier(criterion="gini", random_state=100,
                  max_depth=10, min_samples_leaf=1)
    • Next we train our classifier using the fit method
      clf_gini.fit(X_train, Y_train)
    • We finally compare our accuracy by generating predicted values from our test set and comparing them with our actual values
      Y_pred = clf_gini.predict(X_test)
      print("Accuracy for DT (feature onehotEncode, predictor labelEncode) is ",
                  round(accuracy_score(Y_test, Y_pred)*100, 2))
  9. The final result for this configuration is as follows

    Accuracy for DT (feature onehotEncode, predictor labelEncode) is  96.67
  11. That's it! Using Scikit-Learn and MLData API it's super simple to create, train and use models