TUTORIALS





Create a Decision Tree with Scikit-Learn


This is a walkthrough on how to use the ML Data API to create a Decision Tree classifier in Scikit-Learn


  1. Assuming you have scikit learn installed, we are going to start by importing the following methods and libraries

  2. 
    import requests
    import json
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
            
  3. Next we will import our data through the API. For this tutorial we will use the famous iris dataset
    • We are going to use the requests library in Python to connect to our API and convert the JSON data to a Python dictionary

    • 
      response = requests.get("http://www.mldata.io/get-data/dataset/scikit-label/iris")
      dataset = json.loads(response.text)
                  
  4. Scikit-Learn needs the data split into the input (X) and predictor classes (Y). Through the API this can be done by accessing the X and Y keys in the dataset dictionary.

  5. 
    X = dataset['values']['X']
    Y = dataset['values']['Y']
            
  6. In Machine Learning, it is best practice to split your data into a training set and testing set. Scikit-learn has a train_test_split method that does this for us. In this example we are using 20% of the total data as the test set

  7. 
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                test_size=0.2, random_state=100)
            
  8. Now all that is left for us to do is train our model and test its accuracy
    • First we create a Decision Tree object. In this case we are using the GINI index to choose best attributes, giving a random seed of 100, limiting the depth of the tree to 10 and keeping the minimum sample leaf size to 1

    • 
      clf_gini = DecisionTreeClassifier(criterion="gini", random_state=100,
                  max_depth=10, min_samples_leaf=1)
              
    • Next we train our classifier using the fit method
    • 
      clf_gini.fit(X_train, Y_train)
              
    • We finally compare our accuracy by generating predicted values from our test set and comparing them with our actual values
    • 
      Y_pred = clf_gini.predict(X_test)
      print("Accuracy for DT (feature onehotEncode, predictor labelEncode) is ",
                  round(accuracy_score(Y_test, Y_pred)*100, 2))
              
  9. The final result for this configuration is as follows

  10. 
    Accuracy for DT (feature onehotEncode, predictor labelEncode) is  96.67
            
  11. That's it! Using Scikit-Learn and MLData API it's super simple to create, train and use models