Decision Trees: Theory and Implementation

Rajkeshav
Last Updated: May 13, 2022

Introduction

 

Decision Trees: Decision Trees are the Supervised Machine learning algorithm that can be used for Classification and Regression problems. A classification problem identifies the set of categories or groups to which an observation belongs. A Decision Tree uses a tree or graph-like model of decision. Each internal node represents a "test" on attributes, and each branch represents the outcome of the test. Each leaf node represents a class label (decision taken after computing all features). 

 

Common Examples:

 

Let's draw a Decision Tree for calculating the XNOR table. The table is as follows:

 

INPUT-1

X1

INPUT-2

X2

OUTPUT

Y

1

1

1

1

0

0

0

1

0

0

0

1

 

Here we have three features, INPUT-1, INPUT-2, and OUTPUT.

Let's make the feature INPUT-1 the root node and build a Decision Tree.

 

Sourcealliance.seas.uppnn.edu

 

Here we have split our features at each level depending upon the requirement, and at last, we have final decisions. In actual practice, the data is not as simple as this. The complexity of trees would increase as the data tend towards more practical real-world examples. Let's have a look at a similar kind of example:

Suppose we have to predict whether a student has been invited for an interview or not. We have three features, college type, is-intern, and has-project, and based on these, we have to make the Decision Tree.

 

TIER-1/TIER-2

TIER-3

IS-INTERN

HAS-PROJECT

GOT A CALL

YES

 

YES

YES

YES

YES

 

YES

NO

YES

YES

 

NO

YES

YES

YES

 

NO

NO

NO

 

YES

YES

NO

NO

 

YES

NO

YES

NO

 

YES

YES

YES

YES

 

 

 

In the Machine Learning algorithm, we try to find a nearly accurate result and not the exact one. The task would be very complex if we train the algorithm to get the precise result every time. Similarly, the best version of the Decision Tree is very expensive and computationally impossible to build. So we try to develop a perfect Decision Tree that will give decent accuracy and be much easier to make. According to the above Decision Tree, if a student is from tier-3 college and has a fantastic internship(say Google) but doesn't have any project, they will not get an interview call, which is practically not feasible. Mistakes will come in the way when we have exceptions that don't match the data upon which the Decision Tree is built.  

 

Mathematics behind Decision Tree:

 

Following are the steps that we need to follow to build a very good Decision Tree:-

 

1) We need to select the target attribute/feature from the given attributes. In the above example, the target attribute is “GOT A CALL,” which has two classes of values, either yes or no. 

2) Calculate the information gain of the selected target attribute

    Information Gain = -PP+N log2(PP+N) -  NP+N log2(NP+N)

    Where P = number of the elements in class 1 of the target attribute.

    N = number of the elements in class 2 of the target attribute.

In our case, the target attribute has two classes, “YES” and “NO”. So, P=5, and N=5.

 

3) Now for the remaining features/attributes. 

  1. We calculate the entropy for all the Categorical variables.
  2. We take average information entropy for the current attribute.

The entropy of a given attribute is given by:

Entropy = i=1v Information gain of that i th categorical variable x Probability of i th categorical variable

Here 'V is the number of categorical variables in that attribute.

For example, if the attribute is "IS-INTERN," then it has two categorical variables, "YES" and "NO."

I.G of “YES” = -47log2(47)  say E1

I.G of “NO” = -37log2(37)   say E2

 

The entropy of the attribute “IS-INTERN” = E1+E2 

 

4)  Wel will now find the Gain of each attribute (except the target attribute).

Gain of Individual attribute = Information Gain of the target attribute - Entropy of that attribute.

 

5) The attribute having maximum Gain value will be our Root node. We will divide the root node into subsets that contain possible values for the best feature.

 

6) Recursively make new decision trees using the subsets of the dataset created in step 5, till the point where we don't have any features to split upon.  

 

Implementation:

 

Now we will see the implementation of the Decision Tree using Scikit-Learn. 

 

clf.fit(x_train, y_train) is used to build the Decision Tree. Once the tree is built, we can export this into the pdf. To do this, I import a function from sklearn.tree import export graphviz

 

We supply the classifier clf to the function export_graphviz(clf, out_file=None)out_file=None means I don't want to save the output into any external file. I want this function to return the data. The function’s output is the dot () data, and then I want to export it into a dot pdf. 

pydotplus.graph_from_dot_data(dot_data) returns a pydot object which I have taken in a graph object. 

graph.write_pdf() is a method to assign the name of the pdf.

 

Output:

 

FAQs:

Q1) What are the advantages of a Decision Tree?
Ans. Following are the advantages of Decision Tree

  1. As the implementation is like a binary tree, the time complexity for building and querying is very less.
  2. Very easy to build.
  3. The implementation is non-linear, so the accuracy is high.
  4. It can handle both continuous and discrete data.
  5.  

Q2) What are the disadvantages of a Decision Tree?
Ans. a) Memory consumption is very high as the calculation requires lots of variables
b) Very expansive to create
c) Less appropriate for predicting the continuous values
d) It often lead to overfitting of the data, which can give the wrong prediction on the testing data

 

Q3) What are the different types of nodes in the Decision Tree?
Ans. a) Root node: It is the uppermost node in the Decision Tree.
b) Decision nodes: Decision node helps in split the data into multiple segments
c) Leaf nodes: The final output is obtained from the leaf nodes.

 

Q4) How does a Decision Tree handle the missing values of an attribute?
Ans. a) It fills the missing values by the most common value of that attribute
b) It fills the missing values by assigning a probability to each of the attribute's possible values based on the other samples.

 

Q5) What are the standard algorithms used for deriving the Decision Tree?
Ans. The standard algorithms are:

  1. ID3(Iterative Dichotomiser)
  2. C4.5 (Successor of ID3)
  3. CART ( Classification and Regression Trees)

 

Key Takeaways:

I hope I was able to deliver the actual concept behind these algorithms. There are various exciting Machine Learning algorithms apart from the Decision Tree. If you are interested, you can go through them. 

Thank you a lot, Happy Learning 😇😇

Was this article helpful ?
0 upvotes

Comments

No comments yet

Be the first to share what you think