 ## R Programming - Individual Analytics

• 22nd Sep, 2021
• 16:54 PM

Analysis Report

Predicting Safety issues using dials – Decision trees technique

1. Introduction

Decision trees are versatile Machine Learning algorithm that can perform both classification and regression tasks. They are very powerful algorithms, capable of fitting complex datasets. Decision tree is a graph to represent choices and their results in form of a tree. The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions. We are going to use decision tree technique to predict the safety issue.

2. Data set and Methodology

Data set incudes measures the measurements from 20 dials along with the safety issue which labeled as red, yellow, and green. We are going to apply the decision tree machine learning algorithm to predict the safety issue (red, yellow, and green) using the measurements of 20 dials. Dataset was consist with 1000 observations. We are going to validate the classification model using cross validation. 7:3 ration was used to split the data set in to train and test. Model were build using train set and validated using test set. The CP (complexity parameter) is one of the key parameter in the decision tree which is used to control tree growth. If the cost of adding a variable is higher than the value of CP, then tree growth stops. We need to find the optimal complexity parameter at the first stage.

3. Tuning the complexity parameter

According to the figure 01, the corresponding complexity parameter value that error is becomes constant is 0.025. We can assume that optimal CP as 0.025 for the further analysis.

4. Decision tree results

Figure 02 shows the decision tree which we can use to predict the safety issue using the measurement of 20 dials. There are 7 rules with their confidence. For examples:-
Rule 01:- If dial14 >= -0.57 and Dial15 >= 1.5 then predicted safety issue is Green with 84% confidence.
Rule 02:- If dial14 >= -0.57 and Dial15 < 1>= -0.39 then predicted safety issue is Red with 77% confidence.
So by looking at the tree, safety issues can be predicted.

5. Model Evaluation

Decision tree was evaluated using test set that we split at the first stage.
Figure 02: Decision tree

predict
Green Red Yellow
Green 57 22 25
Red 15 78 4
Yellow 4 22 73
Table 01: Confusion matrix table

According to the table 01, out of 104 green safety issues, model has correctly categorized 57. Out of 97 red safety issues, model has correctly categorized 78. Out of 99 yellow safety issues, model has correctly categorized 73. Overall accuracy is 69.33%.

6. Recommendation and next steps.

Decision model that were build has good accuracy (almost 70%) and can be used to predict the safety issues using 20 measurements of dials. Out of the 20 dials, dial15, dial6 and dial8 seems very important as they are appear in top nodes. Out of them dial 15 is very important. So, management can pay attention towards dial15 and can take necessary actions to control safety issue.  Ensemble methods such as random forest and bagging can be used to increase model accuracy. SO we can continue the work on them to increase the model accuracy and to make predictions more accurate and reliable.

7. Appendix.

library(rpart)
library(rpart.plot)

data <-read_excel("Data for Analytics Individual Assignment Combined.xlsx")
View(data)
summary(data)

# Splitting in to train and test set
sample_ind <- sample(nrow(data),nrow(data)*0.7)
train <- data[sample_ind,]
test <- data[-sample_ind,]

#CP tuning
base_model <- rpart(Safety_Issue ~ ., data = train, method = "class",
control = rpart.control(cp = 0))

# Examine the complexity plot
printcp(base_model)
plotcp(base_model)

# Decision tree
model <- rpart(Safety_Issue ~ ., data = train, method = "class",
control = rpart.control(cp = 0.025))
rpart.plot(model, box.palette = list("green", "red", "yellow"))

#Validation the decision tree
predict <-predict(model, test, type = 'class')

# Confusion marix
table_mat <- table(test\$Safety_Issue, predict)
table_mat

# Accuracy
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test