clemens.scholz@student.uni-halle.de
akin.sir@student.uni-halle.de
# Add paths to the different data sets
# path for training data set
path <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/train.csv"
# path for test data set
path1 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/test.csv"
# path for the corresponding dependent variables for the test data set
path2 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/result.csv"
## Variable ## Definition ## Key
# survived # Survival # 0 = No, 1 = Yes
# pclass # Ticket class # 1, 2, 3 (e.g. 1 for first class.)
# sex # Sex # 0 = Female, 1 = Male
# Age # Age in years
# Sibsp # [#] of siblings / spouses abroad the Titanic (Brother, sister... & husband, wife)
# parch # [#] of parents / children abroad the Titanic (mother, father, daugher, son...)
# fare # Passenger fare
# embarked # Port of Embarkation # C = Cherbourg, Q = Queenstown, S = Southampton
It is important to acknowledge that the variable called
Passenger ID
was not useful for our analysis. We,
therefore, dropped this column. Next, we assigned a binary variable(s)
for Survived
variable, which is better illustration for
showing if the person alive (1) or dead (0). When it comes to the
graphical illustration we decided to choose “deepskyblue2"
colour for the person who survived, "black"
otherwise.
The Fare
variable indicates the passenger fare the
person paid. Being able to pay higher prices may be good illustration of
the wealth of passenger.
There are three different classes on the ship, which were indicated
as 1, 2 and 3. We created separate columns for each class in order to
assign a binary variable for each class. Thus, we will be able to
analyse them. For example, 1 if the passenger is in the first class, 0
otherwise. It is clear that the variable Pclass
is also
essential for our analysis. It might be a good illustration of the
wealth of the passengers who were able to afford first class. Thus, it
is highly relevant to the Fare
variable. Reasonably,
FirstClass
passengers had more luxuries available to them.
Most of first class was located in the middle hull and in the upper
structures. There were two staircases connecting the different decks.
SecondClass
was distributed along the height in the middle
of the ship with only one staircase connecting the different decks.
Amenities there would be the same as for the first class of other ships
of the time. The sole staircase may have impeded evacuation during the
crash. ThirdClass
was located on the deeper parts of the
ship as well as parts directly on the bow and the rear of the ship. The
location of the cabins would have obvious disadvantages during the
crash.
In addition, we assigned a binary variable for determining the gender
of the observed passengers. Therefore, Sex
equals 1 if the
passenger is a man and 0 otherwise. It is important to note that during
evacuation women were given priority, whilst men might have higher
chances of survival due to their physique. Children were prioritized
during evacuation but they would also be less likely to survive due to
their lesser strength compared to adults. Older people may exhibit the
same impediment to a higher degree.
When it comes to the Age
variable, it indicates the age
of the passenger. Unfortunately, the Age
variable was not
available for all cases of this data set. We, therefore, decided to drop
these missing cases in the cleaning process.
The variables SibSp
and Parch
were related
to their family bonds. The SibSp
variable showing how many
siblings or spouses were with the person aboard the ship. We assumed
that higher numbers may mean that more people assisted the person during
evacuation or that they may have spend valuable time during evacuation
looking for relatives. The Parch
variable showing how many
parents or children were with the person aboard the ship. In case of
young children this variable may increase survival as parents will try
to save their children. We also assumed that in case of parents this may
lower chances of survival in case they were looking for their children
or gave up seats on a lifeboat for them or it may improve survival
chances as they would gain a seat on a lifeboat together with their
children.
Finally, the Embarked
variable indicates the port of
embarkation of the passenger. There were three different ports of
embarkation available, that was: Southhampton (S), Cherbourg (C), and
Queenstown (Q). Similarly, we created separate columns for each port of
embarkation of the passengers in order to assign a binary variable for
each of them. Furhermore, we assumed that embarking earlier offers the
passenger more time and opportunity to explore the ship, know their
surroundings and get used to sea travel and possible sea sickness. This
factor may be compounded by earlier passengers having a less populated
ship to explore.
# Importing the dataset and preparing data to analysis
clean_data <- function(raw_data) {
# Find NA values in Age and Embarked
print(paste0("The number of missing values in Age variables: ", sum(is.na(raw_data$Age))))
print(paste0("The number of missing values in Embarked variables: ", sum(is.na(raw_data$Embarked))))
print(paste0("The number of missing values in Fare variables: ", sum(is.na(raw_data$Fare))))
# Delete rows with NA values
raw_data <- raw_data[complete.cases(raw_data$Age), ]
raw_data <- raw_data[complete.cases(raw_data$Embarked), ]
raw_data <- raw_data[complete.cases(raw_data$Fare), ]
# Converting Sex variable as boolean type.
raw_data$Sex <- ifelse(raw_data$Sex == "male", 1, 0)
# Putting dummies for our variables which are Pclass and Embarked (thus, we will be able to analyse them in later process)
# Basically what we do is creating a new column for all Embarked destionations Southhamptopn, Cherbourg and Queenstown with binary values. For example, in Southhampton 1 if "Embarked" is "S", 0 otherwise
raw_data$Southhampton <- ifelse(raw_data$Embarked == "S", 1, 0)
raw_data$Cherbourg <- ifelse(raw_data$Embarked == "C", 1, 0)
raw_data$Queenstown <- ifelse(raw_data$Embarked == "Q", 1, 0)
# Let us do the same process for the Pclass variable:
raw_data$FirstClass <- ifelse(raw_data$Pclass == "1", 1, 0)
raw_data$SecondClass <- ifelse(raw_data$Pclass == "2", 1, 0)
raw_data$ThirdClass <- ifelse(raw_data$Pclass == "3", 1, 0)
# Creating new data frame for cleaned data
clean_data <- data.frame(Name = raw_data$Name, Sex = raw_data$Sex, Age = raw_data$Age, Sibsp = raw_data$SibSp, Parch = raw_data$Parch, FirstClass = raw_data$FirstClass, SecondClass = raw_data$SecondClass, ThirdClass = raw_data$ThirdClass ,Southhampton = raw_data$Southhampton, Cherbourg = raw_data$Cherbourg, Queenstown = raw_data$Queenstown, Fare = raw_data$Fare, Survived = raw_data$Survived, Embarked = raw_data$Embarked, Pclass = raw_data$Pclass)
return(clean_data)
}
raw_data <- read_csv(path)
Rows: 891 Columns: 12── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- clean_data(raw_data)
[1] "The number of missing values in Age variables: 177"
[1] "The number of missing values in Embarked variables: 2"
[1] "The number of missing values in Fare variables: 0"
data
# Clean up
rm(list = c("raw_data"))
# Let's illustrate the basic calculations
total_passenger <- length(data$Sex)
total_female <- sum(data$Sex == 0)
total_male <- sum(data$Sex == 1)
total_survival <- sum(data$Survived == 1)
total_female_survival <- sum(data$Sex == 0 & data$Survived == 1)
total_male_survival <- sum(data$Sex == 1 & data$Survived == 1)
## Class calculations ##
total_pass_1_class <- sum(data$Pclass == 1)
total_pass_1_class_survived <- sum(data$Pclass == 1 & data$Survived == 1)
total_pass_2_class <- sum(data$Pclass == 2)
total_pass_2_class_survived <- sum(data$Pclass == 2 & data$Survived == 1)
total_pass_3_class <- sum(data$Pclass == 3)
total_pass_3_class_survived <- sum(data$Pclass == 3 & data$Survived == 1)
## Class calculation ends ##
ggplot(data, aes(x = factor(Survived), fill = factor(Sex))) +
geom_bar() +
labs(
title = "Number of female and male passengers survived",
caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
x = "Number of passengers survived",
y = "Count",
fill = "Sex") +
scale_fill_manual(values = c("0" = "orchid3", "1" = "skyblue3")) +
theme(axis.line = element_line(color = "#000000"))
## General information ##
section1_gen_info_1 <- paste0("The total number of passengers: ", total_passenger)
section1_gen_info_2 <- paste0("The total number of female passengers: ", total_female)
section1_gen_info_3 <- paste0("The total number of male passengers: ", total_male)
section1_gen_info_4 <- paste0("The total number of passengers who survived: ", total_survival)
section1_gen_info_5 <- paste0("The total number of female passengers who survived: ", total_female_survival)
section1_gen_info_6 <- paste0("The total number of male passengers who survived: ", total_male_survival)
## General information ends ##
for (i in 1:6) {
variable_name <- paste0("section1_gen_info_", i)
variable_value <- get(variable_name)
print(variable_value)
}
[1] "The total number of passengers: 712"
[1] "The total number of female passengers: 259"
[1] "The total number of male passengers: 453"
[1] "The total number of passengers who survived: 288"
[1] "The total number of female passengers who survived: 195"
[1] "The total number of male passengers who survived: 93"
# Plotting the age distribution though histogram. We also wanted to illustrate the estimate of the density.
ggplot(data = data, aes(x=Age)) +
geom_histogram( aes(y=..density..)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.11) +
labs(
title = "Age distribution",
x = "Age",
y = "Density") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
# Illustrating the number of age group and their fare. We also would like to illustrate survival rate as a color and parch as a size.
ggplot() +
geom_point(data = data, aes(x = Age, y = Fare, color = factor(Survived), size = Parch)) +
scale_color_manual(values = c("black", "deepskyblue3")) +
labs(
title = "Age & Fare & Familial relations & Survival",
x = "Age",
y = "Fare",
color = "Survival") +
geom_point() +
theme(axis.line = element_line(color = "#000000"))
scatter3D(data$Age, data$Fare, data$Pclass,
theta=22, phi=29,
xlab="Age of passengers", ylab="Fare", zlab="Ticket classes of passengers",
ticktype="detailed",
pch = 20,
cex.axis=0.45, cex.lab=0.95, expand=0.73,
colvar = data$Survived,
col = c("black", "deepskyblue3"))
# Illustrating scatter plot of Survival vs. Age with color by Sex
ggplot(data = data, aes(x = Age, y = factor(Survived), color = factor(Sex))) +
geom_jitter(width = 0.3, height = 0.3) +
labs(
title = "Distribution of survival by age and gender",
caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
x = "Age of passengers",
y = "Surviving passengers",
color = "Sex"
) +
theme(axis.line = element_line(color = "#000000"))
ggplot(data = data, aes(x = "", y = Pclass, fill = factor(Pclass))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(
title = "Number of passengers from each classes",
fill = "Class"
) +
theme_void()
## Class information ##
section1_gen_class_info_1 <- paste0("The total number of passengers from first class: ", total_pass_1_class)
section1_gen_class_info_2 <- paste0("The total number of passengers from first class who survived: ", total_pass_1_class_survived)
section1_gen_class_info_3 <- paste0("The total number of passengers from second class: ", total_pass_2_class)
section1_gen_class_info_4 <- paste0("The total number of passengers from second class who survived: ", total_pass_2_class_survived)
section1_gen_class_info_5 <- paste0("The total number of passengers from third class: ", total_pass_3_class)
section1_gen_class_info_6 <- paste0("The total number of passengers from third class who survived: ", total_pass_3_class_survived)
## Class information ends ##
for (i in 1:6) {
variable_name <- paste0("section1_gen_class_info_", i)
variable_value <- get(variable_name)
print(variable_value)
}
[1] "The total number of passengers from first class: 184"
[1] "The total number of passengers from first class who survived: 120"
[1] "The total number of passengers from second class: 173"
[1] "The total number of passengers from second class who survived: 83"
[1] "The total number of passengers from third class: 355"
[1] "The total number of passengers from third class who survived: 85"
# Scatter plot of Sex vs. Pclass with color by Survived
ggplot(data = data, aes(x = factor(Pclass), y = factor(Survived), color = factor(Sex))) +
geom_jitter(width = 0.3, height = 0.3) +
labs(
title = "Scatter plot of gender and class coloured by survival",
caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes \n 1 = first class, 2 = second class, 3 = third class",
x = "Ticket classes of passengers",
y = "Number of passengers survived",
color = "Sex") +
scale_color_manual(values = c("0" = "orchid3", "1" = "skyblue4")) +
theme(axis.line = element_line(color = "#000000"))
# Clean up
rm(list = c("i", "section1_gen_class_info_1", "section1_gen_class_info_2", "section1_gen_class_info_3", "section1_gen_class_info_4", "section1_gen_class_info_5", "section1_gen_class_info_6", "section1_gen_info_1", "section1_gen_info_2", "section1_gen_info_3", "section1_gen_info_4", "section1_gen_info_5", "section1_gen_info_6", "total_female", "total_female_survival", "total_male", "total_male_survival", "total_pass_1_class", "total_pass_2_class", "total_pass_3_class", "total_pass_1_class_survived", "total_pass_2_class_survived", "total_pass_3_class_survived", "total_passenger", "total_survival", "variable_name", "variable_value"))
# We decided to work on two columns that are Amount of Fare from passengers and Age of passengers. This is because all the other columns contain small numeric variables such as boolean 1 or 2, TRUE or FALSE or 1,2 and 3 for class.
set.seed(1234)
dt_kmeans <- data[, c("Age", "Fare")]
# First of all, let's start with 2 k-means (two centers) for now.
model_kmeans <- kmeans(dt_kmeans, centers = 2)
clustering_1_info1 <- paste0("The total number of observations for first cluster is ", model_kmeans$size[1], ".")
clustering_1_info2 <- paste0("The total number of observations for second cluster is ", model_kmeans$size[2], ".")
print(c(clustering_1_info1, clustering_1_info2 ))
[1] "The total number of observations for first cluster is 46."
[2] "The total number of observations for second cluster is 666."
# Here is the center of our clusters:
print(model_kmeans$centers)
Age Fare
1 31.60696 192.59855
2 29.50638 23.65218
# Let us visualise it by showing the centers as well
ggplot(data = dt_kmeans, aes(x = Age, y = Fare, color = as.factor(model_kmeans$cluster))) +
geom_point() +
# Let us illustrate the centers of clusters
geom_point(data = data.frame(model_kmeans$centers),
shape = "circle filled",
size = 7,
color = "#000000",
fill = c("deepskyblue2", "gold2"),
alpha = 0.75) +
guides(color = "none") +
scale_color_manual(values = c("deepskyblue2","gold2")) +
labs(
x = "Age of passengers",
y = "Passenger fare") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
# Let us standardise inputs
# We have seen the standardization of variables before: Every variable is centered on its mean and then scaled by its standard deviation.
dt_kmeans_scaled <- data.table(scale(dt_kmeans))
model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 2)
ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
geom_point() +
geom_point(data = data.frame(model_kmeans_scaled$centers),
shape = "circle filled",
size = 7,
color = "#000000",
fill = c("deepskyblue2", "gold2"),
alpha = 0.75) +
guides(color="none") +
scale_color_manual(values = c("deepskyblue2", "gold2")) +
labs(
x="Age of passengers",
y= "Passenger fare") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
We find two clusters mainly depending on the age variable.
We also would like to illustrate elbow plot to find the optimal k. To do so, we will be needing the total within sum of squares (TWSS) of our scaled model. TWSS also means that the squared distance of all data points to their cluster centers. First of all, let us create a loop function and we will loop through different k values from 1 to 20 in this case. Next, we will illustrate an elbow plot to investigate the sensitivity of TWSS to the number of clusters k.
k <- 25
elbow <- data.table("k" = 1:k, "WSS" = NA)
# Let us create our loop function
for (i in 1:k) {
elbow$k[i] <- i
# Next, we are setting the new center of our cluster
elbow$WSS[i] <- kmeans(dt_kmeans_scaled, centers = i, nstart=50)$tot.withinss
}
Warning: did not converge in 10 iterations
# Illustrating our findings with the help of line chart
ggplot(data = elbow, aes(x = k, y = WSS)) +
geom_line() +
labs(
title = "Elbow plot",
x = "k",
y = "TWSS") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
NbClust(data=dt_kmeans_scaled, method="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 6 proposed 12 as the best number of clusters
* 1 proposed 13 as the best number of clusters
* 4 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 12
*******************************************************************
$All.index
KL CH Hartigan CCC Scott Marriot TrCovW TraceW
2 2.2054 372.2797 483.7325 -5.9295 729.605 719419.3 380094.971 932.8642
3 3.1755 554.0442 201.6686 -7.1991 1334.611 692042.4 77488.085 554.8426
4 0.7899 540.8836 137.7367 -7.7737 1729.603 706446.6 43973.842 431.9720
5 0.3825 518.2829 252.7282 -8.8798 2064.002 690125.1 37237.721 361.6210
6 4.3812 612.5190 93.7143 -4.4907 2393.777 625373.5 16891.388 266.3942
7 1.9768 592.9652 112.7439 -5.3835 2600.348 636842.7 15356.183 235.1769
8 0.2219 604.7835 38.0767 -4.8875 2818.839 611990.3 11175.844 202.7526
9 0.8115 561.7677 21.8105 -6.8808 2926.707 665662.2 10806.627 192.3492
10 4.5647 516.5289 82.5888 -9.1249 2977.649 765061.0 10364.909 186.5612
11 0.0511 527.0747 413.5455 -8.6338 3121.229 756663.2 7496.572 166.9230
12 70.0097 798.2867 63.3209 2.4591 3773.675 360173.1 2800.028 104.9872
13 0.1160 802.0861 13.3817 2.5410 3851.963 378688.5 1814.260 96.2781
14 0.6323 754.5096 119.1954 0.8346 3912.351 403475.5 2160.602 94.4695
15 110.4838 827.5847 47.0911 3.2983 4130.540 340922.9 1703.729 80.6903
Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
2 1.4887 1.5243 0.1050 1.3807 0.4615 1.4540 -211.6973 -0.3116 0.3954
3 3.1304 2.5629 0.1238 0.9002 0.4603 1.3762 -115.0938 -0.2722 0.4508
4 4.9496 3.2919 0.0993 0.9209 0.4266 1.3420 -74.6670 -0.2542 0.4167
5 7.3138 3.9323 0.0891 1.0011 0.3872 2.9908 -161.7503 -0.6632 0.3853
6 8.9009 5.3380 0.0879 0.8025 0.4369 2.7302 -169.2059 -0.6309 0.3680
7 10.8642 6.0465 0.0763 0.8895 0.4247 2.0025 -113.1438 -0.4982 0.3452
8 13.0753 7.0135 0.0697 0.8262 0.4441 2.5076 -117.2349 -0.5980 0.3273
9 14.6396 7.3928 0.0656 0.8731 0.3919 1.9970 -143.7807 -0.4958 0.3099
10 15.3350 7.6222 0.0622 0.9273 0.3894 2.6234 -68.0691 -0.6110 0.2947
11 16.9784 8.5189 0.0630 0.9180 0.3972 1.7144 -105.8414 -0.4137 0.2832
12 27.8408 13.5445 0.0940 0.8708 0.4085 3.7345 -133.9980 -0.7264 0.2778
13 28.5170 14.7697 0.1006 0.8715 0.4153 2.8057 -34.7535 -0.6241 0.2678
14 30.6223 15.0525 0.0887 0.8572 0.4065 1.3694 -18.6112 -0.2656 0.2582
15 35.8629 17.6229 0.0867 0.8185 0.4165 1.9926 -78.2067 -0.4897 0.2508
Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
2 466.4321 0.4615 0.1400 0.3579 0.0026 0.0010 11.5350 0.8343 1.7601
3 184.9475 0.5148 0.5897 0.4115 0.0095 0.0014 12.9759 0.7088 1.4853
4 107.9930 0.5316 1.8006 0.5287 0.0096 0.0016 10.8509 0.6034 1.1531
5 72.3242 0.4595 0.3418 0.8339 0.0050 0.0017 10.2125 0.5320 0.9741
6 44.3990 0.4600 0.7410 0.8683 0.0059 0.0018 8.9781 0.4534 0.7840
7 33.5967 0.4367 0.2640 0.9762 0.0118 0.0018 8.8506 0.4119 0.6948
8 25.3441 0.4360 2.9541 0.9639 0.0118 0.0019 8.1356 0.3791 0.5993
9 21.3721 0.3774 0.0339 1.3135 0.0118 0.0019 9.4129 0.3542 0.5528
10 18.6561 0.3800 0.1934 1.2715 0.0059 0.0019 8.6732 0.3447 0.5089
11 15.1748 0.3792 1.1639 1.2443 0.0064 0.0019 9.1083 0.3332 0.5383
12 8.7489 0.3692 -0.5533 1.3041 0.0196 0.0020 8.8987 0.2982 0.2490
13 7.4060 0.3806 1.5236 1.1929 0.0110 0.0020 8.0162 0.2875 0.2119
14 6.7478 0.3771 -0.6041 1.2135 0.0098 0.0020 11.6938 0.2834 0.2225
15 5.3794 0.3826 1.4004 1.1620 0.0098 0.0020 11.5027 0.2713 0.1787
$All.CriticalValues
CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2 0.5699 511.7558 1
3 0.5221 385.4155 1
4 0.5572 232.8363 1
5 0.5318 213.9526 1
6 0.5179 248.5450 1
7 0.5093 217.7170 1
8 0.5022 193.3021 1
9 0.4775 315.1968 1
10 0.4058 161.0794 1
11 0.4739 282.0160 1
12 0.4618 213.2379 1
13 0.2585 154.9208 1
14 0.3779 113.5741 1
15 0.3631 275.3848 1
$Best.nc
KL CH Hartigan CCC Scott Marriot TrCovW
Number_clusters 15.0000 15.0000 12.0000 15.0000 12.0000 12.0 3.0
Value_Index 110.4838 827.5847 350.2246 3.2983 652.4456 415005.5 302606.9
TraceW Friedman Rubin Cindex DB Silhouette Duda PseudoT2
Number_clusters 3.0000 12.0000 12.0000 10.0000 6.0000 2.0000 2.000 2.0000
Value_Index 255.1511 10.8624 -3.8004 0.0622 0.8025 0.4615 1.454 -211.6973
Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert
Number_clusters 2.0000 3.0000 3.0000 4.0000 1 2.0000 12.0000 0
Value_Index -0.3116 0.4508 281.4846 0.5316 NA 0.3579 0.0196 0
SDindex Dindex SDbw
Number_clusters 13.0000 0 15.0000
Value_Index 8.0162 0 0.1787
$Best.partition
[1] 1 4 12 4 10 11 3 12 5 3 7 1 2 1 7 3 10 10 10 1 12 3 10 8 2 7
[27] 4 2 1 1 1 2 12 3 1 1 5 1 11 12 7 1 12 3 5 1 11 3 12 1 1 12
[53] 10 5 4 12 4 12 3 10 1 12 4 1 10 5 8 12 12 1 11 12 7 7 4 10 10 12
[79] 4 10 10 12 1 10 2 1 1 1 1 1 7 12 8 3 4 10 10 11 5 12 2 10 1 2
[105] 12 12 12 1 4 1 4 1 12 1 1 1 12 5 10 2 2 4 7 2 11 1 10 2 2 12
[131] 1 3 5 2 4 7 3 3 1 7 1 2 10 10 5 3 3 2 2 10 10 1 1 3 2 11
[157] 2 12 12 10 2 1 3 10 12 1 2 12 10 1 10 9 12 2 4 10 1 12 2 4 1 1
[183] 1 1 4 12 7 3 12 2 5 1 10 12 1 10 11 12 12 4 7 12 7 10 2 12 4 6
[209] 2 3 11 2 10 5 12 11 9 12 2 10 11 2 3 10 7 12 1 1 10 10 1 2 1 4
[235] 4 10 12 12 12 9 8 1 9 9 10 4 4 8 12 12 2 12 12 7 9 9 1 12 10 12
[261] 9 7 10 10 5 2 9 1 4 9 2 2 3 8 12 12 10 12 2 3 2 12 1 12 12 4
[287] 10 2 12 2 10 10 11 4 4 1 1 9 3 1 8 1 1 8 3 10 4 4 3 10 1 9
[313] 1 12 9 12 1 10 2 12 12 10 12 1 12 1 10 2 3 1 4 2 10 1 10 5 1 12
[339] 12 1 12 1 10 12 2 1 11 9 1 12 8 10 2 1 12 12 5 5 10 3 2 10 10 11
[365] 12 7 2 2 10 2 2 10 7 3 10 10 12 1 10 12 1 3 5 2 7 4 4 7 10 5
[391] 1 7 7 1 11 9 12 1 1 10 4 9 10 12 4 12 10 11 12 2 10 10 10 4 1 2
[417] 2 2 10 12 3 1 10 3 2 9 4 4 5 5 10 11 7 1 10 5 9 12 1 1 7 2
[443] 4 10 2 12 12 1 12 10 7 11 10 1 1 10 2 10 12 9 7 10 4 2 11 1 10 11
[469] 2 10 10 2 11 12 2 10 10 10 12 12 9 2 10 4 10 12 3 12 12 2 1 1 1 7
[495] 7 4 12 7 2 10 5 12 10 10 2 1 4 3 3 11 1 7 12 1 1 1 4 10 12 11
[521] 11 2 2 10 1 4 12 2 2 4 7 10 1 12 1 2 6 4 1 5 7 12 5 1 1 8
[547] 4 3 12 7 2 2 11 2 8 10 1 12 12 2 2 2 9 4 2 12 2 1 8 12 10 3
[573] 1 10 2 4 1 10 12 12 8 5 12 12 12 2 6 4 8 12 10 7 1 10 5 10 3 3
[599] 10 12 11 3 12 1 10 4 2 1 9 1 11 10 10 12 2 7 7 1 3 8 1 5 12 12
[625] 12 1 5 3 11 5 12 10 2 10 10 10 10 10 9 3 12 10 10 1 10 4 12 2 10 3
[651] 10 12 10 2 5 11 12 10 12 3 3 1 3 12 1 4 1 4 1 1 10 10 1 2 10 12
[677] 3 7 5 5 2 1 9 2 12 2 1 2 12 2 12 4 3 12 2 10 2 12 1 1 1 11
[703] 12 10 1 12 12 2 12 1 12 10
# According to the majority rule, the best number of clusters is 3
It is clear that 3 centers would make more suitable for our work, so let us replicate our clustering work with 3 centers.
dt_kmeans_scaled <- data.table(scale(dt_kmeans))
model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 3)
ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
geom_point() +
geom_point(data = data.frame(model_kmeans_scaled$centers),
shape = "circle filled",
size = 7,
color = "#000000",
fill = c("deepskyblue2", "gold2", "forestgreen"),
alpha = 0.75) +
guides(color="none") +
scale_color_manual(values = c("deepskyblue2", "gold2", "forestgreen")) +
labs(
x="Age of passengers",
y= "Passenger fare") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
# Clean up
rm(list = c("k", "i", "clustering_1_info1", "clustering_1_info2", "dt_kmeans", "dt_kmeans_scaled", "elbow", "model_kmeans", "model_kmeans_scaled"))
# Estimate the logistic model
model_glm <- glm(formula = Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown,
data = data, family = binomial(link = "logit"))
# View summary statistics
print(summary(model_glm))
Call:
glm(formula = Survived ~ FirstClass + SecondClass + Sex + Age +
Sibsp + Parch + Fare + Cherbourg + Queenstown, family = binomial(link = "logit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7220 -0.6460 -0.3796 0.6329 2.4461
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.634864 0.325362 5.025 5.04e-07 ***
FirstClass 2.395220 0.343356 6.976 3.04e-12 ***
SecondClass 1.205583 0.249757 4.827 1.39e-06 ***
Sex -2.637859 0.223006 -11.829 < 2e-16 ***
Age -0.043308 0.008322 -5.204 1.95e-07 ***
Sibsp -0.362925 0.129290 -2.807 0.005 **
Parch -0.060365 0.123944 -0.487 0.626
Fare 0.001451 0.002595 0.559 0.576
Cherbourg 0.402848 0.274556 1.467 0.142
Queenstown -0.420532 0.556371 -0.756 0.450
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 960.90 on 711 degrees of freedom
Residual deviance: 632.34 on 702 degrees of freedom
AIC: 652.34
Number of Fisher Scoring iterations: 5
with(model_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 2.248584e-65
Looking at the result of the model, it is clear that being in first class, being female and having a younger age were statistically significant predictors of survival, while the number of siblings and other variables did not show significant influence on survival in the given data set.
### Base tree
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, method = "class")
# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")
# Print an plot data on tree
printcp(model_tree)
Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch +
Fare + Embarked, data = data, method = "class")
Variables actually used in tree construction:
[1] Age Fare Parch Pclass Sex Sibsp
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.454861 0 1.00000 1.00000 0.045472
2 0.029514 1 0.54514 0.54514 0.038412
3 0.027778 3 0.48611 0.55208 0.038586
4 0.024306 4 0.45833 0.52778 0.037965
5 0.010417 5 0.43403 0.47569 0.036523
6 0.010000 9 0.39236 0.49306 0.037021
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.
# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)
# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")
# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.
# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8412921
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.8244382
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_trees <- model_tree_2
# Print an plot data on tree
printcp(model_tree_2)
Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch +
Fare + Embarked, data = data, method = "class")
Variables actually used in tree construction:
[1] Age Fare Pclass Sex Sibsp
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.454861 0 1.00000 1.00000 0.045472
2 0.029514 1 0.54514 0.54514 0.038412
3 0.027778 3 0.48611 0.55208 0.038586
4 0.024306 4 0.45833 0.52778 0.037965
5 0.010417 5 0.43403 0.47569 0.036523
plotcp(model_tree_2)
### Visualization for interesting variable combinations
# Convert 'Survived' variable to a factor
tree_data <- data
tree_data$Survived <- as.factor(tree_data$Survived)
# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Sibsp, method = "class")
# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Sibling/Spouse")
# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_5 <- prune(model_tree_4, cp = min_cp)
# Plot the pruned decision tree model using prp
prp(model_tree_5, extra = 2, main = "Pruned tree for Age and Sibling/Spouse")
# Print an plot data on tree
printcp(model_tree_5)
Classification tree:
rpart(formula = Survived ~ Age + Sibsp, data = tree_data, method = "class")
Variables actually used in tree construction:
[1] Age Sibsp
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.065972 0 1.00000 1.00000 0.045472
2 0.034722 1 0.93403 0.94792 0.045049
3 0.010000 2 0.89931 0.92014 0.044786
plotcp(model_tree_5)
# Create a scatter plot with decision tree boundaries for Age and Sibsp variables
plot1 <- ggplot(data = tree_data, aes(x = Sibsp, y = Age)) +
geom_point(aes(color = Survived)) +
geom_parttree(data = model_tree_5, aes(fill = Survived), alpha = 0.1) +
scale_color_manual(values = c("black", "deepskyblue2")) +
scale_fill_manual(values = c("black", "deepskyblue2")) +
guides(color = "none") +
labs(
x = "Sibsp",
y = "Age",
color = "Survived",
title = "Scatter plot of age and familial relations showing survival and tree estimation of survival"
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"), legend.position = "none")
# Build another decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Parch, method = "class")
# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Parent/Child")
# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_6 <- prune(model_tree_4, cp = min_cp)
# Plot the pruned decision tree model using prp
prp(model_tree_6, extra = 2, main = "Pruned tree for Age and Parent/Child")
# Print an plot data on tree
printcp(model_tree_6)
Classification tree:
rpart(formula = Survived ~ Age + Parch, data = tree_data, method = "class")
Variables actually used in tree construction:
[1] Age Parch
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.052083 0 1.00000 1.00000 0.045472
2 0.027778 1 0.94792 0.97917 0.045313
3 0.017361 2 0.92014 0.98611 0.045368
4 0.011574 3 0.90278 0.98958 0.045395
5 0.010000 6 0.86806 0.96528 0.045200
plotcp(model_tree_6)
# Create a scatter plot with decision tree boundaries for Age and Parch variables
plot2 <- ggplot(data = tree_data, aes(x = Parch, y = Age)) +
geom_point(aes(color = Survived)) +
geom_parttree(data = model_tree_6, aes(fill = Survived), alpha = 0.1) +
scale_color_manual(values = c("black", "deepskyblue2")) +
scale_fill_manual(values = c("black", "deepskyblue2")) +
guides(color = "none") +
labs(
x = "Parch",
y = "Age",
color = "Survived"
) +
theme_minimal() +
theme(
axis.line = element_line(color = "#000000"),
legend.position = "bottom",
legend.box = "horizontal"
)
# Combine the two plots
plot1 + plot2
# Visualizing both age and fare as the only variables containing throughout distributed values.
# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Fare, method = "class")
# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Fare")
# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_7 <- prune(model_tree_4, cp = min_cp)
# Plot the pruned decision tree model using prp
prp(model_tree_7, extra = 2, main = "pruned tree for Age and Fare")
# Print an plot data on tree
printcp(model_tree_7)
Classification tree:
rpart(formula = Survived ~ Age + Fare, data = tree_data, method = "class")
Variables actually used in tree construction:
[1] Age Fare
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.201389 0 1.00000 1.00000 0.045472
2 0.031250 1 0.79861 0.81597 0.043567
3 0.012153 3 0.73611 0.77083 0.042918
plotcp(model_tree_7)
# Create a scatter plot with decision tree boundaries for Age and fare variables
ggplot(data = tree_data, aes(x = Fare, y = Age)) +
geom_point(aes(color = Survived)) +
geom_parttree(data = model_tree_7, aes(fill = Survived), alpha = 0.1) +
scale_color_manual(values = c("black", "deepskyblue2")) +
scale_fill_manual(values = c("black", "deepskyblue2")) +
guides(color = "none") +
labs(
x = "Fare",
y = "Age",
color = "Survived",
title = "Scatter plot of age and fare showing survival and tree estimation of survival"
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"), legend.position = "none")
# This visualization though looking quite nice does not seem to indicate any notable impact of fare in my opinion.
# Clean up
rm(list = c("min_cp", "plot1", "plot2", "model_tree", "model_tree_2", "model_tree_4", "model_tree_5", "model_tree_6", "model_tree_7", "tree_data"))
It is clear that there is only small difference between Pruned base tree and complex base tree. Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex. This is because, we believe that this will save computational memory. Moreover, the higher complexity tree does only improve on a less relevant branch.
The more important branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.
### Investigating hyperparameters
# Define the values for minimum observations, maximum depth, and complexity parameter (cp)
min_obs <- seq(from = 5, to = 50, by = 5)
max_depth <- c(1, 5, 10, 15)
cp_val <- c(0.001, 0.01, 0.05, 0.1)
# Create an empty data frame to store the results
results <- data.frame()
# Iterate over the combinations of min_obs, max_depth, and cp_val
for (i in 1:length(min_obs)) {
for (j in 1:length(max_depth)) {
for (k in 1:length(cp_val)) {
# Build a decision tree model with the specified parameters
model <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked,
method = "class",
cp = cp_val[k],
maxdepth = max_depth[j],
minbucket = min_obs[i])
# Calculate the accuracy of the model predictions
acc <- mean(predict(model, data, type = "class") == data$Survived)
# Append the accuracy and parameter values to the results data frame
results <- rbind(results, c(acc, min_obs[i], max_depth[j], cp_val[k]))
}
}
}
# Set column names for the results data frame
colnames(results) <- c("accuracy", "min_obs", "max_depth", "cp")
# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$min_obs, y = results$max_depth, z = results$cp,
theta = 30, phi = 10,
xlab = "Min Obs", ylab = "Max Depth", zlab = "CP",
ticktype = "detailed",
pch = 19,
cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
colvar = results$accuracy, col = viridis(100))
# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]
# Build a decision tree model using the optimal parameters
model_tree_3 <- rpart(data=data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked,
method = "class",
cp = opt_params$cp,
maxdepth = opt_params$max_depth,
minbucket = opt_params$min_obs)
# Calculate the mean accuracy of the model predictions
mean(predict(model_tree_3, data, type = "class") == data$Survived)
[1] 0.875
# Plot the pruned decision tree with additional details
prp(model_tree_3, extra = 2, main = "Tree with optimal parameters")
# Print an plot data on tree
printcp(model_tree_3)
Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch +
Fare + Embarked, data = data, method = "class", cp = opt_params$cp,
maxdepth = opt_params$max_depth, minbucket = opt_params$min_obs)
Variables actually used in tree construction:
[1] Age Fare Parch Pclass Sex Sibsp
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.4548611 0 1.00000 1.00000 0.045472
2 0.0295139 1 0.54514 0.54514 0.038412
3 0.0277778 3 0.48611 0.54861 0.038499
4 0.0243056 4 0.45833 0.52083 0.037782
5 0.0104167 5 0.43403 0.48264 0.036724
6 0.0092593 6 0.42361 0.49653 0.037119
7 0.0078125 9 0.39583 0.51042 0.037502
8 0.0069444 13 0.36458 0.50347 0.037312
9 0.0052083 15 0.35069 0.50347 0.037312
10 0.0034722 21 0.31944 0.50000 0.037215
11 0.0017361 23 0.31250 0.49653 0.037119
12 0.0010000 25 0.30903 0.49653 0.037119
plotcp(model_tree_3)
# Clean up
rm(list = c("acc", "cp_val", "i", "j", "k", "max_depth", "min_obs", "results", "opt_params", "model_tree_3", "model"))
The uncertain branch discussed before (male, age above 6) can reasonably be further distinguished by class and maybe age. Afterwards the classification seems to be mostly dependent on fare with decisions along the branch leading to no real insights. The optimal model shows only marginal improvement compared to accuracy of earlier models.
set.seed(12345) # Set a seed for reproducibility
# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived
## Model as in lecture
normalize <- layer_normalization() # Create a normalization layer
normalize %>% adapt(X) # Adapt the normalization layer to the data
print(normalize$mean) # Print the mean values used for normalization
tf.Tensor(
[[ 0.25842693 0.24297753 0.49859545 0.6362359 29.64209 0.51404494
0.4325842 34.567238 0.77808994 0.18258427 0.03932584]], shape=(1, 11), dtype=float32)
model_neural <- keras_model_sequential() # Create a sequential model
model_neural %>%
normalize() %>% # Apply normalization # L1
layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>% # Add a hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>% # Add another hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>% # Add a third hidden layer with 16 units and ReLU activation
layer_dense(name = "OutputLayer", units = 1) # Add an output layer with 1 unit
summary(model_neural) # Print the summary of the model
Model: "sequential_82"
____________________________________________________________________________________
Layer (type) Output Shape Param # Trainable
====================================================================================
normalization_82 (Normalizatio (None, 11) 23 Y
n)
HiddenLayer_1 (Dense) (None, 16) 192 Y
HiddenLayer_2 (Dense) (None, 16) 272 Y
HiddenLayer_3 (Dense) (None, 16) 272 Y
OutputLayer (Dense) (None, 1) 17 Y
====================================================================================
Total params: 776
Trainable params: 753
Non-trainable params: 23
____________________________________________________________________________________
#plot_model(model_base)
model_neural %>% compile(
loss = "MSE",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "Accuracy"
) # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric
model_neural %>% fit(
x = X,
y = y,
epochs = 500,
verbose = 0
) # Train the model on the data for 500 epochs
# model_neural$layers[[2]]$kernel # Print the kernel (weights) of the second layer
# model_neural$layers[[3]]$kernel # Print the kernel (weights) of the third layer
# model_neural$layers[[4]]$kernel # Print the kernel (weights) of the fourth layer
model_neural %>% evaluate(X, y, verbose = 0) # Evaluate the model on the data
loss Accuracy
0.07786684 0.89747190
rm(list = c("X", "y"))
### Investigating hyperparameters
# Define the values
learning_rate <- c(0.0001, 0.001, 0.01, 0.1)
width <- c(4, 16, 24)
length <- c(1, 3, 10)
# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived
# Create an empty data frame to store the results
results <- data.frame()
# Iterate over the combinations
for (i in 1:length(learning_rate)) {
for (j in 1:length(width)) {
for (k in 1:length(length)) {
#print(i)
#print(j)
#print(k)
if (length[k] == 1) {
normalize_i <- layer_normalization() # Create a normalization layer
normalize_i %>% adapt(X) # Adapt the normalization layer to the data
model_neural_i <- keras_model_sequential() # Create a sequential model
model_neural_i %>%
normalize_i() %>% # Apply normalization
layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
layer_dense(name = "OutputLayer", units = 1) # Add output layer with 1 unit
model_neural_i %>% compile(loss="MSE", optimizer= optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")
model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)
} else if (length[k] == 3) {
normalize_i <- layer_normalization() # Create a normalization layer
normalize_i %>% adapt(X) # Adapt the normalization layer to the data
model_neural_i <- keras_model_sequential() # Create a sequential model
model_neural_i %>%
normalize_i() %>% # Apply normalization
layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%
layer_dense(name = "OutputLayer", units = 1) # Add output layer with 1 unit
model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")
model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)
} else if (length[k] == 10) {
normalize_i <- layer_normalization() # Create a normalization layer
normalize_i %>% adapt(X) # Adapt the normalization layer to the data
model_neural_i <- keras_model_sequential() # Create a sequential model
model_neural_i %>%
normalize_i() %>% # Apply normalization
layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_4", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_5", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_6", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_7", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_8", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_9", units = width[j], activation = 'relu') %>%
layer_dense(name = "HiddenLayer_10", units = width[j], activation = 'relu') %>%
layer_dense(name = "OutputLayer", units = 1) # Add output layer with 1 unit
model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")
model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)
}
# Calculate the accuracy of the model predictions
eval <- model_neural_i %>% evaluate(X, y, verbose = 0)
acc <- eval[2]
# Append the accuracy and parameter values to the results data frame
results <- rbind(results, c(acc, learning_rate[i], width[j], length[k]))
}
}
}
# Set column names for the results data frame
colnames(results) <- c("accuracy", "learning_rate", "width", "length")
# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$learning_rate, y = results$width, z = results$length,
theta = 30, phi = 10,
xlab = "Learning Rate", ylab = "Width", zlab = "Length",
ticktype = "detailed",
pch = 19,
cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
colvar = results$accuracy, col = viridis(100))
# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]
opt_params
# Clean up
rm(list = c("X", "y", "i", "j", "k", "width", "length", "learning_rate", "results", "opt_params", "normalize_i", "model_neural_i", "eval"))
Looking at the graph, it is evident that we improve our model with more neurons per layer as well as the with higher number of layers but in order to save computational memory we still decided to choose just three layers and sixteen neurons per layer for our model. Furthermore, our investigation shows that the best learning rate in our case is the default of 0.001.
formula <- Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown
# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_ensemble_bagging <- randomForest(formula,
data=data,
mtry=9,
ntree=500)
Warning: The response has five or fewer unique values. Are you sure you want to do regression?
# Creating a variable importance plot for the random forest model
varImpPlot(model_ensemble_bagging, main="Bagging: Variable Importance")
# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
prediction = predict(model_ensemble_bagging, newdata = data))
# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_ensemble_boosting <- gbm(formula = formula,
data=data,
distribution="gaussian",
interaction.depth=1,
n.trees=500,
shrinkage=0.1,
cv.folds = 5)
# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_ensemble_boosting, method = "cv")
# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_ensemble_boosting, plotit = FALSE)
# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_ensemble_bagging, type = 2), keep.rownames = TRUE),
data.table(summary(model_ensemble_boosting, plotit = FALSE)),
by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable",
measure.vars = c("Bagging","Boosting"),
variable.name = "y",
value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]
# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
geom_col(aes(y=Variable, x=x, fill=y)) +
labs(
x = "Relative Importance",
y = "Variables") +
scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"),
axis.ticks=element_blank(),
legend.position = "none") +
facet_wrap(~y)
rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
In analysis of two ensemble methods which are Bagging and Boosting, it is clear that after Sex and Age, Passenger Fare is the most important variable. This supports our assumption that the wealth of passengers may actually influence chances of survival heavily. The structure of our data supports the assumption that Fare does not depend on the family size and the class of the passenger also does not have an impact on survival chances.Comparing the two models shows that both target the same variables.
### Investigating hyperparameters
# Define the values
n_trees <- seq(from = 250, to = 750, by = 250)
shrinking <- seq(from = 0.05, to = 0.2, by = 0.05)
cv_folds <- seq(from = 2, to = 8, by = 3)
# Create an empty data frame to store the results
results <- data.frame()
# Iterate over the combinations
for (i in 1:length(n_trees)) {
for (j in 1:length(shrinking)) {
for (k in 1:length(cv_folds)) {
# Build a model with the specified parameters
model <- gbm(formula = formula,
data=data,
distribution="gaussian",
interaction.depth=1,
n.trees=n_trees[i],
shrinkage=shrinking[j],
cv.folds = cv_folds[k])
# Calculate the accuracy of the model predictions
acc <- gbm.perf(model, method = "cv", plot.it = FALSE)
# Append the accuracy and parameter values to the results data frame
results <- rbind(results, c(acc, n_trees[i], shrinking[j], cv_folds[k]))
}
}
}
# Set column names for the results data frame
colnames(results) <- c("accuracy", "n_trees", "shrinking", "cv_folds")
# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$n_trees, y = results$shrinking, z = results$cv_folds,
theta = 30, phi = 10,
xlab = "Number of trees", ylab = "Shrinkage", zlab = "CV folds",
ticktype = "detailed",
pch = 19,
cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
colvar = results$accuracy, col = viridis(100))
# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]
print(opt_params)
# Build a model using the optimal parameters
model_opt <- gbm(formula = formula,
data=data,
distribution="gaussian",
interaction.depth=1,
n.trees = opt_params$n_trees,
shrinkage = opt_params$shrinking,
cv.folds = opt_params$cv_folds)
# Calculate the mean accuracy of the model predictions
mean(predict(model_opt, data) == data$Survived)
Using 95 trees...
[1] 0
# Clean up
rm(list = c("acc", "cv_folds", "i", "j", "k", "n_trees", "shrinking", "results", "opt_params", "model_opt"))
As the performance of the boosting model seems to vary, we did not change our hyperparameters from our default options. Though since bagging does lead to the best results, we are confident in our hyperparameter choice.
Covariance matrix describe the data to finding the best direction where the data most spread. So let us find the covariance matrix of the data: center the data points and calculate the variances and covariances Next, find the eigenvectors and the corresponding eigenvalues of the covariance matrix. Sort the eigenvectors v by the absolute magnitude of their eigenvalues: these are our principal components (PCs) and the basis vectors of our new coordinate system.
We are back to dropping dimensions but this time we drop principal components instead of variables. We hope to greatly reduce the PCs while capturing most varioation along the first PCs.
We may want to capture a minimum amount of variance of the data in total or keep all the principle components that –individually– explain a minimum amount of variance. Similar to k-means, we can produce a scree plot of our principal components in descending order of their eigenvalues.
What is more is instead of dropping PCs we may be interested in exploring and possibly interpreting them. Certain real estate variables may drive the PC1 coordinate (amenities) while others drive the PC2 coordinate (location). We have already seen that principle components no longer represent single variables as our original basis vectors did. Each PC coordinate is a linear combination of the original variables. The eigen vectors v determine how single variables of x contribute to each PC coordinate via v^Tx.
# We exclude thirdClass because it is our baseline class.
dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
# Will always use a library which decomposes the covariance matrix for us and supplies other fundamental outputs.
model_pca <- prcomp(dt_pca)
summary(model_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.4475 1.3223 1.0565 0.9919 0.93994 0.84767 0.78654 0.7385
Proportion of Variance 0.2328 0.1943 0.1240 0.1093 0.09817 0.07984 0.06874 0.0606
Cumulative Proportion 0.2328 0.4271 0.5511 0.6604 0.75861 0.83844 0.90718 0.9678
PC9
Standard deviation 0.53853
Proportion of Variance 0.03222
Cumulative Proportion 1.00000
It is clear that the model enables us to dive into our 9 principle componets. In the summary, we also find the share that each PC contributes to explaing the variance of our data in descending order of the eigenvalues.
fviz_eig(model_pca, addlabels = TRUE) +
theme(plot.title = element_text(hjust = 0.5))
Looking at the Scree plot, it is evidence that we explain ~66% of the variance with first 4 PCs.
To investigate how our original variables contribute to the PCs and
to possibly interpret these contributions, we can look at the
eigenvectors v. They essentially conribute the coefficients to the
linear combination of our variables that produce the PCs. We find the
eigenvectors at model_pca$rotation
.
Here, $rotation
gives the eigenvectors scaled to a
specific length which are also called loadings: eigenvectors ∗
√eigenvalues
The above eigenvectors take each data point to the PC coordinate system by multiplying vector x. V: xV.
Let’s illustrate Score Plot. It plotos our passengers in only the first two dimensions (PC1 and PC2) which capture 55% of variance.
Finally, let us illustrate our principle components in a loadings plot. Basically, the direction of an arrow shows which PCs a variable contributes to and the angles between arrows indicate the difference in these directions.
# Matrix of (scaled) eigenvectors: V
round(model_pca$rotation, 2)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Age 0.20 -0.47 0.11 -0.34 0.47 -0.01 -0.37 0.42 0.28
Fare 0.55 0.12 0.08 -0.06 0.16 -0.26 0.34 -0.43 0.53
Sex -0.18 -0.33 -0.37 0.50 0.34 -0.50 -0.16 -0.26 -0.09
Parch 0.12 0.55 0.08 -0.04 0.31 0.10 -0.67 -0.32 -0.15
Sibsp 0.05 0.56 -0.18 0.16 0.26 -0.31 0.19 0.65 0.03
Queenstown -0.09 0.07 -0.58 -0.70 -0.18 -0.33 -0.05 -0.10 -0.07
Cherbourg 0.40 -0.06 0.11 0.20 -0.64 -0.39 -0.43 0.20 0.04
FirstClass 0.60 -0.16 -0.06 -0.05 0.16 0.04 0.21 0.02 -0.74
SecondClass -0.29 0.03 0.67 -0.25 0.09 -0.56 0.11 -0.07 -0.26
fviz_pca_ind(model_pca,
geom.ind = "point",
col.ind = as.factor(data$Survived),
palette = c("#f1b147", "#4CA7DE")) +
ggtitle("Score Plot") +
theme(plot.title = element_text(hjust = 0.5))
fviz_pca_var(model_pca,
col.var="contrib",
repel = TRUE) +
theme_minimal()
We find here that Fare and First Class seem to indicate strongly in one direction possibly signifying wealth. Secondly, Sibsp and Parch indiacte closely together pointing at an overarching concept of family size.
Our matrix of (scaled) eigenvectors V above transformed our original 9 coordinates (variables) into 9 PC coordinates.
# Insert PCA into our data for continued use
pca <- as.data.frame(as.matrix(dt_pca) %*%
as.matrix(model_pca$rotation))
data <- cbind(data, pca)
# Define a function to perform PCA transformations on the test data
apply_pca <- function(data, model) {
dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
pca <- as.data.frame(as.matrix(dt_pca) %*%
as.matrix(model$rotation))
return(cbind(data, pca))
}
To reconstruct the original variables we can multiply the PC coordinates by VT:
# Passenger 1: Original variables
# x
round(as.matrix(dt_pca)[1,],2)
Age Fare Sex Parch Sibsp Queenstown Cherbourg
-0.53 -0.52 0.76 -0.51 0.52 -0.20 -0.47
FirstClass SecondClass
-0.59 -0.57
# Passenger 1: PC coordinates
# x * V
round(as.matrix(dt_pca)[1,] %*%
as.matrix(model_pca$rotation), 2)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
[1,] -0.92 0.04 -0.8 0.91 0.1 0.09 0.36 0.26 0.19
# Passenger 1: Reconstructed original variables
# x * V * VT
round(as.matrix(dt_pca)[1,] %*%
as.matrix(model_pca$rotation) %*%
t(as.matrix(model_pca$rotation)), 2)
Age Fare Sex Parch Sibsp Queenstown Cherbourg FirstClass SecondClass
[1,] -0.53 -0.52 0.76 -0.51 0.52 -0.2 -0.47 -0.59 -0.57
However, not much is gained in terms of dimensionality reduction if we simply keep 7 PC coordinates instead of 10 original variables. We have seen at the very beginning that the first 4 PCs capture >65% of the variance of our data, so let us reduce the principle components to 4. We will call our eigenvector matrix V with only the first 4 columns Vreduced If we projected our passenger 1 onto only 4 PCs, we had only 4 values to store: xVreduced
# x * V_reduced
as.matrix(dt_pca)[1,] %*%
as.matrix(model_pca$rotation[,1:4])
PC1 PC2 PC3 PC4
[1,] -0.9219128 0.04287835 -0.7984102 0.9140071
# x * V_reduced * V_reducedT
as.matrix(dt_pca)[1,] %*%
as.matrix(model_pca$rotation[,1:4]) %*%
t(as.matrix(model_pca$rotation[,1:4]))
Age Fare Sex Parch Sibsp Queenstown Cherbourg
[1,] -0.6042263 -0.6158179 0.9094717 -0.1944706 0.2761572 -0.09504479 -0.2792797
FirstClass SecondClass
[1,] -0.5505641 -0.4989356
If we have Vreduced we can reconstruct any passenger from its 4 PC values via xVreduced VTreduced In other words, we only need the PC coordinates xVreduced and can reconstruct the original data. This reconstruction is approximate and depends on the number of PCs we use. Compare the original variable values above to the reconstruction below.
How many dimensions we can/shall drop depends on our purpose. Let us investigate the total reconstruction error for our one passenger depending on the number of eigenvectors we retain in Vreduced. We loop through the reconstruction of passenger 1 for 1 to 9 PCs:
results <- list()
for (pcs in 1:9){
passenger_reconstructed <- as.matrix(dt_pca)[1,] %*%
as.matrix(model_pca$rotation[,1:pcs]) %*%
t(as.matrix(model_pca$rotation[,1:pcs]))
results[[pcs]] <- data.table("PCs used" = pcs,
"MAE"= sum(abs(as.matrix(dt_pca)[1,]-passenger_reconstructed)),
passenger_reconstructed)
}
dt_reconstruction <- rbindlist(results)
round(dt_reconstruction,2)
# Clean up
rm(list = c("dt_pca", "dt_reconstruction", "pca", "passenger_reconstructed", "pcs", "results"))
# Define the logistic regression using PCAs
model_pca_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4,
data = data, family = binomial(link = "logit"))
# Print summary statistic
print(summary(model_pca_glm))
Call:
glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4, family = binomial(link = "logit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5196 -0.8359 -0.4931 0.8222 2.1363
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.45619 0.09078 -5.025 5.02e-07 ***
PC1 0.61801 0.07207 8.575 < 2e-16 ***
PC2 0.33371 0.06741 4.950 7.40e-07 ***
PC3 0.70019 0.08884 7.881 3.23e-15 ***
PC4 -0.56679 0.09276 -6.110 9.94e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 960.90 on 711 degrees of freedom
Residual deviance: 742.15 on 707 degrees of freedom
AIC: 752.15
Number of Fisher Scoring iterations: 4
# Fit of the model
with(model_pca_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 3.469072e-46
# A p-value below 0.001 indicates that the model performs better than an empty model
As all inputs seem to be statistically significant at the 0.1% level.
### Base tree
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4, method = "class")
# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")
# Print an plot data on tree
printcp(model_tree)
Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4, data = data,
method = "class")
Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.215278 0 1.00000 1.00000 0.045472
2 0.142361 1 0.78472 0.81944 0.043614
3 0.041667 2 0.64236 0.69792 0.041704
4 0.027778 4 0.55903 0.65625 0.040912
5 0.017361 5 0.53125 0.62153 0.040194
6 0.015046 6 0.51389 0.62500 0.040268
7 0.013889 9 0.46875 0.63542 0.040487
8 0.010417 10 0.45486 0.64931 0.040773
9 0.010000 16 0.38889 0.62153 0.040194
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.
# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)
# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")
# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.
# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8426966
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.7851124
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_trees <- model_tree
# Print an plot data on tree
printcp(model_tree_2)
Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4, data = data,
method = "class")
Variables actually used in tree construction:
[1] PC1 PC3 PC4
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.215278 0 1.00000 1.00000 0.045472
2 0.142361 1 0.78472 0.81944 0.043614
3 0.041667 2 0.64236 0.69792 0.041704
4 0.027778 4 0.55903 0.65625 0.040912
5 0.017361 5 0.53125 0.62153 0.040194
plotcp(model_tree_2)
rm(list = c("model_tree", "model_tree_2"))
As we find a notable disparity in accuracy between the base tree and the pruned tree favoring the un-pruned base tree in the case of using PCA, we choose the un-pruned model for further application.
set.seed(12345) # Set a seed for reproducibility
# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4")])
y <- data$Survived
## Model as in lecture
normalize_pca <- layer_normalization() # Create a normalization layer
normalize_pca %>% adapt(X) # Adapt the normalization layer to the data
print(normalize_pca$mean) # Print the mean values used for normalization
tf.Tensor([[ 2.3283064e-09 -1.0186341e-08 5.1222742e-09 2.7939677e-09]], shape=(1, 4), dtype=float32)
model_pca_neural <- keras_model_sequential() # Create a sequential model
model_pca_neural %>%
normalize_pca() %>% # Apply normalization
layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>% # Add a hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>% # Add another hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>% # Add a third hidden layer with 16 units and ReLU activation
layer_dense(name = "OutputLayer", units = 1) # Add an output layer with 1 unit
summary(model_pca_neural) # Print the summary of the model
Model: "sequential_119"
____________________________________________________________________________________
Layer (type) Output Shape Param # Trainable
====================================================================================
normalization_119 (Normalizati (None, 4) 9 Y
on)
HiddenLayer_1 (Dense) (None, 16) 80 Y
HiddenLayer_2 (Dense) (None, 16) 272 Y
HiddenLayer_3 (Dense) (None, 16) 272 Y
OutputLayer (Dense) (None, 1) 17 Y
====================================================================================
Total params: 650
Trainable params: 641
Non-trainable params: 9
____________________________________________________________________________________
#plot_model(model_base)
model_pca_neural %>% compile(
loss = "MSE",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "Accuracy"
) # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric
model_pca_neural %>% fit(
x = X,
y = y,
epochs = 500,
verbose = 0
) # Train the model on the data for 500 epochs
#model_pca_neural$layers[[2]]$kernel # Print the kernel (weights) of the second layer
#model_pca_neural$layers[[3]]$kernel # Print the kernel (weights) of the third layer
#model_pca_neural$layers[[4]]$kernel # Print the kernel (weights) of the fourth layer
model_pca_neural %>% evaluate(X, y, verbose = 0) # Evaluate the model on the data
loss Accuracy
0.09847966 0.86516851
rm(list = c("X", "y"))
formula_pca <- Survived ~ PC1 + PC2 + PC3 + PC4
# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_ensemble_bagging <- randomForest(formula_pca,
data=data,
mtry=9,
ntree=500)
Warning: The response has five or fewer unique values. Are you sure you want to do regression?Warning: invalid mtry: reset to within valid range
# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_ensemble_bagging, main="Variable Importance")
# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
prediction = predict(model_pca_ensemble_bagging, newdata = data))
# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_ensemble_boosting <- gbm(formula = formula_pca,
data=data,
distribution="gaussian",
interaction.depth=1,
n.trees=500,
shrinkage=0.1,
cv.folds = 5)
# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_ensemble_boosting, method = "cv")
# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_ensemble_boosting, plotit = FALSE)
# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_ensemble_bagging, type = 2), keep.rownames = TRUE),
data.table(summary(model_pca_ensemble_boosting, plotit = FALSE)),
by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable",
measure.vars = c("Bagging","Boosting"),
variable.name = "y",
value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]
# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
geom_col(aes(y=Variable, x=x, fill=y)) +
labs(
x = "Relative Importance",
y = "Variables") +
scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"),
axis.ticks=element_blank(),
legend.position = "none") +
facet_wrap(~y)
rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
# Define a logistic model using all PCAs
model_pca_full_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = data, family = binomial(link = "logit"))
# Print summary statistic
print(summary(model_pca_full_glm))
Call:
glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 +
PC7 + PC8, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6786 -0.6521 -0.4426 0.6552 2.4048
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.43758 0.09871 -4.433 9.30e-06 ***
PC1 0.75416 0.08797 8.572 < 2e-16 ***
PC2 0.31435 0.07661 4.103 4.07e-05 ***
PC3 0.79900 0.10184 7.846 4.30e-15 ***
PC4 -0.55678 0.10008 -5.563 2.65e-08 ***
PC5 -0.65080 0.10552 -6.168 6.93e-10 ***
PC6 0.35515 0.11641 3.051 0.00228 **
PC7 0.72030 0.13223 5.447 5.11e-08 ***
PC8 -0.28703 0.14355 -1.999 0.04556 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 960.90 on 711 degrees of freedom
Residual deviance: 652.47 on 703 degrees of freedom
AIC: 670.47
Number of Fisher Scoring iterations: 5
# Fit of the model
with(model_pca_full_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 6.593554e-62
# A p-value below 0.001 indicates that the model performs better than an empty model
As all inputs seem to be statistically significant at the 0.1% level.
### Base tree
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, method = "class")
# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")
# Print an plot data on tree
printcp(model_tree)
Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 +
PC7 + PC8, data = data, method = "class")
Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4 PC5 PC6 PC7
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.215278 0 1.00000 1.00000 0.045472
2 0.034722 2 0.56944 0.67708 0.041317
3 0.027778 3 0.53472 0.60764 0.039891
4 0.010417 4 0.50694 0.57639 0.039176
5 0.010000 10 0.43750 0.56944 0.039010
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.
# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]
# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)
# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")
# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.
# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8230337
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.8230337
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_full_trees <- model_tree_2
# Print an plot data on tree
printcp(model_tree_2)
Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 +
PC7 + PC8, data = data, method = "class")
Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4 PC5 PC6 PC7
Root node error: 288/712 = 0.40449
n= 712
CP nsplit rel error xerror xstd
1 0.215278 0 1.00000 1.00000 0.045472
2 0.034722 2 0.56944 0.67708 0.041317
3 0.027778 3 0.53472 0.60764 0.039891
4 0.010417 4 0.50694 0.57639 0.039176
5 0.010000 10 0.43750 0.56944 0.039010
plotcp(model_tree_2)
rm(list = c("model_tree", "model_tree_2"))
In this case the pruned and un-pruned model are identical which makes sense given the nature of PCA.
set.seed(12345) # Set a seed for reproducibility
# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")])
y <- data$Survived
## Model as in lecture
normalize_pca_full <- layer_normalization() # Create a normalization layer
normalize_pca_full %>% adapt(X) # Adapt the normalization layer to the data
print(normalize_pca_full$mean) # Print the mean values used for normalization
tf.Tensor(
[[ 1.3969839e-09 -2.3283064e-09 8.8475645e-09 0.0000000e+00
-1.3969839e-09 2.6775524e-09 -1.3969839e-09 -3.2596290e-09]], shape=(1, 8), dtype=float32)
model_pca_full_neural <- keras_model_sequential() # Create a sequential model
model_pca_full_neural %>%
normalize_pca_full() %>% # Apply normalization
layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>% # Add a hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>% # Add another hidden layer with 16 units and ReLU activation
layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>% # Add a third hidden layer with 16 units and ReLU activation
layer_dense(name = "OutputLayer", units = 1) # Add an output layer with 1 unit
summary(model_pca_full_neural) # Print the summary of the model
Model: "sequential_120"
____________________________________________________________________________________
Layer (type) Output Shape Param # Trainable
====================================================================================
normalization_120 (Normalizati (None, 8) 17 Y
on)
HiddenLayer_1 (Dense) (None, 16) 144 Y
HiddenLayer_2 (Dense) (None, 16) 272 Y
HiddenLayer_3 (Dense) (None, 16) 272 Y
OutputLayer (Dense) (None, 1) 17 Y
====================================================================================
Total params: 722
Trainable params: 705
Non-trainable params: 17
____________________________________________________________________________________
#plot_model(model_base)
model_pca_full_neural %>% compile(
loss = "MSE",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "Accuracy"
) # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric
model_pca_full_neural %>% fit(
x = X,
y = y,
epochs = 500,
verbose = 0
) # Train the model on the data for 500 epochs
#model_pca_full_neural$layers[[2]]$kernel # Print the kernel (weights) of the second layer
#model_pca_full_neural$layers[[3]]$kernel # Print the kernel (weights) of the third layer
#model_pca_full_neural$layers[[4]]$kernel # Print the kernel (weights) of the fourth layer
model_pca_full_neural %>% evaluate(X, y, verbose = 0) # Evaluate the model on the data
loss Accuracy
0.07700509 0.90028089
rm(list = c("X", "y"))
formula_pca_full <- Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8
# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_full_ensemble_bagging <- randomForest(formula_pca,
data=data,
mtry=9,
ntree=500)
Warning: The response has five or fewer unique values. Are you sure you want to do regression?Warning: invalid mtry: reset to within valid range
# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_full_ensemble_bagging, main="Variable Importance")
# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
prediction = predict(model_pca_full_ensemble_bagging, newdata = data))
# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_full_ensemble_boosting <- gbm(formula = formula_pca_full,
data=data,
distribution="gaussian",
interaction.depth=1,
n.trees=500,
shrinkage=0.1,
cv.folds = 5)
# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_full_ensemble_boosting, method = "cv")
# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_full_ensemble_boosting, plotit = FALSE)
# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_full_ensemble_bagging, type = 2), keep.rownames = TRUE),
data.table(summary(model_pca_full_ensemble_boosting, plotit = FALSE)),
by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable",
measure.vars = c("Bagging","Boosting"),
variable.name = "y",
value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]
# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
geom_col(aes(y=Variable, x=x, fill=y)) +
labs(
x = "Relative Importance",
y = "Variables") +
scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"),
axis.ticks=element_blank(),
legend.position = "none") +
facet_wrap(~y)
rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
Notably again in this case our bagging model performs the same as the bagging model using only the first four PCs. This occurs as the relatively unimportant last four PCs will never be chosen by the trees in this case.
test_data <- read_csv(path1) # Read the CSV file located at the specified path and store the test_data in the 'test_data' variable
Rows: 418 Columns: 11── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_solution <- read_csv(path2)
Rows: 418 Columns: 2── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): PassengerId, Survived
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_data$Survived <- test_solution$Survived
test_data <- clean_data(test_data)
[1] "The number of missing values in Age variables: 86"
[1] "The number of missing values in Embarked variables: 0"
[1] "The number of missing values in Fare variables: 1"
test_data <- apply_pca(test_data, model_pca)
test_solution <- as.data.frame(test_data$Survived)
colnames(test_solution) <- "Survived"
rm(list = c("path1", "path2"))
Let us perform predictions for different models on the validation data.
# Prediction for logistic regression
test_solution$logistic_prediction <- predict(model_glm,
newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$logistic_prediction <- ifelse(test_solution$logistic_prediction > 0, 1, 0)
# Prediction for tree regression
test_solution$tree_prediction <- predict(model_trees,
newdata = test_data)[,"1"]
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$tree_prediction <- ifelse(test_solution$tree_prediction > 0.5, 1, 0)
# Prediction for neural network
test_solution$neural_prediction <- predict(model_neural,
x = as.matrix(test_data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$neural_prediction <- ifelse(test_solution$neural_prediction > 0.5, 1, 0)
# Prediction for bagging
test_solution$bagging_prediction <- predict(model_ensemble_bagging,
newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$bagging_prediction <- ifelse(test_solution$bagging_prediction > 0.5, 1, 0)
# Prediction for boosting
test_solution$boosting_prediction <- predict(model_ensemble_boosting,
newdata = test_data)
Using 107 trees...
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_prediction <- ifelse(test_solution$boosting_prediction > 0.5, 1, 0)
##PCA
# Prediction for logistic regression
test_solution$logistic_pca_prediction <- predict(model_pca_glm,
newdata = test_data)
test_solution$logistic_pca_prediction <- ifelse(test_solution$logistic_pca_prediction > 0, 1, 0)
# Prediction for tree regression
test_solution$tree_pca_prediction <- predict(model_pca_trees,
newdata = test_data)[,"1"]
test_solution$tree_pca_prediction <- ifelse(test_solution$tree_pca_prediction > 0.5, 1, 0)
# Prediction for neural network
test_solution$neural_pca_prediction <- predict(model_pca_neural,
x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
test_solution$neural_pca_prediction <- ifelse(test_solution$neural_pca_prediction > 0.5, 1, 0)
# Prediction for bagging
test_solution$bagging_pca_prediction <- predict(model_pca_ensemble_bagging,
newdata = test_data)
test_solution$bagging_pca_prediction <- ifelse(test_solution$bagging_pca_prediction > 0.5, 1, 0)
# Prediction for boosting
test_solution$boosting_pca_prediction <- predict(model_pca_ensemble_boosting,
newdata = test_data)
Using 305 trees...
test_solution$boosting_pca_prediction <- ifelse(test_solution$boosting_pca_prediction > 0.5, 1, 0)
##Full PCA
# Prediction for logistic regression
test_solution$logistic_pca_full_prediction <- predict(model_pca_full_glm,
newdata = test_data)
test_solution$logistic_pca_full_prediction <- ifelse(test_solution$logistic_pca_full_prediction > 0, 1, 0)
# Prediction for tree regression
test_solution$tree_pca_full_prediction <- predict(model_pca_full_trees,
newdata = test_data)[,"1"]
test_solution$tree_pca_full_prediction <- ifelse(test_solution$tree_pca_full_prediction > 0.5, 1, 0)
# Prediction for neural network
test_solution$neural_pca_full_prediction <- predict(model_pca_full_neural,
x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")]), verbose = 0)
test_solution$neural_pca_full_prediction <- ifelse(test_solution$neural_pca_full_prediction > 0.5, 1, 0)
# Prediction for bagging
test_solution$bagging_pca_full_prediction <- predict(model_pca_full_ensemble_bagging,
newdata = test_data)
test_solution$bagging_pca_full_prediction <- ifelse(test_solution$bagging_pca_full_prediction > 0.5, 1, 0)
# Prediction for boosting
test_solution$boosting_pca_full_prediction <- predict(model_pca_full_ensemble_boosting,
newdata = test_data)
Using 208 trees...
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_pca_full_prediction <- ifelse(test_solution$boosting_pca_full_prediction > 0.5, 1, 0)
# Function to calculate correct guesses for a column
# This function calculates the overall accuracy of predictions for a given column by comparing it with the true 'Survived' values.
calculate_correct_guesses <- function(column, survived) {
sum(column == survived) / length(column)
}
# Function to calculate correct guesses for survived cases in a column
# This function calculates the accuracy of predictions for the 'Survived' cases (where 'Survived' = 1) in a given column.
calculate_correct_guesses_survived <- function(column, survived) {
sum(column == 1 & survived == 1) / sum(survived == 1)
}
# Function to calculate correct guesses for dead cases in a column
# This function calculates the accuracy of predictions for the 'Dead' cases (where 'Survived' = 0) in a given column.
calculate_correct_guesses_dead <- function(column, survived) {
sum(column == 0 & survived == 0) / sum(survived == 0)
}
# Calculate correct guesses for each column
# The 'sapply()' function applies the 'calculate_correct_guesses()' function to each column of 'test_solution' (excluding the first column, which is 'Survived').
# The result is the overall accuracy of predictions for each column.
correct_guesses <- sapply(test_solution[-1], calculate_correct_guesses, test_solution$Survived)
correct_guesses <- c(1, correct_guesses)
# Calculate correct guesses for survived cases in each column
correct_guesses_survived <- sapply(test_solution[-1], calculate_correct_guesses_survived, test_solution$Survived)
correct_guesses_survived <- c(1, correct_guesses_survived)
# Calculate correct guesses for dead cases in each column
correct_guesses_dead <- sapply(test_solution[-1], calculate_correct_guesses_dead, test_solution$Survived)
correct_guesses_dead <- c(1, correct_guesses_dead)
# Add the three rows to the original dataframe
# Create a new data frame 'solution' with the calculated accuracy values and corresponding column names.
solution <- data.frame("Accuracy" = correct_guesses,
"Accuracy: Survival" = correct_guesses_survived,
"Accuracy: Dead" = correct_guesses_dead)
# Remove the first row (with the value 1) as it represents the overall accuracy, not related to any specific column.
solution <- solution[-1,]
solution
The table above indicates the output of various models. The “Accuracy” column shows the overall accuracy of the model while “Accuracy Survival” indicates the accuracy of the survival rate in the specific model.
Overall the boosting without PCA does best at predicting our test data. It does when considering accuracy on all cases but also when just trying to predict only the survival of the observation. Predictions of death are surprisingly enough better performed by including all PCAs in a logistic regression. This variant also predicts the second best overall closely followed by the basic tree classification. Notably bagging and boosting under PCA perform the worst even though boosting without PCA performed second best. As a comparison we also ran validation on models which include all PCA variables to check if we could achieve more accuracy that way. Unfortunately, this did not yield better predictions except for the logistic regression.
It is important to note that errors may occur because of changing the seed.
#Optional: Merge training and test data for cross-validation
full_data <- rbind(data, test_data)
# Check if data was merged before
if (exists("full_data")) {
cv_data <- full_data
} else {
cv_data <- data
}
# Define a function to split data into K folds
# Assign a fold number to each row randomly using 'ceiling' and 'sample'.
split_K <- function(dt, folds){
dt$id <- ceiling(sample(1:nrow(dt), replace = FALSE, nrow(dt)) /
(nrow(dt) / folds))
return(dt)
}
# Set the number of folds (K) for cross-validation
K <- 10
cv_data <- split_K(cv_data, K)
# Create a data frame 'cross_validation' to store accuracy values for each model and fold
cross_validation <- data.frame(matrix(NA, nrow = 10, ncol = K))
rownames(cross_validation) <- c("Logistic", "Tree Model", "Neural Network",
"Bagging", "Boosting", "PCA Logistic", "PCA Tree Model",
"PCA Neural Network", "PCA Bagging", "PCA Boosting")
for (i in 1:K) {
# Select the current fold as the validation set and the rest as the training set
dt <- cv_data[cv_data$id == i, ]
# Prediction for logistic regression
dt$logistic_prediction <- predict(model_glm,
newdata = dt)
dt$logistic_prediction <- ifelse(dt$logistic_prediction > 0, 1, 0)
cross_validation[1, i] <- calculate_correct_guesses(dt$logistic_prediction, dt$Survived)
# Prediction for tree regression
dt$tree_prediction <- predict(model_trees,
newdata = dt)[,"1"]
dt$tree_prediction <- ifelse(dt$tree_prediction > 0.5, 1, 0)
cross_validation[2, i] <- calculate_correct_guesses(dt$tree_prediction, dt$Survived)
# Prediction for neural network
dt$neural_prediction <- predict(model_neural, x = as.matrix(dt[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)
dt$neural_prediction <- ifelse(dt$neural_prediction > 0.5, 1, 0)
cross_validation[3, i] <- calculate_correct_guesses(dt$neural_prediction, dt$Survived)
# Prediction for bagging
dt$bagging_prediction <- predict(model_ensemble_bagging,
newdata = dt)
dt$bagging_prediction <- ifelse(dt$bagging_prediction > 0.5, 1, 0)
cross_validation[4, i] <- calculate_correct_guesses(dt$bagging_prediction, dt$Survived)
# Prediction for boosting
dt$boosting_prediction <- predict(model_ensemble_boosting,
newdata = dt)
dt$boosting_prediction <- ifelse(dt$boosting_prediction > 0.5, 1, 0)
cross_validation[5, i] <- calculate_correct_guesses(dt$boosting_prediction, dt$Survived)
##PCA
# Prediction for logistic regression
dt$logistic_pca_prediction <- predict(model_pca_glm,
newdata = dt)
dt$logistic_pca_prediction <- ifelse(dt$logistic_pca_prediction > 0, 1, 0)
cross_validation[6, i] <- calculate_correct_guesses(dt$logistic_pca_prediction, dt$Survived)
# Prediction for tree regression
dt$tree_pca_prediction <- predict(model_pca_trees,
newdata = dt)[,"1"]
dt$tree_pca_prediction <- ifelse(dt$tree_pca_prediction > 0.5, 1, 0)
cross_validation[7, i] <- calculate_correct_guesses(dt$tree_pca_prediction, dt$Survived)
# Prediction for neural network
dt$neural_pca_prediction <- predict(model_pca_neural, x = as.matrix(dt[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
dt$neural_pca_prediction <- ifelse(dt$neural_pca_prediction > 0.5, 1, 0)
cross_validation[8, i] <- calculate_correct_guesses(dt$neural_pca_prediction, dt$Survived)
# Prediction for bagging
dt$bagging_pca_prediction <- predict(model_pca_ensemble_bagging,
newdata = dt)
dt$bagging_pca_prediction <- ifelse(dt$bagging_pca_prediction > 0.5, 1, 0)
cross_validation[9, i] <- calculate_correct_guesses(dt$bagging_pca_prediction, dt$Survived)
# Prediction for boosting
dt$boosting_pca_prediction <- predict(model_pca_ensemble_boosting,
newdata = dt)
dt$boosting_pca_prediction <- ifelse(dt$boosting_pca_prediction > 0.5, 1, 0)
cross_validation[10, i] <- calculate_correct_guesses(dt$boosting_pca_prediction, dt$Survived)
}
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
Using 107 trees...
Using 305 trees...
cross_validation$Mean_accuracy <- rowMeans(cross_validation)
cross_validation$Expected_Loss <- (1 - cross_validation$Mean_accuracy)
cross_validation[, c(11, 12)]
The table indicates Mean_accuracy
and
Expected_Loss
for each model. Mean_accuracy
is
good example of representing the average accuracy of the model across
all the 10 folds in the cross-validation process. In other words, it
shows how well the model performs on average in predicting the survival
rate. On the other hand, Expected_Loss
indicates the
proportion of misclassified instances. In short, the lower the expected
loss, the better the performance of the model. It is clear that, after
cross-validation we find that bagging without PCA seems to perform best,
closely followed by bagging with PCA.