Halle (Saale), July 2023.





# Add paths to the different data sets

# path for training data set
path <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/train.csv"
# path for test data set
path1 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/test.csv"

# path for the corresponding dependent variables for the test data set 
path2 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/result.csv"



Data Dictionary

 ## Variable              ## Definition               ## Key

 # survived               # Survival                 # 0 = No, 1 = Yes
 # pclass                 # Ticket class             # 1, 2, 3 (e.g. 1 for first class.)
 # sex                    # Sex                      # 0 = Female, 1 = Male
 # Age                    # Age in years               
 # Sibsp                  # [#] of siblings / spouses abroad the Titanic (Brother, sister... & husband, wife)
 # parch                  # [#] of parents / children abroad the Titanic (mother, father, daugher, son...)
 # fare                   # Passenger fare
 # embarked               # Port of Embarkation      # C = Cherbourg, Q = Queenstown, S = Southampton

It is important to acknowledge that the variable called Passenger ID was not useful for our analysis. We, therefore, dropped this column. Next, we assigned a binary variable(s) for Survived variable, which is better illustration for showing if the person alive (1) or dead (0). When it comes to the graphical illustration we decided to choose “deepskyblue2" colour for the person who survived, "black" otherwise.

The Fare variable indicates the passenger fare the person paid. Being able to pay higher prices may be good illustration of the wealth of passenger.

There are three different classes on the ship, which were indicated as 1, 2 and 3. We created separate columns for each class in order to assign a binary variable for each class. Thus, we will be able to analyse them. For example, 1 if the passenger is in the first class, 0 otherwise. It is clear that the variable Pclass is also essential for our analysis. It might be a good illustration of the wealth of the passengers who were able to afford first class. Thus, it is highly relevant to the Fare variable. Reasonably, FirstClass passengers had more luxuries available to them. Most of first class was located in the middle hull and in the upper structures. There were two staircases connecting the different decks. SecondClass was distributed along the height in the middle of the ship with only one staircase connecting the different decks. Amenities there would be the same as for the first class of other ships of the time. The sole staircase may have impeded evacuation during the crash. ThirdClass was located on the deeper parts of the ship as well as parts directly on the bow and the rear of the ship. The location of the cabins would have obvious disadvantages during the crash.

In addition, we assigned a binary variable for determining the gender of the observed passengers. Therefore, Sex equals 1 if the passenger is a man and 0 otherwise. It is important to note that during evacuation women were given priority, whilst men might have higher chances of survival due to their physique. Children were prioritized during evacuation but they would also be less likely to survive due to their lesser strength compared to adults. Older people may exhibit the same impediment to a higher degree.

When it comes to the Age variable, it indicates the age of the passenger. Unfortunately, the Age variable was not available for all cases of this data set. We, therefore, decided to drop these missing cases in the cleaning process.

The variables SibSp and Parch were related to their family bonds. The SibSp variable showing how many siblings or spouses were with the person aboard the ship. We assumed that higher numbers may mean that more people assisted the person during evacuation or that they may have spend valuable time during evacuation looking for relatives. The Parch variable showing how many parents or children were with the person aboard the ship. In case of young children this variable may increase survival as parents will try to save their children. We also assumed that in case of parents this may lower chances of survival in case they were looking for their children or gave up seats on a lifeboat for them or it may improve survival chances as they would gain a seat on a lifeboat together with their children.

Finally, the Embarked variable indicates the port of embarkation of the passenger. There were three different ports of embarkation available, that was: Southhampton (S), Cherbourg (C), and Queenstown (Q). Similarly, we created separate columns for each port of embarkation of the passengers in order to assign a binary variable for each of them. Furhermore, we assumed that embarking earlier offers the passenger more time and opportunity to explore the ship, know their surroundings and get used to sea travel and possible sea sickness. This factor may be compounded by earlier passengers having a less populated ship to explore.

# Importing the dataset and preparing data to analysis

clean_data <- function(raw_data) {
  # Find NA values in Age and Embarked
  print(paste0("The number of missing values in Age variables: ", sum(is.na(raw_data$Age))))
  print(paste0("The number of missing values in Embarked variables: ", sum(is.na(raw_data$Embarked))))
  print(paste0("The number of missing values in Fare variables: ", sum(is.na(raw_data$Fare))))
  
  
  # Delete rows with NA values
  raw_data <- raw_data[complete.cases(raw_data$Age), ]
  raw_data <- raw_data[complete.cases(raw_data$Embarked), ]  
  raw_data <- raw_data[complete.cases(raw_data$Fare), ]  
  
  # Converting Sex variable as boolean type.
  raw_data$Sex <- ifelse(raw_data$Sex == "male", 1, 0)
  
  # Putting dummies for our variables which are Pclass and Embarked (thus, we will be able to analyse them in later process)
  # Basically what we do is creating a new column for all Embarked destionations Southhamptopn, Cherbourg and Queenstown with binary values. For example, in Southhampton 1 if "Embarked" is "S", 0 otherwise
  raw_data$Southhampton <- ifelse(raw_data$Embarked == "S", 1, 0)
  raw_data$Cherbourg <- ifelse(raw_data$Embarked == "C", 1, 0)
  raw_data$Queenstown <- ifelse(raw_data$Embarked == "Q", 1, 0)
  
  # Let us do the same process for the Pclass variable:
  raw_data$FirstClass <- ifelse(raw_data$Pclass == "1", 1, 0)
  raw_data$SecondClass <- ifelse(raw_data$Pclass == "2", 1, 0)
  raw_data$ThirdClass <- ifelse(raw_data$Pclass == "3", 1, 0) 
  
  # Creating new data frame for cleaned data
  clean_data <- data.frame(Name = raw_data$Name, Sex = raw_data$Sex, Age = raw_data$Age, Sibsp = raw_data$SibSp, Parch = raw_data$Parch, FirstClass = raw_data$FirstClass, SecondClass = raw_data$SecondClass, ThirdClass = raw_data$ThirdClass ,Southhampton = raw_data$Southhampton, Cherbourg = raw_data$Cherbourg, Queenstown = raw_data$Queenstown, Fare = raw_data$Fare, Survived = raw_data$Survived, Embarked = raw_data$Embarked, Pclass = raw_data$Pclass)
  
  return(clean_data)
  
}

raw_data <- read_csv(path)
Rows: 891 Columns: 12── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- clean_data(raw_data)
[1] "The number of missing values in Age variables: 177"
[1] "The number of missing values in Embarked variables: 2"
[1] "The number of missing values in Fare variables: 0"
data

# Clean up
rm(list = c("raw_data"))


Basic calculations of the dataset

# Let's illustrate the basic calculations
total_passenger <- length(data$Sex)
total_female <- sum(data$Sex == 0)
total_male <- sum(data$Sex == 1)
total_survival <- sum(data$Survived == 1)
total_female_survival <- sum(data$Sex == 0 & data$Survived == 1)
total_male_survival <- sum(data$Sex == 1 & data$Survived == 1)

## Class calculations ##
total_pass_1_class <- sum(data$Pclass == 1)
total_pass_1_class_survived <- sum(data$Pclass == 1 & data$Survived == 1)
total_pass_2_class <- sum(data$Pclass == 2)
total_pass_2_class_survived <- sum(data$Pclass == 2 & data$Survived == 1)
total_pass_3_class <- sum(data$Pclass == 3)
total_pass_3_class_survived <- sum(data$Pclass == 3 & data$Survived == 1)
## Class calculation ends ##
ggplot(data, aes(x = factor(Survived), fill = factor(Sex))) +
  geom_bar() +
  labs(
    title = "Number of female and male passengers survived",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
    x = "Number of passengers survived",
    y = "Count",
    fill = "Sex") +
  scale_fill_manual(values = c("0" = "orchid3", "1" = "skyblue3")) +
  theme(axis.line = element_line(color = "#000000"))


## General information ##
section1_gen_info_1 <- paste0("The total number of passengers: ", total_passenger)
section1_gen_info_2 <- paste0("The total number of female passengers: ", total_female)
section1_gen_info_3 <- paste0("The total number of male passengers: ", total_male)
section1_gen_info_4 <- paste0("The total number of passengers who survived: ", total_survival)
section1_gen_info_5 <- paste0("The total number of female passengers who survived: ", total_female_survival)
section1_gen_info_6 <- paste0("The total number of male passengers who survived: ", total_male_survival)
## General information ends ##

for (i in 1:6) {
  variable_name <- paste0("section1_gen_info_", i)
  variable_value <- get(variable_name)
  print(variable_value)
}
[1] "The total number of passengers: 712"
[1] "The total number of female passengers: 259"
[1] "The total number of male passengers: 453"
[1] "The total number of passengers who survived: 288"
[1] "The total number of female passengers who survived: 195"
[1] "The total number of male passengers who survived: 93"
# Plotting the age distribution though histogram. We also wanted to illustrate the estimate of the density.
ggplot(data = data, aes(x=Age)) +
  geom_histogram( aes(y=..density..)) +
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.11) +
  labs(
    title = "Age distribution",
    x = "Age",
    y = "Density") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))


# Illustrating the number of age group and their fare. We also would like to illustrate survival rate as a color and parch as a size.
ggplot() +
  geom_point(data = data, aes(x = Age, y = Fare, color = factor(Survived), size = Parch)) +
  scale_color_manual(values = c("black", "deepskyblue3")) +
  labs(
    title = "Age & Fare & Familial relations & Survival",
    x = "Age",
    y = "Fare",
    color = "Survival") +
  geom_point() +
  theme(axis.line = element_line(color = "#000000"))



scatter3D(data$Age, data$Fare, data$Pclass, 
        theta=22, phi=29, 
        xlab="Age of passengers", ylab="Fare", zlab="Ticket classes of passengers",
        ticktype="detailed",
        pch = 20,
        cex.axis=0.45, cex.lab=0.95, expand=0.73, 
        colvar = data$Survived,
        col = c("black", "deepskyblue3"))


# Illustrating scatter plot of Survival vs. Age with color by Sex
ggplot(data = data, aes(x = Age, y = factor(Survived), color = factor(Sex))) +
  geom_jitter(width = 0.3, height = 0.3) +
  labs(
    title = "Distribution of survival by age and gender",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
    x = "Age of passengers",
    y = "Surviving passengers",
    color = "Sex"
  ) +
  theme(axis.line = element_line(color = "#000000"))


ggplot(data = data, aes(x = "", y = Pclass, fill = factor(Pclass))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(
    title = "Number of passengers from each classes",
    fill = "Class"
  ) +
  theme_void()



## Class information ##
section1_gen_class_info_1 <- paste0("The total number of passengers from first class: ", total_pass_1_class)
section1_gen_class_info_2 <- paste0("The total number of passengers from first class who survived: ", total_pass_1_class_survived)
section1_gen_class_info_3 <- paste0("The total number of passengers from second class: ", total_pass_2_class)
section1_gen_class_info_4 <- paste0("The total number of passengers from second class who survived: ", total_pass_2_class_survived)
section1_gen_class_info_5 <- paste0("The total number of passengers from third class: ", total_pass_3_class)
section1_gen_class_info_6 <- paste0("The total number of passengers from third class who survived: ", total_pass_3_class_survived)
## Class information ends ##

for (i in 1:6) {
  variable_name <- paste0("section1_gen_class_info_", i)
  variable_value <- get(variable_name)
  print(variable_value)
}
[1] "The total number of passengers from first class: 184"
[1] "The total number of passengers from first class who survived: 120"
[1] "The total number of passengers from second class: 173"
[1] "The total number of passengers from second class who survived: 83"
[1] "The total number of passengers from third class: 355"
[1] "The total number of passengers from third class who survived: 85"
# Scatter plot of Sex vs. Pclass with color by Survived
ggplot(data = data, aes(x = factor(Pclass), y = factor(Survived), color = factor(Sex))) +
  geom_jitter(width = 0.3, height = 0.3) +
  labs(
    title = "Scatter plot of gender and class coloured by survival",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes \n 1 = first class, 2 = second class, 3 = third class",
    x = "Ticket classes of passengers",
    y = "Number of passengers survived", 
    color = "Sex") +
  scale_color_manual(values = c("0" = "orchid3", "1" = "skyblue4")) +
  theme(axis.line = element_line(color = "#000000"))


# Clean up
rm(list = c("i", "section1_gen_class_info_1", "section1_gen_class_info_2", "section1_gen_class_info_3", "section1_gen_class_info_4", "section1_gen_class_info_5", "section1_gen_class_info_6", "section1_gen_info_1", "section1_gen_info_2", "section1_gen_info_3", "section1_gen_info_4", "section1_gen_info_5", "section1_gen_info_6", "total_female", "total_female_survival", "total_male", "total_male_survival", "total_pass_1_class", "total_pass_2_class", "total_pass_3_class", "total_pass_1_class_survived", "total_pass_2_class_survived", "total_pass_3_class_survived", "total_passenger", "total_survival", "variable_name", "variable_value"))



Clustering

# We decided to work on two columns that are Amount of Fare from passengers and Age of passengers. This is because all the other columns contain small numeric variables such as boolean 1 or 2, TRUE or FALSE or 1,2 and 3 for class.
set.seed(1234)
dt_kmeans <- data[, c("Age", "Fare")]

# First of all, let's start with 2 k-means (two centers) for now.
model_kmeans <- kmeans(dt_kmeans, centers = 2)

clustering_1_info1 <- paste0("The total number of observations for first cluster is ", model_kmeans$size[1], ".")
clustering_1_info2 <- paste0("The total number of observations for second cluster is ", model_kmeans$size[2], ".")
print(c(clustering_1_info1, clustering_1_info2 ))
[1] "The total number of observations for first cluster is 46."  
[2] "The total number of observations for second cluster is 666."
# Here is the center of our clusters:
print(model_kmeans$centers)
       Age      Fare
1 31.60696 192.59855
2 29.50638  23.65218
# Let us visualise it by showing the centers as well
ggplot(data = dt_kmeans, aes(x = Age, y = Fare, color = as.factor(model_kmeans$cluster))) +
  geom_point() +
  # Let us illustrate the centers of clusters
  geom_point(data = data.frame(model_kmeans$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2"),
             alpha = 0.75) +
  guides(color = "none") +
  scale_color_manual(values = c("deepskyblue2","gold2")) +
  labs(
    x = "Age of passengers",
    y = "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))



# Let us standardise inputs
# We have seen the standardization of variables before: Every variable is centered on its mean and then scaled by its standard deviation.
dt_kmeans_scaled <- data.table(scale(dt_kmeans))

model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 2)

ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
  geom_point() +
  geom_point(data = data.frame(model_kmeans_scaled$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2"),
             alpha = 0.75) +
  guides(color="none") +
  scale_color_manual(values = c("deepskyblue2", "gold2")) +
  labs(
    x="Age of passengers",
    y= "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))


We find two clusters mainly depending on the age variable.

Elbow plot

We also would like to illustrate elbow plot to find the optimal k. To do so, we will be needing the total within sum of squares (TWSS) of our scaled model. TWSS also means that the squared distance of all data points to their cluster centers. First of all, let us create a loop function and we will loop through different k values from 1 to 20 in this case. Next, we will illustrate an elbow plot to investigate the sensitivity of TWSS to the number of clusters k.

k <- 25
elbow <- data.table("k" = 1:k, "WSS" = NA)

# Let us create our loop function
for (i in 1:k) {
  elbow$k[i] <- i
  # Next, we are setting the new center of our cluster
  elbow$WSS[i] <- kmeans(dt_kmeans_scaled, centers = i, nstart=50)$tot.withinss
}
Warning: did not converge in 10 iterations
# Illustrating our findings with the help of line chart
ggplot(data = elbow, aes(x = k, y = WSS)) +
  geom_line() +
  labs(
    title = "Elbow plot",
    x = "k",
    y = "TWSS") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))


NbClust(data=dt_kmeans_scaled, method="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 
* 6 proposed 12 as the best number of clusters 
* 1 proposed 13 as the best number of clusters 
* 4 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  12 
 
 
******************************************************************* 
$All.index
         KL       CH Hartigan     CCC    Scott  Marriot     TrCovW   TraceW
2    2.2054 372.2797 483.7325 -5.9295  729.605 719419.3 380094.971 932.8642
3    3.1755 554.0442 201.6686 -7.1991 1334.611 692042.4  77488.085 554.8426
4    0.7899 540.8836 137.7367 -7.7737 1729.603 706446.6  43973.842 431.9720
5    0.3825 518.2829 252.7282 -8.8798 2064.002 690125.1  37237.721 361.6210
6    4.3812 612.5190  93.7143 -4.4907 2393.777 625373.5  16891.388 266.3942
7    1.9768 592.9652 112.7439 -5.3835 2600.348 636842.7  15356.183 235.1769
8    0.2219 604.7835  38.0767 -4.8875 2818.839 611990.3  11175.844 202.7526
9    0.8115 561.7677  21.8105 -6.8808 2926.707 665662.2  10806.627 192.3492
10   4.5647 516.5289  82.5888 -9.1249 2977.649 765061.0  10364.909 186.5612
11   0.0511 527.0747 413.5455 -8.6338 3121.229 756663.2   7496.572 166.9230
12  70.0097 798.2867  63.3209  2.4591 3773.675 360173.1   2800.028 104.9872
13   0.1160 802.0861  13.3817  2.5410 3851.963 378688.5   1814.260  96.2781
14   0.6323 754.5096 119.1954  0.8346 3912.351 403475.5   2160.602  94.4695
15 110.4838 827.5847  47.0911  3.2983 4130.540 340922.9   1703.729  80.6903
   Friedman   Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale Ratkowsky
2    1.4887  1.5243 0.1050 1.3807     0.4615 1.4540 -211.6973 -0.3116    0.3954
3    3.1304  2.5629 0.1238 0.9002     0.4603 1.3762 -115.0938 -0.2722    0.4508
4    4.9496  3.2919 0.0993 0.9209     0.4266 1.3420  -74.6670 -0.2542    0.4167
5    7.3138  3.9323 0.0891 1.0011     0.3872 2.9908 -161.7503 -0.6632    0.3853
6    8.9009  5.3380 0.0879 0.8025     0.4369 2.7302 -169.2059 -0.6309    0.3680
7   10.8642  6.0465 0.0763 0.8895     0.4247 2.0025 -113.1438 -0.4982    0.3452
8   13.0753  7.0135 0.0697 0.8262     0.4441 2.5076 -117.2349 -0.5980    0.3273
9   14.6396  7.3928 0.0656 0.8731     0.3919 1.9970 -143.7807 -0.4958    0.3099
10  15.3350  7.6222 0.0622 0.9273     0.3894 2.6234  -68.0691 -0.6110    0.2947
11  16.9784  8.5189 0.0630 0.9180     0.3972 1.7144 -105.8414 -0.4137    0.2832
12  27.8408 13.5445 0.0940 0.8708     0.4085 3.7345 -133.9980 -0.7264    0.2778
13  28.5170 14.7697 0.1006 0.8715     0.4153 2.8057  -34.7535 -0.6241    0.2678
14  30.6223 15.0525 0.0887 0.8572     0.4065 1.3694  -18.6112 -0.2656    0.2582
15  35.8629 17.6229 0.0867 0.8185     0.4165 1.9926  -78.2067 -0.4897    0.2508
       Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
2  466.4321     0.4615  0.1400  0.3579 0.0026 0.0010 11.5350 0.8343 1.7601
3  184.9475     0.5148  0.5897  0.4115 0.0095 0.0014 12.9759 0.7088 1.4853
4  107.9930     0.5316  1.8006  0.5287 0.0096 0.0016 10.8509 0.6034 1.1531
5   72.3242     0.4595  0.3418  0.8339 0.0050 0.0017 10.2125 0.5320 0.9741
6   44.3990     0.4600  0.7410  0.8683 0.0059 0.0018  8.9781 0.4534 0.7840
7   33.5967     0.4367  0.2640  0.9762 0.0118 0.0018  8.8506 0.4119 0.6948
8   25.3441     0.4360  2.9541  0.9639 0.0118 0.0019  8.1356 0.3791 0.5993
9   21.3721     0.3774  0.0339  1.3135 0.0118 0.0019  9.4129 0.3542 0.5528
10  18.6561     0.3800  0.1934  1.2715 0.0059 0.0019  8.6732 0.3447 0.5089
11  15.1748     0.3792  1.1639  1.2443 0.0064 0.0019  9.1083 0.3332 0.5383
12   8.7489     0.3692 -0.5533  1.3041 0.0196 0.0020  8.8987 0.2982 0.2490
13   7.4060     0.3806  1.5236  1.1929 0.0110 0.0020  8.0162 0.2875 0.2119
14   6.7478     0.3771 -0.6041  1.2135 0.0098 0.0020 11.6938 0.2834 0.2225
15   5.3794     0.3826  1.4004  1.1620 0.0098 0.0020 11.5027 0.2713 0.1787

$All.CriticalValues
   CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
2          0.5699           511.7558            1
3          0.5221           385.4155            1
4          0.5572           232.8363            1
5          0.5318           213.9526            1
6          0.5179           248.5450            1
7          0.5093           217.7170            1
8          0.5022           193.3021            1
9          0.4775           315.1968            1
10         0.4058           161.0794            1
11         0.4739           282.0160            1
12         0.4618           213.2379            1
13         0.2585           154.9208            1
14         0.3779           113.5741            1
15         0.3631           275.3848            1

$Best.nc
                      KL       CH Hartigan     CCC    Scott  Marriot   TrCovW
Number_clusters  15.0000  15.0000  12.0000 15.0000  12.0000     12.0      3.0
Value_Index     110.4838 827.5847 350.2246  3.2983 652.4456 415005.5 302606.9
                  TraceW Friedman   Rubin  Cindex     DB Silhouette  Duda  PseudoT2
Number_clusters   3.0000  12.0000 12.0000 10.0000 6.0000     2.0000 2.000    2.0000
Value_Index     255.1511  10.8624 -3.8004  0.0622 0.8025     0.4615 1.454 -211.6973
                  Beale Ratkowsky     Ball PtBiserial Frey McClain    Dunn Hubert
Number_clusters  2.0000    3.0000   3.0000     4.0000    1  2.0000 12.0000      0
Value_Index     -0.3116    0.4508 281.4846     0.5316   NA  0.3579  0.0196      0
                SDindex Dindex    SDbw
Number_clusters 13.0000      0 15.0000
Value_Index      8.0162      0  0.1787

$Best.partition
  [1]  1  4 12  4 10 11  3 12  5  3  7  1  2  1  7  3 10 10 10  1 12  3 10  8  2  7
 [27]  4  2  1  1  1  2 12  3  1  1  5  1 11 12  7  1 12  3  5  1 11  3 12  1  1 12
 [53] 10  5  4 12  4 12  3 10  1 12  4  1 10  5  8 12 12  1 11 12  7  7  4 10 10 12
 [79]  4 10 10 12  1 10  2  1  1  1  1  1  7 12  8  3  4 10 10 11  5 12  2 10  1  2
[105] 12 12 12  1  4  1  4  1 12  1  1  1 12  5 10  2  2  4  7  2 11  1 10  2  2 12
[131]  1  3  5  2  4  7  3  3  1  7  1  2 10 10  5  3  3  2  2 10 10  1  1  3  2 11
[157]  2 12 12 10  2  1  3 10 12  1  2 12 10  1 10  9 12  2  4 10  1 12  2  4  1  1
[183]  1  1  4 12  7  3 12  2  5  1 10 12  1 10 11 12 12  4  7 12  7 10  2 12  4  6
[209]  2  3 11  2 10  5 12 11  9 12  2 10 11  2  3 10  7 12  1  1 10 10  1  2  1  4
[235]  4 10 12 12 12  9  8  1  9  9 10  4  4  8 12 12  2 12 12  7  9  9  1 12 10 12
[261]  9  7 10 10  5  2  9  1  4  9  2  2  3  8 12 12 10 12  2  3  2 12  1 12 12  4
[287] 10  2 12  2 10 10 11  4  4  1  1  9  3  1  8  1  1  8  3 10  4  4  3 10  1  9
[313]  1 12  9 12  1 10  2 12 12 10 12  1 12  1 10  2  3  1  4  2 10  1 10  5  1 12
[339] 12  1 12  1 10 12  2  1 11  9  1 12  8 10  2  1 12 12  5  5 10  3  2 10 10 11
[365] 12  7  2  2 10  2  2 10  7  3 10 10 12  1 10 12  1  3  5  2  7  4  4  7 10  5
[391]  1  7  7  1 11  9 12  1  1 10  4  9 10 12  4 12 10 11 12  2 10 10 10  4  1  2
[417]  2  2 10 12  3  1 10  3  2  9  4  4  5  5 10 11  7  1 10  5  9 12  1  1  7  2
[443]  4 10  2 12 12  1 12 10  7 11 10  1  1 10  2 10 12  9  7 10  4  2 11  1 10 11
[469]  2 10 10  2 11 12  2 10 10 10 12 12  9  2 10  4 10 12  3 12 12  2  1  1  1  7
[495]  7  4 12  7  2 10  5 12 10 10  2  1  4  3  3 11  1  7 12  1  1  1  4 10 12 11
[521] 11  2  2 10  1  4 12  2  2  4  7 10  1 12  1  2  6  4  1  5  7 12  5  1  1  8
[547]  4  3 12  7  2  2 11  2  8 10  1 12 12  2  2  2  9  4  2 12  2  1  8 12 10  3
[573]  1 10  2  4  1 10 12 12  8  5 12 12 12  2  6  4  8 12 10  7  1 10  5 10  3  3
[599] 10 12 11  3 12  1 10  4  2  1  9  1 11 10 10 12  2  7  7  1  3  8  1  5 12 12
[625] 12  1  5  3 11  5 12 10  2 10 10 10 10 10  9  3 12 10 10  1 10  4 12  2 10  3
[651] 10 12 10  2  5 11 12 10 12  3  3  1  3 12  1  4  1  4  1  1 10 10  1  2 10 12
[677]  3  7  5  5  2  1  9  2 12  2  1  2 12  2 12  4  3 12  2 10  2 12  1  1  1 11
[703] 12 10  1 12 12  2 12  1 12 10

# According to the majority rule, the best number of clusters is  3 

It is clear that 3 centers would make more suitable for our work, so let us replicate our clustering work with 3 centers.

dt_kmeans_scaled <- data.table(scale(dt_kmeans))

model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 3)

ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
  geom_point() +
  geom_point(data = data.frame(model_kmeans_scaled$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2", "forestgreen"),
             alpha = 0.75) +
  guides(color="none") +
  scale_color_manual(values = c("deepskyblue2", "gold2", "forestgreen")) +
  labs(
    x="Age of passengers",
    y= "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))


# Clean up
rm(list = c("k", "i", "clustering_1_info1", "clustering_1_info2", "dt_kmeans", "dt_kmeans_scaled", "elbow", "model_kmeans", "model_kmeans_scaled"))



Logistic regression

# Estimate the logistic model
model_glm <- glm(formula = Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown, 
                 data = data, family = binomial(link = "logit"))

# View summary statistics
print(summary(model_glm))

Call:
glm(formula = Survived ~ FirstClass + SecondClass + Sex + Age + 
    Sibsp + Parch + Fare + Cherbourg + Queenstown, family = binomial(link = "logit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7220  -0.6460  -0.3796   0.6329   2.4461  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.634864   0.325362   5.025 5.04e-07 ***
FirstClass   2.395220   0.343356   6.976 3.04e-12 ***
SecondClass  1.205583   0.249757   4.827 1.39e-06 ***
Sex         -2.637859   0.223006 -11.829  < 2e-16 ***
Age         -0.043308   0.008322  -5.204 1.95e-07 ***
Sibsp       -0.362925   0.129290  -2.807    0.005 ** 
Parch       -0.060365   0.123944  -0.487    0.626    
Fare         0.001451   0.002595   0.559    0.576    
Cherbourg    0.402848   0.274556   1.467    0.142    
Queenstown  -0.420532   0.556371  -0.756    0.450    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960.90  on 711  degrees of freedom
Residual deviance: 632.34  on 702  degrees of freedom
AIC: 652.34

Number of Fisher Scoring iterations: 5
with(model_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 2.248584e-65

Looking at the result of the model, it is clear that being in first class, being female and having a younger age were statistically significant predictors of survival, while the number of siblings and other variables did not show significant influence on survival in the given data set.


Tree

### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")


# Print an plot data on tree
printcp(model_tree)

Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + 
    Fare + Embarked, data = data, method = "class")

Variables actually used in tree construction:
[1] Age    Fare   Parch  Pclass Sex    Sibsp 

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.454861      0   1.00000 1.00000 0.045472
2 0.029514      1   0.54514 0.54514 0.038412
3 0.027778      3   0.48611 0.55208 0.038586
4 0.024306      4   0.45833 0.52778 0.037965
5 0.010417      5   0.43403 0.47569 0.036523
6 0.010000      9   0.39236 0.49306 0.037021
plotcp(model_tree)

# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")


# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8412921
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.8244382
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_trees <- model_tree_2

# Print an plot data on tree
printcp(model_tree_2)

Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + 
    Fare + Embarked, data = data, method = "class")

Variables actually used in tree construction:
[1] Age    Fare   Pclass Sex    Sibsp 

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.454861      0   1.00000 1.00000 0.045472
2 0.029514      1   0.54514 0.54514 0.038412
3 0.027778      3   0.48611 0.55208 0.038586
4 0.024306      4   0.45833 0.52778 0.037965
5 0.010417      5   0.43403 0.47569 0.036523
plotcp(model_tree_2)


### Visualization for interesting variable combinations
# Convert 'Survived' variable to a factor
tree_data <- data
tree_data$Survived <- as.factor(tree_data$Survived) 

# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Sibsp, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Sibling/Spouse")


# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_5 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_5, extra = 2, main = "Pruned tree for Age and Sibling/Spouse")


# Print an plot data on tree
printcp(model_tree_5)

Classification tree:
rpart(formula = Survived ~ Age + Sibsp, data = tree_data, method = "class")

Variables actually used in tree construction:
[1] Age   Sibsp

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.065972      0   1.00000 1.00000 0.045472
2 0.034722      1   0.93403 0.94792 0.045049
3 0.010000      2   0.89931 0.92014 0.044786
plotcp(model_tree_5)




# Create a scatter plot with decision tree boundaries for Age and Sibsp variables
plot1 <- ggplot(data = tree_data, aes(x = Sibsp, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_5, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Sibsp",
    y = "Age",
    color = "Survived", 
    title = "Scatter plot of age and familial relations showing survival and tree estimation of survival"
  ) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"), legend.position = "none")

# Build another decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Parch, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Parent/Child")


# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_6 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_6, extra = 2, main = "Pruned tree for Age and Parent/Child")


# Print an plot data on tree
printcp(model_tree_6)

Classification tree:
rpart(formula = Survived ~ Age + Parch, data = tree_data, method = "class")

Variables actually used in tree construction:
[1] Age   Parch

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.052083      0   1.00000 1.00000 0.045472
2 0.027778      1   0.94792 0.97917 0.045313
3 0.017361      2   0.92014 0.98611 0.045368
4 0.011574      3   0.90278 0.98958 0.045395
5 0.010000      6   0.86806 0.96528 0.045200
plotcp(model_tree_6)



# Create a scatter plot with decision tree boundaries for Age and Parch variables
plot2 <- ggplot(data = tree_data, aes(x = Parch, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_6, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Parch",
    y = "Age",
    color = "Survived"
  ) +
  theme_minimal() +
  theme(
    axis.line = element_line(color = "#000000"),
    legend.position = "bottom",
    legend.box = "horizontal"
  )

# Combine the two plots
plot1 + plot2


# Visualizing both age and fare as the only variables containing throughout distributed values.
# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Fare, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Fare")


# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_7 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_7, extra = 2, main = "pruned tree for Age and Fare")


# Print an plot data on tree
printcp(model_tree_7)

Classification tree:
rpart(formula = Survived ~ Age + Fare, data = tree_data, method = "class")

Variables actually used in tree construction:
[1] Age  Fare

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.201389      0   1.00000 1.00000 0.045472
2 0.031250      1   0.79861 0.81597 0.043567
3 0.012153      3   0.73611 0.77083 0.042918
plotcp(model_tree_7)



# Create a scatter plot with decision tree boundaries for Age and fare variables
ggplot(data = tree_data, aes(x = Fare, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_7, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Fare",
    y = "Age",
    color = "Survived",
    title = "Scatter plot of age and fare showing survival and tree estimation of survival"
  ) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"), legend.position = "none")


# This visualization though looking quite nice does not seem to indicate any notable impact of fare in my opinion.

# Clean up
rm(list = c("min_cp", "plot1", "plot2", "model_tree", "model_tree_2", "model_tree_4", "model_tree_5", "model_tree_6", "model_tree_7", "tree_data"))

It is clear that there is only small difference between Pruned base tree and complex base tree. Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex. This is because, we believe that this will save computational memory. Moreover, the higher complexity tree does only improve on a less relevant branch.

The more important branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

Investigating hyperparameters for the tree model

### Investigating hyperparameters
# Define the values for minimum observations, maximum depth, and complexity parameter (cp)
min_obs <- seq(from = 5, to = 50, by = 5)
max_depth <- c(1, 5, 10, 15)
cp_val <- c(0.001, 0.01, 0.05, 0.1)

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations of min_obs, max_depth, and cp_val
for (i in 1:length(min_obs)) {
  for (j in 1:length(max_depth)) {
    for (k in 1:length(cp_val)) {
      # Build a decision tree model with the specified parameters
      model <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, 
                     method = "class",
                     cp = cp_val[k],
                     maxdepth = max_depth[j],
                     minbucket = min_obs[i])
      
      # Calculate the accuracy of the model predictions
      acc <- mean(predict(model, data, type = "class") == data$Survived)
      
      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, min_obs[i], max_depth[j], cp_val[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "min_obs", "max_depth", "cp")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$min_obs, y = results$max_depth, z = results$cp,
                  theta = 30, phi = 10,
                  xlab = "Min Obs", ylab = "Max Depth", zlab = "CP",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))



# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

# Build a decision tree model using the optimal parameters
model_tree_3 <- rpart(data=data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, 
                      method = "class",
                      cp = opt_params$cp, 
                      maxdepth = opt_params$max_depth, 
                      minbucket = opt_params$min_obs)

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree_3, data, type = "class") == data$Survived)
[1] 0.875
# Plot the pruned decision tree with additional details
prp(model_tree_3, extra = 2, main = "Tree with optimal parameters")


# Print an plot data on tree
printcp(model_tree_3)

Classification tree:
rpart(formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + 
    Fare + Embarked, data = data, method = "class", cp = opt_params$cp, 
    maxdepth = opt_params$max_depth, minbucket = opt_params$min_obs)

Variables actually used in tree construction:
[1] Age    Fare   Parch  Pclass Sex    Sibsp 

Root node error: 288/712 = 0.40449

n= 712 

          CP nsplit rel error  xerror     xstd
1  0.4548611      0   1.00000 1.00000 0.045472
2  0.0295139      1   0.54514 0.54514 0.038412
3  0.0277778      3   0.48611 0.54861 0.038499
4  0.0243056      4   0.45833 0.52083 0.037782
5  0.0104167      5   0.43403 0.48264 0.036724
6  0.0092593      6   0.42361 0.49653 0.037119
7  0.0078125      9   0.39583 0.51042 0.037502
8  0.0069444     13   0.36458 0.50347 0.037312
9  0.0052083     15   0.35069 0.50347 0.037312
10 0.0034722     21   0.31944 0.50000 0.037215
11 0.0017361     23   0.31250 0.49653 0.037119
12 0.0010000     25   0.30903 0.49653 0.037119
plotcp(model_tree_3)


# Clean up
rm(list = c("acc", "cp_val", "i", "j", "k", "max_depth", "min_obs", "results", "opt_params", "model_tree_3", "model"))

The uncertain branch discussed before (male, age above 6) can reasonably be further distinguished by class and maybe age. Afterwards the classification seems to be mostly dependent on fare with decisions along the branch leading to no real insights. The optimal model shows only marginal improvement compared to accuracy of earlier models.



Neural Networks

set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived

## Model as in lecture

normalize <- layer_normalization()   # Create a normalization layer
normalize %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize$mean)   # Print the mean values used for normalization
tf.Tensor(
[[ 0.25842693  0.24297753  0.49859545  0.6362359  29.64209     0.51404494
   0.4325842  34.567238    0.77808994  0.18258427  0.03932584]], shape=(1, 11), dtype=float32)
model_neural <- keras_model_sequential()   # Create a sequential model

model_neural %>%
  normalize() %>%   # Apply normalization # L1

  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation

   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_neural)   # Print the summary of the model
Model: "sequential_82"
____________________________________________________________________________________
 Layer (type)                   Output Shape                  Param #    Trainable  
====================================================================================
 normalization_82 (Normalizatio  (None, 11)                   23         Y          
 n)                                                                                 
 HiddenLayer_1 (Dense)          (None, 16)                    192        Y          
 HiddenLayer_2 (Dense)          (None, 16)                    272        Y          
 HiddenLayer_3 (Dense)          (None, 16)                    272        Y          
 OutputLayer (Dense)            (None, 1)                     17         Y          
====================================================================================
Total params: 776
Trainable params: 753
Non-trainable params: 23
____________________________________________________________________________________
#plot_model(model_base)

model_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

# model_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
# model_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
# model_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data
      loss   Accuracy 
0.07786684 0.89747190 
rm(list = c("X", "y"))

Investigating hyperparameters for the neural networks

### Investigating hyperparameters
# Define the values
learning_rate <- c(0.0001, 0.001, 0.01, 0.1)
width <- c(4, 16, 24)
length <- c(1, 3, 10)

# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations
for (i in 1:length(learning_rate)) {
  for (j in 1:length(width)) {
    for (k in 1:length(length)) {

      #print(i)
      #print(j)
      #print(k)

      if (length[k] == 1) {

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer= optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      } else if (length[k] == 3) {

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      } else if (length[k] == 10) {

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_4", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_5", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_6", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_7", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_8", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_9", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_10", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      }

      # Calculate the accuracy of the model predictions
      eval <- model_neural_i %>% evaluate(X, y, verbose = 0)
      acc <- eval[2]

      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, learning_rate[i], width[j], length[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "learning_rate", "width", "length")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$learning_rate, y = results$width, z = results$length,
                  theta = 30, phi = 10,
                  xlab = "Learning Rate", ylab = "Width", zlab = "Length",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))



# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

opt_params

# Clean up
rm(list = c("X", "y", "i", "j", "k", "width", "length", "learning_rate", "results", "opt_params", "normalize_i", "model_neural_i", "eval"))

Looking at the graph, it is evident that we improve our model with more neurons per layer as well as the with higher number of layers but in order to save computational memory we still decided to choose just three layers and sixteen neurons per layer for our model. Furthermore, our investigation shows that the best learning rate in our case is the default of 0.001.

Ensemble methods

formula <- Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_ensemble_bagging <- randomForest(formula,
                                       data=data,
                                       mtry=9,
                                       ntree=500)
Warning: The response has five or fewer unique values.  Are you sure you want to do regression?
# Creating a variable importance plot for the random forest model
varImpPlot(model_ensemble_bagging, main="Bagging: Variable Importance")


# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_ensemble_bagging, newdata = data))

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_ensemble_boosting <- gbm(formula = formula,
                               data=data,
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500,
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_ensemble_boosting, method = "cv")


# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_ensemble_bagging, type = 2), keep.rownames = TRUE),
                   data.table(summary(model_ensemble_boosting, plotit = FALSE)),
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable",
                  measure.vars = c("Bagging","Boosting"),
                  variable.name = "y",
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(),
        legend.position = "none") +
  facet_wrap(~y)


rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))

In analysis of two ensemble methods which are Bagging and Boosting, it is clear that after Sex and Age, Passenger Fare is the most important variable. This supports our assumption that the wealth of passengers may actually influence chances of survival heavily. The structure of our data supports the assumption that Fare does not depend on the family size and the class of the passenger also does not have an impact on survival chances.Comparing the two models shows that both target the same variables.

Investigating hyperparameters for the boosting

### Investigating hyperparameters
# Define the values
n_trees <- seq(from = 250, to = 750, by = 250)
shrinking <- seq(from = 0.05, to = 0.2, by = 0.05)
cv_folds <- seq(from = 2, to = 8, by = 3)

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations
for (i in 1:length(n_trees)) {
  for (j in 1:length(shrinking)) {
    for (k in 1:length(cv_folds)) {
      # Build a  model with the specified parameters
      model  <- gbm(formula = formula,
                               data=data,
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=n_trees[i],
                               shrinkage=shrinking[j],
                               cv.folds = cv_folds[k])

      # Calculate the accuracy of the model predictions
      acc <- gbm.perf(model, method = "cv", plot.it = FALSE)

      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, n_trees[i], shrinking[j], cv_folds[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "n_trees", "shrinking", "cv_folds")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$n_trees, y = results$shrinking, z = results$cv_folds,
                  theta = 30, phi = 10,
                  xlab = "Number of trees", ylab = "Shrinkage", zlab = "CV folds",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))



# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

print(opt_params)

# Build a model using the optimal parameters
model_opt  <- gbm(formula = formula,
               data=data,
               distribution="gaussian",
               interaction.depth=1,
               n.trees = opt_params$n_trees,
               shrinkage = opt_params$shrinking,
               cv.folds = opt_params$cv_folds)

# Calculate the mean accuracy of the model predictions
mean(predict(model_opt, data) == data$Survived)
Using 95 trees...
[1] 0
# Clean up
rm(list = c("acc", "cv_folds", "i", "j", "k", "n_trees", "shrinking", "results", "opt_params", "model_opt"))

As the performance of the boosting model seems to vary, we did not change our hyperparameters from our default options. Though since bagging does lead to the best results, we are confident in our hyperparameter choice.



PCA

Covariance matrix describe the data to finding the best direction where the data most spread. So let us find the covariance matrix of the data: center the data points and calculate the variances and covariances Next, find the eigenvectors and the corresponding eigenvalues of the covariance matrix. Sort the eigenvectors v by the absolute magnitude of their eigenvalues: these are our principal components (PCs) and the basis vectors of our new coordinate system.

Reducing principle components

We are back to dropping dimensions but this time we drop principal components instead of variables. We hope to greatly reduce the PCs while capturing most varioation along the first PCs.

We may want to capture a minimum amount of variance of the data in total or keep all the principle components that –individually– explain a minimum amount of variance. Similar to k-means, we can produce a scree plot of our principal components in descending order of their eigenvalues.

What is more is instead of dropping PCs we may be interested in exploring and possibly interpreting them. Certain real estate variables may drive the PC1 coordinate (amenities) while others drive the PC2 coordinate (location). We have already seen that principle components no longer represent single variables as our original basis vectors did. Each PC coordinate is a linear combination of the original variables. The eigen vectors v determine how single variables of x contribute to each PC coordinate via v^Tx.

# We exclude thirdClass because it is our baseline class. 
dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
# Will always use a library which decomposes the covariance matrix for us and supplies other fundamental outputs.
model_pca <- prcomp(dt_pca)
summary(model_pca)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7    PC8
Standard deviation     1.4475 1.3223 1.0565 0.9919 0.93994 0.84767 0.78654 0.7385
Proportion of Variance 0.2328 0.1943 0.1240 0.1093 0.09817 0.07984 0.06874 0.0606
Cumulative Proportion  0.2328 0.4271 0.5511 0.6604 0.75861 0.83844 0.90718 0.9678
                           PC9
Standard deviation     0.53853
Proportion of Variance 0.03222
Cumulative Proportion  1.00000

It is clear that the model enables us to dive into our 9 principle componets. In the summary, we also find the share that each PC contributes to explaing the variance of our data in descending order of the eigenvalues.

fviz_eig(model_pca, addlabels = TRUE) + 
         theme(plot.title = element_text(hjust = 0.5))

Looking at the Scree plot, it is evidence that we explain ~66% of the variance with first 4 PCs.

To investigate how our original variables contribute to the PCs and to possibly interpret these contributions, we can look at the eigenvectors v. They essentially conribute the coefficients to the linear combination of our variables that produce the PCs. We find the eigenvectors at model_pca$rotation.

Here, $rotation gives the eigenvectors scaled to a specific length which are also called loadings: eigenvectors ∗ √eigenvalues

The above eigenvectors take each data point to the PC coordinate system by multiplying vector x. V: xV.

Let’s illustrate Score Plot. It plotos our passengers in only the first two dimensions (PC1 and PC2) which capture 55% of variance.

Finally, let us illustrate our principle components in a loadings plot. Basically, the direction of an arrow shows which PCs a variable contributes to and the angles between arrows indicate the difference in these directions.

# Matrix of (scaled) eigenvectors: V
round(model_pca$rotation, 2)
              PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9
Age          0.20 -0.47  0.11 -0.34  0.47 -0.01 -0.37  0.42  0.28
Fare         0.55  0.12  0.08 -0.06  0.16 -0.26  0.34 -0.43  0.53
Sex         -0.18 -0.33 -0.37  0.50  0.34 -0.50 -0.16 -0.26 -0.09
Parch        0.12  0.55  0.08 -0.04  0.31  0.10 -0.67 -0.32 -0.15
Sibsp        0.05  0.56 -0.18  0.16  0.26 -0.31  0.19  0.65  0.03
Queenstown  -0.09  0.07 -0.58 -0.70 -0.18 -0.33 -0.05 -0.10 -0.07
Cherbourg    0.40 -0.06  0.11  0.20 -0.64 -0.39 -0.43  0.20  0.04
FirstClass   0.60 -0.16 -0.06 -0.05  0.16  0.04  0.21  0.02 -0.74
SecondClass -0.29  0.03  0.67 -0.25  0.09 -0.56  0.11 -0.07 -0.26
fviz_pca_ind(model_pca, 
             geom.ind = "point", 
             col.ind = as.factor(data$Survived), 
             palette = c("#f1b147", "#4CA7DE")) +
  ggtitle("Score Plot") +
  theme(plot.title = element_text(hjust = 0.5))


fviz_pca_var(model_pca, 
             col.var="contrib", 
             repel = TRUE) +
    theme_minimal()

We find here that Fare and First Class seem to indicate strongly in one direction possibly signifying wealth. Secondly, Sibsp and Parch indiacte closely together pointing at an overarching concept of family size.

Reconstructing Variables

Our matrix of (scaled) eigenvectors V above transformed our original 9 coordinates (variables) into 9 PC coordinates.

# Insert PCA into our data for continued use
pca <- as.data.frame(as.matrix(dt_pca) %*%
  as.matrix(model_pca$rotation))

data <- cbind(data, pca)

# Define a function to perform PCA transformations on the test data
apply_pca <- function(data, model) {
  dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
  
  pca <- as.data.frame(as.matrix(dt_pca) %*%
    as.matrix(model$rotation))
  
  return(cbind(data, pca))
}

To reconstruct the original variables we can multiply the PC coordinates by VT:

# Passenger 1: Original variables
# x
round(as.matrix(dt_pca)[1,],2)
        Age        Fare         Sex       Parch       Sibsp  Queenstown   Cherbourg 
      -0.53       -0.52        0.76       -0.51        0.52       -0.20       -0.47 
 FirstClass SecondClass 
      -0.59       -0.57 
# Passenger 1: PC coordinates
# x * V
round(as.matrix(dt_pca)[1,] %*%
        as.matrix(model_pca$rotation), 2)
       PC1  PC2  PC3  PC4 PC5  PC6  PC7  PC8  PC9
[1,] -0.92 0.04 -0.8 0.91 0.1 0.09 0.36 0.26 0.19
# Passenger 1: Reconstructed original variables
# x * V * VT
round(as.matrix(dt_pca)[1,] %*%
        as.matrix(model_pca$rotation) %*%
        t(as.matrix(model_pca$rotation)), 2)
       Age  Fare  Sex Parch Sibsp Queenstown Cherbourg FirstClass SecondClass
[1,] -0.53 -0.52 0.76 -0.51  0.52       -0.2     -0.47      -0.59       -0.57

However, not much is gained in terms of dimensionality reduction if we simply keep 7 PC coordinates instead of 10 original variables. We have seen at the very beginning that the first 4 PCs capture >65% of the variance of our data, so let us reduce the principle components to 4. We will call our eigenvector matrix V with only the first 4 columns Vreduced If we projected our passenger 1 onto only 4 PCs, we had only 4 values to store: xVreduced

# x * V_reduced
as.matrix(dt_pca)[1,] %*%
  as.matrix(model_pca$rotation[,1:4])
            PC1        PC2        PC3       PC4
[1,] -0.9219128 0.04287835 -0.7984102 0.9140071
# x * V_reduced * V_reducedT
as.matrix(dt_pca)[1,] %*%
  as.matrix(model_pca$rotation[,1:4]) %*%
  t(as.matrix(model_pca$rotation[,1:4]))
            Age       Fare       Sex      Parch     Sibsp  Queenstown  Cherbourg
[1,] -0.6042263 -0.6158179 0.9094717 -0.1944706 0.2761572 -0.09504479 -0.2792797
     FirstClass SecondClass
[1,] -0.5505641  -0.4989356

If we have Vreduced we can reconstruct any passenger from its 4 PC values via xVreduced VTreduced In other words, we only need the PC coordinates xVreduced and can reconstruct the original data. This reconstruction is approximate and depends on the number of PCs we use. Compare the original variable values above to the reconstruction below.

How many dimensions we can/shall drop depends on our purpose. Let us investigate the total reconstruction error for our one passenger depending on the number of eigenvectors we retain in Vreduced. We loop through the reconstruction of passenger 1 for 1 to 9 PCs:

results <- list()

for (pcs in 1:9){
  passenger_reconstructed <- as.matrix(dt_pca)[1,] %*% 
    as.matrix(model_pca$rotation[,1:pcs]) %*% 
    t(as.matrix(model_pca$rotation[,1:pcs])) 
  
  results[[pcs]] <- data.table("PCs used" = pcs,
                               "MAE"= sum(abs(as.matrix(dt_pca)[1,]-passenger_reconstructed)),
                               passenger_reconstructed)
  }

dt_reconstruction <- rbindlist(results)
round(dt_reconstruction,2)
# Clean up
rm(list = c("dt_pca", "dt_reconstruction", "pca", "passenger_reconstructed", "pcs", "results"))

Models with PCA

PCA: Logistic regression

# Define the logistic regression using PCAs
model_pca_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4, 
                 data = data, family = binomial(link = "logit"))

# Print summary statistic
print(summary(model_pca_glm))

Call:
glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4, family = binomial(link = "logit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5196  -0.8359  -0.4931   0.8222   2.1363  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.45619    0.09078  -5.025 5.02e-07 ***
PC1          0.61801    0.07207   8.575  < 2e-16 ***
PC2          0.33371    0.06741   4.950 7.40e-07 ***
PC3          0.70019    0.08884   7.881 3.23e-15 ***
PC4         -0.56679    0.09276  -6.110 9.94e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960.90  on 711  degrees of freedom
Residual deviance: 742.15  on 707  degrees of freedom
AIC: 752.15

Number of Fisher Scoring iterations: 4
# Fit of the model
with(model_pca_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 3.469072e-46
# A p-value below 0.001 indicates that the model performs better than an empty model

As all inputs seem to be statistically significant at the 0.1% level.

PCA: Tree

### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")


# Print an plot data on tree
printcp(model_tree)

Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4, data = data, 
    method = "class")

Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.215278      0   1.00000 1.00000 0.045472
2 0.142361      1   0.78472 0.81944 0.043614
3 0.041667      2   0.64236 0.69792 0.041704
4 0.027778      4   0.55903 0.65625 0.040912
5 0.017361      5   0.53125 0.62153 0.040194
6 0.015046      6   0.51389 0.62500 0.040268
7 0.013889      9   0.46875 0.63542 0.040487
8 0.010417     10   0.45486 0.64931 0.040773
9 0.010000     16   0.38889 0.62153 0.040194
plotcp(model_tree)

# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")


# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8426966
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.7851124
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_trees <- model_tree

# Print an plot data on tree
printcp(model_tree_2)

Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4, data = data, 
    method = "class")

Variables actually used in tree construction:
[1] PC1 PC3 PC4

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.215278      0   1.00000 1.00000 0.045472
2 0.142361      1   0.78472 0.81944 0.043614
3 0.041667      2   0.64236 0.69792 0.041704
4 0.027778      4   0.55903 0.65625 0.040912
5 0.017361      5   0.53125 0.62153 0.040194
plotcp(model_tree_2)


rm(list = c("model_tree", "model_tree_2"))

As we find a notable disparity in accuracy between the base tree and the pruned tree favoring the un-pruned base tree in the case of using PCA, we choose the un-pruned model for further application.

PCA: Neural Networks

set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4")])
y <- data$Survived

## Model as in lecture

normalize_pca <- layer_normalization()   # Create a normalization layer
normalize_pca %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize_pca$mean)   # Print the mean values used for normalization
tf.Tensor([[ 2.3283064e-09 -1.0186341e-08  5.1222742e-09  2.7939677e-09]], shape=(1, 4), dtype=float32)
model_pca_neural <- keras_model_sequential()   # Create a sequential model

model_pca_neural %>% 
  normalize_pca() %>%   # Apply normalization
  
  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation
 
   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_pca_neural)   # Print the summary of the model
Model: "sequential_119"
____________________________________________________________________________________
 Layer (type)                   Output Shape                  Param #    Trainable  
====================================================================================
 normalization_119 (Normalizati  (None, 4)                    9          Y          
 on)                                                                                
 HiddenLayer_1 (Dense)          (None, 16)                    80         Y          
 HiddenLayer_2 (Dense)          (None, 16)                    272        Y          
 HiddenLayer_3 (Dense)          (None, 16)                    272        Y          
 OutputLayer (Dense)            (None, 1)                     17         Y          
====================================================================================
Total params: 650
Trainable params: 641
Non-trainable params: 9
____________________________________________________________________________________
#plot_model(model_base)

model_pca_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_pca_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

#model_pca_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
#model_pca_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
#model_pca_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_pca_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data
      loss   Accuracy 
0.09847966 0.86516851 
rm(list = c("X", "y"))

PCA: Ensemble methods

formula_pca <- Survived ~ PC1 + PC2 + PC3 + PC4

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_ensemble_bagging <- randomForest(formula_pca, 
                                       data=data, 
                                       mtry=9, 
                                       ntree=500)
Warning: The response has five or fewer unique values.  Are you sure you want to do regression?Warning: invalid mtry: reset to within valid range
# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_ensemble_bagging, main="Variable Importance")


# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_pca_ensemble_bagging, newdata = data))

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_ensemble_boosting <- gbm(formula = formula_pca, 
                               data=data, 
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500, 
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_ensemble_boosting, method = "cv")


# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_ensemble_bagging, type = 2), keep.rownames = TRUE), 
                   data.table(summary(model_pca_ensemble_boosting, plotit = FALSE)), 
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable", 
                  measure.vars = c("Bagging","Boosting"), 
                  variable.name = "y", 
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(), 
        legend.position = "none") +
  facet_wrap(~y)


rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))

Models with full PCA

Full PCA: Logistic regression

# Define a logistic model using all PCAs
model_pca_full_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, 
                 data = data, family = binomial(link = "logit"))

# Print summary statistic
print(summary(model_pca_full_glm))

Call:
glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
    PC7 + PC8, family = binomial(link = "logit"), data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6786  -0.6521  -0.4426   0.6552   2.4048  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.43758    0.09871  -4.433 9.30e-06 ***
PC1          0.75416    0.08797   8.572  < 2e-16 ***
PC2          0.31435    0.07661   4.103 4.07e-05 ***
PC3          0.79900    0.10184   7.846 4.30e-15 ***
PC4         -0.55678    0.10008  -5.563 2.65e-08 ***
PC5         -0.65080    0.10552  -6.168 6.93e-10 ***
PC6          0.35515    0.11641   3.051  0.00228 ** 
PC7          0.72030    0.13223   5.447 5.11e-08 ***
PC8         -0.28703    0.14355  -1.999  0.04556 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960.90  on 711  degrees of freedom
Residual deviance: 652.47  on 703  degrees of freedom
AIC: 670.47

Number of Fisher Scoring iterations: 5
# Fit of the model
with(model_pca_full_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 6.593554e-62
# A p-value below 0.001 indicates that the model performs better than an empty model

As all inputs seem to be statistically significant at the 0.1% level.

Full PCA: Tree

### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")


# Print an plot data on tree
printcp(model_tree)

Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
    PC7 + PC8, data = data, method = "class")

Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4 PC5 PC6 PC7

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.215278      0   1.00000 1.00000 0.045472
2 0.034722      2   0.56944 0.67708 0.041317
3 0.027778      3   0.53472 0.60764 0.039891
4 0.010417      4   0.50694 0.57639 0.039176
5 0.010000     10   0.43750 0.56944 0.039010
plotcp(model_tree)

# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")


# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)
[1] 0.8230337
mean(predict(model_tree_2, data, type = "class")==data$Survived)
[1] 0.8230337
# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_full_trees <- model_tree_2

# Print an plot data on tree
printcp(model_tree_2)

Classification tree:
rpart(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
    PC7 + PC8, data = data, method = "class")

Variables actually used in tree construction:
[1] PC1 PC2 PC3 PC4 PC5 PC6 PC7

Root node error: 288/712 = 0.40449

n= 712 

        CP nsplit rel error  xerror     xstd
1 0.215278      0   1.00000 1.00000 0.045472
2 0.034722      2   0.56944 0.67708 0.041317
3 0.027778      3   0.53472 0.60764 0.039891
4 0.010417      4   0.50694 0.57639 0.039176
5 0.010000     10   0.43750 0.56944 0.039010
plotcp(model_tree_2)


rm(list = c("model_tree", "model_tree_2"))

In this case the pruned and un-pruned model are identical which makes sense given the nature of PCA.

Full PCA: Neural Networks

set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")])
y <- data$Survived

## Model as in lecture

normalize_pca_full <- layer_normalization()   # Create a normalization layer
normalize_pca_full %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize_pca_full$mean)   # Print the mean values used for normalization
tf.Tensor(
[[ 1.3969839e-09 -2.3283064e-09  8.8475645e-09  0.0000000e+00
  -1.3969839e-09  2.6775524e-09 -1.3969839e-09 -3.2596290e-09]], shape=(1, 8), dtype=float32)
model_pca_full_neural <- keras_model_sequential()   # Create a sequential model

model_pca_full_neural %>% 
  normalize_pca_full() %>%   # Apply normalization
  
  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation
 
   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_pca_full_neural)   # Print the summary of the model
Model: "sequential_120"
____________________________________________________________________________________
 Layer (type)                   Output Shape                  Param #    Trainable  
====================================================================================
 normalization_120 (Normalizati  (None, 8)                    17         Y          
 on)                                                                                
 HiddenLayer_1 (Dense)          (None, 16)                    144        Y          
 HiddenLayer_2 (Dense)          (None, 16)                    272        Y          
 HiddenLayer_3 (Dense)          (None, 16)                    272        Y          
 OutputLayer (Dense)            (None, 1)                     17         Y          
====================================================================================
Total params: 722
Trainable params: 705
Non-trainable params: 17
____________________________________________________________________________________
#plot_model(model_base)

model_pca_full_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_pca_full_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

#model_pca_full_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
#model_pca_full_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
#model_pca_full_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_pca_full_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data
      loss   Accuracy 
0.07700509 0.90028089 
rm(list = c("X", "y"))

Full PCA: Ensemble methods

formula_pca_full <- Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_full_ensemble_bagging <- randomForest(formula_pca, 
                                       data=data, 
                                       mtry=9, 
                                       ntree=500)
Warning: The response has five or fewer unique values.  Are you sure you want to do regression?Warning: invalid mtry: reset to within valid range
# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_full_ensemble_bagging, main="Variable Importance")


# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_pca_full_ensemble_bagging, newdata = data)) 

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_full_ensemble_boosting <- gbm(formula = formula_pca_full, 
                               data=data, 
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500, 
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_full_ensemble_boosting, method = "cv")


# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_full_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_full_ensemble_bagging, type = 2), keep.rownames = TRUE), 
                   data.table(summary(model_pca_full_ensemble_boosting, plotit = FALSE)), 
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable", 
                  measure.vars = c("Bagging","Boosting"), 
                  variable.name = "y", 
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(), 
        legend.position = "none") +
  facet_wrap(~y)


rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))

Notably again in this case our bagging model performs the same as the bagging model using only the first four PCs. This occurs as the relatively unimportant last four PCs will never be chosen by the trees in this case.

Data for Validation

test_data <- read_csv(path1)  # Read the CSV file located at the specified path and store the test_data in the 'test_data' variable
Rows: 418 Columns: 11── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_solution <- read_csv(path2)
Rows: 418 Columns: 2── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): PassengerId, Survived
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_data$Survived <- test_solution$Survived

test_data <- clean_data(test_data)
[1] "The number of missing values in Age variables: 86"
[1] "The number of missing values in Embarked variables: 0"
[1] "The number of missing values in Fare variables: 1"
test_data <- apply_pca(test_data, model_pca)

test_solution <- as.data.frame(test_data$Survived)
colnames(test_solution) <- "Survived"

rm(list = c("path1", "path2"))

Predictions for validation

Let us perform predictions for different models on the validation data.

# Prediction for logistic regression
test_solution$logistic_prediction <- predict(model_glm, 
                                             newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$logistic_prediction <- ifelse(test_solution$logistic_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_prediction <- predict(model_trees, 
                                         newdata = test_data)[,"1"]
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$tree_prediction <- ifelse(test_solution$tree_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_prediction <- predict(model_neural, 
                                           x = as.matrix(test_data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)

# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$neural_prediction <- ifelse(test_solution$neural_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_prediction <- predict(model_ensemble_bagging, 
                                            newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$bagging_prediction <- ifelse(test_solution$bagging_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_prediction <- predict(model_ensemble_boosting, 
                                             newdata = test_data)
Using 107 trees...
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_prediction <- ifelse(test_solution$boosting_prediction > 0.5, 1, 0)

##PCA

# Prediction for logistic regression
test_solution$logistic_pca_prediction <- predict(model_pca_glm, 
                                             newdata = test_data)
test_solution$logistic_pca_prediction <- ifelse(test_solution$logistic_pca_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_pca_prediction <- predict(model_pca_trees, 
                                         newdata = test_data)[,"1"]

test_solution$tree_pca_prediction <- ifelse(test_solution$tree_pca_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_pca_prediction <- predict(model_pca_neural, 
                                           x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
test_solution$neural_pca_prediction <- ifelse(test_solution$neural_pca_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_pca_prediction <- predict(model_pca_ensemble_bagging, 
                                            newdata = test_data)
test_solution$bagging_pca_prediction <- ifelse(test_solution$bagging_pca_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_pca_prediction <- predict(model_pca_ensemble_boosting, 
                                             newdata = test_data)
Using 305 trees...
test_solution$boosting_pca_prediction <- ifelse(test_solution$boosting_pca_prediction > 0.5, 1, 0)

##Full PCA

# Prediction for logistic regression
test_solution$logistic_pca_full_prediction <- predict(model_pca_full_glm, 
                                             newdata = test_data)
test_solution$logistic_pca_full_prediction <- ifelse(test_solution$logistic_pca_full_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_pca_full_prediction <- predict(model_pca_full_trees, 
                                         newdata = test_data)[,"1"]
test_solution$tree_pca_full_prediction <- ifelse(test_solution$tree_pca_full_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_pca_full_prediction <- predict(model_pca_full_neural, 
                                           x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")]), verbose = 0)
test_solution$neural_pca_full_prediction <- ifelse(test_solution$neural_pca_full_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_pca_full_prediction <- predict(model_pca_full_ensemble_bagging, 
                                            newdata = test_data)
test_solution$bagging_pca_full_prediction <- ifelse(test_solution$bagging_pca_full_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_pca_full_prediction <- predict(model_pca_full_ensemble_boosting, 
                                             newdata = test_data)
Using 208 trees...
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_pca_full_prediction <- ifelse(test_solution$boosting_pca_full_prediction > 0.5, 1, 0)

Validation

# Function to calculate correct guesses for a column
# This function calculates the overall accuracy of predictions for a given column by comparing it with the true 'Survived' values.
calculate_correct_guesses <- function(column, survived) {
  sum(column == survived) / length(column)
}

# Function to calculate correct guesses for survived cases in a column
# This function calculates the accuracy of predictions for the 'Survived' cases (where 'Survived' = 1) in a given column.
calculate_correct_guesses_survived <- function(column, survived) {
  sum(column == 1 & survived == 1) / sum(survived == 1)
}

# Function to calculate correct guesses for dead cases in a column
# This function calculates the accuracy of predictions for the 'Dead' cases (where 'Survived' = 0) in a given column.
calculate_correct_guesses_dead <- function(column, survived) {
  sum(column == 0 & survived == 0) / sum(survived == 0)
}

# Calculate correct guesses for each column
# The 'sapply()' function applies the 'calculate_correct_guesses()' function to each column of 'test_solution' (excluding the first column, which is 'Survived').
# The result is the overall accuracy of predictions for each column.
correct_guesses <- sapply(test_solution[-1], calculate_correct_guesses, test_solution$Survived)
correct_guesses <- c(1, correct_guesses)


# Calculate correct guesses for survived cases in each column
correct_guesses_survived <- sapply(test_solution[-1], calculate_correct_guesses_survived, test_solution$Survived)
correct_guesses_survived <- c(1, correct_guesses_survived)

# Calculate correct guesses for dead cases in each column
correct_guesses_dead <- sapply(test_solution[-1], calculate_correct_guesses_dead, test_solution$Survived)
correct_guesses_dead <- c(1, correct_guesses_dead)

# Add the three rows to the original dataframe
# Create a new data frame 'solution' with the calculated accuracy values and corresponding column names.
solution <- data.frame("Accuracy" = correct_guesses, 
                 "Accuracy: Survival" = correct_guesses_survived,
                 "Accuracy: Dead" = correct_guesses_dead)
# Remove the first row (with the value 1) as it represents the overall accuracy, not related to any specific column.
solution <- solution[-1,]
solution

The table above indicates the output of various models. The “Accuracy” column shows the overall accuracy of the model while “Accuracy Survival” indicates the accuracy of the survival rate in the specific model.

Overall the boosting without PCA does best at predicting our test data. It does when considering accuracy on all cases but also when just trying to predict only the survival of the observation. Predictions of death are surprisingly enough better performed by including all PCAs in a logistic regression. This variant also predicts the second best overall closely followed by the basic tree classification. Notably bagging and boosting under PCA perform the worst even though boosting without PCA performed second best. As a comparison we also ran validation on models which include all PCA variables to check if we could achieve more accuracy that way. Unfortunately, this did not yield better predictions except for the logistic regression.

It is important to note that errors may occur because of changing the seed.

Cross-Validation

#Optional: Merge training and test data for cross-validation 

full_data <- rbind(data, test_data)
# Check if data was merged before
if (exists("full_data")) {
  cv_data <- full_data
} else {
  cv_data <- data
}

# Define a function to split data into K folds
# Assign a fold number to each row randomly using 'ceiling' and 'sample'.
split_K <- function(dt, folds){
  dt$id <- ceiling(sample(1:nrow(dt), replace = FALSE, nrow(dt)) / 
                     (nrow(dt) / folds))
  return(dt)
}

# Set the number of folds (K) for cross-validation
K <- 10

cv_data <- split_K(cv_data, K)
# Create a data frame 'cross_validation' to store accuracy values for each model and fold
cross_validation <- data.frame(matrix(NA, nrow = 10, ncol = K))
rownames(cross_validation) <- c("Logistic", "Tree Model", "Neural Network", 
                                "Bagging", "Boosting", "PCA Logistic", "PCA Tree Model",
                                "PCA Neural Network", "PCA Bagging", "PCA Boosting")

for (i in 1:K) {
  # Select the current fold as the validation set and the rest as the training set
  dt <- cv_data[cv_data$id == i, ]
  
  # Prediction for logistic regression
  dt$logistic_prediction <- predict(model_glm, 
                                               newdata = dt)
  dt$logistic_prediction <- ifelse(dt$logistic_prediction > 0, 1, 0)
  
  cross_validation[1, i] <- calculate_correct_guesses(dt$logistic_prediction, dt$Survived)
  
  # Prediction for tree regression
  dt$tree_prediction <- predict(model_trees, 
                                           newdata = dt)[,"1"]
  dt$tree_prediction <- ifelse(dt$tree_prediction > 0.5, 1, 0)
  
  cross_validation[2, i] <- calculate_correct_guesses(dt$tree_prediction, dt$Survived)
  
  # Prediction for neural network
  dt$neural_prediction <- predict(model_neural, x = as.matrix(dt[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)
  
  dt$neural_prediction <- ifelse(dt$neural_prediction > 0.5, 1, 0)

  cross_validation[3, i] <- calculate_correct_guesses(dt$neural_prediction, dt$Survived)
  
  # Prediction for bagging
  dt$bagging_prediction <- predict(model_ensemble_bagging, 
                                              newdata = dt)
  dt$bagging_prediction <- ifelse(dt$bagging_prediction > 0.5, 1, 0)
  
  cross_validation[4, i] <- calculate_correct_guesses(dt$bagging_prediction, dt$Survived)
  
  # Prediction for boosting
  dt$boosting_prediction <- predict(model_ensemble_boosting, 
                                               newdata = dt)
  dt$boosting_prediction <- ifelse(dt$boosting_prediction > 0.5, 1, 0)
  
  cross_validation[5, i] <- calculate_correct_guesses(dt$boosting_prediction, dt$Survived)
  
  ##PCA
  
  # Prediction for logistic regression
  dt$logistic_pca_prediction <- predict(model_pca_glm, 
                                               newdata = dt)
  dt$logistic_pca_prediction <- ifelse(dt$logistic_pca_prediction > 0, 1, 0)
  
  cross_validation[6, i] <- calculate_correct_guesses(dt$logistic_pca_prediction, dt$Survived)
  
  # Prediction for tree regression
  dt$tree_pca_prediction <- predict(model_pca_trees, 
                                           newdata = dt)[,"1"]
  dt$tree_pca_prediction <- ifelse(dt$tree_pca_prediction > 0.5, 1, 0)
  
  cross_validation[7, i] <- calculate_correct_guesses(dt$tree_pca_prediction, dt$Survived)
  
  # Prediction for neural network
  dt$neural_pca_prediction <- predict(model_pca_neural, x = as.matrix(dt[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
  dt$neural_pca_prediction <- ifelse(dt$neural_pca_prediction > 0.5, 1, 0)
  
  cross_validation[8, i] <- calculate_correct_guesses(dt$neural_pca_prediction, dt$Survived)
  
  # Prediction for bagging
  dt$bagging_pca_prediction <- predict(model_pca_ensemble_bagging, 
                                              newdata = dt)
  dt$bagging_pca_prediction <- ifelse(dt$bagging_pca_prediction > 0.5, 1, 0)
  
  cross_validation[9, i] <- calculate_correct_guesses(dt$bagging_pca_prediction, dt$Survived)
  
  # Prediction for boosting
  dt$boosting_pca_prediction <- predict(model_pca_ensemble_boosting, 
                                               newdata = dt)
  dt$boosting_pca_prediction <- ifelse(dt$boosting_pca_prediction > 0.5, 1, 0)
  
  cross_validation[10, i] <- calculate_correct_guesses(dt$boosting_pca_prediction, dt$Survived)
    
  
}
Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...

Using 107 trees...

Using 305 trees...
cross_validation$Mean_accuracy <- rowMeans(cross_validation)
cross_validation$Expected_Loss <- (1 - cross_validation$Mean_accuracy)
cross_validation[, c(11, 12)]

The table indicates Mean_accuracy and Expected_Loss for each model. Mean_accuracy is good example of representing the average accuracy of the model across all the 10 folds in the cross-validation process. In other words, it shows how well the model performs on average in predicting the survival rate. On the other hand, Expected_Loss indicates the proportion of misclassified instances. In short, the lower the expected loss, the better the performance of the model. It is clear that, after cross-validation we find that bagging without PCA seems to perform best, closely followed by bagging with PCA.

---
title: "Statistical Machine Learning Project - Titanic"
subtitle: "Martin-Luther-University Halle-Wittenberg"
author: "Scholz, Clemens & Sir, Onur Akin"
output:
  html_notebook:
    fig_width: 10
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float: yes
  html_document:
    fig_width: 10
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float: yes
---
clemens.scholz@student.uni-halle.de <br>
akin.sir@student.uni-halle.de

##### Halle (Saale), July 2023.

<br>
<br>

<style>
.math {
  font-size: small;
}
</style>

<script>
$(document).ready(function() {
  $items = $('div#TOC li');
  $items.each(function(idx) {
    num_ul = $(this).parentsUntil('#TOC').length;
    $(this).css({'text-indent': num_ul * 10, 'padding-left': 0});
  });

});
</script>

<br>
<br>


```{r setup, warning=FALSE, message=FALSE, echo=FALSE}
# Importing the essential libraries
#install.packages(c("ggplot2","janitor", "tibble", "lubridate", "data.table", "tidyr", "dplyr", "xlsx", "openxlsx", "filesstrings", "tidyverse", "MASS", "mnormt", "factoextra", "NbClust", "scatterplot3d", "plot3D", "viridisLite", "rpart.plot", "rpart", 'randomForest', 'gbm', "tensorflow", "keras", "patchwork", "randomForest", "gbm"))
#remotes::install_github("grantmcdermott/parttree")


# -------------------> NOTE <------------------- #
# If you encounter an error when executing the Neural Networks section. We suggest installing the following packages and restarting R Studio:

# install.packages("remotes")
# remotes::install_github(paste0("rstudio/", c("reticulate", "tensorflow", "keras")))
# reticulate::install_miniconda() # skip this if you want to self-install conda or use venv
# keras::install_keras()
# ---------------------------------------------- #


library(janitor)
library(plot3D)
library(tibble)
library(lubridate)
library(data.table)
library(viridisLite)
library(rpart)
library(randomForest)
library(gbm)
library(rpart.plot)
library(tensorflow)
library(parttree)
library(patchwork)
library(rpart.plot)
library(tidyr)
library(dplyr)
library(janitor)
library(factoextra)
library(NbClust)
library(readxl)    
library(openxlsx)
library(filesstrings)
library(scatterplot3d)
library(ggplot2)
library(tidyverse)
library(ggcorrplot)
library(ggExtra)
library(keras)
library(ggridges)
library(MASS)
library(mnormt)
library(randomForest)

rm(list=ls())

```

```{r}
# Add paths to the different data sets

# path for training data set
path <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/train.csv"
# path for test data set
path1 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/test.csv"

# path for the corresponding dependent variables for the test data set 
path2 <- "/Users/clemensscholle/Desktop/Master/Statistical_Machine_Learning/Project/titanic/result.csv"
```


<br>
<br>

# Data Dictionary
```{r}
 ## Variable              ## Definition               ## Key

 # survived               # Survival                 # 0 = No, 1 = Yes
 # pclass                 # Ticket class             # 1, 2, 3 (e.g. 1 for first class.)
 # sex                    # Sex                      # 0 = Female, 1 = Male
 # Age                    # Age in years               
 # Sibsp                  # [#] of siblings / spouses abroad the Titanic (Brother, sister... & husband, wife)
 # parch                  # [#] of parents / children abroad the Titanic (mother, father, daugher, son...)
 # fare                   # Passenger fare
 # embarked               # Port of Embarkation      # C = Cherbourg, Q = Queenstown, S = Southampton

```

It is important to acknowledge that the variable called `Passenger ID` was not useful for our analysis. We, therefore, dropped this column.
Next, we assigned a binary variable(s) for `Survived` variable, which is better illustration for showing if the person alive (1) or dead (0). When it comes to the graphical illustration we decided to choose "`deepskyblue2"` colour for the person who survived, `"black"` otherwise.

The `Fare` variable indicates the passenger fare the person paid. Being able to pay higher prices may be good illustration of the wealth of passenger.

There are three different classes on the ship, which were indicated as 1, 2 and 3. We created separate columns for each class in order to assign a binary variable for each class. Thus, we will be able to analyse them. For example, 1 if the passenger is in the first class, 0 otherwise. It is clear that the variable `Pclass` is also essential for our analysis. It might be a good illustration of the wealth of the passengers who were able to afford first class. Thus, it is highly relevant to the `Fare` variable.
Reasonably, `FirstClass` passengers had more luxuries available to them. Most of first class was located in the middle hull and in the upper structures. There were two staircases connecting the different decks. `SecondClass` was distributed along the height in the middle of the ship with only one staircase connecting the different decks. Amenities there would be the same as for the first class of other ships of the time. The sole staircase may have impeded evacuation during the crash. `ThirdClass` was located on the deeper parts of the ship as well as parts directly on the bow and the rear of the ship. The location of the cabins would have obvious disadvantages during the crash.

In addition, we assigned a binary variable for determining the gender of the observed passengers. Therefore, `Sex` equals 1 if the passenger is a man and 0 otherwise. It is important to note that during evacuation women were given priority, whilst men might have higher chances of survival due to their physique. Children were prioritized during evacuation but they would also be less likely to survive due to their lesser strength compared to adults. Older people may exhibit the same impediment to a higher degree.

When it comes to the `Age` variable, it indicates the age of the passenger. Unfortunately, the `Age` variable was not available for all cases of this data set. We, therefore, decided to drop these missing cases in the cleaning process.

The variables `SibSp` and `Parch` were related to their family bonds. The `SibSp` variable showing how many siblings or spouses were with the person aboard the ship. We assumed that higher numbers may mean that more people assisted the person during evacuation or that they may have spend valuable time during evacuation looking for relatives. The `Parch` variable showing how many parents or children were with the person aboard the ship. In case of young children this variable may increase survival as parents will try to save their children. We also assumed that in case of parents this may lower chances of survival in case they were looking for their children or gave up seats on a lifeboat for them or it may improve survival chances as they would gain a seat on a lifeboat together with their children.

Finally, the `Embarked` variable indicates the port of embarkation of the passenger. There were three different ports of embarkation available, that was: Southhampton (S), Cherbourg (C), and Queenstown (Q). Similarly, we created separate columns for each port of embarkation of the passengers in order to assign a binary variable for each of them. Furhermore, we assumed that embarking earlier offers the passenger more time and opportunity to explore the ship, know their surroundings and get used to sea travel and possible sea sickness. This factor may be compounded by earlier passengers having a less populated ship to explore.


```{r}
# Importing the dataset and preparing data to analysis

clean_data <- function(raw_data) {
  # Find NA values in Age and Embarked
  print(paste0("The number of missing values in Age variables: ", sum(is.na(raw_data$Age))))
  print(paste0("The number of missing values in Embarked variables: ", sum(is.na(raw_data$Embarked))))
  print(paste0("The number of missing values in Fare variables: ", sum(is.na(raw_data$Fare))))
  
  
  # Delete rows with NA values
  raw_data <- raw_data[complete.cases(raw_data$Age), ]
  raw_data <- raw_data[complete.cases(raw_data$Embarked), ]  
  raw_data <- raw_data[complete.cases(raw_data$Fare), ]  
  
  # Converting Sex variable as boolean type.
  raw_data$Sex <- ifelse(raw_data$Sex == "male", 1, 0)
  
  # Putting dummies for our variables which are Pclass and Embarked (thus, we will be able to analyse them in later process)
  # Basically what we do is creating a new column for all Embarked destionations Southhamptopn, Cherbourg and Queenstown with binary values. For example, in Southhampton 1 if "Embarked" is "S", 0 otherwise
  raw_data$Southhampton <- ifelse(raw_data$Embarked == "S", 1, 0)
  raw_data$Cherbourg <- ifelse(raw_data$Embarked == "C", 1, 0)
  raw_data$Queenstown <- ifelse(raw_data$Embarked == "Q", 1, 0)
  
  # Let us do the same process for the Pclass variable:
  raw_data$FirstClass <- ifelse(raw_data$Pclass == "1", 1, 0)
  raw_data$SecondClass <- ifelse(raw_data$Pclass == "2", 1, 0)
  raw_data$ThirdClass <- ifelse(raw_data$Pclass == "3", 1, 0) 
  
  # Creating new data frame for cleaned data
  clean_data <- data.frame(Name = raw_data$Name, Sex = raw_data$Sex, Age = raw_data$Age, Sibsp = raw_data$SibSp, Parch = raw_data$Parch, FirstClass = raw_data$FirstClass, SecondClass = raw_data$SecondClass, ThirdClass = raw_data$ThirdClass ,Southhampton = raw_data$Southhampton, Cherbourg = raw_data$Cherbourg, Queenstown = raw_data$Queenstown, Fare = raw_data$Fare, Survived = raw_data$Survived, Embarked = raw_data$Embarked, Pclass = raw_data$Pclass)
  
  return(clean_data)
  
}

raw_data <- read_csv(path)

data <- clean_data(raw_data)

data

# Clean up
rm(list = c("raw_data"))
```

<br>

## Basic calculations of the dataset
```{r}
# Let's illustrate the basic calculations
total_passenger <- length(data$Sex)
total_female <- sum(data$Sex == 0)
total_male <- sum(data$Sex == 1)
total_survival <- sum(data$Survived == 1)
total_female_survival <- sum(data$Sex == 0 & data$Survived == 1)
total_male_survival <- sum(data$Sex == 1 & data$Survived == 1)

## Class calculations ##
total_pass_1_class <- sum(data$Pclass == 1)
total_pass_1_class_survived <- sum(data$Pclass == 1 & data$Survived == 1)
total_pass_2_class <- sum(data$Pclass == 2)
total_pass_2_class_survived <- sum(data$Pclass == 2 & data$Survived == 1)
total_pass_3_class <- sum(data$Pclass == 3)
total_pass_3_class_survived <- sum(data$Pclass == 3 & data$Survived == 1)
## Class calculation ends ##

```


```{r}
ggplot(data, aes(x = factor(Survived), fill = factor(Sex))) +
  geom_bar() +
  labs(
    title = "Number of female and male passengers survived",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
    x = "Number of passengers survived",
    y = "Count",
    fill = "Sex") +
  scale_fill_manual(values = c("0" = "orchid3", "1" = "skyblue3")) +
  theme(axis.line = element_line(color = "#000000"))

## General information ##
section1_gen_info_1 <- paste0("The total number of passengers: ", total_passenger)
section1_gen_info_2 <- paste0("The total number of female passengers: ", total_female)
section1_gen_info_3 <- paste0("The total number of male passengers: ", total_male)
section1_gen_info_4 <- paste0("The total number of passengers who survived: ", total_survival)
section1_gen_info_5 <- paste0("The total number of female passengers who survived: ", total_female_survival)
section1_gen_info_6 <- paste0("The total number of male passengers who survived: ", total_male_survival)
## General information ends ##

for (i in 1:6) {
  variable_name <- paste0("section1_gen_info_", i)
  variable_value <- get(variable_name)
  print(variable_value)
}



# Plotting the age distribution though histogram. We also wanted to illustrate the estimate of the density.
ggplot(data = data, aes(x=Age)) +
  geom_histogram( aes(y=..density..)) +
  geom_density(fill="#f1b147", color="#f1b147", alpha=0.11) +
  labs(
    title = "Age distribution",
    x = "Age",
    y = "Density") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))

# Illustrating the number of age group and their fare. We also would like to illustrate survival rate as a color and parch as a size.
ggplot() +
  geom_point(data = data, aes(x = Age, y = Fare, color = factor(Survived), size = Parch)) +
  scale_color_manual(values = c("black", "deepskyblue3")) +
  labs(
    title = "Age & Fare & Familial relations & Survival",
    x = "Age",
    y = "Fare",
    color = "Survival") +
  geom_point() +
  theme(axis.line = element_line(color = "#000000"))


scatter3D(data$Age, data$Fare, data$Pclass, 
        theta=22, phi=29, 
        xlab="Age of passengers", ylab="Fare", zlab="Ticket classes of passengers",
        ticktype="detailed",
        pch = 20,
        cex.axis=0.45, cex.lab=0.95, expand=0.73, 
        colvar = data$Survived,
        col = c("black", "deepskyblue3"))

# Illustrating scatter plot of Survival vs. Age with color by Sex
ggplot(data = data, aes(x = Age, y = factor(Survived), color = factor(Sex))) +
  geom_jitter(width = 0.3, height = 0.3) +
  labs(
    title = "Distribution of survival by age and gender",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes",
    x = "Age of passengers",
    y = "Surviving passengers",
    color = "Sex"
  ) +
  theme(axis.line = element_line(color = "#000000"))

ggplot(data = data, aes(x = "", y = Pclass, fill = factor(Pclass))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(
    title = "Number of passengers from each classes",
    fill = "Class"
  ) +
  theme_void()


## Class information ##
section1_gen_class_info_1 <- paste0("The total number of passengers from first class: ", total_pass_1_class)
section1_gen_class_info_2 <- paste0("The total number of passengers from first class who survived: ", total_pass_1_class_survived)
section1_gen_class_info_3 <- paste0("The total number of passengers from second class: ", total_pass_2_class)
section1_gen_class_info_4 <- paste0("The total number of passengers from second class who survived: ", total_pass_2_class_survived)
section1_gen_class_info_5 <- paste0("The total number of passengers from third class: ", total_pass_3_class)
section1_gen_class_info_6 <- paste0("The total number of passengers from third class who survived: ", total_pass_3_class_survived)
## Class information ends ##

for (i in 1:6) {
  variable_name <- paste0("section1_gen_class_info_", i)
  variable_value <- get(variable_name)
  print(variable_value)
}


# Scatter plot of Sex vs. Pclass with color by Survived
ggplot(data = data, aes(x = factor(Pclass), y = factor(Survived), color = factor(Sex))) +
  geom_jitter(width = 0.3, height = 0.3) +
  labs(
    title = "Scatter plot of gender and class coloured by survival",
    caption = "Sex: 0 = Female, 1 = Male; \n Survived: 0 = No, 1 = Yes \n 1 = first class, 2 = second class, 3 = third class",
    x = "Ticket classes of passengers",
    y = "Number of passengers survived", 
    color = "Sex") +
  scale_color_manual(values = c("0" = "orchid3", "1" = "skyblue4")) +
  theme(axis.line = element_line(color = "#000000"))

# Clean up
rm(list = c("i", "section1_gen_class_info_1", "section1_gen_class_info_2", "section1_gen_class_info_3", "section1_gen_class_info_4", "section1_gen_class_info_5", "section1_gen_class_info_6", "section1_gen_info_1", "section1_gen_info_2", "section1_gen_info_3", "section1_gen_info_4", "section1_gen_info_5", "section1_gen_info_6", "total_female", "total_female_survival", "total_male", "total_male_survival", "total_pass_1_class", "total_pass_2_class", "total_pass_3_class", "total_pass_1_class_survived", "total_pass_2_class_survived", "total_pass_3_class_survived", "total_passenger", "total_survival", "variable_name", "variable_value"))

```

<br>
<br>

# Clustering
```{r}
# We decided to work on two columns that are Amount of Fare from passengers and Age of passengers. This is because all the other columns contain small numeric variables such as boolean 1 or 2, TRUE or FALSE or 1,2 and 3 for class.
set.seed(1234)
dt_kmeans <- data[, c("Age", "Fare")]

# First of all, let's start with 2 k-means (two centers) for now.
model_kmeans <- kmeans(dt_kmeans, centers = 2)

clustering_1_info1 <- paste0("The total number of observations for first cluster is ", model_kmeans$size[1], ".")
clustering_1_info2 <- paste0("The total number of observations for second cluster is ", model_kmeans$size[2], ".")
print(c(clustering_1_info1, clustering_1_info2 ))

# Here is the center of our clusters:
print(model_kmeans$centers)

# Let us visualise it by showing the centers as well
ggplot(data = dt_kmeans, aes(x = Age, y = Fare, color = as.factor(model_kmeans$cluster))) +
  geom_point() +
  # Let us illustrate the centers of clusters
  geom_point(data = data.frame(model_kmeans$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2"),
             alpha = 0.75) +
  guides(color = "none") +
  scale_color_manual(values = c("deepskyblue2","gold2")) +
  labs(
    x = "Age of passengers",
    y = "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))


# Let us standardise inputs
# We have seen the standardization of variables before: Every variable is centered on its mean and then scaled by its standard deviation.
dt_kmeans_scaled <- data.table(scale(dt_kmeans))

model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 2)

ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
  geom_point() +
  geom_point(data = data.frame(model_kmeans_scaled$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2"),
             alpha = 0.75) +
  guides(color="none") +
  scale_color_manual(values = c("deepskyblue2", "gold2")) +
  labs(
    x="Age of passengers",
    y= "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))
```

<br>

We find two clusters mainly depending on the age variable.

## Elbow plot
We also would like to illustrate elbow plot to find the optimal k. To do so, we will be needing the total within sum of squares (TWSS) of our scaled model. TWSS also means that the squared distance of all data points to their cluster centers.
First of all, let us create a loop function and we will loop through different k values from 1 to 20 in this case. Next, we will illustrate an elbow plot to investigate the sensitivity of TWSS to the number of clusters k.
```{r}
k <- 25
elbow <- data.table("k" = 1:k, "WSS" = NA)

# Let us create our loop function
for (i in 1:k) {
  elbow$k[i] <- i
  # Next, we are setting the new center of our cluster
  elbow$WSS[i] <- kmeans(dt_kmeans_scaled, centers = i, nstart=50)$tot.withinss
}

# Illustrating our findings with the help of line chart
ggplot(data = elbow, aes(x = k, y = WSS)) +
  geom_line() +
  labs(
    title = "Elbow plot",
    x = "k",
    y = "TWSS") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))

NbClust(data=dt_kmeans_scaled, method="kmeans")
# According to the majority rule, the best number of clusters is  3 
```
It is clear that 3 centers would make more suitable for our work, so let us replicate our clustering work with 3 centers.

```{r}
dt_kmeans_scaled <- data.table(scale(dt_kmeans))

model_kmeans_scaled <- kmeans(dt_kmeans_scaled, centers = 3)

ggplot(data = dt_kmeans_scaled, aes(x = Age, y = Fare, color = as.factor(model_kmeans_scaled$cluster))) +
  geom_point() +
  geom_point(data = data.frame(model_kmeans_scaled$centers),
             shape = "circle filled",
             size = 7, 
             color = "#000000",
             fill = c("deepskyblue2", "gold2", "forestgreen"),
             alpha = 0.75) +
  guides(color="none") +
  scale_color_manual(values = c("deepskyblue2", "gold2", "forestgreen")) +
  labs(
    x="Age of passengers",
    y= "Passenger fare") +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"))

# Clean up
rm(list = c("k", "i", "clustering_1_info1", "clustering_1_info2", "dt_kmeans", "dt_kmeans_scaled", "elbow", "model_kmeans", "model_kmeans_scaled"))

```

<br>
<br>


# Logistic regression
```{r}
# Estimate the logistic model
model_glm <- glm(formula = Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown, 
                 data = data, family = binomial(link = "logit"))

# View summary statistics
print(summary(model_glm))

# Fit of the model
with(model_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
# A p-value below 0.001 indicates that the model performs better than an empty model

```
Looking at the result of the model, it is clear that being in first class, being female and having a younger age were statistically significant predictors of survival, while the number of siblings and other variables did not show significant influence on survival in the given data set.

<br>

# Tree
```{r}
### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")

# Print an plot data on tree
printcp(model_tree)
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")

# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)

mean(predict(model_tree_2, data, type = "class")==data$Survived)

# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_trees <- model_tree_2

# Print an plot data on tree
printcp(model_tree_2)
plotcp(model_tree_2)

### Visualization for interesting variable combinations
# Convert 'Survived' variable to a factor
tree_data <- data
tree_data$Survived <- as.factor(tree_data$Survived) 

# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Sibsp, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Sibling/Spouse")

# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_5 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_5, extra = 2, main = "Pruned tree for Age and Sibling/Spouse")

# Print an plot data on tree
printcp(model_tree_5)
plotcp(model_tree_5)



# Create a scatter plot with decision tree boundaries for Age and Sibsp variables
plot1 <- ggplot(data = tree_data, aes(x = Sibsp, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_5, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Sibsp",
    y = "Age",
    color = "Survived", 
    title = "Scatter plot of age and familial relations showing survival and tree estimation of survival"
  ) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"), legend.position = "none")

# Build another decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Parch, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Parent/Child")

# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_6 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_6, extra = 2, main = "Pruned tree for Age and Parent/Child")

# Print an plot data on tree
printcp(model_tree_6)
plotcp(model_tree_6)


# Create a scatter plot with decision tree boundaries for Age and Parch variables
plot2 <- ggplot(data = tree_data, aes(x = Parch, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_6, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Parch",
    y = "Age",
    color = "Survived"
  ) +
  theme_minimal() +
  theme(
    axis.line = element_line(color = "#000000"),
    legend.position = "bottom",
    legend.box = "horizontal"
  )

# Combine the two plots
plot1 + plot2

# Visualizing both age and fare as the only variables containing throughout distributed values.
# Build a decision tree model using rpart
model_tree_4 <- rpart(data = tree_data, formula = Survived ~ Age + Fare, method = "class")

# Plot the decision tree model using prp
prp(model_tree_4, extra = 2, main = "Tree for Age and Fare")

# Determine the minimum complexity parameter (cp) for pruning
min_cp <- model_tree_4$cptable[which.min(model_tree_4$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_7 <- prune(model_tree_4, cp = min_cp)

# Plot the pruned decision tree model using prp
prp(model_tree_7, extra = 2, main = "pruned tree for Age and Fare")

# Print an plot data on tree
printcp(model_tree_7)
plotcp(model_tree_7)


# Create a scatter plot with decision tree boundaries for Age and fare variables
ggplot(data = tree_data, aes(x = Fare, y = Age)) +
  geom_point(aes(color = Survived)) +
  geom_parttree(data = model_tree_7, aes(fill = Survived), alpha = 0.1) +
  scale_color_manual(values = c("black", "deepskyblue2")) +
  scale_fill_manual(values = c("black", "deepskyblue2")) +
  guides(color = "none") +
  labs(
    x = "Fare",
    y = "Age",
    color = "Survived",
    title = "Scatter plot of age and fare showing survival and tree estimation of survival"
  ) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"), legend.position = "none")

# This visualization though looking quite nice does not seem to indicate any notable impact of fare in my opinion.

# Clean up
rm(list = c("min_cp", "plot1", "plot2", "model_tree", "model_tree_2", "model_tree_4", "model_tree_5", "model_tree_6", "model_tree_7", "tree_data"))


```
It is clear that there is only small difference between Pruned base tree and complex base tree. Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex.  This is because, we believe that this will save computational memory. Moreover, the higher complexity tree does only improve on a less relevant branch. 

The more important branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

## Investigating hyperparameters for the tree model
```{r}
### Investigating hyperparameters
# Define the values for minimum observations, maximum depth, and complexity parameter (cp)
min_obs <- seq(from = 5, to = 50, by = 5)
max_depth <- c(1, 5, 10, 15)
cp_val <- c(0.001, 0.01, 0.05, 0.1)

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations of min_obs, max_depth, and cp_val
for (i in 1:length(min_obs)) {
  for (j in 1:length(max_depth)) {
    for (k in 1:length(cp_val)) {
      # Build a decision tree model with the specified parameters
      model <- rpart(data = data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, 
                     method = "class",
                     cp = cp_val[k],
                     maxdepth = max_depth[j],
                     minbucket = min_obs[i])
      
      # Calculate the accuracy of the model predictions
      acc <- mean(predict(model, data, type = "class") == data$Survived)
      
      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, min_obs[i], max_depth[j], cp_val[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "min_obs", "max_depth", "cp")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$min_obs, y = results$max_depth, z = results$cp,
                  theta = 30, phi = 10,
                  xlab = "Min Obs", ylab = "Max Depth", zlab = "CP",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))


# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

# Build a decision tree model using the optimal parameters
model_tree_3 <- rpart(data=data, formula = Survived ~ Pclass + Sex + Age + Sibsp + Parch + Fare + Embarked, 
                      method = "class",
                      cp = opt_params$cp, 
                      maxdepth = opt_params$max_depth, 
                      minbucket = opt_params$min_obs)

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree_3, data, type = "class") == data$Survived)

# Plot the pruned decision tree with additional details
prp(model_tree_3, extra = 2, main = "Tree with optimal parameters")

# Print an plot data on tree
printcp(model_tree_3)
plotcp(model_tree_3)

# Clean up
rm(list = c("acc", "cp_val", "i", "j", "k", "max_depth", "min_obs", "results", "opt_params", "model_tree_3", "model"))
```

The uncertain branch discussed before (male, age above 6) can reasonably be further distinguished by class and maybe age. Afterwards the classification seems to be mostly dependent on fare with decisions along the branch leading to no real insights. The optimal model shows only marginal improvement compared to accuracy of earlier models.

<br>
<br>

# Neural Networks
```{r}
set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived

## Model as in lecture

normalize <- layer_normalization()   # Create a normalization layer
normalize %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize$mean)   # Print the mean values used for normalization

model_neural <- keras_model_sequential()   # Create a sequential model

model_neural %>%
  normalize() %>%   # Apply normalization # L1

  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation

   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_neural)   # Print the summary of the model

#plot_model(model_base)

model_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

# model_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
# model_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
# model_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data

rm(list = c("X", "y"))
```

## Investigating hyperparameters for the neural networks
```{r}
### Investigating hyperparameters
# Define the values
learning_rate <- c(0.0001, 0.001, 0.01, 0.1)  # Learning rate
width <- c(4, 16, 24) # Amount of neurons per layer
length <- c(1, 3, 10) # Amount of layers

# Data setup
X <- as.matrix(data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")])
y <- data$Survived

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations
for (i in 1:length(learning_rate)) {
  for (j in 1:length(width)) {
    for (k in 1:length(length)) {

      #print(i)
      #print(j)
      #print(k)

      if (length[k] == 1) { # One layer model

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer= optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      } else if (length[k] == 3) { # Three layer model

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      } else if (length[k] == 10) { # Ten layer model

        normalize_i <- layer_normalization()   # Create a normalization layer
        normalize_i %>% adapt(X)   # Adapt the normalization layer to the data

        model_neural_i <- keras_model_sequential()   # Create a sequential model

        model_neural_i %>%
          normalize_i() %>%   # Apply normalization

          layer_dense(name = "HiddenLayer_1", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_2", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_3", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_4", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_5", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_6", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_7", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_8", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_9", units = width[j], activation = 'relu') %>%
          layer_dense(name = "HiddenLayer_10", units = width[j], activation = 'relu') %>%

           layer_dense(name = "OutputLayer", units = 1)   # Add output layer with 1 unit

        model_neural_i %>% compile(loss="MSE", optimizer=optimizer_adam(learning_rate = learning_rate[i]), metrics="Accuracy")

        model_neural_i %>% fit(x = X, y = y, epochs = 250, verbose = 0)

      }

      # Calculate the accuracy of the model predictions
      eval <- model_neural_i %>% evaluate(X, y, verbose = 0)
      acc <- eval[2]

      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, learning_rate[i], width[j], length[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "learning_rate", "width", "length")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$learning_rate, y = results$width, z = results$length,
                  theta = 30, phi = 10,
                  xlab = "Learning Rate", ylab = "Width", zlab = "Length",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))


# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

opt_params

# Clean up
rm(list = c("X", "y", "i", "j", "k", "width", "length", "learning_rate", "results", "opt_params", "normalize_i", "model_neural_i", "eval", "acc"))
```
Looking at the graph, it is evident that we improve our model with more neurons per layer as well as the with higher number of layers but in order to save computational memory we still decided to choose just three layers and sixteen neurons per layer for our model. Furthermore, our investigation shows that the best learning rate in our case is the default of 0.001. 


# Ensemble methods
```{r}
formula <- Survived ~ FirstClass + SecondClass + Sex + Age + Sibsp + Parch + Fare + Cherbourg + Queenstown

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_ensemble_bagging <- randomForest(formula,
                                       data=data,
                                       mtry=9,
                                       ntree=500)

# Creating a variable importance plot for the random forest model
varImpPlot(model_ensemble_bagging, main="Bagging: Variable Importance")

# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_ensemble_bagging, newdata = data))

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_ensemble_boosting <- gbm(formula = formula,
                               data=data,
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500,
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_ensemble_boosting, method = "cv")

# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_ensemble_bagging, type = 2), keep.rownames = TRUE),
                   data.table(summary(model_ensemble_boosting, plotit = FALSE)),
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable",
                  measure.vars = c("Bagging","Boosting"),
                  variable.name = "y",
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(),
        legend.position = "none") +
  facet_wrap(~y)

rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
```
In analysis of two ensemble methods which are Bagging and Boosting, it is clear that after Sex and Age, Passenger Fare is the most important variable. This supports our assumption that the wealth of passengers may actually influence chances of survival heavily. The structure of our data supports the assumption that Fare does not depend on the family size and the class of the passenger also does not have an impact on survival chances.Comparing the two models shows that both target the same variables.


## Investigating hyperparameters for the boosting
```{r}
### Investigating hyperparameters
# Define the values
n_trees <- seq(from = 250, to = 750, by = 250)
shrinking <- seq(from = 0.05, to = 0.2, by = 0.05)
cv_folds <- seq(from = 2, to = 8, by = 3)

# Create an empty data frame to store the results
results <- data.frame()

# Iterate over the combinations
for (i in 1:length(n_trees)) {
  for (j in 1:length(shrinking)) {
    for (k in 1:length(cv_folds)) {
      # Build a  model with the specified parameters
      model  <- gbm(formula = formula,
                               data=data,
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=n_trees[i],
                               shrinkage=shrinking[j],
                               cv.folds = cv_folds[k])

      # Calculate the accuracy of the model predictions
      acc <- gbm.perf(model, method = "cv", plot.it = FALSE)

      # Append the accuracy and parameter values to the results data frame
      results <- rbind(results, c(acc, n_trees[i], shrinking[j], cv_folds[k]))
    }
  }
}

# Set column names for the results data frame
colnames(results) <- c("accuracy", "n_trees", "shrinking", "cv_folds")

# Create a 3D scatter plot using scatter3D
plot3D::scatter3D(x = results$n_trees, y = results$shrinking, z = results$cv_folds,
                  theta = 30, phi = 10,
                  xlab = "Number of trees", ylab = "Shrinkage", zlab = "CV folds",
                  ticktype = "detailed",
                  pch = 19,
                  cex.axis = 0.75, cex.lab = 0.75, expand = 0.75,
                  colvar = results$accuracy, col = viridis(100))


# Select the optimal parameter combination with the highest accuracy
opt_params <- results[which.max(results$accuracy),]

print(opt_params)

# Build a model using the optimal parameters
model_opt  <- gbm(formula = formula,
               data=data,
               distribution="gaussian",
               interaction.depth=1,
               n.trees = opt_params$n_trees,
               shrinkage = opt_params$shrinking,
               cv.folds = opt_params$cv_folds)

# Calculate the mean accuracy of the model predictions
mean(predict(model_opt, data) == data$Survived)

# Clean up
rm(list = c("acc", "cv_folds", "i", "j", "k", "n_trees", "shrinking", "results", "opt_params", "model_opt"))
```

As the performance of the boosting model seems to vary, we did not change our hyperparameters from our default options. Though since bagging does lead to the best results, we are confident in our hyperparameter choice.

<br>
<br>

# PCA

Covariance matrix describe the data to finding the best direction where the data most spread.
So let us find the covariance matrix of the data:
center the data points and calculate the variances and covariances
Next, find the eigenvectors and the corresponding eigenvalues of the covariance matrix.
Sort the eigenvectors v by the absolute magnitude of their eigenvalues: these are our principal components (PCs) and the basis vectors of our new coordinate system.

## Reducing principle components
We are back to dropping dimensions but this time we drop principal components instead of __variables__. We hope to greatly reduce the PCs while capturing most varioation along the first PCs.

We may want to capture a minimum amount of variance of the data in total or keep all the principle components that --individually-- explain a minimum amount of variance.
Similar to k-means, we can produce a scree plot of our principal components in descending order of their eigenvalues.

What is more is instead of dropping PCs we may be interested in exploring and possibly interpreting them.
Certain real estate variables may drive the PC1 coordinate (amenities) while others drive the PC2 coordinate (location).
We have already seen that principle components no longer represent single variables as our original basis vectors did. Each PC coordinate is a linear combination of the original variables. The eigen vectors v determine how single variables of x contribute to each PC coordinate via v^Tx.
```{r}
# We exclude thirdClass because it is our baseline class. 
dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
# Will always use a library which decomposes the covariance matrix for us and supplies other fundamental outputs.
model_pca <- prcomp(dt_pca)
summary(model_pca)

```
It is clear that the model enables us to dive into our 9 principle componets. In the summary, we also find the share that each PC contributes to explaing the variance of our data in descending order of the eigenvalues.

```{r}
fviz_eig(model_pca, addlabels = TRUE) + 
         theme(plot.title = element_text(hjust = 0.5))
```
Looking at the Scree plot, it is evidence that we explain ~66% of the variance with first 4 PCs. 

To investigate how our original variables contribute to the PCs and to possibly interpret these contributions, we can look at the eigenvectors v. They essentially conribute the coefficients to the linear combination of our variables that produce the PCs. We find the eigenvectors at `model_pca$rotation`.

Here, `$rotation` gives the eigenvectors scaled to a specific length which are also called loadings: eigenvectors ∗ √eigenvalues
 
 The above eigenvectors take each data point to the PC coordinate system by multiplying vector x. V: xV.
 
Let's illustrate Score Plot. It plotos our passengers in only the first two dimensions (PC1 and PC2) which capture 55% of variance. 
 
Finally, let us illustrate our principle components in a loadings plot. Basically, the direction of an arrow shows which PCs a variable contributes to and the angles between arrows indicate the difference in these directions. 
```{r}
# Matrix of (scaled) eigenvectors: V
round(model_pca$rotation, 2)

fviz_pca_ind(model_pca, 
             geom.ind = "point", 
             col.ind = as.factor(data$Survived), 
             palette = c("#f1b147", "#4CA7DE")) +
  ggtitle("Score Plot") +
  theme(plot.title = element_text(hjust = 0.5))

fviz_pca_var(model_pca, 
             col.var="contrib", 
             repel = TRUE) +
    theme_minimal()
```

We find here that Fare and First Class seem to indicate strongly in one direction possibly signifying wealth. Secondly, Sibsp and Parch indiacte closely together pointing at an overarching concept of family size.


## Reconstructing Variables
Our matrix of (scaled) eigenvectors V above transformed our original 9 coordinates (variables) into 9 PC coordinates.
```{r}
# Insert PCA into our data for continued use
pca <- as.data.frame(as.matrix(dt_pca) %*%
  as.matrix(model_pca$rotation))

data <- cbind(data, pca)

# Define a function to perform PCA transformations on the test data
apply_pca <- function(data, model) {
  dt_pca <- data.table(scale(data[, c("Age", "Fare", "Sex", "Parch", "Sibsp", "Queenstown", "Cherbourg", "FirstClass", "SecondClass")]))
  
  pca <- as.data.frame(as.matrix(dt_pca) %*%
    as.matrix(model$rotation))
  
  return(cbind(data, pca))
}

```

To reconstruct the original variables we can multiply the PC coordinates by V^_T_^:
```{r}
# Passenger 1: Original variables
# x
round(as.matrix(dt_pca)[1,],2)

# Passenger 1: PC coordinates
# x * V
round(as.matrix(dt_pca)[1,] %*%
        as.matrix(model_pca$rotation), 2)

# Passenger 1: Reconstructed original variables
# x * V * VT
round(as.matrix(dt_pca)[1,] %*%
        as.matrix(model_pca$rotation) %*%
        t(as.matrix(model_pca$rotation)), 2)
```

However, not much is gained in terms of dimensionality reduction if we simply keep 7 PC coordinates instead of 10 original variables. We have seen at the very beginning that the first 4 PCs capture >65% of the variance of our data, so let us reduce the principle components to 4. We will call our eigenvector matrix V with only the first 4 columns V~reduced~  If we projected our passenger 1 onto only 4 PCs, we had only 4 values to store: xV~reduced~

```{r}
# x * V_reduced
as.matrix(dt_pca)[1,] %*%
  as.matrix(model_pca$rotation[,1:4])

# x * V_reduced * V_reducedT
as.matrix(dt_pca)[1,] %*%
  as.matrix(model_pca$rotation[,1:4]) %*%
  t(as.matrix(model_pca$rotation[,1:4]))
```

If we have V~reduced~  we can reconstruct any passenger from its 4 PC values via xV~reduced~ V^T^~reduced~
In other words, we only need the PC coordinates xV~reduced~ and can reconstruct the original data. This reconstruction is approximate and depends on the number of PCs we use. Compare the original variable values above to the reconstruction below.

How many dimensions we can/shall drop depends on our purpose. Let us investigate the total reconstruction error for our one passenger depending on the number of eigenvectors we retain in V~reduced~. We loop through the reconstruction of passenger 1 for 1 to 9 PCs:

```{r}
results <- list()

for (pcs in 1:9){
  passenger_reconstructed <- as.matrix(dt_pca)[1,] %*% 
    as.matrix(model_pca$rotation[,1:pcs]) %*% 
    t(as.matrix(model_pca$rotation[,1:pcs])) 
  
  results[[pcs]] <- data.table("PCs used" = pcs,
                               "MAE"= sum(abs(as.matrix(dt_pca)[1,]-passenger_reconstructed)),
                               passenger_reconstructed)
  }

dt_reconstruction <- rbindlist(results)
round(dt_reconstruction,2)
```

```{r}
# Clean up
rm(list = c("dt_pca", "dt_reconstruction", "pca", "passenger_reconstructed", "pcs", "results"))

```

# Models with PCA
## PCA: Logistic regression
```{r}
# Define the logistic regression using PCAs
model_pca_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4, 
                 data = data, family = binomial(link = "logit"))

# Print summary statistic
print(summary(model_pca_glm))

# Fit of the model
with(model_pca_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
# A p-value below 0.001 indicates that the model performs better than an empty model

```
As all inputs seem to be statistically significant at the 0.1% level.

## PCA: Tree
```{r}
### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")

# Print an plot data on tree
printcp(model_tree)
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")

# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)

mean(predict(model_tree_2, data, type = "class")==data$Survived)

# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_trees <- model_tree

# Print an plot data on tree
printcp(model_tree_2)
plotcp(model_tree_2)

rm(list = c("model_tree", "model_tree_2"))
```

As we find a notable disparity in accuracy between the base tree and the pruned tree favoring the un-pruned base tree in the case of using PCA, we choose the un-pruned model for further application.

## PCA: Neural Networks
```{r}
set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4")])
y <- data$Survived

## Model as in lecture

normalize_pca <- layer_normalization()   # Create a normalization layer
normalize_pca %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize_pca$mean)   # Print the mean values used for normalization

model_pca_neural <- keras_model_sequential()   # Create a sequential model

model_pca_neural %>% 
  normalize_pca() %>%   # Apply normalization
  
  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation
 
   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_pca_neural)   # Print the summary of the model

#plot_model(model_base)

model_pca_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_pca_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

#model_pca_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
#model_pca_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
#model_pca_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_pca_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data

rm(list = c("X", "y"))
```

## PCA: Ensemble methods
```{r}
formula_pca <- Survived ~ PC1 + PC2 + PC3 + PC4

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_ensemble_bagging <- randomForest(formula_pca, 
                                       data=data, 
                                       mtry=9, 
                                       ntree=500)

# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_ensemble_bagging, main="Variable Importance")

# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_pca_ensemble_bagging, newdata = data))

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_ensemble_boosting <- gbm(formula = formula_pca, 
                               data=data, 
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500, 
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_ensemble_boosting, method = "cv")

# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_ensemble_bagging, type = 2), keep.rownames = TRUE), 
                   data.table(summary(model_pca_ensemble_boosting, plotit = FALSE)), 
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable", 
                  measure.vars = c("Bagging","Boosting"), 
                  variable.name = "y", 
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(), 
        legend.position = "none") +
  facet_wrap(~y)

rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
```

# Models with full PCA

## Full PCA: Logistic regression
```{r}
# Define a logistic model using all PCAs
model_pca_full_glm <- glm(formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, 
                 data = data, family = binomial(link = "logit"))

# Print summary statistic
print(summary(model_pca_full_glm))

# Fit of the model
with(model_pca_full_glm, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
# A p-value below 0.001 indicates that the model performs better than an empty model

```
As all inputs seem to be statistically significant at the 0.1% level.

## Full PCA: Tree
```{r}
### Base tree 
# Build a decision tree model using rpart
model_tree <- rpart(data = data, formula = Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, method = "class")

# Plot the decision tree model using prp, with extra details (extra = 2)
prp(model_tree, extra = 2, main = "Base tree for model")

# Print an plot data on tree
printcp(model_tree)
plotcp(model_tree)
# The tree shows that the first deciding factor seems to be gender. Afterwards mainly class and age show seem prominent deciding factors for branches though interestingly enough fare seems to play an important role as well.

# Determine the minimum complexity parameter (cp) for pruning
#cp parameter: captures model complexity
min_cp <- model_tree$cptable[which.min(model_tree$cptable[, "xerror"]), "CP"]

# Prune the decision tree model using the minimum cp
model_tree_2 <- prune(model_tree, cp = min_cp)

# Plot the pruned decision tree model using prp, with extra details (extra = 2)
prp(model_tree_2, extra = 2, main = "Pruned base tree with less complexity")

# The branch determined by being male and over the age of 6 shows the biggest group and the most wrong guesses of the model.

# Calculate the mean accuracy of the model predictions
mean(predict(model_tree, data, type = "class")==data$Survived)

mean(predict(model_tree_2, data, type = "class")==data$Survived)

# Between the accuracy for the pruned and not-pruned model there is only a 2% difference so we will choose the pruned model as representative as it is less complex
model_pca_full_trees <- model_tree_2

# Print an plot data on tree
printcp(model_tree_2)
plotcp(model_tree_2)

rm(list = c("model_tree", "model_tree_2"))

```

In this case the pruned and un-pruned model are identical which makes sense given the nature of PCA.

## Full PCA: Neural Networks
```{r}
set.seed(12345)   # Set a seed for reproducibility

# Data setup
X <- as.matrix(data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")])
y <- data$Survived

## Model as in lecture

normalize_pca_full <- layer_normalization()   # Create a normalization layer
normalize_pca_full %>% adapt(X)   # Adapt the normalization layer to the data

print(normalize_pca_full$mean)   # Print the mean values used for normalization

model_pca_full_neural <- keras_model_sequential()   # Create a sequential model

model_pca_full_neural %>% 
  normalize_pca_full() %>%   # Apply normalization
  
  layer_dense(name = "HiddenLayer_1", units = 16, activation = 'relu') %>%   # Add a hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_2", units = 16, activation = 'relu') %>%   # Add another hidden layer with 16 units and ReLU activation
  layer_dense(name = "HiddenLayer_3", units = 16, activation = 'relu') %>%   # Add a third hidden layer with 16 units and ReLU activation
 
   layer_dense(name = "OutputLayer", units = 1)   # Add an output layer with 1 unit

summary(model_pca_full_neural)   # Print the summary of the model

#plot_model(model_base)

model_pca_full_neural %>% compile(
  loss = "MSE",
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = "Accuracy"
)   # Compile the model with mean squared error loss, Adam optimizer, and accuracy as a metric

model_pca_full_neural %>% fit(
  x = X,
  y = y,
  epochs = 500,
  verbose = 0
)   # Train the model on the data for 500 epochs

#model_pca_full_neural$layers[[2]]$kernel   # Print the kernel (weights) of the second layer
#model_pca_full_neural$layers[[3]]$kernel   # Print the kernel (weights) of the third layer
#model_pca_full_neural$layers[[4]]$kernel   # Print the kernel (weights) of the fourth layer


model_pca_full_neural %>% evaluate(X, y, verbose = 0)   # Evaluate the model on the data

rm(list = c("X", "y"))
```

## Full PCA: Ensemble methods
```{r}
formula_pca_full <- Survived ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8

# Building a random forest model using the randomForest function
# mtry parameter is set to 9, which controls feature subsampling
# ntree parameter is set to 500, specifying the number of trees in the random forest
model_pca_full_ensemble_bagging <- randomForest(formula_pca, 
                                       data=data, 
                                       mtry=9, 
                                       ntree=500)

# Creating a variable importance plot for the random forest model
varImpPlot(model_pca_full_ensemble_bagging, main="Variable Importance")

# Creating a data.table to compare actual survival values with predicted values from the bagging model
model_bag_result_comparison <- data.table(actual = data$Survived,
                                          prediction = predict(model_pca_full_ensemble_bagging, newdata = data)) 

# Building a gradient boosting model using the gbm function
# distribution is set to "gaussian" for regression, but could be modified for classification
# interaction.depth sets the maximum depth of the individual trees
# n.trees is set to 500, indicating the number of boosting iterations
# shrinkage controls the step size when updating the weights of each tree
# cv.folds is set to 5, specifying the number of cross-validation folds for model evaluation
model_pca_full_ensemble_boosting <- gbm(formula = formula_pca_full, 
                               data=data, 
                               distribution="gaussian",
                               interaction.depth=1,
                               n.trees=500, 
                               shrinkage=0.1,
                               cv.folds = 5)

# Calculating the performance of the boosting model using cross-validation
performance_ensemble <- gbm.perf(model_pca_full_ensemble_boosting, method = "cv")

# Generating a summary of the boosting model, but plotit parameter is set to FALSE, so no plot will be generated
summary(model_pca_full_ensemble_boosting, plotit = FALSE)

# Merging the variable importance data from the bagging model with the summary data from the boosting model
# Preparing the data for plotting by transforming percentages and converting to long format using melt function
dt_varimp <- merge(data.table(importance(model_pca_full_ensemble_bagging, type = 2), keep.rownames = TRUE), 
                   data.table(summary(model_pca_full_ensemble_boosting, plotit = FALSE)), 
                   by.x = "rn", by.y = "var")
dt_varimp <- setNames(dt_varimp, c("Variable", "Bagging", "Boosting"))
dt_varimp[, Bagging := 100*Bagging/sum(Bagging)]
dt_varimp <- melt(dt_varimp, id.vars = "Variable", 
                  measure.vars = c("Bagging","Boosting"), 
                  variable.name = "y", 
                  value.name = "x")
dt_varimp <- dt_varimp[order(y,-x)]

# Plotting the variable importance comparison using ggplot2
ggplot(data = dt_varimp) +
  geom_col(aes(y=Variable, x=x, fill=y)) +
  labs(
    x = "Relative Importance",
    y = "Variables") +
  scale_fill_manual(values = c("#f1b147","#4CA7DE")) +
  theme_minimal() +
  theme(axis.line = element_line(color = "#000000"),
        axis.ticks=element_blank(), 
        legend.position = "none") +
  facet_wrap(~y)

rm(list = c("dt_varimp", "model_bag_result_comparison", "performance_ensemble"))
```

Notably again in this case our bagging model performs the same as the bagging model using only the first four PCs. This occurs as the relatively unimportant last four PCs will never be chosen by the trees in this case.

# Data for Validation
```{r}
test_data <- read_csv(path1)  # Read the CSV file located at the specified path and store the test_data in the 'test_data' variable

test_solution <- read_csv(path2)

test_data$Survived <- test_solution$Survived

test_data <- clean_data(test_data)

test_data <- apply_pca(test_data, model_pca)

test_solution <- as.data.frame(test_data$Survived)
colnames(test_solution) <- "Survived"

rm(list = c("path1", "path2"))

```

## Predictions for validation
Let us perform predictions for different models on the validation data.
```{r}
# Prediction for logistic regression
test_solution$logistic_prediction <- predict(model_glm, 
                                             newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$logistic_prediction <- ifelse(test_solution$logistic_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_prediction <- predict(model_trees, 
                                         newdata = test_data)[,"1"]
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$tree_prediction <- ifelse(test_solution$tree_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_prediction <- predict(model_neural, 
                                           x = as.matrix(test_data[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)

# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$neural_prediction <- ifelse(test_solution$neural_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_prediction <- predict(model_ensemble_bagging, 
                                            newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$bagging_prediction <- ifelse(test_solution$bagging_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_prediction <- predict(model_ensemble_boosting, 
                                             newdata = test_data)
# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_prediction <- ifelse(test_solution$boosting_prediction > 0.5, 1, 0)

##PCA

# Prediction for logistic regression
test_solution$logistic_pca_prediction <- predict(model_pca_glm, 
                                             newdata = test_data)
test_solution$logistic_pca_prediction <- ifelse(test_solution$logistic_pca_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_pca_prediction <- predict(model_pca_trees, 
                                         newdata = test_data)[,"1"]

test_solution$tree_pca_prediction <- ifelse(test_solution$tree_pca_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_pca_prediction <- predict(model_pca_neural, 
                                           x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
test_solution$neural_pca_prediction <- ifelse(test_solution$neural_pca_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_pca_prediction <- predict(model_pca_ensemble_bagging, 
                                            newdata = test_data)
test_solution$bagging_pca_prediction <- ifelse(test_solution$bagging_pca_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_pca_prediction <- predict(model_pca_ensemble_boosting, 
                                             newdata = test_data)
test_solution$boosting_pca_prediction <- ifelse(test_solution$boosting_pca_prediction > 0.5, 1, 0)

##Full PCA

# Prediction for logistic regression
test_solution$logistic_pca_full_prediction <- predict(model_pca_full_glm, 
                                             newdata = test_data)
test_solution$logistic_pca_full_prediction <- ifelse(test_solution$logistic_pca_full_prediction > 0, 1, 0)

# Prediction for tree regression
test_solution$tree_pca_full_prediction <- predict(model_pca_full_trees, 
                                         newdata = test_data)[,"1"]
test_solution$tree_pca_full_prediction <- ifelse(test_solution$tree_pca_full_prediction > 0.5, 1, 0)

# Prediction for neural network
test_solution$neural_pca_full_prediction <- predict(model_pca_full_neural, 
                                           x = as.matrix(test_data[, c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8")]), verbose = 0)
test_solution$neural_pca_full_prediction <- ifelse(test_solution$neural_pca_full_prediction > 0.5, 1, 0)

# Prediction for bagging
test_solution$bagging_pca_full_prediction <- predict(model_pca_full_ensemble_bagging, 
                                            newdata = test_data)
test_solution$bagging_pca_full_prediction <- ifelse(test_solution$bagging_pca_full_prediction > 0.5, 1, 0)

# Prediction for boosting
test_solution$boosting_pca_full_prediction <- predict(model_pca_full_ensemble_boosting, 
                                             newdata = test_data)

# Converting the predicted probabilities to binary predictions (0 or 1) using a threshold of 0.5
test_solution$boosting_pca_full_prediction <- ifelse(test_solution$boosting_pca_full_prediction > 0.5, 1, 0)



```

# Validation
```{r}
# Function to calculate correct guesses for a column
# This function calculates the overall accuracy of predictions for a given column by comparing it with the true 'Survived' values.
calculate_correct_guesses <- function(column, survived) {
  sum(column == survived) / length(column)
}

# Function to calculate correct guesses for survived cases in a column
# This function calculates the accuracy of predictions for the 'Survived' cases (where 'Survived' = 1) in a given column.
calculate_correct_guesses_survived <- function(column, survived) {
  sum(column == 1 & survived == 1) / sum(survived == 1)
}

# Function to calculate correct guesses for dead cases in a column
# This function calculates the accuracy of predictions for the 'Dead' cases (where 'Survived' = 0) in a given column.
calculate_correct_guesses_dead <- function(column, survived) {
  sum(column == 0 & survived == 0) / sum(survived == 0)
}

# Calculate correct guesses for each column
# The 'sapply()' function applies the 'calculate_correct_guesses()' function to each column of 'test_solution' (excluding the first column, which is 'Survived').
# The result is the overall accuracy of predictions for each column.
correct_guesses <- sapply(test_solution[-1], calculate_correct_guesses, test_solution$Survived)
correct_guesses <- c(1, correct_guesses)


# Calculate correct guesses for survived cases in each column
correct_guesses_survived <- sapply(test_solution[-1], calculate_correct_guesses_survived, test_solution$Survived)
correct_guesses_survived <- c(1, correct_guesses_survived)

# Calculate correct guesses for dead cases in each column
correct_guesses_dead <- sapply(test_solution[-1], calculate_correct_guesses_dead, test_solution$Survived)
correct_guesses_dead <- c(1, correct_guesses_dead)

# Add the three rows to the original dataframe
# Create a new data frame 'solution' with the calculated accuracy values and corresponding column names.
solution <- data.frame("Accuracy" = correct_guesses, 
                 "Accuracy: Survival" = correct_guesses_survived,
                 "Accuracy: Dead" = correct_guesses_dead)
# Remove the first row (with the value 1) as it represents the overall accuracy, not related to any specific column.
solution <- solution[-1,]

```

```{r}
solution
```
The table above indicates the output of various models. The "Accuracy" column shows the overall accuracy of the model while "Accuracy Survival" indicates the accuracy of the survival rate in the specific model.

Overall the boosting without PCA does best at predicting our test data. It does when considering accuracy on all cases but also when just trying to predict only the survival  of the observation. Predictions of death are surprisingly enough better performed by including all PCAs in a logistic regression. This variant also predicts the second best overall closely followed by the basic tree classification.
Notably bagging and boosting under PCA perform the worst even though boosting without PCA performed second best. As a comparison we also ran validation on models which include all PCA variables to check if we could achieve more accuracy that way. Unfortunately, this did not yield better predictions except for the logistic regression.

It is important to note that errors may occur because of changing the seed. 

# Cross-Validation

```{r}
#Optional: Merge training and test data for cross-validation 

full_data <- rbind(data, test_data)

```

```{r}
# Check if data was merged before
if (exists("full_data")) {
  cv_data <- full_data
} else {
  cv_data <- data
}

# Define a function to split data into K folds
# Assign a fold number to each row randomly using 'ceiling' and 'sample'.
split_K <- function(dt, folds){
  dt$id <- ceiling(sample(1:nrow(dt), replace = FALSE, nrow(dt)) / 
                     (nrow(dt) / folds))
  return(dt)
}

# Set the number of folds (K) for cross-validation
K <- 10

cv_data <- split_K(cv_data, K)
# Create a data frame 'cross_validation' to store accuracy values for each model and fold
cross_validation <- data.frame(matrix(NA, nrow = 10, ncol = K))
rownames(cross_validation) <- c("Logistic", "Tree Model", "Neural Network", 
                                "Bagging", "Boosting", "PCA Logistic", "PCA Tree Model",
                                "PCA Neural Network", "PCA Bagging", "PCA Boosting")

for (i in 1:K) {
  # Select the current fold as the validation set and the rest as the training set
  dt <- cv_data[cv_data$id == i, ]
  
  # Prediction for logistic regression
  dt$logistic_prediction <- predict(model_glm, 
                                               newdata = dt)
  dt$logistic_prediction <- ifelse(dt$logistic_prediction > 0, 1, 0)
  
  cross_validation[1, i] <- calculate_correct_guesses(dt$logistic_prediction, dt$Survived)
  
  # Prediction for tree regression
  dt$tree_prediction <- predict(model_trees, 
                                           newdata = dt)[,"1"]
  dt$tree_prediction <- ifelse(dt$tree_prediction > 0.5, 1, 0)
  
  cross_validation[2, i] <- calculate_correct_guesses(dt$tree_prediction, dt$Survived)
  
  # Prediction for neural network
  dt$neural_prediction <- predict(model_neural, x = as.matrix(dt[, c("FirstClass", "SecondClass", "ThirdClass", "Sex", "Age", "Sibsp", "Parch", "Fare", "Southhampton", "Cherbourg", "Queenstown")]), verbose = 0)
  
  dt$neural_prediction <- ifelse(dt$neural_prediction > 0.5, 1, 0)

  cross_validation[3, i] <- calculate_correct_guesses(dt$neural_prediction, dt$Survived)
  
  # Prediction for bagging
  dt$bagging_prediction <- predict(model_ensemble_bagging, 
                                              newdata = dt)
  dt$bagging_prediction <- ifelse(dt$bagging_prediction > 0.5, 1, 0)
  
  cross_validation[4, i] <- calculate_correct_guesses(dt$bagging_prediction, dt$Survived)
  
  # Prediction for boosting
  dt$boosting_prediction <- predict(model_ensemble_boosting, 
                                               newdata = dt)
  dt$boosting_prediction <- ifelse(dt$boosting_prediction > 0.5, 1, 0)
  
  cross_validation[5, i] <- calculate_correct_guesses(dt$boosting_prediction, dt$Survived)
  
  ##PCA
  
  # Prediction for logistic regression
  dt$logistic_pca_prediction <- predict(model_pca_glm, 
                                               newdata = dt)
  dt$logistic_pca_prediction <- ifelse(dt$logistic_pca_prediction > 0, 1, 0)
  
  cross_validation[6, i] <- calculate_correct_guesses(dt$logistic_pca_prediction, dt$Survived)
  
  # Prediction for tree regression
  dt$tree_pca_prediction <- predict(model_pca_trees, 
                                           newdata = dt)[,"1"]
  dt$tree_pca_prediction <- ifelse(dt$tree_pca_prediction > 0.5, 1, 0)
  
  cross_validation[7, i] <- calculate_correct_guesses(dt$tree_pca_prediction, dt$Survived)
  
  # Prediction for neural network
  dt$neural_pca_prediction <- predict(model_pca_neural, x = as.matrix(dt[, c("PC1", "PC2", "PC3", "PC4")]), verbose = 0)
  dt$neural_pca_prediction <- ifelse(dt$neural_pca_prediction > 0.5, 1, 0)
  
  cross_validation[8, i] <- calculate_correct_guesses(dt$neural_pca_prediction, dt$Survived)
  
  # Prediction for bagging
  dt$bagging_pca_prediction <- predict(model_pca_ensemble_bagging, 
                                              newdata = dt)
  dt$bagging_pca_prediction <- ifelse(dt$bagging_pca_prediction > 0.5, 1, 0)
  
  cross_validation[9, i] <- calculate_correct_guesses(dt$bagging_pca_prediction, dt$Survived)
  
  # Prediction for boosting
  dt$boosting_pca_prediction <- predict(model_pca_ensemble_boosting, 
                                               newdata = dt)
  dt$boosting_pca_prediction <- ifelse(dt$boosting_pca_prediction > 0.5, 1, 0)
  
  cross_validation[10, i] <- calculate_correct_guesses(dt$boosting_pca_prediction, dt$Survived)
    
  
}

# Calculate mean accuracy over folds
cross_validation$Mean_accuracy <- rowMeans(cross_validation)
# Calculate expected loss
cross_validation$Expected_Loss <- (1 - cross_validation$Mean_accuracy)

```

```{r}
cross_validation[, c(11, 12)]
```

The table indicates `Mean_accuracy` and `Expected_Loss` for each model. `Mean_accuracy` is good example of representing the average accuracy of the model across all the 10 folds in the cross-validation process. In other words, it shows how well the model performs on average in predicting the survival rate. On the other hand, `Expected_Loss` indicates the proportion of misclassified instances. In short, the lower the expected loss, the better the performance of the model.
It is clear that, after cross-validation we find that bagging without PCA seems to perform best, closely followed by bagging with PCA. 

