Hierarchical Clustering Analysis with Women Entrepreneurship

gracecamc168
Jan 5, 2021
3 min read

Updated: Jan 7, 2021

One of the main benefits of using Supervised Machine Learning models is helping us to predict the future based on the past data. However, most of the time, the hidden pattern and information in the data can't be uncovered. Clustering, one of the Unsupervised Machine Learning Algorithms, can be used as part of descriptive analytics because it performs useful exploratory analysis by grouping a large number of observations in a dataset into a small number of homogeneous clusters. Put in another way, the algorithm group the observations that share the similar trait together, each group is dissimilar to the other. Hierarchical Clustering which includes agglomerative clustering and divisive clustering is one of the popular clusterings for customer or market segmentations. The example I will use today is agglomerative clustering, which is also called "bottom-up" clustering. The unsupervised

machine Learning algorithms find the pattern on its own, there are no response variables and predictors assigned.

In order to perform hierarchical clustering, we have to keep the data in categorical(factor) or continuous or both data types. Different datatype use different methods. Notice, we can't have text type like reviews or names.

The dataset I use is called Women Entrepreneurship and Labor Force from Kaggle. This is a small dataset which contains 51 rows and 9 columns. This dataset contains both categorical and continuous datatypes. Therefore, we will use "gower" metric in the daisy() for this dataset. But if the data type is only continuous, we will use other metric like "euclidean" or others to measure the distance by using the dist(). And we have to scale the data before using dist(). R is the programming language for this project.

Step 1: transforms the three variables(Country, Level of development and European Union Membership) into factor 1 and 0, respectively.

women$Level.of.development <- ifelse(women$Level.of.development == "Developed",1,0)

women$European.Union.Membership <- ifelse(women$European.Union.Membership == "Member",1,0)

women$Currency <- ifelse(women$Currency == "Euro",1,0)

library(cluster)

d <- daisy(women[,-2],metric ="gower") # measure similarity among observations

result <- agnes(d,method="ward") # Hierarchical Clustering

plot(result)

The dendrogram plot gives us the agglomerative coefficient 0.97, which is good. From the plot, three clusters should be the best choice.

clusters <- cutree(result,k=3)

Each cluster has been assigned to the related observation.

# append the cluster membership to the original data frame.

women1 <- data.frame(women,clusters)

str(women1)

# obtain summary statistics for the three clusters and summize the average value of each

# cluster.

summary(subset(women1,clusters == 1))

summary(subset(women1,clusters == 2))

summary(subset(women1,clusters == 3))

newdata <- women1 %>%

group_by(clusters) %>%

summarise(mean_No = mean(as.numeric(No)),

mean_Level.of.development = mean(as.numeric(Level.of.development)),

mean_European.Union.Membership = mean(as.numeric(European.Union.Membership)),

mean_Currency = mean(as.numeric(Currency)),

mean_Women.Entrepreneurship.Index = mean(Women.Entrepreneurship.Index),

mean_Entrepreneurship.Index = mean(Entrepreneurship.Index),

mean_Inflation.rate = mean(Inflation.rate),

mean_Female.Labor.Force.Participation.Rate = mean(Female.Labor.Force.Participation.Rate)

)

newdata <- add_column(newdata, cluster_number = c(15, 12,24 ), .after=1)

str(newdata)

From the output, we can make the following observations about each of the three clusters.

Cluster 1, there are 28 entrepreneurs, all of them are from developed countries,

and they are in European.Union, so they use euro as their currency.

There is 58.6% of Women.Entrepreneurship, 57.6% of Entrepreneurship. The inflation

rate is negative 0.165. Roughly 61% female participates in the workforce.

Cluster 2, there are 33 entrepreneurs, all of them are from developed countries,

and half of them are in European.Union, so they use national currency as the currency.

There is 60.3% of Women.Entrepreneurship, 61.9% of Entrepreneurship. The inflation

rate is 0.238. Roughly 65.9% female participates in the workforce.

Cluster 3, there are 29 entrepreneurs, all of them are from developing countries,

and they are not in European.Union, so they use national currency as the currency.

There is 34.9% of Women.Entrepreneurship, 33.5% of Entrepreneurship. The inflation

rate is 5.483. Roughly 53.3% female participates in the workforce.

This model is purely data driven. It searches the whole dataset to find the pattern. Therefore, when it comes to large dataset, it could run very slow. It is also very sensitive to the change of the sample dataset, one small change could lead to a completely different clustering output.

Hierarchical Clustering Analysis with Women Entrepreneurship

Recent Posts

Comments