Text-As-Data/K-Means Tutorial.ipynb at main · NatalieRMCastro/Text-As-Data
KMeans is an unsupervised clustering method used in machine learning which calculates the Euclidian distance to generate a measure of similarity (and inversely dissimilarity) and represent data in an n amount of clusters. It is a probabilitic style, thus, it specifies a generation tactic to reveal potential hidden clusters through exploration. It operates under a hard assignment strategy where only one cluster is assigned with the highest probability for match to the centroid (or the mean). Finally, it is a flat clustering method, where there is only a single clustering without hierarchical divisions. (Grimmer, Roberts, Stewart: 2024).
This tutorial will first explore a basic quantiative use for K-Means. This will generate, process, and then visualize the clusters outputted with the KMean function. The second section of this tutorial will take text data and generate clusters using Term Frequency - Inverse Document Frequency. This serves as a structured tutorial, that takes code from the above section and prompts code input to match the provided outputs.
The libraries which will fall under four main categories: data management, data visualization, data querying, and natrual language processing.
Data used in K-Means clusters are vectorized. This means that the model identifies the clusters based on numbers within an arbitrary plane. Data used in the model can either be numbers to start or could be some form of text that is then vectorized.
📖 The KMeans function documentation can be found here.
📖 For a more detailed understanding of fitting the model and performance this paper by M. Ahmed, R. Seraj, and S.M. Shamsul Islam. published in 2020.
KMeans will return a KMeans object that can be manipulated using the above attributes. The goal of the KMeans will generate attributes such as the cluster_centers, labels_, interita_ (Sum of Square Distances between the closest cluster center), and n_features_in. This tells the user new ways to organize the data to help identify new clusters, or solidify the strength of prior held theories.
The attribute fit() is what computes the k-means clustering from the object. X is passed in, which was the array defined from the make_blobs function, or would be the other data made available. The attribute has an optional parameter, sample_weight, which on default does not distribute weights unevenly. Fit is the process of calculating which cluster the data point should belong to.
Cluster_Centers is an attribute of the kmeans object, and provides the corrdinated of the adjusted centers. Remeber from earlier that the original clusters that was provided was at (1,1), (-1,-1), and (1,-1). It should be noted that if KMeans was interuppted before full convergence, the clusters will not be consistent with the labels.
When analyzing a K-means Cluster, the reseracher has two proposed methods: to either identify discriminating words from the clusters or to sample documents assigned to each cluster (Grimmer et al: 2024; Quinn et al: 2010). Both methods look to the final goal of being able to translate the algorithimic arbitrariness to a human interpretable coding schema.
The source code for this graph is the scikit learn tutorial "Comparison of the K-Means and MiniBatchKMeans clustering algorithms"
This graph first plots each cluster point and then plots the current centroid for the cluster. Additionally, it includes the train time it took to train the model, and represent it. This visualization can tell a researcher about the overlap of the clusters, distance between each other, and the outliers for each cluster. Visualization provides a quick overview of the accuracy and similarity of the clustering method.
This notebook provides a simplified version of K-Means .You can use this notebook ‘off the shelf’ to provide an easy way to instantiate your model, and not have to change the code.
➡️ inputs: a vectorized dataset
⬅️ outputs: a cluster visualization and dataframe specific to the input
K-Means can be used to help facilitate discovery. It provides a statistical way to organize the world that may not have been present to researchers upon the first look. This failitates new theories and methods, but like all NLP methods, requires meaning to be attributed by the subject matter expert. When using text clustering methods, more importance should be attributed to what you are clustering, instead of the method used.
Considering the four principles of discovery outlined Grimmer et al in Text as Data, context relevance, no ground truth, placing judgement on concept, and data validity, K-Means provides a strong overview for a basic exploratory data analysis.
K-Means provides little context relevance, and requires the researcher to attribute their own meaning onto the clusters. Additionally, a parameter in the clusterer asks how many clusters should be generated. This requires some sort of understanding for the dataset before hand. This is a strength when conducting deductive research, however it may provide confirmation bias and not clearly illustrate all of the potential phenomenon in the data.
Grimmer et al argue that separate data is best, and in the case of K-means, this holds especially true. External data can be used to support external validity and thus generalization. Without this data and only using K-Means this may be challenging. The intenal validity of the data should be carefully considered.
source: Eric Pozet on Upsplash
K-Means produces a flat clustering of the data. This is opposed to hierarchical, where a nesting association could be generated. Cluster-ers like BERT (Devlin 2018) are able to generate hierarchical clustering, but will take more computational resources. Flat clustering is achieved through hard paritions, where each document is only assigned to one category. This may be challenging when applying a K-Means cluster to data that describes a human phenomenon or concept because there is inherently connections and multiple representations depending on positionality wihtin a text.