What is Topic Modeling?

BERTopic, developed by Grootendorst, is a "topic modeling technique that leverages HuggingFace transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics" documentation. BERTopic is popular, with about 1,833 current citing articles since its introduction in 2018 (about 300 articles per year!). Additionaly, BERTopic is a low barrier topic modeling tool. It is entirely open source, can be customized to language needs, and can be ran without need for (much) fine tuning. BERTopic can provide a new way to read texts from a distance as it provides the user with representative documents and hierarchical topic organization as well. In comparison to LDA, BERTopic automatically detects potential themes through clustering, instead of having to pass in a seed number of topics. In this notebook, I will provide an overview of how to instantiate a BERTTopic Model and different ways to customize it.

Word embeddings relies on the distributional hypothesis, or the idea that in context words are clues to their meanings. Using the nearest words, it allows to the computer to generate dense vectors to then 'learn' the meaning of the texts. The three types of word embeddings are: encode simialirty, automatic generalization, and measuring meaning. These methods in combination have been applied to multiple different forms of topic models like latent dirichelt allocation - LDA (Blei et al : 2001), Bidirectional encorder representations from transformers BERT (Devlin et al), and Global Vectors for Word Representation gloVe (Pennington et al). Throughout the models' time in research, there have been increasing ways to integrate, analyze, and validate the model responses. High priority is placed on how models should be interpreted and generalized.

Introduction to BERTopic

For this notebook, I reccomend downloading it and running it! The Plotly Visualizations do not save in outputs in the notebook

1. Libraries Needed

The libraries used to parse the data in this tutorial are primarily from HuggingFace, or other common NLP techniques.

Expand here to view library information

2. Data Processing

Let's use this dataset posted by Google on HuggingFace. The name of the dataset is "Civil Comments" and was originally used to identify bias in macihne learning outputs. Further documentation of the dataset can be found on both HuggingFace or in their paper Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. The original dataset is quite large, so we will only be using a small section of it.

By conducting a topic model on the training set, it can provide further insight on what kinds of text Google is using to train its AI models.

3. Instantiating the Topic Model

To instantiate a BERTopic Model, you will pass in a list of documents. Any dataset with a size over 1,000 will work best for the model. The two outputs from the topic model will first be the generated topics like (”canada_trudeau_canadians_canadian”) , which are the four or five most frequent words in the topic, and then the probabilities for each representative document. These employ fuzzy matching but present as hard matching. Through BERT’s hierarchical clustering you are able to explore the fuzzy clusters more.

To call the model, you can use the function call: BERTopic(vectorizer_model**=**vectorizer_model,verbose **=** **True**)

To traing the topic model you will then use the variable docs to train the model. topics, probs **=** topic_model**.**fit_transform(docs)

4. Results from BERT

Using these get_topic_info will show information either about the topic or the document, and can later provide information which will feed into the visualizations. It provides an overview of all of the topics generated in the model.

model.get_topic_info: shows the count of the words belonging to the topic, and then the name of the topic. Any topic with -1 is an outlier and should "typically be ignored". The name of the topic is the top three or four words for each topic.

5. Visualizations

When interpreting a model, it should be understood that sometimes there are arbitrary tuning variables. This can provide as a strong new way of interpretting the documents, we may not have seen before. In the words of Grimmer et al "When making [the] assessment, our goal is to assess their ability to credibly organize documents according to a particluar organization", it is our role as the researcher to decide what method is appropriate and best representative of the dataset. In comparison to K-Means or Latent Dirichlet Allocation (Blei et al 2003), BERTopic automatically generates topics for the clusters. These are heavily dependent on the embedding model and the way you have preprocessed your texts.

BERTopic generates three main types of visualization: the intertopic distance map (IDM), the hierachical document clustering, or a timeseries of topic change. In this tutorial, we will examine the IDM and the hierarchical clustering.

Hierarchical clustering provides a way to engage with the data through its algorithimic inference. We are able to look at the cluster outputs and provide an indepth reading by exploring the indivdiual documents assigned and the words that are representative of the topics. BERTopic supports interactive engagement through its graphing modality, Plotly. You are able to see more information about the topic as you hover over it, or even get the representative documents for one topic.

Word_Embeddings_With_BERTopic_Model.html

Topic Table Example:

Topic zero is typically used to demonstrate ‘filler’ topics, or ones that are not viewed as important

Topic	Count	Name	Representation	Representative_Docs
0	-1	688	-1_people_don_just_like	[people, don, just, like, good, time, state, n...
1	0	144	0_trump_president_obama_news	[trump, president, obama, news, said, press, l...
2	1	87	1_police_gun_guns_shot	[police, gun, guns, shot, crime, case, moose, ...
3	2	83	2_canada_trudeau_canadians_canadian	[canada, trudeau, canadians, canadian, governm...

Hierarchical Topic Modeling Visualization

A video showing the hierarchical document visualization. A slider is used to iterate through the topic tree generated.

Intertopic Distance Map

A video showing the interactive function of an IDM. When the slider is moved, it changes the color of the topic bubble selected on the map

GitHub Gist

BERTopic is a low(er) barrier tool to topic modeling. If you are interested in trying it our for yourself, you can use this GitHub Gist I created. This notebook will generate a topic model and generate a hierarchical visualization and an intertopic distance map.

➡️ inputs: a Pandas dataframe with one atleast one column with text data.

⬅️outputs: a trained topic model for your data, two interactive visualizations for exploratory analysis

If you want to learn more about how to save a BERTopic model to access it later, feel free to view my Model Serialization (ie pickling 🫙) tutorial. You will need a HuggingFace account to do store your freshly pickled model on the shelf.

BERTopic Gist.html

Benefits of Using Topic Models

Using a topic model can help support exploratory data analysis as well as a starting point for measurement (Grimmer et al: 2024). The goal of measurement is to identify a concept within your hypothesis. Using a topic model such as this will support exploration. Additionally, examining topics in this way could inform and support the development of a qualitative code book. Topic methods are essentially showing what words are most likely to appear near other words. When a corpora is especially large or complex to read this can provide an entry point into better understanding the data.

Limitations of Using Topic Models

Topic models, like all NLP methods are not actually reading the texts, however they do have the benefit of semantically reading the text and taking words into context so being able to parse the difference between “he got smoked” and “its smoking hot in here!”. However, in comparison to LDA or K-Means, BERTopic allows you to engage more deeply with the topics because it is semantically informed.