Table Of Contents
‣ GitHub Repositories
Text-As-Data/Word Counting Tutorial.ipynb at main · NatalieRMCastro/Text-As-Data
Text-As-Data/Word Counting Gist.ipynb at main · NatalieRMCastro/Text-As-Data
Word counting is the simplest tool used in early NLP. It is always accurate, but may have a tendancy to overlook data and does not achieve the same insight as other models. By recognizing its caveats, it can support the overall exploration of the data to best apply word counting.
Keyword counting can be used to understand the useage of one (or a few) words in relation to the entire corpus of text. Additionally, it can be used as a base validity level to compare unsupervised and supervised models. If the visualizaitons from word counts demonstrate something that is not present, or chronologically different than in the topic clustering (or any other tool) it can push you to a certain area to re-validate its accuracy.
There are two basic steps used in this tutorial to explore the composition of the texts first a counting method, so identifying what words are relevant, and then the second measures the occureance and the sentiment proportion.
I reccomend downloading the Notebook from GitHub to explore the visualizations! They are not rendered in the HTML embed here.
The libraries used in this tutorial are a small fraction in comparison to many other NLP techniques. The main function of these libraries is to import a dataset, structure it, and then to count it.
2.1 Data Importation
The data used in this tutorial is provided by a HuggingFace data set by the user Krushil Patel titled Covid Tweet Text Classification. This dataset consistes of 9,450 different tweets, and is labeled but has an unknown labeling schema (I believe it is to split between test and training data sets). There is little metadata about this data set, but it was selected because of its size and powerful emotions evoked during COVID-19.
2.2 Data Pre-Processing
📌 Here is the documentation about the different kinds of stem algorithims available through the Natural Language Toolkit
An advantage of stemming when using dictionary methods is that is will capture all word forms. It will treat the words 'cause', 'causing', and 'caused' all the same. In the data provided it will then change the original form of the word.
14,000 words would be a lot to visualize, especially if they do not occur frequently. The analysis could proceed in two ways - looking at the most frequent words or looking at the least frequent words. The most frequent would provide information about what was most relevant to the sampled population. Inversely, by analyzing words with few instances it can lead way to exploring silences and gaps within the dataset.
For this analysis, let's look at the words with the highest frequency words and then calculate their seniment. A few other calculations will be preformed to look at the composition of the sentiment.
At this stage, we will generate a table that is the following
index | value | percentage |
---|---|---|
2 | covid | .5257 |
3 | case | .2659 |
45 | test | .2224 |
📌 The documentation for the NLTK Vader Lexicon can be found linked. It provides the source code for the original model.
Using dictionary methods, we can easily conduct a sentiment analysis to see what percentage of the corpus belongs to a certain emotion. Unlike other NLP methods, the NLTK's sentiment analysis does actually 'read' the word and provides a score based on it. However, it does not 'read' the word in context when only a bag of words is passed to it.
In this section of code, the sentiment analyzer will be applied to the above tweets to expand the dataframe with its respective scores. The compound score will then be used in analysis because it is the combination between positive and negative tone. For example someone could say “that’s just wonderful, I wasted my whole paycheck on that dumb thing!”, and the sentiment analyzer would consider wonderful, wasted, dumb for the sentiment all together.
Through calculating the entire frequency and percentage of the sentiments we can generate a dataframe that has information about the average percentage, frequency, and the breakdown of each type of sentiment. These types of graphs could show how the passage is composed or what the overall tone is. In this section, the sentiment analysis used was based on a positive, negative, or neutral sentiment but there are many other forms of classifier that could provide a more fine grained information.