🧾 Table Of Contents
‣ GitHub Repositories
Text-As-Data/Multinomial Language Models Tutorial.ipynb at main · NatalieRMCastro/Text-As-Data
Text-As-Data/Multinomial Bayes Classifier Gist.ipynb at main · NatalieRMCastro/Text-As-Data
here is a description of what the method is, how it is used, and what kind of data it is best used for. Link here also to the literature review page, but provide a short image of it here
This is the code notebook you made, embed the GitHub page for it. Provide a short overview of the notebook here
The libraries used for this notebook are primarily for data management, data parsing, vectorization, and word preprocessing.
Data used in this tutorial will be provided from scikit-learn to make data more accesssible on multiple devices.
The data used in this tutorial is housed in the NLTK huggingface library, and can be imported with 'load_dataset'. Data used in this tutorial was generated by Ebenge Usip, and featured in the paper "Hierarchical Pre-training for Sequence Labelling in Spoken Dialogue (SILICONE). The Corpus used is the "Daily Dialog Act Corpus", or DYDA_DA, which categorizes the the kind of text into four categories "commissive", "directive", "informative", or "question".
📖 More information about the dataset and paper can be found at this hyperlink, or at the repository linked there as well. NLTK Hugging Face provides multiple data sets that already coded, which allow for training models to become much easier.
Multinomial_Language_Models.html
The MNB has six steps to train:
These steps helps the model to have the best text to pass into the model, turn the text into a quantiative representation of it, train the model on the data, and assess its preformance in terms of accuracy for its probability predictions.
Preprocessing depends on the task at hand. The most popular options are lemmatization, stopword extraction, and stemming. This practice is something that is used beyond just MNB, but applied in many other models like Bag of Words, Topic Modeling, and Word Embeddings. By preprocessing the data it allows for patterns to be identified regardless of the word form or potential filler information. The lemmatizer removes any word forms, so the words 'stop', 'stopping', and 'stopped', will all be treated the same as the shared root is indifferent to word tense. However, these methods should be used with discretion as in some cases removing stopwords may confuse analysis (for example, trying to identify a gender difference in text but the stopwords list used remove she/he/they pronouns.)