📌 Purpose: The purpose of this notebook is to validate the application of Gemini when preserving similar themes in summaries to the actual transcripts of the board meetings. To continue to utilize Gemini summaries as a source of input the validation must be close to the original clustering. Additionally, the themes that are not present in the Gemini summaries model in comparison to the acutal themes model will be noted and assessed.

Where does this fit into the methodology?

📂 Files Generated:

https://github.com/NatalieRMCastro/schoolboard-notebooks

Topic Modeling - Gemini Keyword Extraction (.ipynb)

1. Environment Building

1.1 Library Import

These are libraries which I typically import for my files. This package of libraries is typically what I will pull on throughout analysis, and copy and pase these libraries between files. Not every library is always used, but I asusme it is best to import them generally.

Specific libraries for this notebook are BERTopic and HuggingFace. These allow storage for the models, so later when I call on them, I will not have to retrain the model every iteration.

'''OS MANAGEMENT'''
import os
import glob
import smart_open

'''DATA MANAGEMENT'''
import pandas as pd
import json
import datetime
import numpy as np
from datetime import datetime

'''DATA QUERIES'''
import regex as re

'''DATA VISUALIZATION'''
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt

'''NLTK'''
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()
stopwords_set = set(stopwords.words('english'))

''' GENSIM LIBRARIES'''
import gensim
from gensim import corpora
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from gensim.parsing.preprocessing import preprocess_documents
from gensim.parsing.preprocessing import stem_text
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models.doc2vec import Doc2Vec,TaggedDocument

''' TOPIC MODELING'''
from bertopic import BERTopic

from huggingface_hub import login
login()

1.2 Data Import

In the first section of code, I am calling to the file path on my computer and extracting the summaries that were created in

A text file was created for each meeting, with multiple iterations of calls inside for equally sized passages of the meeting. A JSON formtatted object (with Gemini text based limitations) was returned for each passage with keywords characterizing the summary.

In the second block of code file names are extracted from the path patterns. These path patterns are unique identifiers and used throughout multiple notebooks. The file names have the school district number and then a YYYYMMDD format.

The third block of code then extracts each file from the pattern and stores it in a dictionary (with its respective file name as the key), and then a transcript list.

    ## Setting a pattern to extract the files
pattern = sorted(glob.glob(r"...youtube transcripts\gemini summaries\*.txt"))

    ## Setting a pattern to the folder repository
path_pattern = r"...youtube transcripts\\gemini summaries\\"

## Extracting the filenames
paths = []
for file_path in pattern:
    pattern2.append(file_path)
            
print ('test for accuracy: pattern2 length is:',len(pattern2))

'''This section of code creates a list of the filenames and their respective dates.
These file names are consistent across files, and are used as a unique meeting identifier.'''

    ## Creating a storage container for the file names
int_txtfiles = []

    ## Iterating through each pattern to extract the file name using regex methods from the entire path
for file in pattern:
    file_name1 = re.sub(path_pattern,"",file)
    file_name2 = re.sub(".txt","",file_name1)
    file_name = int(file_name2)
    int_txtfiles.append(file_name)
    
    ## Iterating through the integer file names and turning them into strings to have both formats
int_txtfiles.sort() ## Sorting here is important to match the extracted files in the following cell
txtfiles = []
for file in int_txtfiles:
    file_str = str(file)
    txtfiles.append(file_str)

'''The transcripts are then extracted from each file in the pattern and stored in a dictionary'''

## Creating storage containers for the transcript and file name.
transcript_dicts = {}
transcript_list = []

## Iterating through each file and storing it in the above container
iteration = 0
for file in pattern:
    with open(file,"r") as f:
        file_content = f.read()
        transcript_list.append(file_content)
    transcript_dicts[txtfiles[iteration]] = file_content
    iteration = iteration + 1

2. Extracting Gemini Summary Features

2.1 Extracting Keywords

Keywords were generated by Gemini, based on small parts of the overall transcript. This was intended to maintain model focus, and a track over time.

The first cell instantiates patterns informed by the returned Gemini Summaries. These patterns are then applied in the second cell to extract the keywords within each meeting. For each meeting there will be multiple lists returned based on how long the original transcript was.

The third cell iterates through the keywords that were extracted and cleans them. It creates a storage item to hold any misbehaving keywords (if they are not present) and then the matches for each batch in the storage container keyword_main. This is used to create a dataframe with the number of sections in the meeting (proxy for length), the list of all of the keywords returned (in list format for each section), and the meeting. This dataframe is then created and saved in the fourth cell as Gemini Keywords by Meeting.xlsx.

The head of the keywords_macro dataframe is:

The fifth cell creates the micro dataframe to generate decomoposition and recomposition ids for the keywords. This is to later inform what part of the meeting is coming from, and what ‘span’ the topics had throughout the meeting. Each row in the dataframe that is created holds the keywords for the section. The ID is the unique identifier, and the meeting column provides additional information about the school district and date. These can be easily decomposed to conduct a time analysis or by district analysis.

The sixth cell then creates functions to add dates to the dataframe and the school district. Functions were created based on the structure of the meeting ids, and can be applied across the files to extract these.

These cells are used to create two dataframes that track the keywords generated by Gemini. These are not representative of topic frequencies from the chunks, and are unique to the model creation. If instantiated again, these keywords may be different. Retesting the model to generate keywords could be interesting to see repeated validity over time. The keywords generated will be applied to create one BERTopic model, and used for one of three parts of Gemini Validation. The question that can be asked of this method is:

<aside> ❓

Do summaries created in Gemini preserve nuanced themes identified in the original transcripts?

</aside>

This in part is used to help support or disprove the decision to use Gemini as an analysis tool, and it provides an (in)accurate picture of the data.

''' Instantiating a pattern for the summary keywords '''

keys = list(transcript_dicts.keys())
    ## Pattern 1 looks for the section between keywords and summary returned in the text file
keyword_patt1 = r'(?<=keywords)(.*?)(?=summary)'
    ## Pattern 2 looks for the the characters that the actual text within
keyword_patt2 = r'(?<=": ").+?(?=")'

''' Iterating through each transcript and finding the multiple keywords'''

    ## Creating a storage container for the matches. This will hold a list of the keywords throughout the meeting.
    ## There will be multiple matches for each meeting, as they vary in length.
matches = []
for key in keys:
    curr_transcript = transcript_dicts[key]
    matched = re.findall(keyword_patt1,curr_transcript)
    matches.append(matched)

''' This cell extracts the keywords and reassigns them to the original meeting. Two dataframes are created the macro and the micro.
The macro dataframe returned holds information about the length of the meeting, the matches, and the meeting id.
The micro dataframe creates IDs for each chunk and has data about the meeting, meeting id, section id, keywords, and ID.'''

    ## Creating storage containers for the keywords (macro)
#keyword_matches = []
misbehaving = []
keyword_main = {}
keyword_micro = {}

    ## Matches holds all of the returned keywords in the meeting
iteration = 0
for match_batch in matches:
    ## Creating additional storage containers for the keywords (micro)
    cleaned_matches = []
    keyword_matches = []
    key = keys[iteration]
    match_iteration = 0
    
        ## A match batch is each set of keywords returned by Gemini
        ## These are iterated through to clean the keywords and return a proper dictionary object
    for match in match_batch:
        curr_keywords = match.split(sep=",")
        
        ## Checking to see if the keywords are there
        if len(curr_keywords) == 0:
            misbehaving.append(curr_keywords)
            
        ## Appending the current keywords to the list of all keywords
        else:
            keyword_matches.append(curr_keywords)
            match_iteration = match_iteration + 1
    
    ## Iterating through the recently appended iteration and identify the cleaned matches
    for batch in keyword_matches:
        batches = []
        for find in batch:
            word_find = re.findall(r'(\w+-*\w*\s*\w*)',find)
            if len(word_find) != 0:
                word = word_find[0]
                batches.append(word.lower())
        cleaned_matches.append(batches)
      
    ## Storing the match id, and matches to keyword main storage container
    keyword_main[iteration] = {'sections':match_iteration,'matches':cleaned_matches}
    
    iteration = iteration + 1

''' Creating the keywords_macro dataframe from the container keyword_main'''

keywords_macro = pd.DataFrame(keyword_main).transpose()
keywords_macro['meeting'] = keys
keywords_macro.to_excel('Gemini Keywords by Meeting.xlsx')

''' Creating a smaller dataframe where each row is a keyword section'''

## Extracting the smaller keywords and generating section ids to have a 'meeting landscape'
all_matches = keywords_macro['matches'].to_list()

   ## Keeping track of the iteration to understand what meeting and how many sections it has. 
section_iteration = 0
meeting_iteration = 0
match_keeper = []
    
    ## Iterating through each stored list from the keywords_macro matches.
for bundle in all_matches:
    ## Creating a meeting dictionary to then turn into a dataframe
    for theme in bundle:
        meeting_dictionary = {'meeting':keys[meeting_iteration],'meeting id':meeting_iteration+1,'section':section_iteration+1,'keywords':theme}
        match_keeper.append(meeting_dictionary)
    
        section_iteration = section_iteration + 1
    meeting_iteration = meeting_iteration + 1
    
keywords_micro = pd.DataFrame(match_keeper)

## Using a lambda to create an ID based on the meeting ID and section ID.
keywords_micro['ID'] = keywords_micro.apply(lambda row: f"{row['meeting id']}.{row['section']}", axis=1)