Pain Discourse Network: An Integrated Vocabulary Approach

Aug 11, 2025 by Rajiv Sharma 58 views

Paper-Validated Vocabulary Integrated: Pain Discourse Co-Occurrence Network Pipeline

Hey guys! Today, we're diving into an exciting project: building a pain discourse co-occurrence network pipeline that integrates paper-validated vocabulary. This means we're creating a system to analyze how words related to pain are used together in text, and we're making sure to include a set of validated terms from research papers to make our analysis super accurate and relevant. This is crucial for understanding how people talk about pain, especially in conditions like Interstitial Cystitis (IC) and Bladder Pain Syndrome (BPS).

Our goal is to enhance the analysis of textual data, specifically from online forums like Reddit, by prioritizing clinically relevant vocabulary. By doing this, we can better capture the nuances of pain language and identify key relationships between different symptoms, anatomical sites, and the overall impact of pain on individuals' lives. This approach ensures that our analysis is grounded in both real-world patient experiences and established medical knowledge. The pipeline involves several steps, including data preparation, text normalization, co-occurrence counting, graph construction, and community detection. Each step is designed to refine and enrich the network, ultimately providing a comprehensive view of the pain discourse landscape. So, let's get started and see how we can build this awesome pipeline!

Alright, first things first, let's get our environment set up. We'll start by mounting Google Drive, which is where our data and outputs will live. This makes it super easy to access everything. If you've already mounted your drive, you can skip this step, but if not, just run the provided code snippet. Once that's done, we'll install some necessary Python libraries. Think of these as the tools we'll use to build our pipeline. We're talking about libraries like networkx for graph analysis, python-louvain for community detection, and nltk for natural language processing. These are the heavy hitters in the NLP world, and they'll help us crunch through the text data and extract meaningful insights. We'll also use tqdm to track the progress of our code, because nobody likes staring at a blank screen wondering if anything's happening! So, let's install these dependencies and get ready to roll!

from google.colab import drive
import os

# Mount Google Drive (skip if already mounted)
if not os.path.exists('/content/drive/MyDrive'):
    drive.mount('/content/drive')
else:
    print("Google Drive is already mounted")

!pip -q install python-docx networkx==3.2.1 python-louvain==0.16 nltk==3.8.1 tqdm==4.66.4

Now that we've got our environment prepped and ready, it's time to bring in the cavalry – or, in this case, the Python libraries we need. We're talking about importing modules that will handle everything from text processing to network analysis. First up, we've got the usual suspects like re for regular expressions (for text wrangling), itertools for efficient looping, and collections for handy data structures like counters. Then we bring in the big guns: networkx for creating and analyzing graphs, community.community_louvain for community detection (finding clusters of related terms), and nltk for natural language processing tasks like tokenization and stop word removal. And of course, we can't forget tqdm for those sweet, sweet progress bars that keep us sane during long computations!

These libraries are the building blocks of our pipeline. They provide the functions and tools we need to dissect the text data, identify patterns, and build a network that represents the relationships between different pain-related terms. By importing these modules, we're essentially loading up our toolbox with everything we need to tackle this project. So, let's make sure we've got all these imports in place and get ready to move on to the next step. This part is like gathering your materials before starting a big art project – you want to make sure you have everything you need before you start creating.

import re, itertools, math, csv, collections, os, sys, time
from tqdm import tqdm
from docx import Document
import networkx as nx
import community.community_louvain as community_louvain
import nltk
from nltk.corpus import stopwords as sw

Okay, let's talk settings! This is where we fine-tune our pipeline to get the best results. We're setting up some important variables that control how our code runs. DOCX_PATH is where our input document lives – in this case, a .docx file containing text from Reddit discussions. OUT_DIR specifies where we'll save all our output files, like the network graph and analysis results. Think of these settings as the control panel for our pipeline. They allow us to adjust various parameters to optimize the analysis. For instance, WINDOW_SIZE determines how many words we look at around each term to identify co-occurrences. A smaller window might capture more immediate relationships, while a larger window can identify broader connections.

MIN_COUNT sets the threshold for how many times two words need to appear together to be considered a significant co-occurrence. This helps us filter out noise and focus on the most important relationships. LOWERCASE tells the pipeline whether to convert all text to lowercase, which helps ensure that we treat words like "Pain" and "pain" the same. USE_BASIC_NORMALIZE and USE_CLINICAL_PRIORITY are the real game-changers here. USE_BASIC_NORMALIZE activates a normalization process that maps common slang and abbreviations to standardized terms, making our analysis more robust. USE_CLINICAL_PRIORITY is a new feature that prioritizes clinical vocabulary, ensuring that terms validated in research papers are given extra weight in the analysis. This is super important for making our network clinically relevant. Finally, PRINT_TOP_K determines how many of the top results we'll display. These settings collectively define how our pipeline processes the text data and builds the co-occurrence network. By carefully adjusting these parameters, we can tailor the analysis to our specific research questions and ensure that we're capturing the most meaningful insights.

DOCX_PATH   = "/content/drive/MyDrive/reddit.docx"
OUT_DIR     = "/content/drive/MyDrive/pain_nlp_outputs_enhanced"   # New output folder
WINDOW_SIZE = 5
MIN_COUNT   = 5
LOWERCASE   = True
USE_BASIC_NORMALIZE = True
USE_CLINICAL_PRIORITY = True                              # New feature: prioritize clinical vocabulary
PRINT_TOP_K = 30

Now, let's talk about the heart of our enhanced analysis: the paper-validated terms. We've got a set of 19 standardized terms related to Interstitial Cystitis (IC) and Bladder Pain Syndrome (BPS) that have been validated in research papers. These terms are categorized into four key areas: pain-related symptoms, urinary function symptoms, impact/bother symptoms, and anatomical sites. This is where we're making sure our analysis is grounded in real, clinically relevant language. Think of these terms as our gold standard – they're the words we know are important in the context of IC/BPS.

By including these terms, we're ensuring that our co-occurrence network accurately reflects the language used in clinical settings and research. This is a huge step towards making our analysis more meaningful and actionable. The VALIDATED_ICBPS_TERMS dictionary organizes these terms into categories, which will be helpful for later analysis and interpretation. We then flatten this list into ALL_VALIDATED_TERMS for easier access. This list will be used to prioritize these terms during the tokenization and co-occurrence counting steps. By focusing on these validated terms, we can build a network that is not only comprehensive but also highly relevant to the specific challenges faced by individuals with IC/BPS. This ensures that our insights are directly applicable to improving patient care and understanding the complexities of pain discourse.

# === N=19 standardized IC/BPS terms validated in the paper ===
VALIDATED_ICBPS_TERMS = {
    # Pain-related symptoms (4 terms)
    "pain_symptoms": ["burning", "discomfort", "pain", "pressure"],

    # Urinary function symptoms (4 terms)
    "urinary_symptoms": ["frequent", "need", "urgency", "urinate"],

    # Impact/bother symptoms (3 terms)
    "impact_symptoms": ["avoid", "bother", "symptoms"],

    # Anatomical sites (8 terms)
    "anatomical_sites": ["abdomen", "bladder", "pelvis", "perineum",
                         "sacrum", "testes", "urethra", "vagina"]
}

# Flatten list
ALL_VALIDATED_TERMS = []
for category in VALIDATED_ICBPS_TERMS.values():
    ALL_VALIDATED_TERMS.extend(category)

print(f"Paper-validated standardized vocabulary: {len(ALL_VALIDATED_TERMS)} terms")
for cat, terms in VALIDATED_ICBPS_TERMS.items():
    print(f"  {cat}: {len(terms)} terms")

Alright, time to prep the battlefield! First, we make sure our output directory exists. This is where all our results will be saved, so we use os.makedirs with exist_ok=True to create the directory if it doesn't exist, and do nothing if it does. This prevents any annoying errors later on. Next, we're diving into the world of Natural Language Processing (NLP) by downloading the stopwords corpus from nltk. Stopwords are common words like "the", "a", and "is" that don't carry much meaning in our analysis, so we want to filter them out. We create a set called stop_en containing English stopwords. This set will be used later to remove these words from our text data.

These steps are crucial for ensuring that our analysis is clean and efficient. Creating the output directory ensures that we have a designated space for our results, making it easier to keep track of everything. Downloading the stopwords corpus prepares us for the text processing stage, where we'll be cleaning up the data to focus on the most important words. By removing stopwords, we reduce noise and improve the accuracy of our co-occurrence analysis. Think of it as decluttering your workspace before starting a project – a clean environment leads to better results! This preparation sets the stage for the more complex steps of our pipeline, ensuring that we have a solid foundation to build upon.

os.makedirs(OUT_DIR, exist_ok=True)
nltk.download('stopwords')
stop_en = set(sw.words('english'))

Let's get down to the nitty-gritty of text processing! This is where we create a normalize_map – a dictionary that's like our secret weapon for cleaning up the text. This map helps us convert messy, real-world language (think Reddit slang and typos) into standardized terms. We're talking about mapping things like "floo" to "floor", "bladde" to "bladder", and "ic" to "interstitial_cystitis". This is super important because it ensures that we're counting the same concepts even if they're expressed in different ways. The normalize_map includes both original medical text normalization (fixing typos and common misspellings) and Reddit slang/abbreviations mapped to standard terms. This dual approach ensures that our analysis is both medically accurate and sensitive to the nuances of online communication.

But the real magic happens when we start normalizing to our paper-standard vocabulary. We're mapping terms like "urge" and "urgent" to "urgency", "pee" and "peeing" to "urinate", and "hurt" and "hurting" to "pain". This is how we bridge the gap between casual language and clinical terminology. By normalizing to this validated vocabulary, we're ensuring that our co-occurrence network is grounded in established medical knowledge. Think of this step as translating different dialects into a common language. It allows us to see the underlying connections between words and concepts, regardless of how they're expressed. This enhanced normalization map is a key ingredient in making our analysis robust and meaningful, allowing us to capture the true essence of pain discourse.

# 5) Enhanced normalization map (Reddit → medical terms + original fixes)
normalize_map = {
    # Original medical text normalization
    "floo": "floor", "bladde": "bladder", "theapy": "therapy",
    "theapist": "therapist", "ination": "urination", "inating": "urinating",
    "oogyn": "obgyn", "yeas": "years", "impovement": "improvement",
    "metonidazole": "metronidazole", "bebeine": "berberine", "oegano": "oregano",
    "gasto": "gastro", "intestitial": "interstitial", "cystitis": "cystitis",
    "hge": "huge", "fom": "from", "fo": "for", "withot": "without", "sibo": "sibo",

    # Reddit slang/abbr. → standard terms (important improvement)
    "yo": "you", "bt": "but", "ic": "interstitial_cystitis", "ae": "are",
    "jst": "just", "abot": "about", "othe": "other", "becase": "because",
    "eally": "really", "mch": "much", "vey": "very", "thee": "the",
    "ot": "or", "se": "see", "ty": "try", "wold": "would", "moe": "more",
    "afte": "after",

    # Normalize to paper-standard vocabulary
    "urge": "urgency", "urgent": "urgency", "pee": "urinate", "peeing": "urinate",
    "hurt": "pain", "hurting": "pain", "ache": "pain", "aching": "pain",
    "burn": "burning", "uncomfortable": "discomfort", "pbs": "painful_bladder_syndrome"
}

Time to chop up the text and keep only the good stuff! We're creating two key functions here: enhanced_tokenize and is_enhanced_content_token. The enhanced_tokenize function takes a text string as input and spits out a list of tokens – individual words or units of meaning. It starts by converting the text to lowercase (if LOWERCASE is set to True) to ensure consistency. Then, it uses regular expressions to find all sequences of letters, effectively stripping out punctuation and other non-alphabetic characters. If we're using basic normalization (i.e., USE_BASIC_NORMALIZE is True), it applies our trusty normalize_map to convert slang and abbreviations into standard terms. This function is the front line of our text processing pipeline, breaking down the raw text into manageable chunks.

But not all tokens are created equal. That's where is_enhanced_content_token comes in. This function acts as a filter, deciding which tokens are important enough to keep for our analysis. First, it checks if we're prioritizing clinical vocabulary (USE_CLINICAL_PRIORITY is True) and if the token is in our ALL_VALIDATED_TERMS list. If so, it's an automatic keeper! Otherwise, it applies some basic filtering: it removes stopwords (common words like "the" and "a"), and it discards tokens shorter than two characters. Finally, it checks against lists of additional medical terms and common non-medical words. This multi-layered filtering process ensures that we're focusing on the most relevant and meaningful terms in our analysis. By combining enhanced tokenization with intelligent content filtering, we're setting the stage for a co-occurrence network that truly captures the essence of pain discourse.

def enhanced_tokenize(text: str):
    if LOWERCASE:
        text = text.lower()
    # Keep letters only
    tokens = re.findall(r"[a-z]+", text)
    if USE_BASIC_NORMALIZE and normalize_map:
        tokens = [normalize_map.get(t, t) for t in tokens]
    return tokens

def is_enhanced_content_token(tok: str):
    # Always keep paper-validated vocabulary
    if USE_CLINICAL_PRIORITY and tok in ALL_VALIDATED_TERMS:
        return True

    # Basic filtering
    if tok in stop_en:
        return False
    if len(tok) < 2:
        return False

    # Additional medical vocabulary
    medical_terms = {
        "interstitial", "cystitis", "syndrome", "chronic", "pelvic", "floor",
        "dysfunction", "inflammation", "infection", "treatment", "therapy",
        "medication", "antibiotic", "doctor", "physician", "urologist",
        "gynecologist", "diagnosis", "diagnosed", "condition", "relief"
    }

    if tok in medical_terms:
        return True

    # Filter out common non-medical words
    common_non_medical = {
        "like", "get", "also", "one", "know", "time", "help", "take", "day",
        "make", "think", "may", "want", "feel", "good", "bad", "new", "old",
        "first", "last", "long", "way", "work", "right", "left", "high", "low"
    }

    if tok in common_non_medical:
        return False

    return True

Let's get the ball rolling by reading our DOCX file and applying our enhanced processing techniques. We're using the docx library to open and read the document specified by DOCX_PATH. This is where we load up our textual data – the raw material for our analysis. We start a timer (t0 = time.time()) to keep track of how long this process takes, because efficiency is key! Then, we create a Document object from the DOCX file, which allows us to access the text content. We print the number of paragraphs in the document to get a sense of the size of our dataset.

This step is like setting the stage for a performance. We're bringing in the actors (the words and sentences) and getting ready to analyze their interactions. Reading the DOCX file is the first step in transforming raw text into a structured format that our pipeline can work with. By printing the number of paragraphs, we get a quick overview of the scope of our analysis. This initial step is crucial for the overall success of our pipeline, as it lays the foundation for the subsequent text processing and network construction phases. It's like gathering all the pieces of a puzzle before you start putting it together – you need to have all the elements in place before you can see the bigger picture. So, with the DOCX file loaded and the paragraph count noted, we're ready to move on to the next stage of our enhanced processing pipeline.

print("Reading DOCX with enhanced processing…")
t0 = time.time()
doc = Document(DOCX_PATH)
print(f"Paragraphs: {len(doc.paragraphs):,}")

Time to get down to business and count how often words appear together! This is the heart of our co-occurrence analysis. We're creating a co_counter – a collections.Counter object – to keep track of how many times each pair of words appears within a certain window size. We also create a validated_term_counts counter to specifically track the occurrences of our paper-validated terms. This is where our clinical priority comes into play. We want to make sure we're paying extra attention to these important terms.

The magic happens as we loop through each paragraph in the document. For each paragraph, we tokenize the text using our enhanced_tokenize function and filter the tokens using is_enhanced_content_token. This gives us a list of relevant terms for that paragraph. If there are fewer than two tokens, we skip the paragraph because we need at least two words to form a co-occurrence. For each token, we check if it's in our ALL_VALIDATED_TERMS list and, if so, increment its count in validated_term_counts. This ensures that we have an accurate tally of how often these clinical terms are used.

Next, we implement a sliding window approach. We define a window size W (which we set earlier) and slide a window of this size across the list of tokens. Within each window, we consider all unique pairs of terms and increment their counts in co_counter. This is where we capture the local context of each term, identifying which words tend to appear together. By the end of this process, co_counter will contain a comprehensive record of all word co-occurrences in the document, and validated_term_counts will give us a clear picture of how often our key clinical terms are being used. This step is like mapping the landscape of our text data, identifying the peaks and valleys of word usage and highlighting the connections between different terms. It's a crucial step in building our co-occurrence network, providing the raw data that will inform the structure and weights of the connections between nodes.

print("Counting co-occurrences with clinical term priority…")
co_counter = collections.Counter()
validated_term_counts = collections.Counter()  # Track occurrences of validated terms
W = WINDOW_SIZE

for para in tqdm(doc.paragraphs, total=len(doc.paragraphs)):
    toks = [t for t in enhanced_tokenize(para.text) if is_enhanced_content_token(t)]
    if len(toks) < 2:
        continue

    # Count validated terms
    for tok in toks:
        if tok in ALL_VALIDATED_TERMS:
            validated_term_counts[tok] += 1

    # Sliding window
    if len(toks) <= W:
        windows = [toks]
    else:
        windows = (toks[i:i+W] for i in range(0, len(toks)-W+1))
    for window in windows:
        uniq = sorted(set(window))
        if len(uniq) < 2:
            continue
        for a, b in itertools.combinations(uniq, 2):
            co_counter[(a, b)] += 1

print(f"Total unique pairs: {len(co_counter):,}")
print(f"Processing took: {time.time()-t0:.1f}s")

Now, let's check how well we're capturing those validated terms! This is where we assess the detection rate of our paper-validated vocabulary. We're going to loop through each category in VALIDATED_ICBPS_TERMS and check if each term was detected in the text. This is a crucial step for ensuring that our pipeline is actually picking up on the key concepts we're interested in.

For each category and term, we check if the term exists as a key in our validated_term_counts counter. If it does, we know the term was detected, and we print a message indicating its count. If not, we print a message saying the term was not detected. This gives us a clear picture of which terms are being frequently used in the text and which ones might be missing. Finally, we calculate the overall detection rate – the percentage of validated terms that were found in the text. This is a key metric for evaluating the performance of our pipeline. A high detection rate indicates that our tokenization and filtering processes are effectively capturing the relevant clinical vocabulary. This step is like taking a temperature check on our pipeline, ensuring that it's functioning as expected and highlighting any areas that might need further attention. By analyzing the detection status of our validated terms, we can gain confidence in the quality of our subsequent network analysis and ensure that our insights are grounded in the most relevant clinical concepts.

print(f"\n=== Detection status of the 19 paper-validated terms ===")
found_validated = 0
for category, terms in VALIDATED_ICBPS_TERMS.items():
    print(f"\n{category}:")
    for term in terms:
        count = validated_term_counts.get(term, 0)
        if count > 0:
            found_validated += 1
            print(f"  ✓ {term}: {count:,} occurrences")
        else:
            print(f"  ✗ {term}: not detected")

print(f"\nDetection rate: {found_validated}/{len(ALL_VALIDATED_TERMS)} ({found_validated/len(ALL_VALIDATED_TERMS)*100:.1f}%) "

Time to build our network! We're taking the co-occurrence counts we calculated earlier and turning them into a graph where words are nodes and connections represent how often they appear together. This is where our clinical term prioritization really shines. We're creating a networkx.Graph object, which will be the foundation for our network analysis.

We loop through each pair of words and their co-occurrence count in co_counter. If the count is above our MIN_COUNT threshold, we consider it a significant connection and add an edge to the graph. But here's the cool part: we're giving a bonus to connections involving our validated terms. For each word in the pair that's in ALL_VALIDATED_TERMS, we add a clinical_bonus of 1. This means that connections between validated terms get an extra boost, reflecting their clinical importance.

We then compute an enhanced_weight for each edge, which is the raw co-occurrence count plus twice the clinical_bonus. This weighting scheme ensures that validated terms have a stronger influence on the network structure. We add the edge to the graph with the enhanced_weight, the raw_count, and the clinical_bonus as attributes. This allows us to analyze the network based on different weighting schemes later on. Finally, we print some stats about the graph – the number of nodes (unique words) and edges (connections between words). If the graph is empty (no nodes or edges), we print a message suggesting that the MIN_COUNT threshold might be too high. This step is like constructing a map of the pain discourse landscape, highlighting the key relationships between different concepts and emphasizing the importance of clinically relevant terms. By building this enhanced graph, we're setting the stage for a deeper understanding of how people talk about pain and the connections between different aspects of their experience.

print(f"Building enhanced graph with MIN_COUNT={MIN_COUNT}…")
G = nx.Graph()

for (a, b), count in co_counter.items():
    if count >= MIN_COUNT:
        # Bonus weighting for validated vocabulary
        clinical_bonus = 0
        if a in ALL_VALIDATED_TERMS:
            clinical_bonus += 1
        if b in ALL_VALIDATED_TERMS:
            clinical_bonus += 1

        # Compute weight (add bonus for validated terms)
        enhanced_weight = count + (clinical_bonus * 2)
        G.add_edge(a, b, weight=enhanced_weight, raw_count=count, clinical_bonus=clinical_bonus)

print(f"Enhanced Graph: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")

if G.number_of_nodes() == 0 or G.number_of_edges() == 0:
    print("No edges after threshold. Try lowering MIN_COUNT.")
    raise SystemExit

Let's figure out which words are the most influential in our network! We're calculating three different centrality measures: degree centrality, betweenness centrality, and eigenvector centrality. These measures give us different perspectives on the importance of each word in the network. This is like figuring out who the key players are in a social network – who's the most popular, who's the connector, and who's the influencer.

Degree centrality is the simplest: it just counts how many connections a word has. Words with high degree centrality are like the popular kids – they're connected to a lot of other words.
Betweenness centrality measures how often a word lies on the shortest path between two other words. Words with high betweenness centrality are like bridges – they connect different parts of the network.
Eigenvector centrality is a bit more complex: it measures a word's influence based on the influence of its neighbors. Words with high eigenvector centrality are like influencers – they're connected to other influential words. We use the networkx functions degree_centrality, betweenness_centrality, and eigenvector_centrality to calculate these measures. For betweenness centrality, we use the 'weight' argument to take the edge weights (our enhanced co-occurrence counts) into account. For eigenvector centrality, we handle a potential PowerIterationFailedConvergence error by falling back to an unweighted calculation on a subgraph of the top nodes. This ensures that our analysis is robust even if the eigenvector centrality calculation doesn't converge for the full graph.

By calculating these centrality measures, we can identify the words that are most central to the pain discourse network. This gives us valuable insights into the key concepts and relationships that shape how people talk about pain. This step is like analyzing the power dynamics in a social network, revealing who the key influencers and connectors are. By understanding the centrality of different terms, we can gain a deeper understanding of the structure and dynamics of the pain discourse landscape.

print("Computing enhanced centralities…")
deg_c = nx.degree_centrality(G)
bet_c = nx.betweenness_centrality(G, normalized=True, weight='weight')

try:
    eig_c = nx.eigenvector_centrality(G, max_iter=2000, tol=1e-06, weight='weight')
except nx.PowerIterationFailedConvergence:
    print("Eigenvector centrality: fallback to unweighted calculation")
    top_nodes = sorted(deg_c, key=deg_c.get, reverse=True)[:max(2000, int(G.number_of_nodes()*0.3))]
    H = G.subgraph(top_nodes).copy()
    eig_c_sub = nx.eigenvector_centrality(H, max_iter=2000, tol=1e-06)
    eig_c = {n: eig_c_sub.get(n, 0.0) for n in G.nodes()}

Let's find the hidden communities within our network! We're using the Louvain algorithm, a popular method for community detection, to identify clusters of words that are more densely connected to each other than to the rest of the network. Think of these communities as groups of words that share a common theme or context. This is like finding the different cliques or groups in a social network – people who hang out together and share common interests.

We use the community_louvain.best_partition function to find the best community structure for our graph. We set the resolution parameter to 1.0, which controls the granularity of the communities (higher values result in smaller communities). We also specify weight='weight' to use the edge weights (our enhanced co-occurrence counts) in the community detection process. This ensures that words that are strongly connected are more likely to be grouped together. The Louvain algorithm works by iteratively moving nodes between communities to maximize the modularity of the network – a measure of how well the network is divided into communities. The best_partition function returns a dictionary where keys are nodes and values are the community they belong to.

By detecting communities in our pain discourse network, we can gain a deeper understanding of the underlying themes and topics that shape how people talk about pain. This step is like uncovering the hidden social structures within a network, revealing the different groups and their relationships. By identifying these communities, we can gain valuable insights into the complex landscape of pain discourse and how different concepts and experiences are interconnected.

print("Detecting communities (Louvain with clinical weighting)…")
partition = community_louvain.best_partition(G, resolution=1.0, weight='weight')

Time to save our hard work! We're generating several output files to store the results of our analysis. This is like documenting your experiment in a lab notebook – you want to make sure you have a record of everything you've done and the results you've obtained. We're creating three main types of output files:

Validated terms analysis CSV: This file (validated_terms_analysis.csv) contains a detailed analysis of our paper-validated terms. For each term, we record its category, whether it was found in the graph, its frequency, its centrality measures (degree, betweenness, eigenvector), and the community it belongs to. This file allows us to assess the role of these key clinical terms in the pain discourse network.
Enhanced nodes metrics CSV: This file (enhanced_nodes_metrics.csv) contains metrics for each node (word) in the graph. We record whether the node is a validated term, its clinical category (if applicable), its centrality measures, and its community. This file provides a comprehensive overview of the characteristics of each word in the network.
Enhanced edges CSV: This file (enhanced_edges_weighted.csv) contains information about each edge (connection) in the graph. We record the source and target nodes, the enhanced weight, the raw co-occurrence count, the clinical bonus, and whether the source and target nodes are validated terms. This file allows us to analyze the connections between words and the influence of clinical term prioritization.
Enhanced GEXF: This file (enhanced_pain_network.gexf) is a graph exchange format file that can be opened in network visualization software like Gephi. We store the graph structure, node attributes (community, centrality measures, validation status, clinical category), and edge attributes (weight, raw count, clinical bonus). This file allows us to visualize and explore the pain discourse network in an interactive environment. We use the csv library to write the CSV files and the networkx.write_gexf function to write the GEXF file. By generating these output files, we ensure that our results are accessible and reusable. This allows us to further analyze the data, visualize the network, and share our findings with others. This step is like archiving your research data, ensuring that it's preserved for future use and analysis.

# 12) Enhanced file outputs
# Validated terms analysis
validated_csv = os.path.join(OUT_DIR, "validated_terms_analysis.csv")
with open(validated_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["category", "term", "found_in_graph", "frequency",
                     "degree_centrality", "betweenness_centrality",
                     "eigenvector_centrality", "community"])

    for category, terms in VALIDATED_ICBPS_TERMS.items():
        for term in terms:
            found = term in G.nodes()
            freq = validated_term_counts.get(term, 0)
            deg = deg_c.get(term, 0.0) if found else 0.0
            bet = bet_c.get(term, 0.0) if found else 0.0
            eig = eig_c.get(term, 0.0) if found else 0.0
            comm = partition.get(term, -1) if found else -1

            writer.writerow([category, term, found, freq, deg, bet, eig, comm])

# Enhanced nodes with validation flags
nodes_csv = os.path.join(OUT_DIR, "enhanced_nodes_metrics.csv")
with open(nodes_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["node", "is_validated_term", "clinical_category",
                     "degree_centrality", "betweenness_centrality",
                     "eigenvector_centrality", "community"])

    for n in G.nodes():
        is_validated = n in ALL_VALIDATED_TERMS
        category = "non_clinical"
        if is_validated:
            for cat, terms in VALIDATED_ICBPS_TERMS.items():
                if n in terms:
                    category = cat
                    break

        writer.writerow([n, is_validated, category, deg_c.get(n,0.0),
                         bet_c.get(n,0.0), eig_c.get(n,0.0), partition.get(n, -1)])

# Enhanced edges
edges_csv = os.path.join(OUT_DIR, "enhanced_edges_weighted.csv")
with open(edges_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["source", "target", "enhanced_weight", "raw_count", "clinical_bonus",
                     "source_validated", "target_validated"])

    for u, v, d in G.edges(data=True):
        source_val = u in ALL_VALIDATED_TERMS
        target_val = v in ALL_VALIDATED_TERMS
        writer.writerow([u, v, d.get("weight", 1), d.get("raw_count", 1),
                         d.get("clinical_bonus", 0), source_val, target_val])

# Enhanced GEXF with validation attributes
gexf_path = os.path.join(OUT_DIR, "enhanced_pain_network.gexf")
nx.set_node_attributes(G, partition, "community")
nx.set_node_attributes(G, deg_c, "degree_centrality")
nx.set_node_attributes(G, bet_c, "betweenness_centrality")
nx.set_node_attributes(G, eig_c, "eigenvector_centrality")

# Add validation attributes
validated_attrs = {n: n in ALL_VALIDATED_TERMS for n in G.nodes()}
category_attrs = {}
for n in G.nodes():
    category_attrs[n] = "non_clinical"
    if validated_attrs[n]:
        for cat, terms in VALIDATED_ICBPS_TERMS.items():
            if n in terms:
                category_attrs[n] = cat
                break

nx.set_node_attributes(G, validated_attrs, "is_validated")
nx.set_node_attributes(G, category_attrs, "clinical_category")
nx.write_gexf(G, gexf_path)

Let's take a look at the top players in our network! We're displaying the top K terms (where K is our PRINT_TOP_K setting) for each centrality measure: degree centrality, betweenness centrality, and eigenvector centrality. This is like announcing the MVPs of our network – the words that are the most connected, the most influential, and the most central to the pain discourse.

We define a helper function enhanced_topk that takes a dictionary (e.g., a centrality dictionary) and returns a list of the top K key-value pairs, sorted by value in descending order. This function makes it easy to get the top terms for any centrality measure. We also define a function display_with_validation to display the top terms in a user-friendly format. This function takes a centrality dictionary and a name (e.g., "Degree Centrality") as input. It prints the name, then loops through the top K terms and prints their rank, a checkmark if they're a validated term, the term itself, and its centrality score.

By displaying the top centrality terms, we can quickly get a sense of the key concepts and relationships in the pain discourse network. This allows us to identify the words that are most central to the discussion of pain, the words that connect different aspects of the discourse, and the words that are most influential within the network. This step is like presenting the highlights of our analysis, showcasing the most important findings and providing a clear overview of the key insights. By examining the top centrality terms, we can gain a deeper understanding of the structure and dynamics of the pain discourse landscape and identify potential areas for further investigation.

# 13) Enhanced results display
def enhanced_topk(d, k=PRINT_TOP_K):
    return sorted(d.items(), key=lambda x: x[1], reverse=True)[:k]

def display_with_validation(centrality_dict, name):
    print(f"\n=== {name} Top {PRINT_TOP_K} (✓ = validated term) ===")
    for i, (term, score) in enumerate(enhanced_topk(centrality_dict), 1):
        validated = "✓" if term in ALL_VALIDATED_TERMS else " "
        print(f"{i:2d}. {validated} {term:<20} {score:.6f}")

display_with_validation(deg_c, "Degree Centrality")
display_with_validation(bet_c, "Betweenness Centrality")
display_with_validation(eig_c, "Eigenvector Centrality")

Alright, folks, we've reached the finish line! Let's wrap things up by summarizing our results and highlighting the key takeaways from our enhanced pain discourse co-occurrence network pipeline. We've saved a bunch of files containing our analysis results. These files are like the final report of our project, documenting everything we've found.

We print the paths to the validated term analysis CSV, the enhanced nodes CSV, the enhanced edges CSV, and the enhanced network GEXF file. This makes it easy to find and access these files for further analysis and visualization. We also provide a concise summary of our analysis, including the detection rate of paper-validated terms, the size of the graph (number of nodes and edges), and a reminder that we applied a clinical weighting bonus. This summary provides a quick overview of the key aspects of our analysis and the characteristics of the resulting network. Finally, we encourage users to visualize the network in Gephi, a powerful network visualization software. We suggest that validated terms can be color-coded to highlight their importance and influence within the network. This visualization allows us to explore the network structure, identify communities, and gain a deeper understanding of the relationships between different concepts.

By summarizing our results and providing guidance for further exploration, we empower users to make the most of our analysis. This step is like presenting the conclusions of your research, highlighting the key findings and suggesting directions for future work. By providing a clear and concise summary, we ensure that our analysis is accessible and actionable, paving the way for a deeper understanding of pain discourse and its implications for clinical practice and research.

print(f"\n=== Saved files ===")
print(f"Validated term analysis: {validated_csv}")
print(f"Enhanced nodes: {nodes_csv}")
print(f"Enhanced edges: {edges_csv}")
print(f"Enhanced network: {gexf_path}")

print(f"\n=== Analysis summary ===")
print(f"• Detection rate of paper-validated terms: {found_validated}/{len(ALL_VALIDATED_TERMS)} ({found_validated/len(ALL_VALIDATED_TERMS)*100:.1f}%) ")
print(f"• Graph size: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")
print(f"• Clinical weighting bonus applied")
print(f"• Visualize in Gephi: validated terms can be color-coded")