Comparing TF-IDF, transformer embeddings, and ethical auditing frameworks for detecting bias in machine-generated text
Published: July 21, 2025
Introduction
As machine learning advances across industries, the need for models that are "accurate and performant" alongside being "fair, inclusive, and explainable" has become critical. From hiring systems to content feeds, these technologies increasingly shape daily decisions. Yet with this influence comes responsibility -- particularly when generated text may reflect or amplify harmful societal biases.
This article explores the evolution of bias detection through three approaches:
These components form a modular, explainable, and ethically aware pipeline for real-world NLP systems.
The Bias Problem: The Hidden Flaw in Machine Learning Models and AI
As AI systems become embedded in critical domains -- recruitment, education, journalism -- an invisible threat persists: bias. Rather than eliminating human prejudice, many models unintentionally mirror and magnify biases present in training data. Even advanced systems like ChatGPT or Gemini remain vulnerable. A small bias in phrasing can have outsized impact at scale.
Bias detection is "not a luxury -- it's a necessity for building trustworthy and responsible AI."
Goal of This Analysis
The primary objective investigates inherent biases in machine-generated text from ML models and conversational agents. As these technologies accelerate across sensitive domains, robust bias detection becomes critical.
The analysis compares traditional machine learning techniques (TF-IDF vectorization with XGBoost) against transformer-based embeddings (All-MiniLM) to evaluate effectiveness in identifying gender, political, and demographic bias. It also demonstrates how external fairness auditing tools like RAIL API provide an ethical validation layer.
Focus Areas: Types of Bias in Textual Content
Political Bias
Definition: Language promoting, favoring, or disparaging specific political ideologies, parties, or viewpoints, subtly influencing public perception.
Examples:
Demographic Bias
Definition: Assumptions or stereotypes based on race, religion, location, age, or socioeconomic class that reinforce harmful social divides.
Examples:
Gender Bias
Definition: Stereotypes, inequalities, or unjust treatment based on gender, perpetuating outdated views about roles and capabilities.
Examples:
Undetected biases degrade AI quality, fairness, and trustworthiness, making bias detection foundational for ethical AI development.
Dataset Used for This Comparison
Data was gathered from Gemini, ChatGPT, and Kaggle datasets. After removing duplicates and null values, the final dataset contained 2,138 rows with 4 columns.
The distribution showed representation across all bias categories for comprehensive analysis.
Model Selection
XGBoost (Extreme Gradient Boosting) was selected as the core classifier for its ability to efficiently handle high-dimensional and sparse feature spaces from TF-IDF vectors and transformer embeddings.
Key Advantages of XGBoost:
Feature Extraction with TF-IDF
Before feeding text into ML models, it must be converted to numerical format. TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used technique for this transformation.
What is TF-IDF?
TF-IDF is a statistical NLP method representing text as numerical vectors. It evaluates word relevance to a document in a collection, balancing frequency within the document against frequency across all documents in the corpus.
How It Works
TF-IDF assigns scores based on two factors:
This down-weights common uninformative words ("the", "is", "and") while giving important distinctive words higher scores.
Pros of TF-IDF
Cons of TF-IDF
Observations on TF-IDF Model Performance
While the TF-IDF + XGBoost pipeline offers interpretability and simplicity, experiments reveal limitations:
All-MiniLM-L6-v2: Transformer-Based Embedding
What is All-MiniLM-L6-v2?
All-MiniLM-L6-v2 is a compact, pre-trained sentence embedding model by Sentence Transformers. It converts natural language text -- sentences, phrases, or paragraphs -- into dense vector representations capturing semantic meaning.
How It Works
All-MiniLM-L6-v2 is a distilled BERT version, retaining core architectural components in a smaller, faster format:
Pros of All-MiniLM
Cons of All-MiniLM
Observations on All-MiniLM-L6-v2
While All-MiniLM-L6-v2 offers strong semantic capabilities and efficient performance, trade-offs exist:
Vectorization Comparison: TF-IDF vs All-MiniLM
Two vectorization approaches were tested -- TF-IDF and All-MiniLM-L6-v2 -- each coupled with XGBoost.
Key Observations:
These limitations emphasize the need for post-model auditing tools like SHAP and external evaluators like RAIL API to establish trust in automated bias detection systems.
SHAP Library Insights: Explaining the Model's Decisions
To ensure transparency and interpretability, the SHAP library was incorporated into the workflow. This visualizes and identifies which features (words or tokens) contributed most to XGBoost classifier predictions across TF-IDF and All-MiniLM representations.
What is SHAP?
SHAP is a powerful Python library based on Shapley values from cooperative game theory. It provides a principled approach to interpreting ML predictions by attributing "contribution scores to each feature."
How SHAP Works
SHAP treats the ML model as a "game" and each feature (word or token) as a "player" contributing to the outcome (prediction). It answers: "How much did each feature contribute to this specific prediction?"
This produces local explanations for each instance, showing whether a word pushed prediction toward biased or unbiased class and by how much.
Application in This Project
SHAP was used to:
The outcome helped audit both models and understand failure points, such as over-reliance on polarizing or ambiguous terms.
SHAP for TF-IDF Vectors
Analysis revealed which terms most influenced classification across political, demographic, and gender bias categories, with detailed visualizations showing contribution strengths.
SHAP for All-MiniLM
Similarly, SHAP analysis of All-MiniLM embeddings showed embedding dimension contributions to political, demographic, and gender bias predictions.
RAIL API: Ethical Auditing Layer
While TF-IDF + XGBoost and All-MiniLM + XGBoost offer reasonable bias detection performance, they're insufficient for real-world deployment. High false positive rates, semantic nuance limitations, and static nature restrict reliability and trust.
RAIL API (Responsible AI Layer API) provides a final ethical safeguard for content evaluation.
What is RAIL API?
RAIL API is a cloud-based, model-agnostic API evaluating AI-generated content across eight core ethical dimensions:
It provides numeric scores (0-10) per dimension with textual justifications, helping developers quantify and explain AI behavior structurally.
How to Use RAIL API
Benefits of Using RAIL API
Real-World Value
RAIL API audited the 115 instances where both TF-IDF and MiniLM models failed detecting bias. RAIL successfully identified:
Moreover, it provided reasoned explanations for each -- something missing in conventional classifiers.
Conclusion: RAIL API acts as a robust final checkpoint for ethical AI -- ideal for production systems where fairness and compliance are non-negotiable.