haam package

Submodules

Module contents

HAAM - Human-AI Accuracy Model package.

class haam.HAAM(criterion: ndarray | Series | list, ai_judgment: ndarray | Series | list, human_judgment: ndarray | Series | list, embeddings: ndarray | DataFrame | None = None, texts: List[str] | None = None, n_components: int = 200, auto_run: bool = True, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3, standardize: bool = False, sample_split_post_lasso: bool = True)[source]

Bases: object

Simplified interface for HAAM analysis.

This class provides an easy-to-use API for performing the full HAAM analysis pipeline including statistical analysis, topic modeling, and visualization generation.

Methods

create_3d_umap_with_pc_arrows([pc_indices, ...])

Create 3D UMAP visualization with PC directional arrows.

create_all_pc_umap_visualizations([...])

Create UMAP visualizations for multiple PCs with topic labels.

create_all_pc_wordclouds([pc_indices, k, ...])

Create word clouds for all specified PCs.

create_comprehensive_pc_analysis([...])

Create comprehensive PC analysis with word cloud table and 3D UMAP visualization.

create_main_visualization([output_dir, pc_names])

Create and save main HAAM visualization.

create_metrics_summary([output_dir])

Create and save comprehensive metrics summary.

create_mini_grid([output_dir])

Create and save mini grid visualization.

create_pc_effects_visualization([...])

Create PC effects bar chart visualization.

create_pc_wordclouds(pc_idx[, k, max_words, ...])

Create word clouds for high and low poles of a specific PC.

create_top_pcs_wordcloud_grid([pc_indices, ...])

Create a grid visualization of word clouds for top PCs.

create_umap_visualization([n_components, ...])

Create UMAP visualization.

explore_pc_topics([pc_indices, n_topics])

Explore topic associations for specified PCs.

export_all_results([output_dir])

Export all results and create all visualizations.

plot_pc_effects(pc_idx[, save_path])

Create bar plot showing PC effects on outcomes.

run_full_analysis()

Run the complete HAAM analysis pipeline.

visualize_pc_umap_with_topics(pc_idx[, ...])

Create UMAP visualization for a specific PC with topic labels.

__init__(criterion: ndarray | Series | list, ai_judgment: ndarray | Series | list, human_judgment: ndarray | Series | list, embeddings: ndarray | DataFrame | None = None, texts: List[str] | None = None, n_components: int = 200, auto_run: bool = True, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3, standardize: bool = False, sample_split_post_lasso: bool = True)[source]

Initialize HAAM analysis with enhanced parameters.

Recent updates: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF implementation with BERTopic formula - Generic “X” labeling instead of “SC” in visualizations

Parameters:
  • criterion (array-like) – Criterion variable (any ground truth variable, not limited to social class)

  • ai_judgment (array-like) – AI predictions/ratings

  • human_judgment (array-like) – Human ratings

  • embeddings (array-like, optional) – Pre-computed embeddings. If None, will be generated from texts

  • texts (List[str], optional) – Text data for generating embeddings if not provided

  • n_components (int, default=200) – Number of PCA components

  • auto_run (bool, default=True) – Whether to automatically run the full analysis

  • min_cluster_size (int, default=10) – Minimum cluster size for HDBSCAN (matches BERTopic-style clustering)

  • min_samples (int, default=2) – Minimum samples for core points in HDBSCAN

  • umap_n_components (int, default=3) – Number of UMAP components for clustering (3D by default)

  • standardize (bool, default=False) – Whether to standardize X and outcome variables for both total effects and DML calculations. When True, all coefficients will be in standardized units.

  • sample_split_post_lasso (bool, default=True) – Whether to use sample splitting for post-LASSO inference. True: Conservative inference with valid p-values (original behavior) False: Maximum statistical power using full sample

run_full_analysis() Dict[str, Any][source]

Run the complete HAAM analysis pipeline.

Returns:

Dictionary containing all results

Return type:

Dict[str, Any]

create_main_visualization(output_dir: str | None = None, pc_names: Dict[int, str] | None = None) str[source]

Create and save main HAAM visualization.

Parameters:
  • output_dir (str, optional) – Directory to save output. If None, uses current directory

  • pc_names (Dict[int, str], optional) – Manual names for PCs. Keys are PC indices (0-based), values are names. Example: {4: “Lifestyle & Work”, 7: “Professions”, 1: “Narrative Style”}

Returns:

Path to saved HTML file

Return type:

str

create_mini_grid(output_dir: str | None = None) str[source]

Create and save mini grid visualization.

Parameters:

output_dir (str, optional) – Directory to save output. If None, uses current directory

Returns:

Path to saved HTML file

Return type:

str

create_pc_effects_visualization(output_dir: str | None = None, n_top: int = 20) str[source]

Create PC effects bar chart visualization.

Parameters:
  • output_dir (str, optional) – Directory to save output. If None, uses current directory

  • n_top (int) – Number of top PCs to display

Returns:

Path to saved HTML file

Return type:

str

create_metrics_summary(output_dir: str | None = None) Dict[str, Any][source]

Create and save comprehensive metrics summary.

Exports all key metrics including: - Model performance (R² values for X, AI, HU) - Policy similarities between predictions - Mediation analysis (PoMA percentages) - Feature selection statistics

Parameters:

output_dir (str, optional) – Directory to save output. If None, uses current directory

Returns:

Dictionary containing all metrics (also saved to JSON file)

Return type:

Dict[str, Any]

explore_pc_topics(pc_indices: List[int] | None = None, n_topics: int = 10) DataFrame[source]

Explore topic associations for specified PCs.

Parameters:
  • pc_indices (List[int], optional) – PC indices to explore (0-based). If None, uses top PCs

  • n_topics (int) – Number of topics to show per PC

Returns:

DataFrame with topic information

Return type:

pd.DataFrame

plot_pc_effects(pc_idx: int, save_path: str | None = None)[source]

Create bar plot showing PC effects on outcomes.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • save_path (str, optional) – Path to save figure

create_umap_visualization(n_components: int = 3, color_by: str = 'X', output_dir: str | None = None) str[source]

Create UMAP visualization.

Parameters:
  • n_components (int) – Number of UMAP components (2 or 3)

  • color_by (str) – Variable to color by: ‘X’, ‘AI’, ‘HU’, or ‘PC1’, ‘PC2’, etc.

  • output_dir (str, optional) – Directory to save output

Returns:

Path to saved HTML file

Return type:

str

visualize_pc_umap_with_topics(pc_idx: int, output_dir: str | None = None, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = True) str[source]

Create UMAP visualization for a specific PC with topic labels.

Parameters:
  • pc_idx (int) – PC index (0-based). E.g., 0 for PC1, 4 for PC5

  • output_dir (str, optional) – Directory to save output. If None, saves to current directory

  • show_top_n (int) – Number of high-scoring topics to label

  • show_bottom_n (int) – Number of low-scoring topics to label

  • display (bool) – Whether to display in notebook/colab

Returns:

Path to saved HTML file

Return type:

str

create_all_pc_umap_visualizations(pc_indices: List[int] | None = None, output_dir: str | None = None, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = False) Dict[int, str][source]

Create UMAP visualizations for multiple PCs with topic labels.

Parameters:
  • pc_indices (List[int], optional) – List of PC indices to visualize. If None, uses top 10 PCs

  • output_dir (str, optional) – Directory to save visualizations

  • show_top_n (int) – Number of high topics to show per PC

  • show_bottom_n (int) – Number of low topics to show per PC

  • display (bool) – Whether to display each plot (set False for batch processing)

Returns:

Mapping of PC index to output file path

Return type:

Dict[int, str]

create_3d_umap_with_pc_arrows(pc_indices: int | List[int] | None = None, top_k: int = 1, percentile_threshold: float = 90.0, arrow_mode: str = 'all', color_by_usage: bool = True, color_mode: str = 'legacy', show_topic_labels: bool | int = 10, output_dir: str | None = None, display: bool = True) str[source]

Create 3D UMAP visualization with PC directional arrows.

This creates an interactive 3D scatter plot where: - Topics are positioned in 3D UMAP space based on their semantic similarity - Arrows show PC directions from average of bottom-k to top-k topics - Topics are colored based on HU/AI usage patterns (quartiles)

The key insight: In UMAP space, PC gradients often form linear patterns, allowing us to visualize how principal components map to topic space.

Parameters:
  • pc_indices (int or List[int], optional) – PC indices to show arrows for (0-based). - If None and arrow_mode=’all’: shows arrows for PC1, PC2, PC3 - If int: shows arrow for that single PC - If list: shows arrows for all PCs in the list

  • top_k (int, default=1) – Number of top/bottom scoring topics to average for arrow endpoints. Default=1 for cleaner single-topic arrows. If fewer topics meet the threshold, uses all available.

  • percentile_threshold (float, default=90.0) – Percentile threshold for selecting top/bottom topics. 90.0 means top 10% and bottom 10% of topics.

  • arrow_mode (str, default='all') – Controls which arrows to display: - ‘single’: Show arrow for single PC - ‘list’: Show arrows for specified list of PCs - ‘all’: Show arrows for first 3 PCs

  • color_by_usage (bool, default=True) – Whether to color topics by HU/AI usage patterns

  • color_mode (str, default='legacy') – Coloring mode when color_by_usage=True: - ‘legacy’: Use PC coefficient-based inference (original behavior) - ‘validity’: Use direct X/HU/AI measurement (consistent with word clouds)

  • show_topic_labels (bool or int, default=10) – Controls topic label display: - True: Show all topic labels - False: Hide all labels (hover still works) - int: Show only N closest topics to camera

  • output_dir (str, optional) – Directory to save output. If None, uses current directory

  • display (bool, default=True) – Whether to display in notebook/colab

Returns:

Path to saved HTML file

Return type:

str

Examples

# Show arrows with new validity coloring (consistent with word clouds) haam.create_3d_umap_with_pc_arrows(color_mode=’validity’)

# Use legacy PC-based coloring (default) haam.create_3d_umap_with_pc_arrows(color_mode=’legacy’)

# Show arrow only for PC5 with validity coloring haam.create_3d_umap_with_pc_arrows(

pc_indices=4, arrow_mode=’single’, color_mode=’validity’

)

# Show arrows for specific PCs with stricter threshold haam.create_3d_umap_with_pc_arrows(

pc_indices=[0, 3, 7], percentile_threshold=95.0, # Top/bottom 5% arrow_mode=’list’, color_mode=’validity’

)

create_pc_wordclouds(pc_idx: int, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str | None = None, display: bool = True, color_mode: str = 'pole') Tuple[Any, str, str][source]

Create word clouds for high and low poles of a specific PC.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words to display in word cloud

  • figsize (Tuple[int, int]) – Figure size (width, height) for each subplot

  • output_dir (str, optional) – Directory to save output files

  • display (bool) – Whether to display the plots

  • color_mode (str, optional) –

    ‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement:

    • Dark red: top quartile for all (X, HU, AI)

    • Light red: top quartile for HU & AI only

    • Dark blue: bottom quartile for all

    • Light blue: bottom quartile for HU & AI only

    • Grey: mixed signals

Returns:

Figure object, high pole output path, low pole output path

Return type:

Tuple[Figure, str, str]

create_all_pc_wordclouds(pc_indices: List[int] | None = None, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str = './wordclouds', display: bool = False, color_mode: str = 'pole') Dict[int, Tuple[str, str]][source]

Create word clouds for all specified PCs.

Parameters:
  • pc_indices (List[int], optional) – List of PC indices. If None, uses top 9 PCs by ‘triple’ ranking

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words to display in word cloud

  • figsize (Tuple[int, int]) – Figure size for each subplot

  • output_dir (str) – Directory to save output files

  • display (bool) – Whether to display each plot

  • color_mode (str) – ‘pole’ or ‘validity’ coloring mode

Returns:

Dictionary mapping PC index to (high_path, low_path)

Return type:

Dict[int, Tuple[str, str]]

create_top_pcs_wordcloud_grid(pc_indices: List[int] | None = None, ranking_method: str = 'triple', n_pcs: int = 9, k: int = 10, max_words: int = 50, output_file: str | None = None, display: bool = True, color_mode: str = 'pole') Any[source]

Create a grid visualization of word clouds for top PCs.

Parameters:
  • pc_indices (List[int], optional) – List of PC indices. If None, uses top PCs by ranking_method

  • ranking_method (str) – Method to rank PCs if pc_indices not provided: ‘triple’, ‘HU’, ‘AI’, ‘X’

  • n_pcs (int) – Number of top PCs to include if pc_indices not provided

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words per word cloud

  • output_file (str, optional) – Path to save the grid visualization

  • display (bool) – Whether to display the plot

  • color_mode (str) – ‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement

Returns:

Figure object

Return type:

Figure

create_comprehensive_pc_analysis(pc_indices: List[int] | None = None, n_pcs: int = 15, k_topics: int = 3, max_words: int = 100, generate_wordclouds: bool = True, generate_3d_umap: bool = True, umap_arrow_k: int = 1, show_data_counts: bool = True, output_dir: str | None = None, display: bool = True) Dict[str, Any][source]

Create comprehensive PC analysis with word cloud table and 3D UMAP visualization.

This method generates a complete analysis similar to the Colab scripts, including: - Individual word clouds for each PC’s high and low poles - A comprehensive table showing all PCs with X/HU/AI quartile labels - 3D UMAP visualization with PC arrows (optional) - Summary report with data availability statistics

Parameters:
  • pc_indices (List[int], optional) – List of PC indices (0-based) to analyze. If None, uses first n_pcs PCs. Example: [2, 1, 3, 4, 5] for PC3, PC2, PC4, PC5, PC6

  • n_pcs (int, default=15) – Number of PCs to analyze if pc_indices not provided

  • k_topics (int, default=3) – Number of topics to include from each pole in word clouds

  • max_words (int, default=100) – Maximum words to display in each word cloud

  • generate_wordclouds (bool, default=True) – Whether to generate word cloud table

  • generate_3d_umap (bool, default=True) – Whether to generate 3D UMAP visualization with PC arrows

  • umap_arrow_k (int, default=1) – Number of topics for arrow endpoints in UMAP (1 = single topic endpoints)

  • show_data_counts (bool, default=True) – Whether to show data availability counts (e.g., “HU: n=3” for sparse data)

  • output_dir (str, optional) – Directory to save all outputs. If None, creates ‘haam_comprehensive_analysis’

  • display (bool, default=True) – Whether to display visualizations

Returns:

Dictionary containing: - ‘wordcloud_paths’: Dict mapping PC index to (high_path, low_path) - ‘table_path’: Path to comprehensive PC table image - ‘umap_path’: Path to 3D UMAP HTML (if generated) - ‘report_path’: Path to text report with statistics - ‘summary’: Dict with analysis summary statistics

Return type:

Dict[str, Any]

Examples

# Analyze first 15 PCs with all defaults results = haam.create_comprehensive_pc_analysis()

# Analyze specific PCs specific_pcs = [2, 1, 3, 4, 5, 14, 13, 11, 12, 46, 9, 17, 16, 20, 105] results = haam.create_comprehensive_pc_analysis(pc_indices=specific_pcs)

# Only generate word clouds, skip 3D UMAP results = haam.create_comprehensive_pc_analysis(

n_pcs=10, generate_3d_umap=False

)

export_all_results(output_dir: str | None = None) Dict[str, str][source]

Export all results and create all visualizations.

Parameters:

output_dir (str, optional) – Directory to save all outputs

Returns:

Dictionary of all output file paths

Return type:

Dict[str, str]

class haam.HAAMwithBWS(criterion: ndarray | Series | list, ai_judgment: ndarray | Series | list, human_judgment: ndarray | Series | list, bws_features: ndarray | DataFrame, feature_names: List[str] | None = None, texts: List[str] | None = None, standardize: bool = True, sample_split_post_lasso: bool = True, auto_run: bool = True, random_state: int = 42)[source]

Bases: object

HAAM analysis for Best-Worst Scaling (BWS) or other interpretable features.

This class provides HAAM analysis capabilities for pre-computed interpretable features, bypassing PCA while maintaining all statistical analysis and visualization capabilities of standard HAAM.

Parameters:
  • criterion (array-like) – Ground truth variable (e.g., social class)

  • ai_judgment (array-like) – AI predictions/ratings

  • human_judgment (array-like) – Human ratings

  • bws_features (array-like) – Pre-computed interpretable features (e.g., BWS scores) Shape: (n_samples, n_features)

  • feature_names (list, optional) – Names for each BWS feature for interpretability

  • texts (list of str, optional) – Original texts (for supplementary analysis)

  • standardize (bool, default=True) – Whether to standardize features before analysis

  • sample_split_post_lasso (bool, default=True) – Whether to use sample splitting for post-LASSO inference True: Conservative inference with valid p-values False: Maximum power but potential selection bias

  • auto_run (bool, default=True) – Whether to automatically run the full pipeline

  • random_state (int, default=42) – Random seed for reproducibility

Methods

export_results([output_dir])

Export results to files.

run_analysis()

Run the complete HAAM analysis pipeline with BWS features.

__init__(criterion: ndarray | Series | list, ai_judgment: ndarray | Series | list, human_judgment: ndarray | Series | list, bws_features: ndarray | DataFrame, feature_names: List[str] | None = None, texts: List[str] | None = None, standardize: bool = True, sample_split_post_lasso: bool = True, auto_run: bool = True, random_state: int = 42)[source]
run_analysis()[source]

Run the complete HAAM analysis pipeline with BWS features.

export_results(output_dir: str = '.')[source]

Export results to files.

class haam.HAAMAnalysis(criterion: ndarray, ai_judgment: ndarray, human_judgment: ndarray, embeddings: ndarray | None = None, texts: List[str] | None = None, n_components: int = 200, random_state: int = 42, standardize: bool = False)[source]

Bases: object

Main class for Human-AI Accuracy Model analysis.

This class performs sample-split post-lasso regression analysis and generates various visualizations for understanding the relationships between human judgments, AI judgments, and a criterion variable.

Methods

display_all_results()

Display all HAAM results including coefficients and statistics in Colab.

display_coefficient_tables()

Display comprehensive LASSO and post-LASSO model outputs.

display_global_statistics()

Display comprehensive global statistics in organized sections.

display_mediation_results()

Display mediation analysis results with visualization in Colab.

export_coefficients_with_inference([output_dir])

Export both LASSO and post-LASSO coefficients with statistical inference.

export_global_statistics([output_dir])

Export comprehensive global statistics to CSV files.

export_results([output_dir, prefix])

Export results to CSV files.

fit_debiased_lasso([use_sample_splitting, alpha])

Fit debiased lasso models for all outcomes.

generate_embeddings(texts[, model_name, ...])

Generate embeddings using MiniLM model.

get_top_pcs([n_top, ranking_method])

Get top PCs based on ranking method.

__init__(criterion: ndarray, ai_judgment: ndarray, human_judgment: ndarray, embeddings: ndarray | None = None, texts: List[str] | None = None, n_components: int = 200, random_state: int = 42, standardize: bool = False)[source]

Initialize HAAM Analysis.

Parameters:
  • criterion (np.ndarray) – Criterion variable (e.g., social class)

  • ai_judgment (np.ndarray) – AI predictions/ratings

  • human_judgment (np.ndarray) – Human ratings

  • embeddings (np.ndarray, optional) – Pre-computed embeddings. If None, will be generated from texts

  • texts (List[str], optional) – Text data for generating embeddings if not provided

  • n_components (int, default=200) – Number of PCA components to extract

  • random_state (int, default=42) – Random state for reproducibility

  • standardize (bool, default=False) – Whether to standardize X and outcome variables for both total effects and DML calculations. When True, all coefficients will be in standardized units.

static generate_embeddings(texts: List[str], model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', batch_size: int = 32) ndarray[source]

Generate embeddings using MiniLM model.

Parameters:
  • texts (List[str]) – List of text documents

  • model_name (str) – Name of the sentence transformer model

  • batch_size (int) – Batch size for encoding

Returns:

Embedding matrix (n_samples, embedding_dim)

Return type:

np.ndarray

fit_debiased_lasso(use_sample_splitting: bool = True, alpha: float | None = None) Dict[str, Any][source]

Fit debiased lasso models for all outcomes.

Parameters:
  • use_sample_splitting (bool, default=True) – Whether to use sample splitting for valid inference

  • alpha (float, optional) – Regularization parameter. If None, uses CV

Returns:

Dictionary containing all results

Return type:

Dict[str, Any]

get_top_pcs(n_top: int = 9, ranking_method: str = 'triple') List[int][source]

Get top PCs based on ranking method.

Parameters:
  • n_top (int, default=9) – Number of top PCs to return

  • ranking_method (str, default='triple') – Method for ranking: ‘X’, ‘AI’, ‘HU’, or ‘triple’

Returns:

Indices of top PCs (0-based)

Return type:

List[int]

export_results(output_dir: str | None = None, prefix: str = 'haam_results') Dict[str, str][source]

Export results to CSV files.

Parameters:
  • output_dir (str, optional) – Output directory. If None, uses current directory

  • prefix (str, default='haam_results') – Prefix for output files

Returns:

Dictionary of output file paths

Return type:

Dict[str, str]

display_mediation_results()[source]

Display mediation analysis results with visualization in Colab.

display_global_statistics()[source]

Display comprehensive global statistics in organized sections.

display_coefficient_tables()[source]

Display comprehensive LASSO and post-LASSO model outputs.

export_global_statistics(output_dir: str | None = None)[source]

Export comprehensive global statistics to CSV files.

export_coefficients_with_inference(output_dir: str | None = None)[source]

Export both LASSO and post-LASSO coefficients with statistical inference.

display_all_results()[source]

Display all HAAM results including coefficients and statistics in Colab.

class haam.TopicAnalyzer(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Bases: object

Analyze topics and their relationships with principal components.

Methods

create_topic_summary_for_pcs(pc_indices[, ...])

Create concise topic summaries for specified PCs.

get_pc_high_low_topics(pc_idx[, n_high, ...])

Get high and low topics for a specific PC.

get_pc_topic_associations([pc_indices, n_topics])

Get topic associations for specified PCs.

__init__(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Initialize topic analyzer with enhanced parameters.

Now uses optimized hyperparameters: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF: BERTopic formula for better topic extraction

Parameters:
  • texts (List[str]) – Original text documents

  • embeddings (np.ndarray) – Document embeddings

  • pca_features (np.ndarray) – PCA-transformed features

  • min_cluster_size (int) – Minimum cluster size for HDBSCAN (default: 10, matches BERTopic-style)

  • min_samples (int) – Minimum samples for core points (default: 2)

  • umap_n_components (int) – Number of UMAP components for clustering (default: 3)

get_pc_topic_associations(pc_indices: List[int] | None = None, n_topics: int = 15) Dict[int, List[Dict]][source]

Get topic associations for specified PCs.

Parameters:
  • pc_indices (List[int], optional) – PC indices to analyze. If None, analyzes all

  • n_topics (int) – Number of top/bottom topics to return per PC

Returns:

Dictionary mapping PC index to list of topic associations

Return type:

Dict[int, List[Dict]]

get_pc_high_low_topics(pc_idx: int, n_high: int = 5, n_low: int = 5, p_threshold: float = 0.05) Dict[str, List[Dict]][source]

Get high and low topics for a specific PC.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • n_high (int) – Number of high topics to return

  • n_low (int) – Number of low topics to return

  • p_threshold (float) – P-value threshold for significance

Returns:

Dictionary with ‘high’ and ‘low’ topic lists

Return type:

Dict[str, List[Dict]]

create_topic_summary_for_pcs(pc_indices: List[int], n_keywords: int = 5, n_topics_per_side: int = 3) Dict[int, Dict[str, List[str]]][source]

Create concise topic summaries for specified PCs.

Parameters:
  • pc_indices (List[int]) – PC indices to summarize

  • n_keywords (int) – Number of keywords to show per topic

  • n_topics_per_side (int) – Number of high/low topics to include

Returns:

PC index -> {‘high_topics’: […], ‘low_topics’: […]}

Return type:

Dict[int, Dict[str, List[str]]]

class haam.HAAMVisualizer(haam_results: Dict, topic_summaries: Dict | None = None)[source]

Bases: object

Create visualizations for HAAM analysis results.

Methods

create_3d_pca_with_arrows(pca_features, ...)

Create 3D PCA visualization with directional arrows showing PC gradients.

create_3d_umap_with_pc_arrows(...[, ...])

Create 3D UMAP visualization with PC directional arrows.

create_all_pc_umap_visualizations(...[, ...])

Create UMAP visualizations for multiple PCs.

create_main_visualization(pc_indices[, ...])

Create main HAAM framework visualization with dynamic metrics.

create_metrics_summary([output_file])

Create a comprehensive summary of all HAAM metrics.

create_mini_visualization([n_components, ...])

Create mini grid visualization of all PCs.

create_pc_effects_plot(pc_indices[, output_file])

Create bar chart showing PC effects.

create_pc_umap_with_topics(pc_idx, ...[, ...])

Create UMAP visualization colored by PC scores with topic labels.

create_umap_visualization(umap_embeddings[, ...])

Create interactive UMAP visualization.

plot_pc_effects(pc_idx, topic_associations)

Create 4-panel bar chart showing PC effects on outcomes.

__init__(haam_results: Dict, topic_summaries: Dict | None = None)[source]

Initialize visualizer.

Parameters:
  • haam_results (Dict) – Results from HAAMAnalysis

  • topic_summaries (Dict, optional) – Topic summaries from TopicAnalyzer

create_main_visualization(pc_indices: List[int], output_file: str | None = None, pc_names: Dict[int, str] | None = None, ranking_method: str = 'HU') str[source]

Create main HAAM framework visualization with dynamic metrics.

The visualization now shows: - Generic “X” label instead of “SC” for criterion - Dynamically calculated R², PoMA, and unmodeled path percentages - Custom PC names when provided (shows “-” otherwise) - Enhanced topic display using c-TF-IDF

Parameters:
  • pc_indices (List[int]) – List of PC indices to display (0-based)

  • output_file (str, optional) – Path to save HTML file

  • pc_names (Dict[int, str], optional) – Manual names for PCs. Keys are PC indices (0-based), values are names. If not provided, uses “-” for all PCs. Example: {0: “Formality”, 3: “Complexity”, 6: “Sentiment”}

  • ranking_method (str, default='HU') – Method used to rank PCs: ‘HU’, ‘AI’, ‘X’, or ‘triple’

Returns:

HTML content

Return type:

str

create_mini_visualization(n_components: int = 200, n_highlight: int = 20, output_file: str | None = None) str[source]

Create mini grid visualization of all PCs.

Parameters:
  • n_components (int) – Total number of components to show

  • n_highlight (int) – Number of top components to highlight

  • output_file (str, optional) – Path to save HTML file

Returns:

HTML content

Return type:

str

plot_pc_effects(pc_idx: int, topic_associations: Dict, figsize: Tuple[int, int] = (15, 6)) Figure[source]

Create 4-panel bar chart showing PC effects on outcomes.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • topic_associations (Dict) – Topic associations from TopicAnalyzer

  • figsize (Tuple[int, int]) – Figure size

Returns:

Matplotlib figure

Return type:

plt.Figure

create_umap_visualization(umap_embeddings: ndarray, color_by: str = 'X', topic_labels: Dict | None = None, show_topics: bool = True, output_file: str | None = None) Figure[source]

Create interactive UMAP visualization.

Parameters:
  • umap_embeddings (np.ndarray) – 2D or 3D UMAP embeddings

  • color_by (str) – Variable to color by: ‘X’, ‘AI’, ‘HU’, or ‘PC1’, ‘PC2’, etc.

  • topic_labels (Dict, optional) – Topic labels for points

  • show_topics (bool) – Whether to show topic labels

  • output_file (str, optional) – Path to save HTML file

Returns:

Plotly figure

Return type:

go.Figure

create_pc_umap_with_topics(pc_idx: int, pc_scores: ndarray, umap_embeddings: ndarray, cluster_labels: ndarray, topic_keywords: Dict[int, str], pc_associations: Dict[int, List[Dict]], output_file: str | None = None, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = True) Figure[source]

Create UMAP visualization colored by PC scores with topic labels.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • pc_scores (np.ndarray) – PC scores for all samples

  • umap_embeddings (np.ndarray) – 3D UMAP embeddings

  • cluster_labels (np.ndarray) – Cluster assignments for each point

  • topic_keywords (Dict[int, str]) – Topic ID to keyword mapping

  • pc_associations (Dict[int, List[Dict]]) – PC-topic associations from TopicAnalyzer

  • output_file (str, optional) – Path to save HTML file

  • show_top_n (int) – Number of high-scoring topics to label

  • show_bottom_n (int) – Number of low-scoring topics to label

  • display (bool) – Whether to display in notebook/colab

Returns:

Plotly figure object

Return type:

go.Figure

create_all_pc_umap_visualizations(pc_indices: List[int], pc_scores_all: ndarray, umap_embeddings: ndarray, cluster_labels: ndarray, topic_keywords: Dict[int, str], pc_associations: Dict[int, List[Dict]], output_dir: str, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = False) Dict[int, str][source]

Create UMAP visualizations for multiple PCs.

Parameters:
  • pc_indices (List[int]) – List of PC indices to visualize

  • pc_scores_all (np.ndarray) – All PC scores (n_samples x n_components)

  • umap_embeddings (np.ndarray) – 3D UMAP embeddings

  • cluster_labels (np.ndarray) – Cluster assignments

  • topic_keywords (Dict[int, str]) – Topic keywords

  • pc_associations (Dict[int, List[Dict]]) – PC-topic associations

  • output_dir (str) – Directory to save visualizations

  • show_top_n (int) – Number of high topics to show

  • show_bottom_n (int) – Number of low topics to show

  • display (bool) – Whether to display each plot

Returns:

Mapping of PC index to output file path

Return type:

Dict[int, str]

create_pc_effects_plot(pc_indices: List[int], output_file: str | None = None) Figure[source]

Create bar chart showing PC effects.

Parameters:
  • pc_indices (List[int]) – List of PC indices to plot

  • output_file (str, optional) – Path to save HTML file

Returns:

Plotly figure

Return type:

go.Figure

create_metrics_summary(output_file: str | None = None) Dict[str, Any][source]

Create a comprehensive summary of all HAAM metrics.

This method exports: - Model performance metrics (R² values for X, AI, HU) - Policy similarities between predictions - Mediation analysis results (PoMA percentages) - Feature selection statistics - Compatible with the new generic “X” labeling

Parameters:

output_file (str, optional) – Path to save JSON file with metrics

Returns:

Dictionary containing all metrics including: - model_performance: R² values for each model - policy_similarities: Correlations between predictions - mediation_analysis: PoMA and effect decomposition - feature_selection: Number and indices of selected PCs

Return type:

Dict[str, Any]

create_3d_umap_with_pc_arrows(umap_embeddings: ndarray, cluster_labels: ndarray, topic_keywords: Dict[int, str], pc_scores_all: ndarray, pc_indices: int | List[int] | None = None, top_k: int = 1, percentile_threshold: float = 90.0, arrow_mode: str = 'all', color_by_usage: bool = True, color_mode: str = 'legacy', criterion: ndarray | None = None, human_judgment: ndarray | None = None, ai_judgment: ndarray | None = None, show_topic_labels: bool | int = 10, output_file: str | None = None, display: bool = True) Figure[source]

Create 3D UMAP visualization with PC directional arrows.

This method creates a 3D UMAP space where: - Topics are positioned based on their UMAP embeddings - Arrows show PC directions from low to high scoring topics - Arrow endpoints are averages of top-k and bottom-k topic positions - Topics are colored by HU/AI usage patterns

Parameters:
  • umap_embeddings (np.ndarray) – 3D UMAP embeddings (n_samples x 3)

  • cluster_labels (np.ndarray) – Cluster assignments for each point

  • topic_keywords (Dict[int, str]) – Topic ID to keyword mapping

  • pc_scores_all (np.ndarray) – PC scores for all samples (n_samples x n_components)

  • pc_indices (int or List[int], optional) – PC indices to show arrows for. If None and arrow_mode=’all’, shows first 3

  • top_k (int, default=1) – Number of top/bottom topics to average for arrow endpoints (default=1 for cleaner arrows)

  • percentile_threshold (float, default=90.0) – Percentile threshold for determining top/bottom topics

  • arrow_mode (str, default='all') – Arrow display mode: ‘single’, ‘list’, or ‘all’

  • color_by_usage (bool, default=True) – Whether to color topics by HU/AI usage patterns

  • color_mode (str, default='legacy') – Coloring mode when color_by_usage=True: - ‘legacy’: Use PC coefficient-based inference (original behavior) - ‘validity’: Use direct X/HU/AI measurement (consistent with word clouds)

  • criterion (np.ndarray, optional) – Ground truth values (X) for validity coloring mode

  • human_judgment (np.ndarray, optional) – Human judgment values (HU) for validity coloring mode

  • ai_judgment (np.ndarray, optional) – AI judgment values (AI) for validity coloring mode

  • show_topic_labels (bool or int, default=10) –

    • If True: Show all topic labels

    • If False: Hide all topic labels (hover still works)

    • If int: Show only the N topics closest to camera (dynamic)

  • output_file (str, optional) – Path to save HTML file

  • display (bool, default=True) – Whether to display in notebook/colab

Returns:

Plotly 3D figure object

Return type:

go.Figure

create_3d_pca_with_arrows(pca_features: ndarray, cluster_labels: ndarray, topic_keywords: Dict[int, str], pc_indices: int | List[int] | None = None, arrow_mode: str = 'all', color_by_usage: bool = True, output_file: str | None = None, display: bool = True) Figure[source]

Create 3D PCA visualization with directional arrows showing PC gradients.

This method creates a 3D scatter plot of the first 3 PCs with: - Topic clusters floating in 3D space - Directional arrows showing high->low gradients for specified PCs - Color coding based on HU/AI usage patterns - Interactive tooltips with topic information

Parameters:
  • pca_features (np.ndarray) – PCA-transformed features (n_samples x n_components)

  • cluster_labels (np.ndarray) – Cluster assignments for each point

  • topic_keywords (Dict[int, str]) – Topic ID to keyword mapping

  • pc_indices (int or List[int], optional) – PC indices to show arrows for. If None and arrow_mode=’all’, shows first 3

  • arrow_mode (str, default='all') – Arrow display mode: ‘single’, ‘list’, or ‘all’

  • color_by_usage (bool, default=True) – Whether to color topics by HU/AI usage patterns

  • output_file (str, optional) – Path to save HTML file

  • display (bool, default=True) – Whether to display in notebook/colab

Returns:

Plotly 3D figure object

Return type:

go.Figure

class haam.PCWordCloudGenerator(topic_analyzer, analysis_results=None, criterion=None, human_judgment=None, ai_judgment=None)[source]

Bases: object

Generate word clouds for principal component poles.

Methods

create_all_pc_wordclouds([pc_indices, k, ...])

Create word clouds for all specified PCs.

create_pc_wordclouds(pc_idx[, k, max_words, ...])

Create word clouds for high and low poles of a specific PC.

create_top_pcs_wordcloud_grid(pc_indices[, ...])

Create a grid visualization of word clouds for top PCs.

__init__(topic_analyzer, analysis_results=None, criterion=None, human_judgment=None, ai_judgment=None)[source]

Initialize word cloud generator.

Parameters:
  • topic_analyzer (TopicAnalyzer) – TopicAnalyzer instance with computed topics and PC associations

  • analysis_results (dict, optional) – HAAM analysis results containing model coefficients for validity coloring

  • criterion (array-like, optional) – Ground truth values (X) for direct validity measurement

  • human_judgment (array-like, optional) – Human judgment values (HU) for direct validity measurement

  • ai_judgment (array-like, optional) – AI judgment values (AI) for direct validity measurement

create_pc_wordclouds(pc_idx: int, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str | None = None, display: bool = True, color_mode: str = 'pole') Tuple[Figure, str, str][source]

Create word clouds for high and low poles of a specific PC.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words to display in word cloud

  • figsize (Tuple[int, int]) – Figure size (width, height) for each subplot

  • output_dir (str, optional) – Directory to save output files

  • display (bool) – Whether to display the plots

  • color_mode (str, optional) –

    ‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement:

    • Dark red: consensus high (all in top quartile)

    • Light red: any high signal (at least one in top quartile)

    • Dark blue: consensus low (all in bottom quartile)

    • Light blue: any low signal (at least one in bottom quartile)

    • Dark grey: opposing signals (mix of high and low)

    • Light grey: all in middle quartiles

Returns:

Figure object, high pole output path, low pole output path

Return type:

Tuple[plt.Figure, str, str]

create_all_pc_wordclouds(pc_indices: List[int] | None = None, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str = './wordclouds', display: bool = False, color_mode: str = 'pole') Dict[int, Tuple[str, str]][source]

Create word clouds for all specified PCs.

Parameters:
  • pc_indices (List[int], optional) – List of PC indices. If None, uses all available PCs

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words to display in word cloud

  • figsize (Tuple[int, int]) – Figure size for each subplot

  • output_dir (str) – Directory to save output files

  • display (bool) – Whether to display each plot

  • color_mode (str) – ‘pole’ or ‘validity’ coloring mode

Returns:

Dictionary mapping PC index to (high_path, low_path)

Return type:

Dict[int, Tuple[str, str]]

create_top_pcs_wordcloud_grid(pc_indices: List[int], k: int = 10, max_words: int = 50, output_file: str | None = None, display: bool = True, color_mode: str = 'pole') Figure[source]

Create a grid visualization of word clouds for top PCs.

Parameters:
  • pc_indices (List[int]) – List of PC indices to visualize (max 9 for 3x3 grid)

  • k (int) – Number of topics to include from each pole

  • max_words (int) – Maximum words per word cloud

  • output_file (str, optional) – Path to save the grid visualization

  • display (bool) – Whether to display the plot

  • color_mode (str) – ‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement

Returns:

Figure object

Return type:

plt.Figure