haam.haam_init module

HAAM Package - Simplified API

Main module providing a simplified interface for HAAM analysis.

class haam.haam_init.HAAM(criterion: ndarray | Series | list, ai_judgment: ndarray | Series | list, human_judgment: ndarray | Series | list, embeddings: ndarray | DataFrame | None = None, texts: List[str] | None = None, n_components: int = 200, auto_run: bool = True, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3, standardize: bool = False, sample_split_post_lasso: bool = True)[source]

Bases: object

Simplified interface for HAAM analysis.

This class provides an easy-to-use API for performing the full HAAM analysis pipeline including statistical analysis, topic modeling, and visualization generation.

Methods

`create_3d_umap_with_pc_arrows`([pc_indices, ...])	Create 3D UMAP visualization with PC directional arrows.
`create_all_pc_umap_visualizations`([...])	Create UMAP visualizations for multiple PCs with topic labels.
`create_all_pc_wordclouds`([pc_indices, k, ...])	Create word clouds for all specified PCs.
`create_comprehensive_pc_analysis`([...])	Create comprehensive PC analysis with word cloud table and 3D UMAP visualization.
`create_main_visualization`([output_dir, pc_names])	Create and save main HAAM visualization.
`create_metrics_summary`([output_dir])	Create and save comprehensive metrics summary.
`create_mini_grid`([output_dir])	Create and save mini grid visualization.
`create_pc_effects_visualization`([...])	Create PC effects bar chart visualization.
`create_pc_wordclouds`(pc_idx[, k, max_words, ...])	Create word clouds for high and low poles of a specific PC.
`create_top_pcs_wordcloud_grid`([pc_indices, ...])	Create a grid visualization of word clouds for top PCs.
`create_umap_visualization`([n_components, ...])	Create UMAP visualization.
`explore_pc_topics`([pc_indices, n_topics])	Explore topic associations for specified PCs.
`export_all_results`([output_dir])	Export all results and create all visualizations.
`plot_pc_effects`(pc_idx[, save_path])	Create bar plot showing PC effects on outcomes.
`run_full_analysis`()	Run the complete HAAM analysis pipeline.
`visualize_pc_umap_with_topics`(pc_idx[, ...])	Create UMAP visualization for a specific PC with topic labels.

Initialize HAAM analysis with enhanced parameters.

Recent updates: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF implementation with BERTopic formula - Generic “X” labeling instead of “SC” in visualizations

Parameters:

criterion (array-like) – Criterion variable (any ground truth variable, not limited to social class)
ai_judgment (array-like) – AI predictions/ratings
human_judgment (array-like) – Human ratings
embeddings (array-like, optional) – Pre-computed embeddings. If None, will be generated from texts
texts (List[str], optional) – Text data for generating embeddings if not provided
n_components (int, default=200) – Number of PCA components
auto_run (bool, default=True) – Whether to automatically run the full analysis
min_cluster_size (int, default=10) – Minimum cluster size for HDBSCAN (matches BERTopic-style clustering)
min_samples (int, default=2) – Minimum samples for core points in HDBSCAN
umap_n_components (int, default=3) – Number of UMAP components for clustering (3D by default)
standardize (bool, default=False) – Whether to standardize X and outcome variables for both total effects and DML calculations. When True, all coefficients will be in standardized units.
sample_split_post_lasso (bool, default=True) – Whether to use sample splitting for post-LASSO inference. True: Conservative inference with valid p-values (original behavior) False: Maximum statistical power using full sample

run_full_analysis() → Dict[str, Any][source]

Run the complete HAAM analysis pipeline.

Returns:: Dictionary containing all results
Return type:: Dict[str, Any]

create_main_visualization(output_dir: str | None = None, pc_names: Dict[int, str] | None = None) → str[source]

Create and save main HAAM visualization.

Parameters:

output_dir (str, optional) – Directory to save output. If None, uses current directory
pc_names (Dict[int, str], optional) – Manual names for PCs. Keys are PC indices (0-based), values are names. Example: {4: “Lifestyle & Work”, 7: “Professions”, 1: “Narrative Style”}

Returns:

Path to saved HTML file

Return type:

str

create_mini_grid(output_dir: str | None = None) → str[source]

Create and save mini grid visualization.

Parameters:: output_dir (str, optional) – Directory to save output. If None, uses current directory
Returns:: Path to saved HTML file
Return type:: str

create_pc_effects_visualization(output_dir: str | None = None, n_top: int = 20) → str[source]

Create PC effects bar chart visualization.

Parameters:

output_dir (str, optional) – Directory to save output. If None, uses current directory
n_top (int) – Number of top PCs to display

Returns:

Path to saved HTML file

Return type:

str

create_metrics_summary(output_dir: str | None = None) → Dict[str, Any][source]

Create and save comprehensive metrics summary.

Exports all key metrics including: - Model performance (R² values for X, AI, HU) - Policy similarities between predictions - Mediation analysis (PoMA percentages) - Feature selection statistics

Parameters:: output_dir (str, optional) – Directory to save output. If None, uses current directory
Returns:: Dictionary containing all metrics (also saved to JSON file)
Return type:: Dict[str, Any]

explore_pc_topics(pc_indices: List[int] | None = None, n_topics: int = 10) → DataFrame[source]

Explore topic associations for specified PCs.

Parameters:

pc_indices (List[int], optional) – PC indices to explore (0-based). If None, uses top PCs
n_topics (int) – Number of topics to show per PC

Returns:

DataFrame with topic information

Return type:

pd.DataFrame

plot_pc_effects(pc_idx: int, save_path: str | None = None)[source]

Create bar plot showing PC effects on outcomes.

Parameters:

pc_idx (int) – PC index (0-based)
save_path (str, optional) – Path to save figure

create_umap_visualization(n_components: int = 3, color_by: str = 'X', output_dir: str | None = None) → str[source]

Create UMAP visualization.

Parameters:

n_components (int) – Number of UMAP components (2 or 3)
color_by (str) – Variable to color by: ‘X’, ‘AI’, ‘HU’, or ‘PC1’, ‘PC2’, etc.
output_dir (str, optional) – Directory to save output

Returns:

Path to saved HTML file

Return type:

str

visualize_pc_umap_with_topics(pc_idx: int, output_dir: str | None = None, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = True) → str[source]

Create UMAP visualization for a specific PC with topic labels.

Parameters:

pc_idx (int) – PC index (0-based). E.g., 0 for PC1, 4 for PC5
output_dir (str, optional) – Directory to save output. If None, saves to current directory
show_top_n (int) – Number of high-scoring topics to label
show_bottom_n (int) – Number of low-scoring topics to label
display (bool) – Whether to display in notebook/colab

Returns:

Path to saved HTML file

Return type:

str

create_all_pc_umap_visualizations(pc_indices: List[int] | None = None, output_dir: str | None = None, show_top_n: int = 5, show_bottom_n: int = 5, display: bool = False) → Dict[int, str][source]

Create UMAP visualizations for multiple PCs with topic labels.

Parameters:

pc_indices (List[int], optional) – List of PC indices to visualize. If None, uses top 10 PCs
output_dir (str, optional) – Directory to save visualizations
show_top_n (int) – Number of high topics to show per PC
show_bottom_n (int) – Number of low topics to show per PC
display (bool) – Whether to display each plot (set False for batch processing)

Returns:

Mapping of PC index to output file path

Return type:

Dict[int, str]

create_3d_umap_with_pc_arrows(pc_indices: int | List[int] | None = None, top_k: int = 1, percentile_threshold: float = 90.0, arrow_mode: str = 'all', color_by_usage: bool = True, color_mode: str = 'legacy', show_topic_labels: bool | int = 10, output_dir: str | None = None, display: bool = True) → str[source]

Create 3D UMAP visualization with PC directional arrows.

This creates an interactive 3D scatter plot where: - Topics are positioned in 3D UMAP space based on their semantic similarity - Arrows show PC directions from average of bottom-k to top-k topics - Topics are colored based on HU/AI usage patterns (quartiles)

The key insight: In UMAP space, PC gradients often form linear patterns, allowing us to visualize how principal components map to topic space.

Parameters:

pc_indices (int or List[int], optional) – PC indices to show arrows for (0-based). - If None and arrow_mode=’all’: shows arrows for PC1, PC2, PC3 - If int: shows arrow for that single PC - If list: shows arrows for all PCs in the list
top_k (int, default=1) – Number of top/bottom scoring topics to average for arrow endpoints. Default=1 for cleaner single-topic arrows. If fewer topics meet the threshold, uses all available.
percentile_threshold (float, default=90.0) – Percentile threshold for selecting top/bottom topics. 90.0 means top 10% and bottom 10% of topics.
arrow_mode (str, default='all') – Controls which arrows to display: - ‘single’: Show arrow for single PC - ‘list’: Show arrows for specified list of PCs - ‘all’: Show arrows for first 3 PCs
color_by_usage (bool, default=True) – Whether to color topics by HU/AI usage patterns
color_mode (str, default='legacy') – Coloring mode when color_by_usage=True: - ‘legacy’: Use PC coefficient-based inference (original behavior) - ‘validity’: Use direct X/HU/AI measurement (consistent with word clouds)
show_topic_labels (bool or int, default=10) – Controls topic label display: - True: Show all topic labels - False: Hide all labels (hover still works) - int: Show only N closest topics to camera
output_dir (str, optional) – Directory to save output. If None, uses current directory
display (bool, default=True) – Whether to display in notebook/colab

Returns:

Path to saved HTML file

Return type:

str

Examples

# Show arrows with new validity coloring (consistent with word clouds) haam.create_3d_umap_with_pc_arrows(color_mode=’validity’)

# Use legacy PC-based coloring (default) haam.create_3d_umap_with_pc_arrows(color_mode=’legacy’)

# Show arrow only for PC5 with validity coloring haam.create_3d_umap_with_pc_arrows(

pc_indices=4, arrow_mode=’single’, color_mode=’validity’

)

# Show arrows for specific PCs with stricter threshold haam.create_3d_umap_with_pc_arrows(

pc_indices=[0, 3, 7], percentile_threshold=95.0, # Top/bottom 5% arrow_mode=’list’, color_mode=’validity’

)

create_pc_wordclouds(pc_idx: int, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str | None = None, display: bool = True, color_mode: str = 'pole') → Tuple[Any, str, str][source]

Create word clouds for high and low poles of a specific PC.

Parameters:

pc_idx (int) – PC index (0-based)
k (int) – Number of topics to include from each pole
max_words (int) – Maximum words to display in word cloud
figsize (Tuple[int, int]) – Figure size (width, height) for each subplot
output_dir (str, optional) – Directory to save output files
display (bool) – Whether to display the plots
color_mode (str, optional) –
‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement:
- Dark red: top quartile for all (X, HU, AI)
- Light red: top quartile for HU & AI only
- Dark blue: bottom quartile for all
- Light blue: bottom quartile for HU & AI only
- Grey: mixed signals

Returns:

Figure object, high pole output path, low pole output path

Return type:

Tuple[Figure, str, str]

create_all_pc_wordclouds(pc_indices: List[int] | None = None, k: int = 10, max_words: int = 100, figsize: Tuple[int, int] = (10, 5), output_dir: str = './wordclouds', display: bool = False, color_mode: str = 'pole') → Dict[int, Tuple[str, str]][source]

Create word clouds for all specified PCs.

Parameters:

pc_indices (List[int], optional) – List of PC indices. If None, uses top 9 PCs by ‘triple’ ranking
k (int) – Number of topics to include from each pole
max_words (int) – Maximum words to display in word cloud
figsize (Tuple[int, int]) – Figure size for each subplot
output_dir (str) – Directory to save output files
display (bool) – Whether to display each plot
color_mode (str) – ‘pole’ or ‘validity’ coloring mode

Returns:

Dictionary mapping PC index to (high_path, low_path)

Return type:

Dict[int, Tuple[str, str]]

create_top_pcs_wordcloud_grid(pc_indices: List[int] | None = None, ranking_method: str = 'triple', n_pcs: int = 9, k: int = 10, max_words: int = 50, output_file: str | None = None, display: bool = True, color_mode: str = 'pole') → Any[source]

Create a grid visualization of word clouds for top PCs.

Parameters:

pc_indices (List[int], optional) – List of PC indices. If None, uses top PCs by ranking_method
ranking_method (str) – Method to rank PCs if pc_indices not provided: ‘triple’, ‘HU’, ‘AI’, ‘X’
n_pcs (int) – Number of top PCs to include if pc_indices not provided
k (int) – Number of topics to include from each pole
max_words (int) – Maximum words per word cloud
output_file (str, optional) – Path to save the grid visualization
display (bool) – Whether to display the plot
color_mode (str) – ‘pole’ (default): Red for high pole, blue for low pole ‘validity’: Color based on X/HU/AI agreement

Returns:

Figure object

Return type:

Figure

create_comprehensive_pc_analysis(pc_indices: List[int] | None = None, n_pcs: int = 15, k_topics: int = 3, max_words: int = 100, generate_wordclouds: bool = True, generate_3d_umap: bool = True, umap_arrow_k: int = 1, show_data_counts: bool = True, output_dir: str | None = None, display: bool = True) → Dict[str, Any][source]

Create comprehensive PC analysis with word cloud table and 3D UMAP visualization.

This method generates a complete analysis similar to the Colab scripts, including: - Individual word clouds for each PC’s high and low poles - A comprehensive table showing all PCs with X/HU/AI quartile labels - 3D UMAP visualization with PC arrows (optional) - Summary report with data availability statistics

Parameters:

pc_indices (List[int], optional) – List of PC indices (0-based) to analyze. If None, uses first n_pcs PCs. Example: [2, 1, 3, 4, 5] for PC3, PC2, PC4, PC5, PC6
n_pcs (int, default=15) – Number of PCs to analyze if pc_indices not provided
k_topics (int, default=3) – Number of topics to include from each pole in word clouds
max_words (int, default=100) – Maximum words to display in each word cloud
generate_wordclouds (bool, default=True) – Whether to generate word cloud table
generate_3d_umap (bool, default=True) – Whether to generate 3D UMAP visualization with PC arrows
umap_arrow_k (int, default=1) – Number of topics for arrow endpoints in UMAP (1 = single topic endpoints)
show_data_counts (bool, default=True) – Whether to show data availability counts (e.g., “HU: n=3” for sparse data)
output_dir (str, optional) – Directory to save all outputs. If None, creates ‘haam_comprehensive_analysis’
display (bool, default=True) – Whether to display visualizations

Returns:

Dictionary containing: - ‘wordcloud_paths’: Dict mapping PC index to (high_path, low_path) - ‘table_path’: Path to comprehensive PC table image - ‘umap_path’: Path to 3D UMAP HTML (if generated) - ‘report_path’: Path to text report with statistics - ‘summary’: Dict with analysis summary statistics

Return type:

Dict[str, Any]

Examples

# Analyze first 15 PCs with all defaults results = haam.create_comprehensive_pc_analysis()

# Analyze specific PCs specific_pcs = [2, 1, 3, 4, 5, 14, 13, 11, 12, 46, 9, 17, 16, 20, 105] results = haam.create_comprehensive_pc_analysis(pc_indices=specific_pcs)

# Only generate word clouds, skip 3D UMAP results = haam.create_comprehensive_pc_analysis(

n_pcs=10, generate_3d_umap=False

)

export_all_results(output_dir: str | None = None) → Dict[str, str][source]

Export all results and create all visualizations.

Parameters:: output_dir (str, optional) – Directory to save all outputs
Returns:: Dictionary of all output file paths
Return type:: Dict[str, str]

haam.haam_init.example_usage()[source]: Example usage of the HAAM package.