haam.haam_topics module

HAAM Topic Analysis Module

Functions for topic modeling and PC interpretation.

class haam.haam_topics.TopicAnalyzer(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Bases: object

Analyze topics and their relationships with principal components.

Methods

create_topic_summary_for_pcs(pc_indices[, ...])

Create concise topic summaries for specified PCs.

get_pc_high_low_topics(pc_idx[, n_high, ...])

Get high and low topics for a specific PC.

get_pc_topic_associations([pc_indices, n_topics])

Get topic associations for specified PCs.

__init__(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Initialize topic analyzer with enhanced parameters.

Now uses optimized hyperparameters: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF: BERTopic formula for better topic extraction

Parameters:
  • texts (List[str]) – Original text documents

  • embeddings (np.ndarray) – Document embeddings

  • pca_features (np.ndarray) – PCA-transformed features

  • min_cluster_size (int) – Minimum cluster size for HDBSCAN (default: 10, matches BERTopic-style)

  • min_samples (int) – Minimum samples for core points (default: 2)

  • umap_n_components (int) – Number of UMAP components for clustering (default: 3)

get_pc_topic_associations(pc_indices: List[int] | None = None, n_topics: int = 15) Dict[int, List[Dict]][source]

Get topic associations for specified PCs.

Parameters:
  • pc_indices (List[int], optional) – PC indices to analyze. If None, analyzes all

  • n_topics (int) – Number of top/bottom topics to return per PC

Returns:

Dictionary mapping PC index to list of topic associations

Return type:

Dict[int, List[Dict]]

get_pc_high_low_topics(pc_idx: int, n_high: int = 5, n_low: int = 5, p_threshold: float = 0.05) Dict[str, List[Dict]][source]

Get high and low topics for a specific PC.

Parameters:
  • pc_idx (int) – PC index (0-based)

  • n_high (int) – Number of high topics to return

  • n_low (int) – Number of low topics to return

  • p_threshold (float) – P-value threshold for significance

Returns:

Dictionary with ‘high’ and ‘low’ topic lists

Return type:

Dict[str, List[Dict]]

create_topic_summary_for_pcs(pc_indices: List[int], n_keywords: int = 5, n_topics_per_side: int = 3) Dict[int, Dict[str, List[str]]][source]

Create concise topic summaries for specified PCs.

Parameters:
  • pc_indices (List[int]) – PC indices to summarize

  • n_keywords (int) – Number of keywords to show per topic

  • n_topics_per_side (int) – Number of high/low topics to include

Returns:

PC index -> {‘high_topics’: […], ‘low_topics’: […]}

Return type:

Dict[int, Dict[str, List[str]]]