haam.haam_topics module

HAAM Topic Analysis Module

Functions for topic modeling and PC interpretation.

class haam.haam_topics.TopicAnalyzer(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Bases: object

Analyze topics and their relationships with principal components.

Methods

`create_topic_summary_for_pcs`(pc_indices[, ...])	Create concise topic summaries for specified PCs.
`get_pc_high_low_topics`(pc_idx[, n_high, ...])	Get high and low topics for a specific PC.
`get_pc_topic_associations`([pc_indices, n_topics])	Get topic associations for specified PCs.

__init__(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]

Initialize topic analyzer with enhanced parameters.

Now uses optimized hyperparameters: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF: BERTopic formula for better topic extraction

Parameters:

texts (List[str]) – Original text documents
embeddings (np.ndarray) – Document embeddings
pca_features (np.ndarray) – PCA-transformed features
min_cluster_size (int) – Minimum cluster size for HDBSCAN (default: 10, matches BERTopic-style)
min_samples (int) – Minimum samples for core points (default: 2)
umap_n_components (int) – Number of UMAP components for clustering (default: 3)

get_pc_topic_associations(pc_indices: List[int] | None = None, n_topics: int = 15) → Dict[int, List[Dict]][source]

Get topic associations for specified PCs.

Parameters:

pc_indices (List[int], optional) – PC indices to analyze. If None, analyzes all
n_topics (int) – Number of top/bottom topics to return per PC

Returns:

Dictionary mapping PC index to list of topic associations

Return type:

Dict[int, List[Dict]]

get_pc_high_low_topics(pc_idx: int, n_high: int = 5, n_low: int = 5, p_threshold: float = 0.05) → Dict[str, List[Dict]][source]

Get high and low topics for a specific PC.

Parameters:

pc_idx (int) – PC index (0-based)
n_high (int) – Number of high topics to return
n_low (int) – Number of low topics to return
p_threshold (float) – P-value threshold for significance

Returns:

Dictionary with ‘high’ and ‘low’ topic lists

Return type:

Dict[str, List[Dict]]

create_topic_summary_for_pcs(pc_indices: List[int], n_keywords: int = 5, n_topics_per_side: int = 3) → Dict[int, Dict[str, List[str]]][source]

Create concise topic summaries for specified PCs.

Parameters:

pc_indices (List[int]) – PC indices to summarize
n_keywords (int) – Number of keywords to show per topic
n_topics_per_side (int) – Number of high/low topics to include

Returns:

PC index -> {‘high_topics’: […], ‘low_topics’: […]}

Return type:

Dict[int, Dict[str, List[str]]]