haam.haam_topics module
HAAM Topic Analysis Module
Functions for topic modeling and PC interpretation.
- class haam.haam_topics.TopicAnalyzer(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]
Bases:
object
Analyze topics and their relationships with principal components.
Methods
create_topic_summary_for_pcs
(pc_indices[, ...])Create concise topic summaries for specified PCs.
get_pc_high_low_topics
(pc_idx[, n_high, ...])Get high and low topics for a specific PC.
get_pc_topic_associations
([pc_indices, n_topics])Get topic associations for specified PCs.
- __init__(texts: List[str], embeddings: ndarray, pca_features: ndarray, min_cluster_size: int = 10, min_samples: int = 2, umap_n_components: int = 3)[source]
Initialize topic analyzer with enhanced parameters.
Now uses optimized hyperparameters: - UMAP: n_neighbors=5, min_dist=0.0, metric=’cosine’ - HDBSCAN: min_cluster_size=10, min_samples=2 - c-TF-IDF: BERTopic formula for better topic extraction
- Parameters:
texts (List[str]) – Original text documents
embeddings (np.ndarray) – Document embeddings
pca_features (np.ndarray) – PCA-transformed features
min_cluster_size (int) – Minimum cluster size for HDBSCAN (default: 10, matches BERTopic-style)
min_samples (int) – Minimum samples for core points (default: 2)
umap_n_components (int) – Number of UMAP components for clustering (default: 3)
- get_pc_topic_associations(pc_indices: List[int] | None = None, n_topics: int = 15) Dict[int, List[Dict]] [source]
Get topic associations for specified PCs.
- get_pc_high_low_topics(pc_idx: int, n_high: int = 5, n_low: int = 5, p_threshold: float = 0.05) Dict[str, List[Dict]] [source]
Get high and low topics for a specific PC.