Segment common items in a text dataset to pinpoint core themes and their distribution.
- Clusters cover the main topics/subtopics in the dataset
- Clusters backed by accurate, LLM generated summaries
We employ HDBSCAN for probabilistic clustering. This algorithm is advantageous in many ways, including:
- Don’t be wrong: Cluster can have varying densities, don’t need to be globular, and won’t include noise
- Intuitive parameters: Choosing a minimum cluster size is very reasonable, and the number of k clusters does not need to be specified (HDBSCAN finds the optimal k for you)
- Stability: HDBSCAN is stable over runs and subsampling and has good stability over parameter choices
- Performance: When implemented well HDBSCAN can be very efficient; the current implementation has similar performance to fastcluster’s agglomerative clustering
See the HDBSCAN docs on comparing clustering algorithms and how hdbscan works for more information.
- Datasets
- Embedding models
1. Visualizing core themes in fka/awesome-chatgpt-prompts
These figures correspond to experiments/02_09_2023_16_54_32
Figure 1. HDBSCAN splits the 153 text to text prompts from fka/awesome-chatgpt-prompts into two clusters: Cluster 1 with 44 prompts (orange) and Cluster 2 with 105 prompts (blue). The 4 remaining prompts (gray) were filtered out as outliers/noise.
Figure 2. The most persistent prompts in each leaf cluster are known as "exemplars". These represent the hearts around which the ultimate cluster formed. See the HDBSCAN docs on soft clustering explanation for supporting information and functions.
Figure 3. Additional clustering is conducted around the exemplars to identify sub-topics in the dataset. The cases in each sub-cluster subsequently serve as retrieved context for the LLM theme summarization calls below.
Figure 4. Visualizing the "Computer Programming and Software Development" theme, which covers 13% of the dataset. The summary was generated by gpt-3.5-turbo-16k. The above was created with jsoncrack.com/editor.
2. Drift detection for gustavosta/stable-diffusion-prompts
These figures correspond to experiments/04_09_2023_03_02_25
HDBSCAN splits the 73,718 text to image prompts from gustavosta/stable-diffusion-prompts into 78 clusters with 25,019 (33%) of the dataset represented. The remaining 48,699 (66%) were filtered out as outliers/noise. The 5 largest clusters cover 9.5% of the dataset - these are the segments we will examine for drift below.
cluster id | theme |
---|---|
56 | Portraits and artistic depictions of female anime characters, beautiful women, and fashionable young women |
13 | Symmetrical portraits of people, characters, and sci-fi figures |
61 | Futuristic sci-fi spaceship concept art |
50 | Portraits of famous actresses as characters in various roles, outfits, and styles |
74 | Surreal, cinematic, and futuristic digital art |
cluster id | train count (73.7k rows) |
test count (8.19k rows) |
drift detection (% change) |
---|---|---|---|
56 | 2530 (3.43%) | 310 (3.79%) | 10.50 |
13 | 1343 (1.82%) | 149 (1.82%) | 0.00 |
61 | 1287 (1.75%) | 131 (1.60%) | -8.57 |
50 | 1055 (1.43%) | 135 (1.65%) | 15.38 |
74 | 749 (1.02%) | 109 (1.33%) | 30.39 |
Tables 1 & 2. Drift detection for the top 5 largest clusters (bottom), alongside their claude-2 summaries (top).
Prompt: "Beautiful painting of an Aspen forest at sunset, digital art, award winning illustration, golden hour, smooth, sharp lines, concept art, trending on artstation"
Model: Runway Gen-2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Beautiful landscape paintings and matte art (cluster id: 75)
Prompt: "Futuristic batman, brush strokes, oil painting, greg rutkowski"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Art and portraits of Batman characters (cluster id: 41)
Prompt: "Futuristic Porsche designed by Apple, a detailed matte painting by Kitagawa Utamaro, cgsociety, octane render, highly detailed, matte painting, concept art, sci-fi"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Futuristic and fantasy vehicle concept art (cluster id: 52)
Figure 5. A sample of 3 text to image generations with various models for prompts from the gustavosta/stable-diffusion-prompts dataset (alongside their cluster id).