--- title: "faSTM: the stm vignette, run on faSTM" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{faSTM: the stm vignette, run on faSTM} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2, dpi = 96, message = FALSE, warning = FALSE) set.seed(2138) ``` This vignette walks through the **same analysis as the `stm` package vignette** (Roberts, Stewart & Tingley), using the identical CMU 2008 political-blog corpus, but every model is **fit live** here, because faSTM fits in seconds where `stm` takes minutes. (The `stm` vignette loads pre-computed objects to avoid the wait; faSTM does not need to.) The code mirrors the `stm` vignette's calls. Because each fit is fresh, the *topic numbers* differ from the original. The workflow, not the specific topics, is what carries over. > **A note on the plots.** faSTM's `plot()` methods are restyled (ggplot), > *re-defaulted* versions of stm's, not pixel-for-pixel copies. Two differences > worth knowing. `plot(type = "summary")` ranks words by **FREX** (stm defaults to > highest-probability, so pass `labeltype = "prob"` for stm-style labels). And > `plotModels()` draws stm's full **per-topic cloud** (faint, one point per topic > coloured by model) *and* overlays **bold model-mean points** with the > non-dominated models highlighted on a "quality frontier", so you see both the > spread and the summary at once. ```{r} library(faSTM) ``` ## Ingesting data faSTM reads prepared text from `quanteda`/`tidytext` rather than tokenizing itself. A typical preparation: ```{r prep, eval=FALSE} library(quanteda) dfmat <- corpus(my_data, text_field = "documents") |> tokens(remove_punct = TRUE) |> tokens_remove(stopwords("en")) |> dfm() |> dfm_trim(min_termfreq = 5) corpus <- as_corpus(dfmat) # quanteda docvars become the metadata ``` For this vignette we use the bundled poliblog corpus (the `stm` vignette's `poliblog5k`), already prepared: ```{r data} data(poliblog) poliblog out <- list(documents = poliblog$documents, vocab = poliblog$vocab, meta = poliblog$meta) ``` ## Estimating the structural topic model The headline call mirrors the `stm` vignette exactly. Topic prevalence varies with `rating` and a smooth function of `day`: ```{r fit} poliblogPrevFit <- stm(out$documents, out$vocab, K = 20, prevalence = ~ rating + s(day), data = out$meta, init.type = "Spectral", seed = 2138) ``` That fit took seconds, not minutes. ## Model selection and search `selectModel()` fits several models from different initializations and keeps the ones on the semantic-coherence / exclusivity frontier; `plotModels()` shows them. (Reduced to a few candidates here to keep the vignette quick.) ```{r select, fig.height=4} poliblogSelect <- selectModel(out$documents, out$vocab, K = 20, N = 5, prevalence = ~ rating + s(day), data = out$meta, seed = 2138) plotModels(poliblogSelect) ``` `searchK()` sweeps the number of topics, reporting held-out likelihood, semantic coherence, and exclusivity. It also parallelizes across K: ```{r searchk, fig.height=4.2} storage <- searchK(out$documents, out$vocab, K = c(10, 20), prevalence = ~ rating + s(day), data = out$meta, cores = 2) plot(storage) ``` ## Interpreting topics Top words by probability, FREX, lift and score: ```{r labels} labelTopics(poliblogPrevFit, c(3, 7, 20)) ``` Representative documents per topic, displayed as wrapped quotes: ```{r thoughts, fig.height=3.4} # bundled poliblog text is short (~50-char) snippets, so a few fill the panel thoughts3 <- findThoughts(poliblogPrevFit, texts = out$meta$text, n = 4, topics = 3)$docs[[1]] plotQuote(substr(thoughts3, 1, 200), width = 60, main = "Topic 3") ``` Topics ranked by their expected prevalence in the corpus: ```{r summary} plot(poliblogPrevFit, type = "summary") ``` ## Covariate effects on topic prevalence `estimateEffect()` regresses topic proportions on the covariates, propagating topic-estimation uncertainty (the method of composition): ```{r effect} out$meta$rating <- as.factor(out$meta$rating) prep <- estimateEffect(1:20 ~ rating + s(day), poliblogPrevFit, meta = out$meta, uncertainty = "Global") summary(prep, topics = 1)$tables[[1]] ``` Difference in topic prevalence between Liberal and Conservative blogs: ```{r diff, fig.height=3.6} plot(prep, covariate = "rating", topics = c(3, 7, 20), model = poliblogPrevFit, method = "difference", cov.value1 = "Liberal", cov.value2 = "Conservative", xlab = "More Conservative ... More Liberal") ``` A topic's prevalence over time (smooth term in `day`): ```{r cont, fig.height=3.6} plot(prep, "day", method = "continuous", topics = 7, model = poliblogPrevFit) ``` ## Topical content Letting word *use within topics* vary by `rating` (a SAGE content covariate), then comparing the two sides' vocabulary for a topic: ```{r content, fig.height=5} poliblogContent <- stm(out$documents, out$vocab, K = 20, prevalence = ~ rating + s(day), content = ~ rating, data = out$meta, init.type = "Spectral", seed = 2138) plot(poliblogContent, type = "perspectives", topics = 1) ``` Comparing the vocabulary of two topics: ```{r persp2, fig.height=5} plot(poliblogPrevFit, type = "perspectives", topics = c(12, 20)) ``` ## Interactions Prevalence can interact covariates (here `rating` with time), and the effect plot can condition on a moderator value: ```{r interaction, fig.height=3.8} poliblogInteraction <- stm(out$documents, out$vocab, K = 20, prevalence = ~ rating * day, data = out$meta, init.type = "Spectral", seed = 2138) prepInt <- estimateEffect(c(16) ~ rating * day, poliblogInteraction, metadata = out$meta, uncertainty = "None") plot(prepInt, covariate = "day", model = poliblogInteraction, method = "continuous", xlab = "Days", moderator = "rating", moderator.value = "Liberal", topics = 16) ``` ## More visualization A word cloud for a topic, the topic-correlation network, and the convergence trajectory: ```{r viz, fig.height=4.5} cloud(poliblogPrevFit, topic = 7) plot(topicCorr(poliblogPrevFit)) plot(poliblogPrevFit$convergence$bound, type = "l", ylab = "Approximate Objective", main = "Convergence") ``` ## Out-of-sample documents New documents get topic proportions by holding the fitted topics fixed: ```{r newdocs} theta_new <- fit_new_documents(poliblogPrevFit, poliblog) dim(theta_new) ``` --- Everything above is the `stm` vignette's workflow, run on faSTM: the same function names and arguments, the same corpus, and faSTM's restyled, re-defaulted plots (see the note up top). It fits in seconds, with an `estimateEffect` that propagates topic uncertainty. Existing `stm` scripts port with little more than the changes shown here.