--- title: "Beyond stm: faSTM's extensions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Beyond stm: faSTM's extensions} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2, dpi = 96, message = FALSE, warning = FALSE) set.seed(2138) ``` The [companion vignette](faSTM.html) shows that faSTM runs the *same* analysis as the `stm` package. This one covers what faSTM adds **on top of** that framework: multiple content covariates, an `estimateEffect()` with weights and clustering, alternative coherence metrics, and a tidyverse-friendly surface. None of these require leaving the `stm`-compatible object; they are extra tools, not a different model. We use a second bundled corpus, `congress`: a balanced sample of **1,679 U.S. House and Senate floor speeches**, Congresses 100–111 (1987–2011), with metadata `party` (Democrat/Republican), `chamber` (House/Senate), and `congress` (the time index). It is built for exactly this vignette: two categorical covariates that cross, plus a time axis. ```{r load} library(faSTM) data(congress) congress table(congress$meta$party, congress$meta$chamber) ``` We fit one prevalence model (topic prevalence as a function of **party** and a smooth of **time**) and reuse it throughout. ```{r basefit} fit <- stm(congress, K = 12, prevalence = ~ party + s(congress), verbose = FALSE) ``` ## Multiple content covariates `stm` allows a single content (SAGE) covariate. faSTM accepts several and crosses them into a **saturated** content model: one topic-word distribution per combination of levels. Here `party` (2) × `chamber` (2) gives a 4-group content model, so we can read how *Democrats vs Republicans* and *House vs Senate* word each topic differently: ```{r content} fitC <- stm(congress, K = 12, prevalence = ~ party + s(congress), content = ~ party + chamber, # crossed -> 4 SAGE groups verbose = FALSE) fitC$settings$dim$A # number of content groups ``` The fit reports the crossing as it runs, and `stm`'s SAGE tooling works unchanged on the result: ```{r sage, eval=FALSE} sageLabels(fitC, n = 5) # top words per topic, per party×chamber group labelTopics(fitC, topics = 3) ``` The crossed levels are stored on the fit (`settings$covariates$contentvars`, `contenttable`) so per-covariate marginals can be recovered later. ### Framing, not agenda Content covariates are where the interesting partisan signal lives. Above, `estimateEffect` will show that Democrats and Republicans devote *similar prevalence* to the budget/tax topic, but they **word it very differently**. A `~ party` content model makes that concrete: ```{r framing} fitParty <- stm(congress, K = 12, prevalence = ~ party + s(congress), content = ~ party, verbose = FALSE) sl <- sageLabels(fitParty, n = 10) tax <- which(apply(sl$marginal, 1, function(w) any(grepl("^tax", w))))[1] sl$marginal[tax, ] # shared topic vocabulary sl$bygroup$Democrat[tax, ] # how Democrats word it sl$bygroup$Republican[tax, ] # how Republicans word it ``` The shared vocabulary is fiscal (tax, budget, spending), but Democrats reach for *health, care, children, preexisting, denied* while Republicans reach for *taxing, taxed, CBO, authority*: same topic, opposite framing. This is the distinction prevalence covariates can't capture and content (SAGE) covariates can. ## `estimateEffect()`: cluster-robust SEs and weights faSTM keeps `stm`'s method-of-composition uncertainty (re-drawing topic proportions per simulation) and adds design features grouped data usually needs. **Cluster-robust standard errors.** Speeches within a Congress are not independent, so we cluster by `congress`. faSTM swaps the classical vcov for a sandwich estimator with `stm`'s finite-sample correction: ```{r cluster} eff <- estimateEffect(1:12 ~ party + s(congress), fit, metadata = congress$meta, cluster = congress$meta$congress, nsims = 25) summary(eff, topics = 3) ``` `summary()` also takes `p.adjust.method` (e.g. `"BH"`) to correct across topics, and reports per-equation R² and F diagnostics. **Survey weights** enter the same way via `weights =` (weighted least squares per draw). ## Random effects in prevalence Prevalence formulas may include `lme4`-style random-effect terms; faSTM fits a mixed model per posterior draw and pools with Rubin's rules. Natural here if you treat Congresses as exchangeable groups: ```{r ranef, eval=FALSE} eff_re <- estimateEffect(1:12 ~ party + (1 | congress), fit, metadata = congress$meta, nsims = 25) summary(eff_re, topics = 3) # fixed effects + pooled variance components ``` ## Average marginal effects Coefficients on splines and factors are hard to read. `ame()` reports the **average marginal effect**: the mean change in topic proportion for a Republican-vs-Democrat shift, averaged over the sample: ```{r ame} ame(eff, covariate = "party", topics = c(1, 3, 7)) ``` ## Topic prevalence over time Because prevalence includes `s(congress)`, we can trace a topic's share across the 1987–2011 window: ```{r overtime, fig.height=3.6} plot(eff, "congress", method = "continuous", topics = 3, model = fit, xlab = "Congress (100 = 1987 ... 111 = 2009)") ``` ## Coherence: NPMI and C_V Alongside Mimno's semantic coherence, faSTM offers two more coherence metrics common in the topic-model literature: **NPMI** and **C_V**: ```{r coh} data.frame( topic = 1:5, mimno = round(coherence(fit, measure = "mimno", M = 10)[1:5], 2), npmi = round(coherence(fit, measure = "npmi", M = 10)[1:5], 3), c_v = round(coherence(fit, measure = "c_v", M = 10)[1:5], 3) ) ``` `search_k()` can select K by any of these, and parallelizes across K: ```{r searchk, eval=FALSE} search_k(congress, K = c(8, 12, 16), prevalence = ~ party + s(congress), measure = "npmi", cores = 3) ``` ## A tidyverse-friendly surface faSTM ships `broom` generics, so model output flows straight into `dplyr`/`ggplot2`. ```{r broom} # tidy the topic-word matrix (also matrix = "frex" or "gamma") head(tidy(fit, matrix = "beta"), 4) # one-row model summary glance(fit) # tidy an estimated effect into a coefficient table head(tidy(eff), 4) ``` `predict()` infers topic proportions for new documents against the fitted model: ```{r predict, eval=FALSE} predict(fit, newdata = held_out_corpus) # a faSTM_corpus / dfm of new docs ``` ## Summary Everything here returns ordinary data frames or `stm`-compatible objects, so it composes with the rest of your analysis. The companion vignette covers the `stm`-equivalent workflow; this one is the extra mile faSTM adds on top.