---
title: "Beyond stm: faSTM's extensions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Beyond stm: faSTM's extensions}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2,
                      dpi = 96, message = FALSE, warning = FALSE)
set.seed(2138)
```

The [companion vignette](faSTM.html) shows that faSTM runs the *same* analysis as
the `stm` package. This one covers what faSTM adds **on top of** that framework:
multiple content covariates, an `estimateEffect()` with weights and clustering,
alternative coherence metrics, and a tidyverse-friendly surface. None of these
require leaving the `stm`-compatible object; they are extra tools, not a
different model.

We use a second bundled corpus, `congress`: a balanced sample of **1,679 U.S.
House and Senate floor speeches**, Congresses 100–111 (1987–2011), with metadata
`party` (Democrat/Republican), `chamber` (House/Senate), and `congress` (the time
index). It is built for exactly this vignette: two categorical covariates that
cross, plus a time axis.

```{r load}
library(faSTM)
data(congress)
congress
table(congress$meta$party, congress$meta$chamber)
```

We fit one prevalence model (topic prevalence as a function of **party** and a
smooth of **time**) and reuse it throughout.

```{r basefit}
fit <- stm(congress, K = 12, prevalence = ~ party + s(congress),
           verbose = FALSE)
```

## Multiple content covariates

`stm` allows a single content (SAGE) covariate. faSTM accepts several and crosses
them into a **saturated** content model: one topic-word distribution per
combination of levels. Here `party` (2) × `chamber` (2) gives a 4-group content
model, so we can read how *Democrats vs Republicans* and *House vs Senate* word
each topic differently:

```{r content}
fitC <- stm(congress, K = 12,
            prevalence = ~ party + s(congress),
            content    = ~ party + chamber,    # crossed -> 4 SAGE groups
            verbose = FALSE)
fitC$settings$dim$A                            # number of content groups
```

The fit reports the crossing as it runs, and `stm`'s SAGE tooling works
unchanged on the result:

```{r sage, eval=FALSE}
sageLabels(fitC, n = 5)          # top words per topic, per party×chamber group
labelTopics(fitC, topics = 3)
```

The crossed levels are stored on the fit (`settings$covariates$contentvars`,
`contenttable`) so per-covariate marginals can be recovered later.

### Framing, not agenda

Content covariates are where the interesting partisan signal lives. Above,
`estimateEffect` will show that Democrats and Republicans devote *similar
prevalence* to the budget/tax topic, but they **word it very differently**. A
`~ party` content model makes that concrete:

```{r framing}
fitParty <- stm(congress, K = 12, prevalence = ~ party + s(congress),
                content = ~ party, verbose = FALSE)
sl  <- sageLabels(fitParty, n = 10)
tax <- which(apply(sl$marginal, 1, function(w) any(grepl("^tax", w))))[1]

sl$marginal[tax, ]              # shared topic vocabulary
sl$bygroup$Democrat[tax, ]     # how Democrats word it
sl$bygroup$Republican[tax, ]   # how Republicans word it
```

The shared vocabulary is fiscal (tax, budget, spending), but Democrats reach for
*health, care, children, preexisting, denied* while Republicans reach for
*taxing, taxed, CBO, authority*: same topic, opposite framing. This is the
distinction prevalence covariates can't capture and content (SAGE) covariates
can.

## `estimateEffect()`: cluster-robust SEs and weights

faSTM keeps `stm`'s method-of-composition uncertainty (re-drawing topic
proportions per simulation) and adds design features grouped data usually needs.

**Cluster-robust standard errors.** Speeches within a Congress are not
independent, so we cluster by `congress`. faSTM swaps the classical vcov for a
sandwich estimator with `stm`'s finite-sample correction:

```{r cluster}
eff <- estimateEffect(1:12 ~ party + s(congress), fit,
                      metadata = congress$meta,
                      cluster = congress$meta$congress, nsims = 25)
summary(eff, topics = 3)
```

`summary()` also takes `p.adjust.method` (e.g. `"BH"`) to correct across topics,
and reports per-equation R² and F diagnostics. **Survey weights** enter the same
way via `weights =` (weighted least squares per draw).

## Random effects in prevalence

Prevalence formulas may include `lme4`-style random-effect terms; faSTM fits a
mixed model per posterior draw and pools with Rubin's rules. Natural here if you
treat Congresses as exchangeable groups:

```{r ranef, eval=FALSE}
eff_re <- estimateEffect(1:12 ~ party + (1 | congress), fit,
                         metadata = congress$meta, nsims = 25)
summary(eff_re, topics = 3)   # fixed effects + pooled variance components
```

## Average marginal effects

Coefficients on splines and factors are hard to read. `ame()` reports the
**average marginal effect**: the mean change in topic proportion for a
Republican-vs-Democrat shift, averaged over the sample:

```{r ame}
ame(eff, covariate = "party", topics = c(1, 3, 7))
```

## Topic prevalence over time

Because prevalence includes `s(congress)`, we can trace a topic's share across
the 1987–2011 window:

```{r overtime, fig.height=3.6}
plot(eff, "congress", method = "continuous", topics = 3,
     model = fit, xlab = "Congress (100 = 1987 ... 111 = 2009)")
```

## Coherence: NPMI and C_V

Alongside Mimno's semantic coherence, faSTM offers two more coherence metrics
common in the topic-model literature: **NPMI** and **C_V**:

```{r coh}
data.frame(
  topic = 1:5,
  mimno = round(coherence(fit, measure = "mimno", M = 10)[1:5], 2),
  npmi  = round(coherence(fit, measure = "npmi",  M = 10)[1:5], 3),
  c_v   = round(coherence(fit, measure = "c_v",   M = 10)[1:5], 3)
)
```

`search_k()` can select K by any of these, and parallelizes across K:

```{r searchk, eval=FALSE}
search_k(congress, K = c(8, 12, 16), prevalence = ~ party + s(congress),
         measure = "npmi", cores = 3)
```

## A tidyverse-friendly surface

faSTM ships `broom` generics, so model output flows straight into `dplyr`/`ggplot2`.

```{r broom}
# tidy the topic-word matrix (also matrix = "frex" or "gamma")
head(tidy(fit, matrix = "beta"), 4)

# one-row model summary
glance(fit)

# tidy an estimated effect into a coefficient table
head(tidy(eff), 4)
```

`predict()` infers topic proportions for new documents against the fitted model:

```{r predict, eval=FALSE}
predict(fit, newdata = held_out_corpus)   # a faSTM_corpus / dfm of new docs
```

## Summary

Everything here returns ordinary data frames or `stm`-compatible objects, so it
composes with the rest of your analysis. The companion vignette covers the
`stm`-equivalent workflow; this one is the extra mile faSTM adds on top.