| Title: | Fast Structural Topic Models |
|---|---|
| Description: | A modern implementation of the Structural Topic Model. faSTM fits the logistic-normal STM (with prevalence and content covariates) via a multithreaded Rust core, with an opt-in stochastic-variational path for large corpora. It is self-contained: text preparation is read from 'quanteda' or 'tidytext' objects, model inspection (labelTopics with FREX/lift/score, findThoughts, semantic coherence, exclusivity, topic correlations) and an estimateEffect() (method-of-composition posterior propagation) are built in. The fitted object is structurally compatible with 'stm' so existing analyses migrate with minimal changes. |
| Authors: | Neal Caren [aut, cre], Margaret Roberts [cph] (author of stm; inspection formulas in inspect.R adapted from stm (MIT)), Brandon Stewart [cph] (author of stm (MIT)), Dustin Tingley [cph] (author of stm (MIT)) |
| Maintainer: | Neal Caren <[email protected]> |
| License: | Apache License (>= 2) |
| Version: | 0.0.0.9000 |
| Built: | 2026-06-27 16:26:25 UTC |
| Source: | https://github.com/nealcaren/faSTM |
Maps a new corpus's terms onto the term indices of a fitted faSTM model,
dropping out-of-vocabulary terms — the preprocessing needed before inferring
topics for new documents (cf. stm::alignCorpus).
align_corpus(newdata, model)align_corpus(newdata, model)
newdata |
A |
model |
A faSTM fit. |
A list with per-document ids (0-based indices into model$vocab)
and counts, plus dropped (count of out-of-vocabulary term tokens).
stm-shaped counterpart to align_corpus(): reindexes new's documents onto
old.vocab, dropping out-of-vocabulary terms (and empty documents).
alignCorpus(new, old.vocab, verbose = TRUE)alignCorpus(new, old.vocab, verbose = TRUE)
new |
An stm-style |
old.vocab |
Reference vocabulary to align onto. |
verbose |
Logical. |
list(documents, vocab, docs.removed, words.removed).
The average expected change in a topic's proportion per unit of a covariate
(continuous: average derivative; factor: average level-vs-reference contrast),
averaged over the observed data. Cleaner than reading raw coefficients,
especially with splines/interactions (cf. the margins package; stm #271).
ame(object, covariate, topics = object$topics, h = NULL, ci = 0.95)ame(object, covariate, topics = object$topics, h = NULL, ci = 0.95)
object |
A |
covariate |
Covariate name. |
topics |
Topics to report (default: all in the fit). |
h |
Step for the numeric derivative (continuous covariates); defaults to
|
ci |
Confidence level. |
A data.frame: topic, term, ame, se, lower, upper.
faSTM does not do its own tokenization — it reads an already-prepared
document-term representation from the tools the field already uses
(quanteda, tidytext) or a plain sparse matrix. as_corpus() normalizes
any of these into the structure stm() consumes, dropping empty documents
and re-indexing the vocabulary, with metadata kept aligned.
as_corpus(x, meta = NULL, ...)as_corpus(x, meta = NULL, ...)
x |
A |
meta |
Optional data.frame of document metadata, one row per document,
aligned to |
... |
Unused. |
A faSTM_corpus: a list with documents (named list of 2×n integer
matrices: row 1 = 1-based term id, row 2 = count), vocab (character),
meta (data.frame or NULL), and word_counts (corpus term frequencies).
Returns a long data.frame (K, metric, value) ready for ggplot2 —
ggplot(as.data.frame(res), aes(K, value)) + geom_line() + facet_wrap(~metric, scales = "free_y").
## S3 method for class 'faSTM_searchk' as.data.frame(x, ...)## S3 method for class 'faSTM_searchk' as.data.frame(x, ...)
x |
A |
... |
Unused. |
Port of stm::asSTMCorpus's role: accepts a faSTM_corpus, quanteda dfm, or
document-term matrix and returns list(documents, vocab, data) in stm format.
asSTMCorpus(documents, vocab = NULL, data = NULL, ...)asSTMCorpus(documents, vocab = NULL, data = NULL, ...)
documents |
A corpus/dfm/matrix, or an stm-style documents list. |
vocab |
Vocabulary (when |
data |
Optional metadata. |
... |
Ignored. |
list(documents, vocab, data).
Assigns each (document, term) cell to the topic maximizing
theta[doc, k] * beta[k, term] (cf. tidytext::augment.STM).
## S3 method for class 'faSTM' augment(x, data = NULL, ...)## S3 method for class 'faSTM' augment(x, data = NULL, ...)
x |
A faSTM fit (carries its DTM). |
data |
Ignored (accepted for the generic). |
... |
Unused. |
A data.frame: document, term, count, .topic.
Ports of stm:::calcfrex/calclift/calcscore. Each takes a K x V logbeta
(log topic-word matrix) and returns a V x K matrix whose columns are the word
indices ordered most- to least-characteristic for each topic.
calcfrex(logbeta, w = 0.5, wordcounts = NULL) calclift(logbeta, wordcounts) calcscore(logbeta)calcfrex(logbeta, w = 0.5, wordcounts = NULL) calclift(logbeta, wordcounts) calcscore(logbeta)
logbeta |
K x V log topic-word matrix. |
w |
FREX frequency/exclusivity weight. |
wordcounts |
Corpus term frequencies (enables the James-Stein shrinkage). |
A V x K matrix of ordered word indices.
Multinomial residual dispersion (Taddy 2012; port of stm::checkResiduals).
A dispersion well above 1 suggests too few topics.
check_residuals(model, tol = 0.01)check_residuals(model, tol = 0.01)
model |
A faSTM fit (carries its documents). |
tol |
Threshold for counting estimable residual cells. |
A list with dispersion, pvalue, and df.
Port of stm:::checkBeta: finds (topic, word) cells whose exp(logbeta)
exceeds 1 - tolerance — words that are nearly exclusive to a single topic,
which can destabilize estimation.
checkBeta(stmobject, tolerance = 0.01)checkBeta(stmobject, tolerance = 0.01)
stmobject |
A faSTM/stm fit. |
tolerance |
Threshold; a word with topic-probability |
A list with problemTopics, problemWords, and error counts per content group.
Coherence scores for each topic's top-M words, computed from the fit's
stored document-term matrix. "mimno" is the UMass-style score of
semantic_coherence(); "npmi" averages pairwise normalized PMI; "c_v" is
the Roeder et al. (2015) measure (one-set segmentation, NPMI confirmation,
cosine aggregation). NPMI/c_v use document co-occurrence as the probability
estimator. Higher is more coherent (npmi/c_v are roughly in -1, 1).
coherence(model, measure = c("mimno", "npmi", "c_v"), M = 10L)coherence(model, measure = c("mimno", "npmi", "c_v"), M = 10L)
model |
A faSTM fit (carries its DTM). |
measure |
|
M |
Top words per topic. |
A numeric vector, one coherence score per topic.
A balanced sample of 1,679 floor speeches from the U.S. House and Senate,
Congresses 100-111 (1987-2011). Speeches are sampled evenly across
party x chamber x congress so covariate effects are estimable, then lowercased
and pruned of stop words and rare terms. Metadata: party
(Democrat/Republican), chamber (House/Senate), congress (100-111). Built
to showcase multiple (crossed) content covariates and over-time prevalence.
congresscongress
A faSTM_corpus with 1,679 documents and a 4,110-term vocabulary.
Congressional Record, Hein-bound edition (Gentzkow, Shapiro & Taddy), congresses 100-111. The underlying floor speeches are U.S. government works (public domain).
data(congress) fit <- stm(congress, K = 12, prevalence = ~ party + s(congress), content = ~ party + chamber)data(congress) fit <- stm(congress, K = 12, prevalence = ~ party + s(congress), content = ~ party + chamber)
For a multi-covariate (crossed) content model, recovers the topic-word labels for each level of a single content covariate, averaging the crossed topic-word distributions over the other covariate(s). Lets you read off how topics' vocabulary shifts with one covariate while marginalizing the rest.
content_topics(model, by = NULL, n = 7L, type = c("prob", "lift", "frex"))content_topics(model, by = NULL, n = 7L, type = c("prob", "lift", "frex"))
model |
A content (SAGE) faSTM fit. |
by |
Content covariate name to marginalize to (default: the first). |
n |
Words per topic. |
type |
|
A named list (one entry per level of by) of K x n word matrices.
Port of stm:::convertCorpus. "Matrix" returns a documents x V sparse
dgCMatrix; "lda" returns the documents list (the lda/stm format).
convertCorpus(documents, vocab, type = c("Matrix", "lda", "slam"))convertCorpus(documents, vocab, type = c("Matrix", "lda", "slam"))
documents |
stm-style documents list. |
vocab |
Vocabulary vector. |
type |
|
The corpus in the requested format.
Returns the point estimates, standard errors and confidence bounds that
plot.faSTM_effect() would draw, so you can build a custom plot or table
(stm issue #83). Same arguments as the plot method.
effect_estimates( x, covariate, method = c("pointestimate", "continuous", "difference"), topics = x$topics, cov.value1 = NULL, cov.value2 = NULL, values = NULL, moderator = NULL, moderator.value = NULL, npoints = 50L, ci = 0.95 )effect_estimates( x, covariate, method = c("pointestimate", "continuous", "difference"), topics = x$topics, cov.value1 = NULL, cov.value2 = NULL, values = NULL, moderator = NULL, moderator.value = NULL, npoints = 50L, ci = 0.95 )
x |
A |
covariate |
Covariate name. |
method |
|
topics |
Topics to include. |
cov.value1, cov.value2, values
|
Levels/values for difference or continuous range. |
moderator, moderator.value
|
Optional held-fixed interaction term. |
npoints |
Grid size for |
ci |
Confidence level for |
A data.frame with topic, value, est, se, lower, upper.
A drop-in for stm::estimateEffect() that propagates per-document
topic-estimation uncertainty: it regresses each posterior draw of topic
proportions on the covariates and pools the per-draw fits by Rubin's rules.
Propagating that uncertainty is the reason faSTM ships its own estimator
rather than inheriting stm's.
estimateEffect( formula, stmobj, metadata = meta, uncertainty = c("Global", "None", "Local"), nsims = 100L, seed = NULL, meta = NULL, documents = NULL, combine = NULL, weights = NULL, cluster = NULL, ... )estimateEffect( formula, stmobj, metadata = meta, uncertainty = c("Global", "None", "Local"), nsims = 100L, seed = NULL, meta = NULL, documents = NULL, combine = NULL, weights = NULL, cluster = NULL, ... )
formula |
A formula whose LHS lists topic numbers (e.g. |
stmobj |
A faSTM fit (from |
metadata |
A data.frame of covariates aligned to the documents. |
uncertainty |
|
nsims |
Posterior draws for |
seed |
Optional seed for the posterior draws. |
documents |
Accepted for stm compatibility (faSTM reads nu from the fit). |
combine |
Optional list of topic vectors to also estimate as aggregate
topics (each set's proportions are summed before regressing); named entries
set the coefficient names. E.g. |
weights |
Optional per-document survey/sampling weights (weighted OLS). |
cluster |
Optional per-document cluster ids for cluster-robust SEs. |
... |
Unused (stm signature compatibility). |
An object of class c("faSTM_effect", "estimateEffect") with a
summary() method, holding pooled coefficients and standard errors per
topic.
Evaluate held-out log-likelihood of a fit on a held-out set
eval_heldout(model, heldout)eval_heldout(model, heldout)
model |
A faSTM fit (trained on |
heldout |
A |
Mean per-document held-out log-likelihood per token.
Topic exclusivity (FREX-summary, frexw default 0.7)
exclusivity(model, M = 10L, frexw = 0.7)exclusivity(model, M = 10L, frexw = 0.7)
model |
A faSTM fit. |
M |
Top words per topic. |
frexw |
Frequency/exclusivity weight. |
A numeric vector, one exclusivity value per topic.
Representative documents for each topic
find_thoughts(model, texts = NULL, topics = NULL, n = 3L)find_thoughts(model, texts = NULL, topics = NULL, n = 3L)
model |
A faSTM fit. |
texts |
Optional character vector of the raw document texts, aligned to the fitted documents; returned alongside the indices when supplied. |
topics |
Topics to report (default all). |
n |
Documents per topic. |
A list with index (per-topic document indices) and, if texts is
given, docs (the texts).
Find topics whose top words include given words
find_topic(model, words, n = 20L, type = c("prob", "frex", "lift", "score"))find_topic(model, words, n = 20L, type = c("prob", "frex", "lift", "score"))
model |
A faSTM fit. |
words |
Character vector of query words. |
n |
Top words per topic to search. |
type |
Ranking metric: |
Integer vector of matching topics.
Runs the variational E-step for each new document against the fitted model's
fixed global parameters (topic-word matrix, prior mean and covariance), giving
out-of-sample topic proportions (cf. stm::fitNewDocuments). The model's
topics are held fixed; only each new document's proportions are estimated.
fit_new_documents(model, newdata)fit_new_documents(model, newdata)
model |
A faSTM fit (non-content; for content models the group-marginal topic-word matrix is used, with a warning). |
newdata |
A |
A new-documents × K matrix of topic proportions.
Inputs are pre-converted in R/stm.R:
docs_flat / doc_lens: documents as one concatenated 0-based token-id
stream plus per-document lengths (counts already expanded). Reassembled
here into Vec<Vec<u32>>, the shape fit_ctm wants.
prevalence: row-major D×P design matrix flattened (NULL if none).
content_groups: per-doc 0-based group id (NULL if none); num_groups.
fit_stm( docs_flat, doc_lens, num_types, num_topics, em_iters, em_tol, sigma_shrink, prevalence, num_features, content_groups, num_groups, init_spectral, init_beta, gamma_l1_alpha, diagonal, seed, inference, batch_size, tau, kappa, num_threads )fit_stm( docs_flat, doc_lens, num_types, num_topics, em_iters, em_tol, sigma_shrink, prevalence, num_features, content_groups, num_groups, init_spectral, init_beta, gamma_l1_alpha, diagonal, seed, inference, batch_size, tau, kappa, num_threads )
inference: "batch" -> fit_ctm (parity-validated). "svi" -> fit_ctm_svi
once topica #231 PR B (STM-SVI) is in the pinned revision; the R layer gates
the prevalence/content + svi combination until then.
Drop-in for stm::fitNewDocuments(). Holds the fitted topics fixed and runs
the variational E-step for each new document. Supports stm's prior modes and
posterior return.
fitNewDocuments( model, documents, newData = NULL, origData = NULL, prevalence = NULL, betaIndex = NULL, prevalencePrior = c("Average", "Covariate", "None"), contentPrior = c("Covariate", "Average"), returnPosterior = FALSE, verbose = TRUE, ... )fitNewDocuments( model, documents, newData = NULL, origData = NULL, prevalence = NULL, betaIndex = NULL, prevalencePrior = c("Average", "Covariate", "None"), contentPrior = c("Covariate", "Average"), returnPosterior = FALSE, verbose = TRUE, ... )
model |
A faSTM fit. |
documents |
New documents: a |
newData, origData
|
Covariate frames for the new and original documents
(used by |
prevalence |
Prevalence formula (same RHS as the fit) for the covariate prior. |
betaIndex |
Integer per-document content-group index (content models). |
prevalencePrior |
|
contentPrior |
|
returnPosterior |
If |
verbose |
Logical. |
... |
Ignored (stm signature compatibility). |
A theta matrix, or a posterior list when returnPosterior = TRUE.
FREX balances word frequency and exclusivity (Bischof & Airoldi 2012;
Roberts et al.). Unlike stm's labelTopics(), this returns the full numeric
FREX matrix, not just the ranked words (addresses a long-standing stm
request, bstewart/stm#265).
frex_scores(model, w = 0.5)frex_scores(model, w = 0.5)
model |
A faSTM fit. |
w |
FREX frequency/exclusivity weight (0.5 = equal). |
A topics × vocabulary matrix of FREX scores (columns named by vocab).
For tidytext-style data: one row per (document, term) with a count.
from_tidy(data, document = "document", term = "term", count = "n", meta = NULL)from_tidy(data, document = "document", term = "term", count = "n", meta = NULL)
data |
A data.frame. |
document, term, count
|
Column names (strings) for the document id, the
term, and the count. |
meta |
Optional per-document metadata, aligned to the sorted unique documents. |
A faSTM_corpus.
One-row model summary for a faSTM fit
## S3 method for class 'faSTM' glance(x, ...)## S3 method for class 'faSTM' glance(x, ...)
x |
A faSTM fit. |
... |
Unused. |
A one-row data.frame.
words are 0-based ids into the fitted model's vocabulary
(out-of-vocabulary terms dropped by the R caller) with their counts,
concatenated, plus per-document term counts doc_nterms.Out-of-sample topic inference: for each new document, run the variational
E-step against fixed globals (β, μ, Σ⁻¹) and return θ. Documents are passed
sparse — words are 0-based ids into the fitted model's vocabulary
(out-of-vocabulary terms dropped by the R caller) with their counts,
concatenated, plus per-document term counts doc_nterms.
infer_theta_new( beta_flat, num_topics, num_types, mu, siginv, words, counts, doc_nterms )infer_theta_new( beta_flat, num_topics, num_types, mu, siginv, words, counts, doc_nterms )
Label topics by top words (prob, FREX, lift, score)
label_topics(model, n = 7L, frexweight = 0.5)label_topics(model, n = 7L, frexweight = 0.5)
model |
A faSTM fit. |
n |
Number of words per topic per metric. |
frexweight |
FREX frequency/exclusivity weight. |
A faSTM_labels object: per-metric top-word matrices (prob,
frex, lift, score), each topics × n.
init_beta.
Returns K*V row-major topic-word probabilities.LDA topic-word matrix via topica's CVB0 (deterministic collapsed variational
Bayes), to seed a "replicate stm's LDA init" STM fit. Mirrors stm's
collapsed-Gibbs LDA initialization; the result is fed back as init_beta.
Returns K*V row-major topic-word probabilities.
lda_init_beta( docs_flat, doc_lens, num_types, num_topics, iters, alpha, beta, seed )lda_init_beta( docs_flat, doc_lens, num_types, num_topics, iters, alpha, beta, seed )
Document-topic proportions as a data frame
make_dt(model, meta = NULL)make_dt(model, meta = NULL)
model |
A faSTM fit. |
meta |
Optional metadata to bind alongside (defaults to none). |
A data.frame with document and Topic1..TopicK columns (+ meta).
Create a held-out version of a corpus for document-completion validation
make_heldout( corpus, N = floor(0.1 * length(corpus$documents)), proportion = 0.5, seed = NULL )make_heldout( corpus, N = floor(0.1 * length(corpus$documents)), proportion = 0.5, seed = NULL )
corpus |
A |
N |
Number of documents to hold tokens out of (default: 10% of docs). |
proportion |
Fraction of each chosen document's term types to hold out. |
seed |
Optional RNG seed. |
A list with corpus (training corpus, held-out tokens removed) and
missing (per-document held-out terms + counts), class faSTM_heldout.
Port of stm:::makeDesignMatrix: builds the model matrix for newData using
the term structure and factor levels of origData.
makeDesignMatrix(formula, origData, newData, sparse = TRUE, ...)makeDesignMatrix(formula, origData, newData, sparse = TRUE, ...)
formula |
A model formula. |
origData |
Data defining the terms/levels. |
newData |
Data to build the matrix for. |
sparse |
Return a sparse matrix. |
... |
Ignored. |
A (sparse) design matrix.
Runs select_model() for each K and returns the chosen model per K.
many_topics( corpus, K, N = 10L, prevalence = NULL, content = NULL, by = "sum", cores = 1L, seed = 1L, ... )many_topics( corpus, K, N = 10L, prevalence = NULL, content = NULL, by = "sum", cores = 1L, seed = 1L, ... )
corpus |
A |
K |
Integer vector of topic counts. |
N |
Number of candidate models (distinct random inits). |
prevalence, content
|
Optional covariate formulas. |
by |
Selection rule passed to |
cores |
Candidates to fit in parallel. |
seed |
Base RNG seed (candidate i uses |
... |
Passed to |
A faSTM_manytopics: models (best per K) and a summary data.frame.
Aligns every model from a select_model() run to the first and reports how
stable each topic's top words are across runs (cf. stm::multiSTM).
multi_stm(x, n = 10L)multi_stm(x, n = 10L)
x |
A |
n |
Top words used for the stability score. |
A faSTM_multistm with a per-topic mean top-word agreement.
Port of stm:::optimizeDocument's interface: infers one document's topic
proportions against fixed globals and returns its variational mean lambda
(eta), Laplace covariance nu, and theta.
optimizeDocument(document, eta, mu, beta, sigma = NULL, sigmainv = NULL, ...)optimizeDocument(document, eta, mu, beta, sigma = NULL, sigmainv = NULL, ...)
document |
A 2 x n integer matrix (1-based vocab ids; counts). |
eta |
Ignored starting value (kept for signature compatibility). |
mu |
Prior mean (length K-1). |
beta |
K x V topic-word probability matrix. |
sigma, sigmainv
|
Prior covariance or its inverse (supply one). |
... |
Ignored (stm signature compatibility). |
A list with lambda, nu, and theta.
Refits the model many times with the treatment labels permuted, aligning
topics across refits, to build a null distribution for the treatment effect
on each topic (cf. stm::permutationTest). Fast because each refit is cheap.
permutation_test( formula, model, treatment, corpus, nruns = 100L, seed = NULL, ... )permutation_test( formula, model, treatment, corpus, nruns = 100L, seed = NULL, ... )
formula |
Prevalence formula whose RHS includes |
model |
A faSTM fit. |
treatment |
Name of a 0/1 covariate in |
corpus |
The |
nruns |
Total models (1 reference + |
seed |
RNG seed. |
... |
Passed to |
A faSTM_permtest with ref (observed per-topic effects) and null
((nruns-1) × K permuted effects).
Nodes are topics (sized by prevalence, labelled by top words); edges join
topics whose proportions are positively correlated above cutoff. Uses a
lightweight circular layout — no graph-library dependency.
plot_topic_network(model, cutoff = 0.03, n = 3L, labeltype = "frex")plot_topic_network(model, cutoff = 0.03, n = 3L, labeltype = "frex")
model |
A faSTM fit. |
cutoff |
Correlation threshold for an edge. |
n |
Top words per topic label. |
labeltype |
Word ranking for labels. |
A ggplot object.
Plot a fitted model
## S3 method for class 'faSTM' plot( x, type = c("summary", "labels", "perspectives", "hist"), topics = NULL, n = 5L, labeltype = "frex", ... )## S3 method for class 'faSTM' plot( x, type = c("summary", "labels", "perspectives", "hist"), topics = NULL, n = 5L, labeltype = "frex", ... )
x |
A faSTM fit. |
type |
|
topics |
Topics to show (for |
n |
Top words to label each topic with. |
labeltype |
Word ranking for labels: |
... |
Accepted for stm compatibility (e.g. |
A ggplot object.
Plot estimated covariate effects on topic prevalence
## S3 method for class 'faSTM_effect' plot( x, covariate, method = c("pointestimate", "continuous", "difference"), topics = x$topics, model = NULL, cov.value1 = NULL, cov.value2 = NULL, values = NULL, moderator = NULL, moderator.value = NULL, npoints = 50L, ci = 0.95, labeltype = NULL, custom.labels = NULL, xlab = NULL, main = NULL, ... )## S3 method for class 'faSTM_effect' plot( x, covariate, method = c("pointestimate", "continuous", "difference"), topics = x$topics, model = NULL, cov.value1 = NULL, cov.value2 = NULL, values = NULL, moderator = NULL, moderator.value = NULL, npoints = 50L, ci = 0.95, labeltype = NULL, custom.labels = NULL, xlab = NULL, main = NULL, ... )
x |
A |
covariate |
Name of the covariate to vary. |
method |
|
topics |
Topics to show (default all in the effect object). |
values |
For |
npoints |
Grid size for |
ci |
Confidence level. |
... |
Unused. |
A ggplot object.
Faceted held-out likelihood, semantic coherence, exclusivity and bound vs K.
## S3 method for class 'faSTM_searchk' plot(x, ...)## S3 method for class 'faSTM_searchk' plot(x, ...)
x |
A |
... |
Unused. |
A ggplot object.
5,000 political blog posts from the 2008 U.S. election (the stm
vignette example) as a ready-to-use faSTM_corpus. Metadata: rating
(Conservative/Liberal), day (1-365), blog, text.
data(poliblog)data(poliblog)
A faSTM_corpus: 5,000 documents, 2,632-term vocabulary.
Eisenstein & Xing (2010), via the stm package.
The variational (Laplace) posterior of each document's logit-topic vector is
eta_d ~ N(lambda_d, nu_d), both stored on a faSTM fit. This draws nsims
samples of theta per document by sampling eta and applying the softmax (with
the reference topic appended as 0). This is the pure-R equivalent of topica's
posterior_theta_samples; no Rust call is needed because eta + nu fully
describe the posterior. Feeds estimateEffect()'s method of composition.
posterior_theta_samples(model, nsims = 100L, seed = NULL)posterior_theta_samples(model, nsims = 100L, seed = NULL)
model |
A faSTM fit (from |
nsims |
Number of posterior draws. |
seed |
Optional integer seed for reproducible draws. |
A nsims-length list of D x K theta matrices.
Predict topic proportions for new documents
## S3 method for class 'faSTM' predict(object, newdata, ...)## S3 method for class 'faSTM' predict(object, newdata, ...)
object |
A faSTM fit. |
newdata |
New documents (corpus / dfm / matrix / stm-style list). |
... |
Passed to |
A new-documents x K matrix of topic proportions.
Each line is M term:count term:count ... with 0-based term ids.
read_ldac(file) write_ldac(documents, file)read_ldac(file) write_ldac(documents, file)
file |
Path to the |
documents |
A list of 2×n integer matrices (1-based ids). |
read_ldac returns a list of 2×n integer matrices (faSTM/stm document
format, 1-based ids); write_ldac returns the path invisibly.
A b-spline basis for smooth covariate effects, e.g. prevalence = ~ s(day).
Matches stm::s() exactly — including the df = min(10, nval - 1) default —
so spline-term coefficients agree with stm. (You can also use
splines::bs()/splines::ns() directly.)
s(x, df, ...)s(x, df, ...)
x |
Numeric predictor. |
df |
Basis dimension; defaults to |
... |
Passed to |
A spline basis matrix (with class "s").
For models fit with a content covariate, reports each topic's marginal top
words plus, for every content group, the words most distinctive to that group
within the topic (group-vs-marginal log-ratio — the SAGE deviation).
sage_labels(model, n = 7L, frexweight = NULL)sage_labels(model, n = 7L, frexweight = NULL)
model |
A faSTM fit with a content covariate. |
n |
Words per list. |
A faSTM_sagelabels object.
Fits the model across a range of K and reports diagnostics for choosing it:
held-out likelihood (document completion), semantic coherence, exclusivity,
and the variational bound. Unlike stm::searchK, the per-K fits parallelize
across K (a long-standing request, bstewart/stm#262) and each fit is itself
fast (Rust), so a sweep that took minutes takes seconds.
search_k( corpus, K, prevalence = NULL, content = NULL, heldout = TRUE, proportion = 0.5, residuals = FALSE, cores = 1L, M = 10L, seed = 1L, measure = c("mimno", "npmi", "c_v"), verbose = FALSE, ... )search_k( corpus, K, prevalence = NULL, content = NULL, heldout = TRUE, proportion = 0.5, residuals = FALSE, cores = 1L, M = 10L, seed = 1L, measure = c("mimno", "npmi", "c_v"), verbose = FALSE, ... )
corpus |
A |
K |
Integer vector of topic counts to try. |
prevalence, content
|
Optional covariate formulas (see |
heldout |
Logical; compute held-out likelihood via document completion. |
proportion |
Held-out token fraction (passed to |
cores |
Number of K-fits to run in parallel (forked; 1 = sequential).
When |
M |
Top words for coherence/exclusivity. |
seed |
RNG seed (held-out split + fits). |
... |
Passed to |
A faSTM_searchk object wrapping a tidy data.frame results with one
row per K (K, heldout, semcoh, exclusivity, bound).
select_model runPick one model from a select_model run
select_best(x, by = c("sum", "semcoh", "exclusivity"))select_best(x, by = c("sum", "semcoh", "exclusivity"))
x |
A |
by |
|
A single faSTM fit.
With random initialization the variational objective is multimodal, so the
standard workflow (cf. stm::selectModel) is to fit many models and keep
those on the semantic-coherence / exclusivity frontier, then choose among
them. faSTM fits the candidates in parallel.
select_model( corpus, K, N = 10L, prevalence = NULL, content = NULL, init.type = "Random", cores = 1L, M = 10L, frexw = 0.7, seed = 1L, ... )select_model( corpus, K, N = 10L, prevalence = NULL, content = NULL, init.type = "Random", cores = 1L, M = 10L, frexw = 0.7, seed = 1L, ... )
corpus |
A |
K |
Number of topics. |
N |
Number of candidate models (distinct random inits). |
prevalence, content
|
Optional covariate formulas. |
init.type |
Initialization; |
cores |
Candidates to fit in parallel. |
M |
Top words for coherence/exclusivity scoring. |
frexw |
Exclusivity FREX weight. |
seed |
Base RNG seed (candidate i uses |
... |
Passed to |
A faSTM_selectmodel: models (the fits), semcoh, exclusivity,
and frontier (indices of non-dominated models).
Sum over the top-M words of each topic of log((D(w_i,w_j)+1)/D(w_j)),
using document co-occurrence counts. Higher (less negative) is more coherent.
semantic_coherence(model, M = 10L)semantic_coherence(model, M = 10L)
model |
A faSTM fit (must carry its document-term matrix; faSTM stores it). |
M |
Number of top words per topic. |
A numeric vector, one coherence value per topic.
A drop-in replacement for stm::stm()'s fitting step. Accepts the same
documents / vocab / prevalence / content inputs, fits with topica's
Rust core, and returns an object compatible with the stm package so that
stm::labelTopics(), stm::plot.STM(), stm::findThoughts(),
stm::sageLabels(), and stm::toLDAvis() work unmodified. Use
estimateEffect() from this package for covariate effects that propagate
topic-estimation uncertainty.
stm( documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, max.em.its = 500L, emtol = 1e-05, init.type = c("Spectral", "Random", "LDA", "Custom"), init.beta = NULL, model = NULL, gamma.prior = c("Pooled", "L1"), gamma.l1.alpha = 0.001, sigma.prior = 0, seed = 1L, inference = c("batch", "svi"), batch_size = 256L, tau = 64, kappa = 0.7, num_threads = 0L, verbose = TRUE, ... )stm( documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, max.em.its = 500L, emtol = 1e-05, init.type = c("Spectral", "Random", "LDA", "Custom"), init.beta = NULL, model = NULL, gamma.prior = c("Pooled", "L1"), gamma.l1.alpha = 0.001, sigma.prior = 0, seed = 1L, inference = c("batch", "svi"), batch_size = 256L, tau = 64, kappa = 0.7, num_threads = 0L, verbose = TRUE, ... )
documents |
stm-format documents: a named list of |
vocab |
Character vector of vocabulary terms. |
K |
Number of topics. |
prevalence |
A right-hand-side formula (e.g. |
content |
A right-hand-side formula naming a single categorical variable,
or a factor; the SAGE content covariate. |
data |
A data.frame of document metadata (the |
max.em.its |
Maximum EM iterations (batch) / epochs (svi). |
emtol |
Relative-bound convergence tolerance. |
init.type |
Topic initialization: |
init.beta |
Optional K x V topic-word probability matrix to start the fit
from a given initialization (overrides |
model |
A fitted model whose topic-word matrix seeds |
gamma.prior |
Prevalence-coefficient prior: |
sigma.prior |
Shrinkage applied to the topic covariance off-diagonal. |
seed |
Integer seed (batch fit is reproducible from it). |
inference |
|
batch_size, tau, kappa
|
SVI controls (minibatch size; Robbins-Monro
|
num_threads |
Worker threads for the parallel variational E-step. |
verbose |
Logical; print progress. |
An object of class c("faSTM", "STM") — an stm-compatible fit.
Tidy a faSTM fit (topic-term or document-topic distributions)
## S3 method for class 'faSTM' tidy(x, matrix = c("beta", "gamma", "frex"), ...)## S3 method for class 'faSTM' tidy(x, matrix = c("beta", "gamma", "frex"), ...)
x |
A faSTM fit. |
matrix |
|
... |
Unused. |
A tidy data.frame.
Tidy an estimateEffect fit (one row per term per topic)
## S3 method for class 'faSTM_effect' tidy(x, ...)## S3 method for class 'faSTM_effect' tidy(x, ...)
x |
A |
... |
Unused. |
A data.frame: topic, term, estimate, std.error, statistic, p.value.
Exports the positive-correlation topic network as an igraph object (stm
issue #242), with topic prevalence and FREX labels as vertex attributes and
the positive correlations as edge weights — ready for igraph/ggraph layouts.
topic_corr_graph(x, model = NULL, nlabel = 3L)topic_corr_graph(x, model = NULL, nlabel = 3L)
x |
A |
model |
The fit, if |
nlabel |
Top FREX words per topic for the vertex label. |
An undirected igraph graph.
Topic correlation graph (positive correlations of topic proportions)
topic_correlation(model, cutoff = 0.01)topic_correlation(model, cutoff = 0.01)
model |
A faSTM fit. |
cutoff |
Correlation threshold for an edge. |
A list with cor (the K×K correlation matrix) and posadj (the
thresholded positive adjacency).
Cross-validated lasso (glmnet) of an outcome on the topic-proportion matrix
(cf. stm::topicLasso). Identifies which topics predict the outcome.
topic_lasso( formula, model, data, family = "gaussian", nfolds = 10L, seed = 2138L, ... )topic_lasso( formula, model, data, family = "gaussian", nfolds = 10L, seed = 2138L, ... )
formula |
|
model |
A faSTM fit (supplies the topic proportions). |
data |
Document-level data with the outcome, aligned to the documents. |
family |
glmnet family ( |
nfolds |
CV folds. |
seed |
RNG seed. |
... |
Passed to |
A faSTM_topiclasso with selected per-topic coefficients.
Returns the corpus-level expected topic proportions — the mean of theta per
topic — as a numeric table, so you can read off the values stm's
plot(type = "summary") displays (stm issue #269).
topic_proportions(model, nlabel = 3L)topic_proportions(model, nlabel = 3L)
model |
A faSTM fit. |
nlabel |
Top FREX words to attach as a topic label. |
A data.frame with topic, proportion, label, sorted by proportion.
Like label_topics() but returns the values behind the ranking, not just
the words — e.g. the numeric FREX score per top term (stm issue #265).
topic_terms( model, n = 7L, by = c("prob", "frex", "lift", "score"), frexweight = 0.5 )topic_terms( model, n = 7L, by = c("prob", "frex", "lift", "score"), frexweight = 0.5 )
model |
A faSTM fit. |
n |
Terms per topic. |
by |
Ranking measure: |
frexweight |
FREX frequency/exclusivity weight (used when |
A tidy data.frame with topic, rank, term, score, measure.