Size: 364
Comment: Initial commit
|
Size: 2117
Comment: Fixed indentation
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
== Data Structure == Topic models make use of a '''Document-Term matrix''' ('''DTM'''). Rows are ''documents'' and columns are ''tokens''. ---- |
|
Line 13: | Line 23: |
Categorization/scoring of topics requires some supervision and some priors about the present topics. | Common topic model methods include: * '''Probabilistic Latent Semantic Analysis''' ('''PLSA''') * '''Latent Dirichlet Allocation''' ('''LDA''') * '''Pachinko Allocation''' LDA is a form of unsupervised learning. It takes a few inputs as priors. Importantly, κ is the number of topics. With a higher α (between 0 and 1), more topics will be extracted from each document. Short documents likely call for a small α value. With a higher β (between 0 and 1), topics are composed of more words. Dense documents likely call for a high β value. Topic models are evaluated by '''perplexixity''' of the model, '''coherence''' of topics, and '''exclusivity''' of the topics. See `textmineR::FitLdaModel`. LSA steps are: 1. Run models with varying α and β values. * a.k.a. '''hyperparameter tuning''' * Using the median likely value of κ, run a model for each combination of discrete likely values of α and β * Minimize for perplexity and take the optimal pair. * e.g. `text2vec::perplexity` 2. Run models with varying κ values. * Using the optimal (α, β) pair, run a model for each likely κ value. 3. Select the κ value that gives the best trade-off of minimizing perplexity, maximizing coherence, and maximizing exclusivity. * e.g. `text2vec::perplexity`, `textmineR::CalcPropCoherence`, and `topicdoc::topic_exclusivity` 4. Examine the topics. * e.g. `textmineR::SummarizeTopics` * γ terms are the most exclusive tokens for each topic. * φ terms are the most common tokens for each topic. * If these sets of tokens reveal that the modeled topics are insufficient, go back and select a different κ value. |
Topic Model
A topic model is an application of machine learning to qualitative methods. Topics of documents are extracted and optionally assigned to a bin.
Contents
Data Structure
Topic models make use of a Document-Term matrix (DTM). Rows are documents and columns are tokens.
Implementation
Common topic model methods include:
Probabilistic Latent Semantic Analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Pachinko Allocation
LDA is a form of unsupervised learning. It takes a few inputs as priors. Importantly, κ is the number of topics. With a higher α (between 0 and 1), more topics will be extracted from each document. Short documents likely call for a small α value. With a higher β (between 0 and 1), topics are composed of more words. Dense documents likely call for a high β value.
Topic models are evaluated by perplexixity of the model, coherence of topics, and exclusivity of the topics.
See textmineR::FitLdaModel.
LSA steps are:
- Run models with varying α and β values.
a.k.a. hyperparameter tuning
- Using the median likely value of κ, run a model for each combination of discrete likely values of α and β
- Minimize for perplexity and take the optimal pair.
e.g. text2vec::perplexity
- Run models with varying κ values.
- Using the optimal (α, β) pair, run a model for each likely κ value.
- Select the κ value that gives the best trade-off of minimizing perplexity, maximizing coherence, and maximizing exclusivity.
e.g. text2vec::perplexity, textmineR::CalcPropCoherence, and topicdoc::topic_exclusivity
- Examine the topics.
e.g. textmineR::SummarizeTopics
- γ terms are the most exclusive tokens for each topic.
- φ terms are the most common tokens for each topic.
- If these sets of tokens reveal that the modeled topics are insufficient, go back and select a different κ value.