Differences between revisions 1 and 3 (spanning 2 versions)
Revision 1 as of 2023-10-28 03:21:17
Size: 364
Comment: Initial commit
Revision 3 as of 2023-12-07 18:07:39
Size: 2117
Comment: Fixed indentation
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
== Data Structure ==

Topic models make use of a '''Document-Term matrix''' ('''DTM'''). Rows are ''documents'' and columns are ''tokens''.



----


Line 13: Line 23:
Categorization/scoring of topics requires some supervision and some priors about the present topics. Common topic model methods include:

 * '''Probabilistic Latent Semantic Analysis''' ('''PLSA''')
 * '''Latent Dirichlet Allocation''' ('''LDA''')
 * '''Pachinko Allocation'''

LDA is a form of unsupervised learning. It takes a few inputs as priors. Importantly, κ is the number of topics. With a higher α (between 0 and 1), more topics will be extracted from each document. Short documents likely call for a small α value. With a higher β (between 0 and 1), topics are composed of more words. Dense documents likely call for a high β value.

Topic models are evaluated by '''perplexixity''' of the model, '''coherence''' of topics, and '''exclusivity''' of the topics.

See `textmineR::FitLdaModel`.

LSA steps are:

 1. Run models with varying α and β values.
    * a.k.a. '''hyperparameter tuning'''
    * Using the median likely value of κ, run a model for each combination of discrete likely values of α and β
    * Minimize for perplexity and take the optimal pair.
    * e.g. `text2vec::perplexity`
 2. Run models with varying κ values.
    * Using the optimal (α, β) pair, run a model for each likely κ value.
 3. Select the κ value that gives the best trade-off of minimizing perplexity, maximizing coherence, and maximizing exclusivity.
    * e.g. `text2vec::perplexity`, `textmineR::CalcPropCoherence`, and `topicdoc::topic_exclusivity`
 4. Examine the topics.
    * e.g. `textmineR::SummarizeTopics`
    * γ terms are the most exclusive tokens for each topic.
    * φ terms are the most common tokens for each topic.
    * If these sets of tokens reveal that the modeled topics are insufficient, go back and select a different κ value.

Topic Model

A topic model is an application of machine learning to qualitative methods. Topics of documents are extracted and optionally assigned to a bin.


Data Structure

Topic models make use of a Document-Term matrix (DTM). Rows are documents and columns are tokens.


Implementation

Common topic model methods include:

  • Probabilistic Latent Semantic Analysis (PLSA)

  • Latent Dirichlet Allocation (LDA)

  • Pachinko Allocation

LDA is a form of unsupervised learning. It takes a few inputs as priors. Importantly, κ is the number of topics. With a higher α (between 0 and 1), more topics will be extracted from each document. Short documents likely call for a small α value. With a higher β (between 0 and 1), topics are composed of more words. Dense documents likely call for a high β value.

Topic models are evaluated by perplexixity of the model, coherence of topics, and exclusivity of the topics.

See textmineR::FitLdaModel.

LSA steps are:

  1. Run models with varying α and β values.
    • a.k.a. hyperparameter tuning

    • Using the median likely value of κ, run a model for each combination of discrete likely values of α and β
    • Minimize for perplexity and take the optimal pair.
    • e.g. text2vec::perplexity

  2. Run models with varying κ values.
    • Using the optimal (α, β) pair, run a model for each likely κ value.
  3. Select the κ value that gives the best trade-off of minimizing perplexity, maximizing coherence, and maximizing exclusivity.
    • e.g. text2vec::perplexity, textmineR::CalcPropCoherence, and topicdoc::topic_exclusivity

  4. Examine the topics.
    • e.g. textmineR::SummarizeTopics

    • γ terms are the most exclusive tokens for each topic.
    • φ terms are the most common tokens for each topic.
    • If these sets of tokens reveal that the modeled topics are insufficient, go back and select a different κ value.


CategoryRicottone

Statistics/TopicModel (last edited 2025-01-10 16:20:30 by DominicRicottone)