Differences between revisions 2 and 3
Revision 2 as of 2025-01-10 16:21:53
Size: 1043
Comment: Killing NaturalLanguageProcessing page
Revision 3 as of 2025-01-10 16:22:19
Size: 955
Comment: Killing NaturalLanguageProcessing page
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from NaturalLanguageProcessing/DataPreparation
= Data Preparation =
= NLP Data Preparation =
Line 4: Line 3:
Natural language processing methods generally rely on a data frame that is prepared in a specific format. NLP methods generally rely on a data frame that is prepared in a specific format.

NLP Data Preparation

NLP methods generally rely on a data frame that is prepared in a specific format.


Terminology

A document is a collection of phrases, words, or terms. A corpus is a collection of documents.

The most common data frame formats are:

  • Document-Term Matrix (DTM): rows are documents and columns are words/terms.


Implementation

Typical data preparation involves:

  1. Text cleaning
    • Pushing words to lowercase
    • Expanding contractions and shorthands
    • Correction of typos
    • Removal of non-words
  2. Tokenization
    • Ideally lemmatization, but stemming can be valid
    • Removal of stop words
  3. Collection into documents
    • Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.


CategoryRicottone

Statistics/NaturalLanguageProcessingDataPreparation (last edited 2025-01-10 16:22:19 by DominicRicottone)