NLP Data Preparation

NLP methods generally rely on a data frame that is prepared in a specific format.


Terminology

A document is a collection of phrases, words, or terms. A corpus is a collection of documents.

The most common data frame formats are:


Implementation

Typical data preparation involves:

  1. Text cleaning
    • Pushing words to lowercase
    • Expanding contractions and shorthands
    • Correction of typos
    • Removal of non-words
  2. Tokenization
    • Ideally lemmatization, but stemming can be valid
    • Removal of stop words
  3. Collection into documents
    • Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.


CategoryRicottone

Statistics/NaturalLanguageProcessingDataPreparation (last edited 2025-01-10 16:22:19 by DominicRicottone)