= NLP Data Preparation =

NLP methods generally rely on a data frame that is prepared in a specific format.

<<TableOfContents>>

----



== Terminology ==

A '''document''' is a collection of phrases, words, or terms. A '''corpus''' is a collection of documents.

The most common data frame formats are:

 * '''Document-Term Matrix''' ('''DTM'''): rows are ''documents'' and columns are ''words/terms''.

----



== Implementation ==

Typical data preparation involves:

 1. Text cleaning
   * Pushing words to lowercase
   * Expanding contractions and shorthands
   * Correction of typos
   * Removal of non-words
 2. Tokenization
   * Ideally lemmatization, but stemming can be valid
   * Removal of stop words
 3. Collection into documents
   * Common implementation of steps 1 and 2 is to use a data frame where each row is a ''word/token''; these must be collected back into a DTM.



----
CategoryRicottone