= NLP Data Preparation = NLP methods generally rely on a data frame that is prepared in a specific format. <> ---- == Terminology == A '''document''' is a collection of phrases, words, or terms. A '''corpus''' is a collection of documents. The most common data frame formats are: * '''Document-Term Matrix''' ('''DTM'''): rows are ''documents'' and columns are ''words/terms''. ---- == Implementation == Typical data preparation involves: 1. Text cleaning * Pushing words to lowercase * Expanding contractions and shorthands * Correction of typos * Removal of non-words 2. Tokenization * Ideally lemmatization, but stemming can be valid * Removal of stop words 3. Collection into documents * Common implementation of steps 1 and 2 is to use a data frame where each row is a ''word/token''; these must be collected back into a DTM. ---- CategoryRicottone