NLP Data Preparation
NLP methods generally rely on a data frame that is prepared in a specific format.
Terminology
A document is a collection of phrases, words, or terms. A corpus is a collection of documents.
The most common data frame formats are:
Document-Term Matrix (DTM): rows are documents and columns are words/terms.
Implementation
Typical data preparation involves:
- Text cleaning
- Pushing words to lowercase
- Expanding contractions and shorthands
- Correction of typos
- Removal of non-words
- Tokenization
- Ideally lemmatization, but stemming can be valid
- Removal of stop words
- Collection into documents
Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.