⇤ ← Revision 1 as of 2023-12-07 18:18:34
Size: 975
Comment: Initial
|
Size: 1043
Comment: Killing NaturalLanguageProcessing page
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from NaturalLanguageProcessing/DataPreparation |
Data Preparation
Natural language processing methods generally rely on a data frame that is prepared in a specific format.
Contents
Terminology
A document is a collection of phrases, words, or terms. A corpus is a collection of documents.
The most common data frame formats are:
Document-Term Matrix (DTM): rows are documents and columns are words/terms.
Implementation
Typical data preparation involves:
- Text cleaning
- Pushing words to lowercase
- Expanding contractions and shorthands
- Correction of typos
- Removal of non-words
- Tokenization
- Ideally lemmatization, but stemming can be valid
- Removal of stop words
- Collection into documents
Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.