Size: 1043
Comment: Killing NaturalLanguageProcessing page
|
← Revision 3 as of 2025-01-10 16:22:19 ⇥
Size: 955
Comment: Killing NaturalLanguageProcessing page
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from NaturalLanguageProcessing/DataPreparation = Data Preparation = |
= NLP Data Preparation = |
Line 4: | Line 3: |
Natural language processing methods generally rely on a data frame that is prepared in a specific format. | NLP methods generally rely on a data frame that is prepared in a specific format. |
NLP Data Preparation
NLP methods generally rely on a data frame that is prepared in a specific format.
Terminology
A document is a collection of phrases, words, or terms. A corpus is a collection of documents.
The most common data frame formats are:
Document-Term Matrix (DTM): rows are documents and columns are words/terms.
Implementation
Typical data preparation involves:
- Text cleaning
- Pushing words to lowercase
- Expanding contractions and shorthands
- Correction of typos
- Removal of non-words
- Tokenization
- Ideally lemmatization, but stemming can be valid
- Removal of stop words
- Collection into documents
Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.