Differences between revisions 1 and 2

Data Preparation

Natural language processing methods generally rely on a data frame that is prepared in a specific format.

Contents

Terminology

A document is a collection of phrases, words, or terms. A corpus is a collection of documents.

The most common data frame formats are:

Document-Term Matrix (DTM): rows are documents and columns are words/terms.

Typical data preparation involves:

Text cleaning
- Pushing words to lowercase
- Expanding contractions and shorthands
- Correction of typos
- Removal of non-words
Tokenization
- Ideally lemmatization, but stemming can be valid
- Removal of stop words
Collection into documents
- Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.

Statistics/NaturalLanguageProcessingDataPreparation (last edited 2025-01-10 16:22:19 by DominicRicottone)

-  ⇤ ← Revision 1 as of 2023-12-07 18:18:34 → 
  Size: 975
  Editor: DominicRicottone
  Comment: Initial
+   ← Revision 2 as of 2025-01-10 16:21:53 → ⇥
  Size: 1043
  Editor: DominicRicottone
  Comment: Killing NaturalLanguageProcessing page
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+## page was renamed from NaturalLanguageProcessing/DataPreparation