Differences between revisions 2 and 3

NLP Data Preparation

NLP methods generally rely on a data frame that is prepared in a specific format.

Contents

Terminology

A document is a collection of phrases, words, or terms. A corpus is a collection of documents.

The most common data frame formats are:

Document-Term Matrix (DTM): rows are documents and columns are words/terms.

Typical data preparation involves:

Text cleaning
- Pushing words to lowercase
- Expanding contractions and shorthands
- Correction of typos
- Removal of non-words
Tokenization
- Ideally lemmatization, but stemming can be valid
- Removal of stop words
Collection into documents
- Common implementation of steps 1 and 2 is to use a data frame where each row is a word/token; these must be collected back into a DTM.

Statistics/NaturalLanguageProcessingDataPreparation (last edited 2025-01-10 16:22:19 by DominicRicottone)

-  ⇤ ← Revision 2 as of 2025-01-10 16:21:53 → 
  Size: 1043
  Editor: DominicRicottone
  Comment: Killing NaturalLanguageProcessing page
+   ← Revision 3 as of 2025-01-10 16:22:19 → ⇥
  Size: 955
  Editor: DominicRicottone
  Comment: Killing NaturalLanguageProcessing page
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-## page was renamed from NaturalLanguageProcessing/DataPreparation
= Data Preparation =
+= NLP Data Preparation =
-Line 4:
+Line 3:
-Natural language processing methods generally rely on a data frame that is prepared in a specific format.
+NLP methods generally rely on a data frame that is prepared in a specific format.