Differences between revisions 2 and 6 (spanning 4 versions)
Revision 2 as of 2023-10-28 03:17:18
Size: 686
Comment:
Revision 6 as of 2025-01-10 16:19:13
Size: 628
Comment: Typo
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from NaturalLanguageProcessing/BagOfWords
= Bag of Words =
= Bag of Words Model =
Line 4: Line 3:
A '''bag of words''' model is essentially counting words per document. A '''bag of words model''' is essentially counting words per document.
Line 26: Line 25:
Finally, the output matrix must be interpretted. Semantic meaning is lost so care must be taken to ensure that sourced documents have a common context. Finally, the output matrix must be interpreted. Semantic meaning is lost so care must be taken to ensure that sourced documents have a common context.

Bag of Words Model

A bag of words model is essentially counting words per document.


Data Structure

Rows are documents and columns are words or phrases. Inevitably this is a sparse matrix.


Implementation

First the words and phrases across all documents must be tokenized.

Next, the count of each key word/phase is extracted from each document.

Finally, the output matrix must be interpreted. Semantic meaning is lost so care must be taken to ensure that sourced documents have a common context.


CategoryRicottone

Statistics/BagOfWordsModel (last edited 2025-01-10 16:19:13 by DominicRicottone)