Differences between revisions 12 and 13

Aggregating Data with SPSS

SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.

Contents

Aggregating Data with SPSS
1. Statistics
2. Wide and Long Data

Statistics

Scratch variables can be used to compute aggregated statistics.

compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.

The LEAVE command can be used in a similar manner.

compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.

The AGGREGATE command creates a new dataset of aggregated statistics.

aggregate outfile=* mode=addvariables
  /Total_Sales = sum(Sales).

Additionally it allows for group variables on the /BREAK subcommand.

aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).

Wide and Long Data

The CASESTOVARS command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the /INDEX subcommand.

casestovars
  /id=clientid
  /index=fiscalquarter.

Otherwise variables will be spread into an unknowable number of sequentially-named variables.

If case-wise descriptive statistics are all that is desired from the translation, consider instead using the AGGREGATE command.

dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.

The VARSTOCASES command translates a wide dataset into long format.

varstocases
  /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter.

CategoryRicottone

-  ⇤ ← Revision 12 as of 2023-01-13 23:31:36 → 
  Size: 5177
  Editor: DominicRicottone
  Comment:
+   ← Revision 13 as of 2023-06-09 17:52:50 → ⇥
  Size: 1765
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.
-Line 9:
+Line 11:
-== Calculating higher-level values ==
+== Statistics ==
-Line 11:
+Line 13:
-=== Scratch Variables ===

To compute total sales from a dataset of sales, try:
+[[SPSS/ScratchVariables|Scratch variables]] can be used to compute aggregated statistics.
-Line 22:
+Line 20:
-For more details on using scratch variables, see [[SPSS/ScratchVariables|here]].



=== Leave ===

The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:
+The [[SPSS/Leave|LEAVE]] command can be used in a similar manner.
-Line 37:
+Line 27:
-=== Aggregate ===

The '''`AGGREGATE`''' command can be used to calculate higher-level variables.

Given sales data, to compute total sales, try:
+The [[SPSS/Aggregate|AGGREGATE]] command creates a new dataset of aggregated statistics.
-Line 50:
+Line 34:
-For more complex analysis, it may be necessary to aggregate data into a new dataset and ''then'' merge a higher-level variable into the old dataset. To identify duplicate cases, try:
+Additionally it allows for group variables on the `/BREAK` subcommand.
-Line 53:
+Line 37:
-dataset activate Dataset1.
sort cases by KEYVARLIST.
dataset declare DUPLICATES.
aggregate
  /outfile="DUPLICATES"
  /presorted
  /break=KEYVARLIST
  /COUNT=N.
execute.

dataset activate DUPLICATES.
select if COUNT>1.
execute.

dataset activate Dataset1.
match files
 /file=*
 /file=DUPLICATES
 /by KEYVARLIST.
execute.
frequencies COUNT.
+aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).
-Line 75:
+Line 41:
-Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand.

For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].
-Line 84:
+Line 46:
-== Aggregating Cases ==
+== Wide and Long Data ==
-Line 86:
+Line 48:
-=== Aggregate ===

The '''`AGGREGATE`''' command can be used to create a higher-level dataset.

Given state-level sales data, to create region-level sales data, try:

{{{
dataset declare NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).
}}}

For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].



=== Transforming Long to Wide ===

The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.

Given region-level sales data from multiple years, try:

{{{
dataset activate SALES_Y1.
compute Year=1.

dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.
}}}

The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`).

The index variables can be either numeric or string, but all cases must have a non-missing value.



==== Constants ====

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:
+The [[SPSS/CasesToVars|CASESTOVARS]] command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the `/INDEX` subcommand.
-Line 144:
+Line 52:
- /id=Region
 /index=Year
 /fixed=Region_Name.
+  /id=clientid
  /index=fiscalquarter.
-Line 149:
+Line 56:
-`Region_Name` would not be duplicated for each index level.
+Otherwise variables will be spread into an unknowable number of sequentially-named variables.
-Line 151:
+Line 58:
+If case-wise descriptive statistics are all that is desired from the translation, consider instead using the [[SPSS/Aggregate|AGGREGATE]] command.
-Line 152:
+Line 60:
+{{{
dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.
}}}
-Line 153:
+Line 68:
-=== Transforming Wide to Long ===

The '''`VARSTOCASES`''' command transforms data from wide to long format.

Given the dataset modified above, the `Total_Sales` variable could be recreated by using:
+The [[SPSS/VarsToCases|VARSTOCASES]] command translates a wide dataset into long format.
-Line 161:
+Line 72:
- /make Total_Sales from Total_Sales.1 to Total_Sales.3.
+  /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter.
-Line 164:
+Line 76:
-Each non-missing value from the source variables is split into a new row of data. All other variables are duplicated for each created row of data.



==== Identifiers and Indices ====

To create an identifier variable that specifies the row number in the original data, use the `/ID` subcommand.

Given the dataset modified above, the `Region` variable could be recreated by using:

{{{
varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /id=Region.
}}}

To create a numeric index variable that specifies the source variable's sequence in the original data, use the `/INDEX` subcommand.

Given the dataset modified above, the `Year` variable could be recreated by using:

{{{
varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /index=Year.
}}}

Alternatively, to create a string index variable that specifies the source variable's name in the original data, specify the subcommand like `/INDEX=Year(Total_Sales)`. In this case, `Year` would instead contain the values `"Total_Sales.1"`, `"Total_Sales.2"`, or `"Total_Sales.3"`.

To specify a variable label for a new identifier or index variable, list the label in quotes following the new variable name.

The new identifier or numeric index variable will have a format of `F8.0`.

Diff for "SPSS/AggregatingData"

Aggregating Data with SPSS

Statistics

Wide and Long Data