Differences between revisions 12 and 13
Revision 12 as of 2023-01-13 23:31:36
Size: 5177
Comment:
Revision 13 as of 2023-06-09 17:52:50
Size: 1765
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.
Line 9: Line 11:
== Calculating higher-level values == == Statistics ==
Line 11: Line 13:


=== Scratch Variables ===

To compute total sales from a dataset of sales, try:
[[SPSS/ScratchVariables|Scratch variables]] can be used to compute aggregated statistics.
Line 22: Line 20:
For more details on using scratch variables, see [[SPSS/ScratchVariables|here]].



=== Leave ===

The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:
The [[SPSS/Leave|LEAVE]] command can be used in a similar manner.
Line 37: Line 27:


=== Aggregate ===

The '''`AGGREGATE`''' command can be used to calculate higher-level variables.

Given sales data, to compute total sales, try:
The [[SPSS/Aggregate|AGGREGATE]] command creates a new dataset of aggregated statistics.
Line 50: Line 34:
For more complex analysis, it may be necessary to aggregate data into a new dataset and ''then'' merge a higher-level variable into the old dataset. To identify duplicate cases, try: Additionally it allows for group variables on the `/BREAK` subcommand.
Line 53: Line 37:
dataset activate Dataset1.
sort cases by KEYVARLIST.
dataset declare DUPLICATES.
aggregate
  /outfile="DUPLICATES"
  /presorted
  /break=KEYVARLIST
  /COUNT=N.
execute.

dataset activate DUPLICATES.
select if COUNT>1.
execute.

dataset activate Dataset1.
match files
 /file=*
 /file=DUPLICATES
 /by KEYVARLIST.
execute.
frequencies COUNT.
aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).
Line 75: Line 41:

Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand.

For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].
Line 84: Line 46:
== Aggregating Cases == == Wide and Long Data ==
Line 86: Line 48:


=== Aggregate ===

The '''`AGGREGATE`''' command can be used to create a higher-level dataset.

Given state-level sales data, to create region-level sales data, try:

{{{
dataset declare NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).
}}}

For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].



=== Transforming Long to Wide ===

The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.

Given region-level sales data from multiple years, try:

{{{
dataset activate SALES_Y1.
compute Year=1.

dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.
}}}

The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`).

The index variables can be either numeric or string, but all cases must have a non-missing value.



==== Constants ====

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:
The [[SPSS/CasesToVars|CASESTOVARS]] command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the `/INDEX` subcommand.
Line 144: Line 52:
 /id=Region
 /index=Year
 /fixed=Region_Name.
  /id=clientid
  /index=fiscalquarter.
Line 149: Line 56:
`Region_Name` would not be duplicated for each index level. Otherwise variables will be spread into an unknowable number of sequentially-named variables.
Line 151: Line 58:
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the [[SPSS/Aggregate|AGGREGATE]] command.
Line 152: Line 60:
{{{
dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.
}}}
Line 153: Line 68:
=== Transforming Wide to Long ===

The '''`VARSTOCASES`''' command transforms data from wide to long format.

Given the dataset modified above, the `Total_Sales` variable could be recreated by using:
The [[SPSS/VarsToCases|VARSTOCASES]] command translates a wide dataset into long format.
Line 161: Line 72:
 /make Total_Sales from Total_Sales.1 to Total_Sales.3.   /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter
.
Line 164: Line 76:
Each non-missing value from the source variables is split into a new row of data. All other variables are duplicated for each created row of data.



==== Identifiers and Indices ====

To create an identifier variable that specifies the row number in the original data, use the `/ID` subcommand.

Given the dataset modified above, the `Region` variable could be recreated by using:

{{{
varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /id=Region.
}}}

To create a numeric index variable that specifies the source variable's sequence in the original data, use the `/INDEX` subcommand.

Given the dataset modified above, the `Year` variable could be recreated by using:

{{{
varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /index=Year.
}}}

Alternatively, to create a string index variable that specifies the source variable's name in the original data, specify the subcommand like `/INDEX=Year(Total_Sales)`. In this case, `Year` would instead contain the values `"Total_Sales.1"`, `"Total_Sales.2"`, or `"Total_Sales.3"`.

To specify a variable label for a new identifier or index variable, list the label in quotes following the new variable name.

The new identifier or numeric index variable will have a format of `F8.0`.

Aggregating Data with SPSS

SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.


Statistics

Scratch variables can be used to compute aggregated statistics.

compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.

The LEAVE command can be used in a similar manner.

compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.

The AGGREGATE command creates a new dataset of aggregated statistics.

aggregate outfile=* mode=addvariables
  /Total_Sales = sum(Sales).

Additionally it allows for group variables on the /BREAK subcommand.

aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).


Wide and Long Data

The CASESTOVARS command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the /INDEX subcommand.

casestovars
  /id=clientid
  /index=fiscalquarter.

Otherwise variables will be spread into an unknowable number of sequentially-named variables.

If case-wise descriptive statistics are all that is desired from the translation, consider instead using the AGGREGATE command.

dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.

The VARSTOCASES command translates a wide dataset into long format.

varstocases
  /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter.


CategoryRicottone

SPSS/AggregatingData (last edited 2023-06-11 20:45:57 by DominicRicottone)