Differences between revisions 4 and 13 (spanning 9 versions)
Revision 4 as of 2022-08-22 19:53:29
Size: 3873
Comment:
Revision 13 as of 2023-06-09 17:52:50
Size: 1765
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Aggregating Data = = Aggregating Data with SPSS =

SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.
Line 9: Line 11:
== Calculating higher-level values == == Statistics ==
Line 11: Line 13:


=== Scratch Variables ===

'''Scratch variables''' are declared with a preceding hash (`#`). These variables are deleted on reaching execution, and are never re-initialized.

Given sales data, to compute total sales, try:
[[SPSS/ScratchVariables|Scratch variables]] can be used to compute aggregated statistics.
Line 20: Line 16:
compute #Total_Sales = sum(Sales, #Total_Sales).
compute Total_Sales = #Total_Sales.
compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.
Line 24: Line 20:


=== Leave ===

The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:
The [[SPSS/Leave|LEAVE]] command can be used in a similar manner.
Line 37: Line 27:


=== Aggregate ===

The '''`AGGREGATE`''' command ''can'' be used to calculate higher-level values.

Given sales data, to compute total sales, try:
The [[SPSS/Aggregate|AGGREGATE]] command creates a new dataset of aggregated statistics.
Line 50: Line 34:
The `OUTFILE=*` and `MODE=ADDVARIABLES` subcommands are the default behavior. Additionally it allows for group variables on the `/BREAK` subcommand.
Line 52: Line 36:
Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].



==== Break Groups ====

To calculate a higher-level value for each subgroup, use the `/BREAK=VARLIST` subcommand. The dataset must be sorted by `VARLIST`.

If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.


{{{
aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).
}}}
Line 75: Line 46:
== Aggregating Cases == == Wide and Long Data ==
Line 77: Line 48:


=== Aggregate ===

The '''`AGGREGATE`''' command ''can'' be used to create a higher-level dataset.

This is mostly only useful with break groups, whereby a case is generated for each one.

Given state-level sales data, to create region-level sales data, try:

{{{
dataset define NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).
}}}

If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].



=== Transforming Long to Wide ===

The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.

Given region-level sales data from multiple years, try:

{{{
dataset activate SALES_Y1.
compute Year=1.

dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.
}}}

The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`).

The index variables can be either numeric or string, but all cases must have a non-missing value.



==== Constants ====

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:
The [[SPSS/CasesToVars|CASESTOVARS]] command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the `/INDEX` subcommand.
Line 143: Line 52:
 /id=Region
 /index=Year
 /fixed=Region_Name.
  /id=clientid
  /index=fiscalquarter.
Line 148: Line 56:
`Region_Name` would not be duplicated for each index level. Otherwise variables will be spread into an unknowable number of sequentially-named variables.
Line 150: Line 58:
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the [[SPSS/Aggregate|AGGREGATE]] command.
Line 151: Line 60:
{{{
dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.
}}}
Line 152: Line 68:
=== Transforming Wide to Long === The [[SPSS/VarsToCases|VARSTOCASES]] command translates a wide dataset into long format.
Line 154: Line 70:
The '''`VARSTOCASES`''' command transforms data from wide to long format. {{{
varstocases
  /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter.
}}}

Aggregating Data with SPSS

SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.


Statistics

Scratch variables can be used to compute aggregated statistics.

compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.

The LEAVE command can be used in a similar manner.

compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.

The AGGREGATE command creates a new dataset of aggregated statistics.

aggregate outfile=* mode=addvariables
  /Total_Sales = sum(Sales).

Additionally it allows for group variables on the /BREAK subcommand.

aggregate outfile=* mode=addvariables
  /break=clientid
  /Total_Sales = sum(Sales).


Wide and Long Data

The CASESTOVARS command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the /INDEX subcommand.

casestovars
  /id=clientid
  /index=fiscalquarter.

Otherwise variables will be spread into an unknowable number of sequentially-named variables.

If case-wise descriptive statistics are all that is desired from the translation, consider instead using the AGGREGATE command.

dataset declare clients.
aggregate
  /outfile="clients"
  /break=clientid
  /count=N.

The VARSTOCASES command translates a wide dataset into long format.

varstocases
  /make Sales from Sales.1 to Sales.4
  /index=fiscalquarter.


CategoryRicottone

SPSS/AggregatingData (last edited 2023-06-11 20:45:57 by DominicRicottone)