Differences between revisions 8 and 12 (spanning 4 versions)
Revision 8 as of 2022-08-24 13:48:20
Size: 5635
Comment:
Revision 12 as of 2023-01-13 23:31:36
Size: 5177
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Aggregating Data = = Aggregating Data with SPSS =
Line 15: Line 15:
'''Scratch variables''' are declared with a preceding hash (`#`). These variables are deleted on reaching execution, and are never re-initialized.

Given sales data, to compute total sales, try:
To compute total sales from a dataset of sales, try:
Line 20: Line 18:
compute #Total_Sales = sum(Sales, #Total_Sales).
compute Total_Sales = #Total_Sales.
compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.
Line 23: Line 21:

For more details on using scratch variables, see [[SPSS/ScratchVariables|here]].
Line 41: Line 41:
The '''`AGGREGATE`''' command ''can'' be used to calculate higher-level values. The '''`AGGREGATE`''' command can be used to calculate higher-level variables.
Line 50: Line 50:
The `OUTFILE=*` and `MODE=ADDVARIABLES` subcommands are the default behavior. For more complex analysis, it may be necessary to aggregate data into a new dataset and ''then'' merge a higher-level variable into the old dataset. To identify duplicate cases, try:

{{{
dataset activate Dataset1.
sort cases by KEYVARLIST.
dataset declare DUPLICATES.
aggregate
  /outfile="DUPLICATES"
  /presorted
  /break=KEYVARLIST
  /COUNT=N.
execute.

dataset activate DUPLICATES.
select if COUNT>1.
execute.

dataset activate Dataset1.
match files
 /file=*
 /file=DUPLICATES
 /by KEYVARLIST.
execute.
frequencies COUNT.
}}}
Line 54: Line 78:
Note: running `AGGREGATE` ''without'' a `/BREAK` subcommand (see Break Groups below) is unsupported in PSPP, and unsupported in SPSS version 16 or earlier.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].



==== Break Groups ====

To calculate a higher-level value for each subgroup, use the `/BREAK=VARLIST` subcommand. The dataset must be sorted by `VARLIST`.

If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.

For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].
Line 82: Line 90:
The '''`AGGREGATE`''' command ''can'' be used to create a higher-level dataset.

This is mostly only useful with break groups, whereby a case is generated for each one.
The '''`AGGREGATE`''' command can be used to create a higher-level dataset.
Line 89: Line 95:
dataset define NEW. dataset declare NEW.
Line 95: Line 101:
If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.

Note: running `AGGREGATE` ''without'' a `/BREAK` subcommand is unsupported in PSPP, and unsupported in SPSS version 16 or earlier.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].
For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]].

Aggregating Data with SPSS


Calculating higher-level values

Scratch Variables

To compute total sales from a dataset of sales, try:

compute #TotalSales = sum(Sales, #TotalSales).
compute Total_Sales = #TotalSales.

For more details on using scratch variables, see here.

Leave

The LEAVE command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:

compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.

Aggregate

The AGGREGATE command can be used to calculate higher-level variables.

Given sales data, to compute total sales, try:

aggregate outfile=* mode=addvariables
  /Total_Sales = sum(Sales).

For more complex analysis, it may be necessary to aggregate data into a new dataset and then merge a higher-level variable into the old dataset. To identify duplicate cases, try:

dataset activate Dataset1.
sort cases by KEYVARLIST.
dataset declare DUPLICATES.
aggregate
  /outfile="DUPLICATES"
  /presorted
  /break=KEYVARLIST
  /COUNT=N.
execute.

dataset activate DUPLICATES.
select if COUNT>1.
execute.

dataset activate Dataset1.
match files
 /file=*
 /file=DUPLICATES
 /by KEYVARLIST.
execute.
frequencies COUNT.

Note: if the calculated variables may collide with pre-existing variable names, use the OVERWRITE=YES subcommand.

For details on using the AGGREGATE command, see here.


Aggregating Cases

Aggregate

The AGGREGATE command can be used to create a higher-level dataset.

Given state-level sales data, to create region-level sales data, try:

dataset declare NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).

For details on using the AGGREGATE command, see here.

Transforming Long to Wide

The CASESTOVARS command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.

Given region-level sales data from multiple years, try:

dataset activate SALES_Y1.
compute Year=1.

dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.

The ID variable (Region) becomes a unique identifier for each row of data. The index variable (Year) is dropped, and all other variables are transformed according to the index variable. For example, the Total_Sales column becomes a set of columns, one for each year (Total_Sales.1, Total_Sales.2, Total_Sales.3).

The index variables can be either numeric or string, but all cases must have a non-missing value.

Constants

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the /FIXED subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:

casestovars
 /id=Region
 /index=Year
 /fixed=Region_Name.

Region_Name would not be duplicated for each index level.

Transforming Wide to Long

The VARSTOCASES command transforms data from wide to long format.

Given the dataset modified above, the Total_Sales variable could be recreated by using:

varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3.

Each non-missing value from the source variables is split into a new row of data. All other variables are duplicated for each created row of data.

Identifiers and Indices

To create an identifier variable that specifies the row number in the original data, use the /ID subcommand.

Given the dataset modified above, the Region variable could be recreated by using:

varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /id=Region.

To create a numeric index variable that specifies the source variable's sequence in the original data, use the /INDEX subcommand.

Given the dataset modified above, the Year variable could be recreated by using:

varstocases
 /make Total_Sales from Total_Sales.1 to Total_Sales.3
 /index=Year.

Alternatively, to create a string index variable that specifies the source variable's name in the original data, specify the subcommand like /INDEX=Year(Total_Sales). In this case, Year would instead contain the values "Total_Sales.1", "Total_Sales.2", or "Total_Sales.3".

To specify a variable label for a new identifier or index variable, list the label in quotes following the new variable name.

The new identifier or numeric index variable will have a format of F8.0.


CategoryRicottone

SPSS/AggregatingData (last edited 2023-06-11 20:45:57 by DominicRicottone)