Differences between revisions 1 and 4 (spanning 3 versions)

Aggregating Data

Contents

Aggregating Data
1. Calculating higher-level values
2. Aggregating Cases

Calculating higher-level values

Scratch Variables

Scratch variables are declared with a preceding hash (#). These variables are deleted on reaching execution, and are never re-initialized.

Given sales data, to compute total sales, try:

compute #Total_Sales = sum(Sales, #Total_Sales).
compute Total_Sales = #Total_Sales.

Leave

The LEAVE command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:

compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.

Aggregate

The AGGREGATE command can be used to calculate higher-level values.

Given sales data, to compute total sales, try:

aggregate outfile=* mode=addvariables
  /Total_Sales = sum(Sales).

The OUTFILE=* and MODE=ADDVARIABLES subcommands are the default behavior.

Note: if the calculated variables may collide with pre-existing variable names, use the OVERWRITE=YES subcommand.

Functions

See Aggregate Functions.

Break Groups

To calculate a higher-level value for each subgroup, use the /BREAK=VARLIST subcommand. The dataset must be sorted by VARLIST.

If the dataset is already sorted, use the /PRESORTED subcommand to skip re-processing. The /PRESORTED subcommand must precede the /BREAK subcommand.

Aggregating Cases

Aggregate

The AGGREGATE command can be used to create a higher-level dataset.

This is mostly only useful with break groups, whereby a case is generated for each one.

Given state-level sales data, to create region-level sales data, try:

dataset define NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).

If the dataset is already sorted, use the /PRESORTED subcommand to skip re-processing. The /PRESORTED subcommand must precede the /BREAK subcommand.

Functions

See Aggregate Functions.

Transforming Long to Wide

The CASESTOVARS command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.

Given region-level sales data from multiple years, try:

dataset activate SALES_Y1.
compute Year=1.

dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.

The ID variable (Region) becomes a unique identifier for each row of data. The index variable (Year) is dropped, and all other variables are transformed according to the index variable. For example, the Total_Sales column becomes a set of columns, one for each year (Total_Sales.1, Total_Sales.2, Total_Sales.3).

The index variables can be either numeric or string, but all cases must have a non-missing value.

Constants

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the /FIXED subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:

casestovars
 /id=Region
 /index=Year
 /fixed=Region_Name.

Region_Name would not be duplicated for each index level.

Transforming Wide to Long

The VARSTOCASES command transforms data from wide to long format.

CategoryRicottone

-  ⇤ ← Revision 1 as of 2021-12-29 17:47:37 → 
  Size: 2275
  Editor: DominicRicottone
  Comment:
+   ← Revision 4 as of 2022-08-22 19:53:29 → ⇥
  Size: 3873
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-== Computing Aggregated Variables ==
+== Calculating higher-level values ==
 Line 17:
+Given sales data, to compute total sales, try:
-Line 18:
+Line 20:
-compute #total = sum(sale,#total).
compute total = #total.
+compute #Total_Sales = sum(Sales, #Total_Sales).
compute Total_Sales = #Total_Sales.
-Line 26:
+Line 28:
-The `LEAVE` command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.
+The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior.

Given sales data, to compute total sales, try:
-Line 29:
+Line 33:
-compute total = sum(sale,total).
leave total.
+compute Total_Sales = sum(Sales, Total_Sales).
leave Total_Sales.
-Line 37:
+Line 41:
-The `AGGREGATE` command can either create an aggregated data set or append aggregated variables to the active data set. This latter behavior is a silent default behavior and what will be explored here. For the former, see below.
+The '''`AGGREGATE`''' command ''can'' be used to calculate higher-level values.

Given sales data, to compute total sales, try:
-Line 41:
+Line 47:
-  /last_casenum = max(casenum)
  /total        = sum(sale).
+  /Total_Sales = sum(Sales).
-Line 45:
+Line 50:
-The `OUTFILE=*` and `MODE=ADDVARIABLES` subcommands are silent defaults and can be left off. Additionally, use the `OVERWRITE=YES` subcommand to ignore pre-existing variables.
+The `OUTFILE=*` and `MODE=ADDVARIABLES` subcommands are the default behavior.
-Line 47:
+Line 52:
-Note that if the `AGGREGATE` command is being used in this mode with break groups, '''cases must be sorted ascending'''. The command will do this automatically unless the `/PRESORTED` subcommand is specified.
+Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].



==== Break Groups ====

To calculate a higher-level value for each subgroup, use the `/BREAK=VARLIST` subcommand. The dataset must be sorted by `VARLIST`.

If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.
-Line 59:
+Line 81:
+The '''`AGGREGATE`''' command ''can'' be used to create a higher-level dataset.

This is mostly only useful with break groups, whereby a case is generated for each one.

Given state-level sales data, to create region-level sales data, try:

{{{
dataset define NEW.
aggregate outfile=NEW
  /break=Region
  /Total_Sales=sum(Sale).
}}}

If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand.



==== Functions ====

See [[SPSS/AggregateFunctions|Aggregate Functions]].
-Line 63:
+Line 106:
-The `CASESTOVARS` command transforms a data set from long to wide format. An index variable identifies the cases that should be aggregated.
+The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated.
-Line 65:
+Line 108:
-The index variable (or multiple index variables) can be either numeric or string, but all cases must have a non-missing value.
+Given region-level sales data from multiple years, try:
-Line 67:
+Line 110:
-Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but aren't specified here.
+{{{
dataset activate SALES_Y1.
compute Year=1.
-Line 69:
+Line 114:
-The remaining variables will be duplicated for each case that is aggregated together. Extraneous variables can and should be dropped to save time.
+dataset activate SALES_Y2.
compute Year=2.

dataset activate SALES_Y3.
compute Year=3.

add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3.
sort cases by Region.
formats Year (F1).
dataset name SALES.
casestovars
 /id=Region
 /index=Year.
}}}

The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`).

The index variables can be either numeric or string, but all cases must have a non-missing value.



==== Constants ====

If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such.

Consider:
-Line 73:
+Line 143:
- /id=ID
 /index=INDEX
 /fixed=INDEX_LEVEL_VARIABLES
 /drop=FOO BAR BAZ.
+ /id=Region
 /index=Year
 /fixed=Region_Name.
-Line 79:
+Line 148:
-See a concrete example in the [[CrosswalkLongToWide#SPSS|long-to-wide crosswalk]].
+`Region_Name` would not be duplicated for each index level.



=== Transforming Wide to Long ===

The '''`VARSTOCASES`''' command transforms data from wide to long format.

Diff for "SPSS/AggregatingData"

Aggregating Data

Calculating higher-level values

Scratch Variables

Leave

Aggregate

Functions

Break Groups

Aggregating Cases

Aggregate

Functions

Transforming Long to Wide

Constants

Transforming Wide to Long