Size: 3873
Comment:
|
Size: 1765
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Aggregating Data = | = Aggregating Data with SPSS = SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats. |
Line 9: | Line 11: |
== Calculating higher-level values == | == Statistics == |
Line 11: | Line 13: |
=== Scratch Variables === '''Scratch variables''' are declared with a preceding hash (`#`). These variables are deleted on reaching execution, and are never re-initialized. Given sales data, to compute total sales, try: |
[[SPSS/ScratchVariables|Scratch variables]] can be used to compute aggregated statistics. |
Line 20: | Line 16: |
compute #Total_Sales = sum(Sales, #Total_Sales). compute Total_Sales = #Total_Sales. |
compute #TotalSales = sum(Sales, #TotalSales). compute Total_Sales = #TotalSales. |
Line 24: | Line 20: |
=== Leave === The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior. Given sales data, to compute total sales, try: |
The [[SPSS/Leave|LEAVE]] command can be used in a similar manner. |
Line 37: | Line 27: |
=== Aggregate === The '''`AGGREGATE`''' command ''can'' be used to calculate higher-level values. Given sales data, to compute total sales, try: |
The [[SPSS/Aggregate|AGGREGATE]] command creates a new dataset of aggregated statistics. |
Line 50: | Line 34: |
The `OUTFILE=*` and `MODE=ADDVARIABLES` subcommands are the default behavior. | Additionally it allows for group variables on the `/BREAK` subcommand. |
Line 52: | Line 36: |
Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand. ==== Functions ==== See [[SPSS/AggregateFunctions|Aggregate Functions]]. ==== Break Groups ==== To calculate a higher-level value for each subgroup, use the `/BREAK=VARLIST` subcommand. The dataset must be sorted by `VARLIST`. If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand. |
{{{ aggregate outfile=* mode=addvariables /break=clientid /Total_Sales = sum(Sales). }}} |
Line 75: | Line 46: |
== Aggregating Cases == | == Wide and Long Data == |
Line 77: | Line 48: |
=== Aggregate === The '''`AGGREGATE`''' command ''can'' be used to create a higher-level dataset. This is mostly only useful with break groups, whereby a case is generated for each one. Given state-level sales data, to create region-level sales data, try: {{{ dataset define NEW. aggregate outfile=NEW /break=Region /Total_Sales=sum(Sale). }}} If the dataset is already sorted, use the `/PRESORTED` subcommand to skip re-processing. The `/PRESORTED` subcommand must precede the `/BREAK` subcommand. ==== Functions ==== See [[SPSS/AggregateFunctions|Aggregate Functions]]. === Transforming Long to Wide === The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated. Given region-level sales data from multiple years, try: {{{ dataset activate SALES_Y1. compute Year=1. dataset activate SALES_Y2. compute Year=2. dataset activate SALES_Y3. compute Year=3. add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3. sort cases by Region. formats Year (F1). dataset name SALES. casestovars /id=Region /index=Year. }}} The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`). The index variables can be either numeric or string, but all cases must have a non-missing value. ==== Constants ==== If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such. Consider: |
The [[SPSS/CasesToVars|CASESTOVARS]] command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the `/INDEX` subcommand. |
Line 143: | Line 52: |
/id=Region /index=Year /fixed=Region_Name. |
/id=clientid /index=fiscalquarter. |
Line 148: | Line 56: |
`Region_Name` would not be duplicated for each index level. | Otherwise variables will be spread into an unknowable number of sequentially-named variables. |
Line 150: | Line 58: |
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the [[SPSS/Aggregate|AGGREGATE]] command. | |
Line 151: | Line 60: |
{{{ dataset declare clients. aggregate /outfile="clients" /break=clientid /count=N. }}} |
|
Line 152: | Line 68: |
=== Transforming Wide to Long === | The [[SPSS/VarsToCases|VARSTOCASES]] command translates a wide dataset into long format. |
Line 154: | Line 70: |
The '''`VARSTOCASES`''' command transforms data from wide to long format. | {{{ varstocases /make Sales from Sales.1 to Sales.4 /index=fiscalquarter. }}} |
Aggregating Data with SPSS
SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.
Statistics
Scratch variables can be used to compute aggregated statistics.
compute #TotalSales = sum(Sales, #TotalSales). compute Total_Sales = #TotalSales.
The LEAVE command can be used in a similar manner.
compute Total_Sales = sum(Sales, Total_Sales). leave Total_Sales.
The AGGREGATE command creates a new dataset of aggregated statistics.
aggregate outfile=* mode=addvariables /Total_Sales = sum(Sales).
Additionally it allows for group variables on the /BREAK subcommand.
aggregate outfile=* mode=addvariables /break=clientid /Total_Sales = sum(Sales).
Wide and Long Data
The CASESTOVARS command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the /INDEX subcommand.
casestovars /id=clientid /index=fiscalquarter.
Otherwise variables will be spread into an unknowable number of sequentially-named variables.
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the AGGREGATE command.
dataset declare clients. aggregate /outfile="clients" /break=clientid /count=N.
The VARSTOCASES command translates a wide dataset into long format.
varstocases /make Sales from Sales.1 to Sales.4 /index=fiscalquarter.