Size: 5177
Comment:
|
Size: 1765
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats. |
|
Line 9: | Line 11: |
== Calculating higher-level values == | == Statistics == |
Line 11: | Line 13: |
=== Scratch Variables === To compute total sales from a dataset of sales, try: |
[[SPSS/ScratchVariables|Scratch variables]] can be used to compute aggregated statistics. |
Line 22: | Line 20: |
For more details on using scratch variables, see [[SPSS/ScratchVariables|here]]. === Leave === The '''`LEAVE`''' command explicitly suppresses re-initialization for a given variable. In all other contexts, these are normal variables with normal behavior. Given sales data, to compute total sales, try: |
The [[SPSS/Leave|LEAVE]] command can be used in a similar manner. |
Line 37: | Line 27: |
=== Aggregate === The '''`AGGREGATE`''' command can be used to calculate higher-level variables. Given sales data, to compute total sales, try: |
The [[SPSS/Aggregate|AGGREGATE]] command creates a new dataset of aggregated statistics. |
Line 50: | Line 34: |
For more complex analysis, it may be necessary to aggregate data into a new dataset and ''then'' merge a higher-level variable into the old dataset. To identify duplicate cases, try: | Additionally it allows for group variables on the `/BREAK` subcommand. |
Line 53: | Line 37: |
dataset activate Dataset1. sort cases by KEYVARLIST. dataset declare DUPLICATES. aggregate /outfile="DUPLICATES" /presorted /break=KEYVARLIST /COUNT=N. execute. dataset activate DUPLICATES. select if COUNT>1. execute. dataset activate Dataset1. match files /file=* /file=DUPLICATES /by KEYVARLIST. execute. frequencies COUNT. |
aggregate outfile=* mode=addvariables /break=clientid /Total_Sales = sum(Sales). |
Line 75: | Line 41: |
Note: if the calculated variables may collide with pre-existing variable names, use the `OVERWRITE=YES` subcommand. For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]]. |
|
Line 84: | Line 46: |
== Aggregating Cases == | == Wide and Long Data == |
Line 86: | Line 48: |
=== Aggregate === The '''`AGGREGATE`''' command can be used to create a higher-level dataset. Given state-level sales data, to create region-level sales data, try: {{{ dataset declare NEW. aggregate outfile=NEW /break=Region /Total_Sales=sum(Sale). }}} For details on using the `AGGREGATE` command, see [[SPSS/Aggregate|here]]. === Transforming Long to Wide === The '''`CASESTOVARS`''' command transforms data from long to wide format. An index variable identifies the cases that should be aggregated. Given region-level sales data from multiple years, try: {{{ dataset activate SALES_Y1. compute Year=1. dataset activate SALES_Y2. compute Year=2. dataset activate SALES_Y3. compute Year=3. add files /file=SALES_Y1 /file=SALES_Y2 /file=SALES_Y3. sort cases by Region. formats Year (F1). dataset name SALES. casestovars /id=Region /index=Year. }}} The ID variable (`Region`) becomes a unique identifier for each row of data. The index variable (`Year`) is dropped, and all other variables are transformed according to the index variable. For example, the `Total_Sales` column becomes a set of columns, one for each year (`Total_Sales.1`, `Total_Sales.2`, `Total_Sales.3`). The index variables can be either numeric or string, but all cases must have a non-missing value. ==== Constants ==== If a variable does not vary within the new rows of data, they should instead be treated as constants and should not be transformed. Constants should be specified on the `/FIXED` subcommand. SPSS will automatically inspect variables and warn about any that appear to be constants but were not specified as such. Consider: |
The [[SPSS/CasesToVars|CASESTOVARS]] command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the `/INDEX` subcommand. |
Line 144: | Line 52: |
/id=Region /index=Year /fixed=Region_Name. |
/id=clientid /index=fiscalquarter. |
Line 149: | Line 56: |
`Region_Name` would not be duplicated for each index level. | Otherwise variables will be spread into an unknowable number of sequentially-named variables. |
Line 151: | Line 58: |
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the [[SPSS/Aggregate|AGGREGATE]] command. | |
Line 152: | Line 60: |
{{{ dataset declare clients. aggregate /outfile="clients" /break=clientid /count=N. }}} |
|
Line 153: | Line 68: |
=== Transforming Wide to Long === The '''`VARSTOCASES`''' command transforms data from wide to long format. Given the dataset modified above, the `Total_Sales` variable could be recreated by using: |
The [[SPSS/VarsToCases|VARSTOCASES]] command translates a wide dataset into long format. |
Line 161: | Line 72: |
/make Total_Sales from Total_Sales.1 to Total_Sales.3. | /make Sales from Sales.1 to Sales.4 /index=fiscalquarter. |
Line 164: | Line 76: |
Each non-missing value from the source variables is split into a new row of data. All other variables are duplicated for each created row of data. ==== Identifiers and Indices ==== To create an identifier variable that specifies the row number in the original data, use the `/ID` subcommand. Given the dataset modified above, the `Region` variable could be recreated by using: {{{ varstocases /make Total_Sales from Total_Sales.1 to Total_Sales.3 /id=Region. }}} To create a numeric index variable that specifies the source variable's sequence in the original data, use the `/INDEX` subcommand. Given the dataset modified above, the `Year` variable could be recreated by using: {{{ varstocases /make Total_Sales from Total_Sales.1 to Total_Sales.3 /index=Year. }}} Alternatively, to create a string index variable that specifies the source variable's name in the original data, specify the subcommand like `/INDEX=Year(Total_Sales)`. In this case, `Year` would instead contain the values `"Total_Sales.1"`, `"Total_Sales.2"`, or `"Total_Sales.3"`. To specify a variable label for a new identifier or index variable, list the label in quotes following the new variable name. The new identifier or numeric index variable will have a format of `F8.0`. |
Aggregating Data with SPSS
SPSS offers several commands for computing aggregated statistics and translating datasets into aggregated formats.
Statistics
Scratch variables can be used to compute aggregated statistics.
compute #TotalSales = sum(Sales, #TotalSales). compute Total_Sales = #TotalSales.
The LEAVE command can be used in a similar manner.
compute Total_Sales = sum(Sales, Total_Sales). leave Total_Sales.
The AGGREGATE command creates a new dataset of aggregated statistics.
aggregate outfile=* mode=addvariables /Total_Sales = sum(Sales).
Additionally it allows for group variables on the /BREAK subcommand.
aggregate outfile=* mode=addvariables /break=clientid /Total_Sales = sum(Sales).
Wide and Long Data
The CASESTOVARS command translates a long dataset into wide format. If the dataset already has an index variable for the within-group sequence, specify it on the /INDEX subcommand.
casestovars /id=clientid /index=fiscalquarter.
Otherwise variables will be spread into an unknowable number of sequentially-named variables.
If case-wise descriptive statistics are all that is desired from the translation, consider instead using the AGGREGATE command.
dataset declare clients. aggregate /outfile="clients" /break=clientid /count=N.
The VARSTOCASES command translates a wide dataset into long format.
varstocases /make Sales from Sales.1 to Sales.4 /index=fiscalquarter.