Differences between revisions 10 and 14 (spanning 4 versions)

Joining Data with SPSS

SPSS offers several commands for joining datasets.

Contents

Joining Data with SPSS

Full Join

The primary method for any join in SPSS is the MATCH FILES command.

match files
  /file=LEFT
  /file=RIGHT
  /by ID.

If the datasets are known to have no overlap, the ADD FILES command is also an option.

Left Join

match files
  /file=LEFT /in=keep
  /file=RIGHT
  /by ID.
select if keep=1.

The STAR JOIN command is also available.

Right Join

match files
  /file=LEFT
  /file=RIGHT /in=keep
  /by ID.
select if keep=1.

Inner Join

match files
  /file=LEFT /in=left
  /file=RIGHT /in=right
  /by ID.
select if left=1 and right=1.

CategoryRicottone

-  ⇤ ← Revision 10 as of 2022-08-24 18:15:37 → 
  Size: 4706
  Editor: DominicRicottone
  Comment:
+   ← Revision 14 as of 2023-06-09 21:03:49 → ⇥
  Size: 889
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= SPSS Joining Data =
+= Joining Data with SPSS =
 Line 3:
-SPSS offers several commands for joining data.

The syntax of the commands is first explained, then the equivalent(s) of standard SQL joins are listed.
+SPSS offers several commands for joining datasets.
-Line 13:
+Line 11:
-== Match Files ==
+== Full Join ==
-Line 15:
+Line 13:
-The '''`MATCH FILE`''' command is ''primarily'' used for 1:1 joins, where all cases are uniquely identified in all datasets. Try:
+The primary method for any join in SPSS is the [[SPSS/MatchFiles|MATCH FILES]] command.
-Line 24:
+Line 22:
-Variables are taken in order from the files in order. For variables originating from more than one file, values are taken from the first file they appear in and metadata is taken from the first file with any (i.e. variable label, value labels, or missing values) metadata set.
+If the datasets are known to have no overlap, the [[SPSS/AddFiles|ADD FILES]] command is also an option.

----
 Line 28:
-=== Files ===
+== Left Join ==
 Line 30:
-Each `/FILE` subcommand takes one of:
+{{{
match files
  /file=LEFT /in=keep
  /file=RIGHT
  /by ID.
select if keep=1.
}}}
-Line 32:
+Line 38:
- * a star (`*`) indicating the active data set
 * the name of a data set
 * a valid filename or file handle
+The [[SPSS/StarJoin|STAR JOIN]] command is also available.
-Line 36:
+Line 40:
-If the active dataset is included in a join and referenced by a star (`*`), that dataset will be replaced in-place by the join.
+----
-Line 40:
+Line 44:
-=== Key Variables ===

All of the following must be satisfied by the key variable(s) named on the `/BY` subcommand.

 1. It must be defined as the same [[SPSS/DataTypes|format]] on all files
 2. It must be unique in each file
 3. Each file must be pre-sorted

In other words, each join could be guarded by:

{{{
dataset activate LEFT.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.

dataset activate RIGHT.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.
}}}

If a key variable is a string, it must additionally be defined as the same length on all files.



=== Tables ===

The `MATCH FILE` command has an extension through the '''`/TABLE`''' subcommand. This is useful for appending higher-level variables by way of lookup tables. Try:
+== Right Join ==
-Line 77:
+Line 49:
-  /table=LOOKUP
+  /file=RIGHT /in=keep
-Line 79:
+Line 51:
-}}}

Each `/TABLE` subcommand takes one of:

 * the name of a data set
 * a valid filename or file handle

In the case of a `MATCH FILE ... /TABLE` command, the key variable(s) only need to be unique across in each `/TABLE` dataset. In other words, each join could be guarded by:

{{{
dataset activate LEFT.
sort cases by ID.
execute.

dataset activate LOOKUP.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.
+select if keep=1.
-Line 105:
+Line 58:
-== Update File ==

The '''`UPDATE FILE`''' command is used to overwrite values in a master file with non-null values in transaction files. Try:
+== Inner Join ==
-Line 110:
+Line 61:
-update file=LEFT
  /file=TRANSACTION
+match files
  /file=LEFT /in=left
  /file=RIGHT /in=right
-Line 113:
+Line 65:
-}}}

The final dataset contains all rows and variables from all datasets. Rows and variables originating from a transaction file are appended.



=== Files ===

Each `/FILE` subcommand takes one of:

 * a star (`*`) indicating the active data set
 * the name of a data set
 * a valid filename or file handle

The active dataset can be specified on a `/FILE` subcommand with a star (`*`). That dataset would then be replaced in-place by the join.

The first dataset specified on a `/FILE` subcommand is always the master file.


=== Key Variables ===

All of the following must be satisfied by the key variable(s) named on the `/BY` subcommand.

 1. It must be defined as the same type on all files
 2. It must be unique in each file
 3. Each file must be pre-sorted

In other words, each join could be guarded by:

{{{
dataset activate LEFT.
sort cases by ID.
execute.

dataset activate TRANSACTION.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.
}}}

If a key variable is a string, it must additionally be defined as the same length on all files.

----



== Star Join ==

The final dataset contains only the variables specified on the `SELECT` and `JOIN` subcommand, and only the rows originating from the case file.

Note: unsupported in SPSS version 20 or earlier.

----



== Joins ==

=== Full Join ===

{{{
dataset activate LEFT.
match files
  /file=*
  /file=RIGHT
  /by ID.
execute.
}}}



=== Left Join ===

{{{
dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_left=1).
execute.
}}}



=== Right Join ===

{{{
dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_right=1).
execute.
}}}



=== Inner Join ===

{{{
dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_left=1 and flag_right=1).
execute.
+select if left=1 and right=1.

Diff for "SPSS/JoiningData"

Joining Data with SPSS

Full Join

Left Join

Right Join

Inner Join