SPSS Joining Data

SPSS offers several commands for joining data.

The syntax of the commands is first explained, then the equivalent(s) of standard SQL joins are listed.


Match Files

The MATCH FILE command is primarily used for 1:1 joins, where all cases are uniquely identified in all datasets. Try:

match files
  /file=LEFT
  /file=RIGHT
  /by ID.

The final dataset contains all rows and variables from all datasets. Variables are taken in order from the datasets in order. For variables originating from more than one dataset, values are taken from the first dataset they appear in and metadata is taken from the first dataset with any (i.e. variable label, value labels, or missing values) metadata set.

Files

Each /FILE subcommand takes one of:

If the active dataset is included in a join and referenced by a star (*), that dataset will be replaced in-place by the join.

Key Variables

All of the following must be satisfied by the key variable(s) named on the /BY subcommand.

  1. It must be defined as the same format on all files

  2. It must be unique in each file
  3. Each file must be pre-sorted

In other words, each join could be guarded by:

dataset activate LEFT.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.

dataset activate RIGHT.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.

If a key variable is a string, it must additionally be defined as the same length on all files.

Tables

The MATCH FILE command has an extension through the /TABLE subcommand. This is useful for appending higher-level variables by way of lookup tables. Try:

match files
  /file=LEFT
  /table=LOOKUP
  /by ID.

Each /TABLE subcommand takes one of:

In the case of a MATCH FILE ... /TABLE command, the key variable(s) only need to be unique across in each /TABLE dataset. In other words, each join could be guarded by:

dataset activate LEFT.
sort cases by ID.
execute.

dataset activate LOOKUP.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.


Update File

The UPDATE FILE command is used to overwrite values in a master file with non-null values in transaction files. Try:

update file=LEFT
  /file=TRANSACTION
  /by ID.

The final dataset contains all rows and variables from all datasets. Rows and variables originating from a transaction file are appended.

Files

Each /FILE subcommand takes one of:

The active dataset can be specified on a /FILE subcommand with a star (*). That dataset would then be replaced in-place by the join.

The first dataset specified on a /FILE subcommand is always the master file.

Key Variables

All of the following must be satisfied by the key variable(s) named on the /BY subcommand.

  1. It must be defined as the same type on all files
  2. It must be unique in each file
  3. Each file must be pre-sorted

In other words, each join could be guarded by:

dataset activate LEFT.
sort cases by ID.
execute.

dataset activate TRANSACTION.
sort cases by ID.
compute dup=0.
if (ID=lag(ID)) dup=1.
select if dup=0.
execute.

If a key variable is a string, it must additionally be defined as the same length on all files.


Star Join

The final dataset contains only the variables specified on the SELECT and JOIN subcommand, and only the rows originating from the case file.

Note: unsupported in SPSS version 20 or earlier.


Joins

Full Join

dataset activate LEFT.
match files
  /file=*
  /file=RIGHT
  /by ID.
execute.

Left Join

dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_left=1).
execute.

Right Join

dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_right=1).
execute.

Inner Join

dataset activate LEFT.
match files
  /file=*     /in=flag_left
  /file=RIGHT /in=flag_right
  /by ID.
select if (flag_left=1 and flag_right=1).
execute.


CategoryRicottone