SPSS Unicode

SPSS was written with the assumption that 1 character = 1 byte. That isn't true for many character encodings, including Unicode. Needless to say, we have a workaround now.

Setting the default encoding to Unicode further complicated things, in that Windows users were now forced to think about locale for the first time ever.

PSPP sidestepped this entire issue by observing locale.


Default Encoding

SPSS versions <21 default to the encoding prescribed by the system locale, bearing in mind that SPSS versions <16 do not support Unicode at all.

SPSS versions >=21 use Unicode by default.

SPSS servers inherit from a connected client.

Encoding Override

The SET UNICODE function toggles Unicode mode. YES or ON enable the mode, while NO or OFF disable it.

If Unicode mode is disabled, SPSS tries to use the encoding prescribed by the system locale.

Unicode mode cannot be altered while a data file is open. Try:

dataset close all.
new file.
set unicode=on.
show unicode.

Note: not supported or needed in PSPP.

Locale Override

To check the current locale, use the SHOW LOCALE function.

The SET LOCALE function overrides the system locale within the SPSS session. Try:

set locale="Japanese".

The allowed options for locale are called LocaleIDs, which are meant to follow the IANA character sets registry.

Note: the allowed LocaleIDs changed without compatibility in SPSS version 16.

Note: for SPSS servers, the allowed LocaleIDs come from a configuration file (loclmap.xml). Notably Windows-1252 is not included by default. The systems administrator needs to alter this file to make additional LocaleIDs available.


Text Data

Reading Files

SPSS versions >=21 support an /ENCODING subcommand on GET DATA. Prior to this point, all text data had to be encoded according to the system locale. Even so, the only valid options were "LOCALE" and "UTF8".

SPSS version 23 added support for "UTF16", "UTF16BE", and "UTF16LE".

Writing Files

SPSS versions >=19 support an /ENCODING subcommand on SAVE TRANSLATE with a /TYPE of SAS or STATA. Valid options are "LOCALE", "UTF8", "UTF16", "UTF16BE", "UTF16LE", a numeric Windows code page value (such as "1252"), or an IANA code page value (such as "iso8859-1").

If Unicode mode is enabled, the default is "UTF8". (Otherwise it defaults to "LOCALE".)


Binary Data

PSPP does not support proprietary binary data formats.

Reading Files

SPSS versions >=19 support an /ENCODING subcommand on GET SAS and GET STATA. Valid options include "LOCALE", "UTF8", "Windows-1252", and several other Windows and IBM codepages.

Writing Files

SPSS versions >=19 support an /ENCODING subcommand on SAVE TRANSLATE with a /TYPE of SAS or STATA. Valid options include "LOCALE", "UTF8", "Windows-1252", and several other Windows and IBM codepages. For SAS exports, the encoding applies to both the data file (/OUTFILE) and the value labels file (/VALFILE).

The default for Stata and SAS versions <9 is always "LOCALE". If Unicode mode is enabled, the default for SAS version 9>= is "UTF8". (Otherwise it defaults to "LOCALE".)

Note: SPSS version 25 introduced interoperability with Stata 14, which is the first version of Stata to support Unicode.


CategoryRicottone

SPSS/Unicode (last edited 2023-05-30 19:35:40 by DominicRicottone)