Differences between revisions 1 and 10 (spanning 9 versions)

Stata String Functions

Stata supports these string functions in the global scope.

Contents

Stata String Functions

General Purpose

Function Name	Meaning	Example
`abbrev(s,n)`
`plural(n,s)`	Append "s" to string s if n>1, otherwise returns the original string s
`plural(n,s,p)`	As `plural` but specifying the plural form p explicitly
`real(s)`	Convert string s to a real value
`string(n)`	Convert numeric value n to a string
`string(n,f)`	Convert numeric value n to a string using format f
`stritrim(s)`	Remove duplicated internal space characters
`strofreal(n)`	Convert numeric value n to a string
`strofreal(n,f)`	Convert numeric value n to a string using format f

There is a large set of functions designed for string data representing strictly ASCII-encoded values.

Function Name	Meaning	Example
`char(n)`	ASCII code n
`indexnot(a,b)`
`lower(s)`	Convert to lowercase
`ltrim(s)`	Remove leading space characters
`rtrim(s)`	Remove trailing space characters
`soundex(s)`
`soundex_nara(s)`
`strlen(s)`	Length of string s in characters/bytes
`strlower(s)`	Convert to lowercase
`strltrim(s)`	Remove leading space characters
`strpos(s,p)`
`strproper(s)`	Convert to proper case
`strreverse(s)`
`strrpos(s,p)`
`strrtrim(s)`	Remove trailing space characters
`strtrim(s)`	Remove external space characters
`strupper(s)`	Convert to uppercase
`subinstr(s,p,r,n)`	Replace the first n matches of pattern p with replacement r
`subinword(s,p,r,n)`
`substr(s,o)`	Return the substring of string s from offset o
`substr(s,o,n)`	Return the substring of string s from offset o for length n characters
`trim(s)`	Remove external space characters
`upper(s)`	Convert to uppercase
`word(s,n)`
`wordcount(s)`

These are the new functions designed for Unicode-encoded values. In many cases, they are named similarly except for a 'ustr-' prefix.

Function Name	Meaning	Example
`uchar(n)`	Unicode code n
`udstrlen(s)`	Length of string s in display columns, respecting wide characters
`udsubstr(s,o,n)`	Return the substring of string s from offset o for n display columns
`uisdigit(s)`
`uisletter(s)`
`ustrcompare(a,b)`
`ustrcompare(a,b,l)`
`ustrleft(s,n)`	Return the leftmost substring of string s for length n characters
`ustrlen(s)`	Length of string s in characters
`ustrlower(s)`	Convert to lowercase
`ustrlower(s,l)`	Convert to lowercase in locale l
`ustrltrim(s)`
`ustrpos(s)`
`ustrreverse(s)`
`ustrright(s,n)`	Return the rightmost substring of string s for length n characters
`ustrrpos(s,p)`
`ustrrpos(s,p,o)`
`ustrrtrim(s)`
`ustrsortkey(s)`
`ustrsortkey(s,l)`
`ustrtitle(s)`	Convert to title case
`ustrtitle(s,l)`	Convert to title case in locale l
`ustrtrim(s)`	Remove external whitespace characters
`ustrupper(s)`	Convert to uppercase
`ustrupper(s,l)`	Convert to uppercase in locale l
`ustrword(s,n)`
`ustrword(s,n,l)`
`ustrwordcount(s)`
`ustrwordcount(s,l)`
`usubinstr(s,p,r,n)`	Replace the first n matches of pattern p with replacement r
`usubstr(s,o,n)`	Return the substring of string s from offset o for length n characters

A couple of notes about the substr functions:

Negative offsets are interpreted as offsets from the end of the string value.
Missing lengths are interpreted as the maximum; read until the end of the string value.

generate skip_first_character = usubstr(string, 2, .)
generate second_character = usubstr(string, 2, 1)
generate last_character = usubstr(string, -1, 1)

Regular Expression Functions

There are two sets of regular expression functions. The first are the legacy functions designed for string data representing strictly ASCII-encoded values.

Function Name	Meaning	Example
`regexm(s,p)`	1 if string s matches pattern p, 0 otherwise	`regexm(zip5,"^[0-9][0-9][0-9][0-9][0-9]$")`
`regexr(s,p,r)`	Replace all matches to pattern p with replacement r	`regexr(filename,"\.(txt\|csv\|tsv)","")`
`regexs(n)`	The nth (in [1,9]) pattern match from the last `regexm` call

The second set are the new functions designed for Unicode-encoded values.

Function Name	Meaning	Example
`ustrregexm(s,p)`	1 if string s matches pattern p, 0 otherwise
`ustrregexm(s,p,b)`	Call `ustrregexm` with case-insensitivity if b is 1
`ustrregexrf(s,p,r)`	Replace the first match to pattern p with replacement r
`ustrregexrf(s,p,r,b)`	Call `ustrregexrf` with case-insensitivity if b is 1
`ustrregexra(s,p,r)`	Replace all matches to pattern p with replacement r
`ustrregexra(s,p,r,b)`	Call `ustrregexrf` with case-insensitivity if b is 1
`ustrregexs(n)`	The nth pattern match from the last `ustrregexm` call

For ustrregexs, note that the 0th match is them entire original string if it matched the pattern at all.

See here for details on Stata's regular expressions syntax.

Encoding and Decoding Functions

There are several function meant for encoding or decoding string data.

Function Name	Meaning
`tobytes(s)`
`tobytes(s,n)`
`ustrfix(s)`
`ustrfix(s,r)`
`ustrfrom(s,e,m)`
`ustrinvalidcnt(s)`
`ustrnormalize(s,m)`
`ustrto(s,e,m)`
`ustrtohex(s)`
`ustrtohex(s,n)`
`ustrunescape(s)`

Locale Name Functions

Several of the above string functions take an optional locale name argument. This creates the need for more functions that can parse and validate locale names.

Function Name	Meaning
`collatorlocale(l,t)`
`collatorversion(l)`
`wordbreaklocale(s,n)`

Stata Name Functions

Stata offers several functions for generating a safe name, as for use in generating variables or macros.

Function Name	Meaning
`strtoname(s)`	Create a Stata 13 name
`ustrtoname(s)`	Create a modern Stata name

Both of these functions are variadic. If the second argument is a 1, and then if the first character is numeric, the returned name is prefixed with an underscore character.

-  ⇤ ← Revision 1 as of 2022-09-24 19:42:14 → 
  Size: 3272
  Editor: DominicRicottone
  Comment:
+   ← Revision 10 as of 2025-03-05 03:57:45 → ⇥
  Size: 9106
  Editor: DominicRicottone
  Comment: Minor rephrase
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+Stata supports these '''string functions''' in the global scope.
-Line 8:
+Line 10:
-== General Syntax ==
+== General Purpose ==

||'''Function Name'''||'''Meaning'''                                                         ||'''Example'''||
||`abbrev(s,n)`      || || ||
||`plural(n,s)`      ||Append "s" to string s if n>1, otherwise returns the original string s|| ||
||`plural(n,s,p)`    ||As `plural` but specifying the plural form p explicitly               || ||
||`real(s)`          ||Convert string s to a real value                                      || ||
||`string(n)`        ||Convert numeric value n to a string                                   || ||
||`string(n,f)`      ||Convert numeric value n to a string using format f                    || ||
||`stritrim(s)`      ||Remove duplicated internal space characters                           || ||
||`strofreal(n)`     ||Convert numeric value n to a string                                   || ||
||`strofreal(n,f)`   ||Convert numeric value n to a string using format f                    || ||

There is a large set of functions designed for string data representing ''strictly'' ASCII-encoded values.

||'''Function Name''' ||'''Meaning'''                                                         ||'''Example'''||
||`char(n)`           ||ASCII code n                                                          || ||
||`indexnot(a,b)`     || || ||
||`lower(s)`          ||Convert to lowercase                                                  || ||
||`ltrim(s)`          ||Remove leading space characters                                       || ||
||`rtrim(s)`          ||Remove trailing space characters                                      || ||
||`soundex(s)`        || || ||
||`soundex_nara(s)`   || || ||
||`strlen(s)`         ||Length of string s in characters/bytes                                || ||
||`strlower(s)`       ||Convert to lowercase                                                  || ||
||`strltrim(s)`       ||Remove leading space characters                                       || ||
||`strpos(s,p)`       || || ||
||`strproper(s)`      ||Convert to proper case                                                || ||
||`strreverse(s)`     || || ||
||`strrpos(s,p)`      || || ||
||`strrtrim(s)`       ||Remove trailing space characters                                      || ||
||`strtrim(s)`        ||Remove external space characters                                      || ||
||`strupper(s)`       ||Convert to uppercase                                                  || ||
||`subinstr(s,p,r,n)` ||Replace the first n matches of pattern p with replacement r           || ||
||`subinword(s,p,r,n)`|| || ||
||`substr(s,o)`       ||Return the substring of string s from offset o                        || ||
||`substr(s,o,n)`     ||Return the substring of string s from offset o for length n characters|| ||
||`trim(s)`           ||Remove external space characters                                      || ||
||`upper(s)`          ||Convert to uppercase                                                  || ||
||`word(s,n)`         || || ||
||`wordcount(s)`      || || ||

These are the new functions designed for Unicode-encoded values. In many cases, they are named similarly except for a 'ustr-' prefix.

||'''Function Name''' ||'''Meaning'''                                                         ||'''Example'''||
||`uchar(n)`          ||Unicode code n                                                        || ||
||`udstrlen(s)`       ||Length of string s in display columns, respecting wide characters     || ||
||`udsubstr(s,o,n)`   ||Return the substring of string s from offset o for n display columns  || ||
||`uisdigit(s)`       || || ||
||`uisletter(s)`      || || ||
||`ustrcompare(a,b)`  || || ||
||`ustrcompare(a,b,l)`|| || ||
||`ustrleft(s,n)`     ||Return the leftmost substring of string s for length n characters     || ||
||`ustrlen(s)`        ||Length of string s in characters                                      || ||
||`ustrlower(s)`      ||Convert to lowercase                                                  || ||
||`ustrlower(s,l)`    ||Convert to lowercase in locale l                                      || ||
||`ustrltrim(s)`      || || ||
||`ustrpos(s)`        || || ||
||`ustrreverse(s)`    || || ||
||`ustrright(s,n)`    ||Return the rightmost substring of string s for length n characters    || ||
||`ustrrpos(s,p)`     || || ||
||`ustrrpos(s,p,o)`   || || ||
||`ustrrtrim(s)`      || || ||
||`ustrsortkey(s)`    || || ||
||`ustrsortkey(s,l)`  || || ||
||`ustrtitle(s)`      ||Convert to title case                                                 || ||
||`ustrtitle(s,l)`    ||Convert to title case in locale l                                     || ||
||`ustrtrim(s)`       ||Remove external whitespace characters                                 || ||
||`ustrupper(s)`      ||Convert to uppercase                                                  || ||
||`ustrupper(s,l)`    ||Convert to uppercase in locale l                                      || ||
||`ustrword(s,n)`     || || ||
||`ustrword(s,n,l)`   || || ||
||`ustrwordcount(s)`  || || ||
||`ustrwordcount(s,l)`|| || ||
||`usubinstr(s,p,r,n)`||Replace the first n matches of pattern p with replacement r           || ||
||`usubstr(s,o,n)`    ||Return the substring of string s from offset o for length n characters|| ||

A couple of notes about the `substr` functions:

 * Negative offsets are interpreted as offsets from the end of the string value.
 * Missing lengths are interpreted as the maximum; read until the end of the string value.

{{{
generate skip_first_character = usubstr(string, 2, .)
generate second_character = usubstr(string, 2, 1)
generate last_character = usubstr(string, -1, 1)
}}}
-Line 11:
+Line 100:
-=== Date and Datetime Masks ===

Date and datetime conversion functions use a concept of '''masks'''. These instruct the function how to interpret the string.

A mask of `"DMY"` can parse all of:

 * `"21nov2006"`
 * `"21 November 2006"`
 * `"21-11-2006"`
 * `"21112006"`

Spaces are ignored in a mask; `"DMY"` is equivalent to `"D M Y"`.

The mask `"DMY"` cannot parse a string with a two-digit year. A two-digit prefix can be applied to "Y" in the mask, such as "DM19Y". If a string has a two-digit year, such a mask will cause the year to be interpreted as being within the 1900s. If a string has a four-digit year, the mask will not mutate the value.
-Line 31:
+Line 105:
-== Clock ==
+== Regular Expression Functions ==
-Line 33:
+Line 107:
-Convert a string date and time into the number of milliseconds since the Stata epoch (`01jan1960 00:00:00.000`).
+There are two sets of regular expression functions. The first are the legacy functions designed for string data representing strictly ASCII-encoded values.
-Line 35:
+Line 109:
-There are two functions: '''`clock`''' and '''`Clock`'''.
+||'''Function Name'''||'''Meaning'''                                               ||'''Example'''                               ||
||`regexm(s,p)`      ||1 if string s matches pattern p, 0 otherwise                ||`regexm(zip5,"^[0-9][0-9][0-9][0-9][0-9]$")`||
||`regexr(s,p,r)`    ||Replace all matches to pattern p with replacement r         ||`regexr(filename,"\.(txt|csv|tsv)","")`     ||
||`regexs(n)`        ||The nth (in [1,9]) pattern match from the last `regexm` call|| ||
-Line 37:
+Line 114:
-To create a datetime that ''ignores'' leap seconds, try:
+The second set are the new functions designed for Unicode-encoded values.
-Line 39:
+Line 116:
-{{{
generate double datetime = clock(string, "YMDhms")
format datetime %tc 
}}}
+||'''Function Name'''   ||'''Meaning'''                                          ||'''Example'''                               ||
||`ustrregexm(s,p)`     ||1 if string s matches pattern p, 0 otherwise           || ||
||`ustrregexm(s,p,b)`   ||Call `ustrregexm` with case-insensitivity if b is 1    || ||
||`ustrregexrf(s,p,r)`  ||Replace the first match to pattern p with replacement r|| ||
||`ustrregexrf(s,p,r,b)`||Call `ustrregexrf` with case-insensitivity if b is 1   || ||
||`ustrregexra(s,p,r)`  ||Replace all matches to pattern p with replacement r    || ||
||`ustrregexra(s,p,r,b)`||Call `ustrregexrf` with case-insensitivity if b is 1   || ||
||`ustrregexs(n)`       ||The nth pattern match from the last `ustrregexm` call  || ||
-Line 44:
+Line 125:
-To create a datetime that ''includes'' leap seconds since the epoch, try:
+For `ustrregexs`, note that the 0th match is them entire original string if it matched the pattern at all.
-Line 46:
+Line 127:
-{{{
generate double datetime = Clock(string, "YMDhms")
format datetime %tC 
}}}

As noted above, the mask should be composed of: `"Y"`, `"M"`, `"D"`, `"h"`, `"m"`, and `"s"`. See above for details on masks.
+See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions syntax.
-Line 57:
+Line 133:
-== Date ==
+== Encoding and Decoding Functions ==
-Line 59:
+Line 135:
-Convert a string date into the number of days since the Stata epoch (`01jan1960 00:00:00.000`).
+There are several function meant for encoding or decoding string data.
-Line 61:
+Line 137:
-{{{
generate long date = date(string, "MDY")
format date %td
}}}

As noted above, the mask should be composed of: `"Y"`, `"M"`, and `"D"`. See above for details on masks.
+||'''Function Name''' ||'''Meaning'''||
||`tobytes(s)`        || ||
||`tobytes(s,n)`      || ||
||`ustrfix(s)`        || ||
||`ustrfix(s,r)`      || ||
||`ustrfrom(s,e,m)`   || ||
||`ustrinvalidcnt(s)` || ||
||`ustrnormalize(s,m)`|| ||
||`ustrto(s,e,m)`     || ||
||`ustrtohex(s)`      || ||
||`ustrtohex(s,n)`    || ||
||`ustrunescape(s)`   || ||
-Line 72:
+Line 154:
-== HalfYearly ==
+== Locale Name Functions ==
-Line 74:
+Line 156:
-Convert a string date into the number of half years since the Stata epoch (`01jan1960 00:00:00.000`).
+Several of the above string functions take an optional ''locale name'' argument. This creates the need for more functions that can parse and validate locale names.
-Line 76:
+Line 158:
-{{{
generate int halfyear = halfyearly(string, "YH")
format halfyear %th
}}}

As noted above, the mask should be composed of: `"Y"` and `"H"`. See above for details on masks.
+||'''Function Name'''   ||'''Meaning'''||
||`collatorlocale(l,t)` || ||
||`collatorversion(l)`  || ||
||`wordbreaklocale(s,n)`|| ||
-Line 87:
+Line 167:
-== Monthly ==
+== Stata Name Functions ==
-Line 89:
+Line 169:
-Convert a string date into the number of months since the Stata epoch (`01jan1960 00:00:00.000`).
+Stata offers several functions for generating a safe name, as for use in generating variables or macros.
-Line 91:
+Line 171:
-{{{
generate int month = monthly(string, "YM")
format month %tm
}}}
+||'''Function Name''' ||'''Meaning'''             ||
||`strtoname(s)`      ||Create a Stata 13 name    ||
||`ustrtoname(s)`     ||Create a modern Stata name||
-Line 96:
+Line 175:
-As noted above, the mask should be composed of: `"Y"` and `"M"`. See above for details on masks.
+Both of these functions are variadic. If the second argument is a 1, and then if the first character is numeric, the returned name is prefixed with an underscore character.
-Line 102:
+Line 181:
-== Quarterly ==
+== See also ==
-Line 104:
+Line 183:
-Convert a string date into the number of quarters since the Stata epoch (`01jan1960 00:00:00.000`).

{{{
generate int quarter = quarterly(string, "YQ")
format quarter %tq
}}}

As noted above, the mask should be composed of: `"Y"` and `"Q"`. See above for details on masks.

----



== Weekly ==

Convert a string date into the number of weeks since the Stata epoch (`01jan1960 00:00:00.000`).

{{{
generate int week = weekly(string, "YW")
format week %tw
}}}

As noted above, the mask should be composed of: `"Y"` and `"W"`. See above for details on masks.

----



== Yearly ==

Convert a string date into the number of years since the Stata epoch (`01jan1960 00:00:00.000`).

{{{
generate int year = yearly(string, "Y")
format year %th
}}}

As noted above, the mask should be composed of: `"Y"` and `"W"`. See above for details on masks.
+[[https://www.stata.com/manuals/fnstringfunctions.pdf|Stata string functions]]

Diff for "Stata/StringFunctions"