Differences between revisions 8 and 9
Revision 8 as of 2023-06-13 22:48:44
Size: 7952
Comment:
Revision 9 as of 2025-03-05 03:55:08
Size: 9167
Comment: Rewrite
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
== Abbrev == == General Purpose Functions ==
Line 13: Line 13:
---- There are three sets of general-purpose string functions. The first are general-purpose string functions.
Line 15: Line 15:
||'''Function Name'''||'''Meaning''' ||'''Example'''||
||`abbrev(s,n)` || || ||
||`plural(n,s)` ||Append "s" to string s if n>1, otherwise returns the original string s|| ||
||`plural(n,s,p)` ||As `plural` but specifying the plural form p explicitly || ||
||`real(s)` ||Convert string s to a real value || ||
||`string(n)` ||Convert numeric value n to a string || ||
||`string(n,f)` ||Convert numeric value n to a string using format f || ||
||`stritrim(s)` ||Remove duplicated internal space characters || ||
||`strofreal(n)` ||Convert numeric value n to a string || ||
||`strofreal(n,f)` ||Convert numeric value n to a string using format f || ||
 
The second set are the legacy functions designed for string data representing strictly ASCII-encoded values.
Line 16: Line 28:
||'''Function Name''' ||'''Meaning''' ||'''Example'''||
||`char(n)` ||ASCII code n || ||
||`indexnot(a,b)` || || ||
||`lower(s)` ||Convert to lowercase || ||
||`ltrim(s)` ||Remove leading space characters || ||
||`rtrim(s)` ||Remove trailing space characters || ||
||`soundex(s)` || || ||
||`soundex_nara(s)` || || ||
||`strlen(s)` ||Length of string s in characters/bytes || ||
||`strlower(s)` ||Convert to lowercase || ||
||`strltrim(s)` ||Remove leading space characters || ||
||`strpos(s,p)` || || ||
||`strproper(s)` ||Convert to proper case || ||
||`strreverse(s)` || || ||
||`strrpos(s,p)` || || ||
||`strrtrim(s)` ||Remove trailing space characters || ||
||`strtrim(s)` ||Remove external space characters || ||
||`strupper(s)` ||Convert to uppercase || ||
||`subinstr(s,p,r,n)` ||Replace the first n matches of pattern p with replacement r || ||
||`subinword(s,p,r,n)`|| || ||
||`substr(s,o)` ||Return the substring of string s from offset o || ||
||`substr(s,o,n)` ||Return the substring of string s from offset o for length n characters|| ||
||`trim(s)` ||Remove external space characters || ||
||`upper(s)` ||Convert to uppercase || ||
||`word(s,n)` || || ||
||`wordcount(s)` || || ||
Line 17: Line 55:
== Char == The third set are the new functions designed for Unicode-encoded values.
Line 19: Line 57:
---- ||'''Function Name''' ||'''Meaning''' ||'''Example'''||
||`uchar(n)` ||Unicode code n || ||
||`udstrlen(s)` ||Length of string s in display columns, respecting wide characters || ||
||`udsubstr(s,o,n)` ||Return the substring of string s from offset o for n display columns || ||
||`uisdigit(s)` || || ||
||`uisletter(s)` || || ||
||`ustrcompare(a,b)` || || ||
||`ustrcompare(a,b,l)`|| || ||
||`ustrleft(s,n)` ||Return the leftmost substring of string s for length n characters || ||
||`ustrlen(s)` ||Length of string s in characters || ||
||`ustrlower(s)` ||Convert to lowercase || ||
||`ustrlower(s,l)` ||Convert to lowercase in locale l || ||
||`ustrltrim(s)` || || ||
||`ustrpos(s)` || || ||
||`ustrreverse(s)` || || ||
||`ustrright(s,n)` ||Return the rightmost substring of string s for length n characters || ||
||`ustrrpos(s,p)` || || ||
||`ustrrpos(s,p,o)` || || ||
||`ustrrtrim(s)` || || ||
||`ustrsortkey(s)` || || ||
||`ustrsortkey(s,l)` || || ||
||`ustrtitle(s)` ||Convert to title case || ||
||`ustrtitle(s,l)` ||Convert to title case in locale l || ||
||`ustrtrim(s)` ||Remove external whitespace characters || ||
||`ustrupper(s)` ||Convert to uppercase || ||
||`ustrupper(s,l)` ||Convert to uppercase in locale l || ||
||`ustrword(s,n)` || || ||
||`ustrword(s,n,l)` || || ||
||`ustrwordcount(s)` || || ||
||`ustrwordcount(s,l)`|| || ||
||`usubinstr(s,p,r,n)`||Replace the first n matches of pattern p with replacement r || ||
||`usubstr(s,o,n)` ||Return the substring of string s from offset o for length n characters|| ||
Line 21: Line 90:
A couple of notes about the `substr` functions:
Line 22: Line 92:

== CollatorLocale ==

----



== CollatorVersion ==

----



== IndexNote ==

----



== Lower ==

Deprecated name for `strlower`.

----



== LTrim ==

Deprecated name for `strltrim`.

----



== Plural ==

----



== Real ==

----



== RegexM ==

Match a string against a pattern. Returns 1 if the string matches and 0 otherwise.

The string must not contain a null byte (`char(0)`). While fixed-length strings cannot contain a null byte by design, long strings (`strL`) can. To get around this restriction, consider [[Stata/StringFunctions#UstrRegexM|ustrregexm]].

The
{{{
generate byte begins_with_number = regexm(string, "^[0-9]")
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== RegexR ==

Match a string against a pattern and replace the first matching substring with a replacement substring.

The string must not contain a null byte (`char(0)`). While fixed-length strings cannot contain a null byte by design, long strings (`strL`) can. Returned substrings can be up to 1,100,000 bytes long. To get around these restrictions, consider [[Stata/StringFunctions#UstrRegexRf|ustrregexrf]].

To replace more than just the first matching substring, consider [[Stata/StringFunctions#UstrRegexRa|ustrregexra]].

{{{
generate filename_without_extension = regexr(filename,"\.(txt|csv|tsv)","")
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== RegexS ==

Extract the nth matching substring from a prior `regexm` test. The 0th match is the original string if it matched.

Only the first 9 matching substrings are stored and available. Returned substrings can be up to 1,100,000 bytes long. To get around these restrictions, consider [[Stata/StringFunctions#UstrRegexS|ustrregexs]].

{{{
generate byte is_pipe_delimited = regexm(string,"[^|]+")
generate first_field = regexs(1)
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== RTrim ==

Deprecated name for `strrtrim`.

----



== Soundex ==

----



== Soundex_Nara ==

----



== String ==

Alias for [[Stata/StringFunctions#StrOfReal|strofreal]].

----



== StrITrim ==

----



== StrLen ==

----



== StrLower ==

----



== StrLTrim ==

----



== StrOfReal ==

----



== StrPos ==

----



== StrProper ==

----



== StrReverse ==

----



== StrRPos ==

----



== StrRTrim ==

----



== StrToName ==

----



== StrTrim ==

----



== StrUpper ==

----



== SubInStr ==

----



== SubInWord ==

----



== SubStr ==

Extract a substring from a string using a ''start'' argument and an optional ''length'' argument, as `substr(string, start, length)`. If the optional ''length'' argument is left off or set to the missing value (`.`), the extraction continues to the end of the string.

{{{
generate skip_first_character = substr(string, 2)
generate skip_first_character = substr(string, 2, .)
generate second_character = substr(string, 2, 1)
generate last_character = substr(string, -1, 1)
}}}

The ''start'' and ''length'' parameters are byte positions rather than character indices, which does not matter for ASCII data but will impact many other character encodings. If the optional ''length'' argument is left off and a null byte (`char(0)`) is encountered between the ''start'' byte position and the end of the string, the extraction ends at that null byte (excluding the null byte). To get around these restrictions, consider [[Stata/StringFunctions#USubStr|usubstr]].

----



== ToBytes ==

----



== Trim ==

Deprecated name for `strtrim`.

----



== UChar ==

----



== UIsDigit ==

----



== UIsLetter ==

----



== Upper ==

Deprecated name for `strupper`.

----



== UStrCompare ==

----



== UStrCompareEx ==

----



== UStrFix ==

----



== UStrFrom ==

----



== UStrInvalidCnt ==

----



== UStrLeft ==

Extract the first n characters from a string.

{{{
generate first_two = ustrleft(string, 2)
}}}

----



== UStrLen ==

The returned value is in terms of characters, irrespective of wide characters. To return a value that can be used in fixed-width fonts respecting wide characters, a variant named `udstrlen` is also available.

----



== UStrLower ==

----



== UStrLTrim ==

----



== UStrNormalize ==

----



== UStrPos ==

----



== UStrRegexM ==

Match a Unicode string against a pattern. Returns 1 if the string matches and 0 otherwise.

The optional third argument toggles case-insensitive matching. The default is 0 (case-sensitive).

{{{
generate byte begins_with_number = ustrregexm(string, "^[0-9]")
generate byte begins_with_letter = ustrregexm(string, "^[a-z]", 1)
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== UStrRegexRf ==

Match a Unicode string against a pattern and replace the first matching substring with a replacement substring.

The optional fourth argument toggles case-insensitive matching. The default is 0 (case-sensitive).

{{{
generate filename_without_extension = ustrregexrf(filename, "\.(txt|csv|tsv)", "", 1)
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== UStrRegexRa ==

Match a Unicode string against a pattern and replace all matching substrings with a replacement substring.

The optional fourth argument toggles case-insensitive matching. The default is 0 (case-sensitive).

{{{
generate name_without_numbers = ustrregexra(name, "[0-9]", "")
generate name_without_accented_a = ustrregexra(name, "[áàȧâäǎăāãå]", "a", 1)
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== UStrRegexS ==

Extract the nth matching substring from a prior `regexm` test. The 0th match is the original string if it matched.

{{{
generate byte is_pipe_delimited = ustrregexm(string,"[^|]+")
generate first_field = ustrregexs(1)
}}}

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions.

----



== UStrReverse ==

----



== UStrRight ==

Extract the last n characters from a string.

{{{
generate last_two = ustrright(string, 2)
}}}

----



== UStrRPos ==

----



== UStrRTrim ==

----



== UStrSortKey ==

----



== UStrSortKeyEx ==

----



== UStrTitle ==

----



== UStrTo ==

----



== UStrToHex ==

----



== UStrToName ==

----



== UStrTrim ==

----



== UStrUnescape ==

----



== UStrUpper ==

----



== UStrWord ==

----



== UStrWordCount ==

----



== USubInStr ==

----



== USubStr ==

Extract a substring from a string using ''start'' and ''length'' arguments, as `usubstr(string, start, length)`. If the ''length'' argument is the missing value (`.`), the extraction continues to the end of the string.
 * Negative offsets are interpreted as offsets from the end of the string value.
 * Missing lengths are interpreted as the maximum; read until the end of the string value.
Line 547: Line 101:
The ''start'' and ''length'' parameters are character indices, irrespective of wide characters. To extract a substring that can be printed in fixed-width fonts to a fixed-length space respecting wide characters, a variant named `udsubstr` is also available.
Line 553: Line 107:
== Word == == Regular Expression Functions ==

There are two sets of regular expression functions. The first are the legacy functions designed for string data representing strictly ASCII-encoded values.

||'''Function Name'''||'''Meaning''' ||'''Example''' ||
||`regexm(s,p)` ||1 if string s matches pattern p, 0 otherwise ||`regexm(zip5,"^[0-9][0-9][0-9][0-9][0-9]$")`||
||`regexr(s,p,r)` ||Replace all matches to pattern p with replacement r ||`regexr(filename,"\.(txt|csv|tsv)","")` ||
||`regexs(n)` ||The nth (in [1,9]) pattern match from the last `regexm` call|| ||

The second set are the new functions designed for Unicode-encoded values.

||'''Function Name''' ||'''Meaning''' ||'''Example''' ||
||`ustrregexm(s,p)` ||1 if string s matches pattern p, 0 otherwise || ||
||`ustrregexm(s,p,b)` ||Call `ustrregexm` with case-insensitivity if b is 1 || ||
||`ustrregexrf(s,p,r)` ||Replace the first match to pattern p with replacement r|| ||
||`ustrregexrf(s,p,r,b)`||Call `ustrregexrf` with case-insensitivity if b is 1 || ||
||`ustrregexra(s,p,r)` ||Replace all matches to pattern p with replacement r || ||
||`ustrregexra(s,p,r,b)`||Call `ustrregexrf` with case-insensitivity if b is 1 || ||
||`ustrregexs(n)` ||The nth pattern match from the last `ustrregexm` call || ||

For `ustrregexs`, note that the 0th match is them entire original string if it matched the pattern at all.

See [[Stata/RegularExpressions|here]] for details on Stata's regular expressions syntax.
Line 559: Line 135:
== WordBreakLocale == == Encoding and Decoding Functions ==

There are several function meant for encoding or decoding string data.

||'''Function Name''' ||'''Meaning'''||
||`tobytes(s)` || ||
||`tobytes(s,n)` || ||
||`ustrfix(s)` || ||
||`ustrfix(s,r)` || ||
||`ustrfrom(s,e,m)` || ||
||`ustrinvalidcnt(s)` || ||
||`ustrnormalize(s,m)`|| ||
||`ustrto(s,e,m)` || ||
||`ustrtohex(s)` || ||
||`ustrtohex(s,n)` || ||
||`ustrunescape(s)` || ||
Line 565: Line 156:
== WordCount == == Locale Name Functions ==

Several of the above string functions take an optional ''locale name'' argument. This creates the need for more functions that can parse and validate locale names.

||'''Function Name''' ||'''Meaning'''||
||`collatorlocale(l,t)` || ||
||`collatorversion(l)` || ||
||`wordbreaklocale(s,n)`|| ||

----



== Stata Name Functions ==

Stata offers several functions for generating a safe name, as for use in generating variables or macros.

||'''Function Name''' ||'''Meaning''' ||
||`strtoname(s)` ||Create a Stata 13 name ||
||`ustrtoname(s)` ||Create a modern Stata name||

Both of these functions are variadic. If the second argument is a 1, and then if the first character is numeric, the returned name is prefixed with an underscore character.

----



== See also ==

[[https://www.stata.com/manuals/fnstringfunctions.pdf|Stata string functions]]

Stata String Functions

Stata supports these string functions in the global scope.


General Purpose Functions

There are three sets of general-purpose string functions. The first are general-purpose string functions.

Function Name

Meaning

Example

abbrev(s,n)

plural(n,s)

Append "s" to string s if n>1, otherwise returns the original string s

plural(n,s,p)

As plural but specifying the plural form p explicitly

real(s)

Convert string s to a real value

string(n)

Convert numeric value n to a string

string(n,f)

Convert numeric value n to a string using format f

stritrim(s)

Remove duplicated internal space characters

strofreal(n)

Convert numeric value n to a string

strofreal(n,f)

Convert numeric value n to a string using format f

The second set are the legacy functions designed for string data representing strictly ASCII-encoded values.

Function Name

Meaning

Example

char(n)

ASCII code n

indexnot(a,b)

lower(s)

Convert to lowercase

ltrim(s)

Remove leading space characters

rtrim(s)

Remove trailing space characters

soundex(s)

soundex_nara(s)

strlen(s)

Length of string s in characters/bytes

strlower(s)

Convert to lowercase

strltrim(s)

Remove leading space characters

strpos(s,p)

strproper(s)

Convert to proper case

strreverse(s)

strrpos(s,p)

strrtrim(s)

Remove trailing space characters

strtrim(s)

Remove external space characters

strupper(s)

Convert to uppercase

subinstr(s,p,r,n)

Replace the first n matches of pattern p with replacement r

subinword(s,p,r,n)

substr(s,o)

Return the substring of string s from offset o

substr(s,o,n)

Return the substring of string s from offset o for length n characters

trim(s)

Remove external space characters

upper(s)

Convert to uppercase

word(s,n)

wordcount(s)

The third set are the new functions designed for Unicode-encoded values.

Function Name

Meaning

Example

uchar(n)

Unicode code n

udstrlen(s)

Length of string s in display columns, respecting wide characters

udsubstr(s,o,n)

Return the substring of string s from offset o for n display columns

uisdigit(s)

uisletter(s)

ustrcompare(a,b)

ustrcompare(a,b,l)

ustrleft(s,n)

Return the leftmost substring of string s for length n characters

ustrlen(s)

Length of string s in characters

ustrlower(s)

Convert to lowercase

ustrlower(s,l)

Convert to lowercase in locale l

ustrltrim(s)

ustrpos(s)

ustrreverse(s)

ustrright(s,n)

Return the rightmost substring of string s for length n characters

ustrrpos(s,p)

ustrrpos(s,p,o)

ustrrtrim(s)

ustrsortkey(s)

ustrsortkey(s,l)

ustrtitle(s)

Convert to title case

ustrtitle(s,l)

Convert to title case in locale l

ustrtrim(s)

Remove external whitespace characters

ustrupper(s)

Convert to uppercase

ustrupper(s,l)

Convert to uppercase in locale l

ustrword(s,n)

ustrword(s,n,l)

ustrwordcount(s)

ustrwordcount(s,l)

usubinstr(s,p,r,n)

Replace the first n matches of pattern p with replacement r

usubstr(s,o,n)

Return the substring of string s from offset o for length n characters

A couple of notes about the substr functions:

  • Negative offsets are interpreted as offsets from the end of the string value.
  • Missing lengths are interpreted as the maximum; read until the end of the string value.

generate skip_first_character = usubstr(string, 2, .)
generate second_character = usubstr(string, 2, 1)
generate last_character = usubstr(string, -1, 1)


Regular Expression Functions

There are two sets of regular expression functions. The first are the legacy functions designed for string data representing strictly ASCII-encoded values.

Function Name

Meaning

Example

regexm(s,p)

1 if string s matches pattern p, 0 otherwise

regexm(zip5,"^[0-9][0-9][0-9][0-9][0-9]$")

regexr(s,p,r)

Replace all matches to pattern p with replacement r

regexr(filename,"\.(txt|csv|tsv)","")

regexs(n)

The nth (in [1,9]) pattern match from the last regexm call

The second set are the new functions designed for Unicode-encoded values.

Function Name

Meaning

Example

ustrregexm(s,p)

1 if string s matches pattern p, 0 otherwise

ustrregexm(s,p,b)

Call ustrregexm with case-insensitivity if b is 1

ustrregexrf(s,p,r)

Replace the first match to pattern p with replacement r

ustrregexrf(s,p,r,b)

Call ustrregexrf with case-insensitivity if b is 1

ustrregexra(s,p,r)

Replace all matches to pattern p with replacement r

ustrregexra(s,p,r,b)

Call ustrregexrf with case-insensitivity if b is 1

ustrregexs(n)

The nth pattern match from the last ustrregexm call

For ustrregexs, note that the 0th match is them entire original string if it matched the pattern at all.

See here for details on Stata's regular expressions syntax.


Encoding and Decoding Functions

There are several function meant for encoding or decoding string data.

Function Name

Meaning

tobytes(s)

tobytes(s,n)

ustrfix(s)

ustrfix(s,r)

ustrfrom(s,e,m)

ustrinvalidcnt(s)

ustrnormalize(s,m)

ustrto(s,e,m)

ustrtohex(s)

ustrtohex(s,n)

ustrunescape(s)


Locale Name Functions

Several of the above string functions take an optional locale name argument. This creates the need for more functions that can parse and validate locale names.

Function Name

Meaning

collatorlocale(l,t)

collatorversion(l)

wordbreaklocale(s,n)


Stata Name Functions

Stata offers several functions for generating a safe name, as for use in generating variables or macros.

Function Name

Meaning

strtoname(s)

Create a Stata 13 name

ustrtoname(s)

Create a modern Stata name

Both of these functions are variadic. If the second argument is a 1, and then if the first character is numeric, the returned name is prefixed with an underscore character.


See also

Stata string functions


CategoryRicottone

Stata/StringFunctions (last edited 2025-03-05 03:57:45 by DominicRicottone)