Stata String Functions

Stata supports these string functions in the global scope.


General Syntax

Date and Datetime Masks

Date and datetime conversion functions use a concept of masks. These instruct the function how to interpret the string.

A mask of "DMY" can parse all of:

Spaces are ignored in a mask; "DMY" is equivalent to "D M Y".

The mask "DMY" cannot parse a string with a two-digit year. A two-digit prefix can be applied to "Y" in the mask, such as "DM19Y". If a string has a two-digit year, such a mask will cause the year to be interpreted as being within the 1900s. If a string has a four-digit year, the mask will not mutate the value.


Clock

Convert a string date and time into the number of milliseconds since the Stata epoch (01jan1960 00:00:00.000).

There are two functions: clock and Clock.

To create a datetime that ignores leap seconds, try:

generate double datetime = clock(string, "YMDhms")
format datetime %tc 

To create a datetime that includes leap seconds since the epoch, try:

generate double datetime = Clock(string, "YMDhms")
format datetime %tC 

As noted above, the mask should be composed of: "Y", "M", "D", "h", "m", and "s". See above for details on masks.


Date

Convert a string date into the number of days since the Stata epoch (01jan1960 00:00:00.000).

generate long date = date(string, "MDY")
format date %td

As noted above, the mask should be composed of: "Y", "M", and "D". See above for details on masks.


HalfYearly

Convert a string date into the number of half years since the Stata epoch (01jan1960 00:00:00.000).

generate int halfyear = halfyearly(string, "YH")
format halfyear %th

As noted above, the mask should be composed of: "Y" and "H". See above for details on masks.


Monthly

Convert a string date into the number of months since the Stata epoch (01jan1960 00:00:00.000).

generate int month = monthly(string, "YM")
format month %tm

As noted above, the mask should be composed of: "Y" and "M". See above for details on masks.


Quarterly

Convert a string date into the number of quarters since the Stata epoch (01jan1960 00:00:00.000).

generate int quarter = quarterly(string, "YQ")
format quarter %tq

As noted above, the mask should be composed of: "Y" and "Q". See above for details on masks.


RegexM

Match a string against a pattern. Returns 1 if the string matches and 0 otherwise.

The string must not contain a null byte (char(0)). While fixed-length strings cannot contain a null byte by design, long strings (strL) can. To get around this restriction, consider ustrregexm.

The

generate byte begins_with_number = regexm(string, "^[0-9]")

See here for details on Stata's regular expressions.


RegexR

Match a string against a pattern and replace the first matching substring with a replacement substring.

The string must not contain a null byte (char(0)). While fixed-length strings cannot contain a null byte by design, long strings (strL) can. Returned substrings can be up to 1,100,000 bytes long. To get around these restrictions, consider ustrregexrf.

To replace more than just the first matching substring, consider ustrregexra.

generate filename_without_extension = regexr(filename,"\.(txt|csv|tsv)","")

See here for details on Stata's regular expressions.


RegexS

Extract the nth matching substring from a prior regexm test. The 0th match is the original string if it matched.

Only the first 9 matching substrings are stored and available. Returned substrings can be up to 1,100,000 bytes long. To get around these restrictions, consider ustrregexs.

generate byte is_pipe_delimited = regexm(string,"[^|]+")
generate first_field = regexs(1)

See here for details on Stata's regular expressions.


SubInStr


SubStr

Extract a substring from a string using a start argument and an optional length argument, as substr(string, start, length). If the optional length argument is left off or set to the missing value (.), the extraction continues to the end of the string.

generate skip_first_character = substr(string, 2)
generate skip_first_character = substr(string, 2, .)
generate second_character = substr(string, 2, 1)
generate last_character = substr(string, -1, 1)

The start and length parameters are byte positions rather than character indices, which does not matter for ASCII data but will impact many other character encodings. If the optional length argument is left off and a null byte (char(0)) is encountered between the start byte position and the end of the string, the extraction ends at that null byte (excluding the null byte). To get around these restrictions, consider usubstr.


UstrLeft

Extract the first n characters from a string.

generate first_two = ustrleft(string, 2)


UstrRegexM

Match a Unicode string against a pattern. Returns 1 if the string matches and 0 otherwise.

The optional third argument toggles case-insensitive matching. The default is 0 (case-sensitive).

generate byte begins_with_number = ustrregexm(string, "^[0-9]")
generate byte begins_with_letter = ustrregexm(string, "^[a-z]", 1)

See here for details on Stata's regular expressions.


UstrRegexRf

Match a Unicode string against a pattern and replace the first matching substring with a replacement substring.

The optional fourth argument toggles case-insensitive matching. The default is 0 (case-sensitive).

generate filename_without_extension = ustrregexrf(filename, "\.(txt|csv|tsv)", "", 1)

See here for details on Stata's regular expressions.


UstrRegexRa

Match a Unicode string against a pattern and replace all matching substrings with a replacement substring.

The optional fourth argument toggles case-insensitive matching. The default is 0 (case-sensitive).

generate name_without_numbers = ustrregexra(name, "[0-9]", "")
generate name_without_accented_a = ustrregexra(name, "[áàȧâäǎăāãå]", "a", 1)

See here for details on Stata's regular expressions.


UstrRegexS

Extract the nth matching substring from a prior regexm test. The 0th match is the original string if it matched.

generate byte is_pipe_delimited = ustrregexm(string,"[^|]+")
generate first_field = ustrregexs(1)

See here for details on Stata's regular expressions.


UstrRight

Extract the last n characters from a string.

generate last_two = ustrright(string, 2)


USubInStr


USubStr

Extract a substring from a string using start and length arguments, as usubstr(string, start, length). If the length argument is the missing value (.), the extraction continues to the end of the string.

generate skip_first_character = usubstr(string, 2, .)
generate second_character = usubstr(string, 2, 1)
generate last_character = usubstr(string, -1, 1)

The start and length parameters are character indices, irrespective of wide characters. To extract a substring that can be printed in fixed-width fonts to a fixed-length space respecting wide characters, consider udsubstr.


UdSubStr

Extract a substring from a string using start and length arguments, as udsubstr(string, start, length). If the length argument is the missing value (.), the extraction continues to the end of the string.

generate skip_first_character = udsubstr(string, 2, .)
generate second_character = udsubstr(string, 2, 1)
generate last_character = udsubstr(string, -1, 1)

The start and length parameters are display columns. To extract a substring that is a fixed number of characters, consider usubstr.


Weekly

Convert a string date into the number of weeks since the Stata epoch (01jan1960 00:00:00.000).

generate int week = weekly(string, "YW")
format week %tw

As noted above, the mask should be composed of: "Y" and "W". See above for details on masks.


Yearly

Convert a string date into the number of years since the Stata epoch (01jan1960 00:00:00.000).

generate int year = yearly(string, "Y")
format year %th

As noted above, the mask should be composed of: "Y" and "W". See above for details on masks.


CategoryRicottone