Lines
Strings are finite sequences of characters. However, the real problem arises when someone asks what a symbol is. Users who speak English are familiar with the following characters: the letters A
, B
, C
, etc., as well as numbers and common punctuation marks. These characters are standardized with a mapping to integer values from 0 to 127 according to the standard. https://en.wikipedia.org/wiki/ASCII [ASCII]. Of course, there are many other characters used in other languages, including variants of ASCII characters with accents and other modifications, related fonts such as Cyrillic and Greek, as well as fonts completely unrelated to ASCII and English, which include Arabic, Chinese, Hebrew, Hindi, Japanese and Korean.. The standard https://en.wikipedia.org/wiki/Unicode [Unicode] allows you to solve complex issues related to what exactly a character is, and is generally recognized as the definitive standard that works with this problem. Depending on your needs, you can either completely ignore these complexities and just pretend that only ASCII characters exist, or write code that can handle any characters or encodings that you may encounter when working with non-ASCII text. Julia simplifies and optimizes working with plain text in ASCII format, and processing The Unicode format becomes completely uncomplicated and effective. In particular, you can write C-style string code to process ASCII strings and they will work in the expected way, both in terms of performance and semantics. If such a code encounters non-ASCII text, it will end correctly with a clear error message, rather than silently displaying distorted results. When this happens, it is very easy to change the code for processing non-ASCII data.
There are several notable high-level features regarding Julia strings.
-
The built-in specific type used for strings (and string literals) in Julia is the type
String
. It supports the full set of characters https://en.wikipedia.org/wiki/Unicode [Unicode] via encoding https://en.wikipedia.org/wiki/UTF-8 [UTF-8]. (To convert to and from other Unicode encodings, there is a functiontranscode
.) -
All string types are subtypes of the abstract type
AbstractString
, and external packages define additional subtypes ofAbstractString' (for example, for other encodings). When defining a function that expects a string argument, you should declare the type as `AbstractString
so that it accepts any string type. -
Like in C and Java, but unlike most dynamic languages, Julia has a very popular type for representing a single character, called
AbstractChar
. Embedded subtypeChar
of type `AbstractChar' is a 32-bit primitive type that can represent any Unicode character (and which is based on UTF-8 encoding). -
As in Java, strings are immutable: the value of the AbstractString object cannot be changed. To create another string value, you need to build a new string from parts of other strings.
-
Conceptually, a string is a particle function from indexes to characters: for some index values, the character value is not returned, but an exception occurs. This allows you to index strings by the byte index of the encoded representation, rather than by the character index, which cannot be efficiently and simply implemented for encodings with a variable number of bytes for Unicode strings.
Symbols
The value Char
represents a single character: it is just a 32-bit primitive type with a special literal representation and corresponding arithmetic behaviors, which can be converted to a numeric value representing https://en.wikipedia.org/wiki/Code_point [Unicode character code]. (Julia packages can define other subtypes of AbstractChar', for example, to optimize operations for others https://en.wikipedia.org/wiki/Character_encoding [text encodings].) This is how `Char
values are entered and displayed (note that character literals are separated by single quotes, not double quotes):
julia> c = 'x'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> typeof(c)
Char
The value of Char
can be easily converted to its integer value, i.e. the character code.
julia> c = Int('x')
120
julia> typeof(c)
Int64
In 32-bit architectures, the function typeof(c)
will have the type Int32
. An integer value can be easily converted back to the Char
type.
julia> Char(120)
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
Not all integer values are valid Unicode character codes, but to improve performance, the Char
type conversion does not check the validity of each character value. To verify that each converted value is a valid character code, use the function isvalid
.
julia> Char(0x110000)
'\U110000': Unicode U+110000 (category In: Invalid, too high)
julia> isvalid(Char, 0x110000)
false
At the time of writing, the valid Unicode character codes are U+0000
--U+D7FF
and U+E000
--`U+10FFFF'. Not all of them have clear meanings yet, and they are not necessarily interpreted by applications, but all of these values are considered valid Unicode characters.
You can enter any Unicode character in single quotes using the character \u
followed by up to four hexadecimal digits, or the character \U
followed by up to eight hexadecimal digits (the longest allowed value requires only six digits).
julia> '\u0'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)
julia> '\u78'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> '\u2200'
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> '\U10ffff'
'\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)
Julia uses your system’s locale and language settings to determine which characters can be output as-is and which should be output using the common, escaped input forms \u
or `\U'. In addition to these forms of Unicode escaping, you can also use all https://en.wikipedia.org/wiki/C_syntax#Backslash_escapes [traditional escaped C input forms].
julia> Int('\0')
0
julia> Int('\t')
9
julia> Int('\n')
10
julia> Int('\e')
27
julia> Int('\x7f')
127
julia> Int('\177')
127
You can perform comparisons and a limited number of arithmetic operations with Char
values.
julia> 'A' < 'a'
true
julia> 'A' <= 'a' <= 'Z'
false
julia> 'A' <= 'X' <= 'Z'
true
julia> 'x' - 'a'
23
julia> 'A' + 1
'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
Basics of working with strings
String literals are separated by double quotes or triple double quotes (not single quotes).
julia> str = "Hello, world.\n"
"Hello, world.\n"
julia> """Contains "quote" characters"""
"Contains \"quote\" characters"
Long lines in strings can be split by prefixing a new line with a backslash (\
).
julia> "This is a long \
line"
"This is a long line"
If you want to extract a character from a string, you need to index it.
julia> str[begin]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
julia> str[1]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
julia> str[6]
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
julia> str[end]
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
Many Julia objects, including strings, can be indexed using integers. The index of the first element (the first character of the string) is returned by the function firstindex(str)
, and the index of the last element(character) is a function lastindex(str)
. The keywords begin
and end' can be used inside an indexing operation as an abbreviation for the first and last indexes, respectively, in a given dimension. String indexing, like most indexing in Julia, starts with 1. The `firstindex' function always returns `1
for any AbstractString
object. However, as we will see below, the lastindex(str)
function is usually not the same as the length(str)
function for a string, since some Unicode characters can take up several code units.
With a keyword end
you can perform arithmetic and other operations as with a regular value.
julia> str[end-1]
'.': ASCII/Unicode U+002E (category Po: Punctuation, other)
julia> str[end÷2]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Using an index less than the keyword begin
(1
) or more than the keyword end
results in an error.
julia> str[begin-1]
ERROR: BoundsError: attempt to access 14-codeunit String at index [0]
[...]
julia> str[end+1]
ERROR: BoundsError: attempt to access 14-codeunit String at index [15]
[...]
You can also extract a substring using range indexing.
julia> str[4:9]
"lo, wo"
Note that the expressions str[k]
and str[k:k]
do not give the same result.
julia> str[6]
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
julia> str[6:6]
","
The first is a single-character value of type Char
, and the second is a string value that contains only one character. These are completely different things in Julia.
When indexing a range, a copy of the selected part of the source string is created. Alternatively, you can create a representation for a string using the type SubString
. Simply put, when using a macro @views
in the code block, all line fragments are converted to substrings. For example:
julia> str = "long string"
"long string"
julia> substr = SubString(str, 1, 4)
"long"
julia> typeof(substr)
SubString{String}
julia> @views typeof(str[1:4]) # @views converts fragments into substrings
SubString{String}
Unicode and UTF-8
Julia fully supports Unicode characters and strings. How discussed above, in character literals, Unicode character codes can be represented using the Unicode escape sequences \u
and \U
, as well as all standard escape sequences in C. They can also be used to write string literals.
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
The display of these Unicode characters as escape characters or special characters depends on the terminal’s language standard settings and its Unicode support. String literals are encoded using UTF-8 encoding. UTF-8 is an encoding with a variable number of bytes, meaning not all characters are encoded into the same number of bytes (code units). In UTF-8, ASCII characters, i.e. those with character codes less than 0x80 (128) are encoded, as in ASCII, using a single byte, while character codes 0x80 and higher are encoded using several bytes — up to four per character.
Julia string indexes refer to code units (or bytes for UTF-8), standard fixed-width blocks that are used to encode arbitrary characters (character codes). This means that not every string index (String
) is necessarily a valid index for a character. If you index a string using such an invalid byte index, an error will be returned.
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[3]
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
In this case, the character ∀
is a three-byte character, so indexes 2 and 3 are invalid, and the index of the next character is 4. This next valid index can be calculated using the function nextind(s,1)
, followed by using the function nextind(s,4)
and so on.
Since 'end` is always the last valid index in the collection, end-1
indicates an invalid byte index if the penultimate character is multibyte.
julia> s[end-1]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
julia> s[end-2]
ERROR: StringIndexError: invalid index [9], valid nearby indices [7]=>'∃', [10]=>' '
Stacktrace:
[...]
julia> s[prevind(s, end, 2)]
'∃': Unicode U+2203 (category Sm: Symbol, math)
The first case works because the last character y
and the space are single-byte characters, whereas end-2
indexes the middle of the multibyte representation ∃'. The correct way here is to use the function `prevent(s, lastindex(s), 2)
or, if you use this value for indexing in s
, you can write s[prevent(s, end, 2)]
, and end
expands to the function lastindex(s)
.
Substring extraction using indexing also assumes that there are valid byte indexes, otherwise an error will be returned.
julia> s[1:1]
"∀"
julia> s[1:2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '
Stacktrace:
[...]
julia> s[1:4]
"∀ "
Due to variable-length encodings, the number of characters in a string (set using the method length(s)
) does not always match the last index. If you iterate through the indexes from the first to the last (lastindex(s)
) and index s
, the sequence of characters returned in the absence of errors will be the sequence of characters that make up the string s
. Thus, length(s) <= lastindex(s)
, since each character in the string must have its own index. The following is an inefficient and overloaded way to iterate the characters of the string s
.
julia> for i = firstindex(s):lastindex(s)
try
println(s[i])
catch
# Игнорировать ошибку индекса
end
end
∀
x
∃
y
There are actually spaces in the empty lines. Fortunately, the above inconvenient option is not necessary for iterating characters in a string, since you can simply use the string as an iterable object that does not require exception handling.
julia> for c in s
println(c)
end
∀
x
∃
y
If you need to get valid indexes for a row, you can use the functions nextind
and prevent
to increase or decrease the value to the next or previous valid index, as mentioned above. You can also use the function eachindex
for iterating over valid character indexes.
julia> collect(eachindex(s))
7-element Vector{Int64}:
1
4
5
6
7
10
11
To access the raw code units (bytes for UTF-8) encoding, you can use the function codeunit(s,i)
, where the index i
is executed sequentially from 1
to ncodeunits(s)
. Function codeunits(s)
returns shell AbstractVector{UInt8}
, which allows you to access these raw code units (bytes) as an array.
Strings in Julia may contain invalid UTF-8 code sequences. This convention allows you to treat any sequence of bytes as a string (String
). In such situations, the following rule applies: when analyzing a sequence of code units from left to right, characters are formed by the longest sequence of 8-bit code units, which coincides with the beginning of one of the following bit patterns (each x
can have the value 0
or 1
).
-
0xxxxxxx
; -
110xxxxx
10xxxxxx
; -
1110xxxx
10xxxxxx
10xxxxxx
; -
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
; -
10xxxxxx
; -
11111xxx
.
In particular, this means that sequences of code units that are too long and sequences of too high an order and their prefixes are considered as one invalid character, rather than as several invalid characters. This rule is best explained by an example.
julia> s = "\xc0\xa0\xe2\x88\xe2|"
"\xc0\xa0\xe2\x88\xe2|"
julia> foreach(display, s)
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)
'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)
'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)
'|': ASCII/Unicode U+007C (category Sm: Symbol, math)
julia> isvalid.(collect(s))
4-element BitArray{1}:
0
0
0
1
julia> s2 = "\xf7\xbf\xbf\xbf"
"\U1fffff"
julia> foreach(display, s2)
'\U1fffff': Unicode U+1FFFFF (category In: Invalid, too high)
We can see that the first two code units in the string s
form a space character encoding that is too long. It is invalid, but it is accepted as a single character in the string. The following two code units form the valid beginning of a three-byte UTF-8 sequence. However, the fifth unit of the code, \xe2
, is not a valid continuation of it. Therefore, code units 3 and 4 are also interpreted as incorrectly formed characters in this string. Similarly, the unit of code 5 forms an incorrect character, since |
is not its valid continuation. As a result, the string s2
contains one character code of too high an order.
Julia uses UTF-8 encoding by default, and support for new encodings can be added using packages. For example, the package https://github.com/JuliaStrings/LegacyStrings.jl [LegacyStrings.jl] implements the UTF16String
and UTF32String
types. Additional discussion of other encodings and how to implement their support is still beyond the scope of this document. For more information about UTF-8 encoding, see the section below on byte array literals. Function transcode
is designed to convert data between different UTF-xx encodings, mainly for working with external data and libraries.
Concatenation
One of the most common and useful string operations is concatenation.
julia> greet = "Hello"
"Hello"
julia> whom = "world"
"world"
julia> string(greet, ", ", whom, ".\n")
"Hello, world.\n"
It is important to keep in mind potentially dangerous situations, such as concatenation of invalid UTF-8 strings. The resulting string may contain characters other than the characters in the input strings, and the number of characters in it may be less than the sum of the number of characters in the concatenated strings, for example:
julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")
julia> c = string(a, b)
"∀"
julia> collect.([a, b, c])
3-element Vector{Vector{Char}}:
['\xe2\x88']
['\x80']
['∀']
julia> length.([a, b, c])
3-element Vector{Int64}:
1
1
1
This can only happen for invalid UTF-8 strings. For valid UTF-8 strings, concatenation preserves all characters in strings and the additivity of string lengths.
There is also a method available for concatenating strings in Julia *
.
julia> greet * ", " * whom * ".\n"
"Hello, world.\n"
For users of languages that use the +
method for string concatenation, it may seem that the choice of the *
method is unexpected, but the use of *
has a precedent in mathematics, especially in abstract algebra.
In mathematics, +
usually denotes a commutative operation, where the order of the operands does not matter. An example is matrix addition, where A + B == B + A
for any matrices A
and B
having the same shape. On the contrary, *
It usually denotes a non-commutative operation, where the order of the operands has a value. An example is matrix multiplication, where in the general case A * B != B * A
. As in the case of matrix multiplication, string concatenation is non-commutative: hello * who != whom * greet
. Thus, the *
method is a more natural choice for the infix operator of string concatenation, which corresponds to the generally accepted mathematical usage.
More precisely, the set of all strings of finite length S, together with the string concatenation operator *
, forms https://en.wikipedia.org/wiki/Free_monoid [free monoid] (S, *
). The neutral element of this set is the empty string ""
. When a free monoid is noncommutative, the operation is usually represented as \cdot
, *
or a similar character, rather than a `+', which, as already mentioned, usually implies commutativity.
Interpolation
Building strings using concatenation can become a rather burdensome process. To reduce the need for detailed function calls string
or repetitive multiplication operations, Julia allows you to perform interpolation into string literals using the literal $
, as in Perl.
julia> greet = "Hello"; whom = "world";
julia> "$greet, $whom.\n"
"Hello, world.\n"
This is a more readable and expedient feature, equivalent to the string concatenation described above, — the system rewrites this seemingly single string literal into the call string(greet, ",", who, ".\n")
.
The shortest complete expression after the $
is accepted as the expression whose value should be interpolated into the string. So you can interpolate any expression into a string using parentheses.
julia> "1 + 2 = $(1 + 2)"
"1 + 2 = 3"
Both concatenation and string interpolation call the function string
for converting objects to string form. However, the string
function actually just returns the output of the function print
, so new types should add methods to the function print
or method show
instead of the string
function.
Most objects that do not have an AbstractString type are converted to strings that exactly match the type in which they are entered as literal expressions.
julia> v = [1,2,3]
3-element Vector{Int64}:
1
2
3
julia> "v: $v"
"v: [1, 2, 3]"
Function string
is an identifier for the values AbstractString
and AbstractChar
, so they are interpolated into strings as themselves, without quotes and escaping.
julia> c = 'x'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> "hi, $c"
"hi, x"
To include the literal $
in a string literal, escape it with a backslash.
julia> print("I have \$100 in my account.\n")
I have $100 in my account.
String literals enclosed in triple quotes
When strings are created using triple quotes ("""..."""
), They have a special behavior that can be useful for creating long blocks of text.
First, lines enclosed in triple quotes are also aligned at the line level with the smallest indentation. This is useful for defining indented lines in code. For example:
julia> str = """
Hello,
world.
"""
" Hello,\n world.\n"
In this case, the last (empty) line before the closing """
sets the indentation level.
The alignment level is defined as the longest common initial sequence of spaces or tab characters in all lines, excluding the line following the opening triple quotes (""“) and lines containing only spaces or tab characters (the line containing the closing triple quotes (”"“) is always included). Then, for all lines, excluding the text following the opening triple quotes (”""
), the general initial sequence is deleted (including lines containing only spaces and tabs if they begin with this sequence), for example:
julia> """ This
is
a test"""
" This\nis\n a test"
Further, if the opening triple quotes ("""
) are followed by a newline, it is deleted from the resulting string.
"""hello"""
equivalent to
"""
hello"""
but
"""
hello"""
it will contain a literal newline at the beginning.
Newline exclusion is performed after alignment. For example:
julia> """
Hello,
world."""
"Hello,\nworld."
If a new line is deleted using a backslash, alignment will also be taken into account.
julia> """
Averylong\
word"""
"Averylongword"
The final space remains unchanged.
String literals enclosed in triple quotes can contain the characters "
without escaping.
Note that line breaks in literal lines enclosed in single or triple quotes result in the newline character (LF) \n
in the line, even if your editor uses carriage return \r
(CR) or a combination of CRLF to end the lines. To include CR in a string, use the explicit escape character \r
. For example, you can enter the literal string "a CRLF line ending\r\n"
.
Basic operations
You can compare strings lexicographically using standard comparison operators.
julia> "abracadabra" < "xylophone"
true
julia> "abracadabra" == "xylophone"
false
julia> "Hello, world." != "Goodbye, world."
true
julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true
julia> findfirst('o', "xylophone")
4
julia> findlast('o', "xylophone")
7
julia> findfirst('z', "xylophone")
You can start searching for a character from a given offset using the functions findnext
and findprev
.
julia> findnext('o', "xylophone", 1)
4
julia> findnext('o', "xylophone", 5)
7
julia> findprev('o', "xylophone", 5)
4
julia> findnext('o', "xylophone", 8)
You can use the function occursin
to check if a substring is found in the string.
julia> occursin("world", "Hello, world.")
true
julia> occursin("o", "Xylophon")
true
julia> occursin("a", "Xylophon")
false
julia> occursin('o', "Xylophon")
true
The last example shows that the function occursin
can also search for a character literal.
julia> repeat(".:Z:.", 10)
".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."
julia> join(["apples", "bananas", "pineapples"], ", ", " and ")
"apples, bananas and pineapples"
Some other useful features are listed below.
-
firstindex(str)
gives the minimal (byte) index that can be used to index intostr
(always 1 for strings, not necessarily true for other containers). -
lastindex(str)
gives the maximal (byte) index that can be used to index intostr
. -
length(str)
the number of characters instr
. -
length(str, i, j)
the number of valid character indices instr
fromi
toj
. -
ncodeunits(str)
number of code units in a string. -
codeunit(str, i)
gives the code unit value in the stringstr
at indexi
. -
thisind(str, i)
given an arbitrary index into a string find the first index of the character into which the index points. -
nextind(str, i, n=1)
find the start of then
th character starting after indexi
. -
prevind(str, i, n=1)
find the start of then
th character starting before indexi
.
Non-standard string literals
There are situations when you need to build a string or use string semantics, but the standard string construction is not exactly what you need. Julia has non-standard string literals for such cases. A non-standard string literal looks like a regular string literal enclosed in double quotes, but it has an identifier prefix. Its behavior may differ from that of a regular string literal.
Examples of string literals are regular expressions, byte array literals and version number literals, which will be described below. Users and packages can also define new non-standard string literals. Additional documentation is provided in the section Metaprogramming.
Regular expressions
Sometimes you are not looking for an exact string, but for a specific pattern. For example, let’s say you’re trying to get a single date from a large text file. You don’t know what this date is (that’s why you’re looking for it), but you know it will look something like this: `YYYY-MM-DD'. Regular expressions allow you to set these patterns and search for them.
Perl-compatible regular expressions (regex) versions 2 provided by the library are available in Julia. https://www.pcre.org /[PCRE] (a description of the PCRE2 syntax can be found https://www.pcre.org/current/doc/html/pcre2syntax.html [here]). Regular expressions are related to strings in two ways: the obvious connection is that regular expressions are used to find regular patterns in strings; the other connection is that regular expressions themselves are entered as strings that are parsed into a finite automaton that can be used to efficiently search for patterns in strings. In Julia, regular expressions are introduced using non-standard string literals prefixed with various identifiers starting with r'. The simplest regular expression literal without enabled features simply uses `+r"…"+
.
julia> re = r"^\s*(?:#|$)"
r"^\s*(?:#|$)"
julia> typeof(re)
Regex
To check if a regular expression matches a string, use the function occursin
.
julia> occursin(r"^\s*(?:#|$)", "not a comment")
false
julia> occursin(r"^\s*(?:#|$)", "# a comment")
true
As you can see, the function occursin
simply returns true or false values, indicating whether the string matches the specified regular expression. However, it is necessary to know not only whether the string matches, but also how it matches. To record this match information, the function is used match
.
julia> match(r"^\s*(?:#|$)", "not a comment")
julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")
If the regular expression does not match the specified string, the function match
returns `nothing' is a special value that is not printed in the output in any way. Except that it is not output, it is a completely normal value, and you can check it programmatically.
m = match(r"^\s*(?:#|$)", line)
if m === nothing
println("not a comment")
else
println("blank or comment")
end
If the regular expression matches, the value returned by the function is match
, is an object RegexMatch
. These objects register the match of the expression, including the substring that matches the template, and all found substrings, if any. In this example, only the part of the substring that matches is written, but you may need to write any non-empty text after the comment character. You can do the following.
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")
When calling a function match
you can specify the index from which the search should start. For example:
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)
RegexMatch("1")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)
RegexMatch("2")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)
RegexMatch("3")
The following information can be extracted from the `RegexMatch' object.
-
The entire matched substring:
m.match
-
Recorded substrings as an array of strings:
m.captures
-
The offset where the whole match starts:
m.offset
-
Offsets of the recorded substrings as a vector:
m.offsets
In the case where the record does not match, instead of a substring, the record (m.captures
) contains nothing (nothing
) in this position, and the offset (m.offsets
) is zero (recall that indexes in Julia are based on 1, so a zero offset in a row is unacceptable). Here are a couple of fictional examples.
julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")
julia> m.match
"acd"
julia> m.captures
3-element Vector{Union{Nothing, SubString{String}}}:
"a"
"c"
"d"
julia> m.offset
1
julia> m.offsets
3-element Vector{Int64}:
1
2
3
julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")
julia> m.match
"ad"
julia> m.captures
3-element Vector{Union{Nothing, SubString{String}}}:
"a"
nothing
"d"
julia> m.offset
1
julia> m.offsets
3-element Vector{Int64}:
1
0
2
It is convenient when records are returned as an array, so you can use the destructuring syntax to bind them to local variables. The 'RegexMatch` object implements iterator methods that are passed to the captures
field, so you can destructure the matching object directly.
julia> first, second, third = m; first
"a"
Records can also be accessed by indexing the RegexMatch
object using the record group number or name.
julia> m=match(r"(?<hour>\d+):(?<minute>\d+)","12:45")
RegexMatch("12:45", hour="12", minute="45")
julia> m[:minute]
"45"
julia> m[2]
"45"
The record can be referenced in the substitution string when using the function replace
by using \n
to refer to the nth record group and adding s
as a prefix for the substitution string. Record group 0 refers to the entire matching object. Named record groups can be referenced in substitution using \g<groupname>
. For example:
julia> replace("first second", r"(\w+) (?<agroup>\w+)" => s"\g<agroup> \1")
"second first"
Numbered record groups can also be designated as \g<n>
for separation, for example:
julia> replace("a", r"." => s"\g<0>1")
"a1"
You can change the behavior of regular expressions by using a specific combination of the flags i
, m
, s
and x
after the closing double quotation mark. These flags have the same meaning as in Perl, as explained in this excerpt from https://perldoc.perl.org/perlre#Modifiers [Perl regular Expression manual pages].
i Do case-insensitive pattern matching. If locale matching rules are in effect, the case map is taken from the current locale for code points less than 255, and from Unicode rules for larger code points. However, matches that would cross the Unicode rules/non-Unicode rules boundary (ords 255/256) will not succeed. m Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. Used together, as r""ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. x Tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The '#символ ' также treated as a metacharacter introducing a comment, just as in ordinary code.
For example, in the following regular expression, all three flags are enabled.
julia> r"a+.*b+.*d$"ism
r"a+.*b+.*d$"ims
julia> match(r"a+.*b+.*d$"ism, "Goodbye,\nOh, angry,\nBad world\n")
RegexMatch("angry,\nBad world")
The literal r"..."
is constructed without interpolation and escaping (with the exception of the quotation marks "
, which must still be escaped). Here is an example showing the difference from standard string literals.
julia> x = 10
10
julia> r"$x"
r"$x"
julia> "$x"
"10"
julia> r"\x"
r"\x"
julia> "\x"
ERROR: syntax: invalid escape sequence
Regular expression strings with triple quotes in the form r"""..."""
are also supported (and may be useful for regular expressions containing quotation marks or newlines).
To programmatically create a valid regular expression string, you can use the Regex() constructor.
. In this case, you can use the contents of string variables and other string operations when constructing a regular expression string. Any of the above regular expression codes can be used in a single string argument for the Regex()
constructor. Here are some examples:
julia> using Dates
julia> d = Date(1962,7,10)
1962-07-10
julia> regex_d = Regex("Day " * string(day(d)))
r"Day 10"
julia> match(regex_d, "It happened on Day 10")
RegexMatch("Day 10")
julia> name = "Jon"
"Jon"
julia> regex_name = Regex("[\"( ]\\Q$name\\E[\") ]") # Интерполяция значения имени
r"[\"( ]\QJon\E[\") ]"
julia> match(regex_name, " Jon ")
RegexMatch(" Jon ")
julia> match(regex_name, "[Jon]") === nothing
true
Note the use of the escape sequence \Q...\E
. All characters between \Q
and \E
are interpreted as literal characters. This is useful for matching characters that would otherwise be regular expression metacharacters. However, you should be careful when using this function together with string interpolation, since the interpolated string may itself contain the sequence \E
, which will unexpectedly stop matching literals. The data entered by the user must be sanitized before being included in the regular expression.
Literals of byte arrays
Another useful non-standard string literal is the byte array string literal: b"..."
. This form allows you to use string notation to express literal arrays of read-only bytes, i.e. arrays of values. UInt8
. These objects are of type CodeUnits{UInt8, String}
. The following rules apply to literals of byte arrays.
-
ASCII characters and ASCII escape characters create a single byte.
-
The characters
\x
and octal escape sequences create a byte corresponding to the escape value. -
Unicode escape sequences create a sequence of bytes encoding a given character code in UTF-8.
These rules partially duplicate each other, since the behavior of \x
and octal escape sequences less than 0x80(128) falls under both of the first two rules, but here these rules are consistent. The combination of these rules makes it easy to use ASCII characters, arbitrary byte values, and UTF-8 sequences to create byte arrays. Here is an example of using all three rules.
julia> b"DATA\xff\u2200"
8-element Base.CodeUnits{UInt8, String}:
0x44
0x41
0x54
0x41
0xff
0xe2
0x88
0x80
The ASCII DATA string corresponds to bytes 68, 65, 84, 65. \ xff
outputs one byte 255. The Unicode escape sequence \u2200
is encoded in UTF-8 as bytes 226, 136, 128. Note that the resulting byte array does not match a valid UTF-8 string.
julia> isvalid("DATA\xff\u2200")
false
As already mentioned, the type is CodeUnits{UInt8, String}`behaves like a read-only array `UInt8
and if you need a standard vector, you can convert it using Vector{UInt8}
.
julia> x = b"123"
3-element Base.CodeUnits{UInt8, String}:
0x31
0x32
0x33
julia> x[1]
0x31
julia> x[1] = 0x32
ERROR: CanonicalIndexError: setindex! not defined for Base.CodeUnits{UInt8, String}
[...]
julia> Vector{UInt8}(x)
3-element Vector{UInt8}:
0x31
0x32
0x33
Also note the significant difference between \xff
and \uff
: the first sequence encodes _ bytes of 255_, while the second sequence represents the code of the character 255, which in UTF-8 is encoded as two bytes.
julia> b"\xff"
1-element Base.CodeUnits{UInt8, String}:
0xff
julia> b"\uff"
2-element Base.CodeUnits{UInt8, String}:
0xc3
0xbf
Character literals act the same way.
For character codes smaller than \u80
, it turns out that the UTF-8 encoding of each character code is just one byte, created by the corresponding escape character \x
, so this difference can be safely ignored. However, there is a significant difference between the escape characters \x80
--\xff
and \u80
--\uff
: the former encode single bytes that (unless followed by very specific continuation bytes) do not form valid UTF-8 data, whereas the latter represent Unicode character codes. with two-byte encodings.
If all this is completely unclear, try to read the document. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (The absolute minimum that every software developer should know about Unicode and character sets). This is a great introductory article about Unicode and UTF-8 that can help you figure this out.
Literals of version numbers
Version numbers can be easily expressed using non-standard string literals of the form v"..."
. Version number literals create objects VersionNumber
that follow the specifications https://semver.org /[semantic version control], and therefore consist of numeric values of the primary version number, the secondary version number, and the patch number, followed by alphanumeric pre-release and build designations. For example, 'v"0.2.1-rc1+win64"` is split into the primary version number 0
, the secondary version number 2
, the patch version number 1
, the pre-release value rc1
and the build value win64'. When entering a version literal, everything except the main version number is optional, so, for example, `v"0.2"
is equivalent to v"0.2.0"
(with empty pre-release and build designations), v"2"
is equivalent to v"2.0.0"
and so on.
The VersionNumber
objects are mainly used for simple and correct comparison of two (or more) versions. For example, the constant VERSION
stores the Julia version number as a VersionNumber
object, and therefore you can define some version-specific behavior using the following simple operators.
if v"0.2" <= VERSION < v"0.3-"
# Выполнение чего-то конкретного для серии выпуска 0.2
end
Please note that the above example uses a non-standard version number v'0.3-"
with the ending character -
: This notation is a Julia extension to the standard and is used to indicate a version number that is lower than any release number 0.3
, including all its pre-releases. Thus, in the above example, the code will only work with stable versions of 0.2
and exclude versions such as v"0.3.0-rc1"
. To also allow unstable (i.e. pre-release versions) versions of 0.2
, the lower bound check should be changed as follows: v"0.2-" <= VERSION
.
Another non-standard extension of the version specification allows the use of the ending character +
to express the upper limit of assembly versions. For example, VERSION > v"0.2-rc1+"
can be used to refer to any version higher than 0.2-rc1
and any of its builds.: this will return the value false
for version v"0.2-rc1+win64"
and the value true
for v"0.2-rc2"
.
It is recommended to use such special versions in comparisons (in particular, the ending character -
should always be used at the upper bounds unless there is a good reason not to do so), but they should not be used as the actual version number of something, as they are unacceptable in the semantic version control scheme.
Besides being used for a constant VERSION
, VersionNumber
objects are widely used in the Pkg
module to specify package versions and their dependencies.
Raw string literals
Raw strings without interpolation or escaping can be expressed using non-standard string literals of the form raw"..."
. Raw string literals create ordinary String objects that contain nested content exactly as it was entered, without interpolation or escaping. This is well suited for strings containing code or markup in other languages that use $
or \
as special characters.
The exception is that quotation marks still have to be escaped. For example, raw"\""
is equivalent to "\""
. To express all strings, backslashes must also be escaped, but only when they appear immediately before the quotation mark character.
julia> println(raw"\\ \\\"")
\\ \"
Note that the first two backslashes are displayed in the output exactly, as they do not precede the quotation mark character. However, the next backslash character escapes the backslash following it, and the last backslash escapes the quotation mark, since these backslashes come before the quotation mark.
Annotated lines
The API for AnnotatedStrings is considered experimental and may be modified in different versions of Julia. |
Sometimes it is useful to be able to store metadata related to areas of a string. AnnotatedString'
wraps another string and allows you to annotate its areas with labeled values (+:label ⇒ value+). All universal string operations are applied to the base string. However, when possible, the style information is saved. This means that you can work with `AnnotatedString
— take substrings, complement them, combine them with other strings — and metadata annotations will be attached.
This string type is the main one for StyledStrings stdlib, which uses annotations marked :face
, for storing information about the style.
When concatenating AnnotatedString
try to use annotatedstring
instead of string
, if you want to save string annotations.
julia> str = Base.AnnotatedString("hello there",
[(1:5, :word, :greeting), (7:11, :label, 1)])
"hello there"
julia> length(str)
11
julia> lpad(str, 14)
" hello there"
julia> typeof(lpad(str, 7))
Base.AnnotatedString{String}
julia> str2 = Base.AnnotatedString(" julia", [(2:6, :face, :magenta)])
" julia"
julia> Base.annotatedstring(str, str2)
"hello there julia"
julia> str * str2 == Base.annotatedstring(str, str2) # *-конкатенация все еще работает
true
Access to annotations AnnotatedString
and their modification is carried out using the functions annotations
and annotate!
.