Engee documentation
Notebook

Text search and replacement

Processing text data often involves searching and replacing substrings. There are several functions that find text and return various information: some functions confirm that the text exists, others count the number of repetitions of a text fragment, find indexes or extract substrings.


Text search

To determine if a text fragment is present, you can use the function occursin(). Logical values of 1 correspond to true, and 0 to false.

In [ ]:
txt = "she sells seashells by the seashore"
TF = occursin("sea", txt)
Out[0]:
true

You can calculate how many times this text occurs using the function count().

In [ ]:
n = count("sea", txt)
Out[0]:
2

To determine where the text is located, use the function findall(), which returns the indexes of characters that match the text fragment "sea".

In [ ]:
idx = findall("sea", txt)
Out[0]:
2-element Vector{UnitRange{Int64}}:
 11:13
 28:30

Searching for text in arrays of strings

The search and replace functions also allow you to find text in multi-element arrays. For example, find the names of colors in the names of several songs.

In [ ]:
songs = ["Penny Lane", "Yellow Submarine","Blackbird"]
colors = ["Red", "Yellow", "Black"]

TF = occursin.(colors,songs)
Out[0]:
3-element BitVector:
 0
 1
 1

To display a list of songs containing color names, use the TF logical array as indexes in the original array of songs. This method is called logical indexing.

In [ ]:
songs[TF]
Out[0]:
2-element Vector{String}:
 "Yellow Submarine"
 "Blackbird"

Matching patterns

In addition to searching for literal text such as “sea” or “yellow”, you can search for text matching the pattern. There are many predefined patterns, such as digits, to search for a sequence of digits.

In [ ]:
address = " Sesame Street, New York, NY 10128"
nums = match(r"\d+", address)
Out[0]:
RegexMatch("10128")

You can combine templates to make your search more accurate. For example, find words starting with the letter “S". Use a string to specify the "S” character, and lettersPattern to find additional letters after that character.

In [ ]:
lettersPattern = r"[a-zA-Z]+"
pat = "N" * lettersPattern
StartWithS = match.(pat, address).match
Out[0]:
"New"

Other functions for working with text in Engee can be found in the [Text strings] section (https://engee.com/helpcenter/stable/julia/base/strings.html ).