Text.locate
Group: Text
Aliases: position_of
, span_of
Documentation
Find the location of the term
in the input. Returns a Span representing the location at which the term was found, or Nothing
if the term was not found in the input.
Arguments
term
: The term to find.mode
: Specifies if the first or last occurrence of the term should be returned if there are multiple occurrences within the input. The first occurrence is returned by default.case_sensitivity
: Specifies if the text values should be compared case sensitively.
Examples
Finding location of a substring.
"Hello World!".locate "J" == Nothing
"Hello World!".locate "o" == Span (Range 4 5) "Hello World!"
"Hello World!".locate "o" mode=Matching_Mode.Last == Span (Range 7 8) "Hello World!"
Match length differences in case-insensitive matching.
term = "straße"
text = "MONUMENTENSTRASSE 42"
match = text . locate term case_sensitivity=Case_Sensitivity.Insensitive
term.length . should_equal 6
match.length . should_equal 7
Extending matches to full grapheme clusters.
ligatures = "ffiffl"
ligatures.length == 2
term_1 = "IFF"
match_1 = ligatures . locate term_1 case_sensitivity=Case_Sensitive.Insensitive
term_1.length == 3
match_1.length == 2
term_2 = "ffiffl"
match_2 = ligatures . locate term_2 case_sensitivity=Case_Sensitive.Insensitive
term_2.length == 6
match_2.length == 2
# After being extended to full grapheme clusters, both terms "IFF" and "ffiffl" match the same span of grapheme clusters.
match_1 == match_2
Remarks
What is a Character?
A character is defined as an Extended Grapheme Cluster, see Unicode Standard Annex 29. This is the smallest unit that still has semantic meaning in most text-processing applications.
Match Length
The function returns not only the index of the match but a Span
instance
which contains both the start and end indices, allowing to determine the
length of the match. This is useful for case insensitive matching. In
case-insensitive mode, a single character can match multiple characters,
for example ß
will match ss
and SS
, and the ligature ffi
will match
ffi
or f
etc. Thus in case-insensitive mode, the length of the match
can be shorter or longer than the term that was being matched, so it is
extremely important to not rely on the length of the matched term when
analysing the matches as they may have different lengths.
Matching Grapheme Clusters
In case-insensitive mode, a single character can match multiple characters,
for example ß
will match ss
and SS
, and the ligature ffi
will match
ffi
or f
etc. Thus in this mode, it is sometimes possible for a term to
match only a part of some single grapheme cluster, for example in the text
ffia
the term ia
will match just one-third of the first grapheme ffi
.
Since we do not have the resolution to distinguish such partial matches
(as that would require non-integer indices), so a match which matched just
a part of some grapheme cluster is extended and treated as if it matched
the whole grapheme cluster.