Text.locate_all
Group: Text
Aliases: index_of_all
, position_of_all
, span_of_all
Documentation
Finds all the locations of the term
in the input. If not found, the function returns an empty Vector.
Arguments
term
: The term to find.case_sensitivity
: Specifies if the text values should be compared case sensitively.
Examples
Finding locations of all occurrences of a substring.
"Hello World!".locate_all "J" == []
"Hello World!".locate_all "o" . map .start == [4, 7]
Match length differences in case-insensitive matching.
term = "strasse"
text = "MONUMENTENSTRASSE ist eine große Straße."
match = text . locate_all term case_sensitivity=Case_Sensitive.Insensitive
term.length == 7
match . map .length == [7, 6]
Extending matches to full grapheme clusters.
ligatures = "ffifflFFIFF"
ligatures.length == 7
match_1 = ligatures . locate_all "IFF" case_sensitivity=Case_Sensitive.Insensitive
match_1 . map .length == [2, 3]
match_2 = ligatures . locate_all "ffiff" case_sensitivity=Case_Sensitive.Insensitive
match_2 . map .length == [2, 5]
Remarks
What is a Character?
A character is defined as an Extended Grapheme Cluster, see Unicode Standard Annex 29. This is the smallest unit that still has semantic meaning in most text-processing applications.
Match Length
The function returns not only the index of the match but a Span
instance
which contains both the start and end indices, allowing to determine the
length of the match. This is useful for case insensitive matching. In
case-insensitive mode, a single character can match multiple characters,
for example ß
will match ss
and SS
, and the ligature ffi
will match
ffi
or f
etc. Thus in case-insensitive mode, the length of the match
can be shorter or longer than the term that was being matched, so it is
extremely important to not rely on the length of the matched term when
analysing the matches as they may have different lengths.
Matching Grapheme Clusters
In case-insensitive mode, a single character can match multiple characters,
for example ß
will match ss
and SS
, and the ligature ffi
will match
ffi
or f
etc. Thus in this mode, it is sometimes possible for a term to
match only a part of some single grapheme cluster, for example in the text
ffia
the term ia
will match just one-third of the first grapheme ffi
.
Since we do not have the resolution to distinguish such partial matches
(as that would require non-integer indices), so a match which matched just
a part of some grapheme cluster is extended and treated as if it matched
the whole grapheme cluster.