Skip to main content

Text.locate_all

locate_alltermcase_sensitivity

Group: Text
Aliases: index_of_all, position_of_all, span_of_all

Documentation

Finds all the locations of the term in the input. If not found, the function returns an empty Vector.

Arguments

  • term: The term to find.
  • case_sensitivity: Specifies if the text values should be compared case sensitively.

Examples

Finding locations of all occurrences of a substring.

      "Hello World!".locate_all "J" == []
"Hello World!".locate_all "o" . map .start == [4, 7]

Match length differences in case-insensitive matching.

      term = "strasse"
text = "MONUMENTENSTRASSE ist eine große Straße."
match = text . locate_all term case_sensitivity=Case_Sensitive.Insensitive
term.length == 7
match . map .length == [7, 6]

Extending matches to full grapheme clusters.

      ligatures = "ffifflFFIFF"
ligatures.length == 7
match_1 = ligatures . locate_all "IFF" case_sensitivity=Case_Sensitive.Insensitive
match_1 . map .length == [2, 3]
match_2 = ligatures . locate_all "ffiff" case_sensitivity=Case_Sensitive.Insensitive
match_2 . map .length == [2, 5]

Remarks

What is a Character?

A character is defined as an Extended Grapheme Cluster, see Unicode Standard Annex 29. This is the smallest unit that still has semantic meaning in most text-processing applications.

Match Length

The function returns not only the index of the match but a Span instance which contains both the start and end indices, allowing to determine the length of the match. This is useful for case insensitive matching. In case-insensitive mode, a single character can match multiple characters, for example ß will match ss and SS, and the ligature will match ffi or f etc. Thus in case-insensitive mode, the length of the match can be shorter or longer than the term that was being matched, so it is extremely important to not rely on the length of the matched term when analysing the matches as they may have different lengths.

Matching Grapheme Clusters

In case-insensitive mode, a single character can match multiple characters, for example ß will match ss and SS, and the ligature will match ffi or f etc. Thus in this mode, it is sometimes possible for a term to match only a part of some single grapheme cluster, for example in the text ffia the term ia will match just one-third of the first grapheme . Since we do not have the resolution to distinguish such partial matches (as that would require non-integer indices), so a match which matched just a part of some grapheme cluster is extended and treated as if it matched the whole grapheme cluster.