Skip to main content

Text.locate

locatetermmodecase_sensitivity

Group: Text
Aliases: position_of, span_of

Documentation

Find the location of the term in the input. Returns a Span representing the location at which the term was found, or Nothing if the term was not found in the input.

Arguments

  • term: The term to find.
  • mode: Specifies if the first or last occurrence of the term should be returned if there are multiple occurrences within the input. The first occurrence is returned by default.
  • case_sensitivity: Specifies if the text values should be compared case sensitively.

Examples

Finding location of a substring.

      "Hello World!".locate "J" == Nothing
"Hello World!".locate "o" == Span (Range 4 5) "Hello World!"
"Hello World!".locate "o" mode=Matching_Mode.Last == Span (Range 7 8) "Hello World!"

Match length differences in case-insensitive matching.

      term = "straße"
text = "MONUMENTENSTRASSE 42"
match = text . locate term case_sensitivity=Case_Sensitivity.Insensitive
term.length . should_equal 6
match.length . should_equal 7

Extending matches to full grapheme clusters.

      ligatures = "ffiffl"
ligatures.length == 2
term_1 = "IFF"
match_1 = ligatures . locate term_1 case_sensitivity=Case_Sensitive.Insensitive
term_1.length == 3
match_1.length == 2
term_2 = "ffiffl"
match_2 = ligatures . locate term_2 case_sensitivity=Case_Sensitive.Insensitive
term_2.length == 6
match_2.length == 2
# After being extended to full grapheme clusters, both terms "IFF" and "ffiffl" match the same span of grapheme clusters.
match_1 == match_2

Remarks

What is a Character?

A character is defined as an Extended Grapheme Cluster, see Unicode Standard Annex 29. This is the smallest unit that still has semantic meaning in most text-processing applications.

Match Length

The function returns not only the index of the match but a Span instance which contains both the start and end indices, allowing to determine the length of the match. This is useful for case insensitive matching. In case-insensitive mode, a single character can match multiple characters, for example ß will match ss and SS, and the ligature will match ffi or f etc. Thus in case-insensitive mode, the length of the match can be shorter or longer than the term that was being matched, so it is extremely important to not rely on the length of the matched term when analysing the matches as they may have different lengths.

Matching Grapheme Clusters

In case-insensitive mode, a single character can match multiple characters, for example ß will match ss and SS, and the ligature will match ffi or f etc. Thus in this mode, it is sometimes possible for a term to match only a part of some single grapheme cluster, for example in the text ffia the term ia will match just one-third of the first grapheme . Since we do not have the resolution to distinguish such partial matches (as that would require non-integer indices), so a match which matched just a part of some grapheme cluster is extended and treated as if it matched the whole grapheme cluster.