Substring Search Methods

Oforth examples. Feel free to post your own code.

Substring Search Methods

Postby Doug » 01 Oct 2018 14:37

The following String methods use the Boyer-Moore text search technique as the basic text search engine.
See the definition #search-text.
Methods with and without case sensitivity for the searches are provided.

Code: Select all
\ Douglas B Hoffman
\ Last Revision: 12 Oct 2018  12:47:00  dbh

\ fixed bug in #replall-txt, enabling one pass in #removeAll and #removeAllCI
\ changed set-skiptable and search-text to use byteSize and byteAt
\   this does two things: 1) greatly improves speed of searches
\                         2) handles UTF8 searches
\ added #>charIndex method to help handle UTF8
\ thanks to Franck Bensusan for all of the above

\ deleted chRemove-All and friends as they are not needed
\ added #? for testing
\ cleaned up stack effect comments
\ added removeAll and removeAllCI String methods
\ removed import: mapping
\ changed #search and #searchCI String methods to not require an offset

import: chars  \ to get >upper for Strings

\ helper method from Franck
\ Return character index corresponding to byte index
\ returns the character index at a byte index into an utf-8 string
String method: >charIndex ( n -- m )   
    1 1 while ( dup n < ) [
        self utf8At drop
        swap 1+ swap
        ]
    drop
;

: noop ;

tvar: skiptable

: set-skiptable  ( pattern -- )
  | i r m | 1 -> r \ r stores the rightmost occurrence of each char in pattern
                   \ into the skiptable
  pattern byteSize -> m
  256 m Array newWith to skiptable
  pattern byteSize loop: i [ i pattern byteAt m r - skiptable put r 1+ -> r ]
  ;

\ text string size = n
\ i = index into the text
\ pattern string size = m
\ j = index into the pattern

\ search-text uses the Boyer-Moore text search algorithm.
\ text is text to be searched for the pattern substring.
\ offset is offset into text where the search is to begin.
\ If the search is successful n is the offset into the text where
\ pattern was found, otherwise return 0.
\ flag=true  then do a case-insensitive search
\ flag=false then do a case-sensitive   search
: search-text ( pattern offset flag text  -- n | false )
  | n m i j upper |
  flag if #>upper else #noop then -> upper
  text byteSize -> n
  pattern byteSize -> m
  m offset + -> i
  m -> j
  pattern upper execute dup set-skiptable -> pattern
  begin
    i 1- n >= if false return then
    i text byteAt upper execute dup j pattern byteAt
    if=: [
      drop
      i 1- -> i  j 1- -> j
      ]
    else: [
      m j - 1+
      swap skiptable at
      max i + -> i
      m -> j
      ]
    j 1 <
  until
  i 1+ text >charIndex ;

: Search \ ( pattern offset s -- n | false )
  false swap search-text ;

: SearchCI \ ( pattern offset s -- n | false )
  true swap search-text ;

\ search for string pattern, if found return offset n, otherwise return false
String method: search \ ( pattern s -- n | false )
  0 self Search ;

\ same as #search but case-insensitive
String method: searchCI \ ( pattern s -- n | false )
  0 self SearchCI ;

\ search for s1, if found replace with s2, new str s3, offset and true returned, else false
: sch&repl-text ( s1 s2 off r s -- s3 offset true | false )
  | s3 i | String new -> s3
  s1 off s r execute
  dup if -> off
         1 off 1- for: i [ i s at s3 add ]
         s3 s2 + >string -> s3
         off s1 size +
         s size  for: i [ i s at s3 add ]
         s3 off true
      else drop false
      then ;

\ search for s1, if found replace with s2, new str s3, offset and true returned, else false
String method: sch&repl \ ( s1 s2 s -- s3 offset true | false )
  0 #Search self  sch&repl-text ;

\ same as #sch&repl but case-insensitive
String method: sch&replCI \ ( s1 s2 s -- s3 offset true | false )
  0 #SearchCI self sch&repl-text ;

\ search for s1, if found remove it, return new str s3 offset and true if found, else false
String method: sch&remove \ ( s1 s -- s3 offset true | false )
  String new 0 #Search self sch&repl-text ;

\ same as #sch&remove but case-insensitive
String method: sch&removeCI \ ( s1 s -- s3 offset true | false )
  String new 0 #SearchCI self sch&repl-text ;

\ in string s replace all occurrences of s1 with s2, using runnable r
\ return new string s3 to replace s
: replall-txt ( s1 s2 r s  -- s3 )
  | off | 0 -> off
  begin
    s1 s2 off r s sch&repl-text
    if s1 size - -> off -> s
    else s return
    then
  again
;

\ replace all occurrences of s1 with s2, return new string s3
String method: replAll \ ( s1 s2 s -- s3 )
  #Search self replall-txt ;

\ replace all occurrences of s1 with s2 case-insensitive, return new string s3
String method: replAllCI \ ( s1 s2 s -- s3 )
  #SearchCI self replall-txt ;

\ remove all occurrences of s1, return new string s3
String method: removeAll \ ( s1 s -- s3 )
   String new #Search self replall-txt ;

\ remove all occurrences of s1, return new string s3
String method: removeAllCI  \ ( s1 s -- s3 )
   String new #SearchCI self replall-txt ;


forget: Search
forget: SearchCI
forget: replall-txt
forget: sch&repl-text



test: [ "c" "ab‚Ǩcd" search  6 == ]
test: [ "‚Ǩ" "ab‚Ǩcd" search  3 == ]

test: [ "NEEDLE" "FINDINAHAYSTACKNEEDLE" search 16 == ]
test: [ "NeeDLE" "FINDINAHAYSTACKNEEDLE" search 0 == ]

test: [ "NEEDLE" "FINDINAHAYSTACKNEEDLE" searchCI 16 == ]
test: [ "NeeDLE" "FINDINAHAYSTACKNEEDLE" searchCI 16 == ]
test: [ "lo" "HI" "hellothere" sch&repl 1 == 4 ==  "helHIthere" == ]

test: [ "lo" "HI" "hellothere" sch&repl 1 == 4 ==  "helHIthere" == ] 
test: [ "LO" "HI" "hellothere" sch&repl 0 == ]
test: [ "LO" "HI" "hellothere" sch&replCI 1 == 4 ==  "helHIthere" == ]

test: [ "lo" "hellothere" sch&remove 1 == 4 == "helthere" == ]
test: [ "LO" "hellothere" sch&remove 0 == ]
test: [ "LO" "hellothere" sch&removeCI 1 == 4 == "helthere" == ]

 
test: [ "lo" "HI" "hellotherelobaaaalolo" replAll   "helHIthereHIbaaaaHIHI" == ]
test: [ "o" "H" "hellotherelobaaaalolo" replAll   "hellHtherelHbaaaalHlH" == ]
test: [ "needle" "PAT" "findinahaystacknEEdleanotherneedle" replAll   "findinahaystacknEEdleanotherPAT" == ]
test: [ "lo" "HIZZ" "hellotherelobaaaalolo" replAll "helHIZZthereHIZZbaaaaHIZZHIZZ" == ]
test: [ "HIZZ" "lo" "helHIZZthereHIZZbaaaaHIZZHIZZ" replAll "hellotherelobaaaalolo" == ]

test: [ "lo" "HI" "HELLOTHERELOBAAAALOLO" replAllCI   "HELHITHEREHIBAAAAHIHI" == ]
test: [ "needle" "PAT" "findinahaystacknEEdleanotherneedlE" replAllCI   "findinahaystackPATanotherPAT" == ]


test: [ "lo" "hellotherelobaaaalolo" removeAll "heltherebaaaa" == ]
test: [ "Lo" "hellotherelobaaaalolo" removeAllCI   "heltherebaaaa" == ]
test: [ "," "123,456,789.12" removeAll   "123456789.12" == ]
test: [ "a" "helloa byAAloap" removeAll  "hello byAAlop" == ]
test: [ "a" "bbas23AjjAa" removeAllCI   "bbs23jj" == ]
test: [ "a" "helloa byAAloap" removeAllCI   "hello bylop" == ]

test: [ "r‚àë‚àÇ∆í¬©" "abshytr‚àë‚àÇ∆í¬©uirtdgr‚àë‚àÇ∆í¬©r‚àë‚àÇ∆í¬©oskde" removeAll  "abshytuirtdgoskde" == ]


test: [ "‚àë" byteSize  7 == ]
test: [ "r∑∂ƒ©" byteSize 24 == ]

Last edited by Doug on 12 Oct 2018 18:58, edited 5 times in total.
Doug
 
Posts: 20
Joined: 20 Jul 2018 14:26

Re: Substring Search Methods

Postby Franck » 01 Oct 2018 16:48

Hello Doug,

Great, thank you !

For deferred words, version V1.2 added a new package, so you can just include it :
Code: Select all
import: defer


Into this package, word #=> is declared to update the action of a defer. So you can write :
Code: Select all
  flag if #>upper else #noop then  => upper


Regards,
Franck
Franck
 
Posts: 159
Joined: 29 Oct 2014 19:01

Re: Substring Search Methods

Postby Doug » 01 Oct 2018 21:41

Thanks for the tips, Franck.

Once the Boyer-Moore #search-text was working the rest of the code was pretty easy to do in Oforth. GC is a wonderful thing!

Regards,
-Doug
Doug
 
Posts: 20
Joined: 20 Jul 2018 14:26

Re: Substring Search Methods

Postby Doug » 07 Oct 2018 16:56

I changed the code submission as follows:

\ added #? for testing
\ cleaned up stack effect comments
\ added removeAll and removeAllCI String methods
\ removed import: mapping
\ changed #search and #searchCI String methods to not require an offset
Doug
 
Posts: 20
Joined: 20 Jul 2018 14:26

Re: Substring Search Methods

Postby Doug » 12 Oct 2018 03:25

Updated again to greatly increase search speed and to properly handle UTF8 text. Thanks to Franck in getting this done.

-Doug
Doug
 
Posts: 20
Joined: 20 Jul 2018 14:26

Re: Substring Search Methods

Postby Franck » 12 Oct 2018 08:51

You're welcome,

This has much better performances than the generic basic sub collection search and it supports UTF8 strings.

Perhap's you should consider making a package of it ?

(and perhap's I should consider opening a page for this kind of package...).

Franck
Franck
 
Posts: 159
Joined: 29 Oct 2014 19:01


Return to Oforth examples

Who is online

Users browsing this forum: No registered users and 1 guest

cron