Jul 27, 2006

Pattern matching - One or more occurances of a character

This was actually a question that popped in the discussion forum. I just want to keep all the good things that come out on the discussion forum (where I had SOME role to play) in this blog :)
Here is the issue.

Say, I have a table like this..

create table abcCollection(abc varchar(30) not null)

And I insert the following data

INSERT abcCollection values('ABC')
INSERT abcCollection values('ABxC')
INSERT abcCollection values('ABBBC')
INSERT abcCollection values('ABBBBxBC')
INSERT abcCollection values('ABxBBBBBBBC')
INSERT abcCollection values('123ABBBC456AB')
INSERT abcCollection values('dfdfddsds')
INSERT abcCollection values('acdaccdbbvfcab')
INSERT abcCollection values('acbcabbbfc')
INSERT abcCollection values('acbcabbbfc')
INSERT abcCollection values('abbbcafffc')
INSERT abcCollection values('acabc')
INSERT abcCollection values('acac')
INSERT abcCollection values ('abbdddbbc')
INSERT abcCollection values ('abbdddbabc')


Now I want to get the following result

ABC
ABBBC
123ABBBC456AB
abbbcafffc
acabc
abbdddbabc


That is, the result should contain A followed by one or more Bs followed by C. This pattern can appear anywhere within the string.

We cannot directly use the LIKE operator since it doesn't support this kind of a search.

There are quite a few methods to solve this. I am including all the solutions here.

The following methods will work for SQL Server 2000 and 2005.

Method 1: Call a recursive function.
This function keeps replacing 2 continous Bs with a single B till it doesn't find any and then search for 'ABC'

create function dbo.fn1(@input varchar(30))
returns bit
as
begin
return case when @input like '%abc%' then 1
when charindex('bb',@input) > 0 then dbo.fn1(replace(@input,'bb','b'))
else 0 end
end


And you can call the function like this..

select abc from abcCollection where dbo.fn1(abc) = 1

But, This is based on recursion and you can nest it only 32 times. So if your text has more than 32 Bs continously, then it might fail and its performance intensive too.

Method 2: Use a while loop within the function.
The logic of this function is similar to Method 1 with a different implementation.

create function fn1 (@str as varchar(30)) returns varchar(30)
as
begin
while (PATINDEX('%BB%', @str) > 0)
begin
set @str = replace (@str,'BB','B')
end
return @str
end


And you can call the function this way:

select abc from abcCollection where PATINDEX ('%ABC%', dbo.fn1(abc)) > 0

While this will work for big strings, you will have to execute a while loop for every column and so its performance intensive.

Method 3: Filter in the query (smart replace)
This solution is an absolute brainer, originally conceived by Robert Carnegie a long time back, shown to me by Steve Kass:

select abc from abcCollection where
replace(replace(replace(abc,'B','<>'),'><',''),'<>','B')
like '%ABC%'


Here is how it works:
Take for example you have the string 'ABBBC'
First replace will replace 'B' with '<>'
So the string becomes A<><><>C
The Second replace will replace '><' with an empty string( '' )
So A<><><>C becomes A<>C
Now the third replace replaces '<>' with 'B'
So it becomes ABC and now this is compared against ABC.

Method 4:
Another version conceived from Method 3

select abc from abcCollection where replace(replace(abc,'ac',''),'b','') like '%ac%'

Here is what it does:
Take a string like this 'ACDFABBBCCA'
When you replace 'AC' with empty string ( '' ) it becomes 'DFABBBCCA'
Now replace 'B' with an empty string, it becomes 'DFACCA'
Now this string is compared against 'AC'

The following is the implementation of Method 1 in SQL Server 2005 using common table expressions. Its still a bad performer when compared to Methods 3 and 4. But here it is anyways :)


with cte as
(select abc,replace(abc,'bb','b') as d from abcCollection
union all
select abc, replace(d,'bb','b') from cte where d <> replace(d,'bb','b')
)
select abc from cte where d like '%abc%'

1 comments:

kuyi said...

I have been researching a clever way of dealing with pattern matching in SQL Server 2000 for a while. This is a great post.

Thanx a bunch for the post.

Post a Comment