HOWTO: SQL Weighted Best Match Query with Fuzzy Logic
Posted: 9/29/2007 4:53:24 PM
By: Comfortably Anonymous
Times Read: 1,944
0 Dislikes: 0
Topic: Windows Administration

Doing a definitive query using T-SQL is easy. When you know all the criteria in the query it's as easy as SELECT something WHERE Criteria1=x AND Criteria2=y. But in the case where you have some loosely defined criteria and want SQL to find the best possible match rather than an exact match, it gets to be a tricky problem.

Introduction

There are two basic types of best matching approaches: A Best Match Substring Search and a Best Match Full Criteria Search. A Substring Search is for best matching in text searches. (There may be a more standard term for that but I am not aware of one so am making up my own term for it.) This is more along the lines of finding the best matches in a block of text, similar to a search engine (aka Google) search. A Criteria Search where you have the complete value or text of each of the parameters/criteria you are searching for, and need to find which entry in the database matches the highest number of criteria in that search. Some criteria may be undefined (null). You may wish to add "mass" to a certain criteria type to give it more "weight" in the search, such as "If you match on [Criteria A but not criteria B or criteria C], it's considered a better match than matching on [Criteria B and criteria C, but not criteria A]. This means that a higher weight has been assigned to criteria A than on criteria B and criteria C.

Note that this article deals with the second type of query, a Best Match Full Criteria Seach. If you are looking for info about Substring Searching in Text, try
http://www.google.com/search?hl=en&q=t-sql+best+match+text+search

Also note that this article details a best match Criteria Search using T-SQL (Microsoft SQL Server), but it should be easy to convert the logic of the examples to PL-SQL (Oracle), PostgreSQL, MySQL, or any other SQL-based database server that supports stored procedures. The logic is pretty straight-forward and doesn't use anything platform-specific.

Creating the Solution

I had an issue where I needed to fetch data based on the best match of four criteria: A Category Name, a Vendor ID, a Context, and a Service. Also, for a couple of them, I had some existing data that contained a "wildcard" word that meant it applied to any entry, as long as there was not a definitive entry to match on, which made it still more complicated to account for. I did a lot of searching on the net looking for a similar solution but only found articles about finding best matches when dealing with substrings of text. For my issue, each of my criteria, when it existed, would be a full word, I did not need to do substring searching. It seems there is a large wealth of substring best match searching articles out there, but nothing on the criteria best matching, thus the reason I decided to post this article to save the next person the hair loss this problem nearly caused me. :)

OK, shut up, quit blabbering and get on with the technical details of the article dude... 

Technical Details

I have a table called Lookup which contains the following columns:

Data	(Primary Key, varchar(200), not null) – The data to return from the best matching search.
Category (varchar(128), not null) – Consider this Criteria A
Vendor (varchar(6), not null) – Criteria B
Context (varchar(16), not null) – Criteria C
Service (varchar(16), not null)
– Criteria D
For purposes of this article lets just populate the table with five rows to keep it simple:


DataCategoryVendorContextService
ResultATest100001DevCommon
ResultBTest100001TestCommon
ResultCTest100002TestOrange
ResultDTest200001ProdCommon
ResultETest200002ProdCommon

The stored procedure:

--------------------------------------------------------------------
-- Does a Weighted Select against passed criteria in order to find
-- the record containing the best matching Data.
--------------------------------------------------------------------
CREATE proc [dbo].[rspBestMatch]
(

@Category varchar(128),

@Vendor varchar(6),

@Context varchar(16),

@Service varchar(16)

)
AS
DECLARE @TTemp TABLE (Data varchar(200), Mass int)
DECLARE @VendorMatch int
DECLARE @VendorCount int
-- Should never call this proc without a vendor, but if someone 
-- does, assume a VendorID of 0 so will match on any vendor.
-- (Vendor 0 matches any vendor unless a Vendor > 0 is passed)
IF (@VENDOR = NULL)
SET @Vendor = 0
IF (@CONTEXT = NULL)
SET @Context = ''
 -- Make sure parameter data is in lower case.
SET @Category = Lower(@Category)
SET @Context = Lower(@Context)
SET @Service = Lower(@Service)
IF (@Vendor > 0)
BEGIN
SET @VendorMatch = (
SELECT Count(1)
FROM Lookup
WHERE (Lower(Category) = @Category)
AND (Vendor = @Vendor)) 
	IF @VendorMatch = 0
SET @Category = 0
END
-- Collect entries that match on Vendor
INSERT INTO @TTemp
SELECT Data, 10 as Mass
FROM Lookup
WHERE Vendor = @Vendor
AND Lower(Category) = @Category
SET @VendorCount = (SELECT COUNT(1) FROM @TTemp)
-- Only if we found an Category/Vendor match should we look for 
-- other criteria, otherwise we want to return nothing.
IF (@VendorCount > 0) 
BEGIN
-- Collect entries that match on Context
IF (@Context > '')
BEGIN

INSERT INTO @TTemp

SELECT Data, 1 as Mass

FROM Lookup

WHERE Lower(Context) = @Context

AND Lower(Category) = @Category
END

 

-- Collect entries that match on Service

IF @Service IS NOT NULL

BEGIN

INSERT INTO @TTemp

SELECT Data, 2 as Mass

FROM Lookup

WHERE Lower([Service]) = @Service

AND Lower(Category) = @Category

 

-- If passed Service does not equal common,

-- then also attempt to match on Service='common'.

IF Lower(@Service) <> 'common'

BEGIN

INSERT INTO @TTemp

SELECT Data, 1 as Mass

FROM Lookup

WHERE Lower([Service]) = 'common'

AND Lower(Category) = @Category

END

END

 

END
-- Find Data with highest weight
SELECT TOP 1 Data
FROM @TTemp
GROUP BY Data
ORDER BY Sum(Mass) DESC
--------------------------------------------------------------
 
EXPERIMENTATION













Now to experiment with this a bit:

Execute rspBestMatch with the following data:

 

  1. Category='Test1', Vendor='00001', Context='Dev', Service='common'

    You should get back ResultA from the query as it is the best match.

  2. Category='Test1', Vendor='00002', Context=NULL, Service=NULL


    You should get back ResultC from the best match query

  3. Category='Test2', Vendor=NULL, Context=NULL, Service=NULL


    You should get back ResultE this time, as the only thing we matches on was the category.

  4. To demonstrate the weighting on the Vendor criteria, use the following data; Category='Test1', Vendor='00002', Context='Dev', Service=NULL.

    You should get back ResultC in this case. Even though we had an match on Context='Dev', the Vendor='00002' won out as the best match because if you look in the sproc code above, when we get a match on Vendor, it gets assigned a mass of 10. Matches on Context are only assigned a mass of 1.

 

To get a little more understanding about what it going on behind the scenes with the query, change the final four lines (After 'Find Data with highest weight' to:

SELECT Data, Sum(Mass) as Weight
FROM @Ttemp
GROUP BY Data
ORDER BY Weight DESC


This will show you the sorting and grouping going on to find the best match.

Enjoy!

 
Rating: (You must be logged in to vote)
Discussion View:
Replies:

HOWTO: SQL Weighted Best Match Query with Fuzzy Logic
Posted: 9/29/2007 4:53:24 PM
By: Comfortably Anonymous
Times Read: 1,944
0 Dislikes: 0
Topic: Windows Administration

This is a test

Rating: (You must be logged in to vote)