Advanced Searches / "Fuzzy Logic"

connexion

Registered User.
Local time
Today, 22:26
Joined
Jul 30, 2003
Messages
72
Hello peeps.

Does anyone know much about searching techniques in terms of finding data based on fuzzy logic?

I have a fairly adequate database for finding product records which currently uses text strings (in any order) and spaces as wildcard characters, to look for exact matches to all of the text strings entered, across mutiple table fields. What i'm looking to do though is make what is good...better, allowing for user's spelling mistakes etc.

I'm kind of thinking about breaking down character strings into smaller segments, then scoring those that partially match, and ordering them in a list of results according to the score obtained (highest score at top)

Has anyone got any good ideas for reference material/examples on this one?

Thanks
Vince
 
Interesting (abbreviation for "very complicated"!) ?

Thanks Wayne,

I was thinking of something that was still likely to find the correct items, but would get round users miss-spelling words, working on probability of being correct rather than a flat correct or not.

Best i could come up with easily was to take the search string entered, then split it into groups of 2 characters from each word enetered, discarding the remaining character if a word contains an odd number. Then searching on the result of this rather than the full string enetered, looking for the highest probability in any matches.

The remaining single characters would be removed as part of the splitting process as matches to single characters would be relatively worthless.

e.g. the string "Search for this item" would become "se ar ch fo th is it em".

This still finds anything containing the same characters but rather than finding the specific character strings each 2 characters that match would contribute to a "probability score", lines with the highest score being displayed top in the list of search results.

spelt correctly "search for this item" = "SE AR CH FO TH IS IT EM" = 8 matches.
incorrect spelling = "serch for this item" = "se rc FO TH IS IT EM" = 5 matches, only the last 5 groups of characters being correct, (placing it further down the list)

Unless the user has terrible spelling problems, the result would appear even if it was not spelt correctly as although the item required may or may not have the highest "string match score" it would at least be seen to match and would therefore be presented in the results.

I've tried it out a bit on a database with 30,000 descriptions in it and it seems to work surprisingly well when mistakes are tested.

Am i over simplifying this, what do you think?

Vince
 
The problem with this approach is that it can get expensive for large searches. Fuzzy math (i.e. kinda looks like a match...) is a great idea but suppose you should have broken up the search differently? Like

S EA RC H F OR THI S ITE M or

SEA RCH FOR T HIS I TEM or .... you get the point.

The question will always be, "How far do you wish to go with these text splits?"

TECHNICALLY, you should use a sliding window approach where you take your original search string and the body of text to be searched. Then compare each character of the text body to each character of the search string and generate a score. Then move the string one step to the right and do the same thing again. Where you have a close match, the score should jump up a lot higher in terms of what characters match up to one another.

But even that isn't foolproof since you face THREE possible classes of error: Substitution, Insertion, Omission. This approach only works well for the general Substitution case or, for the Insertion or Omission cases, where the error is at the beginning or end of the search string.

So what do you do? I suggest a search for "Fuzzy logic" and "String Searches" on Google or one of the other search sites. See if someone else has already published an algorithm. The ultimate problem is that to do this really right is CPU intensive, and VBA only semi-compiles, so it isn't the most efficient language you could use.
 
Vince,

I second Doc's opinion. You can make a really flexible search engine using
Access. I'd give the user's the benefit of the doubt. In most apps, with
quick retrieval times, they can try "Smythe", "Smyth" and finally figure out
that it's "Smith" that they're looking for.

Unless you (and they) are willing to spend a lot of time (yours), then most
traditional search strategies should apply.

Wayne
 
Calm down! lol

I was only suggesting!! lol

I think that what is important in getting searches right is trying to get a solution that sticks to "rules" that aid speed, but leaving the door ajar just enough to allow a little "anticipation" of what the user is trying to find.

Splitting the strings would make a big difference, according to how the split is done, but if there is a "rule" applied to ensure that splitting follows a logical pattern then within the "rule" certain user errors could be accounted for.

For example accepting the point that most mistakes by a user would be either using an incorrect character or getting the order of two characters the wrong way round, and in any case the mistake being within one word, rather than them spelling every word of the string incorrectly, incorrect words enetered can be accounted for within the "rule".

If the split followed a logical pattern then would it be that intensive?

1) treat each word of the whole search string seperated by spaces as a seperate string making up the search string.
2) if the number of characters within each word is odd, remove the last character from the word leaving something that can be split into chunks of 2 characters.
3) break the word into 2 character chunks, inserting spaces between them to build the new search string, treating the spaces as wildcard characters.
4) search through data looking for an occurance of each 2 character string within the search string (in any order).
5) apply a score to each item of data where a 2 character chunk of the search string matches a 2 character chunk with the data being searched.
6) display results based on highest number of matches found.

This means that the rule on splitting follows a logical pattern and within that is a "tolerance" for error, rather than a straight forward right/wrong approach.

I am finding through replicating what would actually happen that the search finds basically the same data as a more traditional search, but where mistakes that would be made are accounted for by eliminating parts of the search string manually it isn't really affecting the results, and isn't really any slower.

The data i am using is based on 28,000 descriptions in a memo field.
Excatly the correct search string finds 1 result.
An incorrect search string (one character out of place) finds None
The split string finds 17, the correct one being at the top of the list

It's difficult to expalin it without pointing out that our current search treats spaces as wildcards and looks for exact matches of words within the search string in descriptions anyway.

I am always on the side of accuracy over speed, I'd rather do each search taking 1 second finding just what i want rather than each one being 1/10th of a second and finding nothing.

I'm going to have a good read around all this and build an example, if nothing else...just to prove to myself that i'm talking garbage! "if at first you don't succeed..." and all that..lol !!

Vince
 

Users who are viewing this thread

Back
Top Bottom