Solved searching with a partial string (2 Viewers)

John Sh

Active member
Local time
Tomorrow, 00:48
Joined
Feb 8, 2021
Messages
606
I have a situation where a number of entries have been mis-spelled and I need to find the correct spelling in a reference list.
As an example, the mis-spelled word is "Dryopteriaceae" and the correct spelling is "Dryopteridaceae".
My idea is to start with the full word and work backwards until a match is found using the code below.
This code fails with "nLen" not being recognised in the "(left(family, nLen))" statement, with or without the outer ().
Is there a way to progressively limit the recordset field such that the search criteria would eventually find a match.
In this case, when both arguments are "Dryopteri" one could reasonably assume the correct match has been found.
I have seen a reference to using "Soundex" for this purpose but even extending it out to 8 characters, it fails to find a matching pair.
At the requisite four characters, it will find numerous faulty matches as it is only sampling the beginning of each word.
Is this possible, or is there a better way to achieve my goal?

Code:
Private Sub Form_Load()
    Dim nLen As integer
    Dim str As String
    Dim rs1 As Recordset
    Set rs1 = oDB.OpenRecordset("taxon", dbOpenSnapshot)
    str = "Dryopteriaceae"
    nLen = Len(str)
    Do
        rs1.FindFirst "(left(family, nLen)) ='" & Left(str, nLen) & "'"
        If Not rs1.NoMatch Then
            Exit Sub
        Else
            nLen = nLen - 1
        End If
    Loop While nLen > 1
End Sub
 
I have a couple demos using different techniques.
You can use soundex that allows you to search things that sound alike
But in the demo I found that a Find as you type is really the fastest way.

If I type
Dr it finds anything with DR
if I add DRY then it finds anything with DRY
See these examples
 
In that example there are 10k names in the list. You can play with the Find as You type and that would instantly find your example above.
But the more challenging ones are when you know it something like "Cathy " or "Kathy" or "Cathy, Kathy, Cathie, Kathie, Catherine, Katherine ...." That is where you might want to have additional features like Soundex, Levenshtein, or Simil because with FAYT you will quickly find Ca.... but will not return Kathy and you may miss catherine.

I type in Cathy and it picks up other things that are similar but spelled differently, to include Cathy, Catherine, Kathryn, Kattie


Capture.PNG

Now I do not know if I coded the Soundex/2 wrong but none of the K sounding names appear. I found from my testing the most useful is the Simil function. If you had both a find as you type or the simil to do both types of searches you should be able to find any 'like' strings quickly.
 

Attachments

  • Capture.PNG
    Capture.PNG
    60.4 KB · Views: 22
Last edited:
I have a situation where a number of entries have been mis-spelled and I need to find the correct spelling in a reference list.
As an example, the mis-spelled word is "Dryopteriaceae" and the correct spelling is "Dryopteridaceae".
My idea is to start with the full word and work backwards until a match is found using the code below.
This code fails with "nLen" not being recognised in the "(left(family, nLen))" statement, with or without the outer ().
Is there a way to progressively limit the recordset field such that the search criteria would eventually find a match.
In this case, when both arguments are "Dryopteri" one could reasonably assume the correct match has been found.
I have seen a reference to using "Soundex" for this purpose but even extending it out to 8 characters, it fails to find a matching pair.
At the requisite four characters, it will find numerous faulty matches as it is only sampling the beginning of each word.
Is this possible, or is there a better way to achieve my goal?

Code:
Private Sub Form_Load()
    Dim nLen As integer
    Dim str As String
    Dim rs1 As Recordset
    Set rs1 = oDB.OpenRecordset("taxon", dbOpenSnapshot)
    str = "Dryopteriaceae"
    nLen = Len(str)
    Do
        rs1.FindFirst "(left(family, nLen)) ='" & Left(str, nLen) & "'"
        If Not rs1.NoMatch Then
            Exit Sub
        Else
            nLen = nLen - 1
        End If
    Loop While nLen > 1
End Sub
In this sort of technical case,can you not control the permitted values by using a table of values, including the id of the user who created the value? Then the "genus name" or whatever in the "subject" entity becomes a numeric id, (referencing the permitted values table) and your problem becomes far more tractable.

You could even add a field to confirm the genus name, so the errors should only exist in the unconfirmed genus names. It them takes multiple people to be incompetent to get bad data proliferation.
 
For data entry convenience on an open text field like taxonomy, you can also use a combo that selects distinct values from the taxonomy field itself. That way any taxonomies previously entered are available for quick selection without linking to a sub- or lookup-table.
 
Now I do not know if I coded the Soundex/2 wrong but none of the K sounding names appear.
I don't think Soundex does any phonetic matching, and it is dependent on the first character being correct. At least that is how I remember it.
 
I expect soundex would allow duplicates, because they sound correct. You want a way to prevent typos.

When you let users type text you are going to get loads of errors.

Something like 7mm will be entered as 7mm, 7 mm, 7m.m. and loads of other things. It's very hard to make anything foolproof.
 
Thanks to all.
I will try "like" and see how it goes.
I haven't heard of "Simil" but will have a look at it and MajP's demo.
The database I am working on was created in the 1980s and had nothing like a combo box so it was a case of whatever was entered was accepted.
I am now cleaning up that mess.
All critical entries are now from a combobox and additions can only be approved by a supervisor.

My original idea of reducing the size of target and search strings was doomed to failure if the two strings were of different lengths, in that case there could never be a match.

John
 
I am now cleaning up that mess
I think, if I was doing a huge clean up I would pick one of those algorithms like simil.

Then loop all your records and return and save all like values that meet a certain score. Maybe anything above a .65 if using Simil or 70 using Leven... These are then "pretty similar" records. Or maybe set it to return the top 10 most similar records.

So you would have a large table with maybe 5-10 child records either way.

Then build a user interface that shows a value and all similar found records. Have a way to quickly select a "master" if one exist and select any records that should have its value swapped with the master. This will still be time consuming, but at least it is a doable strategy. For each record you review you then need status "Master", "Swapped", "Reviewed", default is "Not Reviewed". This way you know what records have been reviewed and if anything was done.

I agree with everyone there are ways to ensure data entry of unique values, but I do not think that is the problem. The problem is fixing what exists.\

Also sort your solutions from highest found score. Work records that have a Simil score of .99 is likely to be a misspelling.
 
Last edited:
The "findfirst" statement that works
Perhaps
rs1.FindFirst "family like '" & Left(str, nLen) & "*'"
The code that finally worked is a slight variation on your suggestion,
Code:
rs1.FindFirst "[family] like '" & "*" & sStr & "*'"

Where sStr is the result of left(str,nlen)
Many thanks.
John
 
I think, if I was doing a huge clean up I would pick one of those algorithms like simil.

Then loop all your records and return and save all like values that meet a certain score. Maybe anything above a .65 if using Simil or 70 using Leven... These are then "pretty similar" records. Or maybe set it to return the top 10 most similar records.

So you would have a large table with maybe 5-10 child records either way.

Then build a user interface that shows a value and all similar found records. Have a way to quickly select a "master" if one exist and select any records that should have its value swapped with the master. This will still be time consuming, but at least it is a doable strategy. For each record you review you then need status "Master", "Swapped", "Reviewed", default is "Not Reviewed". This way you know what records have been reviewed and if anything was done.

I agree with everyone there are ways to ensure data entry of unique values, but I do not think that is the problem. The problem is fixing what exists.\

Also sort your solutions from highest found score. Work records that have a Simil score of .99 is likely to be a misspelling.
I have created a form that has all the inherent errors in the main table with links to spelling corrections, duplicate entries, family/genus mismatches and a host of other problems. The idea is to bring all these errors into a single hub where the supervisors can work their way through.
Many of the errors require people with botanical training to correct so a blanket fix is not really possible.
Simple spelling errors or specimen location errors etc are corrected in bulk from within the form.
Given the age of the database, it was never normalised.
Data entry was on the raw table, scanning over some 90 fields which were all text fields, including dates.
I have had Access analyze the data but it virtually gave up on creating a normalised result.
Maybe I can get AI to have a look at it!
 
You may be Interested in

 
I have created a form that has all the inherent errors in the main table with links to spelling corrections, duplicate entries, family/genus mismatches and a host of other problems. The idea is to bring all these errors into a single hub where the supervisors can work their way through.
Many of the errors require people with botanical training to correct so a blanket fix is not really possible.
Simple spelling errors or specimen location errors etc are corrected in bulk from within the form.
Given the age of the database, it was never normalised.
Data entry was on the raw table, scanning over some 90 fields which were all text fields, including dates.
I have had Access analyze the data but it virtually gave up on creating a normalised result.
Maybe I can get AI to have a look at it!
In general cleaning large amounts of dubious data is not going to be easy..

If it's an ongoing issue, then I still would explore considering a table based control system, as you could freeze the current problems and work on them without continuing data errors complicating things.
 

Users who are viewing this thread

Back
Top Bottom