Filtering only Worlds...Real words...

mtairhead

Registered User.
Local time
Today, 01:38
Joined
Oct 17, 2003
Messages
138
I'm having a problem with one of my queries...And I think it's more of a logistics problem than an I-don't-know-what-I'm-talking-about problem.

I have a table with around 20,000 records. It's complicated to get into, but the entire table only has one field, and each record only contains one word. I've already used MS Excel to filter out repeats, which means I currently have 20,000 unique English words.

The problem is: Because of the way these records were collected, a lot of them aren't words...Just junk. For example, a good record would be "Maryland" or "England" or "Truck"....And I seem to have plenty of these. I seem to have just as many bad records, however. Here are a few examples of bad records:
€ „O¥zNP
€ Žˆ'¢¯
DYD_D¥_D§
E¡•†‰‰†ˆ•¢…‰¡…˜

The query I have already only accepts records with vowels and multiple English letters...But As you can see that's a common characteristic in the junk I'm getting. Any ideas?

Thanks
 
I haven't tried this myself, but here are some ideas:
- Use a query that will filter out entries with any non-alpha characters. That is, only accept entries with A-Z or a-z.
- Perhaps there's a way to use the spell check feature in Access to catch nonsense. Like opening the table/query as a recordset, looping through each entry and doing a spell check on each entry.
 
What about making a lot of new fields using the Mid function. Then perhaps if you sort A to Z on each of the fields that will group the non alpha characters and then you could delete those rows.

In your example

€ „O¥zNP
€ Žˆ'¢¯
DYD_D¥_D§
E¡•†‰‰†ˆ•¢…‰¡…˜

The first two will picked on Field1. The third would be picked up on Field4 and the fourth would be picked up on Field3

You would need to use Len([CrazyField]) to give character count then use that in IIF so as to run the right Mid. Perhaps and easier way would be to us the Len() and break up the query into different queries and then apply Mid

Mike
 
your A-Z suggestion is pointing in the right direction. You can use regular expressions to match the data that you want to exclude.. get that record ID and null out the field.

Do a search on this message board for Regular Expressions. I have posted examples here (recently too).
 
Set up a new query:

SELECT tblWords.fldWord
FROM tblWords
WHERE (((fncHasBadChars([fldword]))=True));


And fncHasBadChars is:

Code:
Function fncHasBadChars(str As String) As Boolean

Dim I As Integer

[COLOR=DarkOrange]fncHasBadChars = False[/COLOR]
For I = 1 To Len(str)
    Select Case Asc(Mid(str, I, 1))
        Case 65 To 90, 97 To 122
            [COLOR=DarkOrange]'No op - fncHasBadChars already False[/COLOR]
        Case Else
            fncHasBadChars = True
            Exit For
    End Select
Next I
End Function
I don't know how fast that would be with 20k records, but the returned records can be deleted directly from the query result. IOW, run the above, select all the records from the returned record set, and if you're sure (very sure - make a backup), delete them.

(edit: I think that'll make the function quicker...)
 
Last edited:
Thanks, I'll set up these queries today and see how it works. Thank you all! This forum is the BEST place in the world.

Andrew
 
if you have 20,000 records, using regular expressions would be significantly more efficient than string manipulation functions.
 

Users who are viewing this thread

Back
Top Bottom