Clean first names

jpaokx

Registered User.
Local time
Today, 10:52
Joined
Sep 23, 2013
Messages
37
Hello,

I am trying to clean first names but there is such complexity that I cannot do it in a single query.

j. b john
john
john j
j. John
john
j john,
john, peter
j, john
john
j,john


In all different examples, I want the output to give me "John". So, I need a query to remove commas, ignore initials, dots, remove spaces and leading spaces and make proper case.
All these examples are under the first name field (so, there is no any surname).

Any idea how?
I have applied various ways to fix them but I am stuck. Nothing can give me a proper cleaning and,also, I always need to do 3-4 extra queries to fix this (depending on the outcome of the previous query).
 
Last edited:
this is difficult to do in one pass due to the number of potential variations

would be useful to know your various ways, no point in suggesting them again but you could use nested if, trim, replace, instr and instrrev to get a lot out the way

- I would replace all non alpha characters with a single common character as a starting point - say make them all a space. In the query window for a select query...

fld1:replace(replace(firstname,"."," "),"," ") <standardise non alpha characters to space

fld2:iif(instr(fld1," ")=1,trim(mid(fld1,2)),fld1) <this will resolve single initials

fld3:iif(instr(fld2," ")=1,trim(mid(fld2,2)),fld2) <repeat for second initials
to this point that will have resolved all your examples except 'john j' and 'john, peter' so look at reverse view

fld4:iif(instr(strreverse(fld3)," ")=1,trim(left(fld3,len(fld3)-2)),fld3) <sorts out john j

Assuming you want the john, rather than peter the final manipulation would be
fld5:iif(instr(fld4)," ")<>0,left(fld3,instr(fld4," "))-1),fld4)

put these as columns in a select query and save (note you need to show fld1, fld2 etc to be able to reference them in fld2, fld3 etc). Then use an update query to update the firstname field to fld5.

You may have more issues, but you will only find them by inspection but the above should resolve most of them
 
This issue isn't going to be solved by a query, its going to take VBA. You pass this function the string, it performs your logical steps and returns to you the clean name. However...

Can you logically state how this cleaning should be done? Don't worry about code yet. Just write out a list of steps a person should do to start with a string of text and to end up with a clean name.

Then once you have, you start with some test cases--take your strings from your first post, apply your instructions for them and see if they produce the correct output for each string--editing your instructions as needed to accomodate each new case.

I don't know that you can. My advice is to start with these 4 tests:

john j
j john
john jay
jay john

Once you have that fixed explain this one case that you haven't thought of:

Mary Ann Louise Smith

I know that chick and her first name is Mary Ann, her middle name is Louise and her last name is Smith. What happens when that meets your system?

Then move on to La'tanya or Je'taime.
 
jpaokx,

FWIW many people have spent careers trying to cleanse data. It is complex and more often than people will admit --it requires HUMAN EYEBALLS at some point. Rarely is this totally automated.

A few questions for you:

How certain are you that these values should all be John
j. b johns
john
john j
j. John
john
j john,
john, peter
j, john
john
j,john

What/why does John, Peter ==>John??
Similarly, for j.b. johns?

Can you show readers some more sample data?

You look for patterns, and create an algorithm to handle each pattern, and repeat.

Further to plog's examples.

Ali van't Goor
Oscar de la Hoya
Muhammid azul Muhammed
 
Hi guys,

I know...these are some of the main challenges in data cleansing and thanks for your replies!
Clearly, most of the data are single first names like "John" or "Anna". But, there are some records (about 10k) that are more complex.

The examples that I gave you are the main ones that I need to deal with.
My problem is that, although I have functions to solve some of them, I need to incorporate them into one function (query).

My challenge is with the IIf function. When I give my condition, then, if this is met, then it doesn't move to the next condition and, that's why I need to apply about 10 queries to tackle all such cases.

For example,
If I want to clean this "j john,", I may use iff(instr(trim(firstname),",")>0,replace(trim(firstname),",",""), UCase(mid(firstname,1,1)). It will clean the commas, but it will proper case anything that doesnt have commas.

It would theoreticall work for me if I could write something like iff(instr(trim(firstname),",")>0,replace(trim(firstname),",","") AND UCase(mid(firstname,1,1), if(....)), but it doesn't work like that, right?
If I knew a way to apply such multiple updates with an AND functionality, I could work on this by myself.

In terms of my steps, the idea is to keep the first part of the first name. So, I need to remove any second names (e.g. from "John, Peter" will be "John), remove initials from the string and, so, find any string after a word that contains more than one character when there is no commas or dots (or two characters when there is a comma or a dot) but keep the following string that contains more than 1 character and, finally, apply proper case.

I know guys that this requires human eyeballs and I am not saying that this can be solved easily. I am trying to think all possible way to reduce this amount of bad records :)
 
jpaokx,

What/why does John, Peter ==>John??
Similarly, for j.b. johns?

This is just an example. The rules are to keep only one of the two first names. So, from John, Peter, it will be John.
Regarding the second one, it was a typo (so, not "johns" but "john"). I fixed this in my original post.
 
Once you have that fixed explain this one case that you haven't thought of:

Mary Ann Louise Smith

The business rules for this case are to keep the "Mary" unless the data show "Mary-Ann". I guess my queries don't affect this.
 
so in my suggestion - post #2 - modify to

fld5:strConv(iif(instr(fld4)," ")<>0,left(fld3,instr(fld4," "))-1),fld4),3)
 
there was a couple of typos in my suggestion

try this code - change table name and field name to match your data

Code:
SELECT tblNames.firstname, 
 Replace(Replace([firstname],"."," "),","," ") AS fld1, 
 IIf(InStr([fld1]," ")=2,Trim(Mid([fld1],2)),[fld1]) AS fld2, 
 IIf(InStr([fld2]," ")=2,Trim(Mid([fld2],2)),[fld2]) AS fld3, 
 IIf(InStr(StrReverse([fld3])," ")=1,Trim(Left([fld3],Len([fld3])-2)),[fld3]) AS fld4, 
 StrConv(IIf(InStr([fld4]," ")<>0,Left([fld4],InStr([fld4]," ")-1),[fld4]),3) AS fld5
FROM tblNames;
produces this result
attachment.php
 

Attachments

  • Capture.JPG
    Capture.JPG
    31.3 KB · Views: 265
CJ_London,

The queries you gave me worked to a large extent. Thanks a lot. I also made some improvements based on some extra scenarios.
There are 1-2 things that I am still struggling.
The first one is how would you modify the query if you have a name, let say, "Jonathan J. Peter". How can I do this as "Jonathan Peter"? Again, this is not entirely representative example and I want to apply this rule for certain cases (not as a generic rule).

The other one is that, when I run the queries, I get an #error for null first names. Any idea how to adjust the query so that this can give me null? I tried to use nz() but it didn't work.

Thanks in advance!
 
you'll need to add another step - work it out logically

firstname Jonathan J. Peter
fld1 Jonathan J Peter (two spaces)
fld1a: IIf(InStr([fld1],"xx"), left(fld1,instr(fld1,"xx")-1) & mid(fld1,instrrev(fld1,"x"),"")

you'll need to check the syntax and note I have changed spaces for x's otherwise they will be lost when I post this responses.

this might join the flow or stop at fld1

assuming the latter you would have another field after fld5

fld6: iif(fld1a<>"",fld1a,fld5)

not sure what you mean by nulls - if you mean the firstname is blank then use criteria

...
where nz(firstname)<>""
 

Users who are viewing this thread

Back
Top Bottom