extracting data from text fields... complex? (1 Viewer)

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
hi everyone,

i'm in the situation many will tsk tsk about. i was, in the past, assured i wouldn't have to deal with this data, but here i am, dealing with it.

i have a field with descriptive text. i want to extract some of that text for further analysis.

i can find various instances of text that we're interested in, but now my supervisor wants me to separate two types of the data, which are not really separable the way the data has been entered.

from the text field "notes", i want to determine a score of samples which have a pink isolate, a blue isolate, and a blue isolate with a pink halo.

so i'm creating a boolean for pink, blue, and blue PH for each sample tested.

i have been able to get the pink and pink halo all right by using:

Code:
Pink: IIf([notes] Like "*pink*",1,0)
and
Code:
Blue PH: IIf([notes] Like "*PH*",1,0)
but now i cannot determine if the field contains a blue isolate AND a blue with pink halo, or just a blue isolate....

example of the records for "notes":

notes
-----------------
pink / white (mixed)
tiny pink
pink, blue, (mixed)
pink, blue, white, (mixed)
Blue and white - brown agar (mixed)
Pink, blue, white - brown agar (mixed)
blue
blue / white / blue PH (mixed)
blue PH / white (mixed)
pink, blue, white, (mixed)

Blue and white - brown agar (mixed)
Pink, blue, white - brown agar (mixed)
Blue (mixed)
Pink and blue (mixed)
Blue

as you can see, if i search just for "*blue*", then i'll return a 1 even for those samples that only have a blue PH, not just plain blue.

my idea, if it's possible, is to COUNT how many instances there are in the string of PH and of Blue. then, if blue and PH count match, then there is only blue PH isolates from that sample. if there are more of blue than PH, then the sample had both types of isolates.

i just can't come up with a way (in a query) to COUNT how many times "blue" and "PH" appear in each record...

think about a bag of marbles - to give an analogy - we have data that describes all the different coloured marbles in our bag in one text field for each bag. initially, my supervisor just wanted to know which bags contained "pink" marbles and "blue" marbles, if any. NOW he's decided that the decorative bits matter, and wants to know which bags have "pink" marbles, "blue without a swirl" marbles, and then "blue with a swirl" marbles. thing is, "blue without a swirl" was only ever recorded as "blue"....

we have thousands of records with this field, so that's why i'm trying to find a query solution rather than re-typing or re-designing (there is no more data to go in, all the data is there, the study is complete, it's just data extraction and analysis now to go).
 

namliam

The Mailman - AWF VIP
Local time
Today, 05:25
Joined
Aug 11, 2003
Messages
11,695
Why not try and normalize this?
Is this data somewhat seperated by i.e. Linefeed/carage return?

I think you are looking for the colours inside one line... right? And no decernable fixed seperator...
If so, I can invision a function that "simply" looks for colours and goes to see what "additional" information in there, if you understand what I mean.
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
thanks for you reply, namliam.

in the snippet of data i gave in my first post , each separate line is the all the 'data' in one field for each sample record. so, the field has a max of about 5 'kinds' of isolates 'recorded'....

you're right namliam, this should be normalised. i may have to just do it. it might even be quicker than finding a query solution (there are hundreds of records, of course - just to make it easy!).
 

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
Hi wiklendt

I do like problems like this and I have an hour to spare so just use me and then I MUSt go to the beach.

Can you send me a spreadsheet with the data in or a 2003 database with the table in.

A totals query should do the trick all in one go.
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
Hi wiklendt

I do like problems like this and I have an hour to spare so just use me and then I MUSt go to the beach.

Can you send me a spreadsheet with the data in or a 2003 database with the table in.

A totals query should do the trick all in one go.

sorry, only just saw your post. do your best! but don't stress out about it! if the file doesn't work (v2003 *.mdb) then just go to the beach! ;) but thank u :)
 

Attachments

  • TextExtraction.zip
    164.1 KB · Views: 100
Last edited:

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
Hi again

Have a look at Query1 and see if this could be the answer.

You can add as many field as you like depending on the number of different character strings that you are looking for.

Base another query on this one and then either put a 1 or a 0 in the criteria depending on what you are looking for.

It takes a bit of setting up but you can copy the formula and just change the column heading and the text sought.
 

Attachments

  • TextExtraction.mdb
    836 KB · Views: 96

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
Hi again

Have a look at Query1 and see if this could be the answer.

You can add as many field as you like depending on the number of different character strings that you are looking for.

Base another query on this one and then either put a 1 or a 0 in the criteria depending on what you are looking for.

It takes a bit of setting up but you can copy the formula and just change the column heading and the text sought.

hi, wow, that looks like a lead. i copied as instructed, however, it's 'counting' something other than the actual text?

e.g., for a field with the data:
pink, blue, white, (mixed)

the expression is returning "2" when searching the text "blue" - i checked if perhaps it was a 'select disctinct' issue (by changing the property "Unique Values" to "no") but that did not change the returned sum.

anyway, gives me a starting ground. thank you very much highandwild. enjoy the beach :)
 

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
It is using the INSTR function to establish the position of the text that you are looking for in the [notes] field. If the function returns anything other than 0 then the calculation places a 1 in the column. You could put anything you like there insterad of the 1 but that is the way I do it.

shinycream: Sum(IIf(InStr(1,tblCHROMplates!Notes,"shiny cream",1)>0,1,0))

I'm not sure how you get a 2 as the result unless you have two records with the same ID which I don't.

pinkbluewhitemixed: Sum(IIf(InStr(1,[tblCHROMplates]![Notes],"pink, blue, white, (mixed)",1)>0,1,0))

just gives me 1's or 0's.
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
and in this case:
blue, blue PH, (mixed)

searching "PH" returned "2" and
searching "blue" returned "2"...

edit: my bad, it returned "1" in both cases, but there are some records with a value of up to 5...

yes, i thought about the two records with same ID, but i believe i've controlled for that here...
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
It is using the INSTR function to establish the position of the text that you are looking for in the [notes] field. If the function returns anything other than 0 then the calculation places a 1 in the column. You could put anything you like there insterad of the 1 but that is the way I do it.

shinycream: Sum(IIf(InStr(1,tblCHROMplates!Notes,"shiny cream",1)>0,1,0))

I'm not sure how you get a 2 as the result unless you have two records with the same ID which I don't.

pinkbluewhitemixed: Sum(IIf(InStr(1,[tblCHROMplates]![Notes],"pink, blue, white, (mixed)",1)>0,1,0))

just gives me 1's or 0's.


hm, that's interesting. in the DB i sent you, even if i use my own text search, it returns only 1's. back in my 'real' database, it returns higher numbers sometimes.

but that's not the issue either - i was able to get a 1/0 using

Code:
Blue PH: IIf([Notes] Like "*PH*",1,0)

however, i was unable to distinguish between when this ought to be counted as a blue, as opposed to a blue PH (which are two different types of isolate).

that is, if i use:
Code:
Blue: IIf([Notes] Like "*blue*",1,0)

then it flags all the "blue", sure, but it also flags those that are "blue PH".

i think your expression is doing a similar thing... it can't distinguish between a blue and blue PH... and sometimes one record will have both, and so both need to be included for analysis.

i fear i may just have to do what namliam suggested - normalise.... it's only about 6500 records ;)
 

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
data of "blue, blue PH, (mixed)" needs to be split into three columns
so you text for the existance of "blue", "blue PH" and "(mixed)" quite seperately.

In the second query.

If you want to identify the records that have "blue" and not "blue PH" then you put a 1 in the "blue" column and 0 in the "blue PH" column.

If you want to identify the records that have "blue" and ALSO "blue PH" then you put a 1 in the "blue" column and 1 in the "blue PH" column.

If you want to identify the records that have neither "blue" OR "blue PH" then you put a 0 in the "blue" column and 0 in the "blue PH" column.
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
ah, see, a record with the "notes" field like this:
pink, blue, blue PH, white, (mixed)
i was hoping to get Blue=2 and PH=1, then i can be sure that there is 1 PH and (2-1)=1 blue only.....

edit: and i know i can rely on such a count b/c i've scanned the data and it seems to be kinda consistent - it's just i don't know how to GET a count like that ;)
 
Last edited:

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
data of "blue, blue PH, (mixed)" needs to be split into three columns
so you text for the existance of "blue", "blue PH" and "(mixed)" quite seperately.

In the second query.

If you want to identify the records that have "blue" and not "blue PH" then you put a 1 in the "blue" column and 0 in the "blue PH" column.

If you want to identify the records that have "blue" and ALSO "blue PH" then you put a 1 in the "blue" column and 1 in the "blue PH" column.

If you want to identify the records that have neither "blue" OR "blue PH" then you put a 0 in the "blue" column and 0 in the "blue PH" column.

yup, that's exactly what i'm looking to get, except, as soon as you look for the text "blue", it will count "blue PH" also (because "blue PH" includes the string "blue")....

what you describe above can only be done manually so far as i can tell, unless i can (a) normalise or (b) get a setup working as i've described in post #12...

it's the "putting" of the 1 in the correct columns that needs the fine-tuning...
 

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
get a setup working as i've described in post #12...

I can do that to but just hang on a sec......

The beach has been there for 20000 years so it's not going anywhere.
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
get a setup working as i've described in post #12...

I can do that to but just hang on a sec......

The beach has been there for 20000 years so it's not going anywhere.

...hm you sound like me and the gym.... "i'll go tomorrow" LOL
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
(those pink, blue, blue PH notes are further to the bottom of the table, by the way.... "shiny dark blue" was a previous data entry person being a little artistic, we think....)
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
highandwild, i'm afraid it's waaaaay past my bedtime (darned database!), i'm really in need of some zzzz's.... could be one reason i haven't figured this out out yet LOL.

don't stay on this too long, you need your R&R too....! (but i do know very well the addiction of a challenge)

i'll check in again in the morning.
 

highandwild

Registered User.
Local time
Today, 04:25
Joined
Oct 30, 2009
Messages
435
Here you go...

Slight problems but got around them.

Amend the qry_CreateAnalysisTable according to what you are looking for.

Example:

Blue: CharCount(tblCHROMplates!Notes,"BLUE",False)

Copy this line and amend accordingly.

The CharCount user defined function is in the only module and checks for the occurences of the 2nd parameter in the 1st paramenter. The FALSE just signifies not case sensitive.

A combination of these two different techniques I have shown you should do the trick.

You have missing data in some of your [notes].

You data really needs to be normalised but that is a bigger job.

If you ever need help doing that just send me a msg.

Here is the beach: http://www.soton.ac.uk/~imw/chesil.htm that I'm off to now. 18 miles and 18000000000 pebbles. Small tiny ones at one end and big heavy ones at the other and still unexplained.

You owe me a big Ice Cream.
 

Attachments

  • TextExtraction.mdb
    1.1 MB · Views: 92

namliam

The Mailman - AWF VIP
Local time
Today, 05:25
Joined
Aug 11, 2003
Messages
11,695
Yes, you can (try) and stack the IIFs and counts/sums what not...

Ultimately doing this in a more normalized version is going to prove MUCH more helpfull

Possibly agree with / enforce upon the user community to use a 'fixed' seperator, comma or / or .....
 

wiklendt

i recommend chocolate
Local time
Today, 13:25
Joined
Mar 10, 2008
Messages
1,746
Yes, you can (try) and stack the IIFs and counts/sums what not...

Ultimately doing this in a more normalized version is going to prove MUCH more helpfull

Possibly agree with / enforce upon the user community to use a 'fixed' seperator, comma or / or .....

*whines* it's not my fault!

...well, ok, yes it is. but if it weren't for me, my supervisor would have make ALL the bloody fields in this database as free-text, so we've been having an ok time with analysis up till now...

as for normalising - at the moment that's still an option. however, this DB is not part of an ongoing clinical or diagnostic process, it's part of an almost-done 3 year research grant - so the motivation at the moment (for my supervisor) is to get 'results'... i forsee this DB being used much, after the 3 years is up, because of the wealth of data it holds, so i may just have to normalise it anyway for ease of future use.

while my supervisor is really good and often allows me to do what i think is best (esp with databases) i ultimately have to "leave it as it is" if he says so.

- take heart, namliam, i'm on your side, it's just a matter of what's quickest for the moment... ;)
 

Users who are viewing this thread

Top Bottom