identify duplicate strings: a numeric string within another

90405 · Jul 21, 2004

identify duplicate strings: a numeric string within another

Is there a method to identify a numeric string within another numeric string and then express "duplicate"

As exmple data:

097102105111114100105112101108108101
could appear in another string

097102105111114100105112101108108101100105114101103105110097116111115111110105097

this would give me an indication that they could possibly be duplicates

all values are contained within same rcolumn, if duplicate appears then print "duplicate" in next column

i.e.
097102105111114100105112101108108101
097102105111114100105112101108108101
097102105111114100105112101108108101100105114101103105110097116111115111110105097
097102105111114100105112101108108101100105114101103105110097116111115111110105097

Dan_T · Jul 21, 2004

have you tried something as simple as the instr() function?

ReAn · Jul 21, 2004

Problem with this is that to do this you'd have to check EACH row for an InString for each row..

aka

#rows * #rows

so if you have 300 rows, your looking at 90000 InStr functions, now strings are typically mutable (or is the term immutable) anyways, it means when you manipulate strings it actually allocates memory for another string, makes the changes and puts it into the new location and then deletes the old one. So, string manipulation is SLOW, and so string comparison algorithms are slow too because of many factors. In the whole reality of things it's small to do one, but your not talking 1 ms your talking like 100ms-200ms

So for 10 records it takes 1-2 seconds so you're looking at 9000-18000 seconds which is 150-300 minutes, or 2.5-5 hours.

Now, if you know exactly how long duplicates will be, then it's much easyer, but when you're talking numbers that big... it's a bitch @ a half.

Another thing to considder, if the duplications are always at the beginning or end you can make a couple new fields:

Left([textfield],5)
Left([textfield],10)
Left([textfield],15)

and say that if all three of those fields are the same, then there is a good chance that they have duplicates inside and check only those... and so on .. the numbers can be changed.. you get the picture.

The_Doc_Man · Jul 21, 2004

I think we need to clarify terms. Let's see if I got this right.

1. Your number strings are variable-length
2. You want to cross-correlate whether two number strings have the relationship such that one string matches or starts another string.
3. You want to know whether a duplicate appears. It is not clear from your question whether you want both records to show that they are duplicates.

Here are two key questions...

A. Can the match ever be based on a fully contained string that DOESN'T start the other string in the comparison?

B. Can the match be declared between subsets of two longer strings without there being a matching string with the subset string (that matched the two longer strings) as its EXACT string?

If the answer A or B is YES then you are right in the number of comparisons you will have to accomplish. And it WILL INDEED take hours. If you are looking for general, embedded matches starting with random offsets, you have no easy solution that occurs to me.

BUT...

If the match can ONLY occur based on the beginning of the strings and only occurs on exact-match of one string to (another string or a sub-string thereof - but not substrings of BOTH), then you need to SORT the dataset with an appropriate query. Now, here is how I would do it, 'cause I'm happy with VBA.

Create two recordsets that open your sorted query in an updatable mode. You might have to set NO LOCKS on the queries. Determine (from either query) the number of records in the recordset. Reset both queries to the first record in the recordset. NOW step one recordset to the next record.

Compare the strings from the two recordsets. If the first recordset's number matches the second recordset's number, set a yes/no variable in the first recordset's current record and update it. (Option: Set both flags and update both? But then you would never know which was the duplicate and which was the original... and this is likely to cause locking problems.)

Then step BOTH recordsets to the next record and loop. STOP the loop when the SECOND recordset hits EOF. You are done! And this is a one-pass algorithm.

From your comments, sadly, it appears that this MIGHT not be usable. But then again, it might... Your comments about using LEFT suggest that it is only string starts that make a difference. So think about it.

I believe that you can only sort on a limited text string. But if these strings are ALWAYS numbers, you can "cheat" on the sort operation a little.
Break the strings into subsets in order to convert them from string format to LONG format, perhaps 8 bytes at a time, which converts to 4 bytes when treated as an integer. Sort by the converted numbers, not the original strings. Be sure to use the NZ function when the subset chunk is empty 'cause the original string was kind of short.

identify duplicate strings: a numeric string within another

90405

Registered User.

Dan_T

Registered User.

ReAn

Dangerous Programmer...

The_Doc_Man

Immoderate Moderator

Similar threads

Users who are viewing this thread