convert or transform a text string into an associated "unique" numeric code (1 Viewer)

ReAn

Dangerous Programmer...
Local time
Today, 16:42
Joined
Jun 25, 2004
Messages
250
90405 said:
for example:

a company name such as:

ApisSRL

would be converted into a numeric code ????

and I would extract (select) digits from this numeric code, derived from every odd positioned digit

and then recombine it


0-6-5-1-1-2-1-0-5-1-15083082076
I get from above odd positions

deriving a unique value
0511510380

I want to use maximum 8 - 16 digits

Okay, ive replicated what you wanted here, only problem is that long strings will produce long id's, so max of 16 won't really work. Check it out:

Code:
Function MakeCode(myString As String) As Double
Dim MyCode As String
Dim MyCode2 As String
Dim i, j As Integer

MyCode = GenKeyCode(myString)
j = 0
For i = 1 To Len(MyCode) Step 2
  MyCode2 = MyCode2 & Mid(MyCode, i, 1)
  j = j + 1
Next
MakeCode = CDbl(MyCode2)
End Function

Function GenKeyCode(myString As String) As String
Dim i As Integer
For i = 1 To Len(myString)
 GenKeyCode = GenKeyCode + FormatAsc(Mid(myString, i, 1))
Next
End Function

Function FormatAsc(myStr As String) As String
Dim i As Integer
Dim s As String
i = Asc(myStr)
s = CStr(i)
If Len(s) = 1 Then
 s = "00" & s
ElseIf Len(s) = 2 Then
 s = "0" & s
End If

FormatAsc = s
End Function

I've tested it against your previous example and it works perfect.
 
L

Larry Schwartz

Guest
I'm nt a database person, but ....

What you are trying to do is to "hash" a text string into a unique value. This is a technique used very frequently to store a large number of records at unique locations in a random access file. There are lots of hashing algorithms (try Google or your local college's computer library) with varying amounts of acceptable "collisions".

In typical use, the algorithm is set up to produce a (hopefully) even spread of the hashed values over an address space (the real, random, record numbers in the file). The hashed value/record number is used to find an open space in the disk file and put the record's data there. An index is then maintained; it contains an entry for each hashed value/disk address pair. To fnd the data, hash the source argument and look up the result in the index. If the value space for the hashed values exactly matches the address space for the file, you can get away without the index.

Your routine may need to be set up to allow collisions, i.e., the situation where the hashed value is non-unique. One such scheme just looks for the next unused record space in the file.

What you have not said is whether you have to be able to take the hashed value and work it back into the original value. There are some mechanical solutions to this problem.

Larry
 

90405

Registered User.
Local time
Tomorrow, 00:42
Joined
Jul 4, 2004
Messages
21
I guess I was trying to improvise a very simple empirical "hash" method, before I moved onto more complex methods.

I was moving in one direction and had one intention, but I guess in the cross-communication and maybe some misinterpretation, some unexpected solutions were derived in a creative sort of way - pleasing to see that this process has also lead to a challenging dilemma and other potential benefical off-shoots for others.

For your interest, below is something further on "hashing".

I'm still trying to understand a best method to resolve my dilemma.

Thank you for your exchange of ideas and support ...!


Vic


***************************

"Hash tables are one of the most efficient means of searching for data. A hash table establishes a mapping between all of the possible keys and the position where the data associated with those keys is stored.

Direct Addressing is where the hashing algorithm guarantees that no two keys will generate the exact same hash value. This is the ideal situation but is impractical in most situations. Instead most hash functions map multiple keys to the same has value and provide indirect addressing mechanisms to handle the situation where two keys map to the same value. The situation where this occurs is called a collision. Two keys that hash to the same value like this are called synonyms.

A hash function attempts to distribute keys in a random manner so as to minimize the need to resolve collisions. One of the simplest hashing algorithms is division remainder where you divide the key value by some fixed number and take the remainder as the key location (also known as modulo arithmetic). For example let's say we have space for one thousand entries, so we divide the key of each value by 1000 and take the remainder so the entry with key 5078950770 will occupy position 770 in our array (5078950770 modulo 1000 is 770). Of course with this particular example every 1000th key will map to the same location."


citing:
http://www.felgall.com/hash.htm
 

90405

Registered User.
Local time
Tomorrow, 00:42
Joined
Jul 4, 2004
Messages
21
Just something further for your interesdt

****************
hashing



The term you selected is being presented by searchDatabase.com, a TechTarget site for Database professionals.


Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
As a simple example of the using of hashing in databases, a group of people could be arranged in a database like this:

Abernathy, Sara
Epperdingle, Roscoe
Moore, Wilfred
Smith, David
(and many more sorted into alphabetical order)

Each of these names would be the key in the database for that person's data. A database search mechanism would first have to start looking character-by-character across the name for matches until it found the match (or ruled the other entries out). But if each of the names were hashed, it might be possible (depending on the number of names in the database) to generate a unique four-digit key for each name. For example:
7864 Abernathy, Sara
9802 Epperdingle, Roscoe
1990 Moore, Wilfred
8822 Smith, David
(and so forth)

A search for any name would first consist of computing the hash value (using the same hash function used to store the item) and then comparing for a match using that value. It would, in general, be much faster to find a match across four digits, each having only 10 possibilities, than across an unpredictable value length where each character had 26 possibilities.
The hashing algorithm is called the hash function (and probably the term is derived from the idea that the resulting hash value can be thought of as a "mixed up" version of the represented value). In addition to faster data retrieval, hashing is also used to encrypt and decrypt digital signatures (used to authenticate message senders and receivers). The digital signature is transformed with the hash function and then both the hashed value (known as a message-digest) and the signature are sent in separate transmissions to the receiver. Using the same hash function as the sender, the receiver derives a message-digest from the signature and compares it with the message-digest it also received. They should be the same.

The hash function is used to index the original value or key and then used later each time the data associated with the value or key is to be retrieved. Thus, hashing is always a one-way operation. There's no need to "reverse engineer" the hash function by analyzing the hashed values. In fact, the ideal hash function can't be derived by such analysis. A good hash function also should not produce the same hash value from two different inputs. If it does, this is known as a collision. A hash function that offers an extremely low risk of collision may be considered acceptable.

Here are some relatively simple hash functions that have been used:

The division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.)
Folding: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.
Radix transformation: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length.
Digit rearrangement: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.

citing:
http://searchdatabase.techtarget.com/sDefinition/0,,sid13_gci212230,00.html
 

KenHigg

Registered User
Local time
Today, 18:42
Joined
Jun 9, 2004
Messages
13,327
Interesting concepts- Thanks for your '2 cents' Larry...
 

90405

Registered User.
Local time
Tomorrow, 00:42
Joined
Jul 4, 2004
Messages
21
I am still a bit of novice with Access, and what you stated is correct, but I just wanted to convert the alphanumeric data to numeric code since I believe it also facilitates indexing/ processing.

Do you have any suggestions as I can remove duplicates in a databse?

both exact and near exact

3 P Rappresentanza Commercio Estero (S.R.L.)
3 P Rappresentanza Commercio Estero (S.R.L.)
3 P Rappresentanze Commercio Estero Srl

Apis S.R.L.
Apis Srl
Apis Srl
Apis Srl
Apis Srl

plus the same company can operate in different product sectors and in different provinces with different telephone numbers.


I can send you the raw data contained in Excel for your examination.
thanks

Vic
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 17:42
Joined
Feb 28, 2001
Messages
27,353
OK, here's a thought from an old programmer who has seen this kind of stuff before.

First, you are doing a hash. But other posters have told you that, and you knew it already. Second, you are ALSO doing something called a "characteristic" - which I haven't seen in the other posts. The quoted use is a MATHEMATICAL term, not an algorithmic term. You might wish to look up the term in mathematical circles. But, THIRD, you need to know about a property of all such methods. It is called the "DISTANCE" and tells you something about the probability of two different inputs colliding.

For instance, for the CRC-16 method of computing a polynomial-based checksum, if the input string is 256 bytes or less, CRC-16 has one chance in approximately 4 million of "colliding" - presuming that the bytes are ASCII and that you have eliminated non-printing characters and such. There are some other assumptions hidden in that, too. Trust me, the math to figure the distance is truly ugly. In fact, it is fugly.

You can try other CRC formulae, or the CCITT method, or several other things that produce checksums. All of them are similar to implement but different in mathematical properties. For that matter, if you have a network programming package you might already HAVE a CRC routine in one of the .DLL files as a callable module. See if your network package includes data on WinSock programming and the tools it provides. (Assuming you have such a package.)

Whatever method you select, see if you can find out its distance function because you are going to need to know that as part of your selection process. I'll offer the thought that a CRC-16 checksum is at least one possible choice. Good "distance," the checksum can be truncated to 32 bits easily (giving you up to 11 decimal digits for a 32-bit number unsigned), and the computed characteristic is sensitive to the order of the input string. I.e. ABC and CBA produce two different characteristics. I'll leave it to you to find web-based references on the topic and various formulae to try.
 

90405

Registered User.
Local time
Tomorrow, 00:42
Joined
Jul 4, 2004
Messages
21
This simplified everything ...!

After I take two valium pills and get my mental brain database which is a Pentium (at the momemnt working overclocked...!) to analyse all this.
 
L

Larry Schwartz

Guest
For a unique 4-byte integer, you could use the system clock. For PCs, it started at 0 and increases at approximately 18.2065 times a second. The value 0 was assigned to the morning of Jan. 1, 1970 at midnight.

The counter can count up to about 4 billion before it will overflow. You could use this value if an 18th of a second is adequate resolution. Keep track of the last clock value you used. If it is not the same as the new one, append one or two zeroes. If it is the same, increment the suffix to 01 see if that was the same as the last number you generated; if the same, try 02 ...

The clock value is available from the operating systems, Word, Excel, VBA, etc., but the operating system clock access functions return a new value only once a second. The hardware clock is not very accurate. In actual operation, they gain or lose about 5 to 15 seconds per day, with 10 seconds per day being typical. Although the hardware clock usually outperforms the software clock, its performance pales in comparison to even a low-cost wristwatch. HOWEVER, in this application, we DO NOT care about acuracy, only uniqueness.

(The remainder of this note is extracted from a Novel web page:
http://developer.novell.com/research/appnotes/1994/january/03/index.htm)

The real time hardware clock can be addressed by placing a command in I/O port 70h, then reading back the result from I/O port 71h. The simple BASIC program in Appendix A shows how it is possible to read the current time from the RTC. The PC BIOS and DOS define interrupt calls to access the time and date information held by the RTC. Most high level languages also provide functions to get and set the clock from within a program.

Appendix A: Example Programs
Example BASIC program to read the RTC:

10 REM RTCLOCK.BAS - For QBASIC or QB
20 '
30 CLS
40 DO
50 OUT &H70, 0
60 RTSECONDS = INP(&H71)
70 OUT &H70, 2
80 RTMINUTES = INP(&H71)
90 OUT &H70, 4
100 RTHOURS = INP(&H71)
110 LOCATE 1, 1
120 PRINT STRING$(20, " ")
130 LOCATE 1, 1
100 PRINT "Real Time Clock now - ";
110 PRINT HEX$(RTHOURS); ":";
120 PRINT HEX$(RTMINUTES); ":";
130 PRINT HEX$(RTSECONDS)
140 LOOP UNTIL INKEY$ <> ""
150 END

Example BASIC program demonstrating access to the NIST (National Institute of Standards and Technology, formerly NBS) time source:

10 REM NISTTEST.BAS - For QBASIC or QB
15 CLS
20 COM(2) ON
30 ON COM(2) GOSUB 200
40 T$ = ""
50 OPEN "COM2:1200,N,8,1,BIN,RB1024" FOR RANDOM AS 1
60 PRINT #1, "ATZ" + CHR$(13)
70 T$ = ""
80 T = TIMER
90 DO
100 LOOP UNTIL TIMER > T + 5
105 REM Change the line below to include international access code if required
110 PRINT #1, "AT DT 3034944774" + CHR$(13)
120 T = TIMER
130 DO
140 LOOP UNTIL TIMER > T + 60
150 PRINT
160 PRINT T$
170 CLOSE 1
180 END
190 '
200 L$ = INPUT$(1, #1)
210 PRINT L$;
220 RETURN
 

Users who are viewing this thread

Top Bottom