Just wondering if there is a better way to find data

CJ_London · Jul 23, 2022

At the moment if you want to find some data you write a query. To do so you need to know the table relationships and which field holds the data you are looking for.

but on this forum and other forums there is a search option whilst search engines such as Google and bing will take a phrase and return the hits. The forum search engine (and I appreciate probably only a couple of tables require searching) if using a sql query would have to make use of ‘like *…*’ or equivalent which precludes the use of indexing so I would expect to be slow - but it isn’t. I understand Search engines apply some sort of key which enables fast retrieval of data

so just wondering if something similar can be done with access? I’ve doing a bit of reading up on binary indexing, I understand the principle but struggling to see how to apply that in the real world.

I’ve used hash fields using an algorithm to combine all values in a record so it is easy to identify complete duplicates -primarily when importing data - and thinking one could do something similar to create a binary field probably in a separate table to identify a value and where it can be found.

quite likely this would be a big table - one field but a record for each record in every table - perhaps some text field values need to be split to individual words

this is just me musing on a Saturday evening but if anyone has gone down this path, would be interested to know what bumps along the road they encountered or alternative methods they considered

Pat Hartman · Jul 23, 2022

I'm not sure anyone with properly normalized tables would ever need to do this but you could. Indexes take space though so you probably wouldn't use the technique on a large application. Also, there would be issues with memo fields so unless you are willing to live with the first 255 characters, those are probably out of the question.

Create a table with five fields:
FieldID
TableName
FieldName
FieldValue
RowID

Populate it. If you have SQL Server, you can create text indexes which may index individual words. With Access no index will help. Then search the FieldValue using Like or InStr(). I don't know which would be faster. You can play around with it. Then of course, to show the data in context, is a different problem. I guess you would use the tablename and RowID to bring up a form with the search field in context and use the FieldName to highlight the field that got the hit. This would be pretty clunky though.

I had to work with a database recently where a similar search was done. There were 6 text fields involved. The user entered a partial string and the search code examined each of the 6 fields for the search string. As it happens, the table was 600,000 rows so the search was painfully slow given that the process had to work RBAR. I never understood the thought process. Didn't the user know he was looking for a last name or an address? Why search all columns if you knew what type of data you had? Some of the columns could have been combos and those searches would not have used Like so they would have used indexes and been very fast.

Determine how frequently you need to update the index table. I'm pretty sure you don't want to do it on the fly so the user has to be prepared for invalid results.

CJ_London · Jul 24, 2022

Probably not but I’m more interested in the principle at the moment. How does Google/Bing return data so quickly?

Yes - I actually experimented with that technique years ago when I was using ucsd pascal which supports a variant field type.

I guess what I’m speculating about is whether data can be compressed and transformed into something that has a common basis for fast retrieval

A bit simplistic but as an example, a long data type takes 4 bytes but that is only required for the largest numbers - a value of 1 takes 1 bit so there are 31 bits that can be saved. Integers take 2 bytes - so any long value that is in the integer range ‘wastes’ a minimum of 2 bytes. An ascii character takes 1 byte (2 for Unicode) but the bulk of Latin characters are in the range 0-127 so only take a maximum of 7 bits. Vba uses Unicode so that is saving over 50% for most characters.

I guess where I’m going with this is an algorithm that (perhaps) converts a records values into a hash value ( basically encryption?) then inspects the bytes and removes the unused space or just does the same without hashing which might make searching easier. It would need to be combined with a suitable identifier/pointer to go fetch the data. The search phrase would be similarly treated to find matches.

how to provide context is another matter but in theory a user would not need any knowledge of the underlying data structure to return information.

further, if these values condense to something small enough they could perhaps be loaded into an array or dictionary on load - but appreciate access/vba may not be the vehicle for doing this as I would anticipate there would some requirement for memory management

The_Doc_Man · Jul 24, 2022

The act of compression would, I think, tend to scramble the words in a non-linear manner because most compression methods will merge bits from the next word into prior words. The comparison would therefore also be in some order other than alphabetic, at least after the first word. To return one or two decent matches quickly is no biggie but to return thousands of hits on a no longer alphabetized reference set seems difficult to imagine. Particularly since the hardware is still executing instructions in 8 bit, 16 bit, or 32 bit gulps.

Galaxiom · Jul 24, 2022

CJ_London said:
The forum search engine (and I appreciate probably only a couple of tables require searching) if using a sql query would have to make use of ‘like *…*’ or equivalent which precludes the use of indexing so I would expect to be slow - but it isn’t. I understand Search engines apply some sort of key which enables fast retrieval of data

Search engines use a type of Full Text Index. The pages and documents are crawled and the locations of every significant word recorded in an index that allows queries to be run against them.

SQL Server has this. File Tables present a folder in the file system. You can simply copy the files into the folders and they will automatically appear in the table. The Full Text Index accesses the file content through the table.

The queries can return instances of words and their grammatical variants (other tenses and cases) located in the files within a specified separation of the number of words between them.

Pat Hartman · Jul 24, 2022

It is amazing how search engines return results so quickly but as Galaxiom said, it is done with huge indexes and fully optimized code and probably smoke and mirrors

Keep in mind the bias of the big tech companies. As they scan each new/updated page to index it, they can decide whether they want people to find the articles on the page or not. So, it is not just the search algorithms that are biased. It is also the indexing algorithms because they assign the page rankings.

CJ_London · Jul 27, 2022

Thanks for all the thoughts. I'd not come across Full Text Search before - but can see how that goes some way towards what I was thinking. It doesn't quite match my vision, but can see my vision is perhaps not realistic

.

Oh well, have to think of something else....

Galaxiom · Jul 27, 2022

You can create your own index of the words using Access. Each word is a record in the index table. Extra fields record the original recordID and location in that record which can be used by the code to go to the record and location in it for display of the result.

If you wanted to emulate Full Text Indexing's alternative case functionality I think there was code on here recently that handled generating plurals. No doubt it would also be possible to handle other tenses but it would take some pretty elaborate coding and probably a table of exceptions.

CJ_London · Jul 27, 2022

yes I realise that -and could still compact words to reduce the size of the index. It falls over when words become phrases. So you need some form of word separation algorithm (could be as simple as 'no more that 3 words apart') and some rules as to when they are applied. Plurals and tenses are yet another consideration. That's when my vision jumped into the abyss!. Doable, but perhaps too resource heavy to be a realistic option for access.

I might play around with it at a later date to see what the impact might look like.

Pat Hartman · Jul 27, 2022

When I had to convert the customer files from 12 systems to one, I started by standardizing the names. This worked because my population was mostly banks and insurance companies. I made a list of common words in their names and added a convert to value. So all variations of company converted to Co. I also removed all punctuation and extra spaces at this stage. Some of the files came from foreign subsidiaries in Europe. The file from Australia, which was the only clean file of the lot came to me as an .mdb

so I think their application was built with Access. All the others were mainframe and COBOL with different types of databases. Those files came as fixed width text.

Another technique I used during this project was soundex. If you want to make an index that returns similar words or certain types of misspellings, that is very helpful.

I also used some software we bought that was specifically designed to find duplicate names. Internally, it used a technique similar to soundex but more sophisticated. We used that both for the conversion and then in our data entry forms to try to prevent adding new duplicates.

CJ_London · Jul 27, 2022

I'm familiar with soundex - used it twice. Once for a utility company with around 1m business clients and they needed to group them so they could identify the owner - e.g. a group of shops all called 'Smithy's' the differentiator being the location (or to be more precise the meter number). I discovered there were 21 ways to spell McDonalds when there should only be one!

The other was for a hotel booking system (the check in part) for a chain of hotels. A new guest would say 'my name is ....' the receptionist just typed what they heard. Nine times out of ten just the one record came up, if not the receptionist could ask and add additional information

jdraw · Jul 27, 2022

I worked with an individual who was a creator of a name search algorithm. The search was important such that, to incorporate a company federally or to register a trademark, the name had to be reviewed against all existing names to ensure there would be no/minimal impact/confusion on existing intellectual property owners.
As I recall the process was developed (70's-80's) on certain hardware (ControlData) with 60 bit words and binary/logic operations. Each bit had specific meaning and was tuned on/off based on spelling, sound, derivation, diphthong, monophthong....(let's just say up to 60 factors). He had PhD's in mathematics and linguistics. He left me when suggesting--"just think of a 60 dimensional hypercube".

CJ_London · Jul 27, 2022

just think of a 60 dimensional hypercube

But question is > did it work? >how long to run?

The_Doc_Man · Jul 27, 2022

Sheesh - and string theory only requires 11 dimensions.

jdraw · Jul 27, 2022

CJ_London said:
just think of a 60 dimensional hypercube

But question is > did it work? >how long to run?

The system or derivative of it is operational daily.

CJ_London · Jul 27, 2022

As I recall the process was developed (70's-80's) on certain hardware (ControlData) with 60 bit words and binary/logic operations.

on rereading that is kind of what I was thinking

jdraw · Jul 28, 2022

Without understanding the math and the various linguistic facts, I see the gist of the approach to be a number of factors that can be on/off, present/absent or whatever scheme you have and stick to. There is some way (algorithm??) to determine that the factor is there or not (or some statistic/probability to provide confidence). You could AND, OR etc to show a result.
When he and I talked (over coffee/lunch), he would describe conditions/situations where "his approach" might be used.
Things like screening blood types and diseases; reviewing evidence; even things like facial recognition; gait analysis; hand writing analysis; forensics......
We had some very interesting lunches and talks.

Galaxiom · Jul 28, 2022

jdraw said:
There is some way (algorithm??) to determine that the factor is there or not (or some statistic/probability to provide confidence). You could AND, OR etc to show a result.

Of course. The comparison is done very simply with bitwise operators that determine if each individual bit matches or not between the data value and the test value. The bit in the result shows which bits matched. Bitwise operators are supported in VBA.

VB & VBA in a Nutshell: The Language

C.4. Logical and Bitwise Operators Logical operators allow you to evaluate one or more expressions and return a logical value. VBA supports six logical operators: And, Or, Not, Eqv, Imp, … - Selection from VB & VBA in a Nutshell: The Language [Book]

www.oreilly.com

The most common use for bitwise processing in the world would surely the application if the subnet mask in network settings.

This is also how a method can accept multiple parameters as a single value by arithmetic addition of enumeration values.
Easily recognised when the enumerations are in a binary pattern 1,2 ,4, 8, 16 etc

A VBA example of their use is the password complexity function I posted on this site in 2011.

The_Doc_Man · Jul 28, 2022

Not that Access explicitly has it with JET or ACE, but there is a data type that is often found in SQL engines called "Packed Boolean" that assigns a single bit within some larger variable as the Boolean variable or field, either 1 or 0. Some of the structures you find for app objects will use an encoded set of "flag bits" that are packed into some container variable or field.

Galaxiom · Jul 28, 2022

The_Doc_Man said:
there is a data type that is often found in SQL engines called "Packed Boolean"

Interesting. Not much about them online but I did come across this.

Packed Booleans are especially valuable in array languages because only arrays of Booleans are subject to this optimization: a single Boolean in a processor will use an entire register just like an integer or floating-point value. Using a packed representation for arrays of Booleans both decreases memory usage by a factor of eight and makes faster algorithms possible, as eight times more one-bit values may be fit in a single CPU register (and thus manipulated at once) than one-byte values. However, packed Boolean arrays have a drawback: while optimized algorithms on bit Booleans are usually faster than those on byte Booleans, packed Boolean algorithms are often considerably harder to implement, and unoptimized algorithms using bit Booleans tend to be slower. This is because bits in memory cannot be addressed directly on a modern computing architecture. To read or write a single bit at an arbitrary offset, the byte containing it must be read and then the appropriate bit picked out.

Boolean - APL Wiki

aplwiki.com

On a related note, multiple 'bit' columns in SQL Server are stored in the record as a single byte. (Hence they are Bit Booleans at that point, rather than Byte Booleans used in Access.)

Just wondering if there is a better way to find data

Super Moderator

Super Moderator

Super Moderator

Immoderate Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Immoderate Moderator

Super Moderator

Super Moderator

Super Moderator

Super Moderator

Immoderate Moderator

Super Moderator

Similar threads

Users who are viewing this thread