Include file search engine in Access db?

qqq

Registered User.
Local time
Today, 11:46
Joined
Sep 20, 2005
Messages
24
Hi there,

I would like to include a file search engine in my Access front end, allowing for an indexed search of all relevant files (doc, pdf, xls, txt) in our network, particularly our server.

I have come across dtsearch (http://www.dtsearch.com/PLF_engine_2.html). What they offer is a fully fledged search engine that can be used through VBA code. (I think it exposes an object that can be built into the front end code). Sounds great I think. If it works. I like the idea of seemlessly including a "file search" feature in the front end.

Does anyone of you know this thing? Or have you come across any alternatives for file search engines, which are maybe less expensive?

Thanks for your thoughts!

Marcus
 
Thanks Guus, for the quick response.
I would need something to crawl through several 100 GB for indexing, which I think is beyond the scope of the "search engine code" I could find on planetsourcecode. I think they have been built for searches within db-tables or for finding a file name in a folder. Correct me if I'm wrong.
Marcus
 
The 100GB could be one single file or 25 ripped DVD's so it doesn't say much. Windows Commander can find files a lot faster than the file search that was build in the standard explorer. Both compiled!
I don't think that VBA is the answer to a quick search. It is an interpreter. The code is not compiled therefor a lot slower.

Compare this VBS script to the search in the explorer:
Code:
[ff.vbs] 
Dim System
Dim Drive
ReDim Location(0)
Dim FileName
FileName = LCase(Trim(InputBox("Enter the name of the" &_
                  " file that you wish to search for.")))
If Len(FileName) = 0 Then Wscript.Quit
Set System = CreateObject("Scripting.FileSystemObject")
For Each Drive In System.Drives
   If Drive.IsReady And Drive.DriveType = 2 Then
      Call FindFile(Drive & "\")
   End If
Next 'Drive
Msgbox Join(Location,vbCr)

Sub FindFile(ThisFolder)
    Dim File
    Dim Folder
    For Each Folder In System.GetFolder(ThisFolder).SubFolders
       For Each File In Folder.Files
          If LCase(File.Name) = FileName Then
             Location(Ubound(Location)) = File
             ReDim Preserve Location(Ubound(Location) + 1)
          End If
       Next 'File
       Call FindFile(Folder)
    Next 'Folder
END SUB
Enjoy!
 
Thanks again, but the magic thing I am looking for should not only search file/folder names, but search (and index) file content. I would be closer to google desktop search (which has an indexing crawler) than the windows explorer (which I think doesn't have an index).
Cheers, M
 
You're looking for a simple answer. There isn't one. Write your own in some compilable language and create an interface with Access. Perhaps an Add-In.
 
Doing a file content search is extremely dangerous unless your design is extremely selective. Your problem, in a nutshell, is that you have a few choices but EACH has many negatives associated with it.

1. An undirected file indexer that tells you which file contains every frimpin' little word in the world. You'll find like a gazillion entries to wade through. The result will be useless because it contains too much.

If we are truly talking 100 GB worth of files, let's do a sanity check: Average size of a word in the English language is 5 characters (last time I checked) but let's be generous and say you will index words up to 16 characters long.

So... you make a table with an autonumbered prime key and a 16 character text field. Plus overhead = about 28-30 bytes per record. There are ... about 30-50 THOUSAND words commonly used in English, but if you include techie words, make that 100K. So your word list (for non-selective entries) will be 3 Mbytes. And to make word lookups work correctly for building the word list, you need the text field to be indexed, too! So add another 3 Mb for the reverse lookup index. That's 6 MB just for word lookup.

Then you make a table with the names of the documents searched. Give that an autonumber PK and a file-spec of 128 characters (if you have long folder names or file names) or 64 characters (if everything is short. But given "Program Folders" and "Documents and Settings" ... I'd bet on 128.) So you populate that puppy. Let's for argument that you average about 50Kb per file (typical document files). OK, so...

100 Gb of documents / 50 Kb per document = 2 MILLION documents?

Your record size would be the autonumber plus the path name = 128 + 4 + overhead of about 8, maybe = 140 bytes per file. X 2 Mdocs = 280 Mbytes. You are up to 286 Mb between the keyword list and the document list.

Now the fun: The JUNCTION table that links a word to a document. Assuming you disallow duplicates (and you had better do so), you would use the two PKs at 4 bytes each plus overhead of several more bytes to probably be on the order of 20 bytes per record, if you are lucky. Now, using the average size of a word as 10 bytes to make the math doable, 100 Gb of data / 10 bytes/word = 10 Gwords to store.... at 20 bytes per pop. PLUS the index that you need to assure that this is unique, probably another index.

Unfiltered, you are talking 200 Gb JUST FOR THE JUNCTION TABLE. Unless you can filter out dupliicates to bring that junction table down below 2 Gb minus the size you need for the keyword table and the document table, you can't store this in Access. Just remember, it is 20 bytes per entry PLUS 20 bytes for the index that keeps it unique.

Now.... tell me you DIDN'T want to track location in the file for each word you find... PLEASE tell me you didn't. Because if you did, you just blew the uniqueness and lost the ability to filter. If a printed page contains no more than about 500 words per page, you will need not less than 300 Gb of junction table (including the page reference). If you wanted it like a HYPERLINK to the actual use of the word on that page, it is more like 400 Gb of junction table. And your odds of reducing the size via avoiding redundancy just died its final death.

That is not a job for Access. Shoot, I would hesitate to do that on ORACLE with an HP Alpha-series processor doing the text parsing. You are approaching numbers that are appreciable fractions of a TERABYTE.

You can say I'm being frivolous, but trust me, I'm not. I'm showing you how to analyze problems of this sort and how to make decisions on what you wanted to see vs. what you can afford to see. And for the general indexer unfiltered, don't go there. Just don't go there. Trust me. You cannot reasonably get there from here.

So let's consider your other choices...

2. A directed file indexer that tells you which file contains every word in a predefined list. Which means you have to know what you are seeking in the first place. In other words, predefine your list of words and never store anything that is not in that list. Then you keep your size down - but you run the risk of never knowing what unexpected, non-trivial words are in your document.

3. A directed file indexer that tells you which file contains any word NOT in a predifined list. Which means you have to make a list of words you won't want to see. Which will in all likelihood be about half of a dictionary. So we have TWO lists. You look for the word in the "don't want" list before you store in the real "content" list. Probably looking at several thousand "don't want" words. But again, that would keep your numbers down.

So, here's the sanity check.

Your choices are

(1) an unfiltered indexer that will merely overflow Access capacity or

(2) a directed indexer that might fit but will require predifinition of the keywords you wanted to see (and the list will take time to build and will depend on you being able to think of the words you wanted to see) or

(3) a filtered indexer that might fit but will require predefinition of the keywords you didn't want to see (and the list will take time to build and will require you to think of the words you consider undesirable. Not as easy as it sounds...).

Add to that, #2 and #3 require MANUAL DATA ENTRY for hundreds or thousands of words depending on what you seek or what you don't want to see. Keying in either list will take you however long it takes to type in several hundred or several thousand words. Assuming there is absolutely NO think time involved and you touch-type, you are still looking at several HOURS of mind-numbing data entry.

If you would bother to do the math for exercises like this, you would gain some perspective, I think. It would put a magnitude of effort in front of you and help you decide that on the "build/buy" scale, this leans towards "buy" - if possible. Or go do something else with the rest of your life... because running this under Access would probably TAKE the rest of your life.
 
Thanks DocMan for that impressive response and calculation. Am convinced that "building" is not really the solution (I do not intend to spend the rest of my life with this).
However, dtsearch engine (the one I mentioned above) does provide such functionality, and afterall google does, too, indexing terabytes of text. Have you ever used / heard of someone using dtsearch built into an Access tool (the dtsearch people claim it works easily)? Or an alternative product that would offer similar features and can be addressed through VBA?
Thanks again for your efforts!
 
Google does what it does because someone defined some things that people really want to see. Not to mention that for adevertising sites, it is in the advertiser's best interests to PROVIDE the keywords that go with their web content. So the reference list gets handed to Google for lots of their stuff.

Never played with DTSEARCH. What you need to look for is this: If there is a way to make a reference to DTSEARCH from Access, you can use Object Automation to control your searches. If you have a copy of DTSEARCH and it has something you can read to check on that object exposure, you could find out the interface between Access and your search program. Used to be called ActiveX though Automation is sort of taking over the original meaning that used to be ActiveX. Anyway, if there is a way to create a DTSearch Application Object, you are home free. (Well, anyway you can GET there from where you are.) Search this forum and Access help on Application Objects.
 
thanks, that's what I hope dtsearch does. I was wondering if anyone knew another tool... doesn't seem so.
 

Users who are viewing this thread

Back
Top Bottom