word index creator code would be good: possible? (1 Viewer)

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
Hi everyone

I'm in MS Access 2007. I use RTF fields to create non-fiction books/stories.

I would like ideas, and preferably also a whole VBA code to be able to create a routine (for each book/story I write) that creates a purpose built index, just like many books have as of course you would know.

The result of the process I am seeking would be a form which displays
  1. a word list for the particular book, in alphabetical order
  2. against each word would be
    1. the word count for that word
    2. a box for me to tick against each word so that (if I want) it is excluded from the index
    3. the document (book) page number(s) that the particular word appears on
  3. a command button that produces the index that would then come up as a report
  4. a command button which I can use to then print off the index in however many columns per page I want - I know how to do this particular bit (the print command button) OK, I think
  5. A nice optional extra would be to have a separate table of excluded common words (such as a, the, would, I, you, me etc etc etc) so the routine knows not to include them in its search/list/count - now there's an extra challenge for somebody here .... ???!!!!! I guess another tick box "exclude from search" would do the trick?
  6. I'm guessing that the search result may have different font sizes etc and different font types, so a routine that has a font choice drop down box to apply to the result of the routine (ie the list as made by points 1 & 2 above) would also be nice optional extra, but I think I could manage that also.
I have almost zero VBA code understanding.

However, it seems to me that the basic idea above is very simple in concept. So it surprises me that there isn't (as far as I could see) any free or cheap freeware "out there" that would do what I'm looking for. I have searched various forums but perhaps I didn't think of the appropriate search phrase.

I'm really hopeful that a simple code that works can be invented by your local gurus and/or that you can suggest links to other wise sites that might be able to help. The alternative to having this routine would be unthinkable - a manual search by observation, note taking, bits of paper, etc etc. One of my books has about 15,000 words. So, NO ...!!! Thank you, in advance, good people.

Ed K 8 January 2022
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 18:44
Joined
May 7, 2009
Messages
19,169
it's a good exercise.
how do you save each "page" (1 page = 1 record on your table)?
what you need is a recordset, loop through each page (record).
another table which will hold the "words" found (field1) and the word count (field2)
and which page(s) did it appear (field3).
 
  • Like
Reactions: EdK

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 10:44
Joined
Jul 9, 2003
Messages
16,244
I doubt you will find anything in VBA. I'd try Python first.
 

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 10:44
Joined
Jul 9, 2003
Messages
16,244
Actually, it might be worth exploring "Word" forums as well....
 
  • Like
Reactions: EdK

CJ_London

Super Moderator
Staff member
Local time
Today, 10:44
Joined
Feb 19, 2013
Messages
16,553
1 and 2 would be easy enough to achieve in principle, and as Arnel asks, would need to know how you are identifying a page.

As you are probably aware, richtext uses html, not unicode (so the 2 character vbcrlf would require 11 characters '<div>.....</div>'). As you say 'I use RTF fields' would also need to know whether you are actually using richtext for features such as indent, colour, text formatting/size etc, or plaintext which does not. Font size in particular could impact the outcome.

And would also need to know your definition of a word. Would some words actually be phrases e.g. would 'Windows 10', 'hammer drill' or 'St. Ives' be considered one word or two?
 
  • Like
Reactions: EdK

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 05:44
Joined
Feb 28, 2001
Messages
26,996
@EdK

I can tell you about this but you should be aware that this is not a project for the beginner. I have done something similar to that, but it isn't finished because of the "too many irons in the fire" situation. (I thought I would have more free time after retirement, but ... )

To do what you want is certainly possible but you need to understand that the answers are somewhat variable. Further, to do this, you are going to HAVE to learn VBA to the point that you can comfortably deal with application objects, because the information you want isn't always DIRECTLY available to Access - but the other applications have it with little or no trouble. Some comments, interspersed in blue:

  1. a word list for the particular book, in alphabetical order: Access can sort this kind of thing easily, if you make the list of words a parent table and their locations a child table.
  2. against each word would be
    1. the word count for that word: Based on your 2.3 sub-requirement, you must keep these references separate; if so, the COUNT query is trivial. It just depends on a rational data structure.
    2. a box for me to tick against each word so that (if I want) it is excluded from the index
    3. the document (book) page number(s) that the particular word appears on: Ah, but that depends on the font size and the presence of any inclusions such as embedded pictures. It is variable, depending on whether your input was created for a letter document or legal document, and also depends on orientation (portrait/landscape). You will HAVE to ask for help from the application holding this book or document. And let's not forget the possibility that the document is segmented into chapters as separate files. Word allows that.
  3. a command button that produces the index that would then come up as a report: This is a "divide and conquer" case - build the report, then separately build a button to open the report.
  4. a command button which I can use to then print off the index in however many columns per page I want - I know how to do this particular bit (the print command button) OK, I think
  5. A nice optional extra would be to have a separate table of excluded common words (such as a, the, would, I, you, me etc etc etc) so the routine knows not to include them in its search/list/count - now there's an extra challenge for somebody here .... ???!!!!! I guess another tick box "exclude from search" would do the trick? Actually, you would do better to have a list of the words you DID want to see. You would be surprised at how many "noise" words you might come across.
  6. I'm guessing that the search result may have different font sizes etc and different font types, so a routine that has a font choice drop down box to apply to the result of the routine (ie the list as made by points 1 & 2 above) would also be nice optional extra, but I think I could manage that also. I'm stating, and it isn't a guess at all, that this where you want to involve Word as an app object because Access, not being a word processor, will neither know nor give a toot about that kind of detail, whereas Word has all sorts of features built-in to deal with variable fonts, spacing, indenting, etc.
If you wanted to do this in Access, it is possible. I'll give you one thing that might help. To look at words in a document, you must have that document in a text format. If it is ASCII text (not UNICODE), a text parser might help. Here is one I created for my own book analysis:


The download includes a Word document on how to import and use the parser. The download also includes the class module. It WILL NOT work with UNICODE text offerings.

I posted that module for people who might wish to do text parsing, so it is fair game. However, you can't use that module without learning a LOT about VBA. Access will not be able to do the counting for you, but if you have Word open as an app object, it can do page counting. BUT be aware that long documents might be broken up into pieces-parts, which Word refers to as master and sub documents. You will only get a valid "word X is on page Y" count if you have the master document expanded and feed THAT to whatever routine you are going to use for this analysis. You CANNOT do this with just queries and intrinsic form mechanisms. You will also need to learn a bunch about Word's Object Model (internal document structure, if you prefer that term).

This is surely a lot to digest. Having done something similar to this using Word and Access, I'm not blowing smoke. Your request for "whole VBA code" will probably not get back much, as your goals are way too specialized for anyone to have that just lying around on the shelf. My parser might be a starting point for identifying individual words, but it is just a small cog in a complex machine.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 06:44
Joined
Feb 19, 2002
Messages
42,970
I've attached another piece of code that might be useful. The db is a sample of Word OLE automation but it includes a form that reads though a directory and opens all the word documents to extract bookmarks. So between what Doc posted and this, you might be able to pull it together yourself:)
 

Attachments

  • SampleWordAutomation_20210314.zip
    306.1 KB · Views: 303
  • SampleLettersOnly.zip
    33.7 KB · Views: 340

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:44
Joined
May 21, 2018
Messages
8,463
IMO everything that is described is pretty simple. The only question I have is where are the books located.
I'm in MS Access 2007. I use RTF fields to create non-fiction books/stories
Are you saying you write a book in an Access Memo field? I cannot wrap my head around that. I would think you write a book in Word. Either way if the source is Word or Access should be very easy to
1. Read every word and if it is from a Word processor identify the page
2. Store every word in a table and page number.
3. Sort alphabetically
4. Tag a word against a source table of common words (conjunctions, prepositions)
5. I then would group each word and List occurrences like in an index ; SomeWord p. 12, 27, 54, 99
6. Create and print the report.

Now this is likely not good enough. I would build a wizard to pick a word and cycle through each occurrence so you can view the context. So as CJ said you are going to find the word Windows and the word 10. No logic is going to know you want Windows 10. So you want to look at the returned list and pick Windows. Then you need a fast way to validate the choices and update your list. So you will get a 85% solution to start.

I already do something like this but far more advanced. I search a Word document for all acronyms and abbreviations based on REGEXP patterns. I have a table of about 50K acronyms in my line of work. It then associations them to the potential acronym. I then have some features to do some house keeping and delete those I know are not legitimate. I then scroll through the found acronyms and it goes to the first occurrence. If the first occurrence is NASA I can choose to call it out it will automatically replace it the full name and call out: "National Air and Space Administration (NASA). We are always required to spell and callout the first occurrence. I can then build a word acronym list from the found occurrences. This reduce a many hour chore to 15 minutes.

One thing I found is that REGEXP is amazingly fast in its ability to read a large document and compare to a pattern and return all occurrences. You can return 15k words instantaneously and write to a table in no time.
 
  • Like
Reactions: EdK

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
Hi everyone

I'm in MS Access 2007. I use RTF fields to create non-fiction books/stories.

I would like ideas, and preferably also a whole VBA code to be able to create a routine (for each book/story I write) that creates a purpose built index, just like many books have as of course you would know.

The result of the process I am seeking would be a form which displays
  1. a word list for the particular book, in alphabetical order
  2. against each word would be
    1. the word count for that word
    2. a box for me to tick against each word so that (if I want) it is excluded from the index
    3. the document (book) page number(s) that the particular word appears on
  3. a command button that produces the index that would then come up as a report
  4. a command button which I can use to then print off the index in however many columns per page I want - I know how to do this particular bit (the print command button) OK, I think
  5. A nice optional extra would be to have a separate table of excluded common words (such as a, the, would, I, you, me etc etc etc) so the routine knows not to include them in its search/list/count - now there's an extra challenge for somebody here .... ???!!!!! I guess another tick box "exclude from search" would do the trick?
  6. I'm guessing that the search result may have different font sizes etc and different font types, so a routine that has a font choice drop down box to apply to the result of the routine (ie the list as made by points 1 & 2 above) would also be nice optional extra, but I think I could manage that also.
I have almost zero VBA code understanding.

However, it seems to me that the basic idea above is very simple in concept. So it surprises me that there isn't (as far as I could see) any free or cheap freeware "out there" that would do what I'm looking for. I have searched various forums but perhaps I didn't think of the appropriate search phrase.

I'm really hopeful that a simple code that works can be invented by your local gurus and/or that you can suggest links to other wise sites that might be able to help. The alternative to having this routine would be unthinkable - a manual search by observation, note taking, bits of paper, etc etc. One of my books has about 15,000 words. So, NO ...!!! Thank you, in advance, good people.

Ed K 8 January 2022

Actually, it might be worth exploring "Word" forums as well....
Uncle Gizmo - thanks for that suggestion, I now see (by fiddling) that my MS Access book reports (which I produce as pdf) can be exported beautifully from Access as MS Word text files and I note that MS Word does have a sort of index producing facility (which I knew anyway) which I could use but it requires me to go through and identify (+ "mark" with code) so that it comes into the index. This is the type of laborious manual stuff I was hoping to avoid (being a bit lazy at heart). Hence, my thought of having an initially mindless "grab it all" routine that can then be fine tuned manually. Structurally, my Access bookwriter application already produces a report of the whole book, in hierarchical format, as follows - Chapter No. & name; next, lower, what I call "sub-headings"; next, lower, is paragraph number and name; next, lowest, are endnotes, linked to the para number. So you see, all I want is the main words that the reader might be interested in finding - I already have the structural index. Instead of page number, it (any index maker) could easily be linked to chapter_paragraph_number, I guess.

I have an eBook making program called Sigil which comes very close to what I and about one million other would be authors would love to have & use - under Sigil's "word search_replace_correct" routine it produces an alphabetical list of all words in the document, with the number of occurrences of each word. My brain tells me that to do that routine, it must (even if it is instantaneously) that program must know exactly where in the document each word actually is (because it asks if the user wants to correct or delete it). Astonishingly, I can select that whole list but when I right click to copy that info, I am given zero options. I'm thinking I will make a suggestion to (I think) github to go that one step further and provide (even if at some cost to eg myself) a simple index-maker add-in to Sigil. I would certainly be willing to pay A$50 for it if I found that it suits my simple purposes.

Thanks EdK
 

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
IMO everything that is described is pretty simple. The only question I have is where are the books located.

Are you saying you write a book in an Access Memo field? I cannot wrap my head around that. I would think you write a book in Word. Either way if the source is Word or Access should be very easy to
1. Read every word and if it is from a Word processor identify the page
2. Store every word in a table and page number.
3. Sort alphabetically
4. Tag a word against a source table of common words (conjunctions, prepositions)
5. I then would group each word and List occurrences like in an index ; SomeWord p. 12, 27, 54, 99
6. Create and print the report.

Now this is likely not good enough. I would build a wizard to pick a word and cycle through each occurrence so you can view the context. So as CJ said you are going to find the word Windows and the word 10. No logic is going to know you want Windows 10. So you want to look at the returned list and pick Windows. Then you need a fast way to validate the choices and update your list. So you will get a 85% solution to start.

I already do something like this but far more advanced. I search a Word document for all acronyms and abbreviations based on REGEXP patterns. I have a table of about 50K acronyms in my line of work. It then associations them to the potential acronym. I then have some features to do some house keeping and delete those I know are not legitimate. I then scroll through the found acronyms and it goes to the first occurrence. If the first occurrence is NASA I can choose to call it out it will automatically replace it the full name and call out: "National Air and Space Administration (NASA). We are always required to spell and callout the first occurrence. I can then build a word acronym list from the found occurrences. This reduce a many hour chore to 15 minutes.

One thing I found is that REGEXP is amazingly fast in its ability to read a large document and compare to a pattern and return all occurrences. You can return 15k words instantaneously and write to a table in no time.
MajP like minds, we are. I will have a look at your zip files in a while. I'd be happy with an 85% result and yes I know there's manual work needed to fine tune any/the list. I've posted a bit more detail of my Access book structure & reports, and that I can export text file to MS Word from the Access report options drop down list. I sent it in my reply to Uncle Gizmo just now, you might like to look at that. Thanks for your suggestions. EdK
 

CJ_London

Super Moderator
Staff member
Local time
Today, 10:44
Joined
Feb 19, 2013
Messages
16,553
You are not really answering the questions asked of you about how you identify a page number, font sizes and the like. Best you have come up with is 'I already have the structural index.' So how does that work?

As previously said, getting a word count etc is not that difficult if a word is defined as some text or numbers with a space or punctuation both ends Not so good for 'St. Ives' etc.

Why not zip and upload a copy of one of your books in access - use a dummy book if need be. But provide an all encompassing example
 

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
it's a good exercise.
how do you save each "page" (1 page = 1 record on your table)?
what you need is a recordset, loop through each page (record).
another table which will hold the "words" found (field1) and the word count (field2)
and which page(s) did it appear (field3).
arnelgp unfortunately, my MS Access book record(s) do not go by page, they go by (see my post to Uncle Gizmo, to see the database structure). But it's OK, I am investigating (as someone suggested here) the MS Word options, also a program called Sigil (which I use to make eBooks). From the various posts, I am getting a wider appreciation of the issues involved. I have already decided I don't want the index to be perfect, at all, and that my book structure reports, as mentioned to Uncle Gizmo, are exactly equivalent to a TOC (Table of Contents). Thanks for your comments, I wish I could speed learn VBA coding (in less than the time I have left on this beautiful planet - I'm 77 years old/young ....) EdK
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:44
Joined
May 21, 2018
Messages
8,463
I have a working application if you want to PM me. I just have to dumb it way down to give you just what you need. I just tested it on 174 page document and it can build the index in under a minute. I have an exclusion library where you can add words to exclude. I have lots of features to make it easy to add to the list.
Mine is currently designed to select from a word document, but it could pull from any string. The below is the result from my 174 page document. In red are those that are in the exclusion table. You can filter to show the exclusion words and delete them manually or choose to just delete all of them at once.

Index.jpg


Here is the filtered view
excluded.jpg


If I was pulling this from word I would create page numbers for each occurrence, but not sure what to do if you want to pull from a memo field. I Already make a table, but automate it using Word. If you want it in Access that would be far easier.
 
  • Wow
Reactions: EdK

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
You are not really answering the questions asked of you about how you identify a page number, font sizes and the like. Best you have come up with is 'I already have the structural index.' So how does that work?

As previously said, getting a word count etc is not that difficult if a word is defined as some text or numbers with a space or punctuation both ends Not so good for 'St. Ives' etc.

Why not zip and upload a copy of one of your books in access - use a dummy book if need be. But provide an all encompassing example
Hi CJ_London the way my bookwriter works, I don't ever identify a page number in the data base, but it does show when I produce an interim (or final) report. Yes, that's a problem for an index (clearly!). As I explained to arnelgp, the way around that is to export the report (= all & every word) as a MS Word text file, it does that really well, the text is all the same font type and size.

So my focus has shifted now to "what can I use in MS Word that would suit my purposes?" The answer appears to be "Actually (as far as I can see, and happy to be shown differently) nothing automatic that produces just a word index, so therefore a lot of manual work would be involved to "tag" every word I want in the index; not what I was after ....".

I agree totally that MS Access should be able to easily produce an 85% good list of words & "word spots" (for want of a better word than "page number"). I would be happy with that. But I lack the ability to do the necessary ??simple?? VBA coding... Anyway, thanks for your interest EdK
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:44
Joined
May 21, 2018
Messages
8,463
Now once you do it for real you will see the features that are needed. For example how to you want to handle plurals (accept and accepts)? You probably want to combine them in some way.
In anything like this, you are never going to get code that gives you the perfect solution. You need a UI that gives lots of features to make the process fast.
 
  • Like
Reactions: EdK

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:44
Joined
May 21, 2018
Messages
8,463
As CJ_London points out, it may be near impossible to find "St. Ives", but again the trick is to make the UI very flexible and Fast.
Again, my original worked with Word. So once I am finished I want to go back to the source document and verify some information.
moveinword.jpg

So I have the ability to control word from my table of found words. I can scroll through my words in the db and open each occurrence in the source. As you can see I found 3K words, I would then delete from the exclusion list and spend time deleting. Once the list is manageable I can move through the Word document and compare and update my list using my navigation controls.
 
  • Like
Reactions: EdK

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
I have a working application if you want to PM me. I just have to dumb it way down to give you just what you need. I just tested it on 174 page document and it can build the index in under a minute. I have an exclusion library where you can add words to exclude. I have lots of features to make it easy to add to the list.
Mine is currently designed to select from a word document, but it could pull from any string. The below is the result from my 174 page document. In red are those that are in the exclusion table. You can filter to show the exclusion words and delete them manually or choose to just delete all of them at once.

View attachment 97419

Here is the filtered view
Now once you do it for real you will see the features that are needed. For example how to you want to handle plurals (accept and accepts)? You probably want to combine them in some way.
In anything like this, you are never going to get code that gives you the perfect solution. You need a UI that gives lots of features to make the process fast.


If I was pulling this from word I would create page numbers for each occurrence, but not sure what to do if you want to pull from a memo field. I Already make a table, but automate it using Word. If you want it in Access that would be far easier.
MajP - I seriously don't know how to express my appreciation of what you've posted. My mind is genuinely blown away. Dumb is good, but it don't look dumb to me .... And as I haven't yet replied to Uncle gizmo re his post I won't go too overboard in my praise for you, in case his routines are even better. You see, I'm technically way out of my depth here. But all you guys need is a marketing manager and you've got it made. Though .....

By the way, I don't understand much about using forums -
you say "I have a working application if you want to PM me" .... could you please enlighten me on what "PM" stand for? [I'm 77 years old and my brain hurts - ho ho ...]

EdK
 

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
As CJ_London points out, it may be near impossible to find "St. Ives", but again the trick is to make the UI very flexible and Fast.
Again, my original worked with Word. So once I am finished I want to go back to the source document and verify some information.
View attachment 97422
So I have the ability to control word from my table of found words. I can scroll through my words in the db and open each occurrence in the source. As you can see I found 3K words, I would then delete from the exclusion list and spend time deleting. Once the list is manageable I can move through the Word document and compare and update my list using my navigation controls.
Wow, this is so logical and good. I agree with "fast" and "most of it done, do the rest via fine tuning, deleting, etc". You know it is a pleasure to see (my) vague and formless ideas come into fruition, via savant adaptation of "idea" -----------> "useful REALITY". I just can't believe something like your routine hasn't hit the huge "index application" market. I don't know what the big book publishing companies are using ....

Can I ask again (duh ....!!) how do I contact you privately (so that my ignorance is not further demonstrated)?? EdK

PS - If we're ever on the Titanic together, may I please be in your life raft ????
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 18:44
Joined
May 7, 2009
Messages
19,169
if you sent a private message to uncle g, then we can't see it nor does we have access to it.
i tried experimenting with Sigil, and it is not easy to "view/extract" any words inside the
epub. you might need a special "tools" to extract each words.

if you convert your epub to Word it is Very Easy to get those words/word count using
Word Automation.

if you try to Google about word counting in word using VBA, there are lots
of hits which have already been resolved.

you might also want to go to http://www.vbaexpress.com/forum and ask mr.gmayor, MS Word MVP about
your project. he is very expert on ms word and most of the topic forums about Word were solved by him.
 
Last edited:
  • Like
Reactions: EdK

EdK

Registered User.
Local time
Today, 03:44
Joined
May 22, 2013
Messages
42
arnelgp, thank you for your helpful thoughtful comments. I need now to just see what word can do with my exported MS Word text file(s), in the light of whatever your suggested sites and people come up with. I also want to pursue understanding of MajP's very cool creation. I also need to study the basic forum "How to use the forum" help topics, just so I can better communicate with all you geniuses. I am hopeful. and envious of your group expertise. and your helpfulness blows me away. EdK
 

Users who are viewing this thread

Top Bottom