word index creator code would be good: possible? (1 Viewer)

Gasman

Enthusiastic Amateur
Local time
Today, 14:57
Joined
Sep 21, 2011
Messages
14,044
MajP - I seriously don't know how to express my appreciation of what you've posted. My mind is genuinely blown away. Dumb is good, but it don't look dumb to me .... And as I haven't yet replied to Uncle gizmo re his post I won't go too overboard in my praise for you, in case his routines are even better. You see, I'm technically way out of my depth here. But all you guys need is a marketing manager and you've got it made. Though .....

By the way, I don't understand much about using forums -
you say "I have a working application if you want to PM me" .... could you please enlighten me on what "PM" stand for? [I'm 77 years old and my brain hurts - ho ho ...]

EdK
Click MajP name under his icon and then click Start Comversation.
I have just done the same to you, just so you see what happens.

Then when you get a reply or new message, it will show up at the top right of the menu bar where an envelope icon is displayed, next to your username.
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
Aha! Thank you. Yes I see an envelope top rhs, as you said. Thank you I'll get there yet. What does PM stand for? EdK
 

CJ_London

Super Moderator
Staff member
Local time
Today, 14:57
Joined
Feb 19, 2013
Messages
16,553
OK so you are going a different route - but just for the benefit of others who are looking at an access solution re page number, my thoughts are it should be possible to calculate it based on linecount plus information about the page structure. In principle the number of lines per page can be determined by page dimensions, adjusted for margins and divided by height per line. With further adjustments for font height for titles etc.

You would do this by scanning the 'raw' text a character at a time, taking note of the html formatting codes as you go and picking up on font dimensions (there are a few examples about), line feeds and the like. Colin put me on to wizhook as an alternative to stephen leban's gettextheightwidth function to determine font dimensions
 

Gasman

Enthusiastic Amateur
Local time
Today, 14:57
Joined
Sep 21, 2011
Messages
14,044
Aha! Thank you. Yes I see an envelope top rhs, as you said. Thank you I'll get there yet. What does PM stand for? EdK
Private Message
I mentioned that in the PM I sent you? :)
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
After searching this I am not sure if this is such a great approach and all that useable for an index. Maybe it is a start, but you would definitely need to build a full application. There are lots of software out there to do indexing, obviously it is a bigger field than I thought. Some of these apps may be freeware.
Looking at a real index, very few entries are single words. So getting single words may be a start but you would need some interface and features to refine the solutions.
Index Example.jpg


You might be able to use REGEXP to get "Adobe Flash Player", but I know of no way to get "book pricing". My code would return "book" and "pricing". Also a lot of software apps have the ability to do sub items. (Artwork -- covers --- ownership of). Word has a pretty good Index creator, but that requires you to tag the items.
If I was doing this for me, I would probably have the ability to pull all the Words first. Then the means to delete excluded words, and other words to narrow down the list. Then use the navigation controls to search for "Book" and update items. Once I got all the terms in my table I would use it to then automate Word to tag the items and create the index using the Word features.

So you can see the problem in the returns here
Problem.jpg


For example I know the context of this document dealing with RADARS. I know there are important discussions about accuracy and I see it happens 25 times and 4 "accurate". I know that there are topics on "Position Accuracy", "Height Accuracy", "Speed Accuracy", but by itself it would not be very useful.
So maybe I can use the found words to aid in finding the real terms, but a pure list of words is of little value.
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
After searching this I am not sure if this is such a great approach and all that useable for an index. Maybe it is a start, but you would definitely need to build a full application. There are lots of software out there to do indexing, obviously it is a bigger field than I thought. Some of these apps may be freeware.
Looking at a real index, very few entries are single words. So getting single words may be a start but you would need some interface and features to refine the solutions.
View attachment 97436

You might be able to use REGEXP to get "Adobe Flash Player", but I know of no way to get "book pricing". My code would return "book" and "pricing". Also a lot of software apps have the ability to do sub items. (Artwork -- covers --- ownership of). Word has a pretty good Index creator, but that requires you to tag the items.
If I was doing this for me, I would probably have the ability to pull all the Words first. Then the means to delete excluded words, and other words to narrow down the list. Then use the navigation controls to search for "Book" and update items. Once I got all the terms in my table I would use it to then automate Word to tag the items and create the index using the Word features.

So you can see the problem in the returns here
View attachment 97437

For example I know the context of this document dealing with RADARS. I know there are important discussions about accuracy and I see it happens 25 times and 4 "accurate". I know that there are topics on "Position Accuracy", "Height Accuracy", "Speed Accuracy", but by itself it would not be very useful.
So maybe I can use the found words to aid in finding the real terms, but a pure list of words is of little value.
MajP --
"So maybe I can use the found words to aid in finding the real terms, but a pure list of words is of little value."- your comment.

Agreed. but lets go the further step:

I have a simple search routine in my MS Access database & book writer that connects any two "searched for"/"found" words and makes a list/report. From that list I manually check to the listed occurrences in either the database or the listed book paragraphs, and I use that list for book writing and research purposes, unrelated to indexing.

But it occurs to me (and as you that imply_suggest) your Access_MS Word application, could be (?easily) be tweaked such that
  1. it finds the linkages between any two words
  2. would also for each "found occasion" know the exact location of each
  3. could therefore produce a list of "two word combination" plus location [+ content, actually, of each combo, as mine does]
  4. use the total final word list (as automatically, then manually, created - as you describe_suggest) - the key here would be to pare down to a final list that is not going to have an exponentially large number of possible combination of words.
  5. Now, it would be a difficult (but not impossible?) task to manually then go through and extract a list for any index, I know that. But a smart guru (such as yourself) could think this through and create something workable.
  6. (I can see that it would be necessary to) have some routine in MS Access (or linked via Access to control MS Word) whereby the operator person can manually combine the essential "word pairs" of his/her choice, armed with the knowledge/understanding of (a) what the guts_meaning of the text is (b) what the reader of the document is likely to want to search for.
  7. As an author, I would be perfectly happy to spend my time (within reason, of course) doing this "value adding" to the raw word list you can already produce. So that a nice simple usable end product is achieved. So that the end user can then decide what amount of time they want to invest in trying to get the perfect index done, manually adding to the computer produced list(s)
  8. I know this is not ideal. but i think a fair amount of judgement_skill could be applied by the person selecting the index "pairs", as they (in my mind) would be mostly the author of the material anyway, and would know the content intimately - so I would look on any "word finder_word pair producer" as a helpful aid, rather than as the total solution. Artificial intelligence isn't, and perhaps never will be (hopefully) as smart as the best human brains.

I would like to know what search phrase you used to find the instances of "word indexing" programs on the 'Net - I tried early on and (though mainly looking for freeware) didn't find much to get that excited about.

I see you have posted something that "crosses" with what I'm about to post ie this post (like a letter in the post in the olden days where I come from ...). Which I will answer point by point after I have posted this current lot of my comments off. I'ts pretty simple stuff and despite my mis-reading of the indexing application market, I reckon I'd love to see what you came up with in line with what I've written above. totally up to you and your inclination and available time etc etc, of course.

Added soon after: I absolutely LOVE your suggestion to (via Access?) auto add chosen index tags to the Word docx, fantastic idea, especially if it can do "pairing" of selected words ----------> phrases.
Yes, I do really need to get more into MS Word, but I have great faith in Access's ability to query and manipulate (in the right hands of selected savants, of course)!!!

The silly (and/or quite valid) reasons I got sucked into Access were that
  • I couldn't figure out how to control the position and behaviour of images in MS Word (!!!!!)
  • I needed to combine book writing function with database function, which (as far as I can see) MS Ward fails horribly at, but my clunky setup in MS Access controls and connects the two functions OK.
  • Note: I soon enough found a fairly satisfactory routine in Access to control the selection, storage, and positioning of images into about 8 preselected positions and sizes in book report, for any part of the book (chapter or paragraph) - the basic working unit of the book is the paragraph
  • bottom line: works really well for me but (until its rebirth at the hands of some savvy guru) only I can use it (seems a waste, but it happens)
EdK
 
Last edited:

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
MajP --
"So maybe I can use the found words to aid in finding the real terms, but a pure list of words is of little value."- your comment.

Agreed. but lets go the further step:

I have a simple search routine in my MS Access database & book writer that connects any two "searched for"/"found" words and makes a list/report. From that list I manually check to the listed occurrences in either the database or the listed book paragraphs, and I use that list for book writing and research purposes, unrelated to indexing.

But it occurs to me (and as you that imply_suggest) your Access_MS Word application, could be (?easily) be tweaked such that
  1. it finds the linkages between any two words
  2. would also for each "found occasion" know the exact location of each
  3. could therefore produce a list of "two word combination" plus location [+ content, actually, of each combo, as mine does]
  4. use the total final word list (as automatically, then manually, created - as you describe_suggest) - the key here would be to pare down to a final list that is not going to have an exponentially large number of possible combination of words.
  5. Now, it would be a difficult (but not impossible?) task to manually then go through and extract a list for any index, I know that. But a smart guru (such as yourself) could think this through and create something workable.
  6. (I can see that it would be necessary to) have some routine in MS Access (or linked via Access to control MS Word) whereby the operator person can manually combine the essential "word pairs" of his/her choice, armed with the knowledge/understanding of (a) what the guts_meaning of the text is (b) what the reader of the document is likely to want to search for.
  7. As an author, I would be perfectly happy to spend my time (within reason, of course) doing this "value adding" to the raw word list you can already produce. So that a nice simple usable end product is achieved. So that the end user can then decide what amount of time they want to invest in trying to get the perfect index done, manually adding to the computer produced list(s)
  8. I know this is not ideal. but i think a fair amount of judgement_skill could be applied by the person selecting the index "pairs", as they (in my mind) would be mostly the author of the material anyway, and would know the content intimately - so I would look on any "word finder_word pair producer" as a helpful aid, rather than as the total solution. Artificial intelligence isn't, and perhaps never will be (hopefully) as smart as the best human brains.

I would like to know what search phrase you used to find the instances of "word indexing" programs on the 'Net - I tried early on and (though mainly looking for freeware) didn't find much to get that excited about.

I see you have posted something that "crosses" with what I'm about to post ie this post (like a letter in the post in the olden days where I come from ...). Which I will answer point by point after I have posted this current lot of my comments off. I'ts pretty simple stuff and despite my mis-reading of the indexing application market, I reckon I'd love to see what you came up with in line with what I've written above. totally up to you and your inclination and available time etc etc, of course.

Added soon after: I absolutely LOVE your suggestion to (via Access?) auto add chosen index tags to the Word docx, fantastic idea, especially if it can do "pairing" of selected words ----------> phrases.
Yes, I do really need to get more into MS Word, but I have great faith in Access's ability to query and manipulate (in the right hands of selected savants, of course)!!!

The silly (and/or quite valid) reasons I got sucked into Access were that
  • I couldn't figure out how to control the position and behaviour of images in MS Word (!!!!!)
  • I needed to combine book writing function with database function, which (as far as I can see) MS Ward fails horribly at, but my clunky setup in MS Access controls and connects the two functions OK.
  • Note: I soon enough found a fairly satisfactory routine in Access to control the selection, storage, and positioning of images into about 8 preselected positions and sizes in book report, for any part of the book (chapter or paragraph) - the basic working unit of the book is the paragraph
  • bottom line: works really well for me but (until its rebirth at the hands of some savvy guru) only I can use it (seems a waste, but it happens)
EdK
I don't know if this got sent to MajP or not so am sending this EdK
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 22:57
Joined
May 7, 2009
Messages
19,169
i also did the "exercise".
like the previous, listing the words doesn't do much.
Screenshot_10.png
 
  • Like
Reactions: EdK

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
I found a few Freeware versions. Take a look at the video associated with this one , you would have to save as PDF.
This does a lot of the stuff I do and more. The key is once your words are extracted you have to be able to search the document by selecting your words. Then you need to determine if you are going to use it or not. The tagging looks very similar to how Word tags a term.

The software in the video has the same problem I was referring to. You are stuck with individual words. So IMO there needs to be a feature to review each occurrence of the word and then Add/Edit words based on context. So in my example I would click on Accuracy and the first occurrence deals with Position Accuracy. I would add that to the found list. Then look at the next occurrence and see that it deals with Speed Accuracy. I would add that to the list. Each time a word is added to the list it would automatically search the document and add to the page numbers.

I would like to know what search phrase you used to find the instances of "word indexing"
If you use the term "book Indexing software" instead of "word"
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
This one is a little pricey, but way more advanced. This is more what I was thinking. Watch the video. This appears to extract useable terms not just words.

Here is some interesting reading to actually program keyword extraction.
 
Last edited:

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
This one is a little pricey, but way more advanced. This is more what I was thinking. Watch the video. This appears to extract useable terms not just words.

Here is some interesting reading to actually program keyword extraction.
MajP -
  1. been occupied with "other stuff", sorry for delay in replying. You've been very helpful, thank you. Now, I've downloaded trial version of Textract (great name for program) and am tinkering a bit with it. I'm thinking maybe I could/should use the Pro SP (Pro single project) option US$119 + VAT, as it is just what I would use it for - one project (my book). the Pro SP version appears to allow extended use under one "umbrella", and I'm hoping (and thinking) it would allow repeated use of progressively updated pdfs of the same single project. If it does, then it would suit my methods, as I progressively keep my "editor eye" (I self-edit as I write ie edit stuff I've recently done).
  2. As a test, I created a pdf of the structure of my book (not the content, but the structure). This pdf contained words that make up the following - Chapters; sub-headings; paragraph titles. I ran Textract over that and I produced an "A to H" index (that's the limit of the trial version). I discovered that the index could have an unexpected benefit for authors (including me) during the writing of any complex book - bits of it could suggest structural and/or content issues, during the writing of the book
  3. How do I know that? A little story - I ran my eye over the newly printed index. I noticed the index entry "bolts, 4". So I looked at page 4 of my Access produced structural report. There I noticed that the word "bolts" appeared in two separate paragraph headings, both headings had the exact same paragraph title. "The nuts and bolts of getting water". Clearly a duplication. so I flagged both paragraphs as needing rationalising/combining. Sort of useful.
  4. Advanced you say. Yes. I'm not sure at my age I can spend the sort of time I would need to, to get my moneys worth of indexing prowess out of it.
  5. Quote (yours): "Each time a [context] word is added to the list it would automatically search the document and add to the page numbers." - yes. It has to be a progressive thing. And that is why for their Pro SP version, even though it can only be used for one project, they have to allow the user to be able to go back into the index-making part and add/modify. Otherwise nobody would buy it.
  6. I will now look at the indexgenerator video link, thanks for that link
  7. I'm sure you're at least as smart as the people who designed Textract, by the way.
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
I found a few Freeware versions. Take a look at the video associated with this one , you would have to save as PDF.
This does a lot of the stuff I do and more. The key is once your words are extracted you have to be able to search the document by selecting your words. Then you need to determine if you are going to use it or not. The tagging looks very similar to how Word tags a term.

The software in the video has the same problem I was referring to. You are stuck with individual words. So IMO there needs to be a feature to review each occurrence of the word and then Add/Edit words based on context. So in my example I would click on Accuracy and the first occurrence deals with Position Accuracy. I would add that to the found list. Then look at the next occurrence and see that it deals with Speed Accuracy. I would add that to the list. Each time a word is added to the list it would automatically search the document and add to the page numbers.


If you use the term "book Indexing software" instead of "word"
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
Hi MajP I'll go with Index Generator for the time being. I've donated a bit to the makers of it. It looks easy to use and somewhat intuitive. It'll do me. thank you muchly for bringing it to attention. It does have one feature that enables easy groupings of user-selected words and that will give all the flexibility I need. However, it doesn't seem to cope or allow for the saving of your WIP index structure (based on eg Version 1.0 of one's book) to then be picked up and used with a subsequent version of one's book. That's probably too hard for them to do. Instead they warn that "re-use" of a changed pdf in the same project means you have to start again with selecting words & making any previous work obsolete. Or am I getting confused? Doesn't matter, I'll tinker with this little beauty (Index Generator) and see how I go. That'll take a while, competing with all my other distractions. No wonder the general advice is "don't start to do an index until your final draft is done and set in concrete ..." Thanks for your interest, I really appreciate it. You've been terrific EdK
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
FYI. I only did a cursory Google search on "book Indexing Software" and found those examples. You may find other free or inexpensive ones that are better.

Out of my own interest, I will continue to work on this a little when I get time. I think I have a lot of the pieces from my other tool that could be the basis for the application. If you check back here I will PM you if I ever get anything useable
.
I currently am trying to code the Rapid Application Keyword Extractor (RAKE). This will greatly improve what I have

This will allow me to identify Key Terms and not just single words. So in my example in thread #25 it would find then
"Amazon best sellers"
Not the meaningless single words of
Amazon
best
sellers

The algorithm in theory is very simple. However, implementing it in code efficiently on a large set of data will take significant coding. If you are saving all the words of a document you end up with very large arrays (or other data structures) and then need to search them efficiently to apply your algorithms.

This will probably get me closer to a 75% working solution. The article describes how I can find terms like
Justice Department
but not
Department of Justice
The "of" messes things up in the RAKE algorithm
 

Attachments

  • FindWords.accdb
    3.3 MB · Views: 114
Last edited:

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
FYI. I only did a cursory Google search on "book Indexing Software" and found those examples. You may find other free or inexpensive ones that are better.

Out of my own interest, I will continue to work on this a little when I get time. I think I have a lot of the pieces from my other tool that could be the basis for the application. If you check back here I will PM you if I ever get anything useable
.
I currently am trying to code the Rapid Application Keyword Extractor (RAKE). This will greatly improve what I have

This will allow me to identify Key Terms and not just single words. So in my example in thread #25 it would find then
"Amazon best sellers"
Not the meaningless single words of
Amazon
best
sellers

The algorithm in theory is very simple. However, implementing it in code efficiently on a large set of data will take significant coding. If you are saving all the words of a document you end up with very large arrays (or other data structures) and then need to search them efficiently to apply your algorithms.

This will probably get me closer to a 75% working solution. The article describes how I can find terms like
  1. Justice Department
but not
Department of Justice
The "of" messes things up in the RAKE algorithm
MajP -

I had a quick look at RAKE, it is intriguing, trying to extract sense/meaning out of "found word" combinations, by means of algorhythms or similar. The example given seems to work satisfactorily but the proof will be in the practical end result. I wish you well in applying RAKE to your other applications and/or thoughts.

I've gone right off Index Creator, it fails to create/join together any of my test word extraction exercises into even two word phrases. Textract does this quite well. In re-looking at the purchase options for Textract I see there is the (cheaper) option of buying a full version that can only be used against a single project (which would be my book) and that would satisfy me, as it is for lifetime use, so US$100 or so becomes an OK proposition. There is, I realise now, not much point in doing an index of any book until it is set in concrete (which mine certainly isn't).

Thanks for saying that you might (still) come up with a good 75% solution that puts found words into meaningful phrases, that sounds great, and I am very confident you could certainly do it. So I will check to the forum every so often. 'bye for now (and thanks)

EdK
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
I've gone right off Index Creator, it fails to create/join together any of my test word extraction exercises into even two word phrases.
From what I read there are a lot of cheap software that is very limited, and may not save a lot of work/time.

Here is the list from the Professional Society of Indexer of the top professional grade software. I imagine they are pricey.

Dedicated Indexing Software​

This list of dedicated software geared toward the needs of professional indexers is for informational purposes only. It is not intended to be a comprehensive list of all tools that indexers may use in the course of their work. ASI does not endorse any product.


CINDEX™ (for Windows, and Macintosh)
Indexing Research
Tel: (585) 413-1819
Email: info@indexres.com
URL: http://www.indexres.com

Users who formerly purchased CINDEX™ and obtained support through Leverage Technologies, Inc. are invited to contact Indexing Research for ongoing support and information.

Index Manager (for Windows and Macintosh)
Klarso GMBH
Berlin, Germany
Email: info@index-manager.net
URL: http://index-manager.net
Pilar Wyman, Regional Sales Manager, USA/Canada
Email: wyman@index-manager.net
Tel: (443) 336-5497

Macrex™ (for Windows)
Wise Bytes — Macrex Support Office, North America
Tech. Support: (888) 348-4292
Sales: (888) 348-4292
Email: macrexna@gmail.com
URL: http://www.macrex.com
In North America:
20% discount for ASI members;
$200 discount for students & instructors of approved indexing courses.

SKY Index™ Professional (for Windows)
SKY Software
Tel: (540) 751-4336
Email: sales@sky-software.com
URL: http://www.sky-software.com

TExtract book indexing software
Harry Bego, Texyz
The Netherlands
Tel: +31-30-6700318
Fax: +31-30-3100271
Email: info@texyz.com
URL: http://www.texyz.com
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
Thanks again, MajP - I've printed off your list and will investigate. I wish you best results in your coding development EdK
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 10:57
Joined
May 21, 2018
Messages
8,463
I've printed off your list and will investigate
I am definitely no authority on this. I just did a little bit of web searches, but this is a lot bigger field then I ever thought. Doing a good index is both art and science as far as I can tell, and fully automating it is unlikely. That is why there are expensive applications and services. I do not think there is a free lunch. Either you spend a lot of time or pay for the service.
 

EdK

Registered User.
Local time
Today, 07:57
Joined
May 22, 2013
Messages
42
I am definitely no authority on this. I just did a little bit of web searches, but this is a lot bigger field then I ever thought. Doing a good index is both art and science as far as I can tell, and fully automating it is unlikely. That is why there are expensive applications and services. I do not think there is a free lunch. Either you spend a lot of time or pay for the service.
Ah, no free lunch. That figures .... Hey, are you saying that AI (Artificial Intelligence) is a forlorn hope_fear? I love the succinct phrases you use. Just for the record, I came across this site - https://docs.marklogic.com/guide/concepts/indexing#id_40941 Those guys haven't given up hope. Maybe somebody is paying them good money to pursue this "dream" - eg NASA or various government militaries. And maybe the professional indexers are (still) in for the long haul. I would love to have known about that profession (say) 60 years ago. Damn. And I've lost the keys to my time machine. Like you, MajP, I seem to have learned quite a bit in the past week or so re book indexing. I wouldn't have missed it for quids .... EdK
 

Users who are viewing this thread

Top Bottom