Remove HTML code from a string

hooi

Registered User.
Local time
Tomorrow, 03:47
Joined
Jul 22, 2003
Messages
158
Hi,

I have a need to remove html codes from a text (memo) string. For eg:
Here is the string:

This is part one of the string <DIV style="padding:5px; FLOAT: middle; MARGIN: 5px; WIDTH: 88; BACKGROUND-COLOR: #cccccc; height:8">
<font color="#CC0000">don't care what's the text enclosed here...</font></DIV> and this is part two.


The requirement is to remove the text enclosed within the "<DIV... and ...</DIV>, and the final text should appear as "This is part one of the string and this is part two.". The actual text that I need to process contains more than one occurance of such html code in different places of the memo field.

I've been trying to modify the WordCount() function from http://www.access-programmers.co.uk/forums/showpost.php?p=213511&postcount=12 for a while now. But I'm not much of a programmer. Could anyone help as I believe it's a pretty easy thing to do for those of you out there who are programmers. :)

Thanks...
 
Code:
    Dim a As String
    Dim strLeft As String
    Dim strRight As String
    Dim startPOS As Integer
    Dim endPOS As Integer
    
    a = "This is part one of the string <DIV style=""padding:5px; FLOAT: middle; MARGIN: 5px; WIDTH: 88; BACKGROUND-COLOR: #cccccc; height:8"">" & _
        "<font color=""#CC0000"">don't care what's the text enclosed here...</font></DIV> and this is part two."
    
    startPOS = InStr(1, a, "<") - 1
    While startPOS > 0
        endPOS = InStr(1, a, ">")
        strLeft = Left(a, startPOS)
        strRight = Right(a, Len(a) - endPOS)
        a = strLeft & strRight
        startPOS = InStr(1, a, "<") - 1
    Wend
 
Hi Bodisathva,

Thank you for your reply.

I've incorporated your code as a function as shown below:

Function testString(a) As String

' Dim a As String
Dim strLeft As String
Dim strRight As String
Dim startPOS As Integer
Dim endPOS As Integer

startPOS = InStr(1, a, "<") - 1
While startPOS > 0
endPOS = InStr(1, a, ">")
strLeft = Left(a, startPOS)
strRight = Right(a, Len(a) - endPOS)
a = strLeft & strRight
startPOS = InStr(1, a, "<") - 1
Wend
testString = a

End Function


Tested by calling the funciton:
testString("This is part one of the string <DIV style=""padding:5px; FLOAT: middle; MARGIN: 5px; WIDTH: 88; BACKGROUND-COLOR: #cccccc; height:8""><font color=""#CC0000"">don't care what's the text enclosed here...</font></DIV> and this is part two.")

The result now is: This is part one of the string don't care what's the text enclosed here... and this is part two.

However, my expected result is: This is part one of the string and this is part two.

How can the code be further modified to exclude the "don't care what's the text enclosed here... " within the "<" and ">" codes?

Thanks.
 
Got the solution now...

Function testString(a, startCode, endCode) As String

' Dim a As String
Dim strLeft As String
Dim strRight As String
Dim startPOS As Integer
Dim endPOS As Integer


startPOS = InStr(1, a, startCode) - 1
While startPOS > 0
endPOS = InStr(1, a, endCode) + Len(endCode)
strLeft = Left(a, startPOS)
strRight = Right(a, Len(a) - endPOS)
a = strLeft & strRight
startPOS = InStr(1, a, startCode) - 1
Wend
testString = a

End Function

Thank you very much...
 
Hi,

Just realised there are two issues the function does not address:

1. When the startCode "<DIV" appears as the very first character in the input string, it will fail.

2. I need the function to process text in a memo field which contain thousands of letters. When running it on such string, Access shows an error message "The string returned by the buidler was too long. The result will be truncated.".

Any idea how the code can be improved?

Thanks....
 
Code:
startPOS = InStr(1, a, "<DIV") - 1
    While startPOS > 0
        endPOS = InStr(1, a, "</DIV>") + 5
        strLeft = Left(a, startPOS)
        strRight = Right(a, Len(a) - endPOS)
        a = strLeft & strRight
        startPOS = InStr(1, a, "<DIV") - 1
    Wend
assuming you will always be throwing away everything contained within the division tag
 
Thank you very much for your help Bodisathva.
 
Hi,

I've been able to run the function but intermittently it will show error "Out of string space (Error 14)", sometimes Error 7.

Code:
Function ClearTextCount(a, startCode, endCode) As String

    Dim strLeft As String
    Dim strRight As String
    Dim startPOS As Long
    Dim endPOS As Long
    
    startPOS = InStr(1, a, startCode)
    
    While startPOS > 0
        endPOS = InStr(1, a, endCode) + Len(endCode) - 1
        If startPOS > 1 Then
            strLeft = Left(a, startPOS)
        End If
        [COLOR="Red"]strRight = Right(a, Len(a) - endPOS)[/COLOR]
        a = strLeft & strRight
        startPOS = InStr(1, a, startCode) - 1
    Wend
    ClearTextCount = a
    
End Function

The debugger shows that it fails at strRight = Right(a, Len(a) - endPOS)

Microsoft Visual Basic Help says:
Code:
Out of string space (Error 14)

Visual Basic permits you to use very large strings. However, the requirements of other programs and the way you manipulate your strings may cause this error. This error has the following causes and solutions: 

Expressions requiring that temporary strings be created for evaluation may cause this error. For example, the following code causes an Out of string space error on some operating systems: 
MyString = "Hello"
For Count = 1 To 100
MyString = MyString & MyString
Next Count

Assign the string to a variable of another name. 

Your system may have run out of memory, which prevented a string from being allocated. 
Remove any unnecessary applications from memory to create more space. 

For additional information, select the item in question and press F1 (in Windows) or HELP (on the Macintosh).

I have 2GB of RAM, running Windows Vista, don't undertand why it says it has run out of memory. Any idea how this problem can be fixed?

Thanks.
 
Last edited:
Would be interesting to tryout Regular Expressions on the challenge
Code:
' in the declaration section
Private gre                As Object

Function StripDivTag(theString As String) As String

    Dim ReturnString        As String
    
    If gre Is Nothing Then
        Set gre = CreateObject("vbscript.regexp")
    End If
    
    With gre
        .MultiLine = True
        .IgnoreCase = True
        .Global = True
        .Pattern = "<div(.|\n)+?/div>"
        ReturnString = .Replace(theString, vbNullString)
    End With
    StripDivTag = ReturnString
    
End Function
late bound, no need for references, scripting needs to be enabled (some turn that off)
 
Hi RoyVidar,

Thank you for the code, it works fine!

Just curious... what is the purpose of (.|\n)+? within the "<div(.|\n)+?/div>"?

Thanks...
 
Code:
.        any character except newline
/n       newline
|        alternator
(.|\n)   either any character except newline or newline
+        one or more occurrence of previous
         which, because the alternation is within a group
         (parens), means one or more occurrence of any or
         newline
?        non greedy/lazy matching

Good that you asked btw, cause, I'm suddenly a bit unsure whether lazy matching is what's needed here. This will only matter if you have more than one occurrence of this within the string you are testing, I think.

If you have more than one <div>

"Test <div blah>the first</div>then <div blah>then another</div> finished"

Then you probably want non-greedy match, to replace all the occurrence , but not what's between them (i e, keep the question mark). Non greedy, means finding the shortest match, so, here it would replace and give

"Test then finished"

if you have nested <div> tags,

"Test <div blah>the first <div blah>then inner </div>finish the inner</div>finish first"

Then I think you want either a greedy match, or something where I'd need to sit down and think a bit to be able to suggest an alternative.

Greedy match means match the longest possible string, which would mean the initial <div and the last /div>. For a greedy match (i e, remove the question mark), with a text like above, it would return

"Test finish first"

While non greedy (with the question mark) it would probably

"Test finish the inner</div>finish first"

For more info, search for regexp or Regular Expressions, but keep in mind that the implementation of vbscript differs from several others (lacks some features as for instance "lookbehind"). Check out the "Cheat Sheet" here http://regexlib.com/default.aspx.
 
Hi Roy-Vidar,

Thank you for your detailed explanation.

Lazy matching is just what I needed.

For the project I'm working on now, besides removing the html codes from a webpage content, with the new clean Content, the aim now is to insert into it snippets of text from another table.

Let me elaborate: Table A will have just two fields, containing: Type ID and Snippet, eg:
Code:
Type ID  |   Snippet
======================
  1	   Snippet A text… 
  1	   Snippet B text… 
  1	   Snippet C text… 
  1	   Snippet D text…
  2	   Snippet E text…
  2	   Snippet F text…
  3	   Snippet G text…
  3	   Snippet H text…
  4	   Snippet I text…
  :		   :

Table B for contents will have two fields too:
Code:
Type ID  |   Content
=======================
  1	     Content A 
  1	     Content B
  2	     Content C
  3	     Content D
  :              :

So, depending on the Type of the Content, the requirement is to randomly select from Table A to insert Snippets into the Content. Basic rules are such that a snippet will appear before the beginning of the Content and another one after the end of the content. Then, depending on the number of paragraphs in the content, additional snippets are to be placed about two or three paragraphs apart from the top of the content. I know this can be done using query but would it be better with VBA programming?

Thanks...
 

Users who are viewing this thread

Back
Top Bottom