Function to remove illegal xml characters (1 Viewer)

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
I've been trying to import an xml file (access 07) and get an import error saying " ...has encountered an error processing the XML schema in file 'Report.xml'. Illegal xml character."

I'm trying to track down a function that might take an xml file as input, remove any invalid characters and output a "validated" xml file. I've been tinkering around with some of the concepts here (http://forums.asp.net/p/1483793/3470197.aspx#3470197) but I'm fairly well stuck. Any suggestions?
 

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 00:37
Joined
Sep 12, 2006
Messages
15,641
try downloading xmldomdoc (somrthing like that) from microsoft

a whole library of functions for managing, parsing and handling xml files

eg - returns a simple function result if the xml file is not well formed
 

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
The illegal character appears to be a  (a small box, not sure how to otherwise describe it).

I'm still trying to track down an effective way of eliminating these characters. Ideally, perhaps a function that takes an xml file as input, finds an offending xml character and replaces it with a space? Other thoughts?
 
Last edited:

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
I think this snippet of VB.NET will do it, is anyone better than I am at converting to VBA?

Code:
    Public Shared Function RemoveIllegalXMLCharacters(ByVal Content As String) As String

        'Used to hold the output.
        Dim textOut As New StringBuilder()
        'Used to reference the current character.
        Dim current As Char
        'Exit out and return an empty string if nothing was passed in to method
        If Content Is Nothing OrElse Content = String.Empty Then
            Return String.Empty
        End If

        'Loop through the lenght of the content (1) character at a time to see if there
        'are any illegal characters to be removed:
        For i As Integer = 0 To Content.Length - 1
            'Reference the current character
            current = Content(i)
            'Only append back to the StringBuilder valid non-illegal characters
            If (AscW(current) = &H9 OrElse AscW(current) = &HA OrElse AscW(current) = &HD) _
               OrElse ((AscW(current) >= &H20) AndAlso (AscW(current) <= &HD7FF)) _
               OrElse ((AscW(current) >= &HE000) AndAlso (AscW(current) <= &HFFFD)) _
               OrElse ((AscW(current) >= &H10000) AndAlso (AscW(current) <= &H10FFFF)) Then
                textOut.Append(current)
            End If
        Next

        'Return the screened content with only valid characters
        Return textOut.ToString()

    End Function
 

DCrake

Remembered
Local time
Today, 00:37
Joined
Jun 8, 2005
Messages
8,632
Here is a function I have used in the past to replace any illegal characters found in XML

Code:
Function XMLit(szIn As Variant) As String

    If Not IsNull(szIn) Then
        XMLit = Replace(szIn, "&", "&")
        XMLit = Replace(XMLit, "£", "£")
        XMLit = Replace(XMLit, "“", "“")
        XMLit = Replace(XMLit, "€", "€")
        XMLit = Replace(XMLit, "„", " ")
        XMLit = Replace(XMLit, Chr$(148), "”")
        XMLit = Replace(XMLit, Chr$(153), "™")
        XMLit = Replace(XMLit, Chr$(133), "…")
        XMLit = Replace(XMLit, Chr$(137), "‰")
        XMLit = Replace(XMLit, Chr$(144), "")
        XMLit = Replace(XMLit, Chr$(145), "‘")
        XMLit = Replace(XMLit, Chr$(146), "’")
        XMLit = Replace(XMLit, Chr$(147), "“")
        XMLit = Replace(XMLit, Chr$(148), "”")
        XMLit = Replace(XMLit, Chr$(149), "•")
        XMLit = Replace(XMLit, Chr$(150), "–")
        XMLit = Replace(XMLit, Chr$(151), "—")
        XMLit = Replace(XMLit, Chr$(152), "˜")
        XMLit = Replace(XMLit, Chr$(153), "™")
        XMLit = Replace(XMLit, Chr$(154), "š")
        XMLit = Replace(XMLit, Chr$(155), "›")
        XMLit = Replace(XMLit, Chr$(156), "œ")
        XMLit = Replace(XMLit, Chr$(157), "")
        XMLit = Replace(XMLit, Chr$(158), "ž")
        XMLit = Replace(XMLit, Chr$(159), "Ÿ")
        XMLit = Replace(XMLit, Chr$(160), " ")
        XMLit = Replace(XMLit, Chr$(162), "¢")
        XMLit = Replace(XMLit, """", """)
        XMLit = Replace(XMLit, Chr$(163), "£")
        XMLit = Replace(XMLit, Chr$(164), "¤")
        XMLit = Replace(XMLit, Chr$(165), "¥")
        XMLit = Replace(XMLit, Chr$(166), "¦")
        XMLit = Replace(XMLit, Chr$(167), "§")
        XMLit = Replace(XMLit, Chr$(168), "¨")
        XMLit = Replace(XMLit, Chr$(169), "©")
        XMLit = Replace(XMLit, Chr$(170), "ª")
        XMLit = Replace(XMLit, Chr$(171), "«")
        XMLit = Replace(XMLit, Chr$(172), "¬")
        XMLit = Replace(XMLit, Chr$(173), "­")
        XMLit = Replace(XMLit, Chr$(174), "®")
        XMLit = Replace(XMLit, Chr$(176), "°")
        XMLit = Replace(XMLit, Chr$(177), "±")
        XMLit = Replace(XMLit, Chr$(178), "²")
        XMLit = Replace(XMLit, Chr$(179), "³")
        XMLit = Replace(XMLit, Chr$(180), "&H180;")
        XMLit = Replace(XMLit, Chr$(181), "&H181;")
        XMLit = Replace(XMLit, "<", "<")
        XMLit = Replace(XMLit, ">", ">")
        XMLit = Replace(XMLit, Chr$(182), "¶")
        XMLit = Replace(XMLit, Chr$(183), "·")
        XMLit = Replace(XMLit, Chr$(186), "º")
        XMLit = Replace(XMLit, Chr$(187), "»")
        XMLit = Replace(XMLit, Chr$(188), "¼")
        XMLit = Replace(XMLit, Chr$(189), "½")
        XMLit = Replace(XMLit, Chr$(190), "¾")
        XMLit = Replace(XMLit, Chr$(191), "¿")
        XMLit = Replace(XMLit, Chr$(192), "À")
        XMLit = Replace(XMLit, Chr$(193), "Á")
        XMLit = Replace(XMLit, Chr$(194), "Â")
        XMLit = Replace(XMLit, Chr$(195), "Ã")
        XMLit = Replace(XMLit, Chr$(196), "Ä")
        XMLit = Replace(XMLit, Chr$(197), "Å")
        XMLit = Replace(XMLit, Chr$(198), "Æ")
        XMLit = Replace(XMLit, Chr$(199), "Ç")
        XMLit = Replace(XMLit, Chr$(200), "È")
        XMLit = Replace(XMLit, Chr$(201), "É")
        XMLit = Replace(XMLit, Chr$(202), "Ê")
        XMLit = Replace(XMLit, Chr$(203), "Ë")
        XMLit = Replace(XMLit, Chr$(204), "Ì")
        XMLit = Replace(XMLit, Chr$(205), "Í")
        XMLit = Replace(XMLit, Chr$(206), "Î")
        XMLit = Replace(XMLit, Chr$(207), "Ï")
        XMLit = Replace(XMLit, Chr$(208), "Ð")
        XMLit = Replace(XMLit, Chr$(209), "Ñ")
        XMLit = Replace(XMLit, Chr$(210), "Ò")
        XMLit = Replace(XMLit, Chr$(211), "Ó")
        XMLit = Replace(XMLit, Chr$(212), "Ô")
        XMLit = Replace(XMLit, Chr$(213), "Õ")
        XMLit = Replace(XMLit, Chr$(214), "Ö")
        XMLit = Replace(XMLit, Chr$(215), "×")
        XMLit = Replace(XMLit, Chr$(216), "Ø")
        XMLit = Replace(XMLit, Chr$(217), "Ù")
        XMLit = Replace(XMLit, Chr$(218), "Ú")
        XMLit = Replace(XMLit, Chr$(219), "Û")
        XMLit = Replace(XMLit, Chr$(220), "Ü")
        XMLit = Replace(XMLit, Chr$(221), "Ý")
        XMLit = Replace(XMLit, Chr$(222), "Þ")
        XMLit = Replace(XMLit, Chr$(223), "ß")
        XMLit = Replace(XMLit, Chr$(224), "à")
        XMLit = Replace(XMLit, Chr$(225), "á")
        XMLit = Replace(XMLit, Chr$(226), "â")
        XMLit = Replace(XMLit, Chr$(227), "ã")
        XMLit = Replace(XMLit, Chr$(228), "ä")
        XMLit = Replace(XMLit, Chr$(229), "å")
        XMLit = Replace(XMLit, Chr$(230), "æ")
        XMLit = Replace(XMLit, Chr$(231), "ç")
        XMLit = Replace(XMLit, Chr$(232), "è")
        XMLit = Replace(XMLit, Chr$(233), "é")
        XMLit = Replace(XMLit, Chr$(234), "ê")
        XMLit = Replace(XMLit, Chr$(235), "ë")
        XMLit = Replace(XMLit, Chr$(236), "ì")
        XMLit = Replace(XMLit, Chr$(237), "í")
        XMLit = Replace(XMLit, Chr$(238), "î")
        XMLit = Replace(XMLit, Chr$(239), "ï")
        XMLit = Replace(XMLit, Chr$(240), "ð")
        XMLit = Replace(XMLit, Chr$(241), "ñ")
        XMLit = Replace(XMLit, Chr$(242), "ò")
        XMLit = Replace(XMLit, Chr$(243), "ó")
        XMLit = Replace(XMLit, Chr$(244), "ô")
        XMLit = Replace(XMLit, Chr$(245), "õ")
        XMLit = Replace(XMLit, Chr$(246), "ö")
        XMLit = Replace(XMLit, Chr$(247), "÷")
        XMLit = Replace(XMLit, Chr$(248), "ø")
        XMLit = Replace(XMLit, Chr$(249), "ù")
        XMLit = Replace(XMLit, Chr$(250), "ú")
        XMLit = Replace(XMLit, Chr$(251), "û")
        XMLit = Replace(XMLit, Chr$(252), "ü")
        XMLit = Replace(XMLit, Chr$(253), "ý")
        XMLit = Replace(XMLit, Chr$(254), "þ")
        XMLit = Replace(XMLit, Chr$(255), "ÿ")
        XMLit = Replace(XMLit, vbCrLf, "
")
    Else
        XMLit = ""
    End If
    
        
End Function

You just pass the string to the function to clear out the bumf.

David
 

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
Thanks, that's getting me close. Is there any way for me to determine what the character code is for a particular offending character? I added a couple lines of code (shown below) but a particular character is not being removed and still causing me problems. I've also attached a screen shot of what the text looks like. I've gone through various character tables trying to track this particular character down, but I can't seem to find it.

Code:
XMLit = Replace(XMLit, Chr$(127), "")
XMLit = Replace(XMLit, Chr$(129), "")
XMLit = Replace(XMLit, Chr$(141), "")
XMLit = Replace(XMLit, Chr$(143), "")
XMLit = Replace(XMLit, Chr$(144), "")
XMLit = Replace(XMLit, Chr$(157), "")
 

Attachments

  • Clipboard01.jpg
    Clipboard01.jpg
    4.7 KB · Views: 565

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 00:37
Joined
Sep 12, 2006
Messages
15,641
seriously, why do you think its an illegal character - the best thing to do is to use the MS xml dom model - this has a function that validates the structure of a file to determine whether it is a valid xml file or not. the small square may be just a non-printing character, but that could easily be somethingl ike a tab, or other formatting character thats valid in a xml file.

ie how did they generate an xml file, if it contains illegal characters .....
 
Last edited:

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
I think it's illegal for a few reasons.

1) I was having issues processing the xml file, though I wouldn't say my vba skills are polished so this was merely a hunch.

2) I opened the file in IE and get an error on the page. The error states "An invalid character was found in text content." It then gives me the line and character numbers of the offending character.

3) I tried to just import the file with access 07 and get an error message that states "Microsoft Office Access has encountered an error processing the XML schema in file 'Report.xml". Illegal xml character."

4) If I remove the particular character in question, the file processes fine.

I understand the MS xml dom model has some validation properties, but don't they just say "yes, the file is valid" or "no, the file is not?" I'm not seeing anything that might help resolve an issue with the file?

At any rate, I plugged the character in to the dialog here http://sharkysoft.com/tutorials/jsa/content/018.html and it shows it's ASCII value 2. I plugged this in and it took care of that problem.

The file was created by some forensic software that downloads cell phone data. I'm assuming that the particular character in question was somehow entered by the cell phone user. The character is in part of a text message.
 

haydenal

Registered User.
Local time
Yesterday, 18:37
Joined
Jan 8, 2009
Messages
52
If I'm passing my xml text (xmlDoc.Text or xmlDoc.xml) to DCrake's function and replacing a few characters, how do I save the replaced text back to the xml file (or a new xml file)?
 
Last edited:

drleroyhenry

New member
Local time
Yesterday, 16:37
Joined
Jul 13, 2011
Messages
1
DCrake
I do not understand why you are doing this:
XMLit = Replace(XMLit, "“", "“")
XMLit = Replace(XMLit, "€", "€")

Aren't you just replacing these characters with themselves?
 

dfasoro

New member
Local time
Today, 01:37
Joined
Sep 4, 2015
Messages
1
Hi, I converted the VB.Net Code Snippet to VBScript for use in scripts or asp pages in conjunction with MSXML.

Code:
Function RemoveIllegalXMLCharacters(Content)
	Dim Copied(), I, J, current, currentW
	J = 0
	Redim Copied(Len(Content))
	
	For I = 1 To Len(Content)
		current = Mid(Content, I, 1)
		currentW = AscW(current)
		
		If (currentW = 9 Or currentW = 10 Or currentW = 13) _
		   Or ((currentW >= 32) And (currentW <= 55295)) _
		   Or ((currentW >= 57344) And (currentW <= 65533)) _
		   Or ((currentW >= 65536) And (currentW <= 1114111)) Then
			Copied(J) = current
			J = J + 1
		End If
	Next
	
	Redim Preserve Copied(J)
	
	RemoveIllegalXMLCharacters = Join(Copied, "")
End Function
 

Users who are viewing this thread

Top Bottom