Regex question (1 Viewer)

Guus2005

AWF VIP
Local time
Today, 05:10
Joined
Jun 26, 2007
Messages
2,641
I have a problem and i want to use REGEX to fix it.
Now i have two problems, i know.

The solution is simple:
Remove all double quotes not directly preceded or directly followed by a semicolon.

The problem is that i have 3000 csv files which need to be checked and corrected if the following occures:
Code:
1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12" 
2;232;312;"Cafe "Blue Oyster";"Rotterdam";33;"DCBA21"
When i import those in a table i get an error message on those lines

So i want to fix those files to look like this:
Code:
1;200;345;"Apotheker Blue tongue";"Apeldoorn";12;"ABCD12" 
2;232;312;"Cafe Blue Oyster";"Rotterdam";33;"DCBA21"

I hope you can help
Thanks for your time!
 

Pat Hartman

Super Moderator
Staff member
Local time
Yesterday, 23:10
Joined
Feb 19, 2002
Messages
43,302
Since the records are invalid, you could open the files using the VBA Open method or use FSO. You would read record by record and write each record to the output file.

If this error is rare, you might want to use an exception method. Use TransferText to import each file and log the ones that fail. Then you go back and use the Open or FSO method on the failures.
 

ebs17

Well-known member
Local time
Today, 05:10
Joined
Feb 7, 2020
Messages
1,949
This seems to me to be unsolvable with RegEx.

One can implement a lookahead (finds double quote if no semicolon follows).
Generally there is also a lookbehind (finds double quote if no semicolon precedes it). But this is not implemented in the VB variant of RegEx.

In this case you would need a combination of both.

My idea would be to delete the double quotes in general and set the import specification accordingly. Or is there a special reason for the text delimiters?

i have 3000 csv files
You can do a loop over all files yourself, can't you?
 

Josef P.

Well-known member
Local time
Today, 05:10
Joined
Feb 2, 2023
Messages
827
Could this work with regex: Remove all " unless they are at the beginning/end of the line or are next to ;
 
Last edited:

NauticalGent

Ignore List Poster Boy
Local time
Yesterday, 23:10
Joined
Apr 27, 2015
Messages
6,341
@arnelgp is our resident RegEx SME, hopefully he will see this and weigh in. I am certain it can be done, but I would not know how to do it.

This could be done VBA using the InStr() and InStrRev() functions - the lookahead/lookbehind solution Ebs17 mentioned. Have you considered that?

Edit: From your example and simple Replace(strInput, chr(34)&chr(34),chr(34)) should work. That assumes ALL double quotes are wrong.
 
Last edited:

Josef P.

Well-known member
Local time
Today, 05:10
Joined
Feb 2, 2023
Messages
827
My experiment with RegEx:
Code:
Private Sub RegExTest()
  
   Dim TestStrings(1 To 2) As String
   Dim i As Long
  
   TestStrings(1) = "1;200;345;""Apotheker ""Blue tongue"""";""Apeldoorn"";12;""ABCD12"""
   TestStrings(2) = "2;232;312;""Cafe ""Blue Oyster"";""Rotterdam"";33;""DCBA21"""
  
   For i = LBound(TestStrings) To UBound(TestStrings)
      Debug.Print i
      Debug.Print "in: "; TestStrings(i)
      Debug.Print "out:"; RemoveQuotesInStrings(TestStrings(i))
   Next

End Sub

Private Function RemoveQuotesInStrings(ByVal InputText As String) As String

   Static RegEx As Object
   Dim Temp As String

   If RegEx Is Nothing Then
      Set RegEx = CreateObject("Vbscript.Regexp")
      RegEx.Pattern = "([^;])(\"")([^;])"
   End If
  
   Temp = Replace(InputText, """""", """")
   Temp = RegEx.Replace(Temp, "$1$3")
      
   RemoveQuotesInStrings = Temp

End Function
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 23:10
Joined
Jan 23, 2006
Messages
15,379
Nice! Seems initial issue is to get the input string into an acceptable string format for Access to avoid errors.
 

Josef P.

Well-known member
Local time
Today, 05:10
Joined
Feb 2, 2023
Messages
827
Temp = Replace(InputText, """""", """")
What could the regex pattern look like so that this line with Replace is not necessary?

/edit:
No sooner have you asked a question than you find the answer. ;)

Code:
RegEx.Pattern = "([^;])(\"")([^;])"
change to:
RegEx.Pattern = "([^;]+)(\""+)([^;]+)"

Code:
Private Function RemoveQuotesInStrings(ByVal InputText As String) As String

   Static RegEx As Object
   Dim Temp As String

   If RegEx Is Nothing Then
      Set RegEx = CreateObject("Vbscript.Regexp")
      RegEx.Global = True
      RegEx.Pattern = "([^;])([""]+)([^;])"
   End If

   Temp = RegEx.Replace(InputText, "$1$3")
  
   RemoveQuotesInStrings = Temp

End Function

/edit 2 ... dont work for
1;200;345;"Apotheker ""Blue tongue"";"Apeldoorn";12;"ABCD12"

/edit 3 .. forgot "Gobal"
 
Last edited:

NauticalGent

Ignore List Poster Boy
Local time
Yesterday, 23:10
Joined
Apr 27, 2015
Messages
6,341
Temp = RegEx.Replace(InputText, "$1$3")
What exactly does this bit do? I googled it and read about numbered substitutions and all, but was hoping you could be more specific about what it is doing in this case...?
 

theDBguy

I’m here to help
Staff member
Local time
Yesterday, 20:10
Joined
Oct 29, 2018
Messages
21,477
What exactly does this bit do? I googled it and read about numbered substitutions and all, but was hoping you could be more specific about what it is doing in this case...?
Just a guess, but I read that as saying to replace any ; followed by one or more " with a single ; based on the pattern used.
 

ebs17

Well-known member
Local time
Today, 05:10
Joined
Feb 7, 2020
Messages
1,949
Code:
RegEx.Pattern = "(...)(...)(...)"
                   $1   $2   $3
Subpatterns are defined in the pattern by round brackets.
A part of them is used for the replacement.
 

NauticalGent

Ignore List Poster Boy
Local time
Yesterday, 23:10
Joined
Apr 27, 2015
Messages
6,341
Code:
RegEx.Pattern = "(...)(...)(...)"
                   $1   $2   $3
Subpatterns are defined in the pattern by round brackets.
A part of them is used for the replacement.
Excellent explanation, thank you
 

Josef P.

Well-known member
Local time
Today, 05:10
Joined
Feb 2, 2023
Messages
827
The next improvements:
Code:
Private Sub RegExTest()
  
   Dim TestStrings(1 To 2) As String
   Dim i As Long
  
   TestStrings(1) = """A"""";1;200;345;""Apotheker """"Blue tongue"""";""Apeldoorn"";12;""ABCD12"""""
   TestStrings(2) = """""B"";2;232;312;""Cafe ""Blue Oyster"";""Rotterdam"";33;""DCBA21"""
  
   For i = LBound(TestStrings) To UBound(TestStrings)
      Debug.Print i
      Debug.Print "in: "; TestStrings(i)
      Debug.Print "out:"; RemoveQuotesInStrings(TestStrings(i))
      Debug.Print String(30, "-")
   Next
  
   Debug.Print "Multi line:"
   Debug.Print "in:"
   Debug.Print Join(TestStrings, vbNewLine)
   Debug.Print "out:"
   Debug.Print RemoveQuotesInStrings(Join(TestStrings, vbNewLine))
  
End Sub

Private Function RemoveQuotesInStrings(ByVal InputText As String) As String

   Static RegEx As Object
   Dim Temp As String

   'Set RegEx = Nothing  '<-- uncomment for testing (reset static var)
   If RegEx Is Nothing Then
      Set RegEx = CreateObject("Vbscript.Regexp")
      RegEx.Global = True
      RegEx.Pattern = "([^;\n\r])([""]+)([^;\r\n])"
   End If

   Temp = RegEx.Replace(InputText, "$1$3")
     
   RemoveQuotesInStrings = Temp

End Function

output in immediate window:
1
in: "A"";1;200;345;"Apotheker ""Blue tongue"";"Apeldoorn";12;"ABCD12""
out:"A";1;200;345;"Apotheker Blue tongue";"Apeldoorn";12;"ABCD12"
------------------------------
2
in: ""B";2;232;312;"Cafe "Blue Oyster";"Rotterdam";33;"DCBA21"
out:"B";2;232;312;"Cafe Blue Oyster";"Rotterdam";33;"DCBA21"
------------------------------
Multi line:
in:
"A"";1;200;345;"Apotheker ""Blue tongue"";"Apeldoorn";12;"ABCD12""
""B";2;232;312;"Cafe "Blue Oyster";"Rotterdam";33;"DCBA21"
out:
"A";1;200;345;"Apotheker Blue tongue";"Apeldoorn";12;"ABCD12"
"B";2;232;312;"Cafe Blue Oyster";"Rotterdam";33;"DCBA21"

Show Regex match online: https://regex101.com/r/UivEbr/1
 
Last edited:

Guus2005

AWF VIP
Local time
Today, 05:10
Joined
Jun 26, 2007
Messages
2,641
thank you all for your input. I'll try it next time at work!
 

ebs17

Well-known member
Local time
Today, 05:10
Joined
Feb 7, 2020
Messages
1,949
After Josef has solved the question 1:1 and comprehensively, I would like to remind you of my hint:
My idea would be to delete the double quotes in general and set the import specification accordingly. Or is there a special reason for the text delimiters?
Code:
Content = Replace(Content, Chr(34), "")
Such replacement is simpler and less prone to failure.
 

Users who are viewing this thread

Top Bottom