Regex and dates

CJ_London

Super Moderator
Staff member
Local time
Today, 17:24
Joined
Feb 19, 2013
Messages
17,398
I'm trying to extract some dates from some text using regex

The test data is "Project start date was 2 January 2015 and"
and " the end date is expected to be 25 April 2015"

and my pattern is "date.*?(\d.*20\d\d)"

individually this returns the correct values. But when the testdata is combined into "Project start date was 2 January 2015 and the end date is expected to be 25 April 2015"

I get returned

"2 January 2015 and the end date is expected to be 25 April 2015"

I'm sure I'm missing something simple - any suggestions?

my code is as follows although I don't think is anything to do with that - I thnk the problem is with the pattern
Code:
 Dim rgex As New RegExp 
 Dim Matches As MatchCollection
 dim target1 as string
  
     target1 = "Project start date was 2 January 2015 and the end date is expected to be 25 April 2015"
 
     rgex.Pattern = "date.*?(\d.*20\d\d)"
     Set Matches = rgex.Execute(target1)
     Debug.Print Matches.Item(0).SubMatches.Item(0)
 
Last edited:
You can do it without Regex Chris but I'll need to fire up regex and test it if you want a regex solution. What would you prefer, a simpler version or regex?

And can you confirm if it's always going to be in the dd mmmm yyyy format?
 
Hi vbaInet

I'd prefer regex, I'm testing in VBA (using reference Microsoft VBScript regular Expressions v5.5). Although I can do it using functions such as split, instr, like etc but I need a more generic tool which users can use and is easier to manage in the long run.

In this test, I'm just trying to identify dates (this works for 2 January 2015, 2/1/2015, 02/01/2015, 2 Jan 2015 for example which can all be converted to a datetime value with cDate).

The idea is the user will be prompted for a keyword and the system builds the pattern - so if it is just 'find all dates', the pattern would just be (\d.*20\d\d). But they may want to filter it with "preceded with the keyword 'Date'" or "preceded with the word 'Date' within 20 characters" - but one step at a time.
 
Ok, I'll need to refresh my memory on Regex because it's been long I programmed so I might be a while. Probably later in the day if I get a chance.
 
Sorry CJ_London, only had time to look into this. Here's how you can do it:
Code:
Public Function ExtractDates(ValueString As String) As MatchCollection
    Dim rgxObject       As VBScript_RegExp_55.RegExp
    Dim rgxMatchesCol   As VBScript_RegExp_55.MatchCollection
    Dim rgxMatch        As VBScript_RegExp_55.Match
    
    Const STR_MONTHS    As String = "January|February|March|April|" & _
                                    "May|June|July|August|" & _
                                    "September|October|November|December"
    
    Set rgxObject = New VBScript_RegExp_55.RegExp
    
    With rgxObject
        .Global = True
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = "\d?\d\s(" & STR_MONTHS & ")\s(\d{4}|\d{2})"
        
        Set rgxMatchesCol = .Execute(ValueString)
    End With
    
    Set ExtractDates = rgxMatchesCol
End Function
Translates to:
* Zero or one digit - \d?
* One digit then one space - \d\s
* One of the months in the string - STR_MONTHS
* One space - \s
* Four digits or two digits - (\d{4}|\d{2}) - the higher order (i.e four digits) comes first.

Hope that helps.
 
Forgot to remind you that you should do an IsDate() check on the resulting dates.
 
In this test, I'm just trying to identify dates (this works for 2 January 2015, 2/1/2015, 02/01/2015, 2 Jan 2015 for example which can all be converted to a datetime value with cDate).
Oops... just realised that I was only using your first example in my first code so here's a revised version that matches the other date variants:
Code:
Public Function ExtractDates(ValueString As String) As MatchCollection
    Dim rgxObject       As VBScript_RegExp_55.RegExp
    
    Const STR_MONTHS    As String = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|" & _
                                    "May|June?|July?|Aug(ust)?|" & _
                                    "Sept(ember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?"
    
    Set rgxObject = New VBScript_RegExp_55.RegExp
    
    With rgxObject
        .Global = True
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = "\d?\d[COLOR="red"]([/COLOR](\s(" & STR_MONTHS & ")\s)|([/\-\.]\d?\d[/\-\.])[COLOR="Red"])[/COLOR][COLOR="Blue"]([/COLOR][1-2]\d{3}|\d{2}[COLOR="blue"])[/COLOR](?=[\s\.])"
        
        Set ExtractDates = .Execute(ValueString)
    End With
End Function
* The red group is the month boundary, i.e. " Jul ", " July ", "[/-.]6[/-.]", "[/-.]06[/-.]", where [/-.] denotes the char list "/", "-" or "."
* The blue group is the year boundary, i.e. "1990" or "90" and a 4 digit year cannot begin with 0.

And ignore my post about remembering to use IsDate(), I see that you have that covered.
 
Last edited:
Hi vbaINet

Thanks for taking a look, I tried that on the test data and it does now return two values, unfortunately it returns ' 14 ' and '/1/' -I guess that is to do with the positioning of the round brackets. But thanks for a much more robust way of finding dates. I'll see if I can develop it further

With my code I think the issue is with the need to place a ? to stop regex from being 'greedy'

with the individual strings, and using (\d.*?20\d\d), there is only 1 '20\d\d', but when they are in the string there are two, so regex starts with the first '(\d' but ends with the second '20\d\d' - and I need it to stop with the first '20\d\d' and continue looking for the second '(\d' - hope that makes sense!
 
No problem CJ_London. Give me an example of a string where ' 14 ' and '/1/' are returned?

With yours the ? should stop it from being greedy, I'll need to look into both patterns.
 
my test string, but I modified the first date to check my pattern worked for multiple date formats - test string is now

"Project start date was 2/11/2014 and the end date is expected to be 25 April 2015"

The other part of my pattern is to restrict the returned values to those which have the word 'date' before it - "date.*?(\d.*?20\d\d)" so if the test string was

"Project start date was 2/11/2014 and on the 5th Feb 2015 something happened, meanwhile the end date is expected to be 25 April 2015"

what should be returned is 2/11/2014 and 25 April 2015

I struggle to put into words the step by step requirement but this my reading of the pattern search is
date - find the word 'date'
.*?(\d - look forward one char at a time until a digit is found (and match found return digit if rest of pattern found)
.*?2 - look forward one char at a time until a 2 is found (and still within brackets so return all characters)
0 - still in match if this is a 0
\d\d) - still in match if two digits and close return value

I've tried variations such as

\d{2} instead of \d\d
.{1,20}? instead of .*?

it still reports the dates correctly as dates, but only the first one.

I should also say I've set the regex properties as
.Global = True
.MultiLine = False
.IgnoreCase = True
 
We can add the "date" string part at a latter stage without any problems.

"Project start date was 2/11/2014 and the end date is expected to be 25 April 2015"

With the new test string, what results do you get with yours and mine?
 
Going like for like

target string:"Project start date was 2/11/2014 and the end date is expected to be 25 April 2015"
___
Code:
Const STR_MONTHS As String = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|" & _
        "May|June?|July?|Aug(ust)?|" & _
        "Sept(ember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?"
pattern:"\d?\d((\s(" & STR_MONTHS & ")\s)|([/\-\.]\d?\d[/\-\.]))([1-2]\d{3}|\d{2})(?=[\s\.])"

number of matches: 14
values returned:
0 - /11/
1 -
2 -
3 -
4 -
5 -
6 -
7 -
8 -
9 -
10 -
11 -
12 - /11/
13 - 2014
___
pattern:"(\d.*?20\d\d)"

number of matches: 1
values returned:
0 - 2/11/2014
 
Now that's very weird. Are you using the latest regular expression library, 5.0 I think?
 
yes - version 5.5. This is my complete code which sits in a module

Code:
Private Sub testRegex()
Dim rgex As New RegExp
Dim Matches As MatchCollection
Dim i As Integer
 
Const STR_MONTHS    As String = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|" & _
                                    "May|June?|July?|Aug(ust)?|" & _
                                    "Sept(ember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?"
'target1 = "Project start date was 2/11/2014 and "
'target1 = "the end date is expected to be 25 April 2015"
target1 = "Project start date was 2/11/2014 and the end date is expected to be 25 April 2015"

 If rgex.test(target1) Then
    With rgex
       .Multiline = True
       .Global = True
       .IgnoreCase = True
    End With
    
   rgex.Pattern = "(\d.*?20\d\d)"
   'rgex.Pattern = "date.*?(\d[/\a-z \d]{1,20}20\d{2})"
   'rgex.Pattern = "\d?\d((\s(" & STR_MONTHS & ")\s)|([/\-\.]\d?\d[/\-\.]))([1-2]\d{3}|\d{2})(?=[\s\.])"
   Set Matches = rgex.Execute(target1)
    
    On Error Resume Next
    
    Debug.Print Matches.Item(0).SubMatches.Count
    For i = 0 To Matches.Item(0).SubMatches.Count
        Debug.Print i & " - "; Matches.Item(0).SubMatches.Item(i)
    Next i
End If
 End Sub
 
Ah ok, I see the problem, you're using the wrong collection. What you want is the MatchCollection as in my original code. And also the Test() part isn't necessary because there's no pattern to test against. I've re-jigged your code:
Code:
Public Sub testRegex()
    Dim rgex As New RegExp
    Dim Matches As MatchCollection
    Dim rgexMatch As Match
    Dim target1 As String
    Dim i As Integer
    
    Const STR_MONTHS    As String = "Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|" & _
                                        "May|June?|July?|Aug(ust)?|" & _
                                        "Sept(ember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?"
    'target1 = "Project start date was 2/11/2014 and "
    'target1 = "the end date is expected to be 25 April 2015"
    target1 = "Project start date was 2/11/2014 and the end date is expected to be 25 April 2015 "
    
    With rgex
        .MultiLine = True
        .Global = True
        .IgnoreCase = True
        
'        .Pattern = "(\d.*?20\d\d)"
'        .Pattern = "(\d.*?20\d\d)"
        '.Pattern = "date.*?(\d[/\a-z \d]{1,20}20\d{2})"
        .Pattern = "\d?\d((\s(" & STR_MONTHS & ")\s)|([/\-\.]\d?\d[/\-\.]))([1-2]\d{3}|\d{2})(?=([\s\.])|$)"
    End With
     
    Set Matches = rgex.Execute(target1)
    
    For Each rgexMatch In Matches
        i = i + 1
        Debug.Print i & " - " & rgexMatch.Value
    Next
End Sub
 
Aha!, nothing wrong with the pattern, just the way I was looking at the output.

Both versions now work, and if I change the pattern to include the 'date filter' (my terminology;))

Debug.Print i & rgexMatch.SubMatches(0)

returns the date as well

Once again - many thanks
 
I tweaked my pattern a bit (look at the very end) for it to work.

If you're looking at SubMatches() then it's not returning the right results. The hierarchy is like this (from bottom up):

SubMatches > Match > MatchCollection

So all your matches will be in the MatchCollection and to retrieve a match you look in the Match item not the submatches. Submatches breaks down your match into sub-units and you look at that if you want to do what is called back-referencing.
 
I just came across this site and I thought I should share. It does a good job of validating regex and highlights the matched string(s).

Happy regexing ;)
 

Users who are viewing this thread

Back
Top Bottom