Unwrap text for Import and reformat as table (1 Viewer)

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
Getting close, but missing a step or two. Have a series of text files that need to be cleaned up and reformatted. The issue is two fold. Need to remove hard return wrapping so that each line is complete and then format to a table structure. I built code that imports fine if no wrapping. I tried this link and seemed to work, except in my implementation it is putting in delimiters, so that if a row of data has a comma, it treats that as a separate field rather than a single field. I thought that fixed width would mean that it ignores potential delimiter characters.

http://www.access-programmers.co.uk/forums/showthread.php?t=176573

The other problem is with the logic so that sometimes in is concatenating a new line to the old line. There may be exceptions due to typos and older files that used a tab rather than space, but seems in general like the indication of a wrap termination is that a new record will begin with either a number period and space or a letter (a, b, c, d only) a period and a space. The next part of building the table is tricky because the question needs to be repeated for each response as shown below in phase 2.

Example from plain text file (xxx.txt):
1. This is line one sentence
a. response a
b. response b
c. response c
d. response d
2. This is line two sentence
and it wraps
a. response a
b. response b
c. response c is
also wrapping
d. response d

----------------------------------
Phase 1 result should be:
-----------------------------------
1. This is line one sentence
a. response a
b. response b
c. response c
d. response d
2. This is line two sentence and it wraps
a. response a
b. response b
c. response c is also wrapping
d. response d

----------------------------------
Phase 2 result should be an access table with several cols:
-----------------------------------
Code:
QNO     Question                          Response           Answer
1     This is line one  sentence                a         response a
1     This is line one  sentence                b         response b
1     This is line one  sentence                c         response c
1     This is line one  sentence                d         response d
2     This is line two sentence and it wraps    a         response a
2     This is line two sentence and it wraps    b         response b
2     This is line two sentence and it wraps    c         response c is also wrapping
2     This is line two sentence and it wraps    d         response d
Here is the code that I'm trying to use for unwrapping.
Code:
Private Sub btnUnwrapText_Click()
'Create a new text file that has removed the wrapped text
'assumes that wrapping only occurs on second line.  If
'wrapping exceeds two lines, code won't work.  Haven't had
'a chance to see how to dynamically resolve this issue.
'20150520
'http://www.access-programmers.co.uk/forums/showthread.php?t=217125

    Dim db As DAO.Database
    Dim RS As DAO.Recordset
    Dim stSQL As String
    Dim stLine As String
    Dim stUnWrapped As String
    Dim fso As New FileSystemObject
    Dim ts As TextStream
    Dim stImport As String
    Dim stTableName As String
    
    stImport = selectFile("Text")
    stTableName = Mid(stImport, InStrRev(stImport, "\") + 1)
    stTableName = Mid(stTableName, 1, InStr(stTableName, ".") - 1)
    stTableName = Replace(stTableName, "-", "_")
    stSQL = "Select * FROM " & FileNameWithExt(stImport) & " "
    
    Set db = OpenDatabase(FilePath(stImport), False, False, "Text; HDR=No")
    Set RS = db.OpenRecordset(stSQL)
    Set ts = fso.OpenTextFile(Replace(stImport, ".", "_Rev."), ForWriting, True)
    
    Do Until RS.EOF
        stLine = RS.Fields(0)
        RS.MoveNext
        If Not RS.EOF Then
            stUnWrapped = RS.Fields(0)
        End If
        If Mid(stUnWrapped, 2, 1) <> "." Then
            If Left(stUnWrapped, 1) <> Chr(42) Then
                stLine = stLine & " " & stUnWrapped
            Else
                RS.MovePrevious
            End If
        Else
            RS.MovePrevious
        End If
        ts.WriteLine stLine
        RS.MoveNext
    Loop
    RS.Close
    Set RS = Nothing
    Set db = Nothing
End Sub
For reference, here is code that works fine for files that are properly formatted to begin with (1 row per question/response and no wrapping), in case I need to have a two step process, but perhaps the above code would be able to do the formatting this does as well.

Code:
Sub btnImportExamText_Click()

'Read the text file containing the exam questions
'parse out the line and reformat to enable data to
'flow into the XMLExam table.  Once in the table,
'can view the questions as well as export to XML format
'20150518

'sources:
'http://stackoverflow.com/questions/938796/read-lines-from-a-text-file-but-skip-the-first-two-lines
'http://www.wiseowl.co.uk/blog/s211/readline.htm
    'again, we need this strange thing to exist so that ...
     
    Dim RS As DAO.Recordset
    Dim fso As New FileSystemObject
    'the file we're going to read from
    Dim ts As TextStream
    '... we can open a text file with reference to it
    Dim ThisLine As String
    Dim i As Integer
    Dim stImport As String
    Dim stTableName As String
    Dim stQNo As String
    Dim intQNo As Integer
    Dim intDelim As Integer
    Dim stText1 As String
    Dim stType As String
    Dim stText2 As String
    Dim stCorrect As Boolean
    Dim dupCheck As Integer
    Dim stSQL As String
    
    Set RS = CurrentDb.OpenRecordset("XMLExam")
    
    stImport = selectFile("Text")
    stTableName = Mid(stImport, InStrRev(stImport, "\") + 1)
    stTableName = Mid(stTableName, 1, InStr(stTableName, ".") - 1)
    stTableName = Replace(stTableName, "-", "_")
    
    'Set ts = fso.OpenTextFile("Z:\EXAM\COURSE\COURSE5D15.txt", ForReading)
    Set ts = fso.OpenTextFile(stImport, ForReading)
    'keep reading in lines till no more
    
    i = 0
    intQNo = 1
    
    Do Until ts.AtEndOfStream
        ThisLine = ts.ReadLine
        i = i + 1
'        Debug.Print "Line " & i, ThisLine
        intDelim = InStr(ThisLine, ".")
        stQNo = Left([ThisLine], intDelim)
        If Left(stQNo, 1) >= 0 And Left(stQNo, 1) < 10 Then
            stQuestionno = Left(stQNo, intDelim - 1)
            stText1 = Mid(ThisLine, intDelim + 1)
            'Trial and error determined that Readline goes to next line.  Need to
            'move to a next line if the current line begins with a number, because
            'it is a question and the response answer (A,B,C,D) begins on the
            'following line.  Originally tried .skipline to skip a line, but that
            'ended up going over the next line meaning would miss a row altogether.
            ThisLine = ts.ReadLine
            intDelim = InStr(ThisLine, ".")
            stText2 = Mid(ThisLine, intDelim + 1)
        Else
            stQuestionno = stQuestionno
            stText1 = stText1
        End If
        stType = Left([ThisLine], intDelim)
        Select Case Left(stType, 1)
            Case "A" To "Z"
                stType = Left(stType, InStr(stType, ".") - 1)
                stCorrect = False
            Case Chr(42)
                stType = Mid(stType, 2, 1)
                stCorrect = True
            Case Else
                stType = ""
                stCorrect = False
        End Select
        If Me.optAutoCalc = True Then
            intQNo = intQNo
        Else
            intQNo = stQuestionno
        End If
        'avoid duplicates
        dupCheck = DCount("ProdCode & Text1 & Type & SourceFormat", "XMLExam", "ProdCode = '" & stTableName & "' AND Text1 = '" & Trim(stText1) & "' AND [Type] = '" & stType & "' AND SourceFormat = 'Text'")
        If dupCheck = 0 Then
            RS.AddNew
            RS!ProductCode = stTableName
            RS!Category = "EXAM"
            RS!displayQuestions = Me.txtDisplayQuestions
            RS!passMinimum = Me.txtpassMinimum
            RS!questionNumber = intQNo
            RS!text1 = Trim(stText1)
            RS!type = UCase(stType)
            RS!Text2 = Trim(Mid(ThisLine, 4))
            RS!correct = stCorrect
            RS.Update
        End If
        'Increment Auto Question Number assuming D is last response of current question
        If Me.optAutoCalc = True And stType = "D" Then
            intQNo = intQNo + 1
        End If
    Loop
    'close down the file
    ts.Close
    RS.Close
    If DCount("ProdCode", "XMLExam", "ProdCode = '" & stTableName & "' AND SourceFormat Is Null") > 0 Then
        If Me.optAutoCalc Then
                stSQL = "UPDATE XMLExam SET XMLExam.displayQuestions = " & intQNo - 1 & " " & _
                        "WHERE XMLExam.ProdCode = '" & stTableName & "'" & " " & _
                        "AND XMLExam.SourceFormat Is Null"
                CurrentDb.Execute stSQL
        End If
        If Me.optLastNumberInFile Then
                stSQL = "UPDATE XMLExam SET XMLExam.displayQuestions = " & intQNo & " " & _
                        "WHERE XMLExam.ProdCode = '" & stTableName & "'" & " " & _
                        "AND XMLExam.SourceFormat Is Null"
                CurrentDb.Execute stSQL
        End If
        stSQL = "UPDATE XMLExam SET XMLExam.SourceFormat = 'Text' " & _
                "WHERE XMLExam.ProdCode = '" & stTableName & "'" & " " & _
                "AND XMLExam.SourceFormat Is Null"
        CurrentDb.Execute stSQL
    End If
    Me.Requery
End Sub
 

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 09:57
Joined
Sep 12, 2006
Messages
15,656
is the wrap a "real" carriage return, or is it wrapping because it is just too long to display.

ie notepad would normally (always?) give you line lengths of 1024 characters.
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
Sorry if wasn't clear...they are hard carriage returns, if it was visually wrapping, then I probably wouldn't have an issue identifying which row belongs where. I am using a plain text editor (Programmer's File Editor) to review these files in order to avoid potential issues such as may crop up using notepad.
 

vbaInet

AWF VIP
Local time
Today, 09:57
Joined
Jan 22, 2010
Messages
26,374
I haven't looked at your code yet but are there always going to be 4 records per QNo? What Chr() code(s) are you using to identify whatever break that's causing the problem?
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
When properly formatted, the imported text will have 5 rows. First row is the question number and actual question text, next four rows are the responses a, b, c, d. After importing, yes it should total 4 rows per question. The properly formatted text, I read entire line and import as is, store question in memory variable and loop through until I hit the next numeric.

For the "wrapped" text, my thought was trying to read the entire row, put in memory and look at next row, if next row did not begin with ("a. " or "b. " or "c. " or "d. ") or a numeric + period and a space, then append that row to what I grabbed from the previous row. I don't think I can use a chr() specifically to identify a wrap since each line will by default have a hard carriage return regardless of wrapping or not.
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
One step closer... code now fixes wrapped text as long as it isn't more than 2 lines. Overcame the delimiter problem by counting the number of fields in the recordset and then concatenating each field together via a loop if the other fields aren't null and adding a comma since that was removed by the delimiter process. If anyone has an idea how I can set up a better loop to be able to handle more than two rows of wrapped text, that would be great. I tried a manual repeat of the code section that builds the wrap concatenation, but it produced unexpected results.

Code:
Public Sub btnUnwrapText_Click()
'Create a new text file that has removed the wrapped text
'assumes that wrapping only occurs on second line.  If
'wrapping exceeds two lines, code won't work.  Haven't had
'a chance to see how to dynamically resolve this issue.
'20150520
'Got it to concatenate delimiter using field count properties
'20150526
'http://www.access-programmers.co.uk/forums/showthread.php?t=217125
'https://msdn.microsoft.com/en-us/library/windows/desktop/ms676135%28v=vs.85%29.aspx

    Dim db As DAO.Database
    Dim RS As DAO.Recordset
    Dim stSQL As String
    Dim stLine As String
    Dim stUnWrapped As String
    Dim fso As New FileSystemObject
    Dim ts As TextStream
    Dim stImport As String
    Dim stTableName As String
    Dim linecheck As Integer
    
    'stImport = selectFile("Text")
    stImport = "C:\Temp\TESTFILEOD14.txt"
    stTableName = Mid(stImport, InStrRev(stImport, "\") + 1)
    stTableName = Mid(stTableName, 1, InStr(stTableName, ".") - 1)
    stTableName = Replace(stTableName, "-", "_")
    stSQL = "Select * FROM " & FileNameWithExt(stImport) & " "
    
    Set db = OpenDatabase(filePath(stImport), False, False, "Text; HDR=No")
    Set RS = db.OpenRecordset(stSQL)
    Set ts = fso.OpenTextFile(Replace(stImport, ".", "_Rev."), ForWriting, True)
    
    Do Until RS.EOF
    'Restore text that has been split into multiple fields due to having a comma
        For i = 0 To (RS.Fields.Count - 1)
            If i > 0 And Not IsNull(RS.Fields(i)) Then
                stLine = stLine & ", " & RS.Fields(i)
            Else
                stLine = stLine & RS.Fields(i)
            End If
        Next
        'get next line in file to see if it is actually part of previous line
        RS.MoveNext
        If Not RS.EOF Then
            For i = 0 To (RS.Fields.Count - 1)
                If i > 0 And Not IsNull(RS.Fields(i)) Then
                stUnWrapped = stUnWrapped & ", " & RS.Fields(i)
            Else
                stUnWrapped = stUnWrapped & RS.Fields(i)
            End If
            Next
        End If
' if first characters are between A and D and contain a period, are numeric and contain a period or is an asterisk, then the line is self contained, otherwise it is wrapped and should be concatenated with previous line
        Select Case Left(stUnWrapped, 1)
            Case "A", "B", "C", "D"
                If Mid(stUnWrapped, 2, 1) = "." Then
                    linecheck = 1
                Else
                    linecheck = 0
                End If
            Case "*"
                linecheck = 1
            Case IsNumeric(Left(stUnWrapped, 1))
                If (Mid(stUnWrapped, 2, 1) = "." Or Mid(stUnWrapped, 3, 1) = "." Or Mid(stUnWrapped, 4, 1) = ".") Then
                    linecheck = 1
                Else
                    linecheck = 0
                End If
            Case Else
                linecheck = 0
        End Select
        If linecheck = 0 Then
            stLine = stLine & " " & stUnWrapped
        Else
            RS.MovePrevious
        End If
        
        ts.WriteLine stLine
        stLine = ""
        stUnWrapped = ""
        If Not RS.EOF Then
            RS.MoveNext
        End If
    Loop
    RS.Close
    Set RS = Nothing
    Set db = Nothing
End Sub
 

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 09:57
Joined
Sep 12, 2006
Messages
15,656
few questions

Where does the data come from. can that be fixed?
How long is the text in the split row?
Do you need the second part of the wrapped sentence?
Is there ALWAYS 5 or 6 rows per response. ie a one or 2 line title, and then 4 responses marked a. to d.?
 

vbaInet

AWF VIP
Local time
Today, 09:57
Joined
Jan 22, 2010
Messages
26,374
For the "wrapped" text, my thought was trying to read the entire row, put in memory and look at next row, if next row did not begin with ("a. " or "b. " or "c. " or "d. ") or a numeric + period and a space, then append that row to what I grabbed from the previous row. I don't think I can use a chr() specifically to identify a wrap since each line will by default have a hard carriage return regardless of wrapping or not.
That's a good approach, but I'll break it down further:

1. For each line identify whether it begins with "a/b/c/d. " or a digit followed by ". ", e.g. "1. "
2. If it's the digit format "1. ", then highlight that to be the beginning of a new record
2(a). For the next line check whether it begins with "a/b/c/d. ", if it doesn't you know it's a continuation, if it does, then it's a response.
2(b). If it's a continuation just keep repeating 2(a) until you hit "a/b/c/d. "
3. If it's the "a/b/c/d. " format, then highlight that as a response and repeat 2(a) until you hit a new record.
4. Because you're saving the previous line in the current line, you will need to clean up the very last line just outside your loop.

With regards my Chr() question, I thought that you were having problems identifying what character was causing it to wrap but that doesn't seem to be the case.

And I think gemma-the-husky might be brewing a solution for you too ;)
 

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 09:57
Joined
Sep 12, 2006
Messages
15,656
That's a good approach, but I'll break it down further:

1. For each line identify whether it begins with "a/b/c/d. " or a digit followed by ". ", e.g. "1. "
2. If it's the digit format "1. ", then highlight that to be the beginning of a new record
2(a). For the next line check whether it begins with "a/b/c/d. ", if it doesn't you know it's a continuation, if it does, then it's a response.
2(b). If it's a continuation just keep repeating 2(a) until you hit "a/b/c/d. "
3. If it's the "a/b/c/d. " format, then highlight that as a response and repeat 2(a) until you hit a new record.
4. Because you're saving the previous line in the current line, you will need to clean up the very last line just outside your loop.

With regards my Chr() question, I thought that you were having problems identifying what character was causing it to wrap but that doesn't seem to be the case.

And I think gemma-the-husky might be brewing a solution for you too ;)


I was just wondering how critical the wrapped row was. (assuming it's at most only one row). In which case you could just ignore that row completely. but now I see some of the responses are wrapping.
 

vbaInet

AWF VIP
Local time
Today, 09:57
Joined
Jan 22, 2010
Messages
26,374
@gemma-the-husky: Yeah, some responses wrap that's why he needs to check it each line and use the current line to finalise the previous line(s).
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
Hi Guys,
To try and answer your questions... These are plain text files containing multiple choice exam questions. The files can come in either as plain text, excel, pdf or word and then converted to plain text. Sure they can be fixed before import, which is what I have been doing, by manually going into each file one by one and visually scanning the rows and then pressing the back space key. After looking at several files this way, one tends to miss something or maybe hit delete or backspace too much and lose words. I had hoped I could automate this process as a lot of older files seem to have been created quickly and missed issues like wrapping. The wrapping effect happened when people copied and pasted data off of pdf into text files. Perhaps wrapping isn't the right word to use for this situation other than it is a continuation of the question or answer, so yes, we would need the entire question or entire answer if it took multiple lines. The multi line issue can happen in either the Question or the Response. Below I put the parens in only to indicate a line number/counter, file does not actually have (1), (2), etc. Asterisk (*) indicates to the online system that this is the correct answer.

INPUT
(1) 1. question sldkjsk fl
(2) slkdk dk
(3) a. response slk f s
(4) sdlkfjd kd dk
(5) slk fkd dl
(6) b. response
(7) *c. response
(8) d. response

OUTPUT Part I
(1) 1. question sldkjsk fl slkdk dk
(2) a. response slk f s sdlkfjd kd dk slk fkd dl
(3) b. response
(4) *c. response
(5) d. response

Then they are reformatted again into xml for upload into the online system that administers the exams.

If you need additional clarification, please let me know.
 

vbaInet

AWF VIP
Local time
Today, 09:57
Joined
Jan 22, 2010
Messages
26,374
I actually addressed all of that in my response in post #8, have a look. But I see that you have introduced a new response type to the mix, "*c. ", is this a typo?
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
Hi vbaInet,

I didn't explicitly mention the *, sorry about that, but it is in the code. I think the original code I had it as chr(42), but then in the latest revision, used "*" directly.

I understand your answer in theory, but my skills are a bit fuzzy in some areas in practice. So are you saying I don't need a separate variable for the main line and the "wrapped line"? Is the way I've laid out the conditions for letters and numbers followed by period as case statements the right approach or is there a simpler method?
 

vbaInet

AWF VIP
Local time
Today, 09:57
Joined
Jan 22, 2010
Messages
26,374
No problem. Can you upload a sample file that includes all the possible scenarios?

Yes, You don't need separate variables. You just identify what kind of line it is and act accordingly then re-use the same variable. Once I get the upload I'll see if I have time to write some code and advise on yours too.
 

BlueIshDan

&#9760;
Local time
Today, 05:57
Joined
May 15, 2014
Messages
1,122
Replace Chr(10) and Chr(13) with vbNullString?

You may have to do this from a byte array, you may just be able to use the actual Replace function.
 

BlueIshDan

&#9760;
Local time
Today, 05:57
Joined
May 15, 2014
Messages
1,122
With my getfilelines function you could do something far more simple:

Code:
Public Sub ImportData(file)

    Dim str_line As Variant
    Dim var_split() As String
    Dim rs As Recordset: Set rs = CurrentDb.OpenRecordset("table_name")
    
    For Each str_line In GetFileLines(file, True, True, False)
    
        var_split = split(str_line, ".")
        
        If IsNumeric(var_split(0)) Then
            ' Add new Record
            ' rs!sentence = Trim(var_split(1))
        Else
            ' rs(var_split(0)) = Trim(var_split(1))
        End If
    Next

    rs.Close
    
End Sub

Something like that?

If you have an unpredictable number of answers, I can provide the dynamic field check/addition code as well.
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
vbaInet: I'll try and put together some sample data tonight.

BlueIshDan: Thanks for your code suggestion, looks like it may work with some additional modification. In my initial test, the code will need modification as these are exam questions, a period is not reliable as the sole delimiter. The code aborted when it came to a "wrapped" question that did not have a period at all due to ending with a "?". Additionally I wonder what could happen with a multi sentence question in which it may have more than one period. Presently, I'm only relying on the assumption that a period is an indicator when it appears after the question number or the response letter of A-D or the first character is an asterisk (in which case a period doesn't matter).
 

sxschech

Registered User.
Local time
Today, 01:57
Joined
Mar 2, 2010
Messages
793
I have attached a txt file with sample data. After question #3, I commingled some of the formats I discovered while importing various files today using the manual method of import (fixing the file by opening in text editor and "scrubbing" it before importing). The format of the questions are consistent within the files, that is to say if the questions/responses have a space, all in the file will, if they have a tab between question/response and wording, all will have. If there is a line break, it will be consistent pattern of question, response, line break, question response line break. If it uses parens instead of period they all will have it.
A few exceptions I encountered though:

  • Every once in a while a file would have an empty line between one question, for example no empty lines between questions 1-10, then question 11 has a line break then it's back to no empty lines.
  • A file had two of the questions with responses of a., b., c., a. rather than a., b., c., d.
  • A file had one of the questions omitting response d. (only had a., b., c.) and according to my supervisor, that particular question did not have a "d' response, meaning it was intentionally set that way.
If the variations are too much to deal with, I'd be happy enough with a solution that handles the layout of questions 1-3 in the attachment as that is more the norm. Questions 4-7 would be icing on the cake as it were.
 

Attachments

  • Exam_ple.txt
    1.6 KB · Views: 399

BlueIshDan

&#9760;
Local time
Today, 05:57
Joined
May 15, 2014
Messages
1,122
Wow is this all of the variations or is there bound to be more?
 
Last edited:

BlueIshDan

&#9760;
Local time
Today, 05:57
Joined
May 15, 2014
Messages
1,122
I've got it done, just have to handle a random byte character that show up here and there.
 

Users who are viewing this thread

Top Bottom