Importing non-normalised excel data from different files

wiklendt · Oct 7, 2010

Hey everyone.

Intro
I'm going to tackle an import procedure i've been avoiding for a while - mainly because i have no idea of where to start with this.

Please open the attached PDF (the files are excel, but i've printed to PDF for convenience of showing you guys) so you know what i'm talking about....

ok, so. we have a machine which determines antibiotic resistance of any bacterial isolate you throw into it, and for each sample, we can manually export its results in a single excel file per sample. We currently have a hundred or so of these files, but acquire about 3-4 new ones a week.

Information required to pull out
I've highlighted the fields in the PDF that i'd need to import. Noted also is that the header information isn't always the same number of rows down, and there's other data which is an unknown rows also.

In addition to that, the antibiotic names in the first column can change depending on the test performed on the machine.

Also, the values it provides as a result (e.g., "<=1") i will want to separate the, what i call, 'qualifier' from the value and wherever there is no qualifier, i want to add an equals to it (i.e., "8" becomes "=8") and then split to two fields)

Current competencies
i have several normalised tables in my database, and once the data is imported into it, i can match field data to an ID and append the appropriate FK, but i don't know how to start importing this stuff.

The unknown
i know an import specification is normally used, but it requires column and field names to be identical.... well, that's fine except where the data is all over the shop like in our excel files. not helped by the fact i've never done any import whatsoever (aside from a FE/BE scenario, which is totally different anyhow!).

so how would i go about this one? would i need to parse the info first into some other format and then import, or import and try to filter out redundant data, or some other way?

anyone willing to point me in the right direction? how would you do it?

ta.

Trevor G · Oct 7, 2010

Can you attach a verison of the database in Access 2003 format so we can take a look at the table structure etc, if the data is coming of a spreadsheet as indicated, would it be useful to have a second sheet which is linked so the data goes across the columns and you end up with a single row, that would be easier to use a transferspreadsheet macro or code if necessary.

It may also be useful to upload a sample spreadsheet so this can be worked on as well.

wiklendt · Oct 7, 2010

Sure.

What i've done is arranged the relationship view with the relevant tables. Basically anything with "MIC" in the name. Please bear in mind it is still a work in progess, so don't be shy to voice any major (or even minor) database faux pas you spot.

I suppose i should now explain some of the process...

antimicrobials are added to a panel, which form a standard test (there may be several of these standard tests).

to make a test 'run', we combine an isolate with a panel.

this run then can have many results as each panel has multiple antimicrobials.

each result is dependent on the combination of antimicrobial/panel/isolate, and is given a 'qualifier', a value, and an interpretation (as per the machine).

this particular DB attached is missing the 'marker' and 'rules' tables, but they are a 1-M with the result table linked by MICrunID (i.e., any one run (nb: 'run', not 'result') can have many rules/markers associated). these marker and rule tables may look like they have similar data in them and should be in the same table, but the machine sometimes gives the same rule code for different descriptors between the two, so i've kept them as separate tables in my DB. while the excel sheet has a column for rule codes within the antimicrobial results rows, i've never seen the machine put any there, they are always below the antimicrobials results just like in the sample excel file.

DCrake · Oct 7, 2010

In my experience what I have done is to create a second worksheet which has the correct column headings in row 1 and in row 2 I reference the cells in sheet 1 as per the column headings. That way you are turning the data 90 degrees.

Then when you import the data you refer to sheet 2 for the source data.

Trevor G · Oct 7, 2010

What I have done (in part) is look at the spreadsheet and created a second sheet and transposed the headings in Column A from Sheet1 then added some simple formula for the first section to display the data under the headings, that would make it a lot easier to transfer into a table in your database.

I would suggest that the data goes into a temporary table first then set some queries up to copy the data to the correct tables.

I have returned the spreadsheet for now so you can get an idea of what could be done, once you have the spreadsheet sorted I would look to use a macro that will do a transferSpreadsheet, or write it in code to do the same.

If you note David's comments as well this is the way both of us would look to tackle this.

wiklendt · Oct 7, 2010

oops, there's been an misunderstanding/miscommunication.

it's my fault for using the term "non-normalised"... if you look at the sheets, it is actually 'normalised' by way of new data on a new line... but it's all over the shop.

also, i'm not understanding: how does putting everything on one row help when the import into access from excel has to have identical column names? with the varying header info of the excel sheet, this would mean the columns are varying in number, and with the different anitbiotics depending on the panel you use for the test, so are the antibiotic names variable (so basically, no file is guaranteed to have the same headings)...? am i missing something obvious? wouldn't it make more sense to put every thing in two columns and call it "field" and "data"? or something?

Galaxiom · Oct 7, 2010

All suggestions so far involve creating another spreadsheet. However it is obvious (at least to me) that would be completely impractical since every original sheet is going to be different.

The source data needs to be walked through line by line and the content compared with a table of possible values to determine how it is to be dealt with. Compare the value in the first cell of the row with a list of potential values then direct the parsing and extraction of the rest of the row accordingly.

Banana · Oct 8, 2010

If you think you can push it, I'd really want them to use a Excel template. If you can't be sure that there will be a header, it can get dicey, I'd think. Also, what if they type in "Rlue 1391"? That'd get missed over in the VBA code - using a template that also provides the comboboxes (if this is considered appropriate) would help reduce data entry errors and give you a consistent layout for your code to work against.

If I were the company's manager, I would find the excel spreadsheet unacceptable because it introduces uncertainty in terms of knowing that I can parse the required data correctly and uniformly.

wiklendt · Oct 8, 2010

Banana said:
If you think you can push it, I'd really want them to use a Excel template. If you can't be sure that there will be a header, it can get dicey, I'd think. Also, what if they type in "Rlue 1391"? That'd get missed over in the VBA code - using a template that also provides the comboboxes (if this is considered appropriate) would help reduce data entry errors and give you a consistent layout for your code to work against.

If I were the company's manager, I would find the excel spreadsheet unacceptable because it introduces uncertainty in terms of knowing that I can parse the required data correctly and uniformly.

no no no, people aren't manually typing the excel data - they're manually exporting it from the machine which runs the test.

the excel files are 'standard' but when there's no data for a particular heading, the machine doesn't include the heading itself. annoying, but not fixable.

the data changes, of course, but this includes the antimicrobials and the number of them plus the numbers of rules etc.

out of interest i had a look this morning to see if the machine is capable of exporting csv files. it was, but looked like it would be harder to deal with in terms of importing to access (could that just be my naive view?)... see below:

Code:

"EBDAIMCGACFBAMIEJBAFDBIFFBIEKAGIAH","SPECIMEN LAB REPORT - FINAL",08/10/2010  08:25:38AM,"Page 1/1","CIDMLS ICPMR Westmead","EpiCenter Version: V5.75A / V4.71A","CNR Darcy and Hawkesbury Rd
Westmead","Phoenix Instrument Version: 5.75A","Patient Name:",,"Patient ID:",,"Patient Name:",,"Patient ID:",,"Patient Comments:",,"Birth Date:",,"Patient Sex:",,"Patient Sex:",,"Birth Date:",,"Ordering Physician:","Unspecified","User Name:","lab user",":",,":",,":",,":",,":",,"Accession #:","38085e04","Specimen Type:","Unspecified","Hospital Service:","Unspecified","Body Site:","Unspecified","Collection Date:",07/10/2010  02:19:53PM,"Receipt Date:",07/10/2010  02:19:53PM,"Collection Date:",07/10/2010  02:19:53PM,"Receipt Date:",07/10/2010  02:19:53PM,"Antimicrobial Therapy:","Specimen Comments:",,":",,":",,":",,":",,":",,"Isolate Number:","1","Final","Organism Name:","Klebsiella oxytoca","Comments:",,"Isolate Classification:","Significant / Unknown","Taxonomy Notes:","Previously known as:
Aerobacter oxytocum, Bacillus oxytocus, Klebsiella pneumoniae indole + biogroup
Found in environment, food and animal and human feces. Isolated from human respiratory tract, urine, wounds and blood.","Test Types:","ID/AST Combo","Test Name:","NMIC/ID-80","Test Status:","Complete",,"Sequence Number:","422890003545","Lot #:","0090658","Start Date/Time:",07/10/2010  02:21:51PM,"Sequence Number:","422890003545","Lot #:","0090658","Start Date/Time:",07/10/2010  02:21:51PM,"Result Date/Time:",08/10/2010  06:50:15AM,"Location:","1/D17","Location:","1/D17","Result Date/Time:",08/10/2010  06:50:15AM,"Organism Name:","Klebsiella oxytoca","Needs Attention Reasons",,,

edit: in fact, the csv looks useless: it doesn't actually contain the essential bits we need, which are the antimicrobial resistance values (where there is the antibiotic name and qualifier with value). why doesn't it export that data in the csv? dunno.

Banana · Oct 8, 2010

My apologies - I've missed that point. I have to admit it's very odd that a machine-generated import would not always include header or be at least consistent in the layout. At least you don't have to worry about data entry errors (unless the machine in question was that badly designed!) so Galaxiom's advice is probably the best solution - parse the significant cells in the leftmost column and decide what cell it is, then process it accordingly.

wiklendt · Oct 8, 2010

Banana said:
My apologies - I've missed that point. I have to admit it's very odd that a machine-generated import would not always include header or be at least consistent in the layout. At least you don't have to worry about data entry errors (unless the machine in question was that badly designed!) so Galaxiom's advice is probably the best solution - parse the significant cells in the leftmost column and decide what cell it is, then process it accordingly.

No worries - this is all a little inherently messy. I don't think the machine software was ever intended to be transfered from one source to another. The normal use of results in a clinical setting is to print the report, then ring the doctor/clinician with results. However, we're using it for research and therefore are interested in analysing the data and trends etc, so we want the data a little easier to use, be searchable, filterable, etc.

wiklendt · Oct 19, 2010

Banana said:
My apologies - I've missed that point. I have to admit it's very odd that a machine-generated import would not always include header or be at least consistent in the layout. At least you don't have to worry about data entry errors (unless the machine in question was that badly designed!) so Galaxiom's advice is probably the best solution - parse the significant cells in the leftmost column and decide what cell it is, then process it accordingly.

ok, i've had a go at starting on this and haven't got very far, LOL.

at the moment i'm starting from the basics: find a cell with specific value, then grab the value from the cell immediately to its right (according to the current excel structure, described at length in previous posts - see post 1 for attached pdf example).

i've managed to get a dialog box for users to select an xls file, which is successful and the file opens, but then i can't seem to do anything with it...

i.e., i'm getting an error 9 (subscript out of range) on line 90:

(btw, this code is triggered by a button on a form. the only code associated with that button at the moment is calling this function below)

Code:

Option Compare Database

Function fImportPhoenix() As String
10    On Error GoTo Err_fImportPhoenix

          Dim strDialogTitle As String
          Dim PathStrg As String
          Dim msg As String
          Dim relativePath As String
          Dim dbPath As String
         
          Dim ApXL As Object
          Dim xlWBk As Object
          Dim xlWSh As Object
          Dim fld As Field
          Dim strPath As String
          Const xlCenter As Long = -4108
          Const xlBottom As Long = -4107
         
          Dim strSample As String
          Dim strOrganism As String
          Dim strTestName As String

20           strDialogTitle = "Select a file for import"
30           PathStrg = GetOpenFile_CLT(".\", strDialogTitle)
             
         'If no file was selected then the PathStrg variable will be empty.
         'If there was a file selected then.....
40       If PathStrg <> "" Then
         
50            Set ApXL = CreateObject("Excel.Application") 'invoke excel application
60            Set xlWBk = ApXL.Workbooks.Open(PathStrg) 'open the selected file
70            Set xlWSh = xlWBk.Worksheets("Sheet1")
80            ApXL.Visible = False ' don't show the file to the user
              
90            strSample = xlWSh.Cells.Find("Accession #:", LookIn:=xlValues).Offset(0, 1).Value
100           strOrganism = xlWSh.Cells.Find("Organism Name:", LookIn:=xlValues).Offset(0, 1).Value
110           strTestName = xlWSh.Cells.Find("Test Name:", LookIn:=xlValues).Offset(0, 1).Value
              
120           msg = "Sample: " & strSample & Chr(13) & "Org: " & strOrganism & Chr(13) & "Test: " & strTestName
130           MsgBox msg, vbOKOnly, "Search Returns"
              
140      End If

Exit_fImportPhoenix:
150       Exit Function

Err_fImportPhoenix:
          ' "modFeatures" class module required for strings presented below
160       msg = "Error # " & Str(Err.Number) & Chr(13) & " (" & Err.Description & ")"
170       msg = msg & Chr(13) & "in modImportPhoenix | fImportPhoenix"
180       MsgBox msg, vbOKOnly, fstrDBname & ": Error", Err.HelpFile, Err.HelpContext
190       Resume Exit_fImportPhoenix

End Function

wiklendt · Oct 19, 2010

also, how can i get access to 'close' the xls file properly? i.e., so that when i go to my task manager, i don't see as many instances of Excel open as i have times tested the code! (and probably for other considerations i've not thought of....)

Banana · Oct 19, 2010

Easy answer first.

You'd close with the following code:

Code:

xlBk.Save
xlBk.Close
ApXL.Quit

(untested, off-the-top-of-head code)

Consult the help files but you definitely want to look for the methods named Save, Close and Quit. (BTW, sometime it helps when searching on help files, to open Excel and use its VBA editor to get you better search results. In theory, you shouldn't have to do and be able to find Excel's stuff even from Access VBA editor, but I've found that sometime it just doesn't happen.)

wiklendt · Oct 19, 2010

Banana said:
Easy answer first.

You'd close with the following code:

Code:

xlBk.Save xlBk.Close ApXL.Quit

(untested, off-the-top-of-head code)

Consult the help files but you definitely want to look for the methods named Save, Close and Quit. (BTW, sometime it helps when searching on help files, to open Excel and use its VBA editor to get you better search results. In theory, you shouldn't have to do and be able to find Excel's stuff even from Access VBA editor, but I've found that sometime it just doesn't happen.)

Ta. That seems to do the trick.

Banana · Oct 19, 2010

Regarding the other question on working with Excel... well. That's a can of worms! Here's few things that may help you get started.

In general, you need to be familiar with the Cell object and Range object- we use that lot in exporting.

Let's say you want to start by going down then across, parsing the data. A sample code to do this would be like this:

Code:

With xlWs
    For i = 1 To .UsedRange.Rows.Count
        If .Cells(i, 1) = "This is string I'm looking for" Then
            'yayyy, found it
        End If
    Next
End With

This should then enable you to know how big the spreadsheet is and don't go and searching all those empty cells. Obviously the final code is going to be more complicated, but as you said, the file is machine-generated, so you know what structure you can expect. For headers that doesn't disappear randomly, you can use that as "bookmark" to start off your other routines for parsing. For headers that disappear, your best bet may be in analyzing a common pattern, possibly using RegEx if needed to determine if this is a row you want to parse.

Does that help? I know this is kind of abstract and it's been awhile since I worked with Excel but hopefully if you get started and studying Cells & Range you'll come up with specific questions that other can answer more readily?

wiklendt · Oct 19, 2010

Banana said:
Regarding the other question on working with Excel... well. That's a can of worms! Here's few things that may help you get started.

In general, you need to be familiar with the Cell object and Range object- we use that lot in exporting.

Let's say you want to start by going down then across, parsing the data. A sample code to do this would be like this:

Code:

With xlWs For i = 1 To .UsedRange.Rows.Count If .Cells(i, 1) = "This is string I'm looking for" Then 'yayyy, found it End If Next End With

This should then enable you to know how big the spreadsheet is and don't go and searching all those empty cells. Obviously the final code is going to be more complicated, but as you said, the file is machine-generated, so you know what structure you can expect. For headers that doesn't disappear randomly, you can use that as "bookmark" to start off your other routines for parsing. For headers that disappear, your best bet may be in analyzing a common pattern, possibly using RegEx if needed to determine if this is a row you want to parse.

Does that help? I know this is kind of abstract and it's been awhile since I worked with Excel but hopefully if you get started and studying Cells & Range you'll come up with specific questions that other can answer more readily?

Thanks, i'll try it out tonight. Gotta get back to work as i'm now doing this in my lunch breaks :-/

The_Doc_Man · Oct 19, 2010

Hmmm, missed this thread earlier. I'm up to my ... anatomy ... in alligators down here in the swamp land. But I'm doing something at least similar.

First, the CSV file format imports as well as does Excel if you treat it like a spreadsheet. Excel does. Access can.

Second, you have rows and columns. Some of these are identifying information, the rest of these are actually live data. What I did was create a list of keywords I wanted to see. Then I created an Excel object, opened the file (as a workbook), opened the worksheet (probably the only one in the workbook), and started traversing the cell array, first looking for known headers. One axis tells me what I'm seeing. The other axis gives me the specifics. In my case, a row is all about a server we manage, the column is the attributes of the server. Different projects can give us different formats.

So what I do is I have an array that tells me what column contains the data I want for a given field. In the first pass, I build the map of column number and field name. Then row by row, in the SECOND pass, I use the map to extract data from the cells of the spreadsheet one row at a time. And if I have something that might be there, but happens to not be there at the time, I simply assure that when it is stored, it is EMPTY, not NULL.

As far as it goes, I do the excel.close with optional parameter acSaveNo so that it won't save anything in the spreadsheet.

You should always use an object that you create using something like

set objvar = createobject("Excel.Application") (or something like that).

When done, you do objvar.Close with appropriate parameters, followed by one last little precaution: set objvar = Nothing.

That closes it for good.

wiklendt · Oct 21, 2010

thanks for all the tips, guys - i'm now ploughing along and learning lots about Excel VBA (and how different it is from within Access!). I've gotten further than i though i would in just a few days (though it took me a whole day to realise one part was not working because i simply mispelled "worksheet"! LOL)

Banana, your code got me started in the right direction. Got it working for me, too.

I might need more help later with other aspects of this project, but i'll start a new thread for those specific bits (with a link back to here if anyone needs background info).

wiklendt · Oct 26, 2010

hi everyone,

i've posted my final excel manipulation code here. got it working!

i only have to now start on the process of importing the data into access. i'm sure that will be just as fun.

Importing non-normalised excel data from different files

i recommend chocolate

Attachments

Registered User.

i recommend chocolate

Attachments

Remembered

Registered User.

Attachments

i recommend chocolate

Super Moderator

split with a cherry atop.

i recommend chocolate

split with a cherry atop.

i recommend chocolate

i recommend chocolate

i recommend chocolate

split with a cherry atop.

i recommend chocolate

split with a cherry atop.

i recommend chocolate

Immoderate Moderator

i recommend chocolate

i recommend chocolate

Similar threads

Users who are viewing this thread