parse a text file to separate into multiple files. (1 Viewer)

wiklendt · Feb 12, 2010

hi,

i was thinking to use a .bat file to parse a text file i have, but don't know where to start.

it is 11 Mb, and contains many many annotations of DNA from 9 whole plasmids.

i need sort each annotation by plasmid and then by contig (there would be 9 plasmids in total, but many contigs). and then (though 9 can be done manually) break the file into 9 separate files - one for each plasmid.

i don't actually know where to start or how it would be possible to parse. each annotation can have a variable number of lines (which represent 'tokens') and the indents within the annotations ARE significant (but i assume a .bat file of copy text preserves even tabs?).

i was thinking perhaps to break up each annotation into separate text files, then merging them back once they're in sorted inside windows explorer, or something?

as far as i can see, they all start with "On GAY4H"..., but that 'word' (beginning with GAY4H) is not always the same length of characters. also, as far as i can tell, no contig (within a plasmid) appears more than once.

i DON'T know if GAY4H sorted would correspond to plasmid/contig sorting (as an analogy, i don't know if one would sort as A B C D E F with the other as 1 2 3 4 5, or if A B C D E F would result in no sorting of any kind with the numbers).

here is an example of a couple of annotations:

Code:

On GAY4HAK01CLXPT (Plasmid 203 Contig 00050) the following tokens appear in this order: 
    ISCR1# (IS; 66) --> (1..324)
----- GAY4HAK01CLXPT ------------------------------


On GAY4HAK01CLXQW (Plasmid 146 Contig 00100) the following tokens appear in this order: 
    Transposon# <-- (1..464)
        Tn3# (Tn; 90) <-- (1..464)
----- GAY4HAK01CLXQW ------------------------------


On GAY4HAK01CLXSY (Plasmid 137 Contig 00003) the following tokens appear in this order: 
    nil-match <-> (1..53)
    qnrB2# (R_gene; 229933) <-- (54..473)
----- GAY4HAK01CLXSY ------------------------------


On GAY4HAK01CLXTF (Plasmid 401 Contig 00014) the following tokens appear in this order: 
    L/Mbackbone# (region; 230137) <-- (1..185)
----- GAY4HAK01CLXTF ------------------------------

thanks heaps - i once tried to learn .bat coding, but got awfully lost on anything beyond the basic "copy file from here, paste it to here".

if anyone can point me to a comprehensive example of parsing text - yay!

gemma-the-husky · Feb 12, 2010

i dont know about dos command files.

i didnt think you could split files with them

I find stuff like this easier to do from inside access, to be honest - thats what im best at nowadays, although you really want a standalone executable

rather than work on the section start - i would be inclined to work on the file terminator - the rows with the "-------------------" in.

so this sort of pseudocode

Code:

opendiskfile
clear newfilestring
while not eodiskfile
    readline
    extend newfilestring
    if line is lastline of a reading then 
       write newfilestring to newdiskfile
       clearstring
    end if
wend
tidyup

wiklendt · Feb 12, 2010

hello! i have come to post my success

sorry david, i was offline to work on this. my approach was different to my initial idea.

i first copy/pasted the whole file into excel (only v2007 could handle it b/c it has over 300,000 lines/rows). and quickly added a line count (=A1+1), in case i stuffed up sorting...

(edit: oh, and i used excel when i realised that the indents in the original text file were real tab spaces, and not just lots of individual spacebar spaces, so i knew excel would preserve the indents as new columns - i ended up with 4 data columns - and would be also preserved in the conversion back to .txt)

then i used a combination of IF statements as shown below:

to assign a unique number to each annotation:

Code:

=IF(LEFT(B2,6)<>"On GAY", A1,A1+1)

which basically adds 1 to the ID only if it's a new annotation (denoted by the "On GAY" sequence)

then, to extract the plasmid number:

Code:

=IF((LEFT(E4,6)<>"On GAY"),A4,RIGHT((MID(E4,FIND("Plasmid ",E4,1),11)),3))

and similarly for the Contig.
(the cell numbers might look a bit odd to you, but at each successful answer per column, i copy/pasted as value to preserve during sorting, then added a new A column to begin more data extraction)

this all ensured each row had an ID for each type of thing i wanted to sort by.

then i selected all the data and "inserted" a 'table' (i don't know if this is true for previous versions) which allows you to sort one column without accidentally "un-sync"ing it from the other columns (i.e., it keeps rows together - much like an access record).

then i just sorted the table by plasmid, then contig, then line number (turns out the contig value was repeated, it just wasn't obvious in the original text).

and so that other people without excel 2007 could see it, i selected the relevant columns and copy/pasted them back into a text file (.txt)!

he he he he.... and it only took about 30 min

(i've been to the gym in between)

thanks for your thoughts, david. i wasn't going to put this in access just yet b/c i'm not sure what else will need to be done on the data... but it will be easy to do now, if needed

wiklendt · Feb 12, 2010

oh, and i've now just used the nifty table filters that excel 2007 has and was able to split the data into 9 text files (one for each plasmid) in about 3 min

parse a text file to separate into multiple files. (1 Viewer)

wiklendt

i recommend chocolate

gemma-the-husky

Super Moderator

wiklendt

i recommend chocolate

wiklendt

i recommend chocolate

Similar threads

Users who are viewing this thread