wiklendt
i recommend chocolate
- Local time
- Today, 11:45
- Joined
- Mar 10, 2008
- Messages
- 1,746
hi,
i was thinking to use a .bat file to parse a text file i have, but don't know where to start.
it is 11 Mb, and contains many many annotations of DNA from 9 whole plasmids.
i need sort each annotation by plasmid and then by contig (there would be 9 plasmids in total, but many contigs). and then (though 9 can be done manually) break the file into 9 separate files - one for each plasmid.
i don't actually know where to start or how it would be possible to parse. each annotation can have a variable number of lines (which represent 'tokens') and the indents within the annotations ARE significant (but i assume a .bat file of copy text preserves even tabs?).
i was thinking perhaps to break up each annotation into separate text files, then merging them back once they're in sorted inside windows explorer, or something?
as far as i can see, they all start with "On GAY4H"..., but that 'word' (beginning with GAY4H) is not always the same length of characters. also, as far as i can tell, no contig (within a plasmid) appears more than once.
i DON'T know if GAY4H sorted would correspond to plasmid/contig sorting (as an analogy, i don't know if one would sort as A B C D E F with the other as 1 2 3 4 5, or if A B C D E F would result in no sorting of any kind with the numbers).
here is an example of a couple of annotations:
thanks heaps - i once tried to learn .bat coding, but got awfully lost on anything beyond the basic "copy file from here, paste it to here".
if anyone can point me to a comprehensive example of parsing text - yay!
i was thinking to use a .bat file to parse a text file i have, but don't know where to start.
it is 11 Mb, and contains many many annotations of DNA from 9 whole plasmids.
i need sort each annotation by plasmid and then by contig (there would be 9 plasmids in total, but many contigs). and then (though 9 can be done manually) break the file into 9 separate files - one for each plasmid.
i don't actually know where to start or how it would be possible to parse. each annotation can have a variable number of lines (which represent 'tokens') and the indents within the annotations ARE significant (but i assume a .bat file of copy text preserves even tabs?).
i was thinking perhaps to break up each annotation into separate text files, then merging them back once they're in sorted inside windows explorer, or something?
as far as i can see, they all start with "On GAY4H"..., but that 'word' (beginning with GAY4H) is not always the same length of characters. also, as far as i can tell, no contig (within a plasmid) appears more than once.
i DON'T know if GAY4H sorted would correspond to plasmid/contig sorting (as an analogy, i don't know if one would sort as A B C D E F with the other as 1 2 3 4 5, or if A B C D E F would result in no sorting of any kind with the numbers).
here is an example of a couple of annotations:
Code:
On GAY4HAK01CLXPT (Plasmid 203 Contig 00050) the following tokens appear in this order:
ISCR1# (IS; 66) --> (1..324)
----- GAY4HAK01CLXPT ------------------------------
On GAY4HAK01CLXQW (Plasmid 146 Contig 00100) the following tokens appear in this order:
Transposon# <-- (1..464)
Tn3# (Tn; 90) <-- (1..464)
----- GAY4HAK01CLXQW ------------------------------
On GAY4HAK01CLXSY (Plasmid 137 Contig 00003) the following tokens appear in this order:
nil-match <-> (1..53)
qnrB2# (R_gene; 229933) <-- (54..473)
----- GAY4HAK01CLXSY ------------------------------
On GAY4HAK01CLXTF (Plasmid 401 Contig 00014) the following tokens appear in this order:
L/Mbackbone# (region; 230137) <-- (1..185)
----- GAY4HAK01CLXTF ------------------------------
thanks heaps - i once tried to learn .bat coding, but got awfully lost on anything beyond the basic "copy file from here, paste it to here".
if anyone can point me to a comprehensive example of parsing text - yay!