I have a file that looks something like this:
select a,b,c FROM Xtable select a,b,c FROM Vtable select a,b,c FROM Atable select a,b,c FROM Atable select d,e,f FROM Atable
I want to get a sortedMap:
{ "Atable":["select a,b,c FROM Atable", "select d,e,f FROM Atable"], "Vtable":["select a,b,c FROM Vtable"], "Xtable":["select a,b,c FROM Xtable"] }
The keys of sortedMap
will be tableName and values being the textline in list.
I started off with this, but stuck in tokenizing the line for regex matching:
import re f = open('mytext.txt', 'r') x = f.readlines() print x f.close() for i in x: p = re.search(".* FROM ", i) //now how to tokenize and get the value that follows FROM
Answer
You can use a combination of defaultdict
and regular expressions. Let lines
be a list of your lines:
from collections import defaultdict pattern = "(select .+ from (S+).*)" results = defaultdict(list) for line in lines: query, table = re.findall(pattern, line.strip(), flags=re.I)[0] results[table].append(query)
Actually, the right way to read the file would be:
with open('mytext.txt') as infile: for line in infile: query, table = re.findall(pattern, line.strip(), flags=re.I)[0] results[table].append(query)
The result is, naturally, a defaultdict
. If you want to convert it into an ordered dictionary, call the dictionary constructor:
from collections import OrderedDict OrderedDict(sorted(results.items())) #OrderedDict([('Atable', ['select a,b,c FROM Atable', ...
You can make the pattern
more robust to keep track of commas, valid identifiers, etc.