Use search with regex to find Korean characters using Python

Using Python 2.7.9 on Windows 8.1 Enterprise 64-bit

I’m using the following code to search for any Korean characters ( http://lcweb2.loc.gov/diglib/codetables/9.3.html )

line = ['x'. 'y', 'z', '쭌', 'a']

if any([re.search("[%s-%s]" % ("xE3x84xB1".decode('utf-8'), "xECxADx8C".decode('utf-8')), x) for x in line[3:]]):
    print "found character"

When ever I run the script and give it the following character the console shows ∞¡î which is a result of IDLE / Command Prompt being unable to show Korean characters I’m guessing.

is the last character that I was hoping to match in the regex

So is the above search correct at least? I’d prefer to know I at least have the right pattern to search for and spend time trying to make the console show the proper Korean characters.

I’ve tried in command prompt to do cph 1252 and nothing. It never prints out “found character” so I wouldn’t ever know.

If it helps, the script is receiving text from an IRC channel where Korean is usually spoken.

Answer

Use Unicode strings (note the “u” prefixes):

import re

line = [u'x', u'y', u'z', u'쭌', u'a']

if any([re.search(u'[u3131-ucb4c]', x) for x in line[3:]]):
    print "found character"

Leave a Reply

Your email address will not be published. Required fields are marked *