在python字典中遍历unicode字符串并与unicode进行比较

时间:2022-05-11 22:49:05


I have two python dictionaries containing information about japanese words and characters:

我有两本python字典包含关于日语单词和字符的信息:

  1. vocabDic : contains vocabulary, key: word, value: dictionary with information about it
  2. vocabDic:包含词汇,关键字:单词,价值:字典与相关信息
  3. kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it

    汉字:包含汉字(单个日文),键:汉字,值:包含汉字信息的字典

    Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
    My Python version is 2.6
    My code is as following:

    现在我想要遍历vocabDic中每个单词的每个字符,并在汉字字典中查找这个字符。我的目标是创建一个csv文件,然后我可以将它作为词汇表和汉字表的连接表导入数据库。我的Python版本是2.6,我的代码如下:

    kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    kanjiVocabJoinCount = 1
    
    #loop through dictionary
    for key, val in vocabDic.iteritems():
        if val['lang'] is 'jpn': # only check japanese words
            vocab = val['text']
            print vocab
            # loop through vocab string
            for v in vocab:
                 test = kanjiDic.get(v)
                 print v
                 print test
                 if test is not None:
                    print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
                    kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
                    kanjiVocabJoinCount = kanjiVocabJoinCount+1
    

If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None

To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.

如果我到命令行打印变量,得到:词汇:工作,打印在日本v(一个字符的词汇在for循环):�测试(字符kanjiDic抬头):对我来说似乎没有一个for循环混乱的编码。我尝试了各种功能(解码,编码…)但到目前为止还没有运气。有什么办法可以让它工作吗?非常感谢您的帮助。

1 个解决方案

#1


11  

From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.

从您对问题的描述来看,vocab是一个编码的str对象,而不是unicode对象。

For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:

对具体性,假设词汇= u '債務の天井”在utf - 8编码:

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.

如果您对已编码的str对象进行循环,您将每次获得一个字节:\xe5、\x82、\xb5等等。

However if you loop over the unicode object, you'd get one unicode character at a time:

但是,如果对unicode对象进行循环,则每次将获得一个unicode字符:

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

Note that the first unicode character, encoded in utf-8, is 3 bytes:

注意,用utf-8编码的第一个unicode字符是3个字节:

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.

这就是为什么循环遍历字节、一次打印一个字节(例如打印\xe5)不能打印可识别字符的原因。

So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:

因此,看起来您需要解码str对象并使用unicode对象。您没有提到您正在为您的str对象使用什么编码。如果它是utf-8,那么你可以这样解码:

vocab=val['text'].decode('utf-8')

If you are not sure what encoding val['text'] is in, post the output of

如果您不确定val['text']的编码是什么,请发布输出

print(repr(vocab))

and maybe we can guess the encoding.

也许我们可以猜到编码。

#1


11  

From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.

从您对问题的描述来看,vocab是一个编码的str对象,而不是unicode对象。

For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:

对具体性,假设词汇= u '債務の天井”在utf - 8编码:

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.

如果您对已编码的str对象进行循环,您将每次获得一个字节:\xe5、\x82、\xb5等等。

However if you loop over the unicode object, you'd get one unicode character at a time:

但是,如果对unicode对象进行循环,则每次将获得一个unicode字符:

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

Note that the first unicode character, encoded in utf-8, is 3 bytes:

注意,用utf-8编码的第一个unicode字符是3个字节:

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.

这就是为什么循环遍历字节、一次打印一个字节(例如打印\xe5)不能打印可识别字符的原因。

So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:

因此,看起来您需要解码str对象并使用unicode对象。您没有提到您正在为您的str对象使用什么编码。如果它是utf-8,那么你可以这样解码:

vocab=val['text'].decode('utf-8')

If you are not sure what encoding val['text'] is in, post the output of

如果您不确定val['text']的编码是什么,请发布输出

print(repr(vocab))

and maybe we can guess the encoding.

也许我们可以猜到编码。