perlexのバグ情報 - kensuke-miの日記

Perlexは何度もこのページで紹介しているが，要は形態素辞書．

しかし，よくよく調べてみると，不可解な現象がいくつも発生していた．
例えば，以下

anga&#185;tn	108	V	[pred="anga&#185;tn_____1",cat=V,@2*3plPreparfFam]	anga&#185;tn_____1	Default	2*3plPreparfFam	%default	v9

見出し語の途中に右肩についた1が出現している．これはラテン文字の文字でu00B9でユニコード体系にも組み込まれている．
Unicode Character 'SUPERSCRIPT ONE' (U+00B9)
が，当然，そんな文字はペルシア語には存在していない．Perlexの中で意図された特殊な記号か？と思うも，そうでもない．

結局，バグと解釈するのが正しかった．
で，バグとわかったはいいものの，今度は報告先が見つからない．
INRIAGForge: Alexina: Project Filelist
しかも，論文のfirst auhorにメールしても連絡はとれない．

仕方がないので，記録程度にここに記しておくことにする．

右肩に文字がつく現象はこの他にも数種存在している．
そこで，自分が調べられる範囲内で調査をして，後処理のスクリプトを書いた．

#! /usr/bin/python
#-*- coding:utf-8 -*-

import sys, codecs, re;

f=codecs.open('test.fix.lex', 'w', 'utf-8');

#with open('N.lex', 'r') as lines:
with codecs.open('N.lex', 'r', 'latin_1') as lines:

    for line in lines:
        
        error_line=line;
        #print [error_line]

        if re.findall(ur'\xb9', error_line):
            error_line=error_line.replace(u'\xb9', u'&#353;');

        if re.findall(ur'\xba', error_line):
            error_line=error_line.replace(u'\xba', u'&#351;');

        if re.findall(r'\xbc', error_line):
            error_line=error_line.replace(u'\xbc', u'&#378;');

        if re.findall(r'\xbf', error_line):
            error_line=error_line.replace(u'\xbf', u'&#7827;');

        if re.findall(r'\xbe', error_line):
            error_line=error_line.replace(u'\xbe', u'&#382;');

        if re.findall(r'\xfe', error_line):
            error_line=error_line.replace(u'\xfe', u'&#355;');

        if re.findall(r'\xb3', error_line):
            error_line=error_line.replace(u'\xb3', u'&#295;');

        if re.findall(r'&#226;', error_line):
            error_line=error_line.replace(u'&#226;', u'&#257;');

        if re.findall(r'&#240;', error_line):
            error_line=error_line.replace(u'&#240;', u'&#273;');

        if re.findall(ur'&#232;', error_line):
            error_line=error_line.replace(u'&#232;', u'&#269;');


        f.write(error_line);


f.close();

ちなみに，このPerlex，utf-8でファイルをデコードしようとすると失敗する．正解の文字コードはlatin-1．
なんでそんなややこいもん使ってるんや．と思うが，フランス人が作成しているプロジェクトなのでlatin-1が主流に使われたのかもしれない．