Information about Persian NLP Resource

This is about Persian NLP resource.

About Persian corpus, one of the well known corpus is "Bijankhan corpus."

I found improved corpus of Bijankhan corpus, UPEC corpus.

It's made by Mojgan Seraji, Phd students of Uppasala University, Sweden.

http://stp.lingfil.uu.se/~mojgan/

He prepares UPEC tag model file for hunpos tagger.
So, I could use easily model file with hunpos tagger.

Command for tagging with hunpos is below

% cat (un-tagged text file) | hunpos-tag (where model-file is) > output


I also could train myself with TnT tagger.

% tnt-para (whre UPEC corpus is)

Example of message is below

                                                                                                                            • -

TnT-Para: Generate trigram parameters from corpus - Version 2.2
(C) 1993 - 2000 Thorsten Brants, thorsten@coli.uni-sb.de
Reading corpus /work/kensuke-mi/persian_rs/UPEC.txt .....................................................................................................................................(2613492 tokens)
Writing lexicon .......................................................................... (74915 tokens)
Lexicon written to file 'UPEC.lex'
Writing n-grams ................ (102 uni-, 1526 bi-, 15008 trigrams)
n-grams written to file 'UPEC.123'

                                                                                                                                • -

And, now I cloud add tag following command

% tnt (where UPEC.lex and UPEX.123 are) (where text file is) > output file

I tried add tag for news text extracted from Hamshahri news.
Result is below:

                                                                    • -

پس ADV_TIME
از P
آنکه CON
اجلاس N_SING
نم N_SING
پايان N_SING

فرمانده N_SING
(7 tokens)

                                                                    • -


! Important information !

It's impossible to add tag raw text file with UPEX.

I had to process to text file following format.

Word
Word
.
.
.


One line must contain only one word and word for tag.

If I wrote a lot of word in one line, tagger add tag only for head word of line.




So, I have to convert normal sentence for suitable to tagging.
I'll write such script in Python.

Reference of Arabic character for Python is

http://www.spencegreen.com/2008/12/19/python-arabic-unicode/