Information about Persian NLP Resource
This is about Persian NLP resource.
About Persian corpus, one of the well known corpus is "Bijankhan corpus."
I found improved corpus of Bijankhan corpus, UPEC corpus.
It's made by Mojgan Seraji, Phd students of Uppasala University, Sweden.
http://stp.lingfil.uu.se/~mojgan/
He prepares UPEC tag model file for hunpos tagger.
So, I could use easily model file with hunpos tagger.
Command for tagging with hunpos is below
% cat (un-tagged text file) | hunpos-tag (where model-file is) > output
I also could train myself with TnT tagger.
% tnt-para (whre UPEC corpus is)
Example of message is below
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
TnT-Para: Generate trigram parameters from corpus - Version 2.2
(C) 1993 - 2000 Thorsten Brants, thorsten@coli.uni-sb.de
Reading corpus /work/kensuke-mi/persian_rs/UPEC.txt .....................................................................................................................................(2613492 tokens)
Writing lexicon .......................................................................... (74915 tokens)
Lexicon written to file 'UPEC.lex'
Writing n-grams ................ (102 uni-, 1526 bi-, 15008 trigrams)
n-grams written to file 'UPEC.123'
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
And, now I cloud add tag following command
% tnt (where UPEC.lex and UPEX.123 are) (where text file is) > output file
I tried add tag for news text extracted from Hamshahri news.
Result is below:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
پس ADV_TIME
از P
آنکه CON
اجلاس N_SING
نم N_SING
پايان N_SING
فرمانده N_SING
(7 tokens)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
! Important information !
It's impossible to add tag raw text file with UPEX.
I had to process to text file following format.
Word
Word
.
.
.
One line must contain only one word and word for tag.
If I wrote a lot of word in one line, tagger add tag only for head word of line.
So, I have to convert normal sentence for suitable to tagging.
I'll write such script in Python.
Reference of Arabic character for Python is
http://www.spencegreen.com/2008/12/19/python-arabic-unicode/