To use Persian segmenter and tokenizer:SeTPer

After processing by PrePer, you can now segment and tokenize Persian document by using SeTPer.

SeTPer uses Uplug framework.
So,for using SeTPer, we have to know usage of Uplug.
(I took a few hours to understand usage of Uplug)

Uplug is tool for handling any corpus.
Uplug can convert form text to xml, get alignment between two corpus, tagging, chuncking ...etc.

I think that you can use Uplug if you use big linux distribution like Ubuntu, Fedora, ...etc
I don't know whether you can use in MacOS or not.

You can download by SourceForge, below site.

now, the latest version is 0.2.0d.
I use this version.

After downloading, open tar by following command.

$ tar -vxf uplug-0.2.0d.tar.gz

change directory to opened.

$ cd uplug

Uplug doesn't need , configure, make , make install.
In other word, Uplug is standalone program.

Detail of program and usage is written in


The main program is "uplug"

First you call this program, and call another module after that.

So, main command of Uplug is

$ ./uplug (other module)

Now I write my example following QUICKSTART.
My environment is below.

OS: Ubuntu 12.04 LST
where Uplug is:~/uplug/


make new directory "myproject" in ~/uplug/

$ mkdir ~/myproject

$ cd myproject/

$ cp ../example/1988sv.txt .
$ cp ../example/1988en.txt .


encode from txt to xml

now in directory ~/uplug/myproject.

$ ../uplug ../systems/pre/basic -ci 'iso-8859-1' -in 1988sv.txt > 1988sv.xml

$ ../uplug ../systems/pre/basic -ci 'iso-8859-1' -in 1988en.txt > 1988en.xml

you can look text in xml format now!


get alignment between 1988sv.xml and 1988en.xml

Now you are in directory ~/uplug/myproject

$ ../uplug ../systems/align/sent -src 1988sv.xml -trg 1988en.xml > 1988sven.xml

you can look alignment by following command

$ ../tools/readalign 1988sven.xml | less

Other functions can be used in similar way.