Friday, December 16, 2011

Factored translation: simple experiment

  •           Moses decoder and training scripts that implements the factored translation models; MGiza++ for word alignment
  •           The English-Romanian journalistic corpus used for NAACL 2005 word alignment shared task (37,000 sentence pairs, and two test sets of 400 sentence pairs each)

The VMware Player Virtual Machine (user/ubuntu) contains a working install of Moses and training scripts for a baseline system and for a factored system.


Training the baseline system (smt/naacl-baseline)

Preparing for alignment:$ wa-prep.sh
Takes just a few minutes and produces a directory 'corpus' under 'smt/naacl-baseline/tm-enro'.

Word alignment:
for English to Romanian
$ word-align.sh 1
for Romanian to English
$ word-align.sh 2
It takes ~20 minutes for each direction and produces 2 directories: 'giza.en-ro' and 'giza.ro-en' under 'smt/naacl-baseline/tm-enro'. The alignment can run in parallel (use 'screen' or different sessions for each direction).

Extract translation equivalents:
$ train-tm.sh
It takes ~10 minutes and produces a directory 'model' under 'smt/naacl-baseline/tm-enro'. The decoder configuration file is 'moses.ini' in the 'model' directory. Just to be sure that everything worked fine, test this configuration file by launching:
$ test-tm.sh tm-enro/model/moses.ini
It translates the test file 'naacl.en-ro.fact0.en.test' using the system configured in the 'moses.ini' file. It outputs the translation in 'naacl.en-ro.fact0.en.test-outro' and a BLEU score on command line.

Parameter optimization:
$ tune.sh
The system parameters are optimized using the development test set ('naacl.en-ro.fact0.en.dev' and  the reference translation 'naacl.en-ro.fact0.ro.dev').
It takes a couple of hours. The result is a 'moses.ini' configuration file in 'smt/naacl-baseline/mert-enro' with optimised parameters.
Hint: To speed up tuning, if you have more than two threads available, allocate more cores to the virtual machine and change the parameter --decoder-flags "threads 2" accordingly.

Test the optimised system:
$ test-tm.sh mert-enro/moses.ini


Training the factored system (smt/naacl-fact)

The factored model has translation equivalents phrase table based on lemmas and uses on the target side the POS tag to generate the wordform (see Addressing SMT data sparseness...). It has two language models: one on wordforms and one on POS tags. The configuration provided is not optimal... try using the morpho-syntactic descriptions instead of POS tags (you have a language model on MSDs in 'smt/lm-ro').

The training and test files have 4 factors: wordform|lemma|POStag|MorphoSyntacticDescription
English:
it|it^Pp|PPER3|Pp3ns sounds|sound^Nc|NNS|Ncnp like|like^Sp|PREP|Sp a|a^Ti|TS|Ti-s terrible|terrible^Af|ADJE|Afp truth|truth^Nc|NN|Ncns
Romanian:
sună|suna^Vm|V3|Vmis3s ca|ca^Rc|RC|Rc un|un^Ti|TSR|Timsr adevăr|adevăr^Nc|NSN|Ncms-n crunt|crunt^Af|ASN|Afpms-n 

To train the system, run the same scripts as for the baseline system. The tuning and testing steps will take a lot more time compared to the baseline system.
Hint: To experiment with different configurations, tune with a smaller development set and test set.

You will notice in 'train-tm.sh' a few extra-lines:
         --lm ${LM_FACTOR}:${ORDER}:${LM}:${LM_TYPE} \
         --lm ${LM_TAG_FACTOR}:${ORDER}:${LM_TAG}:${LM_TYPE} \
         --translation-factors 1-1+2-2 \
         --generation-factors 1,2-0 \
         --decoding-steps t0,t1,g0
If you want to change the translation and the generation step from POS to MSD, use:

         --translation-factors 1-1+3-3 \
         --generation-factors 1,3-0


For more info on configuring the factored parameters see The Moses User Manual and Code Guide and follow the links from this thread from the Moses support list.

Friday, November 18, 2011

EM word alignment exercise

EM word alignment exercise

Source:
bre urna lui Tracosh fotə, au zdədud la tonmie viiu-zəu, Zaz fotə shi au tsimud tonmiia 4 ami shi au nurid.
tubə ge au nurid Zaz fotə, au tsimud tonmiia viiu-zəu, Latsgo fotə 8 ami.
tubə tonmiia lui Poctam fotə au tonmid Bədru fotə, viiu-zəu Nushadu, 16 ami.
bre urna au zdədud la tonmie vradi-zəu, Ronam fotə, 3 ami.
bre urna lui Latsgo fotə au tonmid Poctam fotə 6 ami.

Target:
tubə foiefotul Tracosh a tonmid viul zəu Zaz, tonmia gəruia tsimâmt 4 ami aboi a nurid.
tubə ge a nurid foiefotul Zaz, a tonmid 8 ami viul zəu, foiefotul Latsgo.
tubə tonmia lui foiefotului Poctam a tonmid foiefotul Bedru shi viul zəu, Nushad, dinb te 16 ami.
aboi a urnad tonmia vradelui zəu, foiefotul Ronam, dinb te 3 ami.
tubə foiefotul Latsgo a tonmid 6 ami foiefotul Poctam.

Monday, May 9, 2011

Tutorial model factorizat

Instalaţi PuTTY şi WinSCP ca să interacţionaţi mai uşor cu maşina virtuală. Dacă nu aveţi acces la maşina virtuală folosind Putty, este posibil ca VMWare Player să nu fie configurat corect. În cazul acesta, mai rulaţi odată instalarea VMWare Player cu opţiunea „Reparare”.

Descărcaţi o nouă variantă de corpus şi de scripturi de aici. Dezarhivaţi-le în /home/user/smt/naacl-fact/

Pregătirea de aliniere
wa-prep.sh
durează câteva minute. Produce un director „corpus” în „tm-enro”

Alinierea corpusului
pentru direcţia engleză-română:
walign.sh 1
pentru direcţia română engleză:
walign.sh 2
durează o jumătate de oră pentru fiecare direcţie. Puteţi să le lansaţi în paralel folosind "screen". În urma alinierii aveţi două noi directoare „giza.en-ro” şi „giza.ro-en” în „tm-enro”

Extragerea echivalenţilor de traducere
train-tm-fact.sh
durează aprox. 10 minute. În urma rulării în directorul „tm-enro” apare un director „model”. Fişierul de configurare a sistemului (moses.ini) se găseşte în directorul „model”.

Testarea sistemului
test-tm.sh tm-enro/model/moses.ini
Optimizarea parametrilor
tune.sh
durează aproximativ 3:30 h. În urma rulării, în fişierul „moses.ini” din directorul mert-enro veţi avea o variantă a sistemului cu parametrii optimizaţi pe datele de optimizare (dev-set)

Testarea sistemului optimizat
test-tm.sh mert-enro/moses.ini

Sunday, February 27, 2011

Thursday, February 17, 2011

Moses Scripts install

Note: The word alignment stage in the training scripts needs Giza++ in order to run -- see the post on how to install Mgiza++ (a multi-threaded version of Giza++).

in mosesdecoder/scripts
edit Makefile

instead of:
< TARGETDIR?=/home/s0565741/terabyte/bin
< BINDIR?=/home/s0565741/terabyte/bin
replace with:
> TARGETDIR=/home/user/smt
> BINDIR=/home/user/smt/mgiza/bin

in check-dependencies.pl
modify line 28: GIZA++ with mgiza and snt2cooc.out to snt2cooc
to something like:
unless (-x "$bin_dir/mgiza" && -x "$bin_dir/snt2cooc" && -x "$bin_dir/mkcls" ) {

make release

## Don't forget to set your SCRIPTS_ROOTDIR with:
   export SCRIPTS_ROOTDIR=/home/user/smt/scripts-20110216-1712
or add it in .bashrc

Install multi-threaded Giza++

checkout latest version of mgizapp
A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training. 
svn co https://mgizapp.svn.sourceforge.net/svnroot/mgizapp/trunk/mgizapp mgizapp

create a folder for mgiza installation (something like /home/user/smt/mgiza)

configure and install mgiza
./configure --prefix=/home/user/smt/mgiza
make install

Install Moses Decoder

Tested for mosesdecoder from SVN up to rev. 4169
for Ubuntu 10.10 - 11.04 32bit / 64bit

install compression library
zlib is a library implementing the deflate compression method found
in gzip and PKZIP.
sudo apt-get install zlib1g-dev

install boost-thread library
Toolkit for writing C++ programs that execute as multiple,
asynchronous, independent, threads-of-execution.
sudo apt-get install libboost-thread1.42-dev

install xmlrpc for c
XML-RPC is a quick-and-easy way to make procedure calls over the Internet.
It converts the procedure call into an XML document, sends it to a remote
server using HTTP, and gets back the response as XML. This library provides a modular implementation of XML-RPC for C and C++.
sudo apt-get install libxmlrpc-c3-dev


check-out last version of moses
(Recently, Moses development moved to git repository. This installation tutorial was tested for versions up to 4169. Just add -r 4169 to the following command if you have problems compiling it.)
svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk mosesdecoder


build moses (in moses directory)
./regenerate-makefiles.sh
./configure
(opt) for multi-threaded moses run:
./configure --enable-threads 
(opt) if you also want the moses XML-RPC server run instead:
./configure --with-xmlrpc-c=/usr/bin/xmlrpc-c-config --enable-threads
after configuration run make:
make -j 4

Prerequisites

for Ubuntu 10.10 i386

install ssh
Ssh (Secure Shell) is a program for logging into a remote machine and for executing commands on a remote machine.
sudo apt-get ssh

install build essentials
This package contains an informational list of packages which are
considered essential for building Debian packages.
sudo apt-get install build-essential

install autoconfiguration
The standard for FSF source packages.  This is only useful if you
write your own programs or if you extensively modify other people's
programs.
sudo apt-get install autoconf

install libtool
This is GNU libtool, a generic library support script.  Libtool hides
the complexity of generating special library types (such as shared
libraries) behind a consistent interface.
sudo apt-get install libtool

install subversion
Subversion is a version control system. Version control systems allow
many individuals to collaborate on a set of files (typically source code).  
sudo apt-get install subversion