The experiment shows how factors can improve the translation from English to a morphologically-rich language (Romanian). The experiment uses:
- Moses decoder and training scripts that implements the factored translation models; MGiza++ for word alignment
- The English-Romanian journalistic corpus used for NAACL 2005 word alignment shared task (37,000 sentence pairs, and two test sets of 400 sentence pairs each)
The VMware Player Virtual Machine (user/ubuntu) contains a working install of Moses and training scripts for a baseline system and for a factored system.
Training the baseline system (smt/naacl-baseline)
Preparing for alignment:$ wa-prep.sh
Takes just a few minutes and produces a directory 'corpus' under 'smt/naacl-baseline/tm-enro'.
Word alignment:
for English to Romanian
$ word-align.sh 1
for Romanian to English
$ word-align.sh 2
It takes ~20 minutes for each direction and produces 2 directories: 'giza.en-ro' and 'giza.ro-en' under 'smt/naacl-baseline/tm-enro'. The alignment can run in parallel (use 'screen' or different sessions for each direction).
Extract translation equivalents:
$ train-tm.sh
It takes ~10 minutes and produces a directory 'model' under 'smt/naacl-baseline/tm-enro'. The decoder configuration file is 'moses.ini' in the 'model' directory. Just to be sure that everything worked fine, test this configuration file by launching:
$ test-tm.sh tm-enro/model/moses.ini
It translates the test file 'naacl.en-ro.fact0.en.test' using the system configured in the 'moses.ini' file. It outputs the translation in 'naacl.en-ro.fact0.en.test-outro' and a BLEU score on command line.
Parameter optimization:
$ tune.sh
The system parameters are optimized using the development test set ('naacl.en-ro.fact0.en.dev' and the reference translation 'naacl.en-ro.fact0.ro.dev').
It takes a couple of hours. The result is a 'moses.ini' configuration file in 'smt/naacl-baseline/mert-enro' with optimised parameters.
Hint: To speed up tuning, if you have more than two threads available, allocate more cores to the virtual machine and change the parameter --decoder-flags "threads 2" accordingly.
Test the optimised system:
$ test-tm.sh mert-enro/moses.ini
Training the factored system (smt/naacl-fact)
The factored model has translation equivalents phrase table based on lemmas and uses on the target side the POS tag to generate the wordform (see Addressing SMT data sparseness...). It has two language models: one on wordforms and one on POS tags. The configuration provided is not optimal... try using the morpho-syntactic descriptions instead of POS tags (you have a language model on MSDs in 'smt/lm-ro').
The training and test files have 4 factors: wordform|lemma|POStag|MorphoSyntacticDescription
English:
it|it^Pp|PPER3|Pp3ns sounds|sound^Nc|NNS|Ncnp like|like^Sp|PREP|Sp a|a^Ti|TS|Ti-s terrible|terrible^Af|ADJE|Afp truth|truth^Nc|NN|Ncns
Romanian:
sună|suna^Vm|V3|Vmis3s ca|ca^Rc|RC|Rc un|un^Ti|TSR|Timsr adevăr|adevăr^Nc|NSN|Ncms-n crunt|crunt^Af|ASN|Afpms-n
To train the system, run the same scripts as for the baseline system. The tuning and testing steps will take a lot more time compared to the baseline system.
Hint: To experiment with different configurations, tune with a smaller development set and test set.
You will notice in 'train-tm.sh' a few extra-lines:
--lm ${LM_FACTOR}:${ORDER}:${LM}:${LM_TYPE} \
--lm ${LM_TAG_FACTOR}:${ORDER}:${LM_TAG}:${LM_TYPE} \
--translation-factors 1-1+2-2 \
--generation-factors 1,2-0 \
--decoding-steps t0,t1,g0
Training the baseline system (smt/naacl-baseline)
Preparing for alignment:$ wa-prep.sh
Takes just a few minutes and produces a directory 'corpus' under 'smt/naacl-baseline/tm-enro'.
Word alignment:
for English to Romanian
$ word-align.sh 1
for Romanian to English
$ word-align.sh 2
It takes ~20 minutes for each direction and produces 2 directories: 'giza.en-ro' and 'giza.ro-en' under 'smt/naacl-baseline/tm-enro'. The alignment can run in parallel (use 'screen' or different sessions for each direction).
Extract translation equivalents:
$ train-tm.sh
It takes ~10 minutes and produces a directory 'model' under 'smt/naacl-baseline/tm-enro'. The decoder configuration file is 'moses.ini' in the 'model' directory. Just to be sure that everything worked fine, test this configuration file by launching:
$ test-tm.sh tm-enro/model/moses.ini
It translates the test file 'naacl.en-ro.fact0.en.test' using the system configured in the 'moses.ini' file. It outputs the translation in 'naacl.en-ro.fact0.en.test-outro' and a BLEU score on command line.
Parameter optimization:
$ tune.sh
The system parameters are optimized using the development test set ('naacl.en-ro.fact0.en.dev' and the reference translation 'naacl.en-ro.fact0.ro.dev').
It takes a couple of hours. The result is a 'moses.ini' configuration file in 'smt/naacl-baseline/mert-enro' with optimised parameters.
Hint: To speed up tuning, if you have more than two threads available, allocate more cores to the virtual machine and change the parameter --decoder-flags "threads 2" accordingly.
Test the optimised system:
$ test-tm.sh mert-enro/moses.ini
Training the factored system (smt/naacl-fact)
The factored model has translation equivalents phrase table based on lemmas and uses on the target side the POS tag to generate the wordform (see Addressing SMT data sparseness...). It has two language models: one on wordforms and one on POS tags. The configuration provided is not optimal... try using the morpho-syntactic descriptions instead of POS tags (you have a language model on MSDs in 'smt/lm-ro').
The training and test files have 4 factors: wordform|lemma|POStag|MorphoSyntacticDescription
English:
it|it^Pp|PPER3|Pp3ns sounds|sound^Nc|NNS|Ncnp like|like^Sp|PREP|Sp a|a^Ti|TS|Ti-s terrible|terrible^Af|ADJE|Afp truth|truth^Nc|NN|Ncns
Romanian:
sună|suna^Vm|V3|Vmis3s ca|ca^Rc|RC|Rc un|un^Ti|TSR|Timsr adevăr|adevăr^Nc|NSN|Ncms-n crunt|crunt^Af|ASN|Afpms-n
To train the system, run the same scripts as for the baseline system. The tuning and testing steps will take a lot more time compared to the baseline system.
Hint: To experiment with different configurations, tune with a smaller development set and test set.
You will notice in 'train-tm.sh' a few extra-lines:
--lm ${LM_FACTOR}:${ORDER}:${LM}:${LM_TYPE} \
--lm ${LM_TAG_FACTOR}:${ORDER}:${LM_TAG}:${LM_TYPE} \
--translation-factors 1-1+2-2 \
--generation-factors 1,2-0 \
--decoding-steps t0,t1,g0
If you want to change the translation and the generation step from POS to MSD, use:
--translation-factors 1-1+3-3 \
--generation-factors 1,3-0
--translation-factors 1-1+3-3 \
--generation-factors 1,3-0
For more info on configuring the factored parameters see The Moses User Manual and Code Guide and follow the links from this thread from the Moses support list.
No comments:
Post a Comment