Compiled tools, datasets, and other resources for historical text normalization.
The resources provided here have originally been published along with the following publication:
- Marcel Bollmann. 2019. A Large-Scale Comparison of Historical Text Normalization Systems. In Proceedings of NAACL-HLT 2019.
@inproceedings{bollmann2019-largescale,
author = {Bollmann, Marcel},
title = {A Large-Scale Comparison of Historical Text Normalization Systems},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
location = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
year = {2019},
pages = {3885--3898},
url = {http://www.aclweb.org/anthology/N19-1389},
}
If you use an original part of this repository (such as the provided scripts or the previously unpublished dataset splits), I would appreciate if you cite the above-mentioned publication. If you use one of the referenced datasets and/or tools, please remember to (also) cite these accordingly.
For further reading, a lot of additional details and background information can also be found in:
- Marcel Bollmann. 2018. Normalization of Historical Texts with Neural Network Models. Bochumer Linguistische Arbeitsberichte, 22.
The following additional material is also available:
- Table 6 with raw accuracy numbers for the plots provided in Figures 3 and 4 in Bollmann (2019).
Language | Source Corpus | Time Period | Genre | Tokens (total) | Source of Splits |
---|---|---|---|---|---|
English¹ | ICAMET | 1386-1698 | Letters | 188,158 | HistCorp |
German | Anselm | 14th-16th c. | Religion | 71,570 | prev. unpublished |
German | RIDGES | 1482-1652 | Science | 71,570 | prev. unpublished |
Hungarian | HGDS | 1440-1541 | Religion | 172,064 | HistCorp |
Icelandic | IcePaHC | 15th c. | Religion | 65,267 | HistCorp |
Portuguese | Post Scriptum | 15th-19th c. | Letters | 306,946 | prev. unpublished |
Slovene | goo300k | 1750-1899 | Mixed | 326,538 | KonvNormSl 1.0 |
Spanish | Post Scriptum | 15th-19th c. | Letters | 132,248 | prev. unpublished |
Swedish | GaW | 1527-1812 | Official Records | 65,571 | HistCorp |
¹Due to licensing restrictions, the ICAMET dataset may not be distributed further, but the HistCorp website contains instructions on how to obtain the same dataset splits.
The scripts/
directory contains a collection of scripts that was
used in the process of running the normalization experiments in Bollmann (2019).
This includes preprocessing scripts, evaluation and significance
testing scripts, and more. For more details, please see the README file in
the scripts/ folder.
In most cases, you want to combine a naive memorization baseline (for in-vocabulary tokens) with a good, trained model (for out-of-vocabulary tokens).
-
If you have little training data (<500 tokens), you probably want to use the Norma tool (which already includes a naive memorization component); see "Using Norma" for details.
-
Otherwise, the results from Bollmann (2019) suggest using cSMTiser as the trained model in this scenario; see below under "Using cSMTiser".
The naive memorization component can be trained as follows:
scripts/memorizer.py train german-lexicon.txt german-anselm.train.txt
Apply it via:
scripts/memorizer.py apply german-lexicon.txt german-anselm.dev.txt > dev.memo.pred
To combine naive memorization with a trained model (for out-of-vocabulary
tokens), first train and apply one of the normalizers discussed below. If the
predictions of that trained model are in dev.model.pred
, you can then apply
this combined strategy via:
scripts/memorizer.py combine german-lexicon.txt dev.model.pred german-anselm.dev.txt
This will output a new prediction file that returns the learned memorization if
possible, and the corresponding line from dev.model.pred
otherwise.
The following tools are evaluated in Bollmann (2019):
- Norma, described in Bollmann (2012)
- Marian (NMT) for normalization, described in Tang et al. (2018)
- XNMT, following the model of Bollmann (2018)
- cSMTiser (wrapping Moses)
The detailed instructions below assume that the data files are provided in the same format as contained in this repository; i.e., as tab-separated text files where the first column contains a historical word form and the second column contains its normalization.
Norma (and at least one of its dependencies) needs to be compiled manually on your system before it can be used. Detailed instructions for this can be found in the Norma repository.
To use Norma, you need to:
-
Prepare a configuration file; you can use the recommended configuration file, but should adjust the filenames given inside.
-
Prepare a lexicon of contemporary word forms. You can use the contemporary datasets provided here for this purpose, and create a lexicon file with the following command (example given for German):
norma_lexicon -w datasets/modern/combined.de.uniq -a lexicon.de.fsm -l lexicon.de.sym -c
Make sure that the names of the lexicon files match what is given in your
norma.cfg
before you start training.
Data files for Norma need to be in two-column, tab-separated format. To train a new model, use:
normalize -c norma.cfg -f german-anselm.train.txt -s -t --saveonexit
The names of the saved model files are defined in norma.cfg
. Generating
normalizations is done via:
normalize -c norma.cfg -f german-anselm.dev.txt -s > german-anselm.predictions
You need to install the Marian framework and clone the normalization-NMT repository on your local machine. You then need to:
-
Preprocess the input to be in separate source/target files with whitespace-separated characters. This format can be easily generated as follows:
mkdir preprocessed scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessed
This will create the preprocessed input files (named
train.src
,train.trg
, etc.) in thepreprocessed/
subdirectory. -
Edit the
train_seq2seq.sh
script that comes with normalization-NMT to point to the correct paths (for Marian and the preprocessed input), as well as adjust the GPU memory settings and device ID to the correct values for your system. As an example, check out the modified script used for the experiments in Bollmann (2019).
Then, training the model is as simple as calling:
bash train_seq2seq.sh
Generating normalizations is best done by calling marian-decoder
directly,
like this:
cat preprocessed/dev.src | $MARIAN_PATH/marian-decoder -c $MODELDIR/model.npz.best-perplexity.npz.decoder.yaml -m $MODELDIR/model.npz.best-perplexity.npz --quiet-translation --device 0 --mini-batch 16 --maxi-batch 100 --maxi-batch-sort src -w 10000 --beam-size 5 | sed 's/ //g' > german-anselm.predictions
Marian outputs predictions in the same format as the input files, i.e. with
whitespace-separated characters, which is why we pipe it through sed 's/ //g'
to obtain the regular representations. You can skip this part, of course, but
it's required if you want to use the evaluation scripts supplied here and/or
compare with the other normalization methods.
XNMT is based on Python 3.6 and DyNet. You can find detailed instructions on how to install it in the "Getting Started" section of the documentation.
Since XNMT is new software that is changing quickly, and there was no tagged
release at the time of my experiments, it is possible that the newest version
is not compatible with the exact scripts and instructions provided here. For
reference, the experiments performed in Bollmann (2019) are based on XNMT
commit
6557ee8.
You can obtain this exact version of the code by cloning the XNMT repository and
then issuing git checkout 6557ee8
.
To use XNMT, you need to:
-
Preprocess the input to be in separate source/target files with whitespace-separated characters (the same as for Marian):
scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessed
-
Edit the example configuration file by replacing the
<<TMPDIR>>
string with the path to your preprocessed input files; for example:sed -i 's|<<TMPDIR>>|./preprocessed|g' examples/xnmt-config.yaml
For very small datasets, you might also want to increase the patience value (find the line that says
patience: 5
and adjust it).
Afterwards, you can train the model by calling:
PYTHONHASHSEED=0 python3 -m xnmt.xnmt_run_experiments xnmt-config.yaml --dynet-seed 0 --dynet--gpu
This both trains and evaluates; the final predictions will be stored as
dev.predictions
in the given directory.
cSMTiser requires an installation of Moses and MGIZA. Detailed instructions can be found in the cSMTiser repository. To use it, you need to:
-
Preprocess the input to be in separate orig/norm files. There is a bash script to achieve this that has the same argument structure as the script for XNMT and Marian above:
mkdir preprocessed scripts/convert_to_orignorm.sh german-anselm.{train,test,dev}.txt --to preprocessed
-
Edit the example configuration file by replacing the
<<TMPDIR>>
string with the path to your preprocessed input files; for example:sed -i 's|<<TMPDIR>>|preprocessed|g' examples/csmtiser-config.yaml
Likewise, you should replace
<<MODELDIR>>
with the desired output directory (absolute path!) for your trained model, and<<MOSESDIR>>
with the path to your local Moses installation.To add the contemporary data for language modelling (optional), find the line in the configuration file that says
lms: []
and replace it with (e.g.)lms: [datasets/modern/combined.de.uniq]
.
Afterwards, training the model requires the following two commands (from the cSMTiser directory):
python preprocess.py csmtiser-config.yaml
python train.py csmtiser-config.yaml
Generating normalizations is achieved by calling:
python normalise.py csmtiser-config.yaml preprocessed/test.orig
The predicted normalizations will, in this case, be written to
preprocessed/test.orig.norm
.
All software (in the scripts/
directory) is provided under the MIT
License.
Licenses for the datasets differ. The German Anselm data is licensed under CC BY-SA 3.0. The German RIDGES data is licensed under CC BY 3.0. The Icelandic data is licensed under GNU LGPL v3. The Slovene data is licensed under CC BY-SA 4.0. The other datasets unfortunately do not indicate a license, but the rights holders have indicated that the resource is "free" for research purposes. Please see the READMEs included in each dataset subdirectory for more details.
For questions or problems, feel free to file a GitHub issue, or contact me directly:
- Marcel Bollmann ([email protected])