Tutorial: training a dependency parser model in Stanford Stanza

Jun Li, Ph.D
4 min readJan 31, 2022

Stanford NLP Group states in their Stanza documentation :

“All neural modules, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data. To train your own models, you will need to clone the source code from the stanza git repository and follow the procedures below. We also provide runnable scripts coupled with toy data that makes it much easier for users to get started with model training, and you can find at stanza-train git repository.”

I was trying out the scripts mentioned above with their toy data but no luck. So I had to set out to find universal dependencies data, preprocess them and run the training script. There are quite some trial-and-error effort involved, so I thought that it may be a good idea to share my steps and forked Github repos. My stuff are set up and running in IntelliJ IDEA on iMac. It is easier to step though code when needed.

Step 1:

Clone the source code from the stanza git repository and stanza-train git repository. Add a section in __init__.py under stanza package to set up environment variables as in my forked stanza repository. Clone stanza-train git repository. Add a directory “UD_English-EWT” under data/udbase as in my forked stanza-train repository. Or you may directly clone my forked stanza and stanza-train repositories.

Step 2:

From your local cloned “stanza” project, you may set up your favorite python virtual environment and install all the required packages. As mentioned, I’m using IntelliJ IDEA. So I picked “Virtualenv Environment” and set it up as

Step 3:

You guessed it right, the data we are going to train the model is, “Universal Dependencies — English Web Treebank”. The data format is so-called CoNLL-U format. Download those three .conllu files into data/udbase/UD_English-EWT under stanza-train module. The only environment variable you need to set up is “TRAIN_PROJ_BASE” which points to your local cloned “stanza-train” project. Run “stanza/utils/datasets/prepare_depparse_treebank.py”.

It produces .conllu files under “data/processed/depparse” which will be the input files for training.

Step 4:

You may download word vector data as

python

>>> import stanza

>>> stanza.download(“en”)

Step 5:

You may use “ — wordvec_pretrain_file” option to link your downloaded word vector data. My IntelliJ IDEA “Run/Debug Configuration” is shown as

Step 6:

Now it’s time to kick the training process off. From IntelliJ IDEA, click “Run” icon or “Debug” if you want to step through. The output logs as

The batch size was set to 50. It took my iMac (Processor: 3 GHz 6-Core Intel Core i5; Memory: 8 GB 2667 MHz DDR4) about 13 hours to complete the training. The output model is saved as “saved_models/depparse/en_ewt_parser.pt” in current directory.

That’s it. Hope it helps save some of your time. You may report any issue through my forked stanza-train repository or my email (jun.r.li@gmail.com).

Relevant scholarly articles:

  1. Universal Dependency Parsing from Scratch”, Peng Qi, Timothy Dozat, Yuhao Zhang, Christopher D. Manning, In Proceedings of the CoNLL 2018 UD Shared Task.
  2. Simpler but More Accurate Semantic Dependency Parsing”, Timothy Dozat, Christopher D. Manning, ACL 2018
  3. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task”, Timothy Dozat, Peng Qi, Christopher D. Manning, In Proceedings of the CoNLL 2017 Shared Task.
  4. Deep biaffine attention for neural dependency parsing”, Timothy Dozat and Christopher D. Manning, ICLR 2017.

--

--