Firstly, setup a Python 3.6.0 virtualenv with the modules from requirements.txt
installed.
This repo consists of four directories:
- Lexeminator - contains scripts to clone SciTools/Understand (to get lexemes i.e. tokens from .java files) and setup the environment for it to run.
Execute./script.sh
and you're good to go. Possible change needed: Inusers.txt
, change the username in the first column to the current user. - Transformer_Code_Attention - contains code for the Transformer Attention model.
- Firstly execute
python lexeminator.py --src_folder=./DirWithAllThe.javaFiles
(uses the Understand module) to get two files -lex_dumps_data.txt
(contains the training data encoded with numbers) andvocab_file.txt
(contains a dict with the mapping between tokens and their number values) The data dump is after subtokenization of Identifiers. opt.py
(Adam optimizer),model.py
(PyTorch model for the Transformer Decoder),utils.py
(for data processing) are used bytrain.py
- Once the data dump is ready, execute
mkdir save; mkdir ./save/code
(will save the embedding weight pickles in that directory) and then executepython train.py --submit
.--submit
is necessary to get weight pickles which are named as:embed_weights_i.pt
wherei
is the epoch number. Validation loss - choosing 15% of the data randomly at each epoch - to check for overfitting has been incorporated.
You could pass args like--train_file=./lex_dumps_data.txt
(path to training file dump),--n_embd=132
(embedding sizes),--n_layer=9
(number of Transformer blocks),--n_head=12
(number of heads for multihead attention),--n_iter=60
(number of epochs),--valid_percent=15
(15% of data for validation),--n_ctx=22
(maximum number of tokens in a line),--n_batch=500
(minibatch size) - To check GPU usage execute
watch -n 0.25 nvidia-smi
- Firstly execute
- Transformer_Postprocessing - Contains code for tSNE visualization and dumping the learned embeddings as a dict for input to the AST Paths model. For tSNE plots, firstly
mkdir embeds_test
. Next,python tSNE.py --vocab_file=VocabFileFromAbove --sess_dir=embeds_test --embed_file=PathTo_embed_weights_i.pt
. Thentensorboard --logdir=./embeds_test
and open up your browser.
To get embeddings as a dict,python to_dict.py
after setting the path to the vocab_file.txt and the embed_weights_i.pt in the to_dict.py file - it will output aweights_dict.txt
file. - Transformer_tf_Sagemaker - Contains code for the partially complete Tensorflow Estimator model which will be used by Sagemaker. To execute on local,
python transformer_sagemaker.py --train_file=PathTo_lex_dumps_data.txt
. Code for the custom Adam optimizer is added but does not work properly (loss increases at every step). One can also changesteps
in the .py file as there are no epochs in the estimator model. The issue right now is that the same weights are getting printed at every step - this could be due to two reasons - either the printing is wrong or the backprop part of the code in the estimator is not working as desired. The code has#For Sagemaker
which means that those lines are required for hyperparameter tuning on Sagemaker.
Refer to OpenAI's tf code and huggingface's PyTorch port for further details.