Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing file in Data stated in constants.py #5

Open
FrankYFTang opened this issue Jan 21, 2021 · 7 comments
Open

missing file in Data stated in constants.py #5

FrankYFTang opened this issue Jan 21, 2021 · 7 comments

Comments

@FrankYFTang
Copy link
Collaborator

I try to run the basic test but mostly failed

It seems you coded some data path in constatnts.py but those files do not exist.

ftang@ftang4:~/lstm_word_segmentation$ python3 test/test_helpers.py
Traceback (most recent call last):
File "test/test_helpers.py", line 3, in
from lstm_word_segmentation.helpers import is_ascii, diff_strings, sigmoid
File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/helpers.py", line 2, in
from . import constants
File "/usr/local/google/home/ftang/lstm_word_segmentation/lstm_word_segmentation/constants.py", line 7, in
THAI_GRAPH_CLUST_RATIO = np.load(str(path), allow_pickle=True).item()
File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/ftang/lstm_word_segmentation/Data/Thai_graph_clust_ratio.npy'
ft

@sffc
Copy link
Member

sffc commented Jan 21, 2021

I believe the data files are generated from study_languages.py. However, I think that they shouldn't be necessary for evaluating a model; they're used for training. However, it looks like the Python code won't run at all unless those files are present, because the import statements in constants.py are failing. @SahandFarhoodi ?

@SahandFarhoodi
Copy link
Collaborator

There are data files needed to train and test files in the current version of my python code. Some of these are data used to train/test models (my.txt, BEST data, etc), and some of these are data files generated by my code that are used at the evaluation time as well, such as THAI_GRAPH_CLUST_RATIO which is a python dictionary that contains the frequent grapheme clusters in Thai. You can use the functions I have in study_langauges.py to generate this dictionary yourself, but you will still need the BEST data files.

By the end of my internship, I shared a google drive folder with Shane (called Dictionary Segmentation) that has all these files. I just shared the same folder with Frank.

@FrankYFTang
Copy link
Collaborator Author

There are data files needed to train and test files in the current version of my python code.

Should we at least check in github all the files needed to TEST / Eval the segmentation. I think we should not check in all the data which train the model, but for anything that are needed run AFTER the training, should we check them into github?

@sffc
Copy link
Member

sffc commented Jan 21, 2021

We shouldn't check in data files that are strongly coupled with the training data. Instead, it would be better design if the code didn't need those files to exist at all. Ideally the code should be able to pull what it needs directly from the model files.

@SahandFarhoodi
Copy link
Collaborator

I think the main data file that we need for the evaluation is the dictionary that has grapheme clusters in it (e.g. THAT_GRAPH_CLUST_RATIO). This dictionary already exists in the model files as well (that's how we use it in Rust), but my Python code reads that file directly (not from the model file) because that made it much easier to develop and change the algorithm. In addition, these dictionaries are almost independent of the training data, because we just use the training data to count different grapheme clusters, and any text in Thai (even unsegmented) can be used for this purpose and will result in a similar dictionary (I tried this).

@sffc
Copy link
Member

sffc commented Jan 21, 2021

OK, so I think we should probably just check the ratio files into the repo then. Otherwise, someone who downloads the repo won't be able to run the code. Does that sound okay to you @SahandFarhoodi ?

@SahandFarhoodi
Copy link
Collaborator

Yes, I think that's the best solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants