Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nrc.en.mtnt] update word list #221

Merged
merged 6 commits into from
Feb 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions release/nrc/nrc.en.mtnt/HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Version History

## 0.3.0 (2024-02-15)

* Major cleanup:
* Non-words removed (including some acronyms)
* misspellings corrected (both US and UK/Canadian/Australian spellings preserved)
* most proper names removed (except names of continents, countries, nationalities, religions)
* some proper names made lower case (Cardinals to cardinals) though the frequency count unchanged

## 0.2.0 (2023-02-13)

* Lower-case some common words
Expand Down
2 changes: 1 addition & 1 deletion release/nrc/nrc.en.mtnt/LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
The MIT License (MIT)

© 2019-2023 National Research Council Canada
© 2019-2024 National Research Council Canada

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
12 changes: 12 additions & 0 deletions release/nrc/nrc.en.mtnt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,18 @@ them to avoid making that [clbuttic mistake][]. That was “fun”. This was
a pretty manual, and at times, subjective process. Exact replication may
not be guaranteed.

Additional filtering
--------------------

From May 2023 through February 2024 additional hand editing was done
to correct misspellings and remove non-words and many proper names.

Future work
-----------

* Keep separate files for proper names. Could be several files (personal names,
geographic names, company/product names, etc.)
* Include other endings (-s, -ed, -ing, etc.)

[profanities.en]: https://github.com/pmichel31415/mtnt/blob/master/resources/profanities.en
[clbuttic mistake]: https://thedailywtf.com/articles/The-Clbuttic-Mistake-
Loading