-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nrc.en.mtnt] update word list #221
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh that was a lot of work.
Do you need to bump the version
<Version URL="">0.2.0</Version> |
and add an entry to HISTORY.md?
@darcywong00 Yes, there is more work to do. I'm primarily interested in feedback on whether I cut out too many entries. I eliminated a lot of proper names, but kept country names (since it would be a limited set), though I didn't add any new country names. (If I'd known at the beginning how much work was involved, I might never have started!) |
In particular, I want to wait for preliminary approval on the word list modifications before changing HISTORY.md (and perhaps README.md). |
We can wait till @jahorton returns next week to get his thoughts |
I'm not going to get a chance to review this before September meetings -- is anyone else available to look into this? Paging @jahorton @darcywong00 @eddieantonio 😁 |
No need to rush. After tomorrow I'll be OOO until the September meetings. |
me too 😆 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git won't let me view the changes online due to the quantity of them within the same file, so unfortunately, I can't directly comment per line like I'd normally try to do. So, I'll aim to simulate that a bit.
ok 878 |
"OK" is such a common abbreviation and acronym that I feel it'd be wrong to remove it. This abbreviation easily predates the internet, or at the very least, common use of the internet. (Granted, I think it should be pre-capitalized.)
Getting rid of internet "initialisms" (like "lmao", "imo", etc) is totally fine, though. "OK" is short for "okay", a single word, though it's less common to see written out.
okay 626 |
If we're set on removing the abbreviated form, then I'd strongly recommend adding their frequency counts together - they're the "same word", after all.
PC 408 |
I guess this is kind of an exception to the previous comment? In my experience, the average person is familiar with "PC", but not as familiar with what it stands for - "personal computer". I think PC is practically its own word now.
I can find entries in numerous dictionaries listing it; some indicate it an abbreviation, but at least one actually says "noun" instead of "abbreviation", for whatever that's worth:
https://www.merriam-webster.com/dictionary/pc
https://dictionary.cambridge.org/dictionary/english/pc
Contrast with entries for the internet/texting initialism "imo":
https://dictionary.cambridge.org/dictionary/english/imo
... which is explicitly labeled as a "written abbreviation".
Similar reasoning for preserving TV:
TV 389 |
Kinda surprised at "ex" being dropped; I've definitely heard that term used in isolation during natural spoken communication.
ex 354 |
Certainly, it's more "proper" to mention "ex-husband", "ex-girlfriend", etc instead of just "ex", but I think not supporting it within the wordlist is rather "prescriptive" instead of "descriptive". (Linguistically speaking)
I'm a bit on the fence with some of the more common company names - YouTube, Facebook, Amazon etc. Yeah, we don't want to show corporate favoritism, but with just how common some of these are, people may find it "weird" to not see them in suggestions at all. Amazon does refer to a rainforest as well... not that it's in common use for that, admittedly. Contrast with Twitter, which was kept - with the same frequency as before:
Twitter 164 |
I wouldn't be opposed to drastically reducing their frequency if we decided to keep them - they'd be less likely to show up, but would still show up if/when appropriate. My initial, gut reaction - cut it by 75% or so? (That's a bit arbitrary, admittedly.)
For contrast, I see Thanos was kind of frequent... but the wordlist was made around the time he was pretty relevant in the MCU. That's pretty niche and is quite reasonable to remove - it's an artifact of when the list was made and how it was made. I'm totally in favor of removing that entry and those like it. (Thor, etc) It's not like we have an entry for Zeus, Artemis, or other major historical Greek or Roman gods, so removing the (Marvel-relevant) Norse ones is the right call. (Roman-inspired planet names are a reasonable exception.)
PM 201 |
If for no other reason, the fact that "PM" is used when talking about time is pretty significant. I'd prefer to keep an entry for it in case we enable auto-correct at some point in the future - it'd be really awkward to "autocorrect" away from PM when talking about time.
We don't have to worry about that with "AM" because, of course, "am" is a regular word on its own.
gameplay 199 |
I'm probably biased, but I'm against removing that one. It's not really slang or an abbreviation and has its own, distinct meaning.
eh 136 |
Canadians in shambles. Though, to be fair, I don't exactly see "innit" - a common British-ism - in the list either - not even before the changes.
I see some lower-frequency tech abbreviations that I kind of want to bring up along the lines of PC, but they're infrequent and/or niche enough that it is probably fine to remove them anyway.
Granted, AI has been in the news more frequently as of late, I think - there's been discussion about the use of AI in art. And, of course, there's ChatGPT. If I were to argue one more tech abbreviation, that'd be the one - AI is spoken as such far more frequently than I hear laypeople say the long-form version: "artificial intelligence".
USB would be second, cause nobody says "Universal Serial Bus" when speaking casually.
Marvel 102 |
"marvel" is a perfectly fine English word. Just... yeah, don't keep it capitalized.
https://dictionary.cambridge.org/dictionary/english/marvel
Removed:
Roman 80 |
Kept:
Greek 38 |
Latin 39 |
I don't think removing Roman but keeping Greek and Latin is fair. All are common-use when talking history, aren't they?
I do see that we're generally removing names of companies, people, and medications - that's probably fine, and that does allow us to remove a lot of entries. If we do want to outright-remove Facebook and YouTube, that would be consistent.
I'm not sure that removing some of the geographical names is the right call, though - stuff like Amazon (the major river & rainforest), California (the state), etc. County-level and city-level stuff is probably fine to drop, though - there's too much variation at those levels and below. Then again, I don't see any entries before or after for Mississippi (the major river & state), so I guess that this position could complicate things.
I stopped scanning through the changelist after about entry 4000; anything after that point seems generally low-frequency enough to not nitpick. I only noted the Greek thing because it felt a natural point of comparison for Roman.
re: company names @mcdurdin noted for issue #178 he wanted to add
|
Thanks, @jahorton for those comments. I haven't fully reviewed them. I will add some comments and maybe we can talk in Switzerland.
|
That sounds like a good idea. I think keeping common proper names in the list is helpful because this is for general use, and we're often typing these proper names while texting. |
Two letter words are still slightly helpful for corrections for fat fingering |
Picking this up after six months! From @jahorton 20 Aug 2023 review:
I'll leave Greece (referring to present day country) in and omit Roman. See comments above for my (admittedly arbitrary) criteria. Someday it might be helpful to split this into various files (proper names for countries, personal names, etc.). In an effort to make review a bit easier, I created a changed_or_deleted.txt file, where I took a diff file (with In the file, lines with a
"Acolyte" was changed to "acolyte" (with no change in frequency count). "Activision" was dropped. Misspelled "actualy" changed to "actually". (Unfortunately, the diff file didn't handle non-ASCII characters very well, so I'd welcome any feedback! |
Agree. I think we consider this a good update and move forward with getting it merged in? My feedback is very late, but here are a few thoughts:
|
When this is approved and merged, I'll close issue #178 and open a new issue to capture new ideas listed on this PR along with things on issue 178 that were not addressed in this PR. |
Version number has been changed (from 0.2.0 to 0.3.0) along with changes to HISTORY.md and README.md. |
@darcywong00 It seems that this needs your review since you requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
A belated huge Thank You for this work @DavidLRowe! |
This is a major cleanup of the word list used in this lexical model. In particular:
This resulted in about 25% of the entries being eliminated and others being modified.