Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nrc.en.mtnt] update word list #221

Merged
merged 6 commits into from
Feb 18, 2024
Merged

Conversation

DavidLRowe
Copy link
Contributor

This is a major cleanup of the word list used in this lexical model. In particular:

  • non-words removed
    • most acronyms removed (though a few were retained)
    • items with digits removed (except 1st, 2nd, etc)
  • misspellings corrected (which may result in a double entry for the word, for example becasue->because)
    • based on spell check in Notepad++, along with Merriam-Webster online dictionary
    • however both US and UK/Canadian/Australian spellings retained
  • most proper names removed
    • exceptions for names of continents, countries, nationalities, relgions
    • or made lower case if it's a word, for example Cardinals->cardinals (though the frequency count might not be representative)

This resulted in about 25% of the entries being eliminated and others being modified.

Copy link
Contributor

@darcywong00 darcywong00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh that was a lot of work.

Do you need to bump the version

and add an entry to HISTORY.md?

@DavidLRowe
Copy link
Contributor Author

@darcywong00 Yes, there is more work to do. I'm primarily interested in feedback on whether I cut out too many entries. I eliminated a lot of proper names, but kept country names (since it would be a limited set), though I didn't add any new country names.

(If I'd known at the beginning how much work was involved, I might never have started!)

@DavidLRowe
Copy link
Contributor Author

In particular, I want to wait for preliminary approval on the word list modifications before changing HISTORY.md (and perhaps README.md).

@darcywong00 darcywong00 changed the title update word list nrc.en.mtnt] update word list Aug 17, 2023
@darcywong00
Copy link
Contributor

We can wait till @jahorton returns next week to get his thoughts

@mcdurdin
Copy link
Member

I'm not going to get a chance to review this before September meetings -- is anyone else available to look into this? Paging @jahorton @darcywong00 @eddieantonio 😁

@mcdurdin mcdurdin changed the title nrc.en.mtnt] update word list [nrc.en.mtnt] update word list Aug 17, 2023
@DavidLRowe
Copy link
Contributor Author

No need to rush. After tomorrow I'll be OOO until the September meetings.

@mcdurdin
Copy link
Member

After tomorrow I'll be OOO until the September meetings.

me too 😆

Copy link
Contributor

@jahorton jahorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git won't let me view the changes online due to the quantity of them within the same file, so unfortunately, I can't directly comment per line like I'd normally try to do. So, I'll aim to simulate that a bit.


"OK" is such a common abbreviation and acronym that I feel it'd be wrong to remove it. This abbreviation easily predates the internet, or at the very least, common use of the internet. (Granted, I think it should be pre-capitalized.)

Getting rid of internet "initialisms" (like "lmao", "imo", etc) is totally fine, though. "OK" is short for "okay", a single word, though it's less common to see written out.

If we're set on removing the abbreviated form, then I'd strongly recommend adding their frequency counts together - they're the "same word", after all.


I guess this is kind of an exception to the previous comment? In my experience, the average person is familiar with "PC", but not as familiar with what it stands for - "personal computer". I think PC is practically its own word now.

I can find entries in numerous dictionaries listing it; some indicate it an abbreviation, but at least one actually says "noun" instead of "abbreviation", for whatever that's worth:

https://www.merriam-webster.com/dictionary/pc
https://dictionary.cambridge.org/dictionary/english/pc

Contrast with entries for the internet/texting initialism "imo":

https://dictionary.cambridge.org/dictionary/english/imo

... which is explicitly labeled as a "written abbreviation".

Similar reasoning for preserving TV:


Kinda surprised at "ex" being dropped; I've definitely heard that term used in isolation during natural spoken communication.

Certainly, it's more "proper" to mention "ex-husband", "ex-girlfriend", etc instead of just "ex", but I think not supporting it within the wordlist is rather "prescriptive" instead of "descriptive". (Linguistically speaking)


I'm a bit on the fence with some of the more common company names - YouTube, Facebook, Amazon etc. Yeah, we don't want to show corporate favoritism, but with just how common some of these are, people may find it "weird" to not see them in suggestions at all. Amazon does refer to a rainforest as well... not that it's in common use for that, admittedly. Contrast with Twitter, which was kept - with the same frequency as before:

I wouldn't be opposed to drastically reducing their frequency if we decided to keep them - they'd be less likely to show up, but would still show up if/when appropriate. My initial, gut reaction - cut it by 75% or so? (That's a bit arbitrary, admittedly.)

For contrast, I see Thanos was kind of frequent... but the wordlist was made around the time he was pretty relevant in the MCU. That's pretty niche and is quite reasonable to remove - it's an artifact of when the list was made and how it was made. I'm totally in favor of removing that entry and those like it. (Thor, etc) It's not like we have an entry for Zeus, Artemis, or other major historical Greek or Roman gods, so removing the (Marvel-relevant) Norse ones is the right call. (Roman-inspired planet names are a reasonable exception.)


If for no other reason, the fact that "PM" is used when talking about time is pretty significant. I'd prefer to keep an entry for it in case we enable auto-correct at some point in the future - it'd be really awkward to "autocorrect" away from PM when talking about time.

We don't have to worry about that with "AM" because, of course, "am" is a regular word on its own.


I'm probably biased, but I'm against removing that one. It's not really slang or an abbreviation and has its own, distinct meaning.


Canadians in shambles. Though, to be fair, I don't exactly see "innit" - a common British-ism - in the list either - not even before the changes.


I see some lower-frequency tech abbreviations that I kind of want to bring up along the lines of PC, but they're infrequent and/or niche enough that it is probably fine to remove them anyway.

Granted, AI has been in the news more frequently as of late, I think - there's been discussion about the use of AI in art. And, of course, there's ChatGPT. If I were to argue one more tech abbreviation, that'd be the one - AI is spoken as such far more frequently than I hear laypeople say the long-form version: "artificial intelligence".

USB would be second, cause nobody says "Universal Serial Bus" when speaking casually.


"marvel" is a perfectly fine English word. Just... yeah, don't keep it capitalized.

https://dictionary.cambridge.org/dictionary/english/marvel


Removed:

Kept:

I don't think removing Roman but keeping Greek and Latin is fair. All are common-use when talking history, aren't they?


I do see that we're generally removing names of companies, people, and medications - that's probably fine, and that does allow us to remove a lot of entries. If we do want to outright-remove Facebook and YouTube, that would be consistent.

I'm not sure that removing some of the geographical names is the right call, though - stuff like Amazon (the major river & rainforest), California (the state), etc. County-level and city-level stuff is probably fine to drop, though - there's too much variation at those levels and below. Then again, I don't see any entries before or after for Mississippi (the major river & state), so I guess that this position could complicate things.

I stopped scanning through the changelist after about entry 4000; anything after that point seems generally low-frequency enough to not nitpick. I only noted the Greek thing because it felt a natural point of comparison for Roman.

@darcywong00
Copy link
Contributor

I do see that we're generally removing names of companies, people, and medications - that's probably fine, and that does allow us to remove a lot of entries. If we do want to outright-remove Facebook and YouTube, that would be consistent.

re: company names

@mcdurdin noted for issue #178 he wanted to add

Qantas (airline) (and a number of other brands!)

@DavidLRowe
Copy link
Contributor Author

Thanks, @jahorton for those comments. I haven't fully reviewed them. I will add some comments and maybe we can talk in Switzerland.

  • If I was in doubt about removing an item, I often opted to remove it since that would make it show up for review. Happy to add things back in.
  • Since the primary use is for proposing words during typing, I tended to omit two-letter words/abbreviations.
  • Perhaps we want to collect proper names in another TSV file (or several)? For this pass I kept country names and related terms, but dropped those for states, provinces, cities. Hence keeping Greece, but dropping Roman.

@mcdurdin
Copy link
Member

  • Perhaps we want to collect proper names in another TSV file (or several)? For this pass I kept country names and related terms, but dropped those for states, provinces, cities. Hence keeping Greece, but dropping Roman.

That sounds like a good idea. I think keeping common proper names in the list is helpful because this is for general use, and we're often typing these proper names while texting.

@mcdurdin
Copy link
Member

  • Since the primary use is for proposing words during typing, I tended to omit two-letter words/abbreviations.

Two letter words are still slightly helpful for corrections for fat fingering

@DavidLRowe
Copy link
Contributor Author

Picking this up after six months!

From @jahorton 20 Aug 2023 review:

  • keep "ok" (change to "OK")
  • keep "PC"
  • keep "TV"
  • keep "ex"
  • change frequency of "twitter" to 41 (25% of original)
  • keep "PM"
  • keep "gameplay"
  • keep "eh"
  • keep "USB"
  • keep "Marvel" but change to "marvel" and cut frequency to 25

I'll leave Greece (referring to present day country) in and omit Roman. See comments above for my (admittedly arbitrary) criteria. Someday it might be helpful to split this into various files (proper names for countries, personal names, etc.).

In an effort to make review a bit easier, I created a changed_or_deleted.txt file, where I took a diff file (with - for removed lines and + for added lines) and sorted it alphabetically ignoring the plus/minus. (I retained the frequency count after the word.) Note that this doesn't reflect the changes listed above from Josh's review.

In the file, lines with a + should have a corresponding line with a - nearby, indicating what it was changed from, usually a spelling correction or a proper name changed to its corresponding uncapitalized word. Some examples:

+acolyte	8
-Acolyte	8
-Activision	8
+actually	5
-actualy	5

"Acolyte" was changed to "acolyte" (with no change in frequency count). "Activision" was dropped. Misspelled "actualy" changed to "actually". (Unfortunately, the diff file didn't handle non-ASCII characters very well, so être shows up as être but only in this review file.) There are about 6000 deletions and about 600 corrections.

I'd welcome any feedback!

@mcdurdin
Copy link
Member

Someday it might be helpful to split this into various files (proper names for countries, personal names, etc.).

Agree.

I think we consider this a good update and move forward with getting it merged in? My feedback is very late, but here are a few thoughts:

  • I think proper names are very useful for predictive text, including common brand names, localities, common personal names. They are frequently used in text messaging. Common entertainment identities I am less worried about this time, partly because they change so frequently. (But 'Frodo' should definitely be there right?)
  • I would err on the side of adding more words rather than removing rarer words. I would tend to only remove misspelled (misspelt) words or those which are offensive. I find in use that I want to be offered these words and often they are missing. Particularly varied word endings -- I'll be offered "...ed", "...s", but "...ing" will be missing for one word, and a different ending will be missing for another word.

@DavidLRowe
Copy link
Contributor Author

When this is approved and merged, I'll close issue #178 and open a new issue to capture new ideas listed on this PR along with things on issue 178 that were not addressed in this PR.

@DavidLRowe
Copy link
Contributor Author

ooh that was a lot of work.

Do you need to bump the version

and add an entry to HISTORY.md?

Version number has been changed (from 0.2.0 to 0.3.0) along with changes to HISTORY.md and README.md.

@DavidLRowe
Copy link
Contributor Author

@darcywong00 It seems that this needs your review since you requested changes.

Copy link
Contributor

@darcywong00 darcywong00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DavidLRowe DavidLRowe merged commit 8a89a0f into keymanapp:master Feb 18, 2024
2 checks passed
@mcdurdin
Copy link
Member

A belated huge Thank You for this work @DavidLRowe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants