[nrc.en.mtnt] update word list #221

DavidLRowe · 2023-08-14T20:59:17Z

This is a major cleanup of the word list used in this lexical model. In particular:

non-words removed
- most acronyms removed (though a few were retained)
- items with digits removed (except 1st, 2nd, etc)
misspellings corrected (which may result in a double entry for the word, for example becasue->because)
- based on spell check in Notepad++, along with Merriam-Webster online dictionary
- however both US and UK/Canadian/Australian spellings retained
most proper names removed
- exceptions for names of continents, countries, nationalities, relgions
- or made lower case if it's a word, for example Cardinals->cardinals (though the frequency count might not be representative)

This resulted in about 25% of the entries being eliminated and others being modified.

darcywong00

ooh that was a lot of work.

Do you need to bump the version

lexical-models/release/nrc/nrc.en.mtnt/source/nrc.en.mtnt.model.kps

Line 20 in 7baa655

and add an entry to HISTORY.md?

DavidLRowe · 2023-08-17T01:10:48Z

@darcywong00 Yes, there is more work to do. I'm primarily interested in feedback on whether I cut out too many entries. I eliminated a lot of proper names, but kept country names (since it would be a limited set), though I didn't add any new country names.

(If I'd known at the beginning how much work was involved, I might never have started!)

DavidLRowe · 2023-08-17T01:17:58Z

In particular, I want to wait for preliminary approval on the word list modifications before changing HISTORY.md (and perhaps README.md).

darcywong00 · 2023-08-17T02:27:13Z

We can wait till @jahorton returns next week to get his thoughts

mcdurdin · 2023-08-17T03:39:18Z

I'm not going to get a chance to review this before September meetings -- is anyone else available to look into this? Paging @jahorton @darcywong00 @eddieantonio 😁

DavidLRowe · 2023-08-17T18:37:20Z

No need to rush. After tomorrow I'll be OOO until the September meetings.

mcdurdin · 2023-08-17T23:16:26Z

After tomorrow I'll be OOO until the September meetings.

me too 😆

jahorton

Git won't let me view the changes online due to the quantity of them within the same file, so unfortunately, I can't directly comment per line like I'd normally try to do. So, I'll aim to simulate that a bit.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 424 in 7baa655

ok 878

"OK" is such a common abbreviation and acronym that I feel it'd be wrong to remove it. This abbreviation easily predates the internet, or at the very least, common use of the internet. (Granted, I think it should be pre-capitalized.)

Getting rid of internet "initialisms" (like "lmao", "imo", etc) is totally fine, though. "OK" is short for "okay", a single word, though it's less common to see written out.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 541 in 7baa655

okay 626

If we're set on removing the abbreviated form, then I'd strongly recommend adding their frequency counts together - they're the "same word", after all.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 799 in 7baa655

PC 408

I guess this is kind of an exception to the previous comment? In my experience, the average person is familiar with "PC", but not as familiar with what it stands for - "personal computer". I think PC is practically its own word now.

I can find entries in numerous dictionaries listing it; some indicate it an abbreviation, but at least one actually says "noun" instead of "abbreviation", for whatever that's worth:

https://www.merriam-webster.com/dictionary/pc
https://dictionary.cambridge.org/dictionary/english/pc

Contrast with entries for the internet/texting initialism "imo":

https://dictionary.cambridge.org/dictionary/english/imo

... which is explicitly labeled as a "written abbreviation".

Similar reasoning for preserving TV:

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 839 in 7baa655

TV 389

Kinda surprised at "ex" being dropped; I've definitely heard that term used in isolation during natural spoken communication.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 915 in 7baa655

ex 354

Certainly, it's more "proper" to mention "ex-husband", "ex-girlfriend", etc instead of just "ex", but I think not supporting it within the wordlist is rather "prescriptive" instead of "descriptive". (Linguistically speaking)

I'm a bit on the fence with some of the more common company names - YouTube, Facebook, Amazon etc. Yeah, we don't want to show corporate favoritism, but with just how common some of these are, people may find it "weird" to not see them in suggestions at all. Amazon does refer to a rainforest as well... not that it's in common use for that, admittedly. Contrast with Twitter, which was kept - with the same frequency as before:

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 1838 in 7baa655

Twitter 164

I wouldn't be opposed to drastically reducing their frequency if we decided to keep them - they'd be less likely to show up, but would still show up if/when appropriate. My initial, gut reaction - cut it by 75% or so? (That's a bit arbitrary, admittedly.)

For contrast, I see Thanos was kind of frequent... but the wordlist was made around the time he was pretty relevant in the MCU. That's pretty niche and is quite reasonable to remove - it's an artifact of when the list was made and how it was made. I'm totally in favor of removing that entry and those like it. (Thor, etc) It's not like we have an entry for Zeus, Artemis, or other major historical Greek or Roman gods, so removing the (Marvel-relevant) Norse ones is the right call. (Roman-inspired planet names are a reasonable exception.)

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 1540 in 7baa655

PM 201

If for no other reason, the fact that "PM" is used when talking about time is pretty significant. I'd prefer to keep an entry for it in case we enable auto-correct at some point in the future - it'd be really awkward to "autocorrect" away from PM when talking about time.

We don't have to worry about that with "AM" because, of course, "am" is a regular word on its own.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 1557 in 7baa655

gameplay 199

I'm probably biased, but I'm against removing that one. It's not really slang or an abbreviation and has its own, distinct meaning.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 2125 in 7baa655

eh 136

Canadians in shambles. Though, to be fair, I don't exactly see "innit" - a common British-ism - in the list either - not even before the changes.

I see some lower-frequency tech abbreviations that I kind of want to bring up along the lines of PC, but they're infrequent and/or niche enough that it is probably fine to remove them anyway.

Granted, AI has been in the news more frequently as of late, I think - there's been discussion about the use of AI in art. And, of course, there's ChatGPT. If I were to argue one more tech abbreviation, that'd be the one - AI is spoken as such far more frequently than I hear laypeople say the long-form version: "artificial intelligence".

USB would be second, cause nobody says "Universal Serial Bus" when speaking casually.

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 2667 in 7baa655

Marvel 102

"marvel" is a perfectly fine English word. Just... yeah, don't keep it capitalized.

https://dictionary.cambridge.org/dictionary/english/marvel

Removed:

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 3240 in 7baa655

Roman 80

Kept:

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 5811 in 7baa655

Greek 38

lexical-models/release/nrc/nrc.en.mtnt/source/mtnt.tsv

Line 5705 in 7baa655

Latin 39

I don't think removing Roman but keeping Greek and Latin is fair. All are common-use when talking history, aren't they?

I do see that we're generally removing names of companies, people, and medications - that's probably fine, and that does allow us to remove a lot of entries. If we do want to outright-remove Facebook and YouTube, that would be consistent.

I'm not sure that removing some of the geographical names is the right call, though - stuff like Amazon (the major river & rainforest), California (the state), etc. County-level and city-level stuff is probably fine to drop, though - there's too much variation at those levels and below. Then again, I don't see any entries before or after for Mississippi (the major river & state), so I guess that this position could complicate things.

I stopped scanning through the changelist after about entry 4000; anything after that point seems generally low-frequency enough to not nitpick. I only noted the Greek thing because it felt a natural point of comparison for Roman.

darcywong00 · 2023-08-21T02:47:39Z

I do see that we're generally removing names of companies, people, and medications - that's probably fine, and that does allow us to remove a lot of entries. If we do want to outright-remove Facebook and YouTube, that would be consistent.

re: company names

@mcdurdin noted for issue #178 he wanted to add

Qantas (airline) (and a number of other brands!)

DavidLRowe · 2023-08-25T18:10:53Z

Thanks, @jahorton for those comments. I haven't fully reviewed them. I will add some comments and maybe we can talk in Switzerland.

If I was in doubt about removing an item, I often opted to remove it since that would make it show up for review. Happy to add things back in.
Since the primary use is for proposing words during typing, I tended to omit two-letter words/abbreviations.
Perhaps we want to collect proper names in another TSV file (or several)? For this pass I kept country names and related terms, but dropped those for states, provinces, cities. Hence keeping Greece, but dropping Roman.

mcdurdin · 2023-08-29T01:08:13Z

Perhaps we want to collect proper names in another TSV file (or several)? For this pass I kept country names and related terms, but dropped those for states, provinces, cities. Hence keeping Greece, but dropping Roman.

That sounds like a good idea. I think keeping common proper names in the list is helpful because this is for general use, and we're often typing these proper names while texting.

mcdurdin · 2023-08-29T01:08:38Z

Since the primary use is for proposing words during typing, I tended to omit two-letter words/abbreviations.

Two letter words are still slightly helpful for corrections for fat fingering

DavidLRowe · 2024-02-15T03:14:17Z

Picking this up after six months!

From @jahorton 20 Aug 2023 review:

keep "ok" (change to "OK")
keep "PC"
keep "TV"
keep "ex"
change frequency of "twitter" to 41 (25% of original)
keep "PM"
keep "gameplay"
keep "eh"
keep "USB"
keep "Marvel" but change to "marvel" and cut frequency to 25

I'll leave Greece (referring to present day country) in and omit Roman. See comments above for my (admittedly arbitrary) criteria. Someday it might be helpful to split this into various files (proper names for countries, personal names, etc.).

In an effort to make review a bit easier, I created a changed_or_deleted.txt file, where I took a diff file (with - for removed lines and + for added lines) and sorted it alphabetically ignoring the plus/minus. (I retained the frequency count after the word.) Note that this doesn't reflect the changes listed above from Josh's review.

In the file, lines with a + should have a corresponding line with a - nearby, indicating what it was changed from, usually a spelling correction or a proper name changed to its corresponding uncapitalized word. Some examples:

+acolyte	8
-Acolyte	8
-Activision	8
+actually	5
-actualy	5

"Acolyte" was changed to "acolyte" (with no change in frequency count). "Activision" was dropped. Misspelled "actualy" changed to "actually". (Unfortunately, the diff file didn't handle non-ASCII characters very well, so être shows up as Ãªtre but only in this review file.) There are about 6000 deletions and about 600 corrections.

I'd welcome any feedback!

mcdurdin · 2024-02-15T04:03:31Z

Someday it might be helpful to split this into various files (proper names for countries, personal names, etc.).

Agree.

I think we consider this a good update and move forward with getting it merged in? My feedback is very late, but here are a few thoughts:

I think proper names are very useful for predictive text, including common brand names, localities, common personal names. They are frequently used in text messaging. Common entertainment identities I am less worried about this time, partly because they change so frequently. (But 'Frodo' should definitely be there right?)
I would err on the side of adding more words rather than removing rarer words. I would tend to only remove misspelled (misspelt) words or those which are offensive. I find in use that I want to be offered these words and often they are missing. Particularly varied word endings -- I'll be offered "...ed", "...s", but "...ing" will be missing for one word, and a different ending will be missing for another word.

DavidLRowe · 2024-02-15T21:54:02Z

When this is approved and merged, I'll close issue #178 and open a new issue to capture new ideas listed on this PR along with things on issue 178 that were not addressed in this PR.

DavidLRowe · 2024-02-15T21:56:57Z

ooh that was a lot of work.

Do you need to bump the version

lexical-models/release/nrc/nrc.en.mtnt/source/nrc.en.mtnt.model.kps

Line 20 in 7baa655

<Version URL="">0.2.0</Version>

and add an entry to HISTORY.md?

Version number has been changed (from 0.2.0 to 0.3.0) along with changes to HISTORY.md and README.md.

DavidLRowe · 2024-02-17T20:11:58Z

@darcywong00 It seems that this needs your review since you requested changes.

darcywong00

lgtm

mcdurdin · 2024-02-22T02:58:58Z

A belated huge Thank You for this work @DavidLRowe!

update word list

7bf1ddc

DavidLRowe mentioned this pull request Aug 16, 2023

[nrc.en.mtnt] Revise wordlist #178

Closed

darcywong00 requested changes Aug 17, 2023

View reviewed changes

darcywong00 changed the title ~~update word list~~ nrc.en.mtnt] update word list Aug 17, 2023

mcdurdin changed the title ~~nrc.en.mtnt] update word list~~ [nrc.en.mtnt] update word list Aug 17, 2023

jahorton reviewed Aug 21, 2023

View reviewed changes

DavidLRowe added 5 commits February 15, 2024 13:16

Changes to mtnt.tsv in response to feedback

6ba0c2c

Update HISTORY.md

60ad3a3

Update LICENSE.md

c15a19d

Update README.md

e5fb093

Update version number in nrc.en.mtnt.model.kps

40620ab

DavidLRowe requested a review from darcywong00 February 15, 2024 23:01

darcywong00 approved these changes Feb 18, 2024

View reviewed changes

DavidLRowe merged commit 8a89a0f into keymanapp:master Feb 18, 2024
2 checks passed

DavidLRowe mentioned this pull request Feb 22, 2024

[nrc.en.mtnt] Further revisions to wordlist #242

Open

DavidLRowe deleted the nrc-en branch March 6, 2024 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nrc.en.mtnt] update word list #221

[nrc.en.mtnt] update word list #221

DavidLRowe commented Aug 14, 2023

darcywong00 left a comment

DavidLRowe commented Aug 17, 2023

DavidLRowe commented Aug 17, 2023

darcywong00 commented Aug 17, 2023

mcdurdin commented Aug 17, 2023

DavidLRowe commented Aug 17, 2023

mcdurdin commented Aug 17, 2023

jahorton left a comment •

edited

Loading

darcywong00 commented Aug 21, 2023

DavidLRowe commented Aug 25, 2023

mcdurdin commented Aug 29, 2023

mcdurdin commented Aug 29, 2023

DavidLRowe commented Feb 15, 2024

mcdurdin commented Feb 15, 2024

DavidLRowe commented Feb 15, 2024

DavidLRowe commented Feb 15, 2024

DavidLRowe commented Feb 17, 2024

darcywong00 left a comment

mcdurdin commented Feb 22, 2024

[nrc.en.mtnt] update word list #221

[nrc.en.mtnt] update word list #221

Conversation

DavidLRowe commented Aug 14, 2023

darcywong00 left a comment

Choose a reason for hiding this comment

DavidLRowe commented Aug 17, 2023

DavidLRowe commented Aug 17, 2023

darcywong00 commented Aug 17, 2023

mcdurdin commented Aug 17, 2023

DavidLRowe commented Aug 17, 2023

mcdurdin commented Aug 17, 2023

jahorton left a comment • edited Loading

Choose a reason for hiding this comment

darcywong00 commented Aug 21, 2023

DavidLRowe commented Aug 25, 2023

mcdurdin commented Aug 29, 2023

mcdurdin commented Aug 29, 2023

DavidLRowe commented Feb 15, 2024

mcdurdin commented Feb 15, 2024

DavidLRowe commented Feb 15, 2024

DavidLRowe commented Feb 15, 2024

DavidLRowe commented Feb 17, 2024

darcywong00 left a comment

Choose a reason for hiding this comment

mcdurdin commented Feb 22, 2024

jahorton left a comment •

edited

Loading