Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RoadMap] development plan for Chinese inverse text normalization #1

Open
xingchensong opened this issue Sep 7, 2021 · 3 comments
Open
Assignees

Comments

@xingchensong
Copy link
Member

xingchensong commented Sep 7, 2021

Project Explanations:

image

  • Following NeMo's (1) classification + (2) verbalization two-stage method, we plan to adapt jiayu's ITN grammar to this two-stage pipeline (for more details, plz see this paper).

image

  • The reasons why we choose to separate Chinese ITN into two stages (each stage has its own WFST) rather than transduce input text using a single WFST:

    1. WFSTs can only process input linearly, but the word order can change from spoken to written form (i.e. 三分之一 -> 1/3)
    2. English ITN grammars, which has been carefully designed in NeMo, can be seamlessly integrated into this project
@xingchensong
Copy link
Member Author

xingchensong commented Sep 7, 2021

Road Map:

  • Design semiotic-class for Chinese
  • Update Chinese ITN grammars from single-stage to two-stage
  • Simplify ITN related code of Sparrowhawk(C++) and migrate it to WeNet runtime

@xingchensong xingchensong self-assigned this Sep 7, 2021
@xingchensong xingchensong pinned this issue Sep 7, 2021
@robin1001
Copy link
Contributor

危楼高百尺,手可摘星辰。不敢高声语,恐惊天上人。
Seems great, I will learn the basic ideas at first.

@xingchensong
Copy link
Member Author

xingchensong commented Sep 7, 2021

semiotic classes:

category sub-category example
number int 三十一 ==> 31
float 三十一点五七一 ==> 31.571
serial 一一一二二二三三三 ==> 111222333
telephone 加八六一八五四四一三九一二一 ==> +86-18544139121
- - -
electronic IP 二幺九点二二三点幺八四点二五二 ==> 219.223.184.252
email xyx艾特gmail点com ==> [email protected]
url xyx点com ==> xyz.com
- - -
fraction fraction 三分之一点二 ==> 1.2/3
- - -
percent percent 百分之二点五 ==> 2.5%
- - -
measure measure 五点五美元 ==> 5.5$
- - -
date date 二零二一年三月四日 ==> 2021年3月4日
- - -
time time 下午三点十五分 ==> 3:15 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants