Skip to content

Latest commit

 

History

History
95 lines (50 loc) · 1.9 KB

File metadata and controls

95 lines (50 loc) · 1.9 KB

Linguistic features included in our datasets

The dataset contains annotations for all relevant linguistic phenomena that can be customized to adapt bot training to different user language profiles. Some of the most relevant annotations are:

Lexical variation:

M - Morphological variation: inflectional and derivational

“is my SIM card active”

“is my SIM card activated”

L - Semantic variations: synonyms, use of hyphens, compounding…

“what’s my billing date"

“what’s my anniversary date”

Syntactic structure variation:

B - Basic syntactic structure:

“activate my SIM card”

“I need to activate my SIM card”

I - Interrogative structure

“can you activate my SIM card”

“how do I activate my SIM card”

C- Coordinated syntactic structure

 “I have a new SIM card, what do I need to do to activate it?”

D - Indirect speech

 “ask my agent to activate my SIM card”

Language register variations:

P - Politeness variation

“could you help me activate my SIM card, please?”

Q - Colloquial variation

“can u activ8 my SIM?”

R - Respect structures - Language-dependent variations

English: "may" vs "can…"

French: "tu" vs "vous..."

Spanish: "tú" vs "usted..."

W - Offensive language

“I want to talk to a f*cking agent”

Stylistic variations:

K - Keyword mode

"activate SIM"

"new SIM"

E - Use of abbreviations:

“I'm / I am interested in getting a new SIM”

Z - Errors and Typos: spelling issues, wrong punctuation…

“how can i activaet my card”

G - Regional variations

US English vs UK English: "truck" vs "lorry"

France French vs Canadian French: "tchatter" vs "clavarder"

Y - Code switching

“activer ma SIM card”

(c) Bitext Innovations, 2022