The dataset contains annotations for all relevant linguistic phenomena that can be customized to adapt bot training to different user language profiles. Some of the most relevant annotations are:
M - Morphological variation: inflectional and derivational
“is my SIM card active”
“is my SIM card activated”
L - Semantic variations: synonyms, use of hyphens, compounding…
“what’s my billing date"
“what’s my anniversary date”
B - Basic syntactic structure:
“activate my SIM card”
“I need to activate my SIM card”
I - Interrogative structure
“can you activate my SIM card”
“how do I activate my SIM card”
C- Coordinated syntactic structure
“I have a new SIM card, what do I need to do to activate it?”
D - Indirect speech
“ask my agent to activate my SIM card”
P - Politeness variation
“could you help me activate my SIM card, please?”
Q - Colloquial variation
“can u activ8 my SIM?”
R - Respect structures - Language-dependent variations
English: "may" vs "can…"
French: "tu" vs "vous..."
Spanish: "tú" vs "usted..."
W - Offensive language
“I want to talk to a f*cking agent”
K - Keyword mode
"activate SIM"
"new SIM"
E - Use of abbreviations:
“I'm / I am interested in getting a new SIM”
Z - Errors and Typos: spelling issues, wrong punctuation…
“how can i activaet my card”
G - Regional variations
US English vs UK English: "truck" vs "lorry"
France French vs Canadian French: "tchatter" vs "clavarder"
Y - Code switching
“activer ma SIM card”
(c) Bitext Innovations, 2022