-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentences split on newlines #177
base: master
Are you sure you want to change the base?
Conversation
…p/edsnlp into sentences_split_on_lnewlines
Codecov ReportBase: 94.05% // Head: 94.05% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #177 +/- ##
=======================================
Coverage 94.05% 94.05%
=======================================
Files 172 172
Lines 5078 5079 +1
=======================================
+ Hits 4776 4777 +1
Misses 302 302
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
if seen_period and Lexeme.c_check_flag(token.lex, IS_DIGIT): | ||
continue | ||
seen_newline = False | ||
seen_period = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correction. Before, a text like "Mesure du H.3 négatif" produced two sentences, "Mesure du H.3" and "négatif" since seen_newlines
and seen_period
were not reset.
An additional case : if next token is |
Do you mean that we should split into two sentences a text like this: |
Yes, at least 2 consecutive breaks should split the text into phrases I think, no ?? |
4dbf961
to
b72cd44
Compare
Quality Gate passedIssues Measures |
ae75dc5
to
430ef22
Compare
2038fb9
to
232ca91
Compare
2f11f23
to
1ebc7d7
Compare
For the moment, the
eds.sentences
pipe splits sentences on a newline (\n) token iff it is followed by a capitalized token, i.e. a token with an uppercase initial, and with subsequent letter being lowercase. This can be problematic, (see #176)Description
This PR adds a new
split_on_newlines
parameter to the pipe.split_on_newlines="with_capitalized"
, only newlines which subsequent token is capitalized will split sentences.split_on_newlines="with_uppercase"
, only newlines which subsequent token starts with an uppercase letter will split sentences.split_on_newlines=False
, newlines will never split sentences.Checklist