Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved measurements and tables pipelines #207

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Jungack
Copy link
Contributor

@Jungack Jungack commented May 12, 2023

Description

Measurement pipe

  • Measurement pipe can now detect measurements with almost every imaginable units, including the one starting with per
  • Measurement pipe now detects and keeps unitless measurements when any unitless pattern is provided (dimension is set to nounit in that case). This is also the case for percentages : they have a dim of nounit and a scale of 0.01
  • Powers of 10 support when placed before unit or after numerical value if we have a unitless measurement
  • Intervals are now automatically not matched (eg in ASAT ¦U/L ¦216 + ¦7-40, 7-40 is not matched)
  • In addition to the already existing stopwords considered in unitless patterns, we add a new category of stopwords. When one or many of these stopwords are used between a numerical value and unit, they will be skipped.
  • Added measure_before_unit variable which lets the user indicate if the unit is generally before or after the unit. This is especially useful when considering stopwords between numerical values and units (eg in the line mg | 5 | mL | 0.3, with "|" as stopword and measure_before_unit set to False, the extracted ents will be 5mgand 0.3mL. If measure_before_unit was set to True, only the wrong measurement 5mL will be extracted)
  • Default behaviour when measurements attribute is None is to match all possible measurements. In that case, all measurements are labeled "eds.measurement", even the one which were initially recognised by other default measurements (such as eds.weight)
  • Customizable detection of range indication of the measurement with custom or predefined strings placed just before the numerical value (eg "< 5µl", "supérieur à 8"...)
  • Modification of the SimpleMeasurement class to store this range
  • SimpleMeasurement class changed to add pandas displaying support
  • New architecture for the unit_config: degrees and dim are now stored in a subdictionnary so that the user can store a unit with multiple dimensions (useful for eg mmHg)
  • Scales revamped so that 1 always refers to an unit of the international system unit
  • Bug fixed where all positive degrees lead to the creation of their per associate, even if this one was already mentionned by the user
  • When automatically creating a unit starting with per based on one unit not starting with per, some default terms are added based on the ones of the latter unit
  • Automatic conversion to international system units by calling a SimpleMeasurement object with object.ui
  • Added substraction between two SimpleMeasurement objects
  • Updated doc
  • Use of table pipe to detect measures. The strategy is first to label each column as containing value, unit or powers of 10 and then link the units columns and powers columns to the nearest values column. Other features such as automatic retrieval of the unit in the header
  • Complex measure detection inside cells of tables
  • all_measurements variable to let the user choose whether he still wants to match all measures or only the ones he mentions. If set True, the matched measurements not requested by the user will be labeled eds.measurements and the ones requested will have the name that was defined by the user.
  • parse_doc variable to let the user choose if he wants to parse the doc without the tables (True) or nothing (False)
  • parse_tables variable to let the user choose if he wants to parse the tables (True) or not (False)
  • measurements variable now accepts valueless_patterns in addition of unitless patterns. This lets the user hardcode some measures which may not be detected by the measurements pipe. It is especially useful in tables when a measure is not a numerical value (eg. positive, negative)

Table pipe

  • Table pipeline now declares a new extension on the span object, the table attribute which contains a dictionnary with column names, row names or column index as key
  • Added col_names and row_names to let the user specify if column / row names are detected by the regex table pattern
  • Modified default table regex to make it more robust
  • to_pd_table now accepts as_span attribute. When set to True, pandas table contains spans. When set to False, the pandas table contains strings.
  • Fixed matching of tables when the separator is also used to draw the outline of the actual table

Checklist

  • Find a compromise between failed test due false labelisation and our way to label measurements (some measurements are labeled eds.measurement as default name instead of eds.size)
  • Discuss the relevance of measurements pipe tests
  • Support for tables with same headings

@Jungack Jungack changed the title Improved measurements Improved measurements and tables pipelines May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant