Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix undeterministic merge_visits #70

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

svittoz
Copy link
Collaborator

@svittoz svittoz commented Jun 14, 2024

Description

  • Fix undeterministic merge_visits due to sort_values(...).groupby(...).first() being undeterministic in Koalas.
  • Force user to provide open_stay_end_datetime to prevent from unexpected undeterministic results.

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation (eg new pipeline).

@svittoz svittoz force-pushed the fix-visit-merging branch 3 times, most recently from 62098ca to b8d0716 Compare June 14, 2024 12:46
Copy link

github-actions bot commented Jun 14, 2024

Coverage Report

NameStmtsMiss∆ MissCover
TOTAL2441154094%
Files without new missing coverage
NameStmtsMiss∆ MissCover
eds_scikit/utils/test_utils.py

Was already missing at line 50

 def date(s):
-     return dt.strptime(s, "%Y-%m-%d")
Was already missing at lines 88-90
         args = tuple(args)
-     elif type(index_or_key) == str:
-         kwargs[index_or_key] = inputs
Was already missing at lines 114-116
     else:
-         normalized_sum_sq_diff = sum_sq_diff / np.sqrt(sum_sq_diff)
-         assert normalized_sum_sq_diff < 0.001

545091%
eds_scikit/utils/flowchart/flowchart.py

Was already missing at line 152

     def __str__(self) -> str:
-         return self.__repr__()

1311099%
eds_scikit/utils/custom_implem/custom_implem.py

Was already missing at line 54

         """
-         return cut(
             x,

221095%
eds_scikit/utils/checks.py

Was already missing at line 127

         if return_index_or_key:
-             return kwargs[argname], argname
         return kwargs[argname]
Was already missing at line 149
         else:
-             to_display_per_concept = [f"- {concept}" for concept in required_concepts]
         str_to_display = "\n".join(to_display_per_concept)
Was already missing at lines 172-189
-         if all(isinstance(table, tuple) for table in required_tables):
  ...
-         super().__init__(message)

7110086%
eds_scikit/utils/bunch.py

Was already missing at line 32

     def __setattr__(self, key, value):
-         self[key] = value
Was already missing at line 35
     def __dir__(self):
-         return self.keys()
Was already missing at lines 38-41
     def __getattr__(self, key):
-         try:
-             return self[key]
-         except KeyError:
             raise AttributeError(key)

115055%
eds_scikit/resources/utils.py

Was already missing at line 19

     if len(splited) == 1:
-         return None
     return splited[-1]

61083%
eds_scikit/resources/reg.py

Was already missing at lines 50-78

             # Looking for a match excluding version string
-             candidates = [
  ...
-             func = r.get(candidates[0])
         return func

164075%
eds_scikit/plot/omop_teva.py

Was already missing at line 108

                 if drop_columns:
-                     table = table.merge(
                         visit_occurrence.drop(columns=drop_columns),

401098%
eds_scikit/period/tagging_functions.py

Was already missing at lines 60-63

         # TODO: is this necessary ?
-         logger.warning("No matching were found between the 2 DataFrames")
- 
-         return framework.DataFrame(
             columns=["person_id", "t_start", "t_end", "concept", "value"]
Was already missing at lines 119-123
         return (B_start >= A_start) & (B_end <= A_end)
-     elif algo == interval_algos.from_before_to:
-         return B_end <= A_start
-     elif algo == interval_algos.to_before_from:
-         return A_end <= B_start
     else:

366083%
eds_scikit/period/stays.py

Was already missing at line 408

         if open_stay_end_datetime is None:
-             open_stay_end_datetime = datetime.now()
         vo["visit_end_datetime_calc"] = open_stay_end_datetime

891099%
eds_scikit/io/i2b2_mapping.py

Was already missing at lines 38-211

-     i2b2_table_name = i2b2_tables[db_source][table]
  ...
-     return df
Was already missing at lines 230-234
-     def f(x):
-         return mapping.get(x, default)
- 
-     return F.udf(f)

7969013%
eds_scikit/io/base.py

Was already missing at line 13

     def __str__(self):
-         return self.__repr__()

91089%
eds_scikit/event/from_code.py

Was already missing at lines 108-111

     else:
-         event.loc[:, "t_start"] = event.loc[:, columns["code_start_datetime"]]
  ...
-         event = event.drop(
             columns=[columns["code_start_datetime"], columns["code_end_datetime"]]

423093%
eds_scikit/event/diabetes.py

Was already missing at lines 88-102

     """
-     diabetes = conditions_from_icd10(
  ...
- 
-     return diabetes

104060%
eds_scikit/event/consultations.py

Was already missing at line 68

     if type(algo) == str:
-         algo = [algo]

611098%
eds_scikit/emergency/emergency_care_site.py

Was already missing at line 54

     if algo == "from_regex_on_parent_UF":
-         return from_regex_on_parent_UF(care_site)
     elif algo == "from_regex_on_care_site_description":
Was already missing at line 166
     """
-     return attributes.get_parent_attributes(
         care_site,

312094%
eds_scikit/datasets/synthetic/biology.py

Was already missing at lines 37-44

     def reset_to_pandas(self):
-         if self.module == "koalas":
  ...
-             self.module = "pandas"

1327095%
eds_scikit/datasets/__init__.py

Was already missing at line 38

 def __dir__():
-     return known_datasets + [func.__name__ for func in __all__]
Was already missing at lines 52-56
 def add_dataset(table: pd.DataFrame, name: str):
-     dataset_path = os.path.abspath(
-         os.path.join(os.path.dirname(__file__), name + ".csv")
-     )
-     table.to_csv(dataset_path, index=False)
Was already missing at line 67
     """
-     return [func.__name__ for func in __all__]

264085%
eds_scikit/biology/viz/plot.py

Was already missing at line 72

     else:
-         logger.error(
             "The folder {} has not been found",
Was already missing at lines 718-720
     else:
-         terminologies_hist = alt.Chart().mark_text()
-         terminologies_time_series = (
             alt.Chart(measurement)

1303098%
eds_scikit/biology/viz/aggregate.py

Was already missing at line 83

     if stats_only:
-         return {"measurement_stats": measurement_stats}
Was already missing at line 208
     if overall_only:
-         return measurement_stats_overall

972098%
eds_scikit/biology/utils/config.py

Was already missing at lines 30-66

     """
-     my_custom_config = pd.DataFrame()
  ...
-     register_configs()
Was already missing at lines 73-75
     for config in glob.glob(os.path.join(CONFIGS_PATH, "*.csv")):
-         config_name = Path(config).stem
-         registry.data.register(
             f"get_biology_config.{config_name}",
Was already missing at lines 89-94
     """
-     registered = list(registry.data.get_all().keys())
-     configs = [
-         r.split(".")[-1] for r in registered if r.startswith("get_biology_config")
-     ]
-     return configs

3522037%
eds_scikit/biology/cleaning/cohort.py

Was already missing at line 28

     if isinstance(studied_pop, DataFrame.__args__):
-         filtered_measures = measurement.merge(
             studied_pop,

91089%

63 files skipped due to complete coverage.

Coverage success: total of 94% is above 94% 🎉

Copy link

sonarcloud bot commented Jul 1, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant