Unexpected behaviour with order of columns in IntercomparisonMetrics #220

s-scherrer · 2021-03-12T18:52:58Z

When setting up a test for the new PairwiseIntercomparisonMetrics (#219) I stumbled on some (for me) unexpected behaviour.

I used a testcase with 4 datasets, and apparently the order of the datasets changed, leading to wrong results (see example below).
The issue is caused by different orders of the dataset names passed to IntercomparisonMetrics. It works when dataset_names passed to IntercomparisonMetrics comes from get_dataset_names using a DataManager. If one just passes the names of the datasets in a different oder, the results are wrong. However, in the former case the wrong column (c1) is used as reference.

I believe the problem is that pytesmo automatically converts a dataset dictionary to a Data Manager (

pytesmo/src/pytesmo/validation_framework/validation.py

Line 119 in e8cbda4

self.data_manager = DataManager(datasets, spatial_ref, period)

), and in the following just uses an order derived from this.

Based on this order, perform_validation then renames the columns of the DataFrame to ref, k1, k2, and k3 (

pytesmo/src/pytesmo/validation_framework/validation.py

Line 300 in e8cbda4

f = lambda x: "k{}".format(x) if x > 0 else 'ref'

), but if the order of dataset_names was different, the columns are not in the order they should be.

I think this should probably be fixed in Validation, since only validation knows which ones are reference and which ones are non-reference datasets.

@wpreimes, any thoughts on this?

from datetime import datetime
import pandas as pd

from pytesmo.validation_framework.validation import Validation
from pytesmo.validation_framework.metric_calculators import IntercomparisonMetrics, get_dataset_names
from pytesmo.validation_framework.data_manager import DataManager

# test data
startdate = datetime(2000, 1, 1)
enddate = datetime(2000, 12, 31)
dt_index = pd.date_range(start=startdate, end=enddate, freq='D')

names = ['ref', 'c1', 'c2', 'c3']
# always 0.5
df = pd.DataFrame(index=dt_index, data={
                  name: np.repeat(0.5, dt_index.size) for name in names})
df['c1'] += 0.2  # some positive bias
df['c2'] -= 0.2  # some negative bias
df['c3'] -= 0.3  # some more negative bias
df

	ref	c1	c2	c3
2000-01-01	0.5	0.7	0.3	0.2
2000-01-02	0.5	0.7	0.3	0.2
2000-01-03	0.5	0.7	0.3	0.2
2000-01-04	0.5	0.7	0.3	0.2
2000-01-05	0.5	0.7	0.3	0.2
...	...	...	...	...
2000-12-27	0.5	0.7	0.3	0.2
2000-12-28	0.5	0.7	0.3	0.2
2000-12-29	0.5	0.7	0.3	0.2
2000-12-30	0.5	0.7	0.3	0.2
2000-12-31	0.5	0.7	0.3	0.2

366 rows × 4 columns

Bias between c1 and ref is 0.2.

# set up datasets for validation
class DummyReader:
    def __init__(self, df, name):
        self.name = name
        self.data = pd.DataFrame(df[name])

    def read_ts(self, *args, **kwargs):
        return self.data
    
    def __repr__(self):
        return f"Reader for {self.name} with columns {list(self.data.columns)}"

datasets = {
    key+"name": {"columns": [key], "class": DummyReader(df, key)}
    for key in df
}
datasets

{'refname': {'columns': ['ref'], 'class': Reader for ref with columns ['ref']},
 'c1name': {'columns': ['c1'], 'class': Reader for c1 with columns ['c1']},
 'c2name': {'columns': ['c2'], 'class': Reader for c2 with columns ['c2']},
 'c3name': {'columns': ['c3'], 'class': Reader for c3 with columns ['c3']}}

# run validation
metrics = IntercomparisonMetrics(dataset_names=list(datasets.keys()))
val = Validation(
    datasets,
    "refname",
    scaling=None,
    metrics_calculators={(4, 4): metrics.calc_metrics}
)
results = val.calc(1, 1, 1)

/home/samuel/.local/lib/python3.8/site-packages/scipy/stats/stats.py:3508: PearsonRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
  warnings.warn(PearsonRConstantInputWarning())
/home/samuel/miniconda3/envs/pytesmo/lib/python3.8/site-packages/numpy/lib/function_base.py:2642: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/home/samuel/projects/QA4SM/pytesmo/src/pytesmo/scaling.py:250: RuntimeWarning: invalid value encountered in true_divide
  return ((src - np.mean(src)) / np.std(src)) * np.std(ref) + np.mean(ref)

results[list(results.keys())[0]]["BIAS_between_refname_and_c1name"]

array([0.4], dtype=float32)

The bias should be 0.2 here.

When using get_dataset_names instead of list(datasets.keys()) it works correctly:

datamanager = DataManager(datasets, ref_name="refname")
ds_names = get_dataset_names(datamanager.reference_name, datamanager.datasets, n=4)
ds_names, list(datasets.keys())

(['c1name', 'c2name', 'c3name', 'refname'],
 ['refname', 'c1name', 'c2name', 'c3name'])

metrics = IntercomparisonMetrics(dataset_names=ds_names)
val = Validation(
    #datamanager,
    datasets,
    "refname",
    temporal_ref="refname",
    scaling=None,
    metrics_calculators={(4, 4): metrics.calc_metrics}
)
results = val.calc(1, 1, 1)

/home/samuel/.local/lib/python3.8/site-packages/scipy/stats/stats.py:3508: PearsonRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
  warnings.warn(PearsonRConstantInputWarning())
/home/samuel/miniconda3/envs/pytesmo/lib/python3.8/site-packages/numpy/lib/function_base.py:2642: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/home/samuel/projects/QA4SM/pytesmo/src/pytesmo/scaling.py:250: RuntimeWarning: invalid value encountered in true_divide
  return ((src - np.mean(src)) / np.std(src)) * np.std(ref) + np.mean(ref)

results[list(results.keys())[0]]["BIAS_between_c1name_and_refname"]

array([0.2], dtype=float32)

The text was updated successfully, but these errors were encountered:

s-scherrer · 2021-03-18T17:35:37Z

I introduced a new keyword argument rename_cols to Validation.calc and Validation.perform_validation that can be used to disable the renaming. This will be included in PR #219, and fixes the issue from above somewhat. However, I still think the renaming is a very hacky and fragile construct, and we should probably find a way to remove this in the future.

During testing I stumbled on a related problem: If I use a metrics calculator with (n, 2) in Validation, e.g.:

dr = pd.date_range("2000", "2020", freq="D")
n = len(dr)
x = np.ones(n) * 2
df = pd.DataFrame(
    {
        "reference": x,
        "zplus2": x + 2,
        "plus1": x + 1,
    },
    index=dr
)

datasets = {
    key + "_name": {"columns": [key], "class": DummyReader(df, key)}
    for key in df
}

val = Validation(
    datasets,
    "reference_name",
    scaling=None,  # doesn't work with the constant test data
    temporal_matcher=None,
    metrics_calculators={
        (3, 2): PairwiseIntercomparisonMetrics().calc_metrics
    }
)
results = val.calc(1, 1, 1)

Then the results dictionary has these keys:

(('plus1_name', 'plus1'), ('reference_name', 'reference')), 
(('plus1_name', 'plus1'), ('zplus2_name', 'zplus2')), 
(('reference_name', 'reference'), ('zplus2_name', 'zplus2'))

that is, the tuples are sorted alphabetically.

When using the new CombinedTemporalMatcher that I implemented (#219), which only returns combinations including ref

(('plus1_name', 'plus1'), ('reference_name', 'reference')), 
(('reference_name', 'reference'), ('zplus2_name', 'zplus2'))

In one case the reference is first, in the other case the reference is second, which is pretty unintuitive as a user. In this case I would expect that it's always (other, ref), or (ref, other), and not depend on the name of the datasets. The temporal matcher returns the matched pairs with keys (other, ref), which I was expecting to get.

I know that pytesmo by default does all combinations, and in this case a general rule is harder. But I think pytesmo should respect the order that was given by the temporal matcher. Furthermore, since dictionary orders are now reliable (since CPython 3.6, and since Python 3.7 in the standard), the default matcher should probably also respect the order of datasets in the DataManager/datasets dictionary, if it doesn't do this already.

s-scherrer · 2021-03-18T17:56:40Z

Addendum: This reordering is especially a problem when calculating bias, because the sign is then different in these two cases, and simply using tuple(sorted(key)) doesn't work for interpreting the result correctly.

So I think this needs to be fixed when we want to incorporate the new PairwiseIntercomparisonMetrics.

@wpreimes do you have an idea how far reaching those changes would be? Is this order being relied upon a lot in pytesmo?

* temporal matching for multiple frames * don't convert timezone naive indices * combined matcher for dictionary * added option to use all with combined_dropna * comparison of new and old matching implementation This adds a test to check whether the new matching in BasicTemporalMatching works as the old one in case there are no duplicates in the reference frame. It replaced the matcher in the other tests by a dummy that still uses the old matching implementation. The following commit will remove the dummy and set the new expected values. * changed tests for ascat_ismn to use new matching * fixup for previous commit * added PairwiseIntercomparisonMetrics * fixed test for PairwiseIntercomparisonMetrics * statistics based metrics calculator * Added tests for PairwiseIntercomparisonMetrics One test still fails due to difference in scaling before calculating ubRMSD. * Added CIs to PairwiseIntercomparisonMetrics * made bootstrapped CI tests less sensitive * made bootstrapped CI tests less sensitive * added missing import * temporal matching for multiple frames * don't convert timezone naive indices * added option to use all with combined_dropna * combined matcher for dictionary * comparison of new and old matching implementation This adds a test to check whether the new matching in BasicTemporalMatching works as the old one in case there are no duplicates in the reference frame. It replaced the matcher in the other tests by a dummy that still uses the old matching implementation. The following commit will remove the dummy and set the new expected values. * changed tests for ascat_ismn to use new matching * fixup for previous commit * added PairwiseIntercomparisonMetrics * fixed test for PairwiseIntercomparisonMetrics * statistics based metrics calculator * Added tests for PairwiseIntercomparisonMetrics One test still fails due to difference in scaling before calculating ubRMSD. * Added CIs to PairwiseIntercomparisonMetrics * made bootstrapped CI tests less sensitive * made bootstrapped CI tests less sensitive * added missing import * make_combined_temporal_matcher now also works with k>2 * Fixed some issues - bug in validation.get_data_for_result_tuple - no segfault when running tests anymore * better confidence intervals for tcol * added tests for tripletmetrics * added imports * run tests all at the same time * removed imports from validation_framework init * triple metrics only * removed xarray test * fixed test * fixed test * TripleIntercomparisonMetrics -> TripleCollocationMetrics * use percentile method by default; analytical CIs for mse_bias * better test for #220

s-scherrer added a commit to s-scherrer/pytesmo that referenced this issue Apr 23, 2021

better test for TUW-GEO#220

023bed8

s-scherrer mentioned this issue Apr 23, 2021

Improved QA4SM metrics calculators #219

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behaviour with order of columns in IntercomparisonMetrics #220

Unexpected behaviour with order of columns in IntercomparisonMetrics #220

s-scherrer commented Mar 12, 2021

s-scherrer commented Mar 18, 2021

s-scherrer commented Mar 18, 2021

Unexpected behaviour with order of columns in IntercomparisonMetrics #220

Unexpected behaviour with order of columns in IntercomparisonMetrics #220

Comments

s-scherrer commented Mar 12, 2021

s-scherrer commented Mar 18, 2021

s-scherrer commented Mar 18, 2021