Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: index to_period - to_timestamp round trip with freq argument fails specifically with period-start frequencies #58974

Open
2 of 3 tasks
fkiraly opened this issue Jun 10, 2024 · 7 comments
Labels

Comments

@fkiraly
Copy link

fkiraly commented Jun 10, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

freq = "MS"
# works with freq = "ME"

ix = pd.date_range("1/1/1870", periods=20, freq=freq)
ix2 = ix.to_period()

# this line fails for period-start frequencies, but not for period-end or others
ix3 = ix2.to_timestamp(freq=ix.freq)

Issue Description

Converting DateTimeIndex to PeriodIndex and back via a to_period - to_datetime round trip that has the freq argument set in the to_datetime, to the freq of the original DateTimeIndex, fails with ValueError: <MonthBegin> is not supported as period frequency.

This is counterintuitive, and inconsistent with most other frequencies, including period-end frequencies like <MonthEnd>, so I consider it a bug.

Related is this downstream failure in sktime sktime/sktime#6572, which has so far assumed that the above round trip is a trustworthy contract. So, if it is not, a quick confirmation would be appreciated.

Expected Behavior

No failure.

ix3 is assigned identical to ix.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2e python : 3.11.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 2.2.2
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3.1
Cython : 3.0.6
pytest : 7.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

@Aloqeely
Copy link
Member

Thanks for the report!
This sequence seems logical and I'd expect it to work. Although this does work if you don't pass freq (e.g. ix2.to_timestamp())

What I find weirder is that if I instead do ix2.to_timestamp(freq="MS") (on main) then that fails with the message:

ValueError("for Period, please use 'MS' instead of 'MS'")

cc @natmokval

@fkiraly
Copy link
Author

fkiraly commented Jun 13, 2024

I believe passing strings stopped working in the relevant places since pandas 2.2.0, this also caused a failure in a core module of sktime, see failure and fix here:

sktime/sktime#6500

Are you not testing the various round trips and conversions?

I have no good view of the testing framework, but I would just test to_period and to_timestamp with the exhaustive set of all frequencies supported, including more exotic ones like multiples, start, end, weekday, etc.

Including:

  • round trips, there are two round trips, period-timestamp-period and timestamp-period-timestamp
  • passing frequencies in the two parts of the round trip or not, so four variations
  • using strings vs using freq objects (the latter still seems to work in places where strings have stopped working)

We have seen so many changes of behaviour around freq, it is not very pleasant - pandas has a huge user base, and ensuring stable contracts for sth like freq imo is crucial.

Very related, imo a key issue is "unified API" across dfferent frequencies. E.g., if you do the round trip and pass freq for one type of frequency, the same sequence should work for all others, or you end up having to write case distinctions in dependent code.

@MarcoGorelli
Copy link
Member

thanks @fkiraly for the report

Just for a bit of context: for offsets, pandas distinguishes between the start of the period and the end. 'MS' means "month start", and 'ME' means "month end" (the latter used to be 'M', which everyone would interpret as "month" and would cause a lot of headaches). For period, there's no distinction between start and end, which is why for Period, the alias is just 'M'

See #9586 for further discussion


As for this issue...let's take a look. The docs for to_timestamp say

Cast to DatetimeIndex of timestamps, at beginning of period.

So, it looks correct (or at least, as-documented) that the input should be a Period alias, and that how determines whether we go to the start (default) or end of the Period.

The start and end time for a month Period are at the beginning and end of the month:

In [13]: pd.Period('2000-01-01', freq='M').start_time
Out[13]: Timestamp('2000-01-01 00:00:00')

In [14]: pd.Period('2000-01-01', freq='M').end_time
Out[14]: Timestamp('2000-01-31 23:59:59.999999999')

Based on that, I'd have expected pd.period_range('2000', periods=3, freq=MonthEnd()).to_timestamp('M') to output

DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01'], dtype='datetime64[ns]', freq='MS')

But, it doesn't - it outputs

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31'], dtype='datetime64[ns]', freq='ME')

This can be observed even going back to pandas 2.0.3, before #9586 started to be addressed

Are you not testing the various round trips

Based on the docstring of to_timestamp, I don't think this one was ever meant to round-trip

The road forwards

It looks like to_timestamp was never respecting its own documentation. What should be done now, going forwards? Not really sure, gonna have to think this one through
@natmokval did you look into this?


What I find weirder is that if I instead do ix2.to_timestamp(freq="MS") (on main) then that fails with the message:
ValueError("for Period, please use 'MS' instead of 'MS'")

Ah this makes no sense, you're right, thanks. @natmokval fancy taking a look at this error message?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 31, 2024

@fkiraly if you want something which round-trips, I think this should do the trick:

ix = pd.date_range("1/1/1870", periods=20, freq=freq)
ix2 = ix.to_period()

ix3 = ix2.to_timestamp() + pd.tseries.frequencies.to_offset(f'0{freq}')
ix3.freq = freq

though I'd suggest considering whether you really need Period at all

@natmokval
Copy link
Contributor

The road forwards

It looks like to_timestamp was never respecting its own documentation. What should be done now, going forwards? Not really sure, gonna have to think this one through @natmokval did you look into this?

Yeah, this is odd behavior. I missed this, unfortunately.
Also, the how parameter for converting a period to a timestamp doesn’t seem to work properly:

>> ix1 = pd.period_range('2000', periods=3, freq='M').to_timestamp('M', how='s')
>>> ix1
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31'], dtype='datetime64[ns]', freq='ME')
>>>
>>> ix2 = pd.period_range('2000', periods=3, freq='M').to_timestamp('M', how='e')
>>> ix2
DatetimeIndex(['2000-01-31 23:59:59.999999999',
               '2000-02-29 23:59:59.999999999',
               '2000-03-31 23:59:59.999999999'],
              dtype='datetime64[ns]', freq=None)

Whyix2.freq = None, but not <MonthEnd>? it is confusing to me. @MarcoGorelli, what do you think?

In the meantime, maybe we can update the documentation for to_timestamp until this issue is resolved?

@natmokval
Copy link
Contributor

What I find weirder is that if I instead do ix2.to_timestamp(freq="MS") (on main) then that fails with the message:
ValueError("for Period, please use 'MS' instead of 'MS'")

Ah this makes no sense, you're right, thanks. @natmokval fancy taking a look at this error message?

thanks for fixing the error message so quickly.

@MarcoGorelli
Copy link
Member

In the meantime, maybe we can update the documentation for to_timestamp until this issue is resolved?

yeah maybe - let's take this discussion over to #59371, and leave this one for how to round-trip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants