-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fit_frame_split - ValueError: Length of values does not match length of index #12
Comments
Hi, Cheers, |
awesome, thank you! let me know how I can assist 🙇♂️ |
Hi there, could you maybe upload an example notebook highlighting the error. I was not able to reproduce your error. Please see attached files. Cheers, |
You didn't correctly use Please run the following notebook mod; note that cell 2 works as expected, but cell 3 does not. |
@eren-ck hey! Just wanted to check if there was anything wrong with that notebook or intuition? Let me know! |
Hi there, thanks for your patience. So finally had some time. You are right, it seems that for some cases the For others it does. I have to further investigate this. Meanwhile you can just use a smaller or larger frame size (1000 or 3000). I hope this helps you to resolve the problem. For instance: def test_fit_split():
df = pd.read_csv('ST_DBSCAN_2024_03_14.csv')
# transform to numpy array
data = df.loc[:, ['timestamp','x','y']].values
st_dbscan = ST_DBSCAN(eps1=0.25, eps2=250, min_samples=10).fit_frame_split(data, 3000)
df['cluster'] = st_dbscan.labels
return df
df_fit_split = test_fit_split() I expected that with the sparse matrices people would no longer have any need to rely on the Cheers, |
No worries on the delay -- and sounds good, happy to hear I'm not crazy! Will keep my eyes peeled on future follow-up, as it would become the most memory-efficient approach when loads of parallel Spark jobs are being executed at the same time on millions of grouped rows and thousands of groups. Exciting stuff! |
I had the same error when sorting my dataframe before passing it into the
I am able to reproduce on my own dataset. If your data is not sorted by time and you use the pandas.Dataframe.sort_values method, it will create the error. These following snippets illustrate the error and how to fix it. sorted = selected_df.sort_values(by='UTC')
X_original_unsorted = selected_df.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Unsorted shape: {X_original_unsorted.shape}") # (10000, 4)
X_original = sorted.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Sorted shape: {X_original.shape}") # (11309, 4)
X_checked = check_array(X_original)
print(f"Checked shape: {X_checked.shape}") # (11309, 4)
n, m = X_checked.shape
# pdist errors
time_dist = pdist(X_checked[:, 0].reshape(n, 1), metric='euclidean')
# ValueError: Found input variables with inconsistent numbers of samples: [11309, 10000] sorted = selected_df.sort_values(by='UTC')
# This fixes
sorted.reset_index(drop=True, inplace=True)
X_original_unsorted = selected_df.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Unsorted shape: {X_original_unsorted.shape}") # (10000, 4)
X_original = sorted.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Sorted shape: {X_original.shape}") # (10000, 4)
X_checked = check_array(X_original)
print(f"Checked shape: {X_checked.shape}") # (10000, 4)
n, m = X_checked.shape
# No error
time_dist = pdist(X_checked[:, 0].reshape(n, 1), metric='euclidean') This error is probably still following intended behavior of sklearn.utils.check_array, but I am not sure. I will follow up at some point after looking into it. I would encourage that fit_frame_split should sort by time so preventing this behavior is not made the user's issue. Also inform the user that fit will fail if indices as well as timestamps are not in strictly increasing order if making a copy or sorting the dataframe are out of the question. Or possibly throw some exception that indices should also be ordered as well as time. EDIT: I now realize there is a mistake when using the .loc operator on indexes that are not in order, and that this issue does not occur in the same place as the original reporter's code. My issue occurs because I pass a pandas.Dataframe instead of a 2d numpy array. |
I started encountering this issue even without my issue of passing a dataframe with wrong indexes. When outputting the size of the processed frame and the overlap with @csm-kb's demo file, I made this discovery:
Outputting the modified input reveals the same issue: |
Hey! As mentioned in #7, there seem to be edge cases where the
labels
computed byfit_frame_split()
don't match the row count ofX
fed to it. Not quite sure what's causing it at first glance!The error in question:
The use in question (it is sorted by
timestamp
ascending before it goes in):Attached is the subset CSV of ordered timestamp/x/y data that yielded this for me. Timestamp is
unix_millis
, x/y are in an arbitrary space for particle data for a side project.Currently looking into a temporary rewrite of it for the memory constraints I'm currently fighting with (I turned here because with
fit()
, some very large (>100k) position datasets that are only a couple hundred MB in Pandas turned out viamemory_profile
to cause up to a 6.8 GB increment in memory use! which eats heap and crashes smaller workers on my compute cluster, etc... probably the darn matrices becoming not-so-sparse).ST_DBSCAN_2024_03_14.csv
The text was updated successfully, but these errors were encountered: