Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK repeatedly retries bad requests #126

Open
theolampert opened this issue Feb 28, 2024 · 21 comments
Open

SDK repeatedly retries bad requests #126

theolampert opened this issue Feb 28, 2024 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@theolampert
Copy link
Contributor

theolampert commented Feb 28, 2024

Network requests fail with a 400 error code referring to the length of user_id or device_id. Both of these values appear to be set in the request data so I'm unsure why this fails. The client will then repeatedly retry this, in the sampled screenshot it's sent around 800 network requests.

Screenshot 2024-02-28 at 12 02 15

The client will also repeatedly call flush and fill up it's local event cache, in my case the size of this cache is around ~600mb, my colleague's is a few gb. This will peg CPU usage and gradually increase memory usage as well ad disk writes.

This is also occuring in production and reported by by performance monitoring we have in Sentry.

There's a few things going on here but the main issue I see is a bug in the retry logic of the client that I haven't been able to track down.

@theolampert theolampert added the bug Something isn't working label Feb 28, 2024
@theolampert
Copy link
Contributor Author

@izaaz It seems like handleBadRequestResponse in PersistentStorageResponseHandler has a bug in its retry logic, there are no checks to see if the original request issue has been resolved before adding eventsToRetry back into the event pipeline. I also notice that the config variable flushMaxRetries is never used.

@izaaz
Copy link
Collaborator

izaaz commented Feb 28, 2024

Thanks for raising this issue @theolampert - We are taking a look at this

@bwised
Copy link

bwised commented Mar 3, 2024

For me the memory overload is not gradual...it's very quick. After about 30 to 45 seconds after launch I get a massive burst of these and then they don't stop.

Background Task 2,174 ("amplitude"), was created over 30 seconds ago. In applications running in the background, this creates a risk of termination. Remember to call UIApplication.endBackgroundTask(_:) for your task in a timely manner to avoid this.

Some stuff is going bonkers.

@izaaz
Copy link
Collaborator

izaaz commented Mar 4, 2024

We have a repro for this issue. We'll keep this thread posted.

@bejewel-sangwon
Copy link

bejewel-sangwon commented Mar 5, 2024

@izaaz When do you expect this issue to be resolved?

@izaaz
Copy link
Collaborator

izaaz commented Mar 7, 2024

@bejewel-sangwon - there is one other issue that is currently being worked on and this will be worked on right after that. ETA is about 2 weeks from now.

This should only be happening if the client is sending a bad request. Most likely because of a bad userid/device id. Can you please check if you are sending device ids/user ids < 5 characters long? If yes, can you make sure the minIdLength property is set correctly in the config?

@bwised
Copy link

bwised commented Mar 8, 2024

My user ids are always UUID strings and aren't ever sent empty. So length isn't a problem.

@bwised
Copy link

bwised commented Mar 9, 2024

I actually am seeing events in a bunch of .tmp files in the app's Documents/amplitude folder that do have user_id set to null. I'm not sure how that is happening, though.

[{"os_version":"17.4","revenue":null,"session_id":1709960900263,"time":1709961064701,"region":null,"device_brand":null,"idfa":null,"location_lng":null,"idfv":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","device_model":"arm64","event_type":"Initialize","dma":null,"user_id":null,"platform":"iOS","library":"amplitude-swift\/1.3.6","os_name":"ios","product_id":null,"adid":null,"app_version":null,"carrier":"Unknown","city":null,"device_manufacturer":"Apple","revenue_type":null,"event_properties":{"Classification":"Analytic","Is Sandbox":"false","What":"Firebase"},"ip":"$remote","location_lat":null,"insert_id":"C1824F8D-ABDF-4230-9699-0CDAFC819BF8","device_id":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","country":null,"language":"en-US","event_id":221,"price":null,"partner_id":null,"version_name":"3.19","quantity":null},{"os_version":"17.4","revenue":null,"session_id":1709960900263,"idfa":null,"time":1709961064710,"device_brand":null,"region":null,"location_lng":null,"idfv":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","device_model":"arm64","event_type":"[Amplitude] Application Opened","user_id":null,"dma":null,"platform":"iOS","library":"amplitude-swift\/1.3.6","os_name":"ios","product_id":null,"adid":null,"carrier":"Unknown","device_manufacturer":"Apple","app_version":null,"city":null,"revenue_type":null,"event_properties":{"[Amplitude] From Background":false,"[Amplitude] Build":"1","[Amplitude] Version":"3.19"},"ip":"$remote","location_lat":null,"device_id":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","insert_id":"BE2ED514-0A67-4AB5-9E40-DF7482019089","country":null,"language":"en-US","event_id":222,"price":null,"partner_id":null,"version_name":"3.19","quantity":null}]

@izaaz
Copy link
Collaborator

izaaz commented Mar 9, 2024

user id being null is ok from Amplitide's perspective. A request to amplitude needs either a user id or device id or both. We only check for length when it's not null.

@bwised
Copy link

bwised commented Mar 9, 2024

I managed to catch in the debugger today that randomly I get 400's coming back from the backend with the error "Invalid API key". And this is in the same app instance where it's been previously happy uploading events. Then the backend decides the API key is invalid.

It's at this point that the contents of the Documents/amplitude/amplitude-swift-storage--$default_instance.events.index folder explodes.

(And as a side note: Configuration.flushMaxRetries is not used anywhere in the Amplitude code yet)

Edit: I've confirmed that when this happens, HttpClient.getRequestData(events:) is creating the requestPayload with a blank apiKey. I'm continuing to debug.

Edit: I've discovered that I have something that posts an event to the Amplitude SDK before Amplitude is initialized, which causes AmplitudeDestinationPlugin to create an HttpClient with a configuration with no apiKey.

@bwised
Copy link

bwised commented Mar 9, 2024

Yeah, so the workaround for this problem is to not try to post events before the Amplitude SDK is initialized. LOL

@bejewel-changkyu
Copy link

The problem with this issue is that even if a trigger occurred due to an incorrect user_id, the event does not disappear but continues to accumulate like zombies, eventually causing the app itself to crash.

If it were an incorrect event, it should have disappeared from the list, but it keeps accumulating, resembling a DDos attack with a huge number of requests, eventually overwhelming the app until it crashes.

I hope there will be a prompt solution. 🙏

I will also attach a video showing the related logs.

default.mp4

@izaaz
Copy link
Collaborator

izaaz commented Mar 15, 2024

@bwised Thanks for bringing that to our attention too.

@bejewel-changkyu Looking at your log, it looks like your event type is empty?

@crleona
Copy link
Collaborator

crleona commented Mar 16, 2024

@theolampert @bwised @bejewel-changkyu I've landed a fix for the issue, would it be possible to verify this is working on your end? It may take some time for affected devices to stop sending requests, but they should eventually converge to the expected behavior. Thanks!

@theolampert
Copy link
Contributor Author

@crleona Internally this appears to be working for us, we'll release tomorrow and let you know if the issue is completely resolved.

@crleona
Copy link
Collaborator

crleona commented Mar 19, 2024

Thanks @theolampert, this is now released in 1.4.1. Please reopen this issue if you encounter any issues with the release.

@crleona crleona closed this as completed Mar 19, 2024
@theolampert
Copy link
Contributor Author

@crleona Hey, testing this internally, unfortunately we're seeing the same issues with 1.4.1 – @schurigeln will reach out later today with more details and screenshots.

@theolampert
Copy link
Contributor Author

It seems to still be related to using user IDs that are below a certain length, logging out of that particular account works correctly.

There are then still two issues here, the backend has an undocumented minimum user ID length and the SDK is still able to get itself into a state in which it continuously retries bad requests.

@crleona crleona reopened this Mar 22, 2024
@izaaz
Copy link
Collaborator

izaaz commented Mar 22, 2024

@theolampert we'll update the docs to make this more explicit.

Wrt the SDK retrying - currently, the response from the server indicates the first event that had bad data and the SDK drops that event and retries the rest. We are working with the backend team to get the full list so that the SDK can drop all bad requests in one go and not have to retry often. The fix that was pushed should at least make sure you end up with too many duplicates event locally. It still has a slow draining issue which we'll address soon.

From a principle perspective, we've always leaned towards not dropping any data on the SDK. Which is why we don't drop the data when we notice that the user id length is less than 5.

Keeping this ticket open while we get both of these resolved.

@kishinmanglani
Copy link

Any update here? We're probably going to switch away from Amplitude if this doesn't get resolved asap

@crleona
Copy link
Collaborator

crleona commented Nov 20, 2024

@kishinmanglani we consider this solved on our end in the latest versions, are there any particular issues you are still seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants