-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
requests.get with stream=True never returns on 403 #6376
Comments
With it hanging where you claim, I can't think of why it would be hanging there. Is there a stack trace from when you ran it manually that you can share? It might give us a better idea of what might be happening. Without that, this doesn't seem like something we can fix or investigate as there's very little information to go on |
Good point. I let it hang and then aborted. Here is the relevant stack... appears to be related to ssl? When I add some http headers to the request the 403 goes away and the request.get works. So unless the server is doing some conditional ssl/tls stuff based on HTTP (doesn't make any sense to me as http happens after tls), I'm not sure what is up there. In any case, I would expect an ssl problem (timeout, hangup, whatever) to be surfaced...but maybe the problem is in python rather than requests.
|
So we're hanging trying to read the very first line which would be the HTTP version Status code Status reason information. What headers are you adding? Many servers have started blocking requests user agent strings because of abusive and malicious actors using requests. It's possible this server thinks you're acting maliciously and doing something similar. This could be solved with a default timeout which we have an open issue for, but you can also set a read timeout to do the same thing yourself. I don't believe this is a bug we can fix (see also discussion on the timeout issue) in the near term. |
Understandable. I expect that is why I get the 403 (visible in curl...goes away once I add headers...including user-agent). user-agent and referer are the only headers they might be switching on (whether I get a 403 or not). But SSL/TLS breaking is weird as the only http-related stuff available at TLS handshake is the server-name-indication which is just the requested hostname...no http headers. So to break ssl it would have to allow ssl until the http request was received/parsed and then maybe from the http-layer just orphan off the request without responding...but I would expect their webserver to have some upstream timeout or something at some point...I can't imagine their server left the tcp connection open for weeks with no traffic. So at some point the tcp connection should have timed out or got a As it stands now, it looks like we were sitting there for 2 weeks waiting for a response without the tcp connection closing or timing out. I can't imagine that is what is actually happening, but maybe it is and a timeout is the remedy. Plus this is another level beyond the 403 which isn't super logical to me...
I did my tcpdump troubleshooting already for the month but maybe if I get fired up I'll do some more on this topic in the interest of getting the underlying issue identified (e.g. is a FIN being ignored or is the tcp connection really staying open this long?).
This one? I will try to work around this by specifying a timeout so at least we don't hang without failing for weeks on end...but I suspect there is something that should be fixed here (although it might be in urllib3 or python's http, socket or ssl). Just for kicks I thought I'd try again with
|
If in fact the server is responding immediately before we can write anything it is possible it's related to the issues @dkorsunsky linked. It's not a requests but in so much as we rely on the standard library http.client which cannot handle early responses. In other words, as long as we rely on the standard library, there's no fix here for requests |
I have a case where a url I was using requests to fetch (image so using
stream=True
to download to local file) started returning 403 errors with some html...and some very old code stopped working. The 403 isn't the problem. The issue is a requests call that hangs. We didn't notice for a while because it didn't crash/error, it just hung. Apparently for a couple weeks (need more monitoring coverage for other signals there).Running it manually it also hangs for at least a few minutes (as long as I waited). This shouldn't be a timeout case anyway as the server responds right away with the 403.
If I manipulate headers (adding
headers=...
to therequests.get(
call), I can make the 403 go away and the code runs fine again. But that isn't a solution. The issue for me here is the hang because I can't handle or report there is an issue (e.g. getting a 403).Looking through the docs, I don't see anything about this behaviour but I might have missed it. Any idea what I missed?
curl of offending url (redacted)
Excerpt of the hanging code.
res = requests.get(url, stream=True)
never returns.The text was updated successfully, but these errors were encountered: