Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requests.get with stream=True never returns on 403 #6376

Closed
mattpr opened this issue Mar 9, 2023 · 6 comments
Closed

requests.get with stream=True never returns on 403 #6376

mattpr opened this issue Mar 9, 2023 · 6 comments

Comments

@mattpr
Copy link

mattpr commented Mar 9, 2023

I have a case where a url I was using requests to fetch (image so using stream=True to download to local file) started returning 403 errors with some html...and some very old code stopped working. The 403 isn't the problem. The issue is a requests call that hangs. We didn't notice for a while because it didn't crash/error, it just hung. Apparently for a couple weeks (need more monitoring coverage for other signals there).

Running it manually it also hangs for at least a few minutes (as long as I waited). This shouldn't be a timeout case anyway as the server responds right away with the 403.

If I manipulate headers (adding headers=... to the requests.get( call), I can make the 403 go away and the code runs fine again. But that isn't a solution. The issue for me here is the hang because I can't handle or report there is an issue (e.g. getting a 403).

Looking through the docs, I don't see anything about this behaviour but I might have missed it. Any idea what I missed?

# python3 --version
Python 3.8.10
# pip3 list | grep requests
requests           2.28.2

curl of offending url (redacted)

# curl -s -D - https://example.com/some/path/file.jpg
HTTP/2 403
mime-version: 1.0
content-type: text/html
content-length: 310
expires: Thu, 09 Mar 2023 12:22:13 GMT
cache-control: max-age=0, no-cache
pragma: no-cache
date: Thu, 09 Mar 2023 12:22:13 GMT
server-timing: cdn-cache; desc=HIT
server-timing: edge; dur=1
server-timing: ak_p; desc="466212_388605830_176004165_24_6476_12_0";dur=1

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "XXXXXXXXXX" on this server.<P>
XXXXX
</BODY>
</HTML>

Excerpt of the hanging code. res = requests.get(url, stream=True) never returns.

local_file = "/tmp/file.jpg"
url = "https://example.com/some/path/file.jpg"
res = requests.get(url, stream=True)  # <-- hangs
print("I never print.")
if res.status_code == 200:
    try:
        with open(local_file, 'wb') as fd:
            for chunk in res.iter_content(chunk_size=128):
                fd.write(chunk)
    except EnvironmentError as e:
        print("Received error when attempting to download {0}.".format(url))
        print(e)
        return False
    return True
else:
    print("Received {0} status when attempting to download {1}.".format(res.status_code, url))
    return False
@sigmavirus24
Copy link
Contributor

Running it manually it also hangs for at least a few minutes (as long as I waited).

With it hanging where you claim, I can't think of why it would be hanging there. Is there a stack trace from when you ran it manually that you can share? It might give us a better idea of what might be happening. Without that, this doesn't seem like something we can fix or investigate as there's very little information to go on

@mattpr
Copy link
Author

mattpr commented Mar 9, 2023

Good point. I let it hang and then aborted. Here is the relevant stack... appears to be related to ssl?

When I add some http headers to the request the 403 goes away and the request.get works.

So unless the server is doing some conditional ssl/tls stuff based on HTTP (doesn't make any sense to me as http happens after tls), I'm not sure what is up there.

In any case, I would expect an ssl problem (timeout, hangup, whatever) to be surfaced...but maybe the problem is in python rather than requests.

  ^CTraceback (most recent call last):
  ...
  File "/opt/script.py", line 96, in downloadImageToFile
    res = requests.get(url,  stream=True)
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt

@sigmavirus24
Copy link
Contributor

So we're hanging trying to read the very first line which would be the HTTP version Status code Status reason information.

What headers are you adding?

Many servers have started blocking requests user agent strings because of abusive and malicious actors using requests. It's possible this server thinks you're acting maliciously and doing something similar.

This could be solved with a default timeout which we have an open issue for, but you can also set a read timeout to do the same thing yourself.

I don't believe this is a bug we can fix (see also discussion on the timeout issue) in the near term.

@mattpr
Copy link
Author

mattpr commented Mar 9, 2023

What headers are you adding?

headers ={
    'accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.9',
    'referer': 'https://example.com/path/to/file/index.html',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
}

Many servers have started blocking requests user agent strings because of abusive and malicious actors using requests. It's possible this server thinks you're acting maliciously and doing something similar.

Understandable. I expect that is why I get the 403 (visible in curl...goes away once I add headers...including user-agent). user-agent and referer are the only headers they might be switching on (whether I get a 403 or not).

But SSL/TLS breaking is weird as the only http-related stuff available at TLS handshake is the server-name-indication which is just the requested hostname...no http headers. So to break ssl it would have to allow ssl until the http request was received/parsed and then maybe from the http-layer just orphan off the request without responding...but I would expect their webserver to have some upstream timeout or something at some point...I can't imagine their server left the tcp connection open for weeks with no traffic. So at some point the tcp connection should have timed out or got a FIN from their end at which point the connection has failed and some kind of error should surface somewhere on the requesting client's side.

As it stands now, it looks like we were sitting there for 2 weeks waiting for a response without the tcp connection closing or timing out. I can't imagine that is what is actually happening, but maybe it is and a timeout is the remedy.

Plus this is another level beyond the 403 which isn't super logical to me...

  • 200 - okay (with user-agent and referer)
  • 403 - no auth (curl without user-agent and referer)
  • tls/ssl hang - (requests without user-agent and referer)

I did my tcpdump troubleshooting already for the month but maybe if I get fired up I'll do some more on this topic in the interest of getting the underlying issue identified (e.g. is a FIN being ignored or is the tcp connection really staying open this long?).

This could be solved with a default timeout which we have an open issue for, but you can also set a read timeout to do the same thing yourself.

This one?
#3070

I will try to work around this by specifying a timeout so at least we don't hang without failing for weeks on end...but I suspect there is something that should be fixed here (although it might be in urllib3 or python's http, socket or ssl).

Just for kicks I thought I'd try again with curl using requests' user-agent to see if I could get the server to not respond. Still get the 403.

# curl -s -D - -H 'User-Agent: python-requests/2.28.2' https://example.com/some/path/file.jpg
HTTP/2 403
mime-version: 1.0
content-type: text/html
content-length: 310
expires: Thu, 09 Mar 2023 16:29:48 GMT
cache-control: max-age=0, no-cache
pragma: no-cache
date: Thu, 09 Mar 2023 16:29:48 GMT
server-timing: cdn-cache; desc=HIT
server-timing: edge; dur=1
server-timing: ak_p; desc="466216_388605857_42742355_25_6766_11_0";dur=1

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "XXXXXXX" on this server.<P>
XXXXXXXXXX
</BODY>
</HTML>

@dkorsunsky
Copy link

dkorsunsky commented Mar 17, 2023

Do not use Requests, use PycURL

This is a known issue, documented here and here.

@sigmavirus24
Copy link
Contributor

If in fact the server is responding immediately before we can write anything it is possible it's related to the issues @dkorsunsky linked. It's not a requests but in so much as we rely on the standard library http.client which cannot handle early responses. In other words, as long as we rely on the standard library, there's no fix here for requests

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants