-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModuleNotFoundError: No module named 'requests' in scraper 0.4.0 #29
Comments
I spun up a brand new VM (Intel CPU, Amazon Linux), installed docker on it and then ran the scraper on the Typesense docs site and it seemed to work fine for me (see below). I wonder if there's a difference in Docker versions that might be causing this. May I know what version of Docker engine you're using? $ docker run -it --env-file=.env -e CONFIG="{\"index_name\":\"typesense_docs\",\"start_urls\":[{\"url\":\"https://typesense.org/docs/(?P<version>.*?)/\",\"variables\":{\"version\":[\"0.24.0\",\"0.23.1\",\"0.23.0\",\"0.22.2\",\"0.22.1\",\"0.22.0\",\"0.21.0\",\"0.20.0\",\"0.19.0\",\"0.18.0\",\"0.17.0\",\"0.16.1\",\"0.16.0\",\"0.15.0\",\"0.14.0\",\"0.13.0\",\"0.12.0\",\"0.11.2\"]}},{\"url\":\"https://typesense.org/docs/overview/\"},{\"url\":\"https://typesense.org/docs/guide/\"}],\"selectors\":{\"default\":{\"lvl0\":\".content__default h1\",\"lvl1\":\".content__default h2\",\"lvl2\":\".content__default h3\",\"lvl3\":\".content__default h4\",\"lvl4\":\".content__default h5\",\"text\":\".content__default p, .content__default ul li, .content__default table tbody tr\"}},\"scrape_start_urls\":false,\"strip_chars\":\" .,;:#\"}" typesense/docsearch-scraper
Unable to find image 'typesense/docsearch-scraper:latest' locally
latest: Pulling from typesense/docsearch-scraper
677076032cca: Pull complete
3026efbcce37: Pull complete
b83c999f3ae6: Pull complete
4f4fb700ef54: Pull complete
4d02e570415e: Pull complete
fe9dd39ad932: Pull complete
40bdd8cbcb60: Pull complete
330e95c637fc: Pull complete
1c4235bc81bd: Pull complete
f636e29df4a6: Pull complete
2ee46e1d6efd: Pull complete
f2a90558593e: Pull complete
f7cb19d7ba62: Pull complete
b51fd8a46836: Pull complete
72e3879aa441: Pull complete
b656e2665916: Pull complete
95462c1394e2: Pull complete
0a6c9231c464: Pull complete
02b4a1743fdf: Pull complete
fcb6abf81668: Pull complete
066a7661e7fb: Pull complete
b1349c66a67d: Pull complete
cb04953d313a: Pull complete
83cfbae1faa8: Pull complete
4aa2727acdc6: Pull complete
Digest: sha256:ffce60fae1358cfe8ba8a59a50b24dfd835610e543b5fbadba5a84541f7e8b2f
Status: Downloaded newer image for typesense/docsearch-scraper:latest
INFO:scrapy.utils.log:Scrapy 2.8.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Linux-5.10.165-143.735.amzn2.x86_64-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'ERROR',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Algolia DocSearch Crawler'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
warn(
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
self.fingerprinter = fingerprinter or RequestFingerprinter()
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.17.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/guide/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.11.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/overview/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.20.0/> (referer: None)
DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET https://typesense.org/docs/0.24.0/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/api/> (referer: https://typesense.org/docs/0.23.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/api/> (referer: https://typesense.org/docs/0.22.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/api/> (referer: https://typesense.org/docs/0.18.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/api/> (referer: https://typesense.org/docs/0.22.2/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/api/> (referer: https://typesense.org/docs/0.21.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/api/> (referer: https://typesense.org/docs/0.23.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/api/> (referer: https://typesense.org/docs/0.24.0/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.23.0/api/ 54 records)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/api/> (referer: https://typesense.org/docs/0.12.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/api/> (referer: https://typesense.org/docs/0.19.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/api/> (referer: https://typesense.org/docs/0.13.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/api/> (referer: https://typesense.org/docs/0.14.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/api/> (referer: https://typesense.org/docs/0.15.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/api/> (referer: https://typesense.org/docs/0.16.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/api/> (referer: https://typesense.org/docs/0.22.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/api/> (referer: https://typesense.org/docs/0.16.1/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.1/api/ 51 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.18.0/api/ 6 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.2/api/ 55 records)
.
.
.
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 77769,
'downloader/request_count': 277,
'downloader/request_method_count/GET': 277,
'downloader/response_bytes': 2140857,
'downloader/response_count': 277,
'downloader/response_status_count/200': 276,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 10089,
'elapsed_time_seconds': 453.931499,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 22, 16, 54, 6, 607301),
'httpcompression/response_bytes': 11532228,
'httpcompression/response_count': 277,
'memusage/max': 120295424,
'memusage/startup': 68550656,
'request_depth_max': 3,
'response_received_count': 277,
'scheduler/dequeued': 277,
'scheduler/dequeued/memory': 277,
'scheduler/enqueued': 277,
'scheduler/enqueued/memory': 277,
'start_time': datetime.datetime(2023, 2, 22, 16, 46, 32, 675802)}
INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:typesense.api_call:Making get /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "GET /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "PUT /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/typesense_docs_1677081767
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "DELETE /collections/typesense_docs_1677081767 HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
Nb hits: 9097 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
First of all many thanks for keeping the previous tags in Dockerhub
We run the Typesense Scanner in CI (EKS cluster in AWS with Amazon Linux nodes)
Up until 0.3.5 all our pipelines were working without any issue.
With 0.4.0 we get an error, without changing anything on our side
I have pinned our pipelines to 0.3.5 and everything is back to normal
Might be related to #27 and #25
The text was updated successfully, but these errors were encountered: