Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scraping allrecipes website response errors #13

Open
schnapi opened this issue Oct 27, 2017 · 4 comments
Open

scraping allrecipes website response errors #13

schnapi opened this issue Oct 27, 2017 · 4 comments

Comments

@schnapi
Copy link

schnapi commented Oct 27, 2017

I would like to know why I am getting a lot of errors like this when I want to scrape allrecipes.com?

Thanks!

2017-10-27 13:31:38 [allrecipes] DEBUG: No item received for http://allrecipes.com/recipe/16348/baked-pork-chops-i/
2017-10-27 13:31:38 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/16348/baked-pork-chops-i/> (referer: http://allrecipes.com/recipes/?page=2)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 33, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str
@schnapi
Copy link
Author

schnapi commented Oct 27, 2017

2017-10-27 13:36:31 [scrapy.extensions.logstats] INFO: Crawled 382 pages (at 86 pages/min), scraped 0 items (at 0 items/min)
2017-10-27 13:36:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=33> (referer: None)
2017-10-27 13:36:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=34> (referer: None)
2017-10-27 13:36:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=35> (referer: None)
2017-10-27 13:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=36> (referer: None)
2017-10-27 13:36:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=37> (referer: None)
2017-10-27 13:36:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=38> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=40> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=39> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=41> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=42> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=43> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=45> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=44> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=48> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=46> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=47> (referer: None)
2017-10-27 13:36:58 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 31, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str

@schnapi
Copy link
Author

schnapi commented Oct 27, 2017

Do you still have all recipes file? Also allrecipes website blocked my ip. Do you have any suggestion how to handle this problem? Thank you!

@jaanli
Copy link
Owner

jaanli commented Oct 28, 2017

Thanks @schnapi -- cc'ing @brandonmburroughs here too in case he's interested (he wrote a great scraper for it).

Let me know if the allrecipes file here works for you:

https://github.com/altosaar/food2vec/tree/master/dat

There are also preprocessing scripts here: https://github.com/altosaar/food2vec/blob/master/src/process_scraped_data.py

@aayushworkiitr
Copy link

Facing a similar issue here. I wrote a scraper for allrecipes and initially I got data from the website but they have probably blacklisted my IP. Does anyone know a good work-around?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants