Oxylabs Web Scraper API Scheduler

Requirements

Python 3.9+

Setting up Python Environment

Open a terminal or command prompt;

Clone this repository:

git clone git@gitlab-ssh.int.oxyplatform.io:oxybrain/scraper-api-setup.git

Navigate to your project directory:
```
cd scraper-api-setup/
```
Create a virtual environment using Python 3.9:
```
python3.9 -m venv venv
```
Activate the virtual environment:
- On Mac/Linux:
```
source venv/bin/activate
```
- On Windows:
```
venv\Scripts\activate
```

Installing Requirements

Ensure your virtual environment is activated
Install the required packages:
```
pip install -r requirements.txt
```

Setting up Scraper API

The only thing you need to set it up is to prepare a list of URLs and payload of Scraper API parameters which can be generated in our playground. Below you will find general steps to run our Scraper API and a few examples of different options you have to run it.

Steps:

Generate payload by describing your scraping needs in Oxy playground (https://dashboard.oxylabs.io/?route=/api-playground).
Copy payload and store it either in runtime_files/payload.json or as a new file on your device.
Store URLs you wish to scrape in runtime_files/urls.txt or as a new file on your device. Each url must be separated by newline.
Run script which registers/performs your scraping jobs (ensure you have completed installation steps stated above and have environment activated).
```
python3.9 run.py
```
This will start a setup wizard, where you have to provide information about your account and some other.
Enter your Oxylabs username and password. It will be cached into file .env. If you later wish to change it, just remove this file.

Examples

Store completed jobs locally

Initiate run.py script.
```
python3.9 run.py
```
When asked Select where you wish to store the results. enter 1 to choose Locally.
When asked Full path to existing directory where results should be stored: enter path to directory where completed jobs should be stored. This directory must already exist on your device.
That's it, completed jobs will be stored in desired location after some time, depending on the number of URLs you asked to scrape.

Image of wizard for reference

Store completed jobs in cloud

We do support only AWS S3 or Google Cloud Storage for now.

Initiate run.py script.
```
python3.9 run.py
```
When asked Select where you wish to store the results. enter 2 to choose Cloud.
When asked Path to cloud bucket and directory/partition where results should be stored: enter path to your cloud directory where completed jobs should be stored. Bucket permissions should be adjust by using these instructions - https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/cloud-storage.
When asked Do you want to schedule urls to be scraped repetitively?(y/n) enter n as now we just want to scrape all URLs once.

Image of wizard for reference

Store completed jobs in cloud and schedule it to be repeated.

Initiate run.py script.
```
python3.9 run.py
```
When asked Select where you wish to store the results. enter 2 to choose Cloud.
When asked Path to cloud bucket and directory/partition where results should be stored: enter path to your cloud directory where completed jobs should be stored. Bucket permissions should be adjust by using these instructions - https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/cloud-storage.
When asked Do you want to schedule urls to be scraped repetitively?(y/n) enter y.
Next you have to select frequency.
- If you choose Daily, you will need to enter hour scraping jobs will run each day.
- If you choose Weekly, you will need to select weekday scraping jobs will run each week.
- If you choose Monthly, you will need to enter day scraping jobs will run each month.
Lastly, you will be asked to specify date and time when scheduler should stop. You MUST enter end time in format of 2032-12-21 12:34:45. You will be able to stop scheduler before end time. Example of how to do it is in next section.
You will get schedule id, which you should save.

Image of wizard for reference

Stopping scheduler

Initiate run.py script.
```
python3.9 stop.py
```
Enter your username and password.
Select schedule id you wish to deactivate.

Image of wizard for reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Oxylabs Web Scraper API Scheduler

Requirements

Setting up Python Environment

Installing Requirements

Setting up Scraper API

Examples

Store completed jobs locally

Store completed jobs in cloud

Store completed jobs in cloud and schedule it to be repeated.

Stopping scheduler

Files

README.md

Latest commit

History

README.md

File metadata and controls

Oxylabs Web Scraper API Scheduler

Requirements

Setting up Python Environment

Installing Requirements

Setting up Scraper API

Examples

Store completed jobs locally

Store completed jobs in cloud

Store completed jobs in cloud and schedule it to be repeated.

Stopping scheduler