This is a mini project which aims to crawl basic info about newest IT jobs on TopCV. The crawled data will be imported into PostgreSQL database.
Data crawled from each job posting include:
job_id
: Job posting ID, as stored in their server backend.job_title
: Job title.company
: Recruiter company.salary_min
,salary_max
: Salary range (in million VND).yrs_of_exp_min
,yrs_of_exp_max
: Years of experience required.job_city
: Working location (city).due_date
: Deadline for application.jd
: Job description.
Moreover, each entry in the PostgreSQL database also has:
created_at
: Timestamp at which the record was created (in GMT+07).last_modified
: Timestamp of most recent modification to record (in GMT+07).
git
.docker
withdocker-compose
.
git clone https://github.com/minkminkk/scraping-topcv.git
To initialize database and crawler, run:
docker compose up
PostgreSQL database with the required table will be set up, then the crawler will start crawling.
After you are done, run:
docker compose down
The containers and network will be deleted.
The crawler is not yet able to crawl the whole data as TopCV limits the request rate. In the future, crawling using rotating proxies could be implemented to overcome this.