Skip to content

Crawl newest IT jobs at topcv.vn onto PostgreSQL database.

License

Notifications You must be signed in to change notification settings

minkminkk/scraping-topcv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scraping-topcv

This is a mini project which aims to crawl basic info about newest IT jobs on TopCV. The crawled data will be imported into PostgreSQL database.

About the project

Data crawled from each job posting include:

  • job_id: Job posting ID, as stored in their server backend.
  • job_title: Job title.
  • company: Recruiter company.
  • salary_min, salary_max: Salary range (in million VND).
  • yrs_of_exp_min, yrs_of_exp_max: Years of experience required.
  • job_city: Working location (city).
  • due_date: Deadline for application.
  • jd: Job description.

Moreover, each entry in the PostgreSQL database also has:

  • created_at: Timestamp at which the record was created (in GMT+07).
  • last_modified: Timestamp of most recent modification to record (in GMT+07).

Required programs

  • git.
  • docker with docker-compose.

Usage

Clone the git repository

git clone https://github.com/minkminkk/scraping-topcv.git

Set up

To initialize database and crawler, run:

docker compose up

PostgreSQL database with the required table will be set up, then the crawler will start crawling.

Tear down

After you are done, run:

docker compose down

The containers and network will be deleted.

Note

The crawler is not yet able to crawl the whole data as TopCV limits the request rate. In the future, crawling using rotating proxies could be implemented to overcome this.

About

Crawl newest IT jobs at topcv.vn onto PostgreSQL database.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published