Scrappy

Scrappy is a command-line application for scraping internship opportunities from various websites using Golang. It offers fast execution with Cobra, efficient data extraction with Colly, and automated scheduling using Go-cron, storing results in CSV/JSON formats.

Project Structure

The folder structure is organized as follows:

Scrappy/
│
├── cmd/                       # Command definitions for the CLI application
│   ├── root.go                # Main entry point for the CLI commands
│   ├── scrape.go              # Command for initiating the scraping process
│   └── cron.go                # Command for setting up and managing cron jobs
│
├── internal/                  # Internal packages for modular functionality
│   ├── scraper/               # Scraping logic and parsing mechanisms
│   │   ├── scraper.go         # Core scraping logic using Colly
│   │   ├── parser.go          # Functions for parsing scraped data
│   │   └── helpers.go         # Utility functions used in scraping
│   │
│   ├── storage/               # Storage mechanisms and data formatting
│   │   ├── storage.go         # Logic for saving data to CSV/JSON files
│   │   └── formatters.go      # Formatting functions for output data
│   │
│   └── scheduler/             # Scheduler for automating scraping tasks
│       └── cron.go            # Functions for setting up and managing Go-cron jobs
│
├── config/                    # Configuration files for the application
│   └── config.yaml            # Application configurations (e.g., scraping frequency, URLs)
│
├── scripts/                   # Scripts for setting up and managing the project
│   └── setup.sh               # Script for setting up the environment (dependencies, etc.)
│
├── data/                      # Output directory for scraped data
│   ├── output.csv             # Example CSV file containing scraped data
│   └── output.json            # Example JSON file containing scraped data
│
├── logs/                      # Logs generated during scraping
│   └── scraper.log            # Log file for tracking scraping events and errors
│
├── .env                       # Environment variables for API keys, URLs, etc.
├── .gitignore                 # Files and directories to be ignored by git
├── README.md                  # Project documentation (this file)
└── main.go                    # Entry point for running the Scrappy CLI application

Getting Started

Clone the Repository:

git clone https://github.com/AdityaKrSingh26/Scrappy.git
cd Scrappy

Setup: Run the setup script to install dependencies:
```
./scripts/setup.sh
```
Configure: Update config/config.yaml with the websites you want to scrape and other settings.
Run the Scraper: Start scraping using the command:
```
go run main.go scrape
```
Automate with Cron: Schedule automated scraping using:
```
go run main.go cron
```

Description of Key Files

cmd/root.go: Entry point for the CLI, sets up the base structure of the command.
internal/scraper/scraper.go: Contains the logic for scraping different websites using Colly.
internal/scraper/parser.go: Handles parsing the raw HTML or JSON data into usable formats.
internal/storage/storage.go: Manages the storage of scraped data into CSV or JSON.
internal/scheduler/cron.go: Contains the logic to run the scraper at regular intervals using Go-cron.
config/config.yaml: Stores configuration such as scraping intervals and target URLs.
logs/scraper.log: Keeps track of scraping events, errors, and execution logs.

Technologies Used

Golang: Core programming language used for developing the CLI.
Cobra: Library for creating the command-line interface.
Colly: Powerful Golang library for web scraping.
Go-cron: Scheduler for automating scraping tasks.
CSV/JSON: Data formats for exporting scraped data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrappy

Project Structure

Getting Started

Description of Key Files

Technologies Used

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cmd		cmd
config		config
data		data
internal		internal
logs		logs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

AdityaKrSingh26/Scrappy

Folders and files

Latest commit

History

Repository files navigation

Scrappy

Project Structure

Getting Started

Description of Key Files

Technologies Used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages