Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First year scraper is outdated #85

Open
shikharish opened this issue Aug 1, 2024 · 17 comments
Open

First year scraper is outdated #85

shikharish opened this issue Aug 1, 2024 · 17 comments

Comments

@shikharish
Copy link
Member

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

@shikharish shikharish changed the title Scraper is outdated First year scraper is outdated Aug 1, 2024
@harshkhandeparkar
Copy link
Member

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Can you add the details of what has changed?

@harshkhandeparkar
Copy link
Member

@shikharish ?

@shikharish
Copy link
Member Author

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

@harshkhandeparkar
Copy link
Member

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Can you send the new PDF?

@shikharish
Copy link
Member Author

aut24.pdf

@proffapt
Copy link
Member

So, chillzone doesn't have proper data at the moment?

@shikharish
Copy link
Member Author

No.

@proffapt
Copy link
Member

proffapt commented Sep 29, 2024

Oh, so, when will we need to make the required changes?

@harshkhandeparkar
Copy link
Member

Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?

@shikharish
Copy link
Member Author

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

@harshkhandeparkar
Copy link
Member

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?

@shikharish
Copy link
Member Author

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

@harshkhandeparkar
Copy link
Member

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?

@shikharish
Copy link
Member Author

libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....

@harshkhandeparkar
Copy link
Member

What about onlyoffice?

@shikharish
Copy link
Member Author

dont think so.

@harshkhandeparkar
Copy link
Member

Hmm, in that case we should write a Dockerfile to run the scraper in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants