-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First year scraper is outdated #85
Comments
Can you add the details of what has changed? |
The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed. |
Can you send the new PDF? |
So, chillzone doesn't have proper data at the moment? |
No. |
Oh, so, when will we need to make the required changes? |
Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative? |
the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch. |
The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives? |
Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html. |
Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library? |
libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf.... |
What about onlyoffice? |
dont think so. |
Hmm, in that case we should write a Dockerfile to run the scraper in. |
Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.
The text was updated successfully, but these errors were encountered: