First year scraper is outdated #85

shikharish · 2024-08-01T16:55:48Z

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

harshkhandeparkar · 2024-08-01T17:40:14Z

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Can you add the details of what has changed?

harshkhandeparkar · 2024-08-06T14:44:32Z

shikharish · 2024-08-06T14:49:04Z

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

harshkhandeparkar · 2024-08-06T15:00:04Z

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Can you send the new PDF?

shikharish · 2024-08-07T06:31:00Z

aut24.pdf

proffapt · 2024-09-29T02:49:45Z

So, chillzone doesn't have proper data at the moment?

shikharish · 2024-09-29T04:28:35Z

No.

proffapt · 2024-09-29T05:48:44Z

Oh, so, when will we need to make the required changes?

harshkhandeparkar · 2024-09-29T09:39:47Z

Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?

shikharish · 2024-10-04T11:20:54Z

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

harshkhandeparkar · 2024-10-05T05:29:36Z

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?

shikharish · 2024-10-05T20:54:17Z

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

harshkhandeparkar · 2024-10-06T09:14:10Z

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?

shikharish · 2024-10-06T09:26:37Z

libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....

harshkhandeparkar · 2024-10-06T17:11:58Z

What about onlyoffice?

shikharish · 2024-10-07T13:50:57Z

dont think so.

harshkhandeparkar · 2024-10-07T19:08:13Z

Hmm, in that case we should write a Dockerfile to run the scraper in.

shikharish changed the title ~~Scraper is outdated~~ First year scraper is outdated Aug 1, 2024

proffapt added this to Metakgp Dreams Aug 1, 2024

github-project-automation bot moved this to Todo in Metakgp Dreams Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First year scraper is outdated #85

First year scraper is outdated #85

shikharish commented Aug 1, 2024

harshkhandeparkar commented Aug 1, 2024

harshkhandeparkar commented Aug 6, 2024

shikharish commented Aug 6, 2024

harshkhandeparkar commented Aug 6, 2024

shikharish commented Aug 7, 2024

proffapt commented Sep 29, 2024

shikharish commented Sep 29, 2024

proffapt commented Sep 29, 2024 •

edited

Loading

harshkhandeparkar commented Sep 29, 2024

shikharish commented Oct 4, 2024

harshkhandeparkar commented Oct 5, 2024

shikharish commented Oct 5, 2024

harshkhandeparkar commented Oct 6, 2024

shikharish commented Oct 6, 2024

harshkhandeparkar commented Oct 6, 2024

shikharish commented Oct 7, 2024

harshkhandeparkar commented Oct 7, 2024

First year scraper is outdated #85

First year scraper is outdated #85

Comments

shikharish commented Aug 1, 2024

harshkhandeparkar commented Aug 1, 2024

harshkhandeparkar commented Aug 6, 2024

shikharish commented Aug 6, 2024

harshkhandeparkar commented Aug 6, 2024

shikharish commented Aug 7, 2024

proffapt commented Sep 29, 2024

shikharish commented Sep 29, 2024

proffapt commented Sep 29, 2024 • edited Loading

harshkhandeparkar commented Sep 29, 2024

shikharish commented Oct 4, 2024

harshkhandeparkar commented Oct 5, 2024

shikharish commented Oct 5, 2024

harshkhandeparkar commented Oct 6, 2024

shikharish commented Oct 6, 2024

harshkhandeparkar commented Oct 6, 2024

shikharish commented Oct 7, 2024

harshkhandeparkar commented Oct 7, 2024

proffapt commented Sep 29, 2024 •

edited

Loading