-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine efforts with enthec/webappanalyzer #52
Comments
As a community project that had a really big down, I think that the flag of @derekperkins is pretty legit and important. Maybe that @tunetheweb and @max-ostapenko that do a lot of contribution here could also make an answer ? If we don't communicate nor collaborate, I don't see the point to re-create any fork of this tech. |
We evaluated the options in the doc referenced on that linked issue including 1) licensing, 2) using any well-maintained fork that sprung up from the community, or 3) maintaining our own fork on a best-efforts basis as and when members of the community opened PRs. That last option is what we went with and has been working pretty well for us. @max-ostapenko even automated CI checks since with our WebPageTest's instance to give us further confidence in merging, and reduce the maintenance overhead. We're not looking to maintain a Wappalyzer fork and only have the limited resources to commit to maintain this fork, as noted in that doc. We'll be happy to evaluate any other fork that does spring up and note that Enthec's fork was one of the most active ones (though that seems to have died off a little in last few months). However, they have also made some changes that may make it incompatible with our usage. So, at present there are a few concerns with moving to that fork and, as I said, our current decision is working out well enough for us. So, TLDR, no plans to combine efforts with others at the moment but happy to leave this issue open in case the situation changes. |
Hi! We see the issue and we totally understand it, we plan to keep updating it, we just have our progress -- we are doing less commits but bigger than before Not sure about how your CI works but we have fully automated it aswell following the original structure. About those refactors they were about inconsistencies in the structure and the difficulty for strongly typed languages such as Golang to map those values properly, we decided to convert those ambiguous values to one, for example "implies" can be both string and list, so we just converted it to list even if it only has 1 element, we found a lot of those while doing CI and have had some issues opened suggesting making it easier -- whatever implementation for the original wappalyzer you used it had to work with "string or string list" so those changes shouldn't make any difference. There are some that we are unable to do without breaking old wappalyzer implementations such as "dom" so we just tried to reduce them as much as possible. We also plan on doing libs, like the python one we already have and have someone hired publishing but thats on the todo list. We are really interested in getting as many techs as possible to make it as accurate as possible, maybe we can work something out about syncing technologies between repos. Happy to talk to you, lets see how this evolves, we will leave our issue open aswell just if anything changes. |
We had an internal fork that did this, since it is annoying to deal with.
Are there any specifics you have in mind that aren't backwards compatible? |
So at the moment we allow people to provide a list of example URLs in the original PR comment, then run the branch though the HTTP Archive infrastructure and spit out the results, so we can see if the detection was successful or not. We also run against the https://almanac.httparchive.org/en/2022/ site by default to ensure it does NOT detect the new detection (assuming it does not use this). See this example PR: #71 where you can see the github actions comment with the new detections.
Oh that's good to know!
BTW happy for you to take out detections if you can easily automate that if you want? |
Of course we are trying to keep as much legacy as possible, but the jsons were chaotic and much of those values were not documented, we have been able to normalize every field except dom as you can see here: https://github.com/enthec/webappanalyzer/blob/main/.github/workflows/scripts/technology_validator.py#L311 The dom validator: https://github.com/enthec/webappanalyzer/blob/main/.github/workflows/scripts/technology_validator.py#L218
With this situation we are not sure how to transform the list of strings to the second chaotic version as it kind of seems more complete, probably {"a[href*='kqsdesign.pl'][target='_blank']": {"exists": ""}} but also, i dont think a lot of wappalyzer implementations handle this thing, they probably just do the upper one. This is what we did for our python lib: https://github.com/enthec/python-webappanalyzer/blob/master/webappanalyzer/webappanalyzer.py#L275
Oh nice! we just do static structure validation in our CI, we can report any issues with that iyw. |
Definitely do please let us know if you spot any issues with the detections we've added that have slipped past us! |
@enthec-opensource I like the implementation of your schema improvements. We have around 15% of technologies that are not detected on a single website out of 12M. |
sure, just pushed a script with those fixes https://github.com/enthec/webappanalyzer/blob/main/scripts/fix.py just run Will try to sync and validate everything else next week |
After lots of looking, the two most active forks are this one and https://github.com/enthec/webappanalyzer. I was wondering if it made sense to join forces in a single repo.
cc @enthec
The text was updated successfully, but these errors were encountered: