New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GitHub backend upstream detection #248
Conversation
…reating releases URL
Thanks |
4e56cef
to
dcbd0d1
Compare
@pypingou rebased and removed the merge commit. The new regex should correctly match the GH URL even when it is not at the root of the repo. |
project = model.Project( | ||
name='pkgdb2', | ||
homepage='https://github.com/fedora-infra/pkgdb2', | ||
homepage='https://github.com/fedora-infra/pkgdb2/issues', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we want to support this, as it is not a valid home page and if we start to be smart and cover corner cases such as this one, we may end up with a large bowl of spaghetti hard to maintain in the long term.
Ok thinking a little bit about this, I'm having a little internal debate. Basically, we're here fixing one issue: This mean they can practically do anything, we'll be smart enough to figure out the correct URL, even if they provide a homepage link to something that's not the homepage. Anitya deals with projects naming and versions, both of which are a mess, in general. The road we took was basically: Garbage in -> Garbage out. If the information you provide are incorrect (and we try to be clear about what we want), it won't work (KISS). We have two options from there:
So we could report early that non-https url for github are potentially invalid, or just adjust our checks that it splits by What do you think? |
I don't think it's especially negative if one of our backends is more flexible in this aspect. Browsing through Anitya, many projects simply did not work because small variations in the URL broke the entire detection mechanism. e.g see firefox I think proper regex could even be used here, although using something like urlparse could also be an option. If we keep it simple like we have it now, but have mediocre results, I'm not sure we're accomplishing our goal if we aren't flexible enough to handle most of the differences in what users provide. In regards to having a non-repo-home link as the homepage, I have seen users set their homepage to /issues or some other /wiki path. Since we're trying to get rid of our technical debt, I think it would be a good goal to introduce smarter backends to increase the accuracy of our information. |
Would it instead be an idea to just detect wrong URLs for certain projects and just throw an error with those? Regardless of what people set, I'd say we should also mark any /issues or /tags homepage as invalid. Doing such validity checks should be a lot easier than normalizing all the data. |
@puiterwijk perhaps. However, how are we defining "valid" homepage? You could argue that it is plenty valid to have the homepage be /issues, the same way projects have Trac as their homepage. Nevertheless, I'm not sure how proper it would be to revert to the old code that we have in place for the GitHub backend and some other backends. It used a system of |
I am curious, could you point to some examples where people used /wiki or /issues or where the current code was returning incorrect results? |
I can't point you to any specific examples, but the current code would not handle a /wiki or /issues path well. The current code would also break if you use a trailing slash. |
Seems to work here, but granted, I didn't test in anitya itself :) |
I just tried on release-monitoring.org for guake where I specify that the version_url is |
Anyway, I kinda like the change, but I also see cons, so I want to take a little more time to think about it |
No, not the GH owner/project. The GH link in the homepage. Try removing the owner/project, then put in https://github.com/guake/guake/ as the homepage |
Oh I see, it's a bug in https://github.com/fedora-infra/anitya/pull/248/files#diff-c3055793905b5591f47858e267f3b7ecL69 where |
Yeah. I mentioned it in the ticket. Instead of adjusting that element (which I didn't think was easy to maintain because the code structure differed so much between URL types), I rewrote it using regex which seemed more fault tolerant. |
@pypingou any updates? |
I'm still very un-decided, for few reasons:
On the plus side:
But then again, it's not something I've ever seen being asked, and the owner/project fields allows to have a different homepage anyway. So I see pros and cons |
If we can, we should decide one way or the other.. just to get this out of limbo eventually.
FWIW, I find personally find regexes somewhat hard to read as well -- but they are a kind of lingua franca for programmatic string parsing. I don't think they'd be a detriment to the long term readability of the code. A programmer can always just look up how they work from one of the many references online for a refresher (which I often have to).
This is probably the strongest objection, and I basically agree. @puiterwijk's suggestion about doing validation and rejecting different urls instead of normalization as is done here is a good alternative.
The circumvention doesn't seem that large to me. The diff is pretty contained. |
Compare to a one-liner? :) |
A one-liner that might require more one-liners later on? |
Yup, and that's true for any piece of code, this one as much as the next one :) |
@cydrobolt, first of all thanks for finding the bug, stepping up to fix it and provide this pull-request. I'm going to close it as imho the cons out-weight the pros (listed above). I'm sorry it took so long and I sincerely hope it doesn't discourage you, your code is working That being said, the issue of the trailing slash still needs addressing, since you're the one who discovered |
@pypingou sure, I'll make another PR later |
Bug:
https://
)Patch:
startswith
andendswith
detection mechanisms