Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GitHub backend upstream detection #248

Closed
wants to merge 4 commits into from

Conversation

cydrobolt
Copy link
Contributor

Bug:

Patch:

  • Remove some redundant code
  • Use regex rather than fragile startswith and endswith detection mechanisms
  • Fix broken fetch URL generation

@pypingou
Copy link
Member

pypingou commented Jan 9, 2016

  • There is a Merge commit in here, could you see to remove it?
  • Code wise it looks good, but I would be curious if we could add some unit-tests showing what didn't work before and works now so that we're sure to not break it again the future (at least these parts)

Thanks

@cydrobolt
Copy link
Contributor Author

@pypingou rebased and removed the merge commit.
I improved the tests a little. Previously, we weren't testing certain cases (e.g if there is a trailing slash).

The new regex should correctly match the GH URL even when it is not at the root of the repo.

project = model.Project(
name='pkgdb2',
homepage='https://github.com/fedora-infra/pkgdb2',
homepage='https://github.com/fedora-infra/pkgdb2/issues',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we want to support this, as it is not a valid home page and if we start to be smart and cover corner cases such as this one, we may end up with a large bowl of spaghetti hard to maintain in the long term.

@pypingou
Copy link
Member

Ok thinking a little bit about this, I'm having a little internal debate.

Basically, we're here fixing one issue: we require urls to use https by allowing people to put any URL as home page.

This mean they can practically do anything, we'll be smart enough to figure out the correct URL, even if they provide a homepage link to something that's not the homepage.

Anitya deals with projects naming and versions, both of which are a mess, in general. The road we took was basically: Garbage in -> Garbage out. If the information you provide are incorrect (and we try to be clear about what we want), it won't work (KISS).
We can fix the GitHub backend for this (you just did), but then for one backend it behaves differently from all the other ones.

We have two options from there:

  • we're not smart, everywhere (easy)
  • we fix all the other backends to be smart as well (much harder, if even possible for some)

So we could report early that non-https url for github are potentially invalid, or just adjust our checks that it splits by :// or uses urllib/urlparse to split so that we can safely check the domain of the URL and ensure it's indeed github.
It's not the approach you took there, but I wonder if introducing this level of flexibility on one backend won't engage us to adjust all the other ones as well.

What do you think?

@cydrobolt
Copy link
Contributor Author

I don't think it's especially negative if one of our backends is more flexible in this aspect. Browsing through Anitya, many projects simply did not work because small variations in the URL broke the entire detection mechanism. e.g see firefox

I think proper regex could even be used here, although using something like urlparse could also be an option. If we keep it simple like we have it now, but have mediocre results, I'm not sure we're accomplishing our goal if we aren't flexible enough to handle most of the differences in what users provide.

In regards to having a non-repo-home link as the homepage, I have seen users set their homepage to /issues or some other /wiki path.

Since we're trying to get rid of our technical debt, I think it would be a good goal to introduce smarter backends to increase the accuracy of our information.

@puiterwijk
Copy link
Contributor

Would it instead be an idea to just detect wrong URLs for certain projects and just throw an error with those?
When we have checks like that, we can also run them over the current dataset, and ask the submitters of the current wrong urls to fix them or fix them ourselves.

Regardless of what people set, I'd say we should also mark any /issues or /tags homepage as invalid.

Doing such validity checks should be a lot easier than normalizing all the data.

@cydrobolt
Copy link
Contributor Author

@puiterwijk perhaps. However, how are we defining "valid" homepage? You could argue that it is plenty valid to have the homepage be /issues, the same way projects have Trac as their homepage.

Nevertheless, I'm not sure how proper it would be to revert to the old code that we have in place for the GitHub backend and some other backends. It used a system of .endwiths and startswiths that were difficult to understand, maintain, and often returned incorrect results.

@pypingou
Copy link
Member

I am curious, could you point to some examples where people used /wiki or /issues or where the current code was returning incorrect results?

@cydrobolt
Copy link
Contributor Author

I can't point you to any specific examples, but the current code would not handle a /wiki or /issues path well.

The current code would also break if you use a trailing slash.

@pypingou
Copy link
Member

The current code would also break if you use a trailing slash.

>>> import requests
>>> requests.get('https://github.com/fedora-infra/fedocal//tags')
<Response [200]>

Seems to work here, but granted, I didn't test in anitya itself :)

@pypingou
Copy link
Member

I just tried on release-monitoring.org for guake where I specify that the version_url is guake/guake/ seems to behave fine

@pypingou
Copy link
Member

Anyway, I kinda like the change, but I also see cons, so I want to take a little more time to think about it

@cydrobolt
Copy link
Contributor Author

No, not the GH owner/project. The GH link in the homepage.

Try removing the owner/project, then put in https://github.com/guake/guake/ as the homepage

@pypingou
Copy link
Member

Oh I see, it's a bug in https://github.com/fedora-infra/anitya/pull/248/files#diff-c3055793905b5591f47858e267f3b7ecL69 where [:1] should be [:-1]

@cydrobolt
Copy link
Contributor Author

Yeah. I mentioned it in the ticket.

Instead of adjusting that element (which I didn't think was easy to maintain because the code structure differed so much between URL types), I rewrote it using regex which seemed more fault tolerant.

@cydrobolt
Copy link
Contributor Author

@pypingou any updates?

@pypingou
Copy link
Member

I'm still very un-decided, for few reasons:

  • I don't really think regex makes the code easier to read
  • It provides a precedent and eventually will bring expectations for the other backends that we won't be able to match
  • It feels like a big circumvent for a small bug

On the plus side:

  • It brings flexibility

But then again, it's not something I've ever seen being asked, and the owner/project fields allows to have a different homepage anyway. So I see pros and cons

@ralphbean
Copy link
Contributor

I'm still very un-decided, for few reasons:

If we can, we should decide one way or the other.. just to get this out of limbo eventually.

I don't really think regex makes the code easier to read

FWIW, I find personally find regexes somewhat hard to read as well -- but they are a kind of lingua franca for programmatic string parsing. I don't think they'd be a detriment to the long term readability of the code. A programmer can always just look up how they work from one of the many references online for a refresher (which I often have to).

It provides a precedent and eventually will bring expectations for the other backends that we won't be able to match

This is probably the strongest objection, and I basically agree. @puiterwijk's suggestion about doing validation and rejecting different urls instead of normalization as is done here is a good alternative.

It feels like a big circumvent for a small bug

The circumvention doesn't seem that large to me. The diff is pretty contained.

@pypingou
Copy link
Member

It feels like a big circumvent for a small bug

The circumvention doesn't seem that large to me. The diff is pretty contained.

Compare to a one-liner? :)

@cydrobolt
Copy link
Contributor Author

Compare to a one-liner? :)

A one-liner that might require more one-liners later on?

@pypingou
Copy link
Member

A one-liner that might require more one-liners later on?

Yup, and that's true for any piece of code, this one as much as the next one :)

@pypingou
Copy link
Member

@cydrobolt, first of all thanks for finding the bug, stepping up to fix it and provide this pull-request.

I'm going to close it as imho the cons out-weight the pros (listed above).

I'm sorry it took so long and I sincerely hope it doesn't discourage you, your code is working
and does what you wanted to do.

That being said, the issue of the trailing slash still needs addressing, since you're the one who discovered
it, would you like to provide the one-liner patch?

@pypingou pypingou closed this Jan 30, 2016
@cydrobolt
Copy link
Contributor Author

@pypingou sure, I'll make another PR later

@pypingou pypingou deleted the bugfix/fix_github_upstream branch June 15, 2016 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants