Fix GitHub backend upstream detection #248

cydrobolt · 2016-01-09T00:05:54Z

Bug:

GitHub upstream detection was imperfect (e.g URL needed to be in https://)
Broken fetch URL generation resulting from bug in code (see https://github.com/fedora-infra/anitya/blob/master/anitya/lib/backends/github.py#L45 -- returns only last letter from string, doesn't return string except last letter)

Patch:

Remove some redundant code
Use regex rather than fragile startswith and endswith detection mechanisms
Fix broken fetch URL generation

…reating releases URL

pypingou · 2016-01-09T07:08:58Z

There is a Merge commit in here, could you see to remove it?
Code wise it looks good, but I would be curious if we could add some unit-tests showing what didn't work before and works now so that we're sure to not break it again the future (at least these parts)

Thanks

cydrobolt · 2016-01-09T23:05:50Z

@pypingou rebased and removed the merge commit.
I improved the tests a little. Previously, we weren't testing certain cases (e.g if there is a trailing slash).

The new regex should correctly match the GH URL even when it is not at the root of the repo.

pypingou · 2016-01-10T13:43:59Z

tests/test_backend_github.py

        project = model.Project(
            name='pkgdb2',
-            homepage='https://github.com/fedora-infra/pkgdb2',
+            homepage='https://github.com/fedora-infra/pkgdb2/issues',


I am wondering if we want to support this, as it is not a valid home page and if we start to be smart and cover corner cases such as this one, we may end up with a large bowl of spaghetti hard to maintain in the long term.

pypingou · 2016-01-10T14:14:42Z

Ok thinking a little bit about this, I'm having a little internal debate.

Basically, we're here fixing one issue: we require urls to use https by allowing people to put any URL as home page.

This mean they can practically do anything, we'll be smart enough to figure out the correct URL, even if they provide a homepage link to something that's not the homepage.

Anitya deals with projects naming and versions, both of which are a mess, in general. The road we took was basically: Garbage in -> Garbage out. If the information you provide are incorrect (and we try to be clear about what we want), it won't work (KISS).
We can fix the GitHub backend for this (you just did), but then for one backend it behaves differently from all the other ones.

We have two options from there:

we're not smart, everywhere (easy)
we fix all the other backends to be smart as well (much harder, if even possible for some)

So we could report early that non-https url for github are potentially invalid, or just adjust our checks that it splits by :// or uses urllib/urlparse to split so that we can safely check the domain of the URL and ensure it's indeed github.
It's not the approach you took there, but I wonder if introducing this level of flexibility on one backend won't engage us to adjust all the other ones as well.

What do you think?

cydrobolt · 2016-01-10T20:15:12Z

I don't think it's especially negative if one of our backends is more flexible in this aspect. Browsing through Anitya, many projects simply did not work because small variations in the URL broke the entire detection mechanism. e.g see firefox

I think proper regex could even be used here, although using something like urlparse could also be an option. If we keep it simple like we have it now, but have mediocre results, I'm not sure we're accomplishing our goal if we aren't flexible enough to handle most of the differences in what users provide.

In regards to having a non-repo-home link as the homepage, I have seen users set their homepage to /issues or some other /wiki path.

Since we're trying to get rid of our technical debt, I think it would be a good goal to introduce smarter backends to increase the accuracy of our information.

puiterwijk · 2016-01-10T20:17:57Z

Would it instead be an idea to just detect wrong URLs for certain projects and just throw an error with those?
When we have checks like that, we can also run them over the current dataset, and ask the submitters of the current wrong urls to fix them or fix them ourselves.

Regardless of what people set, I'd say we should also mark any /issues or /tags homepage as invalid.

Doing such validity checks should be a lot easier than normalizing all the data.

cydrobolt · 2016-01-10T20:24:55Z

@puiterwijk perhaps. However, how are we defining "valid" homepage? You could argue that it is plenty valid to have the homepage be /issues, the same way projects have Trac as their homepage.

Nevertheless, I'm not sure how proper it would be to revert to the old code that we have in place for the GitHub backend and some other backends. It used a system of .endwiths and startswiths that were difficult to understand, maintain, and often returned incorrect results.

pypingou · 2016-01-10T20:57:00Z

I am curious, could you point to some examples where people used /wiki or /issues or where the current code was returning incorrect results?

cydrobolt · 2016-01-10T20:59:48Z

I can't point you to any specific examples, but the current code would not handle a /wiki or /issues path well.

The current code would also break if you use a trailing slash.

pypingou · 2016-01-10T21:07:05Z

The current code would also break if you use a trailing slash.

>>> import requests
>>> requests.get('https://github.com/fedora-infra/fedocal//tags')
<Response [200]>

Seems to work here, but granted, I didn't test in anitya itself :)

pypingou · 2016-01-10T21:08:27Z

I just tried on release-monitoring.org for guake where I specify that the version_url is guake/guake/ seems to behave fine

pypingou · 2016-01-10T21:09:54Z

Anyway, I kinda like the change, but I also see cons, so I want to take a little more time to think about it

cydrobolt · 2016-01-10T21:10:14Z

No, not the GH owner/project. The GH link in the homepage.

Try removing the owner/project, then put in https://github.com/guake/guake/ as the homepage

pypingou · 2016-01-10T21:12:57Z

Oh I see, it's a bug in https://github.com/fedora-infra/anitya/pull/248/files#diff-c3055793905b5591f47858e267f3b7ecL69 where [:1] should be [:-1]

cydrobolt · 2016-01-10T21:23:02Z

Yeah. I mentioned it in the ticket.

Instead of adjusting that element (which I didn't think was easy to maintain because the code structure differed so much between URL types), I rewrote it using regex which seemed more fault tolerant.

cydrobolt · 2016-01-29T21:22:35Z

@pypingou any updates?

pypingou · 2016-01-29T22:15:29Z

I'm still very un-decided, for few reasons:

I don't really think regex makes the code easier to read
It provides a precedent and eventually will bring expectations for the other backends that we won't be able to match
It feels like a big circumvent for a small bug

On the plus side:

It brings flexibility

But then again, it's not something I've ever seen being asked, and the owner/project fields allows to have a different homepage anyway. So I see pros and cons

ralphbean · 2016-01-30T16:05:34Z

I'm still very un-decided, for few reasons:

If we can, we should decide one way or the other.. just to get this out of limbo eventually.

I don't really think regex makes the code easier to read

FWIW, I find personally find regexes somewhat hard to read as well -- but they are a kind of lingua franca for programmatic string parsing. I don't think they'd be a detriment to the long term readability of the code. A programmer can always just look up how they work from one of the many references online for a refresher (which I often have to).

It provides a precedent and eventually will bring expectations for the other backends that we won't be able to match

This is probably the strongest objection, and I basically agree. @puiterwijk's suggestion about doing validation and rejecting different urls instead of normalization as is done here is a good alternative.

It feels like a big circumvent for a small bug

The circumvention doesn't seem that large to me. The diff is pretty contained.

pypingou · 2016-01-30T16:07:14Z

It feels like a big circumvent for a small bug

The circumvention doesn't seem that large to me. The diff is pretty contained.

Compare to a one-liner? :)

cydrobolt · 2016-01-30T16:08:18Z

Compare to a one-liner? :)

A one-liner that might require more one-liners later on?

pypingou · 2016-01-30T16:15:26Z

A one-liner that might require more one-liners later on?

Yup, and that's true for any piece of code, this one as much as the next one :)

pypingou · 2016-01-30T16:59:33Z

@cydrobolt, first of all thanks for finding the bug, stepping up to fix it and provide this pull-request.

I'm going to close it as imho the cons out-weight the pros (listed above).

I'm sorry it took so long and I sincerely hope it doesn't discourage you, your code is working
and does what you wanted to do.

That being said, the issue of the trailing slash still needs addressing, since you're the one who discovered
it, would you like to provide the one-liner patch?

cydrobolt · 2016-01-30T17:26:45Z

@pypingou sure, I'll make another PR later

Fix GitHub upstream: fix issues with detecting URL, fix issues with c…

23a1513

…reating releases URL

Import re, fix scoping issues

dcbd0d1

cydrobolt force-pushed the bugfix/fix_github_upstream branch from 4e56cef to dcbd0d1 Compare January 9, 2016 22:31

cydrobolt added 2 commits January 9, 2016 17:56

Test different cases

9905486

Indicate desired behavior

b6b0f62

pypingou reviewed Jan 10, 2016
View reviewed changes

pypingou closed this Jan 30, 2016

pypingou deleted the bugfix/fix_github_upstream branch June 15, 2016 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GitHub backend upstream detection #248

Fix GitHub backend upstream detection #248

cydrobolt commented Jan 9, 2016

pypingou commented Jan 9, 2016

cydrobolt commented Jan 9, 2016

pypingou Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

puiterwijk commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

pypingou commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

cydrobolt commented Jan 29, 2016

pypingou commented Jan 29, 2016

ralphbean commented Jan 30, 2016

pypingou commented Jan 30, 2016

cydrobolt commented Jan 30, 2016

pypingou commented Jan 30, 2016

pypingou commented Jan 30, 2016

cydrobolt commented Jan 30, 2016

Fix GitHub backend upstream detection #248

Fix GitHub backend upstream detection #248

Conversation

cydrobolt commented Jan 9, 2016

pypingou commented Jan 9, 2016

cydrobolt commented Jan 9, 2016

pypingou Jan 10, 2016

Choose a reason for hiding this comment

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

puiterwijk commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

pypingou commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

pypingou commented Jan 10, 2016

cydrobolt commented Jan 10, 2016

cydrobolt commented Jan 29, 2016

pypingou commented Jan 29, 2016

ralphbean commented Jan 30, 2016

pypingou commented Jan 30, 2016

cydrobolt commented Jan 30, 2016

pypingou commented Jan 30, 2016

pypingou commented Jan 30, 2016

cydrobolt commented Jan 30, 2016