Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use defined timeout also for HTTP/FTP connections. #53

Merged
merged 1 commit into from Apr 27, 2015
Merged

Use defined timeout also for HTTP/FTP connections. #53

merged 1 commit into from Apr 27, 2015

Conversation

adrianreber
Copy link
Member

The timeout check was only used between the HTTP/FTP connections.
It could still happen that the crawler was waiting for a very long
time on a single open connection. Specify the timeout now also for
each opened HTTP/FTP connection to make sure the crawler times out
after the specified time.

The timeout check was only used between the HTTP/FTP connections.
It could still happen that the crawler was waiting for a very long
time on a single open connection. Specify the timeout now also for
each opened HTTP/FTP connection to make sure the crawler times out
after the specified time.
@nirik
Copy link
Member

nirik commented Apr 25, 2015

Looks good to me, but do we really want the timeout that long?
Or due to keepalive we might be pulling for that long over a single connection?

@adrianreber
Copy link
Member Author

About the timeout. This is really complex as we have different timeouts at different parts in the crawler. Or better, the timeouts have different meanings. The default configuration of the crawler has a timeout of 120 minutes per host. If the timeout has been reached is tested at different parts of the code. The enforcement of the timeout, however, also depends on the protocol used for crawling.

rsync crawled hosts are basically free of the timeout. We have specified a timeout of 4 hours to the rsync command. The rsync timeout, however, has a different meaning. If for more than 4 hours no network I/O happens the process is aborted. What we do not have is a timeout for the rsync command. If the crawling takes 6 hours via rsync we currently have to wait until it finishes, if there is is continuously network I/O.Once the crawling via rsync has finished there is no further timeout check. If the host takes longer than 120 minutes (mentioned the at the beginning) in four consecutive crawls it will be disabled. So we kind have an indirect way for excluding slow rsync mirrors.

If rsync is not available FTP or HTTP is used. FTP is crawling whole directories in one go and HTTP is crawling each file separately. After each crawled entity (directory or file) the timeout is checked and the crawl is aborted if reach two hours.

Now to the point this patch is trying to address. I have seen hosts which hang longer than 2 hours in a single HTTP GET/HEAD (for whatever reason). During this blocking network IO the crawler does not check if the timeout has been reached or not. Therefore I wanted to limit the time of a HTTP/FTP connection to the crawler's timeout.

As you have mentioned there is also keepalive involved. So either we open a connection a HTTP connection for each HTTP HEAD/GET or keepalive is supported. Without keepalive the behaviour is again different.

I addition the timeout is per host. It would probably make more sense to have timeout per category as mirror who has archive and secondary mirrored will take much more time than a host which only has EPEL.

So the timeout handling is difficult and complicated and providing the same timeout behaviour for all parts of the code sounds like a lot of work. Would be great if it could be properly changed/fixed/resolved. This simple patch is more a band-aid to help to detect very slow network connections and not have to wait forever until the OOM killers finishes the crawl. I am also not 100% what effect the timeout values actually has (especially in combination with or without keepalive).

Rather long comment for such a 'simple' patch.

@nirik
Copy link
Member

nirik commented Apr 26, 2015

Thanks for the long and detailed reply. ;) Makes sense, so I am +1 to this band aid for now, and perhaps down the road we can rework the timeouts so they are all configurable/make more sense. :)

@pypingou
Copy link
Member

RFE: Could we document this somewhere under the doc/ folder? :)

@adrianreber
Copy link
Member Author

I will write something similar and store it in the doc/ folder.

adrianreber added a commit that referenced this pull request Apr 27, 2015
Use defined timeout also for HTTP/FTP connections.
@adrianreber adrianreber merged commit 2d26b03 into fedora-infra:master Apr 27, 2015
@adrianreber adrianreber deleted the timeout branch April 27, 2015 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants