LinkChecker

LinkChecker

Documentation

Basic usage

To check a URL like http://www.example.org/ it is enough to type linkchecker www.example.org/ on the command line or type www.example.org in the GUI application. This will check the complete domain of http://www.example.org recursively. All links pointing outside of the domain are also checked for validity.

Performed checks

All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.

Recursion

Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:

  1. A URL must be valid.

  2. A URL must be parseable. This currently includes HTML files, Opera bookmarks files, and directories. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed to be non-parseable.

  3. The URL content must be retrievable. This is usually the case except for example mailto: or unknown URL types.

  4. The maximum recursion level must not be exceeded. It is configured with the --recursion-level command line option or the recursion level GUI option, and is unlimited by default.

  5. It must not match the ignored URL list. This is controlled with the --ignore-url command line option.

  6. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by searching for a "nofollow" directive in the HTML header data.

Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.