Web Mirroring

In many situations, you can pull down the contents of a website, recursively using the wget utility.

To do a straight forward mirror of a site:

  • wget -m http:\\some.site.com

long version:

  • wget –mirror http:\\some.site.com

This is actually the same as specifying:

  • wget -r -l inf -nr -N http:\\some.site.com

long version:

  • wget –recursive –levelinf –dont-remove-listing –timestamping http:\\some.site.com=

Convert for off-line reading, including giving all html files an .html extension.

  • wget –recursive –level1 –dont-remove-listing –timestamping –convert-links –html-extension http://some.site.com=

Other common options or variants

-l depth
--level=depth
  Specifies the maximum depth level for recursion.

-L
--relative
  Follows relative links only.

man wget will tell you more.

Proxy servers

Set an environment label http_proxy

E.g. in bash shell…

  • export http_proxyhttp://proxy.mysite.com:8888=

See the manual for specifying username and password if your proxy server requires them. Options –proxy-user –proxy-passwd

Example

  • wget –verbose –timeout=30 –mirror –proxy-user=myuser –proxy-passwd=mypassword http://some.web.site

More info at http://www.gnu.org/manual/wget/

FTP

  • wget –mirror ftp://myusername:mypassword@ftp.targetsite.com/target/path

– Frank Dean 6 Dec 2002