Monolune

How to download a copy of a website using Wget

There are times when you need to download a copy of a website for offline reading, or simply for archival purposes. Mirroring an entire website is not easy as it involves downloading pages, and needs to ensure that the link structure is preserved to enable offline reading. GNU Wget is a command line program that has built-in facilities for mirroring websites. Wget automatically follows links in HTML and CSS files, and copies JavaScript files and images to recreate a local version of the site.

If you do not have wget installed, you can install it in Debian or Ubuntu by opening the terminal and run: sudo apt-get install wget. On CentOS/RHEL 5/6/7, wget can be installed by running sudo yum install wget. Once it is installed, websites can be copied using the following command:

wget --mirror 
     --convert-links 
     --adjust-extension 
     --page-requisites 
     --no-parent 
     --wait=0.1 
     --random-wait 
     https://www.example.com/pages/

Of course, replace the example URL with the address of the website you wish to mirror. Running the command will create a new directory in your current working directory. That new directory will contain the copied website once the mirroring is complete.

Offline copies of websites containing software documentation are especially useful, because they eliminate the need to go through search engines to find the documentation's homepage. One disadvantage of offline documentation is the lack of search functionality. You can overcome that by using desktop search tools like recoll that are able to search HTML. So far, I've only been using wget to mirror documentation, but I suppose it could be useful for copying news sites and blogs too.