HTTrack
HTTrack is a free (GPL) and easy-to-use offline browser utility.
Basically, it allows you to download the contents of a internet site to a local directory. It builds a complete set of recursively directories, getting HTML, images, and other files from the server and stashing them on your computer. These are static, HTML images of the original site, even if it was built using some database centered, dynamic page tool.
I find it great for archiving copies of my sites before making major changes, or shutting them down.
Using HTTrack
There are versions of HTTrack for multiple OS environments. The one I use is for a standard Linux system. I have configured it to run from a script as a CRON task. The script reads a series of files that list small collections of web sites. It only processes one site at a time, to prevent overloading remote sites that are on shared servers. It stashes each collection in a designated directory on my local server for local backup and browsing.
One of the nice features of the %L function is that it automatically builds an index of the site collections in the target folder.
httrack -%U apache -F "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" -%L LinkList-01 -O /home/Mirror/Mirror-01 --update
The file list (LinkList-01) is a simple list of targeted sites. I found that WordPress sites seem to like to be listed as “http://sob.boatswain.us/”, while my Mediawiki sites won’t work with that and need to be listed without the domain garbage, simply as “sysadm.equoria.com”.
The user agent (-F) is explained in the next section.
user agent 403 rejections
There appears to be a problem with many sites related to the default User Agent identification.
Like a good boy, HTTrack identifies itself when it connects, and immediately get rejected.
Using wget as a testing tool, you can see that it is the HTTrack User Agent that triggers the forbidden message.
[root@neptune temp]# ls -l total 0 [root@neptune temp]# [root@neptune temp]# wget -U "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" www.garg.com 2>&1 | egrep HTTP HTTP request sent, awaiting response... 403 Forbidden [root@neptune temp]# ls -l total 0 [root@neptune temp]# [root@neptune temp]# wget www.garg.com 2>&1 [root@neptune temp]# ls -l total 4 -rw-r--r--. 1 root root 3288 Mar 24 09:55 index.html [root@neptune temp]#
This is handled by the security software on the server. The problem is that they simply do not have HTTrack registered in their database of approved agents.
Use the -F option in httrack to change the user agent message.
F user-agent field (-F "user-agent name") (--user-agent)
In the example above, I used;
-F "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"
The user agent text “Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1” came from Firefox pages on User Agent String.Com.
For additional information, open the HTTrack Users Guide and scroll down the the section on Browser Options.
Comments
HTTrack — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>