
Tuning stealthier archives
Conducting a stealthier mirror is certainly worth learning about, but we have some trade-offs to consider. Throttling and thread limits help keep us off their radar, but that means a mirror operation will take that much longer. We can limit what we are looking for through filters and directory restrictions, which is something that social engineering and our own experience can assist with. Are there efficiencies to be gained elsewhere?
There is considerable overhead as HTTrack processes queries, and a big factor in the time and compute power consumed is the parsing of Multipurpose Internet Mail Extensions (MIME) types on sites. There are many customizable features of HTTrack that can be leveraged through modification to the .httrackrc file, but the most useful one I picked up was the addition of the following line, which drastically cut my mirror time from hours to minutes. This line speeds up transfers by telling HTTrack to assume that HTML, PHP, CGI, DAT, and MPG files correspond to a standard MIME type and can be resolved to their usual end-state, avoiding all of the churn. Simply add this in its entirety (after modifying it to your liking) to the .httrackrc file using your favorite editor and save the file:
set assume asp=text/html,php3=text/html,cgi=image/gif,dat=application/x-zip,mpg=application/x-mp3
Once I've made this modification to my defaults, the stealthier command string that I find useful will do a bunch of non-default things. Keep in mind, once you've honed your own switches, you can save them in the .httrackrc file as well. Here is the command I used:
httrack www.hackthissite.org -O /tmp/hackthissite –c4 -%c10 –A20000000 –m50000000 –m1000000' –M100000000 –r25 –H2 –www.hackthissite.org/forums/* -www.hackthissite.org/user/*
Here is a brief description of the options I used. N denotes a variable:
- -cN: This limits the number of simultaneous downloads or file handles to avoid overwhelming our target
- -AN: This throttles the bandwidth to stay under the radar
- -mN and –mN': These limit total downloads for the non-HTML and HTML file sizes respectively to ensure we do not overrun our disk
- -M: This sets the overall mirror size limit – again keeping the disk size manageable
- -rN: This puts a safety on recursion depth so we don't get into a loop
- -H2: This stops the download if slow traffic is detected
- -<path>/*: This skips a path (in this case, I didn't want to pull down their extensive forums or user profiles)
You can also use the GUI version, WebHTTrack, which is handy for learning how to tweak your favorite filters and options:

This command against the https://www.hackthissite.org/ website took a little over an hour to pull down 200 MB of information. I omitted the forums and users in this example, but I could just as easily have omitted other directories to focus on a specific area of the site or to avoid violating the boundaries established by my customer. If you press Enter while the command line is running, you'll get a rundown of how long it has run, the number of links found, and the total download size to that point in time:


When the archive is complete, you can surf it as if it was the live web by visiting the local path in your browser: file:///tmp/hackthissite/www.hackthissite.org/index.html.
You'll see in the following screenshot that we've successfully captured the dynamic advertising banner content(http://infosecaddicts.com/), which may be worth filtering as well:
