Crawling the Old Site

I used the Linux 'wget' program to crawl the existing servers. Yes, servers plural, because EAA ran one server for www.iac.org and another for members.iac.org. The Linux commands went something like this:

cd /var/www; mkdir old-site; cd old-site; mkdir public; mkdir members
cd public
wget -r -D www.iac.org http://www.iac.org
cd ../members
wget -r -D members.iac.org http://members.iac.org/home.html
# Needed to specify home.html because index.html is the login/password page, with no links to the content below
wget http://members.iac.org  # Now pick up the login page too

This grabbed the great bulk of both sites, but wget did miss a few pages. This happened because the Stellent CMS used by EAA's IT dept. buried the links to those pages in JavaScript statements (which wget does not parse) rather than in HTML. So those pages were loaded using individual wget statements.