Save a website
One of the benefits of websites (including, blogs, wikis, and other web applications) is that every time you view them you get the most up-to-date version of the content. Sometimes however, you may need to save the entire content of a website. It may be that you will be traveling to a place with no internet connectivity and need to share a resource you made online; or maybe you wish to archive a snapshot of a site for historical comparison.
The process for saving a website differs based on how you want to use the result. The two main options are:
- Save as an interactive set of HTML files
- Save as a single linear PDF file
If the website is simply a series of articles that you wish to be able to read through offline, the single PDF file can be a little more user-friendly. The downside of the PDF file is that internal links between pages may not work and some formatting will be lost.
Exporting the site as a directory (folder) full of HTML files will provide a result that can be viewed in a web browser while offline. All of the formatting and internal links will work, but the result is a directory of files rather than a single file and will therefore usually need to be zipped up in order to share it with others.
In both cases, it is important to watch the status of the tool as it can be very easy for the web-crawler that locates pages to download too much (such as the entire Middlebury site) when all you want is a subsection.
Export as HTML files
As mentioned above, exporting a site as HTML files will provide you with a result that can be viewed in a web browser while offline. Rather than being loaded over the internet, the HTML files are loaded from the local hard drive and your browser does the same process of rendering them.
Your web browser can save a single page as HTML. This page can either include just the text or all of the resources (images, style-sheets, etc) that are shown on the page. The main limitation is that any links in the saved page will send your browser over to internet, so this method isn't useful for offline viewing of a whole site.
- Go to File » Save Page As...
- Choose Web Page, Complete if you wish to save all of the images and style sheets.
- Go to File » Save As...
- Choose Web Archive if you wish to save all of the images and style sheets.
- Go to File » Save As...
- Choose Webpage, complete (*.htm;*.html) if you wish to save all of the images and style sheets.
Multiple Pages / Whole sites
These tools will allow you to save a whole site (or a portion of a site) as an interactive set of HTML files. They convert all of the internal links so that as you browse the site you go from one page to another without going over the internet.
wget on Mac/Linux
wget is a command-line tool for fetching webpages and other web content that runs on OS X, Linux, and other UNIX-like operating systems. On Linux systems it is available via the built-in package-management tools: Yum (Red Hat/Suse) or Apt (Ubuntu/Debian). On Mac OS X, it can be installed via the MacPorts package manager -- install MacPorts, then run
sudo port install wget in the Terminal.
One challenge with wget is that without limiting it to particular directories it will happily suck down every page in a large site. A second challenge is that wget puts all the files it downloads into the same directory, making it hard for users to figure out which file they need to open to view the exported content. The example below will address both of these challenges. Command options in bold text in the example will need to be changed to fit your actual site that you are trying to export.
1. Open the Terminal application (OS X) or a Bash shell (Linux)
2. Create a directory for the export
3. Go into the directory
4. Create a directory for the files wget will fetch
5. Go into the content directory
6. Run wget
The important options here the
--reject, and the base URL at the end. These three options will determine which pages get downloaded. In this example we are downloading a WordPress blog, so we want to just include pages under that blog's directory (
mysubdir). As well, the
wp-login.php page will cause a redirection loop, so we want to exclude that.
7. Go to the parent directory
8. Copy the
index.html file to the current directory
This will provide you with a single
index.html file and a single
content/ directory so that it is easy to see what to click on to open the site.
9. Rewrite URLs in the
index.html to point at the
perl -p -i -e 's/(src|href)=(['\''"])([^'\''"\/]+)(['\''"])/$1=$2content\/$3$4/gi' index.html
This command searches and replaces links in the copy of the
index.html file that we moved up a level in the directory structure so that all of the links properly point into the
10. Zip up the export
You should now have the following directory structure:
You can zip up the whole directory from the command line by going up a level and running the
Export as a single PDF file
Documentation on saving websites to PDF using acrobat is available on Adobe's help site.