Middlebury

Save a website

Overview

One of the benefits of websites (including, blogs, wikis, and other web applications) is that every time you view them you get the most up-to-date version of the content. Sometimes however, you may need to save the entire content of a website. It may be that you will be traveling to a place with no internet connectivity and need to share a resource you made online; or maybe you wish to archive a snapshot of a site for historical comparison.

The process for saving a website differs based on how you want to use the result. The two main options are:

  1. Save as an interactive set of HTML files
  2. Save as a single linear PDF file

If the website is simply a series of articles that you wish to be able to read through offline, the single PDF file can be a little more user-friendly. The downside of the PDF file is that internal links between pages may not work and some formatting will be lost.

Exporting the site as a directory (folder) full of HTML files will provide a result that can be viewed in a web browser while offline. All of the formatting and internal links will work, but the result is a directory of files rather than a single file and will therefore usually need to be zipped up in order to share it with others.

In both cases, it is important to watch the status of the tool as it can be very easy for the web-crawler that locates pages to download too much (such as the entire Middlebury site) when all you want is a subsection.

Export as HTML files

As mentioned above, exporting a site as HTML files will provide you with a result that can be viewed in a web browser while offline. Rather than being loaded over the internet, the HTML files are loaded from the local hard drive and your browser does the same process of rendering them.

Single Pages

Your web browser can save a single page as HTML. This page can either include just the text or all of the resources (images, style-sheets, etc) that are shown on the page. The main limitation is that any links in the saved page will send your browser over to internet, so this method isn't useful for offline viewing of a whole site.

Firefox, Chrome

  1. Go to File » Save Page As...
  2. Choose Web Page, Complete if you wish to save all of the images and style sheets.

Safari

  1. Go to File » Save As...
  2. Choose Web Archive if you wish to save all of the images and style sheets.

Internet Explorer

  1. Go to File » Save As...
  2. Choose Webpage, complete (*.htm;*.html) if you wish to save all of the images and style sheets.

Multiple Pages / Whole sites

These tools will allow you to save a whole site (or a portion of a site) as an interactive set of HTML files. They convert all of the internal links so that as you browse the site you go from one page to another without going over the internet.

wget on Mac/Linux

wget is a command-line tool for fetching webpages and other web content that runs on OS X, Linux, and other UNIX-like operating systems. On Linux systems it is available via the built-in package-management tools: Yum (Red Hat/Suse) or Apt (Ubuntu/Debian). On Mac OS X, it can be installed via the MacPorts package manager -- install MacPorts, then run sudo port install wget in the Terminal.

One challenge with wget is that without limiting it to particular directories it will happily suck down every page in a large site. A second challenge is that wget puts all the files it downloads into the same directory, making it hard for users to figure out which file they need to open to view the exported content. The example below will address both of these challenges. Command options in bold text in the example will need to be changed to fit your actual site that you are trying to export.

1. Open the Terminal application (OS X) or a Bash shell (Linux)

2. Create a directory for the export

mkdir mysite

3. Go into the directory

cd mysite/

4. Create a directory for the files wget will fetch

mkdir content

5. Go into the content directory

cd content/

6. Run wget

wget -r --page-requisites --html-extension --convert-links --no-directories --directory-prefix=content --wait=1 --include-directories=/mysubdir --reject wp-login.php http://blogs.middlebury.edu/mysubdir/

The important options here the --include-directories, --reject, and the base URL at the end. These three options will determine which pages get downloaded. In this example we are downloading a WordPress blog, so we want to just include pages under that blog's directory (mysubdir). As well, the wp-login.php page will cause a redirection loop, so we want to exclude that.

7. Go to the parent directory

cd ..

8. Copy the index.html file to the current directory

cp content/index.html ./

This will provide you with a single index.html file and a single content/ directory so that it is easy to see what to click on to open the site.

9. Rewrite URLs in the index.html to point at the content/ directory

perl -p -i -e 's/(src|href)=(['\''"])([^'\''"\/]+)(['\''"])/$1=$2content\/$3$4/gi' index.html

This command searches and replaces links in the copy of the index.html file that we moved up a level in the directory structure so that all of the links properly point into the content/ directory.

10. Zip up the export

You should now have the following directory structure:

mysite/

mysite/index.html mysite/content/

mysite/content/...

You can zip up the whole directory from the command line by going up a level and running the zip command:

cd .. zip -r mysite.zip mysite/

Export as a single PDF file

Acrobat

Documentation on saving websites to PDF using acrobat is available on Adobe's help site.

Powered by MediaWiki