Website to PDF

Long gone are the days when you’d print the contents of digital documents onto dead trees in order to archive them or mail them by letter. Luckily there’s a portable document format (PDF) and documents in this format are widely accepted. While it shouldn’t be a problem to get a PDF document from an office application such as Microsoft Word, getting one from the contents of a website is not always straight forward.

Modern single page applications (SPA) bring the complexity of once stand-alone applications into the browser. In some cases such an application, e.g. for bookkeeping purposes, needs to provide a PDF document that can easily be downloaded or sent by e-mail. However, due to different browsers and browser versions, it is not easy to provide the same layout/quality to all users.

There’s a couple of JS libraries that can generate PDF documents ad hoc, with some hurdles in the implementation or limitations in what they can achieve. For now I only needed a basic website to PDF tool, that provided the same document in every browser.

Approach

Google’s Chrome browser has an extensive API. This API is used by the Developer Tools, for debugging and for remote control. For the latter Google’s Node library Puppeteer comes in very handy as it provides methods to access the Chrome browser’s API and programmatically do the things you’d otherwise do manually. There’s even a class that generates a PDF document out of a page. Apply some settings and you’re done.

Implementation

I’ve created a (very basic) NodeJS application that provides an (very basic) API that accepts URLs and generates a PDF document that is either displayed or offered as a download. The application uses express and Puppeteer and can be run in an, again very basic, environment such as a Docker container.

Caveats

While this approach is straight forward and in my opinion easily implemented, there are some limitations:

  • You cannot access restricted resources.
  • You cannot choose what part of a website is printed.

The repository with the code is available on GitHub: github.com/dvdvnl/webpdf