LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

How to Permanently Dowload Any Book You Can Borrow on archive.org as a PDF

archive.org libraryThere are tons of books you can borrow on archive.org.

I’ve been purging by paper collection and moving to ebooks, and while there a lot of books I can replace with Kindle and PDFs, a fair number of books never made it to ebook form.

archive.org has an offering where they’ve scanned zillions of books and make them available for check-out on a free renewable lending basis.  But this service has four key limitations

  1. It’s using the archive.org page-turning interface which is not as pleasant as having an actual PDF you can use in your favorite software.
  2. You can’t markup, bookmark, etc. this book and retain these notes in the future.
  3. Only works when you’re connected to the Internet.  Going on a cruise or camping?  You’re out of luck.
  4. And it might not be around much longer.  archive.org already lost round one against a gang of publishers who are angry that books they have no interest in republishing can be checked out of a library.

Code to the Rescue

Fortunately, there is a way you can download these books as PDFs.  All you need is a little JavaScript and the ability to pay close attention to instructions.

First, head over to this GitHub, which has full instructions.

Some advice:

  1. Use Firefox as your browser.  It works consistently.
  2. Uncheck “Always ask you where to save files”
  3. You’ll be downloading a couple hundred or more files and you don’t want to hit return for each one.
  4. Zoom in on the image after you check the book out, and do it at least two times.  I usually do 4.  Otherwise you’re going to get tiny JPGs that are fuzzy when you try to read them.
  5. Follow instructions closely.  It won’t work for you the first time and going over the instructions again you’ll realize you missed a small step.

Once you have all the JPGs, you can assemble them into a PDF in various ways.  Here’s a quick Python script that can do it via the img2pdf module.  Just save all the JPGs into one folder and call this script as

make_pdf.py <directory name>

Code:

#!/usr/bin/python3

import img2pdf, os, re, sys

def fail ( message ):
    print ("%s\n" % ( message ))
    sys.exit(1)

if ( len(sys.argv) != 2 ):
  fail ("Usage: makepdf <directory>")

img_dir = sys.argv[1]
img_dir = re.sub( '/$', '', img_dir )
if ( os.path.exists ( img_dir ) == False ):
    fail ( "ERROR: directory '%s' does not exist" % ( img_dir ) )
print ("%-30s: %s" % ( "Directory", img_dir ) )
pdf_name = "%s.pdf" % ( img_dir )
print ("%-30s: %s" % ( "PDF to Create", pdf_name ) )

images = []
for fname in os.listdir(img_dir):
    if not fname.endswith(".jpg"):
        continue
    path = os.path.join(img_dir, fname)
    if os.path.isdir(path):
        continue
    images.append(path)

images.sort()

print ("%-30s: %d" % ( "Num Images", len(images) ) )
print ("%-30s: %s" % ( "First Image", images[0] ) )
print ("%-30s: %s" % ( "Last Image", images[len(images)-1] ) )

with open(pdf_name,"wb") as f:
    f.write(img2pdf.convert(images))

os.system ("du -sh \"%s\"" % ( pdf_name ))

 

raindog308

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published. Required fields are marked *