How to Permanently Dowload Any Book You Can Borrow on archive.org as a PDF

Jul 26, 2023 @ 7:00 am

archive.org library There are tons of books you can borrow on archive.org.

I’ve been purging by paper collection and moving to ebooks, and while there a lot of books I can replace with Kindle and PDFs, a fair number of books never made it to ebook form.

archive.org has an offering where they’ve scanned zillions of books and make them available for check-out on a free renewable lending basis. But this service has four key limitations

It’s using the archive.org page-turning interface which is not as pleasant as having an actual PDF you can use in your favorite software.
You can’t markup, bookmark, etc. this book and retain these notes in the future.
Only works when you’re connected to the Internet. Going on a cruise or camping? You’re out of luck.
And it might not be around much longer. archive.org already lost round one against a gang of publishers who are angry that books they have no interest in republishing can be checked out of a library.

Code to the Rescue

Fortunately, there is a way you can download these books as PDFs. All you need is a little JavaScript and the ability to pay close attention to instructions.

First, head over to this GitHub, which has full instructions.

Some advice:

Use Firefox as your browser. It works consistently.
Uncheck “Always ask you where to save files”
You’ll be downloading a couple hundred or more files and you don’t want to hit return for each one.
Zoom in on the image after you check the book out, and do it at least two times. I usually do 4. Otherwise you’re going to get tiny JPGs that are fuzzy when you try to read them.
Follow instructions closely. It won’t work for you the first time and going over the instructions again you’ll realize you missed a small step.

Once you have all the JPGs, you can assemble them into a PDF in various ways. Here’s a quick Python script that can do it via the img2pdf module. Just save all the JPGs into one folder and call this script as

make_pdf.py <directory name>

Code:

#!/usr/bin/python3

import img2pdf, os, re, sys

def fail ( message ):
    print ("%s\n" % ( message ))
    sys.exit(1)

if ( len(sys.argv) != 2 ):
  fail ("Usage: makepdf <directory>")

img_dir = sys.argv[1]
img_dir = re.sub( '/$', '', img_dir )
if ( os.path.exists ( img_dir ) == False ):
    fail ( "ERROR: directory '%s' does not exist" % ( img_dir ) )
print ("%-30s: %s" % ( "Directory", img_dir ) )
pdf_name = "%s.pdf" % ( img_dir )
print ("%-30s: %s" % ( "PDF to Create", pdf_name ) )

images = []
for fname in os.listdir(img_dir):
    if not fname.endswith(".jpg"):
        continue
    path = os.path.join(img_dir, fname)
    if os.path.isdir(path):
        continue
    images.append(path)

images.sort()

print ("%-30s: %d" % ( "Num Images", len(images) ) )
print ("%-30s: %s" % ( "First Image", images[0] ) )
print ("%-30s: %s" % ( "Last Image", images[len(images)-1] ) )

with open(pdf_name,"wb") as f:
    f.write(img2pdf.convert(images))

os.system ("du -sh \"%s\"" % ( pdf_name ))

Has the Biggest Performance Bottleneck in Python Finally Been Slain?

Enjoy This Index of Thousands of FREE Programming Books! Python, Rust, Javascript, Java, C#, C++, Y...

Just Published: My Powerball Results Checker Script

Setup Odoo? Swap out Slack? Create Plugins for Python? Dynamic DNS? We've Got the Tutorials!

Have You See the Internet Archive's Stolen Truck?

Free Udemy Courses! Our Community Resource Just Keeps Going

raindog308

Raindog308 is a longtime LowEndTalk community administrator, technical writer, and self-described techno polymath. With deep roots in the *nix world, he has a passion for systems both modern and vintage, ranging from Unix, Perl, Python, and Golang to shell scripting and mainframe-era operating systems like MVS. He’s equally comfortable with relational database systems, having spent years working with Oracle, PostgreSQL, and MySQL.

As an avid user of LowEndBox providers, Raindog runs an empire of LEBs, from tiny boxes for VPNs, to mid-sized instances for application hosting, and heavyweight servers for data storage and complex databases. He brings both technical rigor and real-world experience to every piece he writes.

Beyond the command line, Raindog is a lover of German Shepherds, high-quality knives, target shooting, theology, tabletop RPGs, and hiking in deep, quiet forests.

His goal with every article is to help users, from beginners to seasoned sysadmins, get more value, performance, and enjoyment out of their infrastructure.

You can find him daily in the forums at LowEndTalk under the handle @raindog308.