I checked if browsers could cache all bookmarked pages locally

It should take too much space, right?

May 05, 2023

Summary

I bookmark a lot, so I wish Web browsers would save some bookmarked pages entirely on disk.

But wouldn't that take way too much space?

So I wrote a Python script to check how much it would take on my hard drive to make all pages (excluding video) saved locally.

Because of the limits of the script, I would not pretend to draw definitive conclusions, but running it on 3388 bookmarks dumps 4.7GB.

I bookmark a lot

I have 3388 bookmarks in Firefox and when I want to find something, it's a nightmare:

The bookmarking management UI is awful and limited.
You can't search by content.
If you find the proper link and click on it, half of the time, it's a 404 because after all this time, most of them are uncool URLs that do change.

There are alternatives like Zotero or the excellent SingleFile extension, but this is nowhere as well integrated. And I lost my entire Zotero library once, which doesn't help.

I wish my web browser would allow me to bookmark a page, and let me tick a box that says, "save an offline copy".

And then provide a full search index on it, plus offer to display the locally cached version if I'm offline, or if it's not available anymore.

However, we all know that the web is bloated in this day and age. SPA, ads, and thousands of tracker scripts making every single page a mountain of 1 and 0. Saving that will just fill our hard drive with the content of a stream of node_modules-like spawns.

Shirley, you can't be serious.

What I did next will not shock you

Firefox bookmarks are stored in a sqlite file, wget can download a full page will all dependencies, and Python will play happily with both.

So, of course, I quicked and dirtied:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	"""
	usage: download_bookmarks.py [-h] [--concurrency [CONCURRENCY]] [--directory DIRECTORY] bookmarks

	positional arguments:
	bookmarks The path to the sqlite db file containing
	the bookmarks. It's the places.sqlite file
	in your default profile dir.

	optional arguments:
	-h, --help show this help message and exit
	--concurrency [CONCURRENCY], -c [CONCURRENCY]
	Max number of bookmarks to process in parallel
	--directory DIRECTORY, -d DIRECTORY
	Directory to store the downloaded files. Will be recursively created if it doesn't exist. Otherwise,
	a temp dir will be used.
	"""

	import argparse
	import asyncio
	import sqlite3
	import sys
	from asyncio.exceptions import CancelledError
	from pathlib import Path
	from tempfile import TemporaryDirectory

	if not sys.version_info >= (3, 8):
	sys.exit("This script requires Python 3.8 or higher")

	UA = (
	"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4"
	)


	async def download(i, total, url, directory, concurrency_limit):
	async with concurrency_limit:
	print(f"Downloading: {url} - START ({i}/{total})")
	proc = None
	try:
	proc = await asyncio.create_subprocess_shell(
	f'wget -o /dev/null -H -U "{UA}" -p -k -P {directory} {url}',
	stderr=asyncio.subprocess.PIPE,
	)

	_, stderr = await asyncio.wait_for(proc.communicate(), timeout=10)

	if stderr:
	print(f"Downloading: {url} - ERROR ({i}/{total})", file=sys.stderr)
	err = stderr.decode("utf8", errors="replace")
	print(f"\n[stderr]\n{err}", file=sys.stderr)

	print(f"Downloading: {url} - DONE ({i}/{total})")
	except (TimeoutError, CancelledError):
	print(f"Downloading: {url} - TIMEOUT ({i}/{total})", file=sys.stderr)
	except Exception as e:
	print(f"Downloading: {url} - ERR...: '{e}' ({i}/{total})", file=sys.stderr)

	finally:
	if proc:
	try:
	proc.terminate()
	except ProcessLookupError:
	pass


	async def main():
	parser = argparse.ArgumentParser()
	parser.add_argument(
	"bookmarks",
	help="The path to the sqlite db file containing the bookmarks. It's the places.sqlite file in your default profile dir.",
	)
	parser.add_argument(
	"--concurrency",
	"-c",
	type=int,
	nargs="?",
	help="Max number of bookmarks to process in parallel",
	default=40,
	)
	parser.add_argument(
	"--directory",
	"-d",
	help="Directory to store the downloaded files. Will be recursively created if it doesn't exist. Otherwise, a temp dir will be used.",
	)

	args = parser.parse_args()

	directory = args.directory or TemporaryDirectory().name
	directory = Path(directory)
	try:
	directory.mkdir(exist_ok=True, parents=True)
	except OSError as e:
	sys.exit(f"Error while creating the output directory: {e}")

	bookmark_file = Path(args.bookmarks)

	if not bookmark_file.is_file():
	sys.exit(f'Cannot find "{bookmark_file}"')

	if not bookmark_file.name == "places.sqlite":
	sys.exit(
	f'The bookmark file should be a "place.sqlite" file, got "{bookmark_file}"'
	)

	with sqlite3.connect(bookmark_file) as con:
	query = """
	SELECT url from moz_places, moz_bookmarks
	WHERE moz_places.id = moz_bookmarks.fk;
	"""
	try:
	urls = {url for [url] in con.execute(query)}
	except sqlite3.OperationalError as e:
	if "locked" in str(e):
	sys.exit("Close Firefox before running this script")
	raise

	total = len(urls)
	print(f"Ready to process {total} bookmarks")
	print(f"Saving results in: {directory}")

	running_tasks = set()
	concurrency_limit = asyncio.Semaphore(args.concurrency)

	for i, url in enumerate(urls, 1):
	running_tasks.add(download(i, total, url, directory, concurrency_limit))

	await asyncio.wait(running_tasks)

	print(f"Results saved in: {directory}")


	asyncio.run(main())

view raw dl.py hosted with ❤ by GitHub

The script assumes you have a Python 3.8 and wget at hand.

Of course, with such a rudimentary crawler, a lot of things broke:

Bot detectors blocked my attempt, and I don't even try to detect that, so there is probably a lot of captcha pages in there.
Cloudflare got in the way, and I'm not going to pay for scrapos just for this experiment.
Lots uncool URLs.
While wget does get most assets, it's not that great at downloading pages if you care to read it later. -k is not as good as advertised.
Because of CDN, you gotta download external content, but then I probably get a lot of unwanted files in there.
I forgot to set a timeout on my first attempt. Rookie mistake. My 40 workers got locked halfway and I had to restart all over again. So I restarted with timeout handling. And a 100 workers.
Wget was creating a lot of log files, but I couldn’t find how to suppress that without supressing wget output. So you’ll have to remove “-o” you want do see it.
There is a bug in asyncio that means you will see “Event loop is closed” at the end. Too lazy to attach .add_done_callback and clean it up. Should have used a ThreadPool, really, asyncio almost always too much trouble.

If I ever want to do this seriously, I'll fire scrapy with a headless engine to benefit from JS exec, randomization+jitter, clean timeout handling, etc. But I have a game of Great Western Trail tonight, so this will have to do.

I don't know what we can really infer from the result, but here it is:

3388 bookmarks seem to be worth a 4.7GB of my SSD, or 3.5GB zipped.
Downloading the entire set took 4 minutes with a hundred workers.

Out of curiosity, here are the types of the files that got downloaded:

    31396 text/plain
    3034 application/octet-stream
    1316 text/x-c++
    1123 text/x-po
        865 text/x-python
        384 text/html
        227 application/gzip
        218 inode/x-empty
        178 text/x-pascal
        113 image/png
        44 application/zlib
        29 text/x-c
        28 text/x-shellscript
        14 application/xml
        13 application/x-dosexec
        12 text/troff
        5 text/x-makefile
        4 text/x-asm
        3 application/zip
        2 image/jpeg
        2 image/gif
        2 application/x-elc
        1 text/x-ruby
        1 text/x-diff
        1 text/rtf
        1 image/x-xcf
        1 image/x-icon
        1 image/svg+xml
        1 application/x-shockwave-flash
        1 application/x-mach-binary
        1 application/x-executable
        1 application/x-dbf
        1 application/pdf

So is the idea viable?

Provided the downloads come anywhere close to what you would while doing this correctly, I would say, yes.

Firstly, I have 1TO on my laptop and most games in my steam library takes much more space than that.

Not to mention we can gain from better compression algos and symlinking files with the same checksum like for JS and css.

Secondly, it would be opt-in, like a double click on the "save as bookmark" icon to download the entire file, and the star becomes a different color.

Mobile phones, chrome books and raspy may not want to use the space, not to mention there are some bookmarks content that you don't want your OS to index, and show you a preview of in every search. I certainly don't want my bank home page or twitter account profile to be saved. Although a built-in replacement for download video helper would be super cool.

So you wouldn't really download all your bookmarks.

Having a hundred megs of just the things I wished to save would be a pretty sweet deal to me.

Gotta save all those sources for the HN arguments I'll eventually get into. E.G: I'm having a harder and harder time finding back links to Microsoft behavior in the 90'.

Now imagine having this offline library, with local AI to search all that...

Barnaby

May 6, 2023

I tried to solve a similar problem semi-recently. I do not bookmark at all, I never remember to but I have similar hatred for the history UI + implementation as you wrote here about the bookmark UI (perhaps Arc browser will innovate here?)

This annoyance lead to me writing a little app called Wisplight which uses privacy-friendly local full-text-search over my browser history. So I 1. never had to remember to manually save my links somewhere (the reason I dislike all the bookmark apps out there) and 2. can search more than just the titles and use proper indexing rather than simple partial exact text matching like browsers do.

I'm considering open sourcing it soon, once I've perfected the user interface designs...

Expand full comment

gildas

FYI, you can run SingleFile from the command line, see https://github.com/gildas-lormeau/single-file-cli. This might be an interesting alternative to wget. The SingleFile extension also has some options to help you automatically save the pages you bookmark.

4 more comments...

Bite code!

Discussion about this post