I checked if browsers could cache all bookmarked pages locally
It should take too much space, right?
Summary
I bookmark a lot, so I wish Web browsers would save some bookmarked pages entirely on disk.
But wouldn't that take way too much space?
So I wrote a Python script to check how much it would take on my hard drive to make all pages (excluding video) saved locally.
Because of the limits of the script, I would not pretend to draw definitive conclusions, but running it on 3388 bookmarks dumps 4.7GB.
I bookmark a lot
I have 3388 bookmarks in Firefox and when I want to find something, it's a nightmare:
The bookmarking management UI is awful and limited.
You can't search by content.
If you find the proper link and click on it, half of the time, it's a 404 because after all this time, most of them are uncool URLs that do change.
There are alternatives like Zotero or the excellent SingleFile extension, but this is nowhere as well integrated. And I lost my entire Zotero library once, which doesn't help.
I wish my web browser would allow me to bookmark a page, and let me tick a box that says, "save an offline copy".
And then provide a full search index on it, plus offer to display the locally cached version if I'm offline, or if it's not available anymore.
However, we all know that the web is bloated in this day and age. SPA, ads, and thousands of tracker scripts making every single page a mountain of 1 and 0. Saving that will just fill our hard drive with the content of a stream of node_modules-like spawns.
Shirley, you can't be serious.
What I did next will not shock you
Firefox bookmarks are stored in a sqlite file, wget can download a full page will all dependencies, and Python will play happily with both.
So, of course, I quicked and dirtied:
""" | |
usage: download_bookmarks.py [-h] [--concurrency [CONCURRENCY]] [--directory DIRECTORY] bookmarks | |
positional arguments: | |
bookmarks The path to the sqlite db file containing | |
the bookmarks. It's the places.sqlite file | |
in your default profile dir. | |
optional arguments: | |
-h, --help show this help message and exit | |
--concurrency [CONCURRENCY], -c [CONCURRENCY] | |
Max number of bookmarks to process in parallel | |
--directory DIRECTORY, -d DIRECTORY | |
Directory to store the downloaded files. Will be recursively created if it doesn't exist. Otherwise, | |
a temp dir will be used. | |
""" | |
import argparse | |
import asyncio | |
import sqlite3 | |
import sys | |
from asyncio.exceptions import CancelledError | |
from pathlib import Path | |
from tempfile import TemporaryDirectory | |
if not sys.version_info >= (3, 8): | |
sys.exit("This script requires Python 3.8 or higher") | |
UA = ( | |
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4" | |
) | |
async def download(i, total, url, directory, concurrency_limit): | |
async with concurrency_limit: | |
print(f"Downloading: {url} - START ({i}/{total})") | |
proc = None | |
try: | |
proc = await asyncio.create_subprocess_shell( | |
f'wget -o /dev/null -H -U "{UA}" -p -k -P {directory} {url}', | |
stderr=asyncio.subprocess.PIPE, | |
) | |
_, stderr = await asyncio.wait_for(proc.communicate(), timeout=10) | |
if stderr: | |
print(f"Downloading: {url} - ERROR ({i}/{total})", file=sys.stderr) | |
err = stderr.decode("utf8", errors="replace") | |
print(f"\n[stderr]\n{err}", file=sys.stderr) | |
print(f"Downloading: {url} - DONE ({i}/{total})") | |
except (TimeoutError, CancelledError): | |
print(f"Downloading: {url} - TIMEOUT ({i}/{total})", file=sys.stderr) | |
except Exception as e: | |
print(f"Downloading: {url} - ERR...: '{e}' ({i}/{total})", file=sys.stderr) | |
finally: | |
if proc: | |
try: | |
proc.terminate() | |
except ProcessLookupError: | |
pass | |
async def main(): | |
parser = argparse.ArgumentParser() | |
parser.add_argument( | |
"bookmarks", | |
help="The path to the sqlite db file containing the bookmarks. It's the places.sqlite file in your default profile dir.", | |
) | |
parser.add_argument( | |
"--concurrency", | |
"-c", | |
type=int, | |
nargs="?", | |
help="Max number of bookmarks to process in parallel", | |
default=40, | |
) | |
parser.add_argument( | |
"--directory", | |
"-d", | |
help="Directory to store the downloaded files. Will be recursively created if it doesn't exist. Otherwise, a temp dir will be used.", | |
) | |
args = parser.parse_args() | |
directory = args.directory or TemporaryDirectory().name | |
directory = Path(directory) | |
try: | |
directory.mkdir(exist_ok=True, parents=True) | |
except OSError as e: | |
sys.exit(f"Error while creating the output directory: {e}") | |
bookmark_file = Path(args.bookmarks) | |
if not bookmark_file.is_file(): | |
sys.exit(f'Cannot find "{bookmark_file}"') | |
if not bookmark_file.name == "places.sqlite": | |
sys.exit( | |
f'The bookmark file should be a "place.sqlite" file, got "{bookmark_file}"' | |
) | |
with sqlite3.connect(bookmark_file) as con: | |
query = """ | |
SELECT url from moz_places, moz_bookmarks | |
WHERE moz_places.id = moz_bookmarks.fk; | |
""" | |
try: | |
urls = {url for [url] in con.execute(query)} | |
except sqlite3.OperationalError as e: | |
if "locked" in str(e): | |
sys.exit("Close Firefox before running this script") | |
raise | |
total = len(urls) | |
print(f"Ready to process {total} bookmarks") | |
print(f"Saving results in: {directory}") | |
running_tasks = set() | |
concurrency_limit = asyncio.Semaphore(args.concurrency) | |
for i, url in enumerate(urls, 1): | |
running_tasks.add(download(i, total, url, directory, concurrency_limit)) | |
await asyncio.wait(running_tasks) | |
print(f"Results saved in: {directory}") | |
asyncio.run(main()) |
The script assumes you have a Python 3.8 and wget at hand.
Of course, with such a rudimentary crawler, a lot of things broke:
Bot detectors blocked my attempt, and I don't even try to detect that, so there is probably a lot of captcha pages in there.
Cloudflare got in the way, and I'm not going to pay for scrapos just for this experiment.
Lots uncool URLs.
While wget does get most assets, it's not that great at downloading pages if you care to read it later.
-k
is not as good as advertised.Because of CDN, you gotta download external content, but then I probably get a lot of unwanted files in there.
I forgot to set a timeout on my first attempt. Rookie mistake. My 40 workers got locked halfway and I had to restart all over again. So I restarted with timeout handling. And a 100 workers.
Wget was creating a lot of log files, but I couldn’t find how to suppress that without supressing wget output. So you’ll have to remove “-o” you want do see it.
There is a bug in asyncio that means you will see “Event loop is closed” at the end. Too lazy to attach .add_done_callback and clean it up. Should have used a ThreadPool, really, asyncio almost always too much trouble.
If I ever want to do this seriously, I'll fire scrapy with a headless engine to benefit from JS exec, randomization+jitter, clean timeout handling, etc. But I have a game of Great Western Trail tonight, so this will have to do.
I don't know what we can really infer from the result, but here it is:
3388 bookmarks seem to be worth a 4.7GB of my SSD, or 3.5GB zipped.
Downloading the entire set took 4 minutes with a hundred workers.
Out of curiosity, here are the types of the files that got downloaded:
31396 text/plain
3034 application/octet-stream
1316 text/x-c++
1123 text/x-po
865 text/x-python
384 text/html
227 application/gzip
218 inode/x-empty
178 text/x-pascal
113 image/png
44 application/zlib
29 text/x-c
28 text/x-shellscript
14 application/xml
13 application/x-dosexec
12 text/troff
5 text/x-makefile
4 text/x-asm
3 application/zip
2 image/jpeg
2 image/gif
2 application/x-elc
1 text/x-ruby
1 text/x-diff
1 text/rtf
1 image/x-xcf
1 image/x-icon
1 image/svg+xml
1 application/x-shockwave-flash
1 application/x-mach-binary
1 application/x-executable
1 application/x-dbf
1 application/pdf
So is the idea viable?
Provided the downloads come anywhere close to what you would while doing this correctly, I would say, yes.
Firstly, I have 1TO on my laptop and most games in my steam library takes much more space than that.
Not to mention we can gain from better compression algos and symlinking files with the same checksum like for JS and css.
Secondly, it would be opt-in, like a double click on the "save as bookmark" icon to download the entire file, and the star becomes a different color.
Mobile phones, chrome books and raspy may not want to use the space, not to mention there are some bookmarks content that you don't want your OS to index, and show you a preview of in every search. I certainly don't want my bank home page or twitter account profile to be saved. Although a built-in replacement for download video helper would be super cool.
So you wouldn't really download all your bookmarks.
Having a hundred megs of just the things I wished to save would be a pretty sweet deal to me.
Gotta save all those sources for the HN arguments I'll eventually get into. E.G: I'm having a harder and harder time finding back links to Microsoft behavior in the 90'.
Now imagine having this offline library, with local AI to search all that...
I tried to solve a similar problem semi-recently. I do not bookmark at all, I never remember to but I have similar hatred for the history UI + implementation as you wrote here about the bookmark UI (perhaps Arc browser will innovate here?)
This annoyance lead to me writing a little app called Wisplight which uses privacy-friendly local full-text-search over my browser history. So I 1. never had to remember to manually save my links somewhere (the reason I dislike all the bookmark apps out there) and 2. can search more than just the titles and use proper indexing rather than simple partial exact text matching like browsers do.
I'm considering open sourcing it soon, once I've perfected the user interface designs...
FYI, you can run SingleFile from the command line, see https://github.com/gildas-lormeau/single-file-cli. This might be an interesting alternative to wget. The SingleFile extension also has some options to help you automatically save the pages you bookmark.