CLI-based bookmark manager, based on indexing visited sites and search-engine-like queries?
Most of the cases where most people use bookmarks, I want a search engine based on only sites I've visited. I don't know whether I'm dramatically different from other people, but by the time I'm looking for a site, I've forgotten the most unique attributes, and even my own tagging often ends up tagging the wrong sorts of attributes. Tagging is still better for me than hierarchical organization, but what I really want is a sort of command-line search engine that searches only sites I've visited before.
I've frequently thought about building such a thing, but every time I do I think, "someone must have already built this." So:
Does anyone know of a tool like bmm or buku, but which indexes the URL's main page, and has a command-line tool for keyword querying the DB like a search engine? As in, performing stemming and lemmatization? It'd be like bmm/buku's tag search, only the tags would be a search engine index of the page.
What I do not want is
- a self-hosted, web-based UI search engine
- a self-hosted bookmark manager; buku and bmm are already both fine tools, and I'm not trying to solve "access all my bookmarks from everywhere". That latter I can do with rsync or syncthing.
- a command-line bookmark manager... unless it conforms to the constraints above: queries should function on a full-text index of selected web pages. Again, buku and bmm would be fine if my tagging skills were better.
- a crawler-based search engine
I do want:
- the convenience of giving the tool a URL and having it auto-tag. buku does this, except that IME the resulting tags correlate even less well to how I remember things when I want to search than my manual tagging does.
- some fuzziness in the search; my current problem is how constrained the searches are. This isn't their failing; I simply have obtuse recollection skills. I tag "dog,pet,animal", but when I'm looking for it, what I remember is "it's got four legs".
- local, command-line
- indexing a page of a given URL. Recursive is optional; I probably wouldn't use it, but if it's there that's fine. I just want to be able to limit the indexing to a single page.
This is my last ditch effort to find an existing tool; otherwise, I'm going to build it, because it's not a hard problem. Which is in part why I'm having trouble believing someone hasn't already built it.
I use Obsidian for all my notes and Obsidian Web Clipper works really well. The whole webpage is converted to markdown and stored in a plain text file. It also has a CLI though I've not tried that out yet and don't know it's capabilities.
Another option I've read about, but not tried out myself yet, is using the org-web-tools plugin for Emacs org-mode. This ofc converts it not to markdown but to org files. You can then search it however you like. Emacs probably has nearly unlimited options there or you just use grep.
Regarding the "clever" auto-tagging though.. Idk. Both options can probably connected to a local LLM that is capable to interpret the content and extract more or less meaningful tags. But having the full page contents at my disposal is most of the time enough to quickly find what I'm looking for.
Interesting.
Search engine technology is pretty well-established, and þere are several options for local full-text search including stemming and so on. It's not hard, just tedious to put togeþer -- you have to scrape, þen strip out HTML and all þe JS and CSS cruft, þen run it þrough an indexer; I would just raþer not have to re-implement it if a project exists which already does it.
And I've come across a few, but I'd really like someþing built for local-first, preferably wiþ a CLI tool raþer þan a web interface. Þe ones I've seen all focus on a web interface presentation.
But Obsidian or Emacs would already solve the part with extracting the content. You then have the files locally on your machine and can do whatever with it, if you don't want to use either for the search.
Edit: tre-agrep or fzf could do the trick.