Posted by centamiv 4 hours ago
Programming is chanting magic incarnations and spells after all. (And fighting against evil spirits and demons)
Never heard this term before, but I like it.
https://centamori.com/index.php?slug=basics-of-web-developme...
It's tempting to use this in projects that use PHP.
Is it useable with a corpus of like 1.000 3kb markdown files? And 10.000 files?
Can I also index PHP files so that searches include function and class names? Perhaps comments?
How much ram and disk memory we would be talking about?
And the speed?
My first goal would to index a PHP project and its documentation so that an LLM agent could perform semantic search using my MCP tool.
Since it only stores the vectors, the actual size of the Markdown document is irrelevant; you just need to handle the embedding and chunking phases carefully (you can use a parser to extract code snippets).
RAM isn't an issue because I aim for random data access as much as possible. This avoids saturating PHP, since it wasn't exactly built for this kind of workload.
I'm glad you found the article and repo useful! If you use it and run into any problems, feel free to open an issue on GitHub.
HNSW is just the indexing algorithm. It doesn't care where the vectors come from. You can generate them using Ollama (locally) HuggingFace, Gemini...
As long as you feed it an array of floats, it will index it. The dependency on OpenAI is purely in the example code, not in the engine logic.
Very good contribution.