Posted by rkochanowski 4 hours ago
It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev
Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.
The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.
I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.
Finally: very nice project, congrats! :)
[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...
If you are interested in data, you can check my article. Analysis was done with this tool, but a previous version where exact-copy duplicates were excluded from analysis. https://rkochanowski.com/article/analysis-code-duplication/
How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.
While testing this tool, one detected duplication was interesting for a use case. Permission check logic was duplicated and placed in different distant places in the codebase. The code was similar, but not identical, the logic was not the same. One version had stricter checks. I analyzed this with the coding agent, and we found out that both versions are used for the same thing, which means that in some cases validation is insufficient. Having only a single validation place, this bug could be prevented or easily detected.
If so - maintainability, testability. This is old software engineering best practice at this point.
You shouldn’t hyper optimize for deduplication, but it’s usually worth considering. Fewer places to fix issues or improve as well.
On testability: two implementations can be tested against each other, leading to greater coverage with less test code. It doesn't work that way for 3+ implementations, which is another reason not to have that many.