Top
Best
New

Posted by matthewhefferon 6/26/2025

Show HN: I built an AI dataset generator(github.com)
169 points | 33 commentspage 2
alienbaby 5 days ago|
Good for the shape of data, but what about the actual data? If it's entirely random then it's more of a UI demo tool than a tool to generate useful data.
margotli 6/26/2025||
Feels like a useful tool for anyone learning analytics or just needing sample data to test with.
hiatus 6/27/2025|
Are you affiliated with metabase? https://news.ycombinator.com/item?id=44107584
ajar8087 6/27/2025||
I was thinking more synthetic data to fit models like https://whitelightning.ai/
jmsdnns 6/26/2025||
depending on what you're using the synthetic data for, it is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/
b0a04gl 6/26/2025|
seen this pattern a before too. faker holds shape without flow. real tables come from actions : retry, decline, manual review, all that. you just set col types, you might miss why the row even happened. gen needs to simulate behavior, not format
ajd555 6/26/2025||
Was looking for this exact comment. I completely agree with this method, especially if you're testing an entire flow, and not just a UI tool. You want to test the service that interfaces between the API and the dabatase.

I've been writing custom simulation agents (just simple go programs) that simulate different users of my system. I can scale appropriately and see test data flow in. If metabase could generate these simulation agents based on a schema and some instructions, now that would be quite neat! Good job on this first version of the tool, though!

matthewhefferon 6/26/2025|||
That’s a solid callout, appreciate you pointing it out. I’ll definitely dig into that more.
zikani_03 6/26/2025|||
This is well put. I once built a tool called [zefaker] (github.com/creditdatamw/zefaker) to test some data pipelines but never managed to get a good pattern or method for generating data that simulates actions or scenarios that didn't involve too much extra work.

Was hoping this AI dataset generator solves that issue, but i guess it is still early days. Looks good though and using Faker to generate the data locally sounds good as a cost-cutting measure, but also potentially opens room for human-in-the-loop adjustments of the generated data.

tomrod 6/26/2025||
The best synthetic data are those that capture ingestion and action, instead of just relationship.

Relationship is important, but your data structure might capture a virtually infinite number of unexpected behaviors that you would preferably call errors or bugs.