Top
Best
New

Posted by rasinmuhammed 12/16/2025

Show HN: Misata – synthetic data engine using LLM and Vectorized NumPy(github.com)
Hey HN, I’m the author.

I built Misata because existing tools (Faker, Mimesis) are great for random rows but terrible for relational or temporal integrity. I needed to generate data for a dashboard where "Timesheets" must happen after "Project Start Date," and I wanted to define these rules via natural language.

How it works: LLM Layer: Uses Groq/Llama-3.3 to parse a "story" into a JSON schema constraint config.

Simulation Layer: Uses Vectorized NumPy (no loops) to generate data. It builds a DAG of tables to ensure parent rows exist before child rows (referential integrity).

Performance: Generates ~250k rows/sec on my M1 Air.

It’s early alpha. The "Graph Reverse Engineering" (describe a chart -> get data) is experimental but working for simple curves.

pip install misata

I’d love feedback on the simulator.py architecture—I’m currently keeping data in-memory (Pandas) which hits a ceiling at ~10M rows. Thinking of moving to DuckDB for out-of-core generation next. Thoughts?

24 points | 2 comments
OutOfHere 12/23/2025|
Is it possible to incrementally update the schema? I may like to develop it over say ten iterations of incremental points that I missed. After each iteration, I want examine the schema, and say what I want changed.
twelvechess 12/20/2025|
That would be useful for testing MVPs with dummy data to see if they work. However, synthetic data is usually used when you derive new data from existing data, so the new data is called synthetic. From the README I didn't quite catch if that is the case here, but still useful.