Most RAG pipelines ship raw, sensitive documents over the wire to cloud services just to get them parsed, scrubbed of PII, chunked, and vectorized.
BitVanes is a zero-trust, local-first ETL engine designed to solve this. It’s written in Rust, spits out Apache Arrow RecordBatches, and compiles to both a native CLI and WebAssembly so you can run the entire pipeline directly in a browser sandbox. I've got the wasm version at the posted url. Core and cli are on github.
I'd love to get your thoughts on the architecture, particularly around using Arrow (it's my first time using AA, I'm coming from capnp), or the Rust-to-JS design for pdfs to keep the wasm package reasonable.
I'd like to crates the package once I've had some people kick the tires and I get it ironed out.