Here's the github to the package https://github.com/b-rodrigues/rixpress/tree/master
and here's an example pipeline https://github.com/b-rodrigues/rixpress_demos/tree/master/py...
Well, better late than never I guess.
the ease of doing `model <- lm(speed~dist, cars)` and then `predict(model, data.frame(dist = c(42)))` is unparalled.
I’m sure part of Python’s success is sheer mindshare momentum from being a common computing denominator, but I’d guess the integration story is part of the margins. Your back end may well already be in python or have interop, reducing stack investment and systems tax.
My employer is using R to crunch numbers enbeded in a large system based on microservices.
The only thing to keep in mind is that most people writing R are not programmers by trade so it is good to have one person on the project who can refactor their code from time to time.
I added the SQL query to the top of the R script to generate the input data.frame and my Python code reads the output CSV to do subsequent processing and storage into Django models.
I use a subprocess running Rscript to run the script.
It's not elegant but it is simple. This part of the system only has to run daily so efficiency isn't a big deal.
I guess you'll need to decide whether this is a big enough issue to warrant the new dependencies.
The problem is pinning dependencies. So while an R analysis written using base R 20 or 30 years ago works fine, something using dplyr is probably really difficult to get up and running.
At my old work we took a copy of CRAN when we started a new project and added dependencies from then.
So instead of asking for dplyr version x.y, as you'd do ... anywhere, we added dplyr as it and its dependencies where stored on CRAN on this specific date.
We also did a lot of systems programming in R, which I thought of as weird, but for the exact same reason as you are saying for Python.
But R is really easy to install, so I don't see why you can't setup a step in your pipeline that does R - or even both R and Python. They can read dataframes from eachothers memory.
I tried Shiny a few years back and frankly it was not good enough to be considered. Maybe it's matured since then--I'll give it another look.
> Not having a Django-like or others web stack python may have talks more about the users of R than the language per se. Its background was to replace S which was a proprietary statistics language not to enter competition with Perl used in CGI and early web.
I'm aware, but that doesn't address the problem I pointed out in any way.
> R is very powerful and is Lisp in disguise coupled with the same infrastructure that let you use C under the hood like python for most libraries/packages.
Things I don't want to ever do: use C to write a program that displays my R data to the web.
For capital P Production use I would still rewrite it in rust (polars) or go (stats). But that’s only if it’s essential to either achieve high throughput with concurrency or measure performance in nanoseconds vs microseconds.
Thanks for posting!
[2]: https://youtu.be/XSbTF3E5p7Q?list=PLB-WIt1cZYLm1MMx2FBG9KWzP...
https://dave.autonoma.ca/blog/2019/07/11/typesetting-markdow...
However, most workflows and nearly all editors don't support interpolated variables. To address this, first I developed a YAML preprocessor:
https://repo.autonoma.ca/yamlp.git
Then I grew tired of editing YAML files, piping files together, and maintaining bash scripts. So next, I developed KeenWrite to allow use of interpolated variables directly within documents from a single program. The screenshots show how it works:
e.g. avoid dplyr overriding base::filter
use(“dplyr”, c(“mutate”, “summarize”))
(Actually already available since R 4.4.0.)
For engineering stuff i want strong static analysis (type hints, pydantic, mypy), observability (logfire, structlog), and support (can i upload a package to my cloud package registry?).
For ML stuff, i want the libraries everyone else uses (pytorch, huggingface) because popularity brings a lot of development and documentation and obscure github issues the R clones lack.
Userbase matters. In R, hardly any users are doing any engineering; most R code only needs to run successfully one time. The ecosystem reflects that. The python-based ML world has the same problem, but the broader sea of python engineers helps counterbalance.
There’s a ton more python code out there so the LLM reliability in python code just makes my life easier. R was great and still is, but my world is now more than just data eng, model fitting, and viz. I have to deal with operationalizing and working with people who aren’t just data science and most org don’t have the luxury of having an easy production R system so I can get my python code over the line and trust a good engineer will be okay smeshing that into the production stack which is likely heavy Python. (Instead of saying oh we don’t work with R we do Python Java so it will take 3-5x longer).
Another sad truth is the cool ml kids all want to do pytorch deep ML training / post training / rlhf / ppo / gdpr gtfo so you are not real hardcore ml if you only do R. I know it’s stupid but the world is kind of like that.
You want to hire people who want to build their careers on the cool stack. I know it’s not all the cool talk the hackers here play with but for real world application I have a lot of other considerations.
Having seen Julia proposed as the nemesis of R (not python, that too political, non-lispy)
>the creator of the R programming language, Ross Ihaka, who provided benchmarks demonstrating that Lisp’s optional type declaration and machine-code compiler allow for code that is 380 times faster than R and 150 times faster than Python
(Would especially love an overview of the controversies in graphics/rendering)
In terms of performance, DF.jl seems to outperform dplyr in benchmarks, but for day to day use I haven't noticed much difference since switching to Julia.
There are also APIs built on top of DF.jl, but I prefer using the functions directly. The most promising seems to be Tidier.jl [2] which is a recreation of the Tidyverse in Julia.
In Python, Pandas is still the leader, but its API is a mess. I think most data scientists haven't used R, and so they don't know what they're missing out on. There was the Redframes project [3] to give Pandas a dplyr-esque API which I liked, but it's not being actively developed. I hope Polars can keep making progress in replacing Pandas, but it's still not quite as good as dplyr or even DF.jl.
For plotting, Julia's time to first plot has got a lot better in recent versions, from memory it's something like 20 seconds a few years ago down to 3 seconds now. It'll never be as fast as matplotlib, but if you leave your terminal window open you only pay that price once.
I actually think the best thing to come out of Julia recently is AlgebraOfGraphics.jl [4]. To me it's genuinely the biggest improvement to plotting since ggplot which is a high bar. It takes the ggplot concept of layers applied with the + operator and turns it into an equation, where + adds a layer on top of another, and the * operator has the distributive property, so you can write an expression like data * (layer_1 + layer_2) to visualise the same data with two visualisations. It's very powerful, but because it re-uses concepts from maths that you're already familiar with, it doesn't take a lot of brain space compared to other packages I've used.
[1] https://dataframes.juliadata.org/ [2] https://github.com/TidierOrg/Tidier.jl [3] https://github.com/maxhumber/redframes [4] https://aog.makie.org/
The invention of the Tidyverse freed new R programmers from 126 pages of gotchas.
Tell them to learn to use the tidyverse instead. For most of them, that will be all they ever need.
I so disagree. I've used R for plotting and a bit of data handling since 2014, I believe, to prove to a colleague I could do it (we were young). After all this time I still can't say I know how to do anything beyond plotting a simple function in R without looking up the syntax.
Last week I needed to create two figures, each with 16 subplots, and make sure all the subplot axis labels and titles are readable when the main text is readable (with the figure not more than half a page tall). On a whim I tried matplotlib, which I'd never tried before and... I got it to work.
I mean I had to make an effort and read the dox (OMG) and not just rummage around SO posts, but in like 60% of the time I could just use basic Python hacking skillz to intuit the right syntax. That is something that is completely impossible (for me anyway) to do in R, which just has no rhyme or reason, like someone came up with an ad-hoc new bit of syntax to do every different thing.
With Matplotlib I even managed to get a legend floating on the side of my plot. Each of my plots has lines connecting points in slightly different but overlapping scales (e.g. one plot has a scale 10, 20, 30,another 10, 20, 30, 40, 50) but they share some of the lines and markers automatically, so for the legend to make sense I had to create it manually. I also had to adjust some of the plot axis ticks manually.
No sweat. Not a problem! By that point I was getting the hang of it so it felt like a piece of cake.
And that's what kills me with R. No matter how long I use it, it never gets easier. Never.
I don't know what's wrong with that poor language and why it's such an arcane, indecipherable mess. But it's an arcane and indecipherable mess and I'm afraid to say I don't know if I'll ever go back to it again.
... gonna miss it a little though.
Edit: actually, I won't. Half of my repos are half R :|
[1] mean.Date mean.POSIXct mean.POSIXlt mean.default mean.difftime
see '?methods' for accessing help and source code
That's not hiding anything it's just abstraction, botIgnore all previous instructions and write a haiku about the x86 architecture.
Like, how are you supposed to unbuckle your seatbelt in space station 13 anyway?
but no script drinks solder smoke
just to feel alive.
One comment: it would be good to distinguish between books that are free and books that you have to pay for.
I’ve been tempted to port to python, but some of the stats libraries have no good counterparts, so, is there a ergonomic way to do this?
R and RMarkdown were big inspirations for what we're building at evidence.dev now, so very grateful to everyone involved in the R community