I vaguely remember there was something about the VGA architecture of the day that made this approach much slower, but I might be misremembering. My recollection of it is fuzzy. I'm hoping someone will chime in to remind me what I might be thinking of.
It might also just have been that this approach didn't work well with my lookup table optimization (see my other post).