Seam carving is a content-aware image resizing algorithm, and I've been working on a JavaScript implementation of it. It's as fast an I can make it while still using a naïve algorithm. Well as fast as I can make it for now anyway.
Here's some sample output, the original is on the left, the resized version is on the right:
This article is about optimising the code, so I won't go into how seam carving works. If you want to read about it the wikipedia page does a good enough job.
I didn't set out to write a seam carving library, I only wanted to use one. I run a discord bot that (among other things) lets users manipulate images, and I wanted a content-aware resize command. It's written in JavaScript and uses node-canvas to manipulate images, so I found seam-carving-js, which is also written in JavaScript and works with canvases.
It does the job, but it's slow and memory hungry. My bot runs on a VPS with a gig of memory, and eating up half a gig to process one image causes a lot of thrashing. I attempted a hack to help the memory usage, but it didn't do much.
Here's some benchmarks to show how slow it is, I reduced the width of two images by half:
image | size | time |
---|---|---|
dog.jpg | 800x524 | 38732 ms |
castle.jpg | 1428x968 | 171990 ms |
Almost three minutes for castle.jpg. I know my Intel i5 4670K is getting on a bit but surely we can do better.
The first pass was a straightforward implementation of the algorithm. I used typed arrays everywhere and created them with the correct size before the main loop to avoid allocating memory. This worked quite well, here are the benchmarks:
image | time | speed increase |
---|---|---|
dog.jpg | 2491 ms | 15.6x |
castle.jpg | 14630 ms | 11.8x |
Memory usage was also low enough to be not noticeable. This was pretty good, but I still thought 15 seconds was too long for castle.jpg.
Despite trying to avoid allocation, one instance slipped through the net. I fixed that by re-using an array instead of allocating a new one, which offered a small performance benefit:
image | time | speed increase |
---|---|---|
dog.jpg | 2323 ms | 1.07x |
castle.jpg | 14027 ms | 1.04x |
I then split the main loop into a set of small functions. This was mainly to help with debugging and profiling, but it did give a tiny performance increase:
image | time | speed increase |
---|---|---|
dog.jpg | 2258 ms | 1.02x |
castle.jpg | 13852 ms | 1.01x |
I think this might be TurboFan optimising the small functions sooner than it would optimise the big main function. Or perhaps it can be more selective with what it optimises, rather than doing the whole thing.
Once the algorithm has decided which seam to get rid of, those pixels need to be removed, one per row in the image. I do this in place in the array of pixels and every time one is removed the rest of the array needs to shuffle along to close the gap. I originally did this using a loop that copies each pixel to it's new location, skipping removed ones.
After running out of ideas and staring at the TypedArray docs on MDN, I found Array.copyWithin. It's like memmove()
and lets you shift data around in an array. Using that, I avoid looping over every pixel and instead only loop over every row, copying the left half of the row, skipping the removed pixel, then copying the right half.
I changed two loops to use copyWithin. It made quite a difference:
image | time | speed increase |
---|---|---|
dog.jpg | 890 ms | 2.53x |
castle.jpg | 6029 ms | 2.30x |
Finally under a second for something. An arbitrary goal, but being able to say "...in under a second" feels good. The version of dog.jpg from the commit messages was larger, which is why the time here is different.
The Chrome devtools line-level profiling showed that some lines using Math.min
and Math.max
used up a lot of time. These lines are in the middle of the hottest loop in the library, and my assumption was they cause lots of branches (and branch misses). I know that min and max of two numbers can be compiled to one instruction with no branches, but I don't know if v8 can do that.
I decided to make the branchless versions myself instead.
This is how they work, given x and y:
x - y
(1 - size of the number)
times. 31 in this case.(x - y) & 0
, or 0(x - y) & (232 - 1)
, or x - y
y + 0
, or yy + (x - y)
, or xx - 0
, or xx - (x - y)
, or yThese only work on 32 bit integers. I'm using Uint32Array
everywhere, so I don't have to worry about other types sneaking in.
Here's the benchmark afterwards:
image | time | speed increase |
---|---|---|
dog.jpg | 618 ms | 1.44x |
castle.jpg | 3816 ms | 1.58x |
It worked quite well.
That's where it stands right now. I've made this look like a smooth process, but rest assured most of my ideas made it slower rather than faster, and had to be abandoned.
Here's a comparison between the original library that started this and mine:
image | original | mine | speed increase |
---|---|---|---|
dog.jpg | 38732 ms | 618 ms | 62.67x |
castle.jpg | 171990 ms | 3816 ms | 45.07x |