Category Archives: Projects

Demonstrative Implementation: LZW

There’s something of the “standard literature” in undergraduate algorithms: graph traversals, greedy and dynamic programming, sorts and data structures, and so on. One class of algorithms that seems oddly excluded, at least from my experience, is compression algorithms. So I wanted to explore more of that, in this case by implementing LZW.

I implemented (I think…) a more-or-less standard LZW algorithm here. Continuing my theme of bridging computer science and practice (started (?) with my Bloom filter write-up I wrote it in C, and

Some take-aways:

  • I didn’t really bother with literate coding. I think that was ultimately the right thing to do. Uh oh.
  • My biggest frustration with example implementations online (that I can find, mainly from Wikipedia: there’s an old Dr. Dobbs article, and then a million Rosetta Code implementations all doing the same thing) is that it doesn’t seem to reflect a practical implementation. A linear scan over the dictionary? I know (better than most, I think…) that linear scans are great, but it’s hard to believe that’s the right thing here.
  • The Sedgwick implementation for his undergrad book seems OK.
  • I think I can make an interesting short-course on how I iteratively improved performance.
  • I really dropped the ball implementing clear-codes, that was a mess and a half. But it’s exactly the complications of “really” implementing clear-codes, variable-length coding, and so on that I am frustrated being skimmed over by academic resources! Absent my efforts (such that they are) it seems the students’ options are to ignore them entirely, or dive into gzip’s source base.

I have a PDF with more thoughts here: https://github.com/agorenst/lzw/blob/master/lzw.pdf ; you can see how I’m trying to get the framing.

Demonstrative Implementation: Bloom Filter

The Bloom filter seems to be a “sweet spot” for people doing hobby projects: simple enough to implement, unusual enough to be interesting, and complicated enough to feel satisfying.

I am interested in “postbaccalaureate” projects for self-study, and using those projects to bridge the transition from a recent grad to a professional programmer (still with a computer science bent). So I turned to this project for that purpose.

It went… OK!

  • I got really caught up in the literate coding stuff, and I think the code was, in effect, too simple. It was more the surrounding machinery that was interesting.
  • The driver code ended up being the complicated thing, which does suggest that it’s more running the experiments (which happens a lot in my experience!) that’s interesting, than the implementation-proper.
  • Things like the arbitrary-length-bit-vector, and the constituent pages, also seemed to be a bit more interesting.
  • I did really like the experiments, though. It was fun figuring out how to automatically make the charts.

I have a bigger write-up here: https://github.com/agorenst/bloom-filter/blob/master/bloomfilter.pdf

Adventures in Caching Algorithms: Part 1, Getting Input

The subject of caching recently piqued my interest. I encountered caching in my undergrad OS class (so I suppose it’s more like it’s on paging algorithms), but otherwise haven’t encountered it much. It seems to me that caching is hugely important in applications of computer science (in my one hardware course, the professor shared some famous quote to the effect “caching is the one good idea computer science has had”…) but isn’t taught in traditional undergrad curriculums–at least, not as part of algorithms.

My impression so far is that they’ve been deemed theoretically uninteresting. Belady’s algorithm is a nice result (and certainly approachable for undergrad greedy algorithms, I think…), but otherwise it seems the dominating factor of a “good” algorithm is workload-dependent. But still, I want to learn more about them. I’ve been on a kick for learning more about empirical things, and like in the SAT solver I’ve also taken to problems (and solutions) that empirically show improvements, but elude a natural proof for their optimality.

Getting Input: PIN

So I knew I wanted to simulate a cache. For that I need input. Oh I can use an RNG of various distributions and everything, but I was curious if there was a way to “naturally” get more input. So I did some quick searching (that is, I asked a friendly AI) and learned about PIN. I think it’s really cool! It provides natural hooks to “record” what goes on in execution at the assembly level, so I was able to basically do exactly what they say in the example and created a tool to emitted all the instruction addresses (for the icache) and memory read/writes.

So with my tool written (more on that later) I can run it like:

 ./../pin/pin -t obj-intel64/trace_tool.so -- /bin/ls

Working inside-out, that will run the “usual” ls command, (like, just list the directory contents), but the binary will be intercepted and parsed by the pin tool. That pin tool will use the code in my utility (which I’ve named trace_tool) which emits certain bytes to a distinguished file. That output basically looks like this:

i:0x7fa5505b8c21
r:0x7fa550744b60
i:0x7fa5505b8c28
i:0x7fa5505b8c2a
i:0x7fa5505b8c00
w:0x7fa550744b60
i:0x7fa5505b8c0b

I basically have 3 functions in my PIN tool. All three of them just print the provided address, but one prefixes it with “i”, another “r”, and another “w”. So I ask pin to put the first behind every instruction, the second behind every read-op, and the third behind every write-op. That gives me the output I want! I haven’t validated things in any more rigorous way, but a quick scan makes this “look good” to me.

Of course, the way these hobby projects work, I can’t just say I got what I want and dive into caching algorithms. I have to do this part even better. More on (obvious…) performance improvements to come!