Why Programming Should Be Hard

Or… How I Learned To Stop Worrying And Love The Free Monad

“That’s not easy what I just did!” — Carl (River Phoenix) from Sneakers

I hate the word “elegant.” It’s one of those words you hear a lot when talking to programmers, like “refactor” or “trivial.” Like those words, it’s overused. And like all overused words, it’s lost its meaning. “Elegant” (when used in the context of programming) once referred to a program or architecture that was so beautifully simple that it was almost unbelievable. The recursive solution to the Towers of Hanoi is elegant. Most code isn’t. We sure like to use that word a lot, though.

I hate “elegant” because it’s become a rhetorical bludgeon used against complex code that must be complex in order to accomplish its task. Not every problem has a Towers-of-Hanoi-esque solution. Most don’t, or at least the world is too short of geniuses to invent truly elegant solutions to every problem. Also, programming problems are becoming more complex every day. My team writes cloud software, which means we need to write code that runs across hundreds or thousands of independent compute nodes and somehow appears to an end user to accomplish a coherent, unified task. Worse (or better, depending on your perspective), we write IoT software for industrial machines, which means our software has to process hundreds of thousands of discrete events per second and somehow make sense of all that data in a way that can be presented to a human.

In order to manage the complexity of these new programming tasks, we as a community of software engineers must invent new tools. Those tools include:

  • New programming languages with better facilities
  • New programming paradigms (OOP, functional, immutable, reactive)
  • New vocabularies to describe and encode solutions to harder problems (category theory, type-level programming, stream processing)
  • Entirely new types of systems (cluster managers, NoSQL databases, streaming frameworks)

All of these are difficult to learn. Code written in a language you don’t know, using paradigms you’re not familiar with, in vocabularies you have never learned is likely to be incomprehensible to you. If you call this code “inelegant,” you’re not helping. You are making excuses for your own refusal to learn something new and difficult. It’s not a programmer’s duty to make code easy to comprehend for those unwilling to learn. It’s a programmer’s duty to learn, so that they can write and comprehend better and more powerful code.

The reluctance to learn difficult things is understandable, and I can even understand why someone might get defensive about it and start throwing “inelegant” around. But this is ultimately not constructive. Every software engineer that wants to evolve with the industry has to continue to learn. In this post, I’ll talk about my personal learning journey in my career and what I’ve learned about learning, so to speak, from early in my career to as recently as this past year.

The Dangers And Rewards Of Geeking Out

I am a geek. I define “geek” as someone who likes things — often certain kinds of things — just because he or she thinks they’re cool. Geeks often make good engineers, and engineers are often geeks. You can geek out about anything really — I geek out about cooking. I also geek out about software engineering. This has, on occasion, gotten me into some trouble.

Software engineering, like most human endeavors, is a game. It has constraints (i.e rules), and you’re trying to win it (i.e. ship a successful product). Winning games is all about understanding the rules and finding optimal ways to work within these rules to produce the desired outcome. In any sufficiently complex game, the problem space is large enough that there’s plenty to geek out about. An enthusiast for the game of chess might geek out about openings, for instance. I believe this is why many engineers are also gamers. Any game can be a dress rehearsal for the real thing.

Where geeking out can go wrong is when it steers you away from winning the game. Most deep dives into a geek-out session begin with the best of intentions: maybe you read about some cool new software engineering technique. Maybe you did a book club on it. Now you’re in a big hurry to put it into practice. At some point though, you stop caring that you learned about this thing in the first place to win the game, and start caring more about doing the thing. This might lead to doing the thing badly, and in particularly bad cases, you and others might rack up a list of failures that are then used as evidence of why the thing is a bad idea in the first place.

My Awkward Unit Testing Adventure

This happened to me with unit testing. When I first read about automated testing — and unit testing in particular — I was immediately sold. At first, I was thinking about the game. If we write good unit tests I thought, we will save countless hours of manual testing time, find bugs earlier, and have code divided into small comprehensible units that are independently tested. Then I — and other like-minded geeks I worked with — started doing it.

The problem was, nothing works like it does in books. When you start working with real software — legacy software in particular — compromises must be made. Engineering trade offs must be considered. The game has to be played and played well. However, at this point I was so geeked out over unit testing that I was determined to make it work by any means necessary. Because it was cool!

The result was a disaster. Thousands of lines of of un-maintainable, often-useless, poorly performing tests were written by myself and others. About the only thing this accomplished was to give ammunition to those who were skeptical of automated testing from the outset. Now they had a real live train wreck they could point to as evidence they had been right all along. The group then entered a sort of testing dark age that took years of concerted effort to recover from. That recovery was still in process when I left that company.

A Better Way

I like to think I learned the hard-won lesson from that experience. A year or two later, I geeked out about Test Driven Development (TDD). As with unit testing, I had read about it in books and various blogs, and I had a concrete idea of how it could improve our software engineering — even how it could right some of the wrongs from the misadventure above. Also as before, I totally geeked out about it. But this time I caught myself and used my geek powers for good.

As with unit testing, I immediately ran into problems applying the practice to actual code. However this time, I kept the game in mind. Ultimately I wanted to make our software engineering better and more efficient. This meant first and foremost that I accept failure as a possibility. If I can’t make TDD work in our code base, I am doing no one any favors by trying to use it anyway because it’s cool.

Then — and this was my epiphany — I accepted that programming is supposed to be hard. I decided to have some faith in the people that wrote these books and blogs — that they were able to make these technologies work with real software. If I stumbled early on, I told myself, the blame is likely mine, not the technology’s. I kept trying, and I practiced. I referred back to the literature and sought out more reading and advice when I got stuck. Ultimately, I reached the point where the technology I geeked out over met with my goal of playing the game more effectively.

Obviously healthy skepticism of the technology is fine, but don’t let your skepticism rule you. Too many people do this. People with unhealthy skepticism are the same folks that call necessarily-complex code “inelegant.” Take the time to watch the first 3:40 or so of this youtube video. Most of it is about how to play guitar — another thing I geek out about — but the first part is great advice about getting good at anything difficult and dealing with skepticism — yours and others’. It’s almost identical to my own mental model on the subject.

Once you have a comfortable mastery of the technology and a decent track record of applying it to real problems, evangelize it. Start with people more likely to accept it and less likely to show unhealthy skepticism (fellow geeks!). Pretty soon, you’ll have built a tribe not only of believers, but fellow experts that can help spread the use of the technology throughout the organization, and ultimately move a step closer to winning the game.

You Said Something About A Free Monad?

This brings us to the final chapter of our story. In my current position, I am fortunate to have a lot of very smart programmers working for me. One of them, Phil Quirk, I have worked with for several years, including the company involved in my anecdotes above. He likes to geek out about category theory and functional programming. He also believes that they allow him to write more powerful code that’s easier to reason about. For a long time, I stubbornly disagreed with him about this. This is the story of how I was wrong.

I am a firm believer in writing easy-to-read, easy-to-maintain code. Clean Code — one of my favorite books — deals with this subject exclusively. Code is read and changed a lot more often than it is written (which is one time), so of course any engineer would optimize for readability and maintainability. In my discussions with Phil however, I caught myself confusing “easy to read and maintain” with the misuse of the term “elegant” that bothers me so much. I was calling Phil’s code “inelegant” for the same reasons my code had been called that so many times before: because I lacked the patience to learn the vocabularies necessary to understand it. When I realized this, I decided to give Phil and his categories a fair shake.

A book club (on the excellent Scala With Cats) and a few programming experiments later, and I am not only a convert, but I am ashamed I dug in my heels for so long. In the book club, I raised the question to the group “is it fair to require anyone that wants to play in our code base to learn these concepts, even though they are difficult and unfamiliar?” The unanimous answer: of course. Knowing these concepts makes us better at playing the game. If you want to play the game at our level, you must know them.

Phil also loves Dr. Strangelove. I hope that the alternate title of this blog post will suffice as an apology.

Conclusion

Programming is a constant learning journey, at least if you want to continue to work on new and cutting-edge problems. Somehow, the ethos of “false elegance” has crept into the software community, and this sometimes causes us to be inappropriately cynical about new techniques and ways of reasoning about problems. In the worst cases, we disparage these new methods, because if we try to use them and fail, we might look foolish. We should stop this and embrace increasing complexity as an unavoidable fact of technology itself. Only by continuing to be smarter than the problems we’re trying to solve will we continue to win the game.

In future blog posts, I and others will delve into some of the technical details behind some of the technologies I mentioned above. One might even involve my adventures using the free monad in database programming. In them, we’ll show how embracing necessary complexity empowers you to solve difficult problems with — dare I say it? — elegance. Stay tuned!

Conquering Streaming Analytics With Akka And Kafka

Introduction

I almost called this post “conquering the non-idempotent event materialization function,” in reference to my event sourcing wall o’ text from last year, but I didn’t want to lose 80% of my audience before the first paragraph. It would have been an appropriate title however. Still here? In the previous event sourcing post, I talked about the challenges of working with events whose materialization functions are non-idempotent, meaning that if you replay the same event multiple times, your system does not come out the same.

A trivial example of this would be an event that says “$100 added to bank account.” If you play this event 4 times, you add $400 to the account. By contrast, an event that says “account balance is now $100” can be played as many times as you want, and the account balance will remain consistent. In the previous post, I concluded that the latter type of event is preferable, since it allows you to materialize events into system state in an idempotent fashion, and duplicate events are a non-issue. This is convenient, because it’s common for Kafka to replay duplicate events (and in general, “at least once” is a much easier delivery guarantee to enforce than “exactly once”).

In this post, I detail a case where idempotent materialization was impossible: namely a streaming analytics routine that counted incoming numerical data and aggregated the sum of that data, very similar to aggregating “$100 added to account balance” events. We have several new cases where this is critical. For instance, KUKA’s “ready2_fasten” screw fastening package uses our KUKA Connect cloud application to report metrics on screw fastening over time, such as number of successful or unsuccessful attempts in a particular time window. In this case the events looked like “this many new screws have been fastened.” We had to prevent duplicate event replay if the data were to be accurate. This had to be done in a way that could be resilient to transient Kafka issues, microservice crashes/reboots/re-deployments, and network retires. Here is how we did that.

…And How We Didn’t Do That

We did not use Kafka Streams to solve this problem. Soon, I will write a post detailing why we abandoned Kafka Streams for the time being due to various challenges implementing it in production. For now, suffice it to say that we had concluded it was not an option. This was a shame, because if it actually worked, Kafka Streams has some powerful facilities for defining non-idempotent aggregations without having to worry about many of the complicated underlying details.

We did use Akka Streams and Akka’s Kafka connector to accomplish this however. This means we had to implement a few things that Kafka Streams supposedly handles for you, such as guarding against duplicate event replay even when events are consumed and republished by what I refer to as “republish systems” (systems that consume Kafka messages, transform them, and publish them to new topics) in the previous event sourcing post. This had to hold true even if the republish system and the aggregation system were in different microservices.

The Details

The breakthrough in this approach is relatively obvious in retrospect: use the Kafka partition offset. Any time you receive a Kafka message, it has an “offset”, which is a 64-bit integer that increments by one for every message published to a Kafka partition. Every message with an offset of n occurred after any message with offset <n and before any message with offset >n. Also, due to the way Kakfa handles partitioning, we knew that every message for any particular “thing” (in this case a robot, represented by a Kafka partition key corresponding to the robot’s serial number) is in the same partition, and therefore ordered with respect to any other message from that robot. Therefore, the procedure became, simply:

  • When you aggregate events into a state for a robot, that state should include the Kafka offset of the latest message that was aggregated into that state.
  • If you encounter a message with an offset earlier than that stored in the current state, discard it.

This sounds straightforward, but there was some complexity to it. We had several stages in our streaming app that were effectively “flat maps” meaning that one incoming event could produce several outgoing events. The outgoing events all corresponded to a single incoming Kafka message with a single offset. This meant 2 things:

  1. We needed to keep track of the source message offset through all stages of the Akka stream so we could update any aggregation state at any stage with the latest offset of the original Kafka message that produced that state (and discard earlier ones as described above).
  2. Events produced by a “flat map” stage (i.e. multiple events produced by a single Kafka message) must be processed as a batch! It was critical that we avoided processing part of a batch of messages, then possibly crashing. In that case we would not know where in the single Kafka message offset that produced the collection of messages we were processing and had left off. This could cause us to either lose data (by skipping part of a partially-processed collection) or replay duplicate events (by double-processing part of pa partially-processed collection), both of which were unacceptable. Therefore, we passed “flat mapped” events through the stream as a collection (usually a Scala immutable.Queue due to its good append/prepend/dequeue performance) with a single attached Kafka offset, and the aggregate states were updated and persisted only after processing the entire collection.

Persistence And Load Handling

The last remaining problem to solve was how to persist and load aggregated states. When a microservice boots up and starts receiving Kafka messages, we first need to load up the current state of every aggregation stage in the Akka stream (which remember, includes the latest-processed Kafka offset in each stage). Then we can start playing events on top of that state (or discarding events whose Kafka offset is too early).

For this, we used a mix of Redis and Cassandra. A full discussion of the trade offs between the two is somewhat beyond the scope here, but a short treatment would be that Redis is better with smaller, faster data while Cassandra is better with bigger, slower data. We had both flavors of data in our system.

Akka gave us a lot of flexibility in how to persist states. The basic approach we used was:

  • Pick a persistence interval. This is how often we write our aggregated states to Cassandra/Redis. This is a tunable parameter of the system. The longer it is, the less we stress out the database, but the more Kafka events we have to potentially replay on a reboot or crash. Also we can only count on our data being as “fresh” as this interval. We used 1 minute. This was very easy to accomplish using Akka’s built in throttle and conflate functions. We throttled the stream elements representing updated state to 1 per minute per device, and conflated them by simply selecting the most recent state. This is what went to Cassandra/Redis. Kafka Streams has no equivalent of throttle/conflate, or really any time- or rate-aware streaming primitives, which is one of my chief gripes about it.
  • Only commit the message to Kafka after the states that include that Kafka event have been written to Cassandra/Redis. This means that if the system crashes, we will replay any lost Kafka events. We might also replay events that have already been persisted if the system crashes after persisting to Cassandra/Redis but before committing to Kafka. However, this is okay since we ignore Kafka messages with an offset earlier than the latest one that has already been aggregated into the state, as described above.
  • Persist states in the reverse order that the stages appear in the Akka stream. This required some thought. If a stream has 4 aggregation stages, what happens if you persist only 2 of them then crash? When you recover, if the ones you persisted before crashing are the earlier stages, then you might have earlier stages in the stream with later Kafka offsets. If an earlier Kafka message comes in, the earlier stream stage with the later offset will discard that message, and it will never make it to the later stages that still need that event since they were not persisted before the crash. This is trivially avoided by persisting the later (more downstream) stages before the earlier (upstream) ones.

And Then, Oops

Once we had implemented the above system, we ran it in our test environment for several days. To our shock and dismay, we still observed analytics results indicating duplicate events had been replaying. After some frustrating investigation, we identified the cause: republish systems. I discuss these in the previous event sourcing post: they represent any system that takes in Kafka messages, does calculations, and publishes other Kafka messages representing the results of those calculations.

In the case of our streaming analytics, there was in fact a republish system upstream of our analytics stream. This system would translate raw incoming data from robots into a standard data format that most of KUKA Connect uses. Obviously this system could crash just like our analytics system. And equally-obviously, crashing could cause Kafka to replay recent messages that had not been committed. Since our analytics system paid attention only to the Kafka offset of the topic it was directly consuming, and not the input topic to any republish streams that came before it, we could still get duplicate events.

One possible fix would have been to add the same behavior of tracking the Kafka message offset — then discarding earlier messages — into the republish system as well. The problem was that (in this case at least), the republish system was stateless. It had no aggregation state at all, and we’d have had to add one for no other purpose than tracking Kafka offsets.  This could have worked, but it would have meant saving to (and possibly stressing out) a database where we did not previously do this. Stateful streams are also more memory intensive and not as high performing as stateless streams. Finally, several services downstream of this republish system were idempotent already and did not care about message replay. In these cases we’d be paying a cost to persist an aggregation state we didn’t even need. We wanted a general solution that would work for stateless streams just as well as stateful.

Fortunately, we already had a standard protobuf message that we used to wrap any messages we publish to Kafka event streams, that handled things like timestamp and a unique ID for the event. To this, we added what we termed a source topic chain, which is an ordered list of tuples of the following data for each republish system that a message passes through:

  • Topic name
  • Partition number
  • Partition offset

This collection of topic-information tuples is ordered according to the order in which messages pass through republish systems, so we can always tell the “history” of a message no matter how many times it was republished.

Using this information, we only had to change our existing logic that would track the latest Kafka offset for each aggregation state a little bit:

  • Instead of looking at the offset of the current message, look at the offset of the very first message in the republish chain, i.e. the “root cause” of this event. Track that offset in the aggregation state and discard messages of earlier offsets as before.
  • Also track topic and partition. If either of those changes, assumed we changed something about the topology of the republish chain, or possibly repartitioned a Kafka topic. In that case, start over (throw out the tracked latest offset for the different partition/topic and start with the new topic/partition/offset)

Conclusion

Using these approaches together, we were able to implement a streaming analytics system that was resilient to replayed events even though it included non-idempotent event materialization functions. To summarize, the approach was:

  • Track the path of a message through republish systems by appending the topic/partition/offset of every input message to a republish system in the output message of that republish system.
  • In every non-idempotent aggregation state of a stream, also keep track of the latest offset processed, as well as its topic name and partition number.
  • If an incoming message is of the same topic/partition and an earlier offset than the latest one tracked in the aggregated state, discard it. If the topic or partition changes, discard the old offset information and start over.
  • Only commit an offset to Kafka after persisting all the aggregation states corresponding to that offset.
  • For a stream with multiple aggregation stages, persist aggregation states in downstream-first order.
  • Use multi-rate Akka primitives such as conflate and throttle to control how often you persist aggregation states. Tune this value to achieve the proper trade off between performance and data “freshness.”
  • If a streaming stage “flat maps” its inputs, i.e. produces multiple output elements for a single input, avoid Akka primitives that expand the output stream such as mapConcat. Instead, track the output elements as a collection (i.e. the output of the flow should be a collection of things, not a single thing). Process that collection as a single element and only persist the aggregated state after processing the entire collection.
  • Create a class you can use as a streaming element that provides access to the Kafka topic/partition/offset and source topic chain information at every stage in the Akka stream.

Until next time!