Vignettes of a Linux Kernel Mentee

About 5 months ago, over a dinner of oven-baked chicken thighs 1 with rice, I stumbled upon a tweet by Greg Kroah-Hartman about Linux Kernel mentorships. At the time, applications were already closed, but I saw that programs were held three times a year so I put it on my radar.

4 months ago, I saw that applications for the summer programs were open, but I was busy with final exams, so I put it aside again.

3 and 1/2 months ago, applications had been open for awhile, and though I thought that many people had probably been working on the code challenge and/or making initial contributions for weeks, if not months 2, I decided to apply anyway. I was visiting home and was under quarantine, so I thought what the heck, why not spend all my days and nights diving into the kernel. (This turned out to be a pretty exhilarating way to pass the time, and I can still recall how hard my heart was pounding when I hit send on my first patch.)

3 months ago, I was accepted into the Linux Kernel Mentorship Program3.

After a week of completing the initial tasks 4, while anxiously scrolling through bug reports wondering if I could do anything useful, I fixed my first bug.

A day later, Dan Carpenter pointed out that the Sparse static code analyser found a warning that was similar to the bug I just fixed. So I fixed my second bug.

Very shortly after, Daniel Vetter felt that something was odd since I fixed two very similar bugs in the same area of code, so he audited the Direct Rendering Manager (DRM) auth and lease code. Following his suggestions, I found myself regularly contributing to the DRM subsystem, and working on larger and larger portions of the code base over the months.


In the first few weeks, I started by fixing small bugs here and there. At first, the bugs I could tackle were found in less maintained parts of the code base where errors popped out from neglect (or were present since the dawn of time). Some of these bugs weren’t too tricky, but they weren’t fixed because there were less eyes on them. Sometimes I tried tackling harder bugs, but I would often hit dead ends.

From the practice, and from observing how other people tackled problems, I soon started analysing bug reports that were more complex and appeared in widely-used parts of the code. And I found that I could fix them.

Sometimes I had to spend days studying a kernel mechanism, or diving through a decade of logs 5, or reading the specific protocol specification/random powerpoint slides from 2012/scatterings of blog posts from when the feature was first implemented. But gosh darn it, I could fix those bugs. It was no longer a matter of whether I could fix something, but how much time it would take.


One of my recent bug fixes was very reminiscent of one of the earliest bugs I’d looked at but failed to fix in another subsystem far far away. The root cause for the old bug was very far away from its symptoms, and someone else had to fix it with an astute observation 6. But this time, after some back-and-forth discussions, my pattern recognition senses kicked in and I realized that this new bug could be fixed with the same observation that fixed the old bug.


About two months in, I received my first bug report for a change I’d made 7. It was ultimately unrelated to my code change and only shared the same call trace, but I managed to fix it anyway. Then, less than a week later, more bug reports for my other fixes came in. One was a pre-existing bug that was revealed after I touched the code, and I managed to fix that too 8. Another bug… actually turned out to be a regression that I introduced. Slightly devastating, but I’m working on it. (Hey, that’s what release candidates are for, yeah?)


I once received a comment on my patch that contradicted what another maintainer had suggested. Before then, I thought that hardly anyone paid attention to my patches, but suddenly a deluge of seasoned engineers came pouring in with their opinions. Now I wince whenever I botch a patch series and have to resend/update it, since anyone could be watching. But this pressure also pushes me to give my best work.


Sometimes, I would see an interesting discussion in a mailing list and would offer my two cents 9. Sometimes I’d think that I had a bug fix, but it would turn out to be not quite right and someone else would come in to discuss the problem or propose their own solution.

And sometimes, when investigating a bug, I’d form a clear enough mental model of the code that the clouds would part and I’d spot bugs lurking in the shadows. Then I would spend some time making a series of improvements.


As the Linux Kernel Mentorship Program drew to a close, I was reminded of two quotes that have stuck with me for awhile now:

I think that much more important than the patch is the fact that people get used to the notion that they can change the kernel – not just on an intellectual level (“I understand that the GPL means that I have the right to change my kernel”), but on a more practical level (“Hey, I did that small change”).

Linus Torvalds

What’s in your hands, I think and hope, is intelligence: the ability to see the machine as more than when you were first led up to it, that you can make it more.

Alan J. Perlis

I’m grateful for the opportunity to participate in this program because it allowed me to fulfil a long-time dream of becoming an active member of an open source community for a technology that I use all the time.

It’s really quite amazing. Over the weeks, from the initial application tasks, to our first readings, to the support given along the way, mentees are steadily shown the way to contribute to the kernel. So that we can experience, on a practical level, the fact that we can change the kernel. And that we can make it more.

Scattered papers with deranged writing sprawled on my desk.
Just a picture of my desk with various notes on the Linux kernel sprawled over it. It was a good summer.


Notes

  1. One of nine chicken thighs that were cooked in parallel, distributed over two trays. Even before hacking on the kernel it seemed I was fated to deal with issues of concurrency.
  2. Contributing to open source projects used to seem like a overwhelming task. Before I made my first contribution to CPython, I spent a few weeks studying its internals (which was educating, but ultimately overkill). At some point I was also interested in contributing to Prometheus and Kubernetes, and tried to block out an hour every morning to “work on it” and figure out how to contribute. I never really committed to it because it felt so daunting. But now I’ve come to learn that the barrier to entry to any coding project is not so ridiculously high.
  3. Random fact: the motherboard on my laptop also stopped working a day after the results came out, but I believe the two incidents were mostly unrelated.
  4. Which I blogged about here: [1] [2] [3] [4]. At the time I was convinced that even if I couldn’t contribute code, I could at least blog enough to have a positive impact.
  5. I’d sometimes run into the first git commit in 2005, then decide that instead of going back in time hoping to find notes, I should just roll up my sleeves and read the code. But, to be fair, reading Linux kernel logs turns out to be rather educational, and occasionally hilarious:

    A snippet of a git commit history containing the following.

The first step of the rebalance process ensures there is 1MiB free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were:
    
    "man that's a little bonkers"
    "i don't think we even need that code anymore"
    "I think it was there to make sure we had room for the blank 1M at the
    beginning. I bet it goes all the way back to v0"
    "we just don't need any of that tho, i say we just delete it"
    
Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process.

  6. For the technical details, it involved a use-after-free error in a timer instance. I had thought that the issue was a race with a timer being concurrently freed and used, as the bug report suggested. Takashi Iwai then came in with (in my opinion) an absolutely brilliant observation that the issue was the result of concurrent assignments of timers, and the memory corruption was a result of being unable to clean up overwritten timers.
  7. I actually broke Linux much earlier than that, but it was, remarkably quickly, caught and reverted before I even knew about it (the revert patch even went into the same series when backported to Linux stable). And yes, we did eventually unravel the issue, and the fix was merged into mainline.
  8. I think. It’s still going through review, but I’ve tested it. And most importantly, the revelation of the root cause came to me at 2 am, and I’m wildly confident of middle-of-the-night epiphanies. Because if I didn’t believe in them, I couldn’t justify staying up so late*.

    * Although, one other valid reason for staying up late is that I’d wake up to mildly amusing debugging efforts:

    My console running Qemu, with many lines saying: "You're doing a naughty"
    Give me some credit ok, I did manage to find the bug with this debugging output.

  9. But like with crypto currencies, sometimes two cents go a long way, and sometimes they’re kinda useless.
Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *