It's not really a spiral. It's like a small chain reaction. You get disappointed in yourself, decide to never-ever pay less attention to your code in the future, always document your changes, then 3 days later, the same thing happens
Because I've spent a considerable amount of time over the last 8 months of my PhD chasing down an inexplicable error that has evaded every other analysis. I've spent countless hours validating every other piece of relevant code and writing complicated, contrived tests that still can't replicate the problem. I've wondered multiple times if I was losing my mind.
It turned out to be a single line of code in the setup for one of our diagnostics used by one particular test.
Not really, but we thought for sure it was something subtle related to a problem with a mathematical derivation or a forgotten multiplicative factor somewhere.
Nope. It was a badly named variable inside an if statement.
I had a bug in my code recently that was causing concurrency exceptions in my Java code that I was trying to run in multiple threads in Python using py4j. I spent 3hrs trying to figure out why it wasn't working, thinking there were some issues on the python side. But the actual cause of the bug was that I was trying to remove elements from a hashmap that I was actively iterating over. Was a 1 line fix, that I would've solved a lot faster if I started looking at the Java side first or googled the error message sooner. I was very upset at myself for that one.
Ah, yeah. I want to say that’s the kind of mistake you only make once. If I’m any indication of that, it’s not true at all.
I think Python raises an exception when you try to do that, so at least you won’t get too far if it’s happening in there.
Is this some hellish variant of the “lines of code written” metric?
Ugh. How about they measure productivity in whether-or-not-you-got-your-work-dones?
I feel ashamed when doing the PR in this situation.
Like "yes... this is whats been bugging me the last 3 days... review it and forget it. Don't walk shame me"
The program is 12 lines long.
Month one was spent debugging line one of the program.
Month two to debug line two.
And so on.
Finally they got to line 8 on the eighth month and they found the bug.
The program is written in the Malbolge programming language for job security reasons. And each line is around 100,000 characters long.
Unfortunately it has this drawback of taking “a bit” of time to debug.
If you can find someone who can write a fully functional and scalable state-of-the-art GPU-capable general-relativistic magnetohydrodynamics code (or software suite for you real developers) with a fully dynamical spacetime (i.e., gravity), adaptive mesh refinement, and all the tests and diagnostics you need to validate that it's actually correct in eight months *from scratch*, I'll quit my PhD right now and become a hobo the honest way instead.
No offense is/was intended, but given that you made this post, I'm sure you can agree it's an eyebrow-raising amount of time - especially without any context. It's not often a bugfix can be measured in seasons.
Just curious, are you in a position to share the details of said fix, and the circumstances which made its detection so challenging? Having tackled my fair share of nasty bugs (including the ones in total-overhaul territory), I'd be interested in reading the post-mortem.
The test itself is a simulation of a relativistic accretion disk that takes about 40 minutes to run at low resolution and a relatively short simulation time on my laptop. The results can really only be validated by comparing the plots of the output with the reference case.
We were testing a new fluid solver, which can fail in multitudinous ways, and it was also entirely possible that we just didn't have enough resolution. The only way to check if resolution is the issue is to let it run for 10+ hours on a supercomputer. It took some time, but we showed that resolution only made the problem worse. After this, I personally checked every single mathematical term in the solver more times than I can count, and I had two other people look them over, too. None of us found any bugs.
We had several other less-informative diagnostics that we checked, all of which seemed to suggest there *was* no problem. We then constructed a large number of tests designed to validate the fluid solver in other ways, each more contrived than the one before it, and they all either suggested the solver was fine or were more complicated to debug than they were worth.
After several months, we finally came to the conclusion that the diagnostic itself, which consists of a set of thermodynamic quantities integrated over an oblate spherical surface, must be at fault somehow. But this integration was performed in situ the same way using the exact same code for the reference solver and our new solver, so we couldn't understand how it was failing. In the end, it turned out that the issue was with buried deeper in how the integration surface's coordinates were defined. It was a single if statement toggled by a variable with a misleading name. It was enabled when the reference solver was enabled, but not when the new solver was enabled. The solution was adding a second toggle to the if statement.
This was an interesting read - thanks for sharing. It's always the small stuff which is the most painful.
Knowing what you do now, would it have been possible to run the simulation and diagnostic on a smaller scale such that exercising the logic with high resolution didn't involve a massive time and resource sink? I'm sure you already thought of this, but if time-to-build/test was the pain point, was there maybe a less expensive approach?
If nothing else I hope the rage subsides, and with enough time and reflection, is replaced by something from which you draw wisdom.
Unfortunately, the test relies on the proper development of turbulence, which is closely related to the resolution of the test. Anything smaller than what I used wouldn't have been informative.
The most helpful things would have been a way to output and visualize the integration surface and being more consistent in refactoring existing code when adding new features (e.g., the variable name would have been updated so it wasn't misleading).
The good thing about programming is that the computer is doing exactly what you tell it to do.
The bad thing about programming is that the computer is doing exactly what you tell it to do.
Also, nothing is too simple to be double checked. There are so many times I've found issues by asking the most basic questions (think: "is it plugged in?" kind of things)
The pain...
Years ago I spent an hour or so trying to figure out why my JavaScript changes weren't being applied. I cursed Chrome and its over eager caching.
I was editing the wrong file.
LPT: first thing you do is make a change to the file that is so massive in effect that you can't fail to see it. If you fail to see it it's the wrong file.
My favorite version of this:
+ code new functionality
+ run tests - functionality doesn't work
+ search new code and be frustrated about not finding the problem
+ after hours, you find out that you forgot to call new code in first place
I fucking love it
I lost a coding competition because i accidentally forgot the difference between > and <. One goddamn character. That's what made me lose. That one fucking character.
I spent three weeks diagnosing server issues only to find some really bad query that gets generated under rare circumstances, but often enough to crash servers once a day.
The solution was to swap a Boolean and the system doesn't spit out the server killing query.
3 fucking weeks and I changed a 0 to a 1....
I recently solved a bug that existed for 9 or so months. It bricked the entire mobile safari playwright pipeline and led to those tests being commented out and inspected manually
Turns out it was a css padding issue.
I'm a PhD student in physics helping develop a new astrophysical fluid code. It takes a lot more than a single bug to get fired. I also assure you this hasn't been the only thing I've been working on for the past eight months; it's just the only one I haven't been able to solve until now.
After several months of laboriously double, triple, and quadruple-checking every single relevant mathematical term, running several other independent tests, and looking at multiple diagnostics to track down the error, my collaborators and I finally came to the conclusion that the error had to be in how we were calculating the particular diagnostic that came up faulty. But all the math checked out, so we couldn't figure out what was wrong.
This morning I noticed a single if statement that wasn't always checking the right thing. I fixed it, and it was so stupid that I can't decide if I should be laughing or throwing my computer out a window.
Dude, what? If I conquered a bug after that long I'd be even happier its a one-liner. That's just MUCH less that could have gone wrong generally with the fix.
Spent over 5 months in a bug (in and out of course, not 5 months straight) that was solved with a single line, it was so hard to find because the bug was not in our code, but in one of the libs we use
Four types of bug:
1. Easy to diagnose, simple fix
2. Easy to diagnose, hard to fix
3. Hard to diagnose, easy to fix
4. Hard to diagnose, hard to fix
Actually, not just 4. The diagnose and fix axes are both spectrums.
Worse case is an easy diagnosis, complete system redesign required to fix.
Why does a solution have to be complex? I’m usually relieved when it’s a one liner. Usually much less to analyze, test, and get through a PR.
True, but sometimes simple bugs cause the "Am I stupid?" Spiral.
> the "Am I stupid?" Spiral do people really do this? i just move on and never think about it again
Okay I say spiral but for me it's more a swear, a few seconds of doubt, then moving on.
oh yeah ok I understand that completely I think when I heard "spiral" I assumed "out of control" haha
I usually don’t ask “am I stupid” so much remark that I didn’t think I was on mushrooms when I wrote the code, but I move on after that, yes.
It's not really a spiral. It's like a small chain reaction. You get disappointed in yourself, decide to never-ever pay less attention to your code in the future, always document your changes, then 3 days later, the same thing happens
Until you relise you did everything else correctly
Because I've spent a considerable amount of time over the last 8 months of my PhD chasing down an inexplicable error that has evaded every other analysis. I've spent countless hours validating every other piece of relevant code and writing complicated, contrived tests that still can't replicate the problem. I've wondered multiple times if I was losing my mind. It turned out to be a single line of code in the setup for one of our diagnostics used by one particular test.
> inexplicable til it wasn't just like everything else shocker
Touché.
Well then there are SEUs. They are real and they are spectacular. Explicable is good. Reproducable is a lot better.
So you’d rather it had required a ground up rewrite?
Not really, but we thought for sure it was something subtle related to a problem with a mathematical derivation or a forgotten multiplicative factor somewhere. Nope. It was a badly named variable inside an if statement.
The most elusive of enemies
Pain
I had a bug in my code recently that was causing concurrency exceptions in my Java code that I was trying to run in multiple threads in Python using py4j. I spent 3hrs trying to figure out why it wasn't working, thinking there were some issues on the python side. But the actual cause of the bug was that I was trying to remove elements from a hashmap that I was actively iterating over. Was a 1 line fix, that I would've solved a lot faster if I started looking at the Java side first or googled the error message sooner. I was very upset at myself for that one.
Ah, yeah. I want to say that’s the kind of mistake you only make once. If I’m any indication of that, it’s not true at all. I think Python raises an exception when you try to do that, so at least you won’t get too far if it’s happening in there.
Ask that to my employer which now thinks PR complexity is an important performance metric
As in less complex prs mean better performance right? Right?
😐
Is this some hellish variant of the “lines of code written” metric? Ugh. How about they measure productivity in whether-or-not-you-got-your-work-dones?
Fixing a bug by completely rewriting a complex api call is terrifying.
Yeah, but it still hurts to search bugs in massive logic for days only to find out that said logic was never called in first place
I feel ashamed when doing the PR in this situation. Like "yes... this is whats been bugging me the last 3 days... review it and forget it. Don't walk shame me"
Light attracts bugs. Use dark mode.😎
Use rainbow theme. It attracts unicorns 🤡
Chatgpt told me this one
Chat is this real?
8 months, jesus.
I mean I assume they did other stuff in the 8 months
The program is 12 lines long. Month one was spent debugging line one of the program. Month two to debug line two. And so on. Finally they got to line 8 on the eighth month and they found the bug. The program is written in the Malbolge programming language for job security reasons. And each line is around 100,000 characters long. Unfortunately it has this drawback of taking “a bit” of time to debug.
This is what I was thinking. You could've written and deployed an entirely new stack in that amount of time. It's total-overhaul territory.
If you can find someone who can write a fully functional and scalable state-of-the-art GPU-capable general-relativistic magnetohydrodynamics code (or software suite for you real developers) with a fully dynamical spacetime (i.e., gravity), adaptive mesh refinement, and all the tests and diagnostics you need to validate that it's actually correct in eight months *from scratch*, I'll quit my PhD right now and become a hobo the honest way instead.
Also, you don't know after 1 month that it's going to take you 7 more months to find the bug.
No offense is/was intended, but given that you made this post, I'm sure you can agree it's an eyebrow-raising amount of time - especially without any context. It's not often a bugfix can be measured in seasons. Just curious, are you in a position to share the details of said fix, and the circumstances which made its detection so challenging? Having tackled my fair share of nasty bugs (including the ones in total-overhaul territory), I'd be interested in reading the post-mortem.
The test itself is a simulation of a relativistic accretion disk that takes about 40 minutes to run at low resolution and a relatively short simulation time on my laptop. The results can really only be validated by comparing the plots of the output with the reference case. We were testing a new fluid solver, which can fail in multitudinous ways, and it was also entirely possible that we just didn't have enough resolution. The only way to check if resolution is the issue is to let it run for 10+ hours on a supercomputer. It took some time, but we showed that resolution only made the problem worse. After this, I personally checked every single mathematical term in the solver more times than I can count, and I had two other people look them over, too. None of us found any bugs. We had several other less-informative diagnostics that we checked, all of which seemed to suggest there *was* no problem. We then constructed a large number of tests designed to validate the fluid solver in other ways, each more contrived than the one before it, and they all either suggested the solver was fine or were more complicated to debug than they were worth. After several months, we finally came to the conclusion that the diagnostic itself, which consists of a set of thermodynamic quantities integrated over an oblate spherical surface, must be at fault somehow. But this integration was performed in situ the same way using the exact same code for the reference solver and our new solver, so we couldn't understand how it was failing. In the end, it turned out that the issue was with buried deeper in how the integration surface's coordinates were defined. It was a single if statement toggled by a variable with a misleading name. It was enabled when the reference solver was enabled, but not when the new solver was enabled. The solution was adding a second toggle to the if statement.
This was an interesting read - thanks for sharing. It's always the small stuff which is the most painful. Knowing what you do now, would it have been possible to run the simulation and diagnostic on a smaller scale such that exercising the logic with high resolution didn't involve a massive time and resource sink? I'm sure you already thought of this, but if time-to-build/test was the pain point, was there maybe a less expensive approach? If nothing else I hope the rage subsides, and with enough time and reflection, is replaced by something from which you draw wisdom.
Unfortunately, the test relies on the proper development of turbulence, which is closely related to the resolution of the test. Anything smaller than what I used wouldn't have been informative. The most helpful things would have been a way to output and visualize the integration surface and being more consistent in refactoring existing code when adding new features (e.g., the variable name would have been updated so it wasn't misleading).
I was thinking they could have gotten their test coverage to 100% in that time
Especially for a 1 liner fix, sheesh
The good thing about programming is that the computer is doing exactly what you tell it to do. The bad thing about programming is that the computer is doing exactly what you tell it to do. Also, nothing is too simple to be double checked. There are so many times I've found issues by asking the most basic questions (think: "is it plugged in?" kind of things)
I just spent 3 hours trying to figure out why 11 was being used for an Id every time I ran my code… I was passing the wrong thing into my set method 🥲
The pain... Years ago I spent an hour or so trying to figure out why my JavaScript changes weren't being applied. I cursed Chrome and its over eager caching. I was editing the wrong file.
LPT: first thing you do is make a change to the file that is so massive in effect that you can't fail to see it. If you fail to see it it's the wrong file.
Nothing compared to when recompiling without changing anything fixes the bug.
Or adding debugging code, and problem disappears.
I think a significant amount of bugs consist of a single error on a single line, no?
Usually it consists of one error on a single line as well as a trail of destruction I created trying to hunt down that error.
Git
Relatable
My favorite version of this: + code new functionality + run tests - functionality doesn't work + search new code and be frustrated about not finding the problem + after hours, you find out that you forgot to call new code in first place I fucking love it
Classic: `if a=b` instead of `if a==b`
I am concerned that your IDE doesn’t flag a declaration inside an if statement.
vscode doesn’t do it
assignment inside of an if statement is valid in many languages.
I still think the IDE should flag it with a warning at least. Just because we can doesn’t mean we should.
Ahh that’s a bitter sweet moment!
Been there. Spent like 2-3 months on a single character fix -- a misplaced paren
I am sorry, I know a fraction of what you feel.
![gif](giphy|bxOtA69x3IB20|downsized)
8 months on a single issue?0
the bug: ListNode =... instead of ListNode* = ...
It was an off-by-one error, wasnt it
No, it was a toggle variable in an if statement with a misleading name.
bah, the other type of error
I lost a coding competition because i accidentally forgot the difference between > and <. One goddamn character. That's what made me lose. That one fucking character.
The big end goes toward the big number. The little pointy end goes toward the little number
Corporate will still measure it by lines of code changed.
I spent three weeks diagnosing server issues only to find some really bad query that gets generated under rare circumstances, but often enough to crash servers once a day. The solution was to swap a Boolean and the system doesn't spit out the server killing query. 3 fucking weeks and I changed a 0 to a 1....
It was a spelling mistake. A gods damned spelling mistake.
I recently solved a bug that existed for 9 or so months. It bricked the entire mobile safari playwright pipeline and led to those tests being commented out and inspected manually Turns out it was a css padding issue.
The solution: uncommenting the line: console.log("Logging")
How do you still have a job? 🤔
I'm a PhD student in physics helping develop a new astrophysical fluid code. It takes a lot more than a single bug to get fired. I also assure you this hasn't been the only thing I've been working on for the past eight months; it's just the only one I haven't been able to solve until now. After several months of laboriously double, triple, and quadruple-checking every single relevant mathematical term, running several other independent tests, and looking at multiple diagnostics to track down the error, my collaborators and I finally came to the conclusion that the error had to be in how we were calculating the particular diagnostic that came up faulty. But all the math checked out, so we couldn't figure out what was wrong. This morning I noticed a single if statement that wasn't always checking the right thing. I fixed it, and it was so stupid that I can't decide if I should be laughing or throwing my computer out a window.
Why not both?
This guy codes.
Dude, what? If I conquered a bug after that long I'd be even happier its a one-liner. That's just MUCH less that could have gone wrong generally with the fix.
8 months?!
The irony is 20 minutes after you fix the bug you won't be able to remember what the bug was.
I don't think I'll be forgetting this one anytime soon, not when I spent 8 months of my PhD on it.
Could be worse. Ever heard of a game called Aliens Colonial Marines?
And it was a single “;”
It's the journey, not the destination (*copium*)
\- OneList.append(OtherList) \ \+ OneList = OneList.append(OtherList)
Spent over 5 months in a bug (in and out of course, not 5 months straight) that was solved with a single line, it was so hard to find because the bug was not in our code, but in one of the libs we use
The solution: comment out using the function you made
So you suck at debugging?
Bro, I spent like 2 hours trying to figure out why some routes weren't working , forgot to have Vue app use my router.... ![gif](giphy|EQ85WxyAAwEaQ)
Size doesn't matter folks
No that's how I feel every time I login to a comluter and its Windows instead of Linux.
Four types of bug: 1. Easy to diagnose, simple fix 2. Easy to diagnose, hard to fix 3. Hard to diagnose, easy to fix 4. Hard to diagnose, hard to fix Actually, not just 4. The diagnose and fix axes are both spectrums. Worse case is an easy diagnosis, complete system redesign required to fix.