Three Days, Four Bytes, and a Constant I Should Have Read

I had been staring at this register dump for three days.

The solar charge controller was supposed to send back 11 bytes when I asked for register 0x3200. I was getting exactly 4. Every single time. Not 3, not 7, not a random number that would at least suggest some chaos I could chase — always, precisely, 4 bytes. The register address plus what looked like the beginning of a response, then silence.

I’d been through the obvious list twice. CRC math? Calculated by hand against the datasheet, correct. Function code? Right, 0x04 for input registers, double-checked. Byte order? Big-endian, as the spec says. Baud rate? 9600, confirmed with a scope. I had even swapped the RS-485 cable, which solved exactly nothing, as suspected.

What made it worse was that a USB-to-RS-485 dongle connected to a laptop got perfect responses from the same controller. All 11 bytes. Clean. Instant. So the slave wasn’t broken, the register existed, and the data was real. Something in my firmware was eating the front half of every single response, and I had no idea why.

I rewrote the UART driver path. Twice. I switched from DMA mode to legacy interrupt mode, profiled byte arrival times, restructured the receive buffer. Same result. At some point during the third day I genuinely started questioning whether there was a hardware fault on the addon board itself — maybe the RS-485 transceiver had a marginal switching speed and was dropping early bytes at the physical layer.

There wasn’t. The bug was in a number. A single timing constant that was set to 2.1 milliseconds when it should have been 120 microseconds. That’s it. That’s the whole post, in one sentence — except understanding why that number mattered requires understanding what the system was doing and why I got confused about where to look. So let me back up.


What I Was Actually Building

The context is industrial IoT at remote sites — places where sensors and controllers have been running for years, often without any network connectivity at all. Solar installations in the field. Weather monitoring equipment. Energy meters in industrial enclosures. These devices speak Modbus RTU over RS-485, a wired serial protocol that has been doing this job reliably since 1979. They don’t have Wi-Fi. They don’t have Ethernet. They were never designed to talk to the internet, and most of them never will.

The project was to get their data back to a backend system without running physical cable everywhere and without depending on mobile coverage. The solution was a low-power wireless mesh network — a network of small radio nodes that route packets across a site autonomously, even in areas with no cellular signal. Each node handles its own routing, and the mesh delivers packets to a sink node connected to the backend.

The missing piece was the Modbus bridge. I needed a small addon board that could act as a Modbus master on an RS-485 bus, poll sensors on a schedule, and forward the readings into the mesh as structured packets.

The architecture looks like this:

[Solar Charge Controller]
         |
     RS-485 bus
         |
  [nRF52 Addon Board]  <-- Modbus RTU master
         |
   Mesh radio link
         |
   [Mesh Network] <-- autonomous routing
         |
      [Sink Node]
         |
      Backend / Dashboard

The addon board is the translator. It speaks Modbus RTU on the wired side and mesh on the radio side. It runs on the same nRF52 SoC that handles mesh communication, so both stacks — the Modbus driver and the mesh stack — share one processor, one set of peripherals, and one interrupt controller.

That last sentence is important. I’ll come back to it.

The reason this had to exist, rather than buying an off-the-shelf gateway, is constraints: the board had to be small enough to fit in an existing enclosure, power-efficient enough to run from the same battery bus as the sensor, and deeply integrated with the mesh stack so that polling schedules could be coordinated with the network’s sleep cycles. A commodity gateway wouldn’t fit those constraints. So the board exists, and so did the bug.


A Two-Minute Primer on Modbus RTU

Skip this if you’ve spent time with industrial protocols. Come back if you hit a term later that you don’t recognize.

Modbus is from 1979. It was developed by Modicon for programmable logic controllers, and it is almost certainly older than the building you’re reading this in. It is also, for better or worse, still the dominant protocol in solar inverters, charge controllers, energy meters, HVAC controllers, and most industrial sensors you’ll encounter in the field. If you’ve ever wired a building automation system, a security panel, a DMX lighting controller, or a PLC rack, you’ve almost certainly been within arm’s reach of RS-485 and Modbus.

The protocol is simple: one master, one or more slaves, strict turn-taking. The master sends a request frame. A slave that recognizes its own address in the frame sends a response. No slave transmits unless spoken to. No broadcast responses. If you ask the wrong device ID or the wrong function code, you get silence — and silence is also what you get from a broken slave. That ambiguity is your problem to debug.

Frames are binary and compact. A request to read holding registers looks like: [slave address][function code][register address high][register address low][quantity high][quantity low][CRC low][CRC high]. Eight bytes. A response comes back as: [slave address][function code][byte count][data bytes...][CRC low][CRC high]. The CRC is a 16-bit value computed over all preceding bytes, sent little-endian.

Register types matter. Holding registers (0x03) are readable and writable. Input registers (0x04) are read-only measurements. Coils are single-bit outputs. Discrete inputs are single-bit inputs. Using the wrong function code for a register type produces a Modbus exception response — if you’re lucky. If you’re unlucky, some slaves just ignore it and you get silence again.

RS-485 is the physical layer. Unlike RS-232, which uses single-ended voltage levels relative to ground and tops out at maybe 15 meters reliably, RS-485 uses differential signaling across a twisted pair. One wire goes high while the other goes low. The receiver looks at the difference between them, which gives you good noise rejection over long runs — hundreds of meters in practice, sometimes more with careful termination.

The catch with RS-485 is half-duplex: the same two wires are used for transmit and receive, and only one end can drive at a time. That means every node on the bus, including the master, needs a direction-control GPIO pin to switch the RS-485 transceiver between transmit mode and receive mode. Forget to switch it, or switch it at the wrong time, and you either can’t drive the bus or you can’t hear responses. This is a common source of bugs in new RS-485 implementations.

Frame boundaries in Modbus RTU are defined by silence, not by a delimiter byte. A gap of 3.5 character times between bytes marks the end of a frame. At 9600 baud, one character time is roughly 1.04 milliseconds — so 3.5 characters is about 3.6ms. If a gap longer than that appears in the middle of a frame, the receiver is supposed to treat it as a frame break. This detail matters a lot. I’ll come back to it.


The Bug That Ate Three Days

Let me describe the symptom precisely: every call to read a register from the solar charge controller returned exactly 4 bytes. Not a random number. Not a number that varied by register. Four bytes, always. Which was exactly bytes 3 through 6 of what the full response should have been.

If a full response looks like [addr][fn][bytecount][data0][data1][data2][data3]...[crcL][crcH], I was getting [data0][data1][data2][data3] — the first four data bytes, no header, no CRC. That’s not a CRC failure. That’s not a framing error. That’s a partial read starting in the middle of the response.

Hypothesis 1: The slave is malfunctioning.

I hooked up a USB-to-RS-485 adapter to a laptop, opened a serial terminal, and manually composed the exact same Modbus request that the addon board was sending. The solar charge controller responded immediately with a full, clean frame. All 11 bytes, correct CRC, data that made sense. Hypothesis ruled out.

Hypothesis 2: The CRC calculation is wrong.

I extracted the CRC logic from the firmware, ran it against the test vectors in the Modbus spec, and confirmed it was correct. I also ran it against the partial bytes I was actually receiving — they passed CRC as a standalone sequence, which told me they were real Modbus data, just truncated. Ruled out.

Hypothesis 3: Inter-character gap is being inserted incorrectly.

Maybe something in the UART configuration was generating an incorrect baud rate, and the resulting timing was confusing the slave into thinking the request frame had a gap in the middle, causing it to restart or truncate its response. I scoped the TX line and it looked clean. I also checked the nRF52 UART baud rate register configuration — correct, 9600. Ruled out.

Here is where I made a mistake that cost me two days.

I decided the UART peripheral must be behaving differently than I expected. I went deep into the nRF52 datasheet on the DMA-based UARTE peripheral, rewrote the transmit path to use manual register access instead of the SDK wrapper, and profiled byte timing with GPIO toggles and a logic analyzer. The bytes were going out correctly. I switched from DMA-mode UART to legacy UART, which uses interrupts instead. Same result. Four bytes back, reliably, every time.

At some point during this detour, I noticed something in the receive logic that hadn’t caught my eye before. After the firmware finished transmitting the request frame, there was a sequence I’d written earlier and basically stopped thinking about:

// Switch RS-485 transceiver to RX mode
gpio_pin_set(RS485_DE_PIN, 0);  // disable driver
gpio_pin_set(RS485_RE_PIN, 0);  // enable receiver
k_sleep(K_MSEC(2));             // "turnaround delay" <-- HERE
// Now start listening
uart_irq_rx_enable(uart_dev);

The comment said “turnaround delay.” I had written that delay in very early in development to give the line time to settle before enabling the receiver — a reasonable precaution. I’d set it to 2 milliseconds, which at the time felt conservative in a good way. Stable, safe.

At 9600 baud, a Modbus slave starts responding within 1 character time of the request’s last byte — roughly 1ms. The first byte of the response was arriving at my receiver approximately 1-2ms after I switched to RX mode. Which meant my receiver was being enabled, at the earliest, at the same moment the first bytes were already arriving, and often after them.

The first 4 bytes of an 11-byte response arrive in about 4ms. I was enabling receive at 2ms. I was systematically discarding the header bytes — slave address, function code, byte count — and picking up mid-frame, right at the data bytes. Every time. Perfectly, mechanically, every time.

The bytes you’re missing are the bytes that arrived while you weren’t listening.

I had been blaming the CRC math. I had been blaming the UART peripheral. I had rewritten the transmit path twice. The bug was in a delay constant I’d written on day one and filed under “boring infrastructure.” The whole time, the transceiver was switching to receive mode 1-2ms too late, and the controller’s response was already in progress.

I want to be honest about why I missed this for three days: I’d written that initialization code early, it had never caused obvious problems in earlier testing (when I was only checking “did the device respond at all,” not “did I receive the full response”), and I had mentally categorized it as correct. I wasn’t looking at it because I wasn’t questioning it. That’s a bias, not a debugging methodology.

The wrong mental model is the bug. I spent three days looking at the protocol layer because I was certain the problem was there. The bug was in the physical-layer state machine. I missed it because I trusted code that wasn’t under scrutiny.


The Fix

The fix itself was small. The understanding behind it took three days.

The turnaround delay from TX-to-RX mode needed to drop from 2.1ms to something that respected the actual timing of the protocol. At 9600 baud, you need roughly 100µs for the RS-485 line to settle after the last transmitted bit — the driver has to go high-impedance, the termination resistors pull the line to idle state, and the receiver input has to stabilize. I set the delay to 120µs, with some margin. That’s the entire fix, mechanically.

But while I was in the code, I also fixed a second issue I found during the investigation: the wireless mesh stack runs interrupt-driven, and its interrupts can fire at any point during a Modbus transaction. If a mesh interrupt fires in the middle of transmitting the request frame, it can insert a multi-millisecond gap between bytes — which the slave interprets as a frame boundary and abandons the request. The fix is to disable IRQs during the TX burst, re-enable them during the inter-frame waiting window (so the mesh stack can do its housekeeping), and disable them again for the first byte of the RX burst (to avoid a race condition between the byte arriving and the receive enable).

Here’s the critical section structure:

/* Flush TX buffer and send Modbus request */
irq_disable(UART0_IRQ);         /* Mesh stack must not interrupt TX burst */
rs485_set_tx();                 /* DE=1, RE=1: enable driver */
uart_send_frame(req, req_len);  /* send all bytes back-to-back */
uart_wait_tx_complete();        /* wait for last stop bit */

rs485_set_rx();                 /* DE=0, RE=0: enable receiver */
k_busy_wait(120);               /* 120us line-settle (was 2100us) */
irq_enable(UART0_IRQ);          /* mesh stack can run during response wait */

/* Wait for first byte */
if (!wait_for_first_byte(uart_dev, K_MSEC(50))) {
    return MODBUS_ERR_TIMEOUT;
}

irq_disable(UART0_IRQ);         /* Guard RX burst, avoid byte overrun */
receive_modbus_frame(buf, &len);
irq_enable(UART0_IRQ);

The k_busy_wait(120) call is the key line. Everything else is about keeping the mesh stack from interfering with byte timing during critical windows.

Here are the results after this change, across all registers I was testing:

RegisterExpected bytesReceived bytesCRC validData sane
0x32001111YesYes
0x311A77YesYes
0x331A99YesYes
0x311099YesYes
0x310099YesYes
0x310C99YesYes
0x901377YesYes

Zero CRC errors across multiple polling cycles. Battery state of charge reading correctly. PV voltage tracking real-world values. The real-time clock register returning a timestamp that matched my watch.

I tagged the working firmware as a stable baseline, created a branch to preserve it, and put the compiled hex in a known-good folder under version control. That last part might seem excessive for a personal project — but I’ve been burned enough times by “I swear it was working last Tuesday” to appreciate having a concrete artifact I can go back to.


The Tools I Built Around It

Surviving that bug motivated me to build infrastructure so I’d never have to debug it the same way again.

The most useful thing I built was a protocol sniffer: a small Python script that listens on a USB-to-RS-485 adapter and decodes Modbus RTU frames in real time. Every byte that passes on the RS-485 bus gets displayed with annotations: slave address, function code, register address, byte count, data values in hex and decimal, and CRC validity. Color-coded in the terminal.

The sniffer became my ground truth. Before it existed, I had to infer what was on the wire from the firmware’s behavior. With it, I can sit the adapter next to the addon board, watch every request go out and every response come back, and know immediately whether a problem is happening on the wire or somewhere in the firmware’s handling of what it received. It’s the closest thing to a protocol analyzer without buying a protocol analyzer.

The second tool was a slave simulator. It runs on a laptop with a USB-to-RS-485 adapter and emulates the register map of a solar charge controller — realistic values, correct response format, valid CRC. I can test the addon board against the simulator without any bench hardware at all. The simulator was especially useful after the first fix, because I could systematically exercise every register address and confirm the full-frame responses.

I also built a live dashboard that pulls structured packets off the mesh sink’s serial port, parses the TLV (type-length-value) encoding the firmware uses to pack register readings, and displays the latest value for each register in a clean terminal table. It has a demo mode that shows realistic fake values when no hardware is connected — useful for showing people what the data looks like without requiring a full hardware setup.

Rounding it out: a packet parser that takes a hex blob copied from a serial monitor and produces a formatted table of fields, and a set of slash commands that wrap the full flash-test-debug loop. /flash calls nrfjprog --recover first (always, without exception) and then programs the correct hex. /sniff starts the protocol sniffer. /simulate starts the slave simulator with a chosen register map. /dashboard connects to the sink.

The pattern I noticed while building these is that each tool corresponds to a problem I’d gotten stuck on and had to reason through manually. The sniffer exists because I spent too long guessing what was on the wire. The simulator exists because I spent too long depending on bench hardware that might or might not be behaving correctly. The flash command exists because I forgot to recover the chip before flashing once, and spent an afternoon chasing ghost behavior from a partially-programmed device.

If a problem keeps coming back, build it into a command. The command is the lesson.

The tooling now means that the next time something is wrong — and there will be a next time — I have a repeatable methodology: flash, sniff, check wire, check firmware handling, check timing. In that order. The hard-won order.


What This Taught Me

None of this is abstract. These are specific things I got wrong, and what I learned from getting them wrong.

Timing problems hide as data problems.

When the symptom is “I’m getting wrong data” or “I’m getting incomplete data,” the instinct is to look at data: the CRC, the byte parsing, the function code, the register map. That’s often the right instinct. But if the data is right and just missing, the first question should be about timing — specifically, about whether there is a window in your state machine during which the other side is transmitting and you are not listening. In half-duplex RS-485, that window is created by your own TX-to-RX switchover. If that switchover is even slightly too slow, you silently discard the beginning of every response.

I should have gotten to this in the first hour. Instead I got to it on day three. The reason is that timing problems don’t look like timing problems — they look like data problems. The CRC check catches corrupt data. It doesn’t tell you about data that never made it into the buffer.

The wrong mental model is a harder bug than wrong code.

The code was fine. The CRC was fine. The UART peripheral was behaving exactly as specified. My mental model said “the problem is in the protocol layer” — and that model sent me deep into the UART driver documentation, into DMA configuration, into baud rate verification, all of which were correct and all of which I confirmed were correct. None of that confirmation helped, because the model was wrong.

The delay I’d written was not the first thing I questioned because I’d written it, I’d filed it as boring infrastructure, and I wasn’t revisiting it. The lesson is not “always check everything” — that’s impractical. The lesson is: when you’ve ruled out the obvious suspects and you’re still stuck, the model itself deserves explicit scrutiny. Ask: “What am I assuming is correct that I haven’t actually tested?” Write the assumptions down. That list is usually where the bug is.

Network mismatches are invisible and cost full days.

Early in this project, I lost a day to a device that was “offline.” The mesh had a test network (different network ID, different encryption keys) and a production network. I flashed firmware configured for the test network onto a board at a site running production network settings. The device transmitted perfectly. It received nothing. It didn’t log an error. Nothing in the firmware indicated anything was wrong.

The symptom was simply: device shows up in the sink log, no data flows. There’s no “wrong network” error message in the Modbus layer. There’s no protocol-level indication. The board was working, the radio was working, and the data was going nowhere because it was shouting into a different room. This is a specific configuration management problem, but the general lesson is: when a device appears operational but data isn’t flowing, check your network identity parameters before debugging anything else.

Always recover before flashing.

nRF52 chips have an access port protection feature that, if enabled, blocks external debug access. If you flash firmware that enables it, or if you’re working with a chip that has it enabled from a previous flash, and you try to program over it without first running nrfjprog --recover, you end up with partially-programmed flash — sometimes old interrupt vectors, sometimes old constants, sometimes combinations of old and new code that produce behavior neither version would produce on its own.

The symptom is unreproducible, because it depends on the exact memory layout of the old firmware. I hit this once, chased it for half a day, and never hit it again after making recovery a mandatory first step. The recover command erases and unlocks the device. It takes 15 seconds. There’s no reason not to do it.

The thing I still don’t fully understand.

The interrupt-disable-during-TX-burst workaround bothers me. I know it works. I know why it works at a surface level: the mesh stack runs interrupt-driven, and if it interrupts the UART transmission mid-frame, it can stall byte transmission long enough to look like a frame boundary to the slave.

What I don’t fully understand is why this happens at all. The mesh stack’s interrupt handler should run quickly, yield, and return. At 9600 baud, a byte takes about 1ms. If the interrupt handler runs in less than 1ms — which it should — it shouldn’t cause a visible inter-byte gap. But I have seen evidence that it occasionally does: register reads that fail only when the mesh stack is active, and succeed reliably when IRQs are masked during the burst. Something in the mesh stack’s interrupt context is taking longer than it should, or there’s a priority inversion I’m not tracking, or the UART’s TX FIFO isn’t as deep as I’m assuming. I have a workaround. I don’t have an explanation. I’ve left a comment in the code that says exactly that.


What’s Next

The immediate next step is expanding beyond the single sensor type I’ve been testing against. Other charge controller models use the same Modbus RTU protocol but different register maps. Weather stations and energy meters add different function codes and data types — some return 32-bit floats packed across two registers, some use signed integers with scale factors in a separate register. I want the firmware to handle these without a code change for each new device type.

The direction is config-driven register maps: a small table that describes each register address, its function code, data type, scaling factor, and how it gets packed into a mesh packet. Add a device type to the table, reflash, and the board can poll it without changing the core driver logic. That’s the goal, anyway. Whether the abstraction holds for every edge case a production solar installation can produce is something I’ll find out the hard way.

I also want to build a hardware-in-the-loop test rig — the slave simulator running on a Pi, the addon board on a bench, a script that exercises every register address and every error condition (timeout, malformed response, wrong CRC) overnight and reports pass/fail. Right now the test coverage is “it worked when I tested it.” That’s not good enough for something running unattended in the field.

The bigger reflection, after all of this, is something I keep coming back to: the hardest bugs in firmware aren’t bugs in your code. They’re bugs in your assumptions about time. The code was doing exactly what I wrote. The assumption that 2ms was a safe settling delay was wrong, and it was wrong in a way that looked like a protocol error, a hardware error, and a driver error before it looked like what it actually was. Embedded systems run in time. When something is broken, check your assumptions about when things happen before you check whether the bytes are correct.