Following the resolution of the ALTINST mode issue, I returned to investigating the VIA C7 boot problem. Under certain circumstances, some Esther-based systems were experiencing a sudden reboot shortly after the kernel was loaded, which I was able to reproduce on my Biostar Viotech 3100+ motherboard. This led to a lengthy debugging process before I could finally identify the culprit!
Initial analysis
Typically, the bugs I encountered before were related to the NetBSD kernel or drivers. Initially, I assumed that this might also be an issue with the INSTALL kernel configuration because, during the early stages of investigation, the problem only occurred with the install image (not with the fully installed system on my SD card). A critical discovery in the debugging process was that the issue specifically occurred when ACPI 3.0 was enabled. I downloaded older releases to identify the first affected version and found that NetBSD 7.0 was the first to exhibit the symptoms.
To narrow down the problematic commit, I decided to install an older release on my USB image and build several kernels between different 6.99.x versions. However, I soon realized that the kernel wasn’t the issue – older 6.x kernels were also failing with the newer install images. At the same time, the install kernel was successfully booting from my SD card setup. At this point, it became clear that the problem was with the bootloader.
I then began building full distributions from various 6.99.x versions to pinpoint the commit responsible. This process was slow and painful, taking between 5 to 8 hours for each build. In hindsight, I could have just been building the bootloader code, but at the time, I didn’t know where the affected code was or how to build it. After several weeks of this repetitive process, I finally identified that the reboot issue began after the switch to GCC 4.8, specifically when the boot parameters were fixed in this particular commit. Unfortunately, this didn’t offer much insight into the underlying problem, and this approach reached a dead end.
At that point, I returned to the current code and started debugging the kernel’s behavior.
Kernel debugging
The boot log was printing only a few messages before the reboot, but it still provided a useful starting point, especially the last line: ‘pmap_kenter_pa: mapping already present’. I quickly located this message in the code and began investigating what was happening. The comment in the conditional block stated, ‘This should not happen,’ but clearly, it was. Eventually, the code called the kcpuset_copy() method, where both arguments were still undefined, leading to a null pointer dereference during the memcpy() call and triggering a sudden reboot.
Knowing this was helpful, but it didn’t explain why this ‘should not happen’ situation was occurring. Due to the early stage of the kernel boot process, getting a useful stack trace was either difficult or impossible. Nevertheless, I began tracing the pmap_kenter_pa() calls and hypothesizing where the call could have originated, especially since I knew that the global kcpuset_running parameter was not supposed to be set at this point. This is where comparing the behavior of ACPI 2.0 and ACPI 3.0 became useful. I inserted various debugging messages in the relevant parts of the code and compared the memory values between the two. This quickly led me to discover invalid virtual memory values that were dependent on the parameters passed from the bootloader (e.g., atdevbase, PDPpaddr).
It took some time to pinpoint exactly where the problematic code was being executed. My initial assumption was that the issue occurred somewhere in kern/init_main.c main(), but it turned out to be earlier, in init386() (specifically init386_pte0()), which was making the problematic pmap_kenter_pa() call. At this stage, kcpuset_running is not yet initialized, as that occurs later. However, since the conditional block causing the reboot wasn’t supposed to be executed so early in the process, no assertions were added to that part of the code.
Despite this progress, I was slowly hitting a dead end again. It was clear that something was wrong with the memory values, and code inspection showed their dependency on the bootloader’s input, potentially causing a reboot when these values deviated too far from the expected range. This also explained why the boot process sometimes succeeded. I could identify some workarounds at this point—such as ignoring the eblob value in the calculations—but these were not viable long-term solutions. It was becoming clear that I needed to start debugging the bootloader itself!
BIOS bootloader
At this point in the analysis, I already knew that the affected bootloader code was located in the sys/arch/i386/stand path and that it was part of the biosboot bootloader. With the help of other NetBSD developers and the documentation, I learned how to build the bootloader alone and install it into my installation image. This greatly sped up the debugging process, as I no longer needed to build the full distribution.
This process is relatively simple using the build.sh framework:
# build i386 cross-compile toolchain
./build.sh -T ../tools -O ../obj -U -j6 -mi386 tools
# to avoid searching for all dependencies repeatedly, build the distribution once using the build.sh framework
./build.sh -T ../tools -O ../obj -U -j6 -mi386 distribution
# navigate to the i386 bootloaders code folder
cd sys/arch/i386/stand/
# build the bootloaders and repeat the process as many times as necessary
../tools/bin/nbmake-i386 -j6 dependall
# install to destdir
../tools/bin/nbmake-i386 -j6 install
Then install the bios bootloader:
# mount NetBSD install image
mount /dev/sd0a /mnt
# copy secondary bootstrap to the root folder
sudo cp ../obj/destdir.i386/usr/mdec/boot /mnt/boot
# copy bootxx_* files (likely optional)
sudo cp ../obj/destdir.i386/usr/mdec/* /mnt/usr/mdec/
# install the primary bootstrap
installboot /dev/sd0a ../obj/destdir.i386/usr/mdec/bootxx_ffsv1
# unmount install image
umount /mnt
To avoid constantly mounting, unmounting, and re-attaching the USB stick, files can also be transferred via SSH and installed directly.
The main logic of the bootloader resides in the exec_netbsd() function, which is called by various i386 bootloaders with parameters from the primary bootloader’s input. This function loads the kernel, calculates its size, and performs related tasks. My debugging process focused on identifying where memory values were becoming incorrect. The marks[] array, where values are set during the kernel load process, was a major point of focus.
After multiple attempts, I discovered that the initial values in marks[] were correct immediately after the kernel was loaded, but they became corrupted by the end of the common_load_kernel() method. The corruption occurred despite only a few calls happening between these points, making it easier to identify the cause. The bi_getmemmap() call was pinpointed as the culprit. It was identified that the stack overflow occurred upon returning from this method, leading to stack corruption.
The issue was eventually narrowed down to the getmementry() call, which is invoked multiple times within bi_getmemmap(). Even a single call to getmementry() was enough to corrupt the stack right after returning from bi_getmemmap(). Debugging this was challenging because getmementry() is written in assembly code and has not been modified for many years.
With assistance, I eventually found that the allocated buffer for 5 words was actually writing to 6 words when ACPI 3.0 was enabled. It appeared that ACPI 3.0 extended the INT 0x15, EAX = 0xE820 BIOS function for memory detection from 20 bytes to 24 bytes to accommodate extended attributes. Only a few motherboards supported 24 bytes initially, while the specification required that the function return 20 bytes if requested, regardless of the actual support for 24 bytes. Some VIA systems shipped with a buggy BIOS that returned 24 bytes regardless.
The temporary buffer was not allocated for 24 bytes, causing a stack buffer overrun. The fix was to increase the buffer size from 5 to 6 words!
Conclusion
The Biostar Viotech 3100+ now boots successfully, with memory values consistent whether ACPI 3.0 is enabled or not. Changes have been applied to the NetBSD 10 and NetBSD 9 branches (as older releases are no longer supported). I must admit, this investigation was challenging: it involved a long process of narrowing down the issue, making incorrect and time-consuming decisions, countless reboots, and numerous bootloader builds and reinstalls. It required many long evenings to make slow progress or to rule out incorrect theories.
One might question whether spending time on outdated systems is worthwhile, and the answer might be no. However, like every solved mystery, it provides significant rewards in terms of knowledge and experience. I believe this exercise was valuable for me, and if even one user benefits from this fix, it will have been worth it. This motherboard has allowed me to address several issues: from a broken temperature sensor and a faulty ATLINST mode disable process to finally resolving the boot process failure. Now, I can give it some well-deserved rest and shift my focus to other tasks.