Some years ago we had PDP-11/70 that we used for, among other things, software development using cross-assemblers - like, writing a Motorola 6800-series code using a PDP-11 utility. As part of the process, because we had a Big-Endian/Little-Endian problem, we had to go into the binary download file and do byte-swap operations on the RIM-loader file that we used to burn ROM chips. The byte-swapper was written in BASIC which on the PDP-11 was compiled to machine code.
Then, because we needed to upgrade the overworked 11/70, we bought a VAX/VMS machine on a VAX 11/780 - supposedly a faster machine with wider data bus width. The byte-swap program in compiled BASIC code for the 11/70 that ran for 19 minutes on the old, clunky, slow PDP-11/70 was migrated to VAX code, also compiled BASIC for native VAX instructions. It took over 22 minutes. Needless to say, the boss was not a happy camper because that "hunk of junk" wasn't living up to expectations.
The guy in charge of the cross-compiler projects called me in to take a peek. I did a purely software fix by rewriting the byte-swapper in BASIC with one VAX Assembly-language subroutine that actually did the swapping step, leaving the other parts of the program still in BASIC compiled code to read the unswapped data or write the swapped data as an appropriate file type. The old pure-BASIC program that took (by system high-precision timer) 22 minutes 15 seconds now took 17 seconds on the new code.
We had a database compiler that worked on building proprietary databases with a hierarchical structure, tailored for the end product which was an industrial management system. On the VAX, the people who wrote the compiler bemoaned the fact that one compilation often took 70 hours or more, and one of our neighbors had some high-inductance motors that, when they started, caused a brown-out - thus killing the compile. So I analyzed the code, found a bottleneck, and fixed it. By removing one line of code (a particular subroutine call), I got the compiler to run in just a few seconds over one hour. The project leader was almost in tears she was so happy. If we had bought the biggest, fasted VAX available at the time, we would have had about a 50% speed increase on the instructions but the I/O was still going to depend on HDD latency and the nature of this beast was a paging problem. We couldn't afford to throw memory at it because the virtual size of the program never stopped growing. By removing one unneeded subroutine call, I obtained a 78-fold speed improvement. Not 78%... 78.6 TIMES faster, or about 98.8% speed increase, purely based on software.
The moral of this story? Don't put all of your improvement eggs in one basket. You WILL find a bottleneck somewhere after every hardware upgrade. It is inevitable.