Ramblings on Testing and the Advances of AI Systems.

11th March 2025

Tomorrow marks my 13th anniversary at work - and it has been a great 13 years. I have been lucky enough to work on many great problems, and fascinating technologies. I am excited to see new horizons and frontiers arise in my sector, and thinking about the future is more interesting as a result. Today I am sitting at home with no functional electricity, waiting for some power upgrades to be completed, so I am mulling over the things I have learned so far in my career, and how they apply to the technologies I see growing right now.

My time has been devoted to enabling our software engineers as they work on some of the biggest challenges our industry faces; from compilation to acceleration, heterogeneous architecture to safety systems, debugging to software standards, and from graphics to AI. My role in all of this has been to ensure that we could work effectively and at pace. My primary areas of focus are Testing and Security, two disciplines that go hand in hand, and have steered me well throughout the incredible journey we took as my employer grew and evolved.

We focus on technologies related to AI and HPC, and while the challenges of HPC are fairly well understood at this point I think we are just starting to scratch the surface of “AI”, especially given the broad subject area that such a small acronym actually covers. I do not work anywhere near the frontiers of AI, my understanding of the technologies used by AI labs is sadly limited, but I look at how the technologies we do work with affect various parts of the stack and spend a lot of time thinking about how the concepts of Testing and Security affect (and are affected by) the pursuit of “AI”. As a parent to young children I also can’t help but ponder the realities of learning and how they connect.

When we look from a layperson’s perspective at the way that AI labs are training their large language models we see a straightforward and familiar concept - they take a large corpus of text, and let the system read it - connecting words, concepts, finding patterns, storing associations. Of course there is nothing straightforward about the implementation of these steps, and the work being done in that area is mind-blowing, but my attention immediately goes to the data being used. The largest models seem to be trained on huge volumes of data, glibly referred to as “The Internet”, I understand this to be a huge glut of forum posts, scanned books, social media, reference websites, and all sorts. This data seems to serve a dual purpose - both teaching the rules of communication, and also providing the information to be considered and connected together. But I wonder, would I want my young children to learn simply by giving them unfettered access to this same data source? When someone refers to “The Internet” in such broad terms you might likely expect a forthcoming rant about horrifying hives of scum and villainy, or a delightful outline of learning and community. Is this apparent duality just a reflection our own personal bias? If I could personally read all of this data would I become incredibly wise?

I am certain that much has been researched and written about the intrinsic bias of Internet communities, perhaps related to the demographics present in certain parts of the data. Maybe the issue is well enough understood that it is handled in the training process. Still, I find myself wondering about the things that we don’t write down. Or the situations where a ”meme” is more popular than a related truth. Perhaps this could be exaggerated by the Mixture of Expert type models, which perhaps could focus on area where bias is more present. I recently discussed this with a colleague, using stereotypes as an example, wondering whether there could be more text on the Internet that reinforces (directly or indirectly) specific stereotypes than there is disproving them, or identifying them as stereotypes. I am a Scottish person, we are stereotypically portrayed as being miserly with money, or difficult to understand, or prone to wearing kilts. Are the allusions to those stereotypes easy to misunderstand as facts? That’s a very unsubtle example, but amusingly The Internet itself presented a funnier one - AI answers describing Haggis as “a small furry mammal native to Scotland” which was known for having "asymmetrical legs”. How do we intrinsically know what is “real”? How do we recognise satire? Do AI labs just remove The Onion from their training data? What is the value of the unsaid, the undocumented?

I remember first experimenting with cellular data, albeit using Wireless Application Protocol over GPRS. I waxed poetic with friends about how incredible it would be to be able to look things up on the incredible new Wikipedia website while out and about. I welled up thinking about how wonderful and beautiful it was to be at the beginnings of what I saw as “free access to data”. That part of me wants to believe that the Internet might include enough real information and nuances of the human experience to train intelligence - but are there gaps? What about real day to day human existence isn’t captured? Every wonderful success we see with these models makes us want to trust the technology more, but I fear that our ability to find and fix problems isn’t good enough.

I suspect part of this fear comes from my own experience putting things out into the world. With our work on software we similarly put time into building something, testing it as well as we can, and releasing it. If we discover later that something is wrong, we can create a patch - usually by having a small team of specialists focus on the problem (leaving the rest of the team to do other things). But distribution is the issue, you face the challenges of ensuring the end user knows about the patch, gets the patch, and installs the patch. In a SaaS style model, life is potentially easier as you handle distribution internally, but likely still incur serious logistical overheads and oppressive timescales. I used to work in the publishing industry where a slightly different “patch” issue exists - a lot of time and effort went into researching, writing and editing a complicated technical book, followed by design, print, and distribution. Effective patching of issues discovered after distribution was practically impossible, since we couldn’t know who’d bought our books at retail, and could never assume they would look for errata on our website. All of this makes me think about how ”patching” can work with these huge models.

Training a large model is a significant investment of time and resource, if it’s clear the model includes significant issues like a bias, gap, predilection for violence, or some other problem - can it be “patched?”. Some issues could be worked around with clever system messages (Asimov’s Laws but for AI?), or other processes that can happen at a later stage. But unlike devoting a small number of engineers to a patch while the rest keep working, restarting model training from scratch to fix these issues uses all of your resources, and blocks other work. Does this mean only the biggest issues would be fixed? Distribution is interesting too, if a model is published as open weights and used in applications outside of your control then “patching” is an impossible task, and the issues (whether they be bias or something else) will be outside of your ability to resolve. Even if you do retrain the end result might be a large file posted after quite some time has passed - which surely impacts the uptake of that fixed version? Do distilled models compound these issues further since output from a potentially flawed model could be used to train another?

Humans don’t learn in a vacuum, we don’t solely get our information from books, or from looking out at the world through a window. Our learning is experiential, full of constant tests and validations. I remember the frustrations of watching my kids throw their food on the floor for the second, or third time in a row despite my protestations - their behaviour seems to be less about the food and more about “lets see if doing this action makes daddy have the big feelings again this time”! It’s exciting to see people in AI talk about this kind of learning too, through applications of different types of reinforcement learning like self-play. As an occasional Go player I was delighted to see all of the reactions to AlphaGo’s Move 37 against Lee Sidol, a move that seemed to defy expectations and go against the established norms. AlphaGo had “thought outside the box”, in a way I believe was only really possible because of this type of training.

But Go is a system with rules, which means it can be tested, opening the door to self-play and any related qualitative assessments. It seems from reading about new AI systems that there are many other domains where this kind of reinforcement learning can supplement training. Copilot and other tools have focussed on software, and I have heard comments many times from colleagues saying “we just need to hook up a compiler and debugger and let it solve problems itself”. I assume this has been done now, probably years ago. But having a career that revolves around compiler bugs, poorly implemented unit tests, naively copy-pasted Stack Overflow answers, and other issues I do wonder about how trustworthy this kind of testing can be. Perhaps instead of having the system test itself, software quality test suites are used, some other kind of benchmark for the generated code. Maybe challenges, or interview questions are used too - perhaps one AI system tests another. But none of this seems as clear cut as the rules based system that AlphaGo learned to navigate, so perhaps those solutions effectively would just teach the model to produce code in a very expected manner - distancing it from being able to produce something akin to Move 37.

My degree was in Multimedia, a word that at the time we felt would be so important to the future, but ultimately stopped being relevant as soon as the concepts it encapsulated became ubiquitous. My fondest memories of the study all relate to Human Computer Interaction. I have a strong memory of laughing with classmates about how suboptimal the keyboard and mouse felt, until you tried to think of something better. I believe that the work being done on conversational LLM style “AI” is legitimately a huge step in the direction of lowering the barrier to entry for complicated computer use. A layperson may struggle with some tasks simply because they don’t know how to properly describe the problems they want to solve. Seeing people interact with ChatGPT and solve problems thanks to the conversation that flows from a halfbaked initial prompt makes me confident that these technologies can make things better. But while the various big players race to be the first to hit whatever big goal come next, I hope they don’t over reach.

As I wrote at the outset of this blog post, tomorrow is my 13th anniversary at my employers. It’s an unlucky number for some. While I am not a superstitious person, I cannot help but feel like this 13th year in particular will bring about unexpected things. I hope they will continue to be positive and inspiring. I hope to see the field of AI continue to evolve - not just in capability, but in how we ensure that it learns responsibly, and benefits humanity.

talk to me about this

back to home