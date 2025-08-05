Summary

Filtering technical candidates fairly is a hard task: too easy and you don't exclude enough, too much friction and it's a turn off for everyone. At some point, the industry went all in on small tests you would do at home, but AI killed this strategy.

However, there is still a small corner of the web the LLM crawlers are apparently not swallowing with abandon: specs.

So here was the idea: take PEP 750 and ask your future colleagues to make a demo for it.

And for a tiny moment in time, it worked.

From Turing to Voight-Kampff

Sometimes I get a quick gig to help clients choose their future employees.

Filtering technical candidates is a delicate balance. On one hand, you might get hundreds of people reaching out, and you don't have the material resources to interview them all in person. On the other hand, imposing a heavy price for getting to a live call is disrespectful and inhuman.

While I don't think most of the time you should be hiring the elite (because, guess what, very few companies are actually working on an elite project, get off your high horse), that doesn't mean you should accept incompetence.

In 2010, we were already inundated with bullshit profiles, people attracted by the money but who couldn't do a Fizz Buzz, and this started the rise of take-at-home coding tests.

Of course, coding tests have the same balancing issue: you want something that can sort out impostors, but that won't make good devs flee because they have better to do with their lives than go through the hoops of every single prospective employer.

The solution I found is to make something trivial yet a little fun, but not too long, like level 3 or 4 of AdventOfCode. Then commit to spending as much time on reviewing the remaining good applications as they spent themselves applying, as a token of respect.

This doesn't work anymore thanks to LLM.

So I started a long quest of finding something that humans can perform with non-unfair amount of effort but not an AI.

At first, stuff like playing with cardinality (remember the strawberry thing?) or arithmetic sorta worked, but they fixed that.

So what now?

Looking forward to it

One day, I had a realization: LLM were good at things they were trained on, so the past. But have a harder time with things we might do in the future. What if we asked candidates a small exercise on a yet-to-be-implemented spec?

And so I went searching for something that was original enough to trip up machines, but interesting and not complicated in a way that keeps the humans we want in the loop.

Enter PEP 750, the new t-string syntax that will be implemented in Python 3.14, coming out at the end of this year. The idea is to allow a syntax similar to f-strings interpolation, but with a t prefix:

query = t"SELECT username from users where id='{user_id}'"

The twist is, instead of query being a string, it results in a Template object that exposes separately the string part SELECT username from users where id= and the interpolated value part user_id .

The idea is to make it as convenient as string manipulation is for the developer, while allowing functions that are vulnerable to injections (bash commands, SQL queries, etc.) to easily process parameters that need escaping.

It's a great feature, and you can already test it because uv is awesome, and it supports beta Python versions as well:

uv self update uv python install python3.14 uvx python3.14

But even without this wonderful tool, the concept itself is not super complicated, and the spec is very clear. I expect any non-junior dev on my team to understand the PEP and be able to produce a PoC showcasing what it's supposed to do.

Hence, the following exercise was born:

In Python 3.14 there will be a new feature called t-string. Here is the PEP: https://peps.python.org/pep-0750/ It should help with things like avoiding injections. Create a run_command() function that demonstrates how it works. Make it simple, no need to be perfect. It should accept a t-string and execute it using subprocess. Then demonstrate how it helps with avoiding injections by comparing it to os.system() using a similar call with regular strings. Make it a self-contained script that I can run on python 3.14 beta.

To give you an idea, here is what I would come up with if I had to do so:

import os import shlex import subprocess from string.templatelib import Template def run_command(cmd: Template): result = [] for item in cmd: if not isinstance(item, str): item = shlex.quote(item.value) result.append(item.strip()) return subprocess.run(result) if __name__ == "__main__": print('

Without injection:

') dir_name = "." print('######## os.system') os.system(f"ls {dir_name}") print('######## run_command') run_command(t"ls {dir_name}") print('

With injection:

') dir_name = ".; echo POWNED !!!!" print('######## os.system') os.system(f"ls {dir_name}") print('######## run_command') run_command(t"ls {dir_name}")

This will output something like this:

Without injection: ######## os.system file1 file2 0 ######## run_command file1 file2 CompletedProcess(args=['ls', '.'], returncode=0) With injection: ######## os.system file1 file2 POWNED !!!! 0 ######## run_command ls: cannot access ''\''.; echo POWNED !!!!'\''': No such file or directory CompletedProcess(args=['ls', "'.; echo POWNED !!!!'"], returncode=2)

I specifically state to make it simple, so a few lines just to show off the general idea are enough. Most of the time is spent reading the PEP, which is what I'm really testing the candidate against.

It is a tad longer to do than I wish it would be for a first filter. I was not completely happy with making all candidates go through that just for the privilege of being considered, but it was fairer than most tests I've seen around. So it had to do.

Ex-ter-mi-na-te

For several months, this worked beautifully. Neither Claude, Gemini, nor ChatGPT were able to solve it. They had never seen the syntax in any code, so they actually attempted to produce f-string , thinking I misspelled it.

Even today, ChatGPT and Gemini fail miserably at the task, both producing code that is conceptually wrong and not even syntactically correct.

Here is OpenAI's answer:

You can see that not only does it contain an impossible type annotation, it also passes the template string object directly to subprocess.run , which only accepts strings.

Google doesn't do better:

It ignores the PEP link I gave and attempts to solve the problem with an old version of the spec (probably stored in the Google search cache).

And at first, Anthropic failed too.

This worked for a few months, but then something happened that changed the game: people started to write tutorials on t-string. And one bot clearly trains on good dev blogs, because soon after a few publications, Claude started to make sense:

The entire script is correct, runs on Python 3.14, and does, in fact, demonstrate a use of t-string.

Evidently, if a candidate were to send me this, I would flag the script right away:

It's full of redundant comments, and the docstring is futile.

The return value of subprocess.CompletedProcess is perfectionist to an extreme, which is the opposite of what you would expect from someone writing redundant comments and meaningless docstrings.

It's passing shell=False "for safety", but this is the default value of subprocess.run .

It's demonstrating SQL injections as well, something I never requested. Except it's not really doing that, if you look at the code, it just makes it up.

It's commenting out calls to run_command "for safety" despite the fact the values are hard-coded and it's a demo.

It's way too long and cleanly formatted, with way too many checks (like sys.version ). Remember, I asked to keep it simple.

The injection demo is bogus and doesn’t execute the injected code.

It can't help itself and ends with "key takeaways".

But unless you are a total idiot, you can take the script, trim it down, and play pretend.

This test doesn't work anymore. It's solved, in the sense that somebody without the skills to pass it can do so with the help of tooling, and I could be fooled.

And yet it moves

It goes without saying, but this particular instance of test only works if I intend to hire a moderately experienced backend Python dev. It won't work for a sysadmin, a data analyst, a junior, and so on.

And it's not going to fly anymore, anyway.

But the general idea stays valid. As long as the general public hasn't written about a concept, using RFC, specs, and even proposals still in discussion will converge on the hypothetical. That's in our favor.

Not because our future silicon overlords are not capable of it, it's just that their creators are not interested in solving those edge cases perfectly just yet. Until this blog post gets on their reading list, I guess.

Besides, it's an interesting exercise because it's closer to real work: it's not just about code, but about being a coder. And it makes for interesting discussions during the real interview, too. You can talk opinions, you can see curiosity, you can share points of view.

It's not just filtering bad devs, it's also making your first real contact with the good ones enjoyable. After all, we are planning to work together.