conf 21.11.2025 16:10 – 16:40 La Marive EN

Unicode As Low-level Attack Primitive

Abstract

Whether it is for applications, operating systems, databases, etc. anything that reads, writes, manipulates data must be using an encoding. On modern days, it will mostly always be UTF-8 by default, sometimes UTF-16, both are Unicode standards. Security auditors and researchers often manipulate data or protocols, but what about manipulating the underlying encoding? Unicode has become the one encoding to rule them all, replacing hundreds of old standards. At first glance, it could feel like a simplification. It is not. All those old encodings where ultra-basic while Unicode is overwhelmingly complex beyond what you can imagine until reading the specifications. Over the years, the lack of awareness about Unicode and its complexity have led to a lot of issues and implementation errors. The version 16.0 of the Unicode Standard is 1140 pages long, and there are over 60 UAX (Unicode Standard Annexes), UTS (Unicode Technical Standards), UTR (Unicode Technical Reports), each of which is comparable to an IETF RFC. During the last 3 years, I have analysed about 15 programming languages, none of which is fully implementing 100% of the Unicode standard. Any piece of software around you is probably using Unicode, but none of them have complete implementation of it and all of them a probably different. What could go wrong?

Speaker

Alexandre Zanni

Synacktiv

I'm a pentester & security researcher, so I'm mostly focus on offensive security. Outside penetration tests (where I enjoy web the most), I spent a lot of time in R&D, where a majority of this time investment was spent on one topic: Unicode. So Unicode is, by far, the topic I know best. Some people may know me for my Github activity: writing tools, contributing to open-source software a lot as well as security resources, maintaining packages at [BlackArch](https://blackarch.org/), etc.

Talk

Unicode As Low-level Attack Primitive