Talk

Unicode As Low-level Attack Primitive

conf 21.11.2025 16:10 – 16:40 La Marive EN

Unicode As Low-level Attack Primitive

Whether it is for applications, operating systems, databases, etc. anything that reads, writes, manipulates data must be using an encoding. On modern days, it will mostly always be UTF-8 by default, sometimes UTF-16, both are Unicode standards. Security auditors and researchers often manipulate data or protocols, but what about manipulating the underlying encoding? Unicode has become the one encoding to rule them all, replacing hundreds of old standards. At first glance, it could feel like a simplification. It is not. All those old encodings where ultra-basic while Unicode is overwhelmingly complex beyond what you can imagine until reading the specifications. Over the years, the lack of awareness about Unicode and its complexity have led to a lot of issues and implementation errors. The version 16.0 of the Unicode Standard is 1140 pages long, and there are over 60 UAX (Unicode Standard Annexes), UTS (Unicode Technical Standards), UTR (Unicode Technical Reports), each of which is comparable to an IETF RFC. During the last 3 years, I have analysed about 15 programming languages, none of which is fully implementing 100% of the Unicode standard. Any piece of software around you is probably using Unicode, but none of them have complete implementation of it and all of them a probably different. What could go wrong?