
so F first a little backstory about this talk after receiving yet another marketing email butchering my name uh during Christmas I posted this text on Facebook inspired by a song by The Ting [Music] Tings let's see if I I don't know if I have Network so
yeah that's
enough uh my name is B I also go go by Bard Brad Bard and so on online my identity is elap I'm a platform engineer at DNB uh the largest bank in Norway I am an Evidence supporter of uh the free and open source software movement and and have been organizing meetings for a Bergen Linux user group for more than 10 years I have three kids and a wonderful geeky wife and I am a geek to me that means means that I like to go deep into things and figure out how they work are read ready to do that together with me today Geeks sometimes measure their geek Power by showing off their understanding on how systems work kind of like I'm
going to do with you today one common way to do this uh uh uh is to show that you can write something in binary I have not memorized my how to encode my name in ASI mostly because it cannot be done aski does not have or as a kid I did actually try to uh find out how to uh uh my name was encoded in binary this exposed me to the weird world of character encoding there's not one way to encode characters there's an infinite number of ways to encode characters some make sense some maybe not so much most modern uh character encodings use ussq as a base and builds onto it which is why you all uh the problems
usually just uh uh usually occur when you go outside the USS space when talking about encoding we don't usually use the binary codes we usually use the Hax codes which is easily converted to and from binary and is easier to relate to and takes up less less space but for the explanation sake both kind of matters in this talk now let's go into it what is character encoding simply put characters uh character encodings are mappings between numbers and characters we as humans like to work with text graphics and sound while for the computer this is all sequences of
numbers but we encoded text long before we had computers though think of mors language as a text encoding it's even in binary aski also predates computers even numbers need to be represented as character Co points back in the days operating system vendors come up with their own uh encodings for various regions and this was mostly okay because computers rarely process documents from other regions or even from other operating systems Microsoft dominated the market and uh if you're using a different operating system it was mostly up to you on to you to find the right encoding uh so you could communicate with others furthermore no one really expected it to be simp be easy then the worldwide web come come
came around and we all started connecting to the internet suddenly it suddenly became very common to communicate over large distances unless instructed otherwise browsers still expected all document you encountered to have the same encoding as you were using locally even though Unicode was invented roughly at the same time as the worldwide web and utf8 came around in uh 1993 we started using HTML entities to make sure that uh text was correctly displayed in 2003 we are still using HML entities 10 years of the utf8 to make sure that the Scandinavian characters were correctly represented HTML entities is yet another uh layer of encoding on top of ESI imagine how many times I've received emails greeting me like
this it is disappointing when my name cannot be shown correctly I mostly find it entertaining but there is no no doubt that other people are more seriously affected names are no laughing matter today Unicode has one and we have an agreed upon way of referencing every character Under the Sun and then some you would think encoding issues was something of the past Unicode has its own sets of problems but we still struggle with the sins of our past utf8 were supposed to fix everything and to largely agree it did but we are a huge amount of data already encoded in uh with the old encodings and a huge amount of computer systems expecting and producing the old
encodings we started a transition a transition that we might never complete Ayana the internet assigned numbers Authority currently uh recognizes 259 uh standardized character sets to this day ashod devops the asod devops front page greets me like
this a couple of weeks ago I had a kebab at my favorite Kebab Place notice notice how they omitted the umlout over the in D um and how it's noran is displayed I have not been able to find out how this uh how or become came these symbols yet but bet it started with a wrong assumption about the input encoding I have lost count of the number of receipts email and emails I get greeting me with B A with a Tilda Yan sign Rd this is by far the most common Mis decoding in Norway based on my empirical observations it's of course the similar errors with uh and a this happens as a result of
interpreting utf8 as if it was ISO 8859-1 or isol Latin 1 or Windows code page 1252 at esun we used used a deployment tool called octopus they keep sending me emails like this any guesses what happened here this is most likely the result of assuming your UTF en uh encoded input is Macintosh encoded also called Mac Roman mechos cerc would also produce the same result and this is the email I got from bides I'm not sure if this was a joke cuz the first few emails were correct and then the last email was like
this we Works address me like this in emails I'm not sure how what happened here or even how to pronounce this if you know I would like to know so please tell me perhaps it's the the same error as with the Kebab receipts and then there's this a local company in Bergen used to send me emails where they wrote my name like this is this kind of threat or something this symbol is called a dagger I looked into this and concluded that they most most have stored my name somewhere using Code page 850 also called dos Latin one and then and then they have interupted that as if it was Windows cach page 1252 which is usually referred to being
the same as ISO 8859-1 but in the range from 80 to 9f ISO 8859-1 is undefined this is to allow for control characters and other uh adaptations this is where we in Windows code page 1252 find the dagger symbol at the in the same position the web hypotext application technology working group decided uh in the HML 55 standard that whenever you specify ISO 8859-1 as in coding on a web page C code page 1252 should be used instead although this technically makes sense it adds adds uh it adds more confusion you get access to way more symbols with code page 1252 and windows user users probably already expected all those characters to be there at AWS reinvents I talked to some
sales representative for Kong and let them scan my badge they sent me this this email and now we're going to see how that happened I have a fairly good idea what's going on here most likely it started like with octopus interpreting my name given to them in glorious U utf8 as if it was Mac Roman or maxic and encoded the result um as utf8 then they have read that encoded uh utf8 encoded string as if it was Windows code page 1252 or or less likely Windows code page 1254 many computer systems are um configured to take the local uh vendor uh specific encoding as input and storing the data as utf8 and if they then read the data
again and stores it again they will encode it once more for every system your text passes through there is a new opportunity to uh make an encoding error in the previous example um it uh has been misinterpreted at Le at least twice it could have been Mis misinterpreted more more times but accidentally or purposefully reped even text that is correctly represented may have been encoded wrong somewhere along the line because sometimes we developers fix problems in one system that was caused by another system can you imagine what happens when the original system then gets fixed yes yet another missing coding here's another example of interpreting uh an encoding wrong twice it's amazing that this shipment actually
arrived so let's see this is the result of uh first interpreting utf8 as ISO 885 5-1 or 15 or Windows code page 1252 encoding the result um as utf8 then reading it again as Windows code page 1252
yeah um another uh interesting uh thing you can see from these examples is how one character becomes two this is because while ISO 88851 uh and other traditional encodings uh uses one bite uh for every character in its character table utf8 makes use of a dynamic dynamic number of bites to represent each character giving them more than enough room for all the known alphabets in the world living or dead oh yeah uh This Is How They found room for the Beloved pile of feces emo emoji and all the other emojis new emojis are uh are introduced all the time in utfa the Scandinavian characters are represented with two bytes or are they take the letter
o o is the7th co character in the Unicode character table in utf8 this becomes c385 this is two bytes but it could also have been encoded as a composition of capital letter a and combining ring above a is one byte for one and um combining ring above is two byes cc8 a together they make three bytes it's also it's also optional in which order these car uh uh B bite coms so it could also be uh not only 41 cc8 A but also cc8 a41 to make uh matters even worse Unicode also defines an additional character that looks exactly like the Norwegian character if you encounter this it's most lik uh your uh most likely a
uh physicist or you're dealing with um uh text that comes from some kind of optical character cognition system or you're just a geek or the anstrom sign is a unit of length named after the Swedish physicist andas yunas anstrom with an or not an them and is encoded as as e285 84 a Ab one anstrom is 0.1 nanometers in case you wondered encountering this character is is apparently so rare that many applications will convert an angstrom to an or behind the scenes this is also the in line with the current recommendations from the Unicode foundation in fact this also usually happens if you try to use the combining ring above but not always plain text is not simple it's
just unstructured there's nothing in plain a plain text file telling you what encoding it is in you barely have an optional file extension that tells you that it is plain text if you receive text from an input field or the clipboard you don't even have that image formats and other formats we deal with usually have a header that tells you what format it is but with text we often just have the encoded text in itself in UTF 16 there was an attempt to creating such a header UTF 16 had two different modes little andian and big andian this controls uh the order of multi characters um if the first two bites are F FF the
text coding was little Indian and if it was ffe it's big end in this is called the UTF 16 bite order Mark or bomb for short this required all software dealing with text to inject the bomb when you cut and uh remove it when you paste with lots of Legacy software it didn't work so well you might have encod countered the bomb if you opened a plan text plain text file in notepad and the first character is a black Square while the rest of the uh file looks fine however it might also just be a y with a hat in ISO 8859-1 14 I have compiled a list of resources uh on is a GitHub gist so if you scan
the QR code you can find that um and uh otherwise I'm done with the talk and uh if you have questions yes or microphone great talk do you think in 20 years from now you musk's son will be having the same problem as you do cuz he also has a Norwegian letter its name it's probably already happening I also I also told B before he started this talk that you know you need to ask the audience have anybody else in the audience uh should I say non- English characters in the name that are causing problems yeah so more questions for bard have you encountered have you encountered problems with passwords where you use those characters uh not me but uh P told me
about a situation uh where uh yeah I can do that one the Ukraine you it's no no not that one okay uh but where it's uh it's about these uh uh combining characters uh where uh on iOS devices they are combined to one character while on uh PCS they are uh separate symbols and uh someone created a password for their Bank idea bank ID in Norway I have not confirmed that this is the case but it's most likely the case since she were able to log on to her bank ID from her phone but not from any other device yeah I remember way back in 2007 was my first time visiting KB in Ukraine the capital
uh visiting colleagues there back then and out of curiosity of course I asked them uh have you ever considered doing passwords in using kilic uh characters and they they just laughed at me and like you have no world you no idea of what kind of World of Pain you're going to be in if you do that first of all if we travel to the West you don't have you know a keyboard layout with with kill characters on it but the second one is like and I'm sorry for saying this but most of you people in the west you have no idea there exists small letters in the world than a to c anyway so they
were just like that is the most stupid thing you can do and I've talked to people from all over the world and still asked this question have you ever tried to do this in Greek uh in Arabic in Chinese in Japanese and usually the answer is no I wouldn't dare try doing that at all because it's just going to make a lot of problems for me uh perhaps some more security related question but did you ever encounter any potential attack scenarios I can imagine if your name gets converted to something with an m% that could be seen as an extra parameter somewhere or uh for example maybe a normalization attack in an email address that could I don't know
lead to another user being updated stuff like that maybe accidentially uh did you ever see any of that uh as a life experience guy well this example with uh
uh this imagine this being impended to to a URL so yeah uh that can happen and there is also these cases where uh oh wow that's a lot of echo where uh you have uh uh there's more symbols that look look the same in uni in Unicode uh like the youngstorm sign and the or uh so you can register domain using uh uh a different uh character set and uh and in that way make it looks look like it's the same domain as your as your bank for instance so a is famous because you have a symbol uh cilic SYM symol that looks exactly the same or or in in many fonts it's actually quite a little bit differ
different but it's similar enough that people don't react to it yeah more questions yep in the back and and we are already into lunch time so those who wants lunch uh leave and those who wants to discuss this more stay t for the presentation bra B um sorry calling b b um wondering about this why do you think even Norwegian companies Norwegian bank statements Norwegian uh Saran was showing your name why do you believe this is still happening inside Norway not really only related to International Communication where where uh character conversion could be explainable but why why are we still still facing that thank you so as as I uh touched on to there is nothing
indicating what encoding a text is in so this is all with all of this is with text uh with character stats that already accept the Norwegian character and uh but you don't just don't know what which encoding it is when it when it comes over over the wire um and most of these issues is exactly when a system tries to fix uh the encoding they get something in as as uh utf8 and they assume it's it's the old encoding and they try to fix it by making it utf8 and then you get another uh Miss encoding more questions for
B not a question but I noticed the the characters on your header on the slide change from side to side nice touch yeah okay then no more questions for B we'll cut off for lunch thank you B for coming and doing your talk [Applause]