GT - Regular expressions are good, actually

Name: GT - Regular expressions are good, actually
Uploaded: 2023-10-25
Duration: 42 min 1 s
Description: Regular expressions are good, actually: A technical deep-dive into an ideal infosec regex implementation Ground Truth, 17:55 Wednesday Regular expressions are everywhere in information security, but are often seen as opaque, academic, and boring. Regular expressions are anything but boring! This t

BSides Las Vegas42:0163 viewsPublished 2023-10Watch on YouTube ↗

About this talk

Regular expressions are good, actually: A technical deep-dive into an ideal infosec regex implementation Ground Truth, 17:55 Wednesday Regular expressions are everywhere in information security, but are often seen as opaque, academic, and boring. Regular expressions are anything but boring! This talk starts by explaining what regular expressions are (from a theoretical perspective) and why they’re such a good fit for Infosec. The talk then proceeds to explain how common implementations aren’t designed for Infosec use, sometimes even to the point of creating security risks. A brief survey of desired features is then given, and finally a technical dive (including code and benchmarks) is presented on how an ideal regular expression engine for Infosec might be implemented. While this talk has some math, it is designed to be accessible to anyone with a background in Infosec, including newcomers to the field. Rob King

Show transcript [en]

hi so my name as was just said is Rob King and this is the title is actually regular expressions are good actually and they are and at the end of this talk I hope you will agree um so first I want to I want to talk a little bit about uh what brought me to this point um just like Barbie's friend Ken his job is Beach um my job is kind of like regex it's very weird for the past 20ish years I've written thousands of regular Expressions I've written regular expression engines I've written tools that manipulate regular Expressions I it's weird it's I I actually do other things like I can whistle but um it's

regular expressions are a huge huge part of what I do and I love them and I want you to love them too but not everybody loves them and uh the ineffable Jamie zinsky uh said and I'll read this for those of you who maybe can't read it it says some people when confronted with a problem think I know I'll use regular expressions and now they have two problems and uh Jamie swinsky or jwz uh very famous he uh his tagline on Mastadon is um I wrote your parents web browser uh which isn't true because he wrote my web browser which means I'm really old um my kids have no idea they've never heard of Netscape um but

he wrote Netscape or at least a large part of it um although my first web browser was actually a mosaic on the Amiga and if you're an amga person yeah Pierce is like Pierce is like's I knew he was going to put am thing in here because yes I did there there's Thea thing so yay a mosaic but if Jamie zinsky can say regular expressions are problematic then then you know what are what are we mere mortals to do right what what hope do we have in this world if if jwz thinks they're problematic right well what you can do or what we're here to do is learn about what regular expressions are under the hood how they work why they're the

way they are why they can be problematic why they are a a pre-loaded foot gun in some cases and how we can maybe unload the foot gun or or work a little bit better with them this talk is not about how to write regular Expressions much smarter people than me have written much better guides than I could ever write on how to write regular Expressions this is going to be more about how regular expression engines work under the hood and how you could maybe even write one of your own or understand how the design decisions made in these under the hood machines affect the security posture of your software because as security practitioners sometimes we have to

pierce the veil and see how things actually work under the hood you know in theory and in practice there's no difference in practice there is so uh and so what we're going to do as part of this is together we're going to design or at least come up with design criteria for an ideal from an information security perspective regular expression engine um and we're going to do that together it's going to be a fun time um and speaking of that togetherness and camaraderie if anybody has any questions please raise your hand in the middle of the talk don't wait till the end I don't want anybody to be sitting here thinking man that was like the stupidest thing

I've ever heard I really wish I could tell them how stupid that thing was just said tell me cuz I want to either sit there in my stupidness and and admit it or maybe we can come to some agreement so with no further Ado what are regular Expressions anyway what is truth what is a man a miserable little pile of secrets for you Castlevania fans that was probably a quote before Castlevania but they said it in Castlevania um well let's give a very very brief history of regular Expressions because I am a history nerd I have to talk about it just a little bit so regular Expressions were first the the the the thing that became regular

Expressions were first described by a person named Steven Cole clany back in the 1950s and he was using them to uh as a notation to describe the behavior of um neural Nets uh not and I think neural Nets like frogs I don't think like neural Nets like you know artificial neural Nets at AI I mean this was 1951 um and Steven Co clany is as a luminary in the field this was back before computer science was a was a separate discipline this was back when it was still purely just a you know branch of mathematics not even a branch a twig of mathematics and he studied under uh Alonzo Church a church turning thesis Fame Lambda calculus Fame and studied

alongside Alan Turing who of course needs no introduction um and Steven Co clany also if you want to get uh if you want to make friends and just really let everybody know how incredibly smart you are cuz everybody loves it when you do that the asterisk operator in regular Expressions mathematically is Def is called the uh the clany closure so every time you see a star when you're writing a regular expression be sure to say clany closure and everyone will love you so just be sure to do that um so then 1968 actually a little bit before Ken Thompson uh took Clan's um notation and used it to describe patterns in strings and he put it into the QED text

editor which ran on one of ge's operating systems I don't remember which it didn't run on Unix because Unix hadn't been invented yet and then Thompson when he invented Unix co-invented Unix and wrote Ed the standard text editor he used this notation to describe strings and then from there Unix that spread everywhere I've got Unix on my phone I've got Unix on my computer Unix everywhere that thing is probably running Unix everywhere and so regular expression spread with it now what what's actually more fascinating or more interesting or arguably more important is that Thompson the most important thing he did there wasn't putting regular expressions in Ed he actually came up with a really fascinating way of evaluating regular

Expressions that made them much much more efficient um they had been studied by other scientists and uh other mathematicians but his way the Thompson Construction was the first way of actually running them truly efficiently uh and that's and we're going to touch on that very briefly later like I mean that's obviously could be a whole talk but it's it's a really fascinating and beautiful thing and one of the things that I love about regular Expressions um and he also the original paper he actually even describes evaluating regular Expressions uh via a sort of a just in time compilation thing he's like oh yes if you have this thing and a regular expression this actually compiles to this this sequence of uh IBM

794 instructions and and he actually gives and it's it's really neat it's you know this is back in like 1967 it's amazing um and then in the 1980s uh Rob Pike who y'all might know is you know one of the uh main forces behind the go programming language and plan n and a lot of other things um he came up he wrote a text editor called Sam and Sam is the most beautiful text editor you've never used uh and you should read the tech the paper a text the text editor Sam it's one of the best papers in all computer science in my opinion and I'm never wrong and what he did in Sam was

he he made Regular Expressions very fundamental to the construction of the editor and um he came up with so you know subm matches right uh when you've got parentheses parenthesized sub expressions and a regular expression it was thought that you couldn't evaluate those efficiently using Thompson's construction and I'm glossing over a lot of details here but and then Rob Pike when he wrote Sam he he solved the problem and then nobody noticed it for years textbooks were published after the Sam text editor that said yeah Thompson's construction is great but if you want to do submat tracking there's just what can you do there's nothing you can do and then you know people noticed the CM text editor was doing it right

and and whatever whatever and uh and I will digress very very very briefly I uh I actually am a big fan of the Sam Tex editor obviously I updated the original uh 1992 open source release to get it working on Modern uh Linux modern 64-bit operating systems modern font rendering all that and I got an email one day that said uh it's so nice to use this text editor that I I always loved using when I was younger and I it didn't work anymore so I really liked your work and I was like oh great then I noticed it was from Doug mroy who invented Unix pipes and I was like oh okay I was like

well so that was the highlight of my career it was all downhill from there it was it was great so not not all downhill sorry run zero is great sorry I love you this is the highlight of my career it's great okay so all right um so before we go too far I want to make sure we're all everybody here knows regular Expressions I'm sure otherwise you wouldn't be here but I want to just very very briefly go over the syntax so we're all on the same page so and the syntax is not semantics there are alternate syntaxes syntaxes but but let's just you know a single character matches itself single dot can match any character except maybe new

line depends on a setting of a flag question mark marks the previous expression as optional uh a clany closure uh marks the previous expression a zero or or infinite you know any any number zero or more uh a pipe not the Unix kind does alternations can say you know one or the other parentheses for grouping and capturing character classes which match a single character or you know a range inverted classes where you just say anything not named a plus which says one or more times and then counted repetition where you can say at least this many times but no more than this many times and there's usually defaults to say zero and infinite and whatever and then some common zero width

syntax uh you've got the carrot which is technically called the left anchor which is um kind of like not a great name because not all languages are written from left to right but it's called the left anchor in posix and uh it matches the zero wi string at the beginning the dollar sign matches the end um which is not trying to say that money is the end all because it really isn't and then you can also have zero width assertions like you know I want to match this zero width string that has a a uh a word character on one side and a non-word character on the other word boundary things like that and then you'll notice and this is

when you implement a regular expression engine there's a lot of redundancy for such a a an Arcane syntax and almost seemingly intentionally Arcane syntax there's a lot of redundancy um you know a DOT can really just be a character class that matches you know everything in your character set an inverted character class is basically the same thing A Plus can really just be a a star you know for example and then counter repetition can be done with just some creative application of copying the expression and then question marks or stars and then you get to the back references and look ahead and look around and look behind what are these these are not everybody brace yourselves these are

not regular Expressions um these are constructions that are put in a lot of things that call themselves regular Expressions but actually violate the fundamental principle of what regular expressions are in in a mathematical sense so back references um if you're not familiar back references basically there you can refer to a previously captured piece of the string and say you know I want to see that again and look ahead you can say oh I want to match this letter but only if it's not followed by this next letter or is followed or whatever but don't include that as part of the match and those constructions are great but they're not regular Expressions they actually require a different kind of

formalism um and what they what they have in common is they the the thing that regular Expressions need to be truly formally regular Expressions is they need to depend only on the current state of matching and the next character there can be no other context um whereas B references obviously require memory you have to remember what you saw and you know look ahead or look behind or whatever requires you to look at more than just the next character um now for you computational linguists and computer scientists out there I am intentionally conflating and ignoring the difference between nfas and DFAS and recognizers and you know finite State machines and all that stuff and so I'm sure monol

will be popping out of eyesockets and teacups will be dropped when I do this but it's uh it is what it is but to give you an idea of kind of what I mean um this is a a finite State machine representing a pay phone trying to get 25 cents to get to a dial tone um and for those of you under the age of 40 a pay phone is something they have at Defcon that they do for pranks um but it's you know but you'll notice that nothing you all you need to depend on to get over here to dial tone you start you say oh I need 25 cents you know and well

if you put 5 cents in well okay now I need 20 cents and then if I put 10 cents in well now I need another 10 cents and then if I get 10 cents I go to dial tone but if I get 25 I go to dial tone and so on you'll see that at no point do you need anything other than where you are now and what you just got and that's sufficient to get you to the end and that's how regular expressions in the in the formal sense work um again DFA NFA but yes um and that's how reg work they just they take their current state and then the next character or the next

input the next symbol and move on and that's a very nice construct because they have no memory they require very very very few resources just the current state and really just the current state which can be encoded as a very small number and that's why they're so efficient so as part of this talk we're going to talk about these common implementations and then we're going to talk about how they stack up against our uh our rubric which um I I I never used the word rubric until my kids were in school and apparently that's like the new hotness but yeah rubric so um we're going to start so posix style that's uh specifically extended uh regular posic

regular Expressions JavaScript or more actually ecmascript python with its re module in the standard Library go with its regx module in its standard Library rust doesn't actually have it in the standard library but the regx crate is so popular it might as well be and then the odd one out which is hypers scan which is a project from Intel that is designed to be the world's fastest regular expression engine and it is which is pretty cool so let's get to our security concerns and our design goals so denial of service conditions everybody knows about redos and if you don't know about redos you're about to learn about redos so that's great so as we saw it's

impossible to implement a regular expression Engine with uh back references or look ahead or whatever that doesn't um require some extra memory or some extra resources so if you want to have those features in your engine by definition you are going to be vulnerable sometimes in some ways with some inputs and some uh regular Expressions to very quickly use up resources penal resource exhaustion um and to do that actually let's um and it's important to note that even if you're engine even if you're not using those features even if your expression doesn't have back references or doesn't use look ahead or whatever it could still be vulnerable because a lot of times engines aren't going to have like

a separate safe engine over here and then whatever and so let's actually let's I've got a little live demo just to show you what I mean and um I showed this to uh my son my older son the other day and he said oh yeah dad you got to show that at the talk because it's cool and I was like dude that's why I wrote it but yes so it's a little uh it's a little program here we're look at python um by the way opening Vim everybody rest rest in peace Bram molar that's uh that's God everybody he was very passionate about uh children in um in dangerous situations and I it would be

nice if if you have the ability maybe consider making a donation to one of his Charities because he was a very good person anyway but on slightly down note so I'm going to run this this is a little script that generates regular expressions and then generates string and then matches them so and this is of order 10 so what does it do well it generates 10 a string of 10 A's and then it generates a regular expression that is 10 a marks followed by 10 A's and then it does the match and it tells you how long it took right so everybody everybody see that it's not too bad is the font big enough look at that it's

even bigger so now we can say okay so 10 that took some minuscule amount of time what about 20 still faster than the blink of an eye right but you'll notice it's it's actually a lot a lot slower to a computer so now let's try 30 and that is not going to finish for about five or six minutes um on this computer and that's interesting because this is not a huge expression right like you couldn't say oh well you know I'm safe uh I limit my Expressions to you know I I when my when my users when untrusted input comes in my Expressions only you know a thousand characters long well here you go if you're and you know

this is obviously a bit of a contrived example but let's look at the Go version here here's the Go version using Go's regex module does the same thing um go run linear. go 10 and it's going to take a second because Mac OS has to make sure I'm not running malware um so go run linear go 10 and see it does the same thing and it took 413 micros now let's try 20 took 238 micros 30 took 259 microc A th took 30 milliseconds right a thousand in the python version would not have finished before the sun went to Red Giant phase so um it's pretty and again this is contrived right like I but this

is a real thing um at my job we have uh a tool that has a go implementation and a ruby implementation and then it's got a test harness that I think is written in Python and I wrote a real regular expression that I got paid to write like for real business purposes and you know ran it in the go code it worked fine kicked off the test harness which usually took a few minutes went got lunch did whatever came back got busy doing something else closed you know shut down for the day went home went up to went upstairs came back down and the test harness was still running the next day because that regular expression

kicked into this bug right so this isn't something that is yes these are contrived but these are contrived for the purposes of example I mean nobody would say you know ah you know if they were if they were dying they would just say it they wouldn't write the regular expression right um but yes the castle I knew I was going to make that joke so I put that in there um but yeah there are some other denial of service conditions that are much less likely to happen but are still relevant in a lot of situations um for one most engines will do memory allocation at matching time um python uh go will do at least a

few allocations to return the match things like that uh python has to allocate everything pretty much on the Heap um and so you can get excessive memory allocations fragmentation um uh blocking in the memory subsystem things like that and um Regular Expressions if they're being used for multiple threads can maybe result in uh a lot of lot contention I know earlier versions of go regx module actually had this problem where um reg X's could be shared explicitly across threads uh without locking but then there was a lot of internal locking that actually slowed it down and they fixed that and it's actually really interesting how they fixed it um you should totally uh check out the source

code at some point but let's look at so these are our design principles the first of our design principles for our ideal regular expression engine we should only use linear time algorithms we should allocate all memory before match time and we should use per thread state so let's see how that Stacks up how how our implementations check against our rubric so far and so far we've already eliminated half of them right these three it's actually impossible to implement them using these these requirements um but go rust hyperscan they all do and hyperscan gets a little Plus cuz it also does the uh 100% allocation of memory ahead of time and things like that which is pretty

cool um but then rust go all that use guaranteed linear time but what about streaming matches um and this is something that people I this is something that always amazed me is so rare in regular expression implementations because you know we are trying to analyze Network traffic we're trying to analyze enormous amounts of data we're trying to analyze things that might be discontin uous in memory uh and we're trying to analyze things on tape and you think I'm joking but uh this is the IBM uh Diamondback petabyte tape library that came out like last year so tape is not dead um and there's a funny story about uh Donald can and the Art of computer programming which he totally is

going to finish just like George R Martin is going to finish um Game of Thrones um yeah I know and uh but he talks about tape and it's it's interesting but it is important because let's say we're doing Network traffic matching okay and we've got a client and we've got a signature down here you know this is our signature matching this this totally vulnerable uh PHP script that has a file disclosure vulnerability right so the attacker who is our client sends a Git You Know volen HTP name equals Etsy Shadow HTP 1.0 has to be 1.0 because I didn't s a host header um nobody really oh that come on I was proud of that one that's funny um get

the hb1 bag and you get the you know everybody's you know file disclosure it's bad but what about if we're doing packet at a time um for example nrep which is a very common Network GP uh that's why it's called inre um won't apply regular Expressions past the packet boundary so if our attacker just breaks it up you know like such that no one packet will match the uh the regular expression it just bypasses detection entirely and then what about buffer that's what that's by far the more common solution um things like sarot snort things like that will buffer although they can be compiled to use hypers scan which doesn't but if they don't they buffer

things um potentially you can also use DFA with pearl but that's another problem um they buffer things and now you've got this memory per stream um that you don't actually need because you just need to know if this matched or not so if you can just hand your data to the regular expression engine as it comes by you can say oh okay I matched or I didn't because an attacker could potentially put a whole bunch of extra parameters or do whatever and now you've blown out your buffer or or um you know you run out of space or you use excessive resources or whatever and you can avoid that if you just support streaming matching so I say allow

matching of streaming input is our next design criteria and where do we stack on our rubric there well our rubric now we've got streaming is supported by go and hyperscan and no one else and it's really kind of weird there is an implementation uh in Rust that I'll talk about later that kind of does it but yeah and now if you're going to do streaming you also have to do simultaneous matching because if you've got the data coming by you know it's streaming by or it's discontinuous memory or whatever you only get one shot which I think is an Eminem lyric but I don't actually know I only know the meme version and um so you need to be able to

Match multiple Expressions at the same time and a lot of engines don't do this in fact very few do so we got to say okay we got to allow simultaneous matching right and who does that well rust and hyperscan and no one else uh Google's re2 which is written Plus+ does seems a little bit Yeah I promise this is not a hyperscan ad um and this is the one this is the feature that really blows my mind that nobody seems to really Implement as much as they should and this is text and binary matching um most things on the internet are not text files which makes me sad um I know and by the way that's Benjamin Franklin cat

the thir and he's a jerk he got that drawer open by himself and uh yeah um he was very proud of himself but there's lots of things that aren't executables PDFs gifs not gifs cuz that's not even a word um and and sorry um and even with text actually there's all sorts of problems because you know then you got to deal with graph themes like an emoji is not actually one character most of the time it's usually multiple characters all that but but text and binary matching a lot of engines don't support binary matching or they you know they only support text matching or they only support Latin one or binary or something like that and I want to give

again just sort of a quick example here and this was an example that came up when I was working on a modbus TCP thing but um what I'm going to do is so here here's text binary so this is using um Go's mod or go Rex module so I'm going to say okay I'm going to run the first example oh and let me show you yeah here so here's the here's the RS we're matching these first two bites let's these first two characters whatever let's uh notionally those are a transaction ID or something like that right the the binary protocol says oh I these first two bytes are are a uh transaction ID their variable whatever

but we don't care about them we just care about you know whatever the op is the op code so we run that regular expression against that and it works when the when the X ID when the transaction ID is zero it works great but you're saying oh well that's yeah okay that's null right that's not a big deal you let's try something that isn't asking you know will that work against not something that's not asking well let's see running against that we're running fffff and it matched so obviously we're safe right and I've seen this happen a lot of people they'll write binary parsers of the write things where they're trying to extract information from things that aren't text

and it just kind of coincidentally works but this isn't coincidence right this is ffffff it's like the least asky thing you can get it's the the Y with the O so okay that also was proud of that one too because as I remember that um in Latin one so let's try third just to be sure it's you know it's spaces yeah it works fine but then let's try this what happens when we do those btes suddenly it didn't match right C2 A3 those are two btes but it didn't match Why didn't it match does anybody can anybody guess just curious what C2 A3 might be C2 A3 is the utf8 encoding for the British pound sign pound sterling and so what go

did was it decoded one character and took it and went on and then skipped over basically to here and now things don't match and a lot of engines don't support that it means that a lot of these engines are not really useful for uh binary file analysis or network traffic analysis or things like that they're they're kind of hobbled um so that's our next design principle so we should support binary and text and who does that well python does it actually in the most elegant way possible um Python's really good about it uh rust and hyperscan but not go there is a thirdparty module calling it third part is kind of funny because it's written by Russ Cox

who also wrote the main module so it's really not but it's he wrote a binary reg module for go um but it's not included in the stand Library so it doesn't count but JavaScript posic style you can't do it and then what about safe compilation let's say we're letting our users put regular expressions in you've got a field on your your you know form or whatever you put a reg in well what if they can just blow up the compilation step right what if they can make your compiler that's compiling your regular expression go go boom um and so obviously yes you need to limit the size of the regular expression that uh that your attacker your totally legit user

can put in but even still relatively short expressions like like this here you know that will blow up some implementations uh deeply nested parenthesis if you use a recursive descent parser absolutely will blow things up um and this was a real thing like there was I mean just recently last year go had a CBE because um the regx module was vulnerable to a uh crash you could do a denial of service by giving it a large enough regx and I think I mean it was reasonably large like bigger than anybody should be accepting but it was there um and then also saf der serialization so let's say you're allowed to compile your regular expressions and then save them somewhere

and then you accept input well now you've got a problem like Yara where there's a lot of bugs in Yara that blow up um if you uh give it a bad compiled regular expression um so you need to also do that so there's lots of CVS for so limit resource UT and compilation support safety serialization how much time do I have okay all right we're good um so let's where does that fit uh go is the only one that is really really uh careful about this which always surprised me a lot of the other ones you can get out of memory errors things like that and then finally um for this segment of SE of the talk I want to talk

about portability um you were saying there's a bit of a a problem here you're you're noticing a theme with hyperscan well here's the problem with hyperscan it is not portable it does not work on anything but Intel and Intel bought it to make sure it will never work on anything but Intel which is pretty crazy it does some amazing stuff with uh smid instruction uh Vector instructions multiple data instructions it does amazing stuff but it only has an Intel port and this is an arm Mac my phone is running on an arm um well now it's running on an arm okay I'm kiding I'm kidding all right sorry sorry I had to um so uh so yeah so that's portability

so how do we get regular Expressions to do what we need them to do then in this case what are the algorithms that are involved so the earliest regular expression engines and a lot of regular expression engines still do just recursive evaluation right so if you recall from not that long ago you can represent a regular expression as the sort of directed graph right and so here we are we're representing ABAB and then or abbbb right and if we want to do it recursively well now we've got okay we'll start with a we just pick one of the branches we start with a okay we get our next character we get B okay that's good we get our next oh not no okay all

right well I'm dead I got to go back here so it'll backtrack all the way to the Last Choice point and then try the other side and you can go and go and oh now it's going to work now oh I made it great but with this recursive evaluation you have to keep the whole string in memory at once you have to be able to go back at any point and the the resource us should the time factor is how many possible paths there are through the graph which I mean obviously here there's only two but you could get very very large numbers very very quickly um you know it's a quadratic State space explosion you know dogs and cats living

together Mass area so you can't do uh streaming matches you have um problems with stack space usage things like that so this was Thompson's insight and this is again just the most brief you know discussion of it right like this is this was a whole paper by you know Ken Thompson all that but what he realized was when you get to a choice point point oh why not just take both and so he had he would split and he would have two pointers he would say okay we're going to take both and we're going to take next character and both are still alive we're going to take the next character nope Noe that one died but

that's okay we've still got one still alive and we take that one and we go and it works and Thompson's insight there was that you're never going to have more pointers more threads than you do nodes in the graph so at worst you know this is what 2 4 68 this is you're never going to have more than eight pointers eight things so your con your uh your resource usage your state space or uh memory needed to store your state is of constant size so rather than having it grow exponentially or blow up or do whatever you just can say oh that I know I will never need more than in this case eight pointers right um and that was

that was his his beautiful beautiful Insight um and if you have an efficient mechanism of uh determining which states you've already seen and states are alive or dead um then it becomes as efficient as whatever that data structure is and there's a uh an amazing one of the most beautiful papers I've ever read um an efficient representation for sparse sets by Preston Briggs and Londa Toren back in um 93 I didn't read it in 93 I read it much more recently but in 93 it was written and um it's really really beautiful and what was cool is uh when they published it they did not apply it to regular Expressions it had nothing to do with regular Expressions um but then

uh I think it was Rob Pike but it might have been Russ Cox might have been K I don't know somebody at Bell Labs was like hey this is exactly what we need and it it worked great um so highly recommend it super cool beautiful paper totally should read it but what about text and binary matching how can we make that work uh efficiently so the good news is if I had asked this question 13 years ago 10 years ago whatever I would say I don't know y'all are on your own but utf8 is now the uh the it's used everywhere as you can see in this graph utf8 there there is no other encoding um

and thank goodness for that because it's wow wow um and utf8 I'm gonna yeah I'm going to digress for one minute utf8 is the most perfectly designed thing I have ever seen in my life if you look at design constraints at what needs to happen utf8 is amazing it is you couldn't imagine a more perfectly formed thing it's perfect um and it's also evidence that Plan 9 is the most most influential operating system you've never used because um every time you use the proc file system on Linux that was emitted on Plan 9 utf8 Plan 9 uh font rendering on X was inspired by Plan 9 it's crazy but what you can do with plan

n so for character encodings you know it used to be back in the day um you would have code pages and they would just be you know usually 256 bytes asky is a seven biting or seven bit encoding um and so the eth bit you know because we standardize on eight bit bites the eth bit was like oh okay now you can use the half for things that aren't asky you know lucky you who would ever need more than 256 characters right China and Japan don't exist obviously so they uh they this is how it works so this was this is the uh ISO 8859 one Latin one not Latin one sorry just ISO 8591 uh

code page and you can see you've got aski in the b or aski in the lower part and then you've got all the way down there to the Y with the O and what would happen is you'd have different code Pages for different National character sets like here you've got you know here's the Greek one and usually they would have ASI in the lower half so they'd be aski compatible and then they'd have their National character sets at the bottom or the top depending on how you organize it and then there's H 8859 7 is Greek 8859 5 is one of the cilc alphabets I don't remember which one um but yeah you got all that but then what do you do if you

need multiple character sets in the same document well you don't you just don't um or you switch to a multi-te character encoding um and the first one of that was UTF 16 uh which used two bytes per character which is all well and good and then Unicode expanded to now have a million characters uh it originally started out it was just the basic multilingual plane with 65,000 characters you fit it into two bytes and everybody was happy but then unic code expanded to 21 bits and now suddenly UTF 16 you need surrogate by surrogate code points and it's a whole mess so they switch to UTF 32 which wasted 11 bits right am I doing the ma yeah 11 bits

there the top bits always going to be the top B's always going to be zero and most of the bits so yeah yeah it's totally totally wasteful so we switched to other multi-bi and codings and they have locking shifts and things would go crazy and whatever and so then utf8 came along and everybody was happy it's asky compatible 100% it's self- synchronizing that's the beautiful part if I just drop you in the middle of utf8 document you can tell if you're in the middle of a character or not and where the next character starts it's it's really really beautiful and it does it very efficiently and so what you can do to match uh UT F encoded text is just write

a b a binary regular expression engine that matches raw bytes and precompile the utf8 encoding because you don't have to worry since it's self synchronizing since it's all that you don't have to worry about where you start or end and it just works and this was uh this was pioneered by uh 10th 10th edition Unix grep I think was where they first started this but this is what rust's engine does as well under the hood if you're doing text matching in Rust um Go's engine even though it was even though go regx engine was written by B Labs people it actually doesn't do that and I think it has to do with the way they do run reading but um but yeah

so this is really just an amazing beautiful thing that you can do um and then safe compilation this is uh this is something I'm very passionate about as well um does anybody know by the way what that error message is what what causes that or what it's even from it's Commodore 64 color scheme okay that was from Microsoft basic way back in the day and it was if you had too many parentheses in a basic expression so if you have too many parentheses you can still actually cause problems um but what you can do is there's actually been a sort of an interesting uh convergence in verification of uh programs you know with uh web assembly and ebpf and things

like that you can apply those same principles to uh compiled regular expressions and you know maybe do a little bit of symbolic execution to make sure they're valid before you load them and they can be done that can be done offline so you can actually take a p a a piece of B compiled regular expression and just run a quick tool on it and say yes I can formally prove that this if this is loaded it will not cause a crash or an infinite Loop which is pretty cool um and so I really like that um and the safety serialization I think I actually just said that exact same thing yes okay and then finally portability there's not

a lot to say here write it in a portable language please um and if you can get it to run on a pet that's great um so yeah and we actually what what time do I have until I've got five more minutes okay then I'm going to give okay before I jump into the bonus features are there any questions because I don't want to run out of time before there's questions raise your hands anybody anybody okay if you have questions on YouTube email me um my email is on the last slide I'll show it to you again uh so real quick some really cool stuff um that you can do also is there's some prefix

extraction and keying that is done and different things this is a way to rapidly rapidly speed up regular expression matching um and what they do is you can look to see if the regular expression um has a fixed prefix right if there's a fixed string that's always at the beginning and you can compute and say if there's a set of fixed strings or something then um you can use a faster string matching algorithm um AO coric can actually do multiple strings R being carp is limited to usually one string or at least multiple strings but only of the same size um rust uses Yara uses aork go uses I think boy or more I don't

remember who uses rine carp if anybody anymore and then there's one called like uh Teddy that's used by hypers scam that's amazing and it was originally also called like vomit Comet or something it was weird um and then also uh this is the only actual piece of code that uh this is the this is the only contribution I've made to regular expression engines is shrink sets um and I'll I'll post a link somewhere um but that's a way of uh disabling selective selective regular expressions in a way that you can reset the set in constant time and it's really cool and I will give you a link in a second to all that so yes in conclusion regular expressions

are good actually they're not bad they're not scary and they just have problems under the hood sometimes that you need to be aware of and what I want yall to get out of this talk is wow regular expressions are neat there's a beautiful amount of theory behind them there's actually a reason to the madness there's a reason the way they are and maybe I would like to read more about how to write a regular expression engine myself and boy do I have some resources for you in that case so this implementing regular Expressions by Rus Cox it came out in I want to say like 2008 something like that it is hands down the best series of articles

on regular Expressions ever written ever it is I I would not be where I am today if I had not read this article this series of Articles it is beautiful absolutely highly recommended very accessible to anybody with any programming background you don't need to be a mathematician or anything like that um really really worth it highly recommended um and then there's a paper describing hypers scan that's really that's very very cool um efficient submat addressing for regular Expressions by uh it's a Finnish name and I apologize because I'm going to butcher it villy loric Cory um describes a really cool way of doing sub matches that's uh was discovered independently of Rob Pike's method that's really cool

and then the ERX regular expression engine is an engine I wrote that did does a lot of this not all of it and it's purely experimental it's not super fancy but it does work if you're interested um it's written in Rust I wrote it like three years ago and then haven't touched it since because I just wanted to make sure I still could so there we go thank you there's my email address if you have any questions okay anybody I finished on time that's that's good for me yes yes so all right thank you so much everybody

GT - Regular expressions are good, actually

Related talks