Placeholder Image

Subtitles section Play video

  • There's a parsing theme, devotees will know, and there's a Regexp theme. And our

  • aim, in the end, is to unite these all into one, at the top, so that you're all

  • totally happy - or completely unhappy - as the case may be. We have done Regular

  • Expressions and the last one I did was really very simple and straightforward,

  • and showed you absolutely nothing about what can go wrong. And some of you out

  • there, coming up with the usual old jokes, quite right, about: "Oh! you're implementing

  • using Regular Expressions. Now you've got two problems", Yeah! We'll look on in glee

  • as we now take the story a bit further and see what can go right and how

  • powerful Regular Expressions are. But also a little bit of what can go wrong

  • and why you've got to be careful in your use. So what can it be this marvellous new

  • example that consumes Life the Universe and Everything and exhibits a whole load

  • of new features? And you're going to be quite convinced I'm off my head.

  • It's Roman Numbers. Way way back when we first, here at Nottingham, got Lex and Yacc, in

  • UNIX Version 7 - it was in the late 70s. So, believe me, I have been ... or, back in

  • those days I was setting Roman Numbers as an example of what Lex - remember the

  • Regular-Expression-based front-end to your compiler. Does tour identifiers and

  • all that. I thought, all right, as an exercise, just write a Regular Expression

  • that will accept Roman Numbers - all of the good ones and reject all the bad ones.

  • Now, I can't emphasize too strongly: that rider is important! It is very easy, in a

  • Regular Rxpression, quite easy, to get it to recognize what you want, but will it [also]

  • reject all the stuff that's bad? And it is sometimes so easy to overlook [things].

  • I say with shame because, when I'd set that question, I scratched down a quick and

  • easy answer - which I now see is full of holes .... Because if you input, into my

  • regular expression MMII - again, looking ahead, Roman Numbers

  • MMII [is] 2002. Fine. But then I thought ... well I thought now, I didn't think then:

  • "What happens if I do MIIM? [winces at the thought]?" [Answer = ] 2002. [It] shouldn't be. There's a strict hierarchy

  • in the way these work. So, at this stage we don't know how much, if anything, you

  • remember about Roman Numerals. So, as an Extra Bits we've put out a simple

  • tutorial on Roman Numerals based around this example, But for those of you that

  • are totally happy with Roman numerals, we can just carry on with the main video now.

  • So, just to locate ourselves then, and do a few examples to make everybody

  • comfortable I'm going to write down MCDXCII. Start at the left. We've got a

  • thousand [M] Then we see CD and, remember, this is one of these subtraction things

  • CD is "a hundred less than 500" which is 400. MCD. So we're up to 1400 so far. XC?

  • >> Sean: OK That's 90 >> DFB: 90. II? >> Sean: 2 >> DFB: 1492. What happens in 1492 Sean, in Western European folklore?

  • "Columbus discovered horizons new in America (1492) >> Sean: I should have known that one

  • >> DFB: Yeah! right. And, of course, the year we're in at the moment, as we film all of this,

  • is - that's simple MMXX, 2020. And one which is dear to me, I'm sure you'll

  • work out why. Yes, it's got a lot of these prepended abbreviation things in there look,

  • MCM; 1900, XL: 10 less than 50 (40) IV: one less than 5 (4) -- 1944. 244 let's focus

  • Let's focus ourselves now on how would we do a Regular Expression for all of this?

  • In fact, let's make things very very simple and say: "All I'm gonna do at the moment

  • is concentrate on getting a Regular

  • Expression pattern that will either generate me one or two, or three, or four, or five -

  • [writes on paper] forgive me I'll carry on down here. Or six, seven, eight, or nine. And I won't go

  • any further than that. Because tens is a new adventure

  • further to the left. So, those numbers then, from one to nine, which I've written

  • know like that now. I hope you can see that if you're completely simple-minded

  • about this and don't want to do complex Regular Expressions you really could

  • write yourself a great long 'alternation' as it's called. In other words a great

  • big sequence of ORs. And you could do everything from one up to thousands and

  • just have this walloping great OR clause, that your Regular Expression parser will

  • probably go bananas and say: "too many alternations", or something like this.

  • So, no, that's not the way to do it. What one wants to do is to generalize and use,

  • as it were, elegant Regular Expression capabilities to shorten this a bit.

  • So, there we are the Roman numbers from I to IX,

  • All written out. First of all let's try and separate

  • these things into classes, and groups, that make sense. There are two weird

  • things which involved the subtraction rule, you know, like IV (1 less than 5)

  • IX (one less than 10). I think we will do those together, because they're weird in the

  • same way, if you like. The other things are much more regular like you go from 1

  • 2 and 3, which is with III. And similarly, here, you get the same pattern but coming

  • after a V - a V on its own, or VI, VII, VIII, and so on. So, we've got three

  • groupings: things involving Is with or without a V in front of them;

  • and the two special cases IV and IX. So I would conjecture there's a wonderful

  • piece of Regular Expression covering so far what we've done, but something new

  • coming up - so do pay attention - would be this: my RE is going to be I

  • - now this is square 'choice' brackets [], We have seen these before.

  • But whereas, before, I put ranges inside them - you know like [A-Za-z]

  • [0-9], any one of those digits.

  • These aren't ranges; these are just explicit possibilities inside square

  • brackets. It's called a 'choice' so I've got to have an 'I', if I'm on this side of the

  • expression. But then it could be followed by either an X or a V, but it must be one

  • or other of them, and only once. There's no repetition outside; there's no

  • star there's no plus; no nothing. So those are the special cases taken care of,

  • the IV and the IX. Now we put a walloping great big OR bar down the middle. Or if

  • it's not a special case it's a regular straightforward case. But we've been

  • clever with our REs now. Oh yes! look at this. Now, again, I think this is

  • something we may have mentioned but probably not another piece of regular

  • expression notation is: " ... if you put a question mark after something it means

  • 0 or 1 of". So, what I'm saying here is I can either have a V, or I don't need

  • a V. How can you not need a V? Well, what's following on here is going to be

  • something that really is new. This is a counted repetition in curly braces zero

  • - [should be 0,3]. Look at that! Bound to win an award! Let's just run through it.

  • The two special cases are off on that left-hand side of the OR bar. But if you choose not

  • to go down that route you go to here and it says: "Oh well if it's not one of

  • the two special cases then it's like the following: anything between 0 and 3 Is.

  • I mean you may not have Is at all - you can just have V on its own if you want -

  • preceded by an optional V, nought or one these, >> Sean: Tthat's giving you the choice of

  • either 1 2 3, or 5, or 6 7 8, right? >> DFB: Yeah, yeah, that's right.

  • But it's all to do with sometimes having zero of something. Now, hereby

  • hangs a tale about Regular Expressions. And those of you that keep on posting

  • these notices about " ... you've got multiple problems with Regular Expressions !". Well. here's

  • another one. It's the strange case of the 'empty

  • alternative', which really can gang up to get you. Because what you're saying here,

  • in this expression, is: " ... Is it possible for it to match nothing at all and yet

  • that's a valid match?" Yes it is! Because if you choose to ignore what's on the

  • left of the OR bar here - which you can do, it's either/or - we come here, and we say: " V,

  • oh! I can have 0 or 1 of those. So, let's have 0 of them". The I ? I can have

  • anything between 0 to 3 of those. So let's choose 0 of them. So, in other words

  • choosing nothing at followed by nothing is a valid match for a Roman number [according to this RE].

  • And if you think about it it's true. Nothing in Roman Numbers is compulsory. You can have

  • M's, or no M. Yyou can have C's or no C's. You know all of them.

  • It's all optional. But as you start getting down to the units digits you think: "I've got

  • to have something !" The paradox, then, about Roman Numbers

  • done as a Regular Expression, is: you've got all sorts of cases that yield you

  • nothing at all and yet you're going to be very disappointed if you don't have something!

  • But saying "... overall you must have something" That's not a context-free

  • assertion. That's a context-sensitive assertion. So we'll live with that for the

  • moment. And I think also what I'd like to just show you now, if I can get the

  • screen fired up ... and I'm using 'awk' here to demonstrate this, I'm going to add yet

  • another little piece of notation to all of this. I'm going to say that you can, if

  • you want to, insist that what you're matching must just be a valid Roman

  • Number, on a line on its own, with no writing or messing about on either side

  • of it. Just the number. Start of the number (^), end of the number ($). Finish. Right?

  • And you denote that by saying start of string anchor, ^ End of string anchor, $.

  • You're basically saying whatever pattern there is must fit between

  • the start of the string, on a line on its own, and the end of the string - the

  • newline right at the end. Now, this is the key thing about this. Those things [^$ markers]

  • if you put them in, are not options. You can't have zero of them. If you say

  • you must match the beginning of the line, you must match the beginning of the line. If you

  • say you must match the end of the line, you cannot ignore it. You can't have zero

  • of them. So, in a way, this is a kind of reality check and it's generally the

  • case that if you use these correctly you won't end up with null strings being a

  • pseudo-match. Because this thing is saying: "Well, whatever the finite [non-zero] length

  • string is, it's got to fit between beginning of line end of line. Here's the

  • 'awk' script that I hacked together, to try out a few alternatives. And I decided to

  • go for this simple little piece first. Can I do everything from 1 to 9 in a

  • single piece of Regular Expression. And you'll see here this is classic 'awk'.

  • You, give a Regular Expression pattern, then you give an action. It's a C-like syntax,

  • even in 'awk'. Basically what I'm saying is; "If I match that then I print out 'Pattern

  • p1 matched' and so on". Now, if you look down here four of these things - because

  • this is just a test-bed - four of them are commented out, they will not be executed.

  • The one that will be executed is the one I've just been discussing with you, here,

  • which says: "I must match start of string then here's my Regular Expression

  • pattern - which I think will work from I to IX inclusive, and then it must match

  • end of string". So, I'm hoping that by uncommenting this that this will work.

  • And I'll type some good and bad Roman Numbers at it. And for every one it's happy

  • with, it will say 'Pattern P0 matched is' And it will echo back what it your Roman

  • Number [was] that you put in. So, what I want to do, now, is actually run that script, that

  • we've been looking at. 'awk - f ' take your instructions from a file called 'test-roman.awk'

  • OK, so when that runs now it should just hang there waiting for its

  • input of a Roman number. And remember it is going

  • to match that beginning of string, piece of regular expression, end of string.

  • >> DFB: So. which one do you want me to try first Sean? >> Sean: OK, let's go really simply and just go

  • for a number 5 [V]. >> DFB: V all on its own [waits] Look at that! That's a 'P0 matched is V

  • >> Sean: All right - a bit trickier 4 [IV] because it's got that 'minus one' sort of idea. Let's put one in that shouldn't work.

  • >> DFB: Go on that what do you want ? OK just put in C for Computerphile.

  • >> DFB: C for Computerphile? Now you might say that's a valid Roman number.

  • Yes, but it's it's not being covered by the bit of Regular Expression I've done, I've done a

  • piece of Regular Expression that literally, I hope, does I to IX and

  • nothing else. Let's see what happens there [with C] It echoes it back but it doesn't

  • do anything with it. So let's now try IX, looking good.

  • All right how about the ultimate test [which] is type in something utterly idiotic like HELLO

  • No, doesn't do anything with it, doesn't recognize it. So, we're back

  • showing the 'awk' script again and let me just remind you that what we've shown is

  • that as long as we put the anchors in at the beginning and the end of this

  • pattern then our minimal testing, but I think we've tried all the options, it's

  • recognized V it's recognized IX. IV all this stuff. That one is absolutely

  • fine and it rejects rubbish. I've actually put in, as a comment,

  • StackOverflow's assertion that this [points at screen] is the thing that's the answer to life the

  • universe and everything. Let's move right across to the right and the first thing

  • you'll see here is the piece we've just done for the units as it were, I through to IX.

  • But as you go leftwards can you see that the beauty of this is ...

  • that has handled the units part of the Roman Numbers. It's dead easy to see, therefore,

  • that it's like a repetition of the same idea, but with different symbols? When you

  • come to the tens range What are the exceptions there XC (ten less than a hundred)

  • XL (ten less than fifty). I's just like the I's and the X's and V's were before, look.

  • Similar. X[CL]. Or it is is an optional L, for 50, followed by a tens digit, zero

  • {0,3} of them. You can't say just {1,3} because L on its own is

  • perfectly valid as a Roman Number Number. So, you see that piece of pattern - similar to

  • the far-right one but just with a relabeling almost of the digits you're

  • using- that copes with the tens. Up here [points to left of screen] it's coping with the hundreds, going on to

  • the thousands. So, the story finishes off in here by saying: Ah! you can have CM

  • (a hundred less than a thousand) i.e. 900 in other words. CD for 400.

  • And the midpoint of the range, which everything pivots around. It was V

  • there, it was L here, it is going to be D here. I hope you all know that D is

  • 500? So there we are. how about you see three very similar

  • pieces of pattern for the hundreds the tens and the units and right out here at

  • the far left, M*. And M* means you can have zero or more 1000s. So, we've

  • got the basis there for something that looks like it would do everything. The

  • only problem is these empty matches. But that's such a big problem we probably

  • ought to have an episode devoted to that all on its own.

There's a parsing theme, devotees will know, and there's a Regexp theme. And our

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it