Placeholder Image

Subtitles section Play video

  • I actually think that if we're going to do parsing it might be easiest not to

  • dive into compilers straight away, as some of you might want me to do, but to

  • actually just take a simple, very simple, subset of the English language and see

  • how we would parse that. And if you look up in a dictionary "what does `parsing' mean"

  • it says, essentially, one of the definitions is: " ... breaking up sentence

  • structure into its component parts under the direction of a grammar." so the

  • So the grammar is the rules of your language. Well, I have drawn up here a language

  • involving furry dice and robots and people, men and women, and cats and dogs

  • and they can 'bite' they can 'stroke' they can ... whatever. And I thought: that little

  • grammar, which I'll walk you through, is capable of generating between 200 and

  • 300 sentences and they are all valid English. But there's no way it

  • captures the spirit of English. The idea that your sentences have

  • subject-verb-object is pretty universal, I think, among what's called the

  • Indo-European languages. However, sometimes it's not always in the order

  • of subject-verb-object, although that's common. Sometimes they get twisted up

  • particularly in German, where there's a definite option to have the verb last.

  • In this example I'm pointing to here, I'm using a notation that we did cover about

  • three or four years ago now. We'll put out a link.

  • I think the episode was called Angle Brackets. And they invented a notation

  • that looks like this: pointy brackets. Yes, back in the very late 50s and early 60s

  • people who wanted to write compilers realized they had to write

  • down a formal specification of what were the rules in their language. And two

  • computer scientists actually won the Turing award - [the] ACM Turing Award - for their

  • notation and their development of a grammar

  • for the language Algol 60. And it uses angle brackets all over the place.

  • So, that was so-called Backus-Naur form. Those are the two computer scntists

  • who developed it. People really liked it. And I've used it here because it's

  • verbose enough to make some sense. I'll read it out to you. What I'm saying, up at

  • the top of the grammar - using the angle brackets for what's going to become

  • parts of a so-called parse tree. A is defined as then a

  • then an . That's the most common form of English sentence the

  • subject is the person, or thing, doing the action. The verb says what the action is:

  • 'kicked' 'bit', 'stroked' ... whatever. The object says what it is doing it to. [e.g.] 'the man

  • kicked the robot' And it's such a powerful language! You can have either

  • 'the' or 'a' [as] definite or indefinite article. You can have 'the robot' or

  • 'a robot'. You can have a choice of verbs either 'bit', 'kicked' or 'stroked' - you're

  • going to be very expressive in this language (!) Your subject of the sentence

  • can either be 'dog', 'cat', 'man', 'woman' or 'robot'. And finally the tailpiece. The object

  • i've decided could either be any followed by a again, like 'a dog',

  • 'the robot bites a dog' [is] perfectly possible. But just to show that you can put in

  • shortcuts, if you like, and put bits of structure in there that aren't parsed -

  • they're just globbets you can put in - I have defined the phrase: 'two furry dice'

  • as being a valid 'object'. Just the string "two furry dice". So the target

  • sentence, that ultimately we're going to try and construct from all of this, is

  • the [in lower case] robot stroked [writes on paper] ... running out of space here, carry on on the line below ...

  • "the robot stroked two furry dice" We all agree it makes sense [but] is it parsable

  • against that set of rules? And how would you go about showing that it was?

  • >> Sean: So by 'parsable' you mean "can we decode it"? >> DFB: [Yes] can we decode it? Can we assign everything to

  • be the right flavour of structural component according to those [grammar] rules.

  • You can either start with a string, which we've got here, and start trying to group

  • things together and work upwards. Now we all know about trees - the root is at the

  • top. We're computer scientists, the root is at the top; the leaves are at the bottom.

  • They're always upside down. Do we start with and leaves and say:"Ooh! the

  • robot stroked ... The robot stroked ... Looking sideways at the grammar I can

  • see there's a shortcut for "the robot". So I'm gonna start coming from the string

  • upwards and see if I end up back at the root of the tree, or shall I start at the

  • root of the tree and say: "I must have a subject then a verb then an object". Can I

  • see anything that could likely be the start of a ? That's called coming

  • top-down. So it's been there in computer science ever since the dawn of computer

  • science: the top-down approach versus the bottom-up approach.

  • There's no prizes for guessing that functional languages like Haskell tend,

  • by their very structure, to favour the top-down approach. It's messier, is the

  • bottom-up approach. I shall try to indicate both, OK, but let's

  • just this once, let's go top down rather than bottom-up.

  • If we're coming top-down it is telling me, is this grammar, that I must have a

  • then a then an . Now, for those

  • of you who are concerned with a mechanistics -- you can actually think

  • of these things, if you're coming top down, as being pushed as targets, that

  • you've got to satisfy, on a stack which .... We all know about stacks! This one is not

  • building upwards and downwards it's building from right to left. So, at the

  • top of my stack currently, because I'm coming top down, I say my target my

  • sub-target is . I must find a . But I must keep half an eye on the target

  • string to say: "does it really match?" If we cross-refer now, to this grammar,

  • for inspiration, we're trying to match a If I look at the grammar over

  • there it tells me that a can either be an article and a noun like

  • "the" and "a" , with a like 'cat', 'dog', 'robot' or you can take a shortcut. Because it

  • tells you here that the just could be the phrase "the robot". So let's

  • go for broke. I'm looking at the input string and I see

  • "the robot". So let's take the shortcut just to be devils, right?! I can say

  • straight away then, taking the shortcut, that can be the string

  • "the robot". That's good! Now, having satisfied , I pop it

  • off the top of my stack. I've done that. Next thing, to match this sentence,

  • you've got to be able to match the . Well, we're going along the string we've coped

  • with "the robot". Next word - next valid leaf word - that has to match up with being a

  • . I'm looking at 'stroked'. Is 'stroked' a possibility? Look at . What are the

  • possibilities for ? And I see 'bit' 'kicked' or 'stroked'. Great! I choose 'stroked'.

  • But, once again, if you can imagine the stack I then pop that off the top of the stack

  • -- we've done it! The stack is shrinking! all we've got left to identify

  • is an that matches [in] that sentence. Again, I've pulled a fast one here! In the

  • rules for I said you can either break it down into an and a ,

  • just like the , so nice having the action done to it. Or you can take a

  • whizzy short-cut. An allowed shortcut, for an is what you might call an

  • and I put it in deliberately, just to make a point: "two furry dice" Wonderful!

  • Simplest parse going: an can be "two furry dice". We declare success and we say

  • we have matched "the robot strokes two furry dice" by going .

  • came down into being shortcut "the robot". developed straight away

  • into being "stroked". shortcut to It's a very shallow tree, really.

  • But it's developed the target sentence: "the robot stroked two furry dice".

  • Now, the only thing you might say: "Well, yes, wonderful, we've parsed it. But would

  • it have worked if the instead of taking the shortcut. Could you have

  • done it with the option?" Yes, you could.

  • I could have basically, come here: let's write out:

  • and, yes, right at the beginning instead of taking the shortcut phrase

  • "the robot", I could have said: "Ah! but I'm going to do it as can be an

  • followed by a ". So, the head of the stack then would be - when we're

  • looking at the sentence and starting off from the beginning again - it's got to be

  • an to start with. is 'a' or 'the'. Oh! I'm looking at the string, I see "the".

  • Bingo! So, article can be "the" - no problem.

  • Pop that off the stack. What's the next thing I've got to look for? . Can be "robot",

  • which is the next thing I'm looking at? Yes, "robot" is an option, so I could have done

  • "robot". Now the is as before. It's got to be "stroked". And "stroked" is

  • there. And equally the is as before. There's only one way to develop

  • the string: "two furry dice" and that's to take the declared short-cutk that does it for

  • you. The reason for doing this rather artificial example is to say: "Oh dear!

  • Does this matter? We have got a sentence that makes perfect sense to us: "the robot

  • stroked two furry dice". But you've drawn two slightly different trees. First time

  • you took a shortcut subject-phrase, just for "the robot". Next time you said: "Ah! but

  • we could also do it by breaking it down further into a 'the' and 'robot'. Does that

  • matter? In this particular case, in informal spoken English, no it doesn't.f

  • If people understand you the fact that your grammar is technically called 'ambiguous'

  • - so if you say to yourself what does 'ambiguous' mean? It means there's more

  • than one parse tree that seems to satisfy and give the final target phrase.

  • Does that matter? In this case 'no'. Does it matter in general? Yes, because you might

  • get very different effects under certain circumstances. What I think I'll show you

  • is a favourite one almost. Which is to check your 4-function desk calculator.

  • When you start typing things into a 4-function (so-called) hand-calculator.

  • You're basically putting in, shall we say, integer numbers - let's simplify it.

  • But the burning question is something like this: that if you type in 8 * 4 * 2

  • You may not think of it in those terms but computer scientists

  • would say: "Yes ,if you've got the right grammar rules that's an example of a

  • "sentence". That is a - what might you call it? - a sort

  • of subsection of a long calculation, that you do on the right-hand side [say] of a C statement.

  • It's just 8 times 4 times 2. There's no problem is there?. Eight 4s are 32.

  • Two 32s are 64. But then you say: "Ah! but we're into tree structures now. How would

  • you parse that sentence, given that multiply can only multiply two things

  • at a time. So, are you going to arrange it so that your overall expression groups

  • as 8 * 4, done on the left. Then you do the multiply by 2, which is

  • that tree structure. So, the question is you've got a little bit of depth on your

  • tree structure here. It's gone down a level whereas finally you've got one [operand]

  • up at the top level. Do you want that sub-structure to be on the left or on the right?

  • Because it would be equally OK, as we've got things at the moment, for

  • you to draw something that's got 8 here, multiplied by here 4 * 2 with a

  • multiplier between. So have we got a little bits of tree structure on the

  • left or on the right? And does it matter? And you say no it doesn't matter. We all

  • know that you can't complete the expression until it's left operand

  • is known and that involves forcing you to do the multiply. 8 * 4 is 32.

  • Fine, 32. What's the other operand here? It's multiply by 2: 32 * 2 = 64.

  • Equally, over here, 8 x (four twos are 8; eight 8s are 64) So, you're doing it two different ways

  • but you're getting the same answer. And the reason that's OK

  • is that multiply is well-behaved [over the integers] Mathematicians will tell you it commutes

  • a * b is the same as b * a for integers [and * is also associative over the integers]

  • However, just look at the ambiguity, OK, Two different parse trees and it

  • didn't matter for multiply. [But] just suppose those multipliers weren't multiplies but

  • were divides [ / ] . Does it then matter that you do 8 / 4 first in

  • this one? You get your 2 and the multiply is now replaced by a divide:

  • 2 / 2 is 1. But look at this other possibility and

  • just mentally substitute divides for multiplies. 8t divided by the result

  • of 4 / 2 ? Well 4 /2 is 2. 8 / 2 is 4.

  • Two completely different answers! Is it 1 or is it 4. If you're panicking take

  • out your calculator now and check the app. Just do 8 / 4 / 2 and

  • if the answer isn't 1 demand your money back!

  • Here we see it then. Here is an example of where ambiguity [two differing parse trees] does lead to problems,

  • because depending which tree you choose you get a different answer. So, there we

  • are then. We now know what a grammar is. We know that against a grammar you can

  • try and parse things like sentences but "sentences" [is] used in its widest sense.

  • It could be a program statement from C and just think of this: 8 * 4 * 2

  • has been as a part of a phrase on the right-hand side of an arithmetic statement.

  • And grammar is the guiding principle for the whole thing.

I actually think that if we're going to do parsing it might be easiest not to

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it