Placeholder Image

Subtitles section Play video

  • AMIT SHARMA: Hi, all.

  • Welcome to this session on

  • Causality and Machine Learning

  • as a part of Frontiers in

  • Machine Learning.

  • I'm Amit Sharma from Microsoft

  • Research and your host.

  • Now of course I presume you

  • would all agree that

  • distinguishing correlations from

  • causation is important.

  • Even at Microsoft, for example,

  • when we're deciding which

  • product feature to ship or when

  • we're making business decisions

  • about marketing, causality is

  • important.

  • But in recent years, what we're

  • also finding is that causality

  • is important for building

  • predictive machine learning

  • models as well.

  • So especially if you're

  • interested in out-of-domain

  • generalization having your

  • models not brittle, you need

  • causal reasoning to make them

  • robust. And in fact there are

  • interesting results even about

  • adverse robustness and privacy

  • where causality may play a role.

  • This is an interesting time at

  • the intersection of causality

  • and machine learning. And we

  • now have a group at Microsoft as

  • well that is looking at these

  • connections.

  • I'll post a link in the chat.

  • But for now, today I thought we

  • can ask, all ask this question

  • what are the big ideas that will

  • drive further this conversation

  • between causality and ML.

  • And I'm glad that today we have

  • three really exciting talks.

  • Our first talk is from Susan

  • Athey, economics of technology

  • professor from Stanford. She'll

  • talk about the challenges and

  • solutions for decision-making

  • under high dimensional and how

  • generative data modeling can

  • help.

  • And in fact when I started in

  • causality, Susan's work was one

  • of the first I saw that was

  • making connections between

  • causality and machine learning.

  • I'm looking forward to her talk.

  • And next we'll have Elias

  • Bareinboim, who will be talking

  • about the three kinds of

  • questions we typically want to

  • ask about data and how two of

  • them turn out to be causal and

  • they're much harder.

  • And he'll also talk about an

  • interesting emerging new field,

  • causal reinforcement learning.

  • And then finally we'll have

  • Cheng Zhang from Microsoft

  • Research Cambridge.

  • She'll talk about essentially

  • give a recipe for how to build

  • models, neural networks that are

  • robust to adversal attacks. And

  • by now you've guessed in the

  • session she'll use causal

  • reasoning. And at the end we'll

  • have 20 minutes for open

  • discussion. All the speakers

  • will be live for your questions.

  • Before we start, let me tell you

  • one quick secret.

  • All these talks are prerecorded.

  • So if you have any questions

  • during the talk, feel free to

  • just ask those questions on the

  • hub chat itself and our speakers

  • are available to engage with you

  • on the chat even while the talk

  • is going on.

  • With that, I'd like to hand it

  • over to Susan.

  • SUSAN ATHEY: Thanks so much for

  • having me here today in this

  • really interesting session on

  • machine learning and causal

  • inference.

  • Today I'm going to talk about

  • the application of machine

  • learning to the problem of

  • consumer choice.

  • And I'm going to talk about some

  • results from a couple of papers

  • I've been working on that

  • analyze how firms can use

  • machine learning to do

  • counterfactual inference for

  • questions like how should I

  • change prices or how should I

  • target coupons.

  • And I'll also talk a little bit

  • about the value of different

  • types of data for solving that

  • problem.

  • Doing counterfactual inferences

  • is substantially harder than

  • prediction.

  • There can be many data

  • situations where it's actually

  • impossible to estimate

  • counterfactual quantities.

  • It's essential to have the

  • availability of experimental or

  • quasi experimental variation in

  • the data to separate correlation

  • from causal effects.

  • That is, we need to see whatever

  • treatment it is we're studying,

  • that needs to vary for reasons

  • that are unrelated to other

  • unobservables in the model. We

  • need the treatment assignment to

  • be as good as random after

  • adjusting for other observables.

  • We also need to customize

  • machine learning optimization

  • for estimating causal effects

  • and counterfactual of interest

  • instead of for prediction.

  • And indeed, model selection and

  • regularization need to be quite

  • different if the goal is to get

  • valid causal estimates. That's

  • been a focus of research,

  • including a lot of research I've

  • done.

  • A second big problem in

  • estimating causal effects is

  • statistical power. In general,

  • historical observational data

  • may not be informative about

  • causal effects. If we're trying

  • to understand what's the impact

  • of changing prices, if prices

  • always change in the past in

  • response to demand shocks, then

  • we're not going to be able to

  • learn what would happen if I

  • change the price at a time when

  • there wasn't demand shock. I

  • won't have data from that in the

  • past.

  • I'll need to run an experiment

  • or I'm going to need to focus on

  • just a few price changes or use

  • statistical techniques that

  • focus my estimation on a small

  • part of the variation of the

  • data.

  • Any of those things is going to

  • lead to a situation where I

  • don't have as much statistical

  • power as I would like.

  • Another problem is effect sizes

  • are often small.

  • Firms are usually already

  • optimizing pretty well.

  • It will be surprising if making

  • changes leads to large effects.

  • And the most obvious ideas for

  • improving the world have often

  • already been implemented.

  • Now that's not always true, but

  • it's common.

  • And finally personalization is

  • hard.

  • If I want to get exactly the

  • right treatment for you, I need

  • to observe lots of other people

  • just like you, and I need to

  • observe them with different

  • values of the treatment variable

  • that I'm interested in.

  • And again that's very difficult,

  • and often it's not possible to

  • get the best personalized effect

  • for someone in a small dataset.

  • Instead, I'm averaging over

  • people who are really quite

  • different than the person of

  • interest.

  • So for all of these reasons, we

  • need to be quite cautious in

  • estimating causal effects and we

  • need to consider carefully what

  • environments enables that

  • estimation and give us enough

  • statistical power to draw

  • conclusions.

  • Now I want to introduce a model

  • that's commonly used in

  • economics in marketing to study

  • consumer choice.

  • This model was introduced by Dan

  • McMadden in the early 1970s, he

  • won the Nobel Prize for this

  • work.

  • The main crux of his work was to

  • establish a connection between

  • utility maximization, a

  • theoretical model of economic

  • behavior.

  • And the statistical model, the

  • multinomial logit.

  • And this modeling setup was

  • explicitly designed for

  • counterfactual inference.

  • The problem he was starting to

  • solve was what would happen if

  • we expand Bart, which is the

  • public transportation in the Bay

  • Area, what if I expand Bart, how

  • will people change their

  • transportation choices when they

  • have access to this new

  • alternative.

  • So the basic model is an

  • individual's utility depends on

  • their mean utility which varies

  • by the user, the item and time

  • and idiosyncratic shock.

  • In general, this -- we're going

  • to have a more specific

  • functional model for the mean

  • utility, and that's going to

  • allow us to learn from seeing

  • the same consumer over time and

  • also to extrapolate from one

  • consumer to the other.

  • We're going to assume that the

  • consumer maximizes utility among

  • items in a category by just

  • making this choice. So they're

  • going to choose the item I that

  • maximizes their utility.

  • The nice thing is that if the

  • error has type one extreme value

  • distribution is independent

  • across items, then we can write

  • the probability that the user

  • used choice at time T is equal

  • to I, just standard multinomial

  • logit type functional form.

  • So utility maximization will

  • where these mus are the means,

  • will lead to multinomial logit

  • probabilities.

  • So data about individual I's

  • purchases can be use to estimate

  • the mean utility.

  • In particular, if we write their

  • utility, their mean utility as

  • something that depends on the

  • item and the user but that's

  • constant over time, so this is

  • just their mean utility for this

  • item, like how much they like a

  • certain transportation choice,

  • and then a second term which is

  • a product of two terms, the

  • price the users faces at time T

  • for item I and a preference

  • parameter that's specific to the

  • user.

  • If I have this form of

  • preferences and then the price

  • varies over time while the

  • user's preference parameters

  • stay constant, I'll be able to

  • estimate how the user feels

  • about prices by looking at how

  • their choices differ across

  • different price scenarios.

  • And if I pull data across users,

  • I'll then be able to understand

  • the distribution of consumer

  • price sensitivities as well as

  • the distribution of user

  • utilities for different items.

  • So in a paper with Rob Donnelly

  • and David Blei and Fran Ruiz,

  • We take a look at how we can

  • combine machine learning methods

  • and modern computational methods

  • with traditional approaches to

  • studying consumer purchase

  • behavior in supermarkets.

  • The traditional approach in

  • economics in marketing is to

  • study one category like paper

  • towels at a time.

  • We then model consumer

  • preferences using a small number

  • of latent parameters.

  • For example, we might allow a

  • latent parameter for how

  • consumers can care about prices.

  • We might allow a latent

  • parameter for product quality.

  • But other than that, we would

  • typically assume that there's a

  • small number of observable

  • characteristics of items and

  • there's some common coefficients

  • which express how all consumers

  • feel about those

  • characteristics.

  • The traditional models also

  • assume that items are

  • substitutes within a category

  • and they would ignore other

  • categories.

  • So you might study consumer

  • purchases for paper towels

  • ignoring everything else in the

  • supermarket, just throwing all

  • that data away.

  • So what we do in our approach is

  • that we maintain this utility

  • maximization approach.

  • But instead of just studying one

  • category, we study many

  • categories in parallel.

  • We look at more than 100

  • categories, more than a thousand

  • products at the same time.

  • We maintain the assumption that

  • categories are independent and

  • that items are substitutes

  • within the categories.

  • And we select categories where

  • that's true.

  • So categories of items where the

  • consumers typically only

  • purchase one brand or one of the

  • items.

  • We then take the approach of a

  • nested logit which comes from

  • the literature in economics and

  • marketing, in each category

  • where there's a shock to an

  • individual's need to purchase in

  • the category at all.

  • And but then conditional on

  • purchasing, the errors or the

  • idiosyncratic shock to the

  • consumer utility are

  • independent.

  • So having the single shock to

  • purchasing at all is effectively

  • introducing correlation among

  • the probabilities of purchasing

  • each of the items within the

  • category at all.

  • Now, the innovation where the

  • machine learning comes in is

  • that we're going to use matrix

  • factorization for the user item

  • preference parameters.

  • So instead of having for each

  • consumer a thousand different

  • latent parameters, each one for

  • each product they might

  • consider, instead we use matrix

  • factorization so there's a lower

  • dimensional vector of latent

  • characteristics for the products

  • and consumers have a lower

  • vector for latent preferences

  • for those characteristics.

  • That allows us to improve upon

  • estimating a hundred different

  • separate category models.

  • We're going to learn about how

  • much you like organic lettuce

  • from whether you chose organic

  • tomatoes, and we'll also just

  • learn about whether you like

  • tomatoes at all from whether you

  • purchased lettuce in the past.

  • And so I won't have time to go

  • through it today but this is a

  • layout of what we call nested

  • factorization model showing the

  • nest where first you decide what

  • to purchase if you're going to

  • purchase, and then the consumers

  • deciding whether to purchase at

  • all.

  • And we have in each case vectors

  • of latent parameters that are

  • describing the consumer's

  • utility for categories and for

  • items.

  • One of the reasons that this

  • type of model hasn't been done

  • in economics in marketing in the

  • past is what was standard in

  • economics and marketing, if you

  • were going to do a model like

  • this, would be to use either

  • classical methods like maximum

  • likelihood without very many

  • latent parameters or consider

  • Markov chain Monte Carlo

  • bayesian estimation which

  • historically had very limited

  • scalability. What we do in our

  • papers is use variational bayes

  • where we approximate the

  • posterior with parameterized

  • distribution and minimize the

  • kale divergence to the true

  • posterior using stochastic

  • gradient descent.

  • We show we can overcome a number

  • of challenges particular

  • introducing price and time rate

  • cover rate slows down the

  • computation a fair bit, and the

  • substitutability within

  • categories leads to

  • nonlinearities. Despite that

  • we're able to overcome these

  • challenges.

  • Once we have estimates of

  • consumer preferences for a

  • product, and as well we have

  • estimates of consumer

  • sensitivity to price, we can

  • then try to validate our model

  • and see how well do we actually

  • do in assessing how consumer

  • demand changes when prices

  • change.

  • And in our data we see many,

  • many price changes. We see

  • prices typically change on this

  • particular grocery store we have

  • data from on Tuesday night.

  • And so in any particular week

  • there may be a change in price

  • from Tuesday to Wednesday.

  • And so in order to assess how

  • well our model does in

  • predicting the change in demand

  • and response to a change in

  • price, we held out test data

  • from weeks with price changes.

  • In those weeks we break the

  • price changes into large price

  • changes and different buckets of

  • the size of the price change.

  • We then look at what is the

  • change from Tuesday to Wednesday

  • in demand in those weeks.

  • Finally, we break out those

  • aggregations according to which

  • type of consumer we have for

  • each item.

  • So in particular, on a week

  • where we have a change in price

  • for a product, we can

  • characterize the consumers as

  • being very price sensitive,

  • medium price sensitive or not

  • price sensitive for that

  • specific product.

  • And then we can compare how

  • demand changes for each of those

  • three groups.

  • And so this figure here

  • illustrates what we find in the

  • held-out test data.

  • In particular, we find that the

  • consumers that we predict to be

  • the least price sensitive in

  • fact don't seem to respond very

  • much when prices change, while

  • the consumers who are most price

  • sensitive are most elastic, as

  • we say in economics, are the

  • ones whose quantity changes the

  • most when prices change. Once

  • we're confident that we have a

  • good model of consumer

  • preferences, we can then try to

  • do counterfactual exercises such

  • as evaluate what would happen if

  • I introduce coupons and targeted

  • them at individual consumers.

  • We'll take a simple case where

  • we have only two prices we

  • consider, the high price or the

  • typical price, and the low

  • price, which is the discounted

  • price.

  • Now what we do we look into the

  • data and we evaluate what would

  • happen if we sent those targeted

  • coupons out.

  • So for each product we look at

  • the two most common prices that

  • were charged in the data.

  • We then assess which consumers

  • would be most appropriate for

  • coupons.

  • We might look, for example, and

  • say I want to give coupons to

  • third of consumers, I can see

  • which consumers are most price

  • sensitive, most likely to

  • respond to those coupons.

  • I can then actually use held out

  • test data to assess whether my

  • coupon strategy is actually a

  • good one.

  • And that will allow me to

  • validate again whether my model

  • has done a good job in

  • distinguishing the more price

  • sensitive consumers from the

  • less price sensitive consumers.

  • So this figure illustrates that

  • for a particular product there

  • were two prices, the high price

  • and low price that were charged

  • over time.

  • The actual data, the different

  • users might have seen come to

  • the store sometimes on a low

  • price day and sometimes on the

  • high price day indicated by blue

  • or red.

  • What we then do is say what

  • would our models say about who

  • should get the high price and

  • who should get the low price.

  • So we can reassign

  • counterfactually say the top

  • four users to high prices

  • indicated by these orange

  • squares, and we can

  • counterfactually reassign the

  • low, the fourth -- the fifth and

  • sixth user to the low price,

  • indicated by the green

  • rectangles.

  • Now, since the users we assigned

  • to high saw a mix of low and

  • high prices, I can actually

  • compare how much those users

  • purchased on the high priced

  • days and low priced days and I

  • can also look among the people

  • that I would counterfactually

  • assign to low prices and see

  • what's the impact of high prices

  • versus low prices for those

  • consumers. And I can use those

  • estimates to assess what would

  • happen if I reassigned users

  • according to my counterfactual

  • policy.

  • When I do this I can compare

  • what my model prediction

  • happened in the test set to what

  • actually happened in the test

  • set.

  • What I actually find somewhat

  • surprisingly here is that in

  • fact what actually happens in

  • the test set is even more

  • advantageous for the firm than

  • what the model predistricts.

  • In particular, our model

  • predicts if I reallocate the

  • prices to the consumers

  • according to what our model

  • suggests would be optimal from a

  • profit perspective, we can get

  • an 8% increase in revenue.

  • That is instead of varying

  • prices from high to low or from

  • day to day we always kept them

  • high and then we targeted the

  • coupons to the more price

  • sensitive consumers.

  • In the data, if we actually look

  • at what happened in our held-out

  • test data, it looks like that

  • the benefits to high versus low

  • prices and the difference in

  • those benefits between the high

  • and the low consumers are such

  • that it looks like in the test

  • set we actually would have

  • gotten a 10 or 11% increase in

  • profits had the prices been set

  • in that way.

  • To conclude, the approach I've

  • outlined is to try to learn

  • parameters of consumers utility

  • through revealed preference.

  • That is, use the choices that

  • consumers make to learn about

  • their preferences about product

  • characteristics and prices and

  • then predict their responses to

  • alternative situations.

  • It's important to find a dataset

  • that's large enough and has

  • sufficient variation in price to

  • isolate the causal effects of

  • prices and also assess the

  • credibility of the estimation

  • strategy.

  • And it's also important to

  • select counterfactual study

  • where there's actually enough

  • variation in the data to be able

  • to assess and validate whether

  • your estimates are right.

  • And so I illustrated two cases

  • where I was able to use test set

  • data to validate the approach.

  • You use the training data to

  • assess, for example, which

  • consumers are most price

  • sensitive and look at the test

  • data and see if their purchase

  • behavior varies with price in

  • the way that your model

  • predicts.

  • In ongoing work, I'm trying to

  • understand how the different

  • types of data create value for

  • firms.

  • And so in particular if firms

  • are using the kinds of machine

  • learning models that I've been

  • studying and they use those

  • estimates in order to do things

  • like target coupons, we can ask

  • how much do profits go up as

  • they get more data.

  • In particular, how does that

  • answer vary if it's more data

  • about lots more consumers, or if

  • we do things like retain

  • consumer data for a longer

  • period of time.

  • And preliminary results are

  • showing that retaining user data

  • for a longer period of time so

  • you really get to know an

  • individual consumer can be

  • especially valuable in this

  • environment.

  • Overall, I think there's a lot

  • of promise in combining tools

  • from machine learning like

  • matrix factorization but also

  • could be neural nets, with some

  • of the traditional approaches

  • from causal inference.

  • And so here we've put the things

  • together.

  • We used functional forms for

  • demand and the concepts of

  • utility maximization and

  • approaches to counterfactual

  • inference from economics in

  • marketing that use computational

  • techniques from machine learning

  • in order to be able to do this

  • type of analysis at large scale.

  • ELIAS BAREINBOIM: Hi, guys.

  • Good afternoon.

  • I'm glad to be here online

  • today.

  • Thank you for coming.

  • Also thank you for the

  • organizer, I appreciate the

  • organizer, Amit and Amber, for

  • inviting me to speak in the

  • event today.

  • My name is Elias Bareinboim.

  • I'm from the Computer Science

  • Department and the Causal

  • Artificial Intelligence Lab at

  • Columbia University.

  • Check my Twitter.

  • I have discussions about

  • artificial intelligence and

  • machine learning.

  • Also apologies for my voice.

  • I'm a little bit sick.

  • But very happy to be here today.

  • I will be talking about what I

  • have been thinking about the

  • foundations of artificial

  • intelligence, how it relates to

  • causal inference and the notions

  • of explainability and

  • decision-making.

  • I'll start from the outline of

  • the talk.

  • I'll start from the beginning.

  • Defining what is a causal model.

  • I will introduce three basic

  • results that are somewhat

  • intertwined.

  • I usually say that to understand

  • them, we understand like 50% of

  • what causal inference is about.

  • There's a lot more technical

  • results, but the conceptual

  • part, the most important.

  • The first I'll start with

  • structural causal models, which

  • is the most general definition

  • of causal model that we know to

  • date, that's by Pearl himself.

  • Then I'll introduce the second

  • and third order, the second

  • result which is known as the

  • Pearl Causal Hierarchy, the PCH,

  • which was named after him.

  • This is the name after object,

  • mathematical object, used by

  • Pearl himself and Markesian in

  • the book of White.

  • If you haven't read the book,

  • strongly recommend it. It's

  • pretty good since it discusses

  • the foundations of causal

  • inference and how it relates to

  • the future of AI and machine

  • learning.

  • More prominently in the last

  • chapter as well as the

  • intersection of the other

  • sciences.

  • This is work partially based on

  • that chapter that we're working

  • on post oppose hierarchy and the

  • foundations of causal inference,

  • joint work with Juan Correa and

  • Duligur Ibeling Thomas Icard, my

  • students at Columbia, and the

  • last two are collaborators from

  • Stanford University.

  • This is the link here to the

  • chapter.

  • Take a look because most of the

  • things I'm talking there, it's

  • in there some shape or form.

  • Then I'll move to another result

  • that is called the causal

  • hierarchy theorem that which was

  • proven in the chapter about

  • 20 years old, 20-plus years old

  • open result, and used as one of

  • the main building blocks, one of

  • the main causes.

  • And then I'll try to connect

  • advanced machine learning and

  • more specifically supervisory

  • and causal learning, how does it

  • fit or how it fits with the

  • specifics of causal hierarchy,

  • also called ladder of causation

  • in the book.

  • Then I'll move to talk a little

  • bit what causal inference and

  • cross-layer inferences.

  • I would then move to the design

  • of artificial intelligence,

  • artificial intelligence systems

  • with causal capabilities.

  • I will come back with machine

  • learning methods and virtual

  • deep learning MRL. And

  • perspective and my focus here

  • will be more about my goals to

  • introduce the ideas, principles

  • and some tasks.

  • I will not focus on

  • implementation details.

  • Also I should mention that

  • essentially business, the

  • outline of the course this

  • semester course at Columbia,

  • bear with me, I'll try to give

  • you the idea if you're

  • interested to learn more check

  • the reference or send message.

  • Now without further ado, let me

  • introduce here the idea of what

  • is a causal model, structural

  • causal model.

  • And we will use the idea, the

  • idea from the processes, we'll

  • take a process based approach to

  • causality.

  • The idea is borrowed from

  • physics, chemistry sometimes

  • economics and other fields that

  • have a collection of mechanism

  • in some line of some phenomena

  • that we're theorizing. In this

  • case, suppose you're trying to

  • understand the effect of taking

  • some drug on the headache.

  • Those are observable variables

  • and we have the corresponding

  • mechanisms here and the data for

  • available drug and sub H to the

  • variable headache.

  • Each mechanism takes as input,

  • has as argument set of

  • observables in the case of apps

  • of the age and observables, in

  • this case U sub B.

  • There's an observable here, all

  • variables in the universe that

  • generate variations should drop.

  • There's not age can be included

  • in the U sub B. And the same

  • here would be U sub H, drug and

  • age observables transformation U

  • sub H, will use U sub H, all

  • variables in the universe that

  • are not drug and age and someone

  • would have or would not have

  • headache.

  • This is the real process, you

  • usually have possibly a

  • complicated function here F sub

  • G, F sub H, if it's not

  • substantiated.

  • Usually we have some type of

  • course and this is the causal

  • graph related to this collection

  • of mechanisms.

  • The causal graph is nothing, the

  • partial specification of the

  • system in which the arrows here

  • just means that some variable

  • participates in the mechanisms

  • of the other.

  • Just put XYZ to make the

  • communication easier.

  • Now we have, for example, age

  • participates in the mechanism of

  • data sub H and then here's from

  • Z to Y.

  • The same with drug.

  • This is arrow from X to Y and

  • same age here participates in

  • the F sub B.

  • Note here in the graph we don't

  • have the particular

  • instantiation of the function,

  • we're just preserving the

  • arguments that will help them.

  • Now, for sure we can try -- this

  • is the process that is kind of

  • unfolding in time.

  • We can sample from a process

  • like that.

  • This gives rise to our

  • distribution, observational and

  • nonexperimental distribution

  • over the observables PX and Y in

  • this case.

  • Usually when you're doing

  • machine learning supervised

  • learning or unsupervised

  • learning, we're playing about

  • this side here of the equation.

  • Here we are trying to understand

  • causality, it's about when you

  • go to the system and you change

  • something or you overwrite,

  • overwrite as a computer science

  • we like, we overwrite some

  • function.

  • Here we would like to overwrite

  • the equation, the natural way of

  • how people is taking drugs, here

  • is drug is equal to yes. This

  • is called also introduce organic

  • layer given the time but this is

  • related to the due operator in

  • which you have overwrite the

  • original mechanism, F sub D, in

  • this case is do X is equal to

  • yes. Now we no longer have a

  • regional equation, you have a

  • constant here.

  • You can have a lot of constants,

  • constants no on the other side

  • we don't have time on these

  • slides. That's what we have.

  • This is semantics without

  • necessarily having access to the

  • mechanisms themselves.

  • This is the meaning of the

  • operation.

  • Now, here is the graphical, the

  • graphical counterpart of that.

  • Note that there's F sub D here

  • would no longer has the age as

  • argument of this function,

  • there's just the constant, you

  • put the constant here and we cut

  • about this graph we cut the

  • incoming out X.

  • This is the mutilated graph.

  • Again, if we're able to contrive

  • reality in this way, you can

  • sample from this distribution or

  • from this process which gives

  • rise to the distribution called

  • interventional distribution or

  • experimental distribution, P of

  • ZY, given 2X is equal to yes.

  • I use these variables here XZY

  • but X would be any decision, Y

  • can be any outcome, Z any set of

  • covariates or features.

  • Now, what is the challenge here?

  • The challenge that in reality

  • this upper floor here is almost

  • never observed.

  • This is usually called

  • unobserved.

  • This is why I put it in gray.

  • Then this is one of the things

  • we don't have that in practice

  • or very rarely and another

  • challenge usually observe the

  • data that we have it's coming

  • from the left side that is

  • coming from this naturally

  • unfolding or how the system is

  • naturally evolving and we will

  • like to understand what's the

  • effect if you go there and do

  • things and do intervention to

  • the system, with our own wheel

  • or deliberately as a

  • policymaker, decision-maker,

  • this sets this variable to yes.

  • And we have data from the left,

  • from Cheng, you do inference

  • what would happen if you do

  • something in this system.

  • Now, we can try to generalize

  • this idea and define what is the

  • structural causal model.

  • This is chapter on causality

  • book approach a thousand.

  • I won't go through definitions

  • step-by-step, but suffice to say

  • you have type of observables or

  • endogenous variables like age,

  • drug or headache.

  • And exogenous, the unobserved

  • variables, that could U sub D

  • and H that we had before and

  • we'll have a collection of

  • mechanisms for each of these

  • observed variables.

  • Mechanism sub D or sub H, excuse

  • me, this could be seen as some

  • type of new point in physics to

  • summarize the conditions outside

  • the system.

  • Excuse me.

  • Outside the system. Kind of

  • sprinkle mass probability.

  • Have this probability P of U

  • over the exogenous variable.

  • Now, we understand very well how

  • the systems work. There's

  • awesome work by Halpern Galles

  • at Cornell, and Galles and

  • Pearl.

  • Given this type of understanding

  • over these types of systems.

  • And today we're interested in a

  • different result that is the

  • following.

  • Once we have SCM, structural

  • causal model M that's fixed or

  • particular environment or set in

  • with the particular agents, this

  • induces the Pearl Causal

  • Hierarchy, or PCH, that is

  • called the ladder of causation

  • in the Book of White.

  • Let's try to understand what --

  • here's the PCH.

  • Now, different layers of the

  • hierarchy.

  • This is the first layer that is

  • called the associational layer,

  • the activities of seeing, how it

  • would seeing some variable acts,

  • acts change my belief in the

  • variable Y, what does a symptom

  • care wise about the disease.

  • Syntactically it's written as

  • sub P of Y given X and why do

  • people ask this layer but this

  • is very related to the machine

  • learning, supervised and

  • unsupervised learning.

  • Different types of comments

  • there.

  • Bayes is one type of model

  • there. You have decision trees.

  • You have supercomputer machines,

  • and deep neural networks and

  • different types of neural

  • networks. They live in this

  • layer here.

  • Quite important, we're to scale

  • up inferences given this X,

  • could be the pixels, the set of

  • features could be order of

  • thousands, even millions, and

  • try to predict how wide some

  • labelly, have pixels, whether

  • it's a cap or not.

  • And it's kind of classic and

  • it's very hard, we're kind of

  • mastering that to understand

  • pretty well how to do that, and

  • recent breakthroughs in the

  • field in the last 20 years, I

  • should say.

  • Now I have a qualitatively

  • different layer, layer two,

  • interventional.

  • It's related to the activity of

  • doing what if I do X actions,

  • what if I take the Asprin, will

  • my headache be cured.

  • The counterpart to machine

  • learning would be reinforcement

  • learning.

  • You have causal bayesian

  • networks and decision processes,

  • partially observable, Ps and so

  • on.

  • Quite important I'll tell you

  • more about that.

  • Symbolically, you say P of Y

  • given to X comma C.

  • That's the notation that you

  • have.

  • Now I have a qualitatively

  • different layer that's layer

  • three, which is a counterfactual

  • layer. I'll go back here soon,

  • but it's related to activity in

  • pagination, agents to have

  • imagination, retrospection, and

  • introspection, and

  • responsibility, credit

  • assignment.

  • It is the layer that gave the

  • name for the Book of White.

  • This is the why type of

  • question.

  • What if I had acted differently,

  • was it the aspirin that stopped

  • my headache.

  • Syntactically, we have this

  • common nested counterfactual

  • here.

  • I took the drug that is X prime

  • as instantiation of the big X,

  • pardon for my license here.

  • Expire, I took the drug, and I'm

  • cured.

  • That is why prime.

  • Now, you can ask how I have a

  • view of the headache that is y,

  • the opposite of Y prime, had I

  • not taken the drug; that is the

  • X that is the opposite of X

  • prime.

  • I took the drug, I'm good

  • experiment prime in the actual

  • world.

  • In this world.

  • And I asked what if I hadn't

  • taken the drug that is X?

  • Would I be okay?

  • That is the Y or not okay, that

  • is the Y.

  • Okay.

  • Not Y.

  • And then there's no counterpart

  • exactly in machine learning, if

  • you have some particular

  • instance you can ask me off

  • line, but it's all kind of

  • things written in the

  • literature, this comes from the

  • structural causal model.

  • Now I would like to see what is

  • going beyond machine learning.

  • I just mentioned this layer

  • three here.

  • Specifically I'd like to

  • highlight different family of

  • tasks of inferential attacks

  • which fall very naturally

  • causally called cross layer type

  • of inferences as I'm seeing

  • here.

  • Layer one is related as suppose

  • as input you have some data here

  • and most of the available data

  • today is observational.

  • It's possibly collected, numbers

  • here 99 percent of the data we

  • have is coming from layer one,

  • and the latest numbers can,

  • someone complains to you, but

  • 99, 90 percent of the inferences

  • that were introduced today is

  • about doing or actionnal layer

  • three about counterfactuals.

  • And about policies, treatments

  • and decisions, just to cite a

  • few examples.

  • Then search question that we're

  • trying to answer here across

  • layers that we have the data and

  • the inference that one should do

  • is how to use the data collected

  • from observations, passively,

  • that's layer one.

  • Maybe coming from the hospital,

  • to answer questions about the

  • interventions that this layer

  • two.

  • And under what conditions can we

  • do that?

  • Why is this task different is

  • usually a good question.

  • Why is the causal problem

  • nontrivial?

  • The answer is like SCM.

  • Almost never observed, but for a

  • few exceptions such as in feuds

  • such as feuds, such as physics,

  • sorry, chemistry and biology.

  • Biology sometimes.

  • In which the very target there

  • is to learn about this

  • collection mechanism in general

  • we do not observe. That's

  • Young.

  • Most of the fields we in AI

  • machine learning we're

  • interested that there's the

  • human in the loop.

  • Some type of interactions that

  • we cannot given that we cannot

  • read minds and we don't isolate

  • the environment in some kind of

  • precise way.

  • You don't have a controlled

  • environment, usually you cannot

  • get that help.

  • But still the observation here

  • that if it does exist, this

  • collection of mechanisms that

  • underlying the system that we're

  • trying to understand is through

  • there and inducing the PCH and

  • you could still have the query

  • or the data task, the cross

  • layer tasks, how can you get

  • from data, from data that is

  • from a fragment that we have

  • from the SCM, you can talk about

  • that's layer one, observational,

  • how can you answer the question

  • from layer two. And have

  • observed phenomenon and you're

  • trying to get fragments observed

  • at least relizable.

  • That could be layer three as

  • well.

  • How can you move across these

  • layers?

  • Like a lot I use in the class,

  • spend some time but I like the

  • metaphor here, since there's

  • complicated reality, just

  • observe the fragments or shadow

  • of the fragments of the PCH, do

  • an inference about the outside

  • world under what conditions can

  • give you that.

  • That's kind of the flavor or the

  • consequence of these mechanisms

  • that could be the other layers,

  • layer two or three, for example,

  • I'd like to talk about the

  • possibility results or this

  • cross layer inferences. As

  • usual, let me read the task

  • here.

  • Infer causal quantity Y given to

  • X from layer three from

  • observational that is layer one.

  • That's the task that I just

  • showed.

  • Now, the effect of X and Y is

  • not identifiable. I've seen

  • from the observed data proves

  • there exists a collection of

  • mechanisms or SCMs capable of

  • generating the same observed

  • behavior, layer one P of X, and

  • Y, Y is disagreeing with respect

  • to the causal query.

  • To witness, we show two models.

  • This is model one.

  • This is model two.

  • Such that they generate the sale

  • of course the model here.

  • Is this for you to go home and

  • think a little bit, but simple

  • models here this is Xor, by the

  • way.

  • Not X, Xor.

  • These models generate the same

  • observed, model one, P1, P2.

  • And same observed behavior in

  • layer one; however, they

  • generate different layer two

  • behaviors.

  • Different layer two predictions.

  • In this case it tells me -- in

  • this case layer two says

  • probability of Y given to X1 is

  • equal to half, while the model

  • two is saying this is one.

  • In other words, we have kind of

  • layer one under the deterrence

  • what can we say about layer two?

  • There's not enough information

  • there to move.

  • That's the result.

  • I would like now to make a

  • broader statement generalize

  • this idea.

  • Again, this is great work with

  • Correa, Ibeling and Icard from

  • the paper I mentioned earlier,

  • proved the following result

  • theorem.

  • Respect for that measure over

  • some kind of technical

  • conditions, measure over SCM,

  • the subset that any PCH collapse

  • is measure zero.

  • Let me read the informal version

  • here.

  • You go home and you can try to

  • parse that. But informally, for

  • almost any SCM, in other words,

  • any possible environment in

  • which your agent or your system

  • is embedded, the PCA doesn't

  • collapse.

  • In other words, the layers of

  • the hierarchy remains distinct.

  • In other words, you have this

  • hierarchy here, there's some

  • kind of this will not happen

  • that one layer usually

  • determines the other.

  • There's more knowledge in layer

  • two than in layer one on line.

  • There's more knowledge in layer

  • three than layer one and layer

  • two.

  • Then one layer determines the

  • other, you don't get this type

  • of situation.

  • This caused an open problem.

  • As stated in the book of White

  • as parallel in Chapter 1 that

  • says answer question I abut

  • certain type of interaction to

  • layer two above intervention,

  • one needs knowledge at layer I,

  • two or above.

  • Now, the natural question here

  • that you could be asking is

  • like, Elli, how is after all are

  • causal inferences possible, or

  • how are causal inferences

  • possible.

  • Now commonly now these are

  • enforce.

  • Doesn't mean you shouldn't do

  • causal inference at all even if

  • you have this type of

  • determination from one layer to

  • another? And the answer is not

  • at all.

  • The idea here, this motivates

  • the following observation.

  • If you know a little bit about

  • the -- if you know zero about

  • SCM, this is the CHP the causal

  • hierarchy pyramid you get.

  • If you know anything about SCM

  • it may be possible.

  • What is this little bit, it's

  • what you call constructial

  • constraints, which you could

  • have encoded in a graphical

  • model. Different models here

  • you can have graphical model

  • layer one, layer two and so on.

  • And then in principle it could

  • be possible to move across

  • layers, depending on how you

  • encode constraints here.

  • Families are graphical models.

  • I'd like to examine for just one

  • minute the graphical model layer

  • one here that is very popular.

  • Such as a bayesian network

  • that's layer one versus a causal

  • based, start of a base net.

  • Not all graphical models are

  • created equal.

  • This is the same task from the

  • previous theorem. It was shown

  • that it's impossible to move

  • from layer one data to layer two

  • type of statement.

  • Now what if you have a base net?

  • It's compatible with the data?

  • Now, this is all base net, this

  • is compatible with the data.

  • X pointing to I, whatever data

  • we get over XY.

  • And we would like to know what's

  • the P of I layer two quantity Y

  • to X in this case.

  • If you play a little bit or if

  • you know a little bit of

  • causality, there's no unobserved

  • confounder here in this graph,

  • then P of Y given to X is equal

  • to P of Y given to X.

  • By ignorability or back door

  • admissibility, those are names

  • we use to say this unobserved

  • confounder.

  • Now I pick another BM, another

  • layer 1 object, fit there, not

  • only from XY, but from Y to X,

  • see what would be the causal

  • effect of X and Y in this case

  • the Y given 2x is still

  • compatible.

  • Turns out for the semantics of

  • causal intervention, the 2,

  • you'll be cutting the arrow here

  • that is coming from the AX

  • because we're the one

  • controlling this system, which

  • PY given of X equals to be P of

  • Y.

  • Then this here highlights that

  • they have different answers,

  • recruitment, that's not enough

  • information about the underlying

  • SCM in the BM.

  • So as to allow causal inference.

  • To say this is not good, the

  • constraints could be coming from

  • the SCM as to why a layer one

  • object is not good.

  • This is not the end we're

  • looking.

  • Now I would like to consider a

  • second object that is a layer

  • two kind of graphical model.

  • You go to the paper that you

  • define more prominently and I'll

  • do that here, possible to

  • encoder layer two constraints

  • coming from SCM.

  • The idea of asymmetry of causal

  • relations and we'd like to focus

  • on this one now.

  • Now the idea is that there are

  • positive instances we can do

  • cross layer inferences. Let's

  • consider a graphical model,

  • other true graphical models.

  • Remember the mental picture I'd

  • like you to construct is the

  • following.

  • Suppose that this is all the

  • space of all structural models.

  • Here are the models compatible

  • to the graph G.

  • It's a true graphical model.

  • These are the models SCM

  • compatible with PZ, could

  • generate this observed

  • distribution. And here are the

  • models that linked a section of

  • these guys who have the models

  • that are giving the same Y given

  • to X.

  • What I'm saying in reality is

  • that there are situations that

  • for any structural model in

  • quoting this unobserved nature,

  • let's call nature N1 and 2, such

  • that they have the same graph of

  • G. G of N1 is is equal to layer

  • two.

  • If they generate the same PO of

  • PV, the same observed

  • distribution, then they will

  • generate the same causal

  • distribution. That's the notion

  • of identifiability.

  • It is possible to get in some

  • settings.

  • Now let me try to summarize what

  • I've said so far.

  • About some sort of patience

  • between the reality and the

  • reality that is destroying the

  • line mechanism that we don't

  • have and our model of reality

  • that will be graphical model,

  • for example, could be other or N

  • the data.

  • We started from the other

  • defined world, semantically

  • speaking, in which an SCM a pair

  • F and P of U mechanisms and

  • distribution over the exogenous

  • implying the PCH.

  • Which means different aspects of

  • the island nature and types of

  • behavior.

  • Layer one, two, three.

  • We do acknowledge that the

  • collection of mechanisms are

  • there but inference are limiting

  • given that SCM is almost never

  • observable or observed due to

  • the CHP, we have this constraint

  • about how to move across the

  • layers.

  • Now we'll move towards scenarios

  • in each parcel knowledge of the

  • SCM is available that is such a

  • causal graph, layer two causal

  • graph.

  • Causal inference theory helps us

  • determine whether the causal

  • target, the targeted inference

  • is allowed.

  • In the prior example the

  • inference is from layer one to

  • layer two.

  • Namely trying to understand if

  • the graph is P of V that is

  • layer one distribution allows us

  • to answer P of Y given to X.

  • Observation here, sometimes this

  • is not possible.

  • I mean, for weak models, if you

  • have a weak model, mental

  • picture here is like sometimes

  • the true models generating this

  • green guy here, this

  • distribution.

  • There's another model that had

  • the same graph G.

  • It can induce the same

  • observation distribution and

  • generate a model that's called P

  • star Y given to X.

  • And they're in a situation that

  • we cannot do the inference about

  • layer three just without one

  • data.

  • Now, I'd like to spend two

  • minutes just doing a summary of

  • the how does reinforcement

  • learning fit into this picture.

  • I stand three hours last week in

  • ICML talking about that, go to

  • the crl.causal.net, if you want

  • details I'll give you two

  • minutes what happened there.

  • This is the PCH. Now my comment

  • is typical URL is usually

  • confined to layer two or subset

  • of layer two, and usually you

  • cannot move from layer one,

  • cannot leverage the data that is

  • from layer one or very rarely.

  • And this URL doesn't support us

  • make statement about the

  • counterfactuals, the layer two

  • type of counterfactuals.

  • That's the global picture.

  • This is the kind of canonical

  • picture of RL.

  • Can have an agent that's

  • embedded in the environment.

  • The agent is a collection of

  • parameters.

  • The agent observed some kind of

  • state and commits to an action

  • and observes reward.

  • There's a lot of discussion

  • about the model, base of model

  • free.

  • I'd like to say that all model

  • base they mentioned today in the

  • literature is not causal model

  • based, it's causal. Important

  • not to get confused.

  • You can ask me more later.

  • The only difference here causal

  • reinforcement learning

  • perspective, that what, that

  • we'll leverage. And I spent

  • three hours, almost three hours

  • discussing that in the tutorial.

  • That now officially the

  • collection of mechanisms that we

  • just studied, the structural

  • causal model, would be the model

  • of the environment, officially,

  • and the agent side that you have

  • graph G.

  • Now, the two cube observations,

  • the environment and the agent

  • would be tied to the payer SCM

  • in the environment side,

  • environmental side and causal

  • graph on the agent side will

  • define different types of

  • actions or interactions

  • following the PCH, which means

  • that observing, experimenting in

  • and imagining would be these

  • different modes.

  • Please check the CRL.causal.ai

  • for more details there.

  • And this one, we can check

  • later, talk about different

  • types of tasks that we weren't

  • acknowledging before.

  • I'd like to move quickly, spend

  • 30 seconds discussing how does

  • deep learning fit into this

  • picture.

  • Here's the same picture I have

  • before from the left side

  • observational in the pipes,

  • about ten slides ago, and right

  • side interventional world.

  • Now in reality this is about

  • reality and model. This is

  • abstraction in reality or have a

  • data. And you can sample from

  • the data. And this allows us to

  • get the hat distribution, the P

  • hat, and we have results saying

  • the results of the hat

  • distribution and the original

  • distribution keeps decreasing,

  • which makes sense to operate in

  • terms of the hat distribution.

  • Now, for sure can use some kind

  • of formalism to try to learn the

  • hat distribution, including a

  • deep network variation of that.

  • Now, challenge usually

  • interesting in this inference in

  • the right side and you have zero

  • data points in the right side.

  • I'm talking broadly, not

  • reinforcement learning.

  • The reinforcement learnings have

  • all the problems.

  • But you have zero here.

  • Now how on earth can you learn

  • about hat of distribution.

  • Some people, it's connecting the

  • input of the DNN that you

  • learned from the left side to

  • the right side.

  • Which put a guy like that,

  • there's nothing in the data, in

  • this data, nor in the deep net

  • that takes into account the

  • structural constraints that we

  • discussed and nor the CHT.

  • It makes no sense to connect.

  • There's something missing there.

  • I could talk one hour, you

  • invite me to talk about neural

  • nets and causal inference, but

  • this is the picture I want to

  • start the conversation.

  • I would like to conclude and

  • apologies for the short time.

  • It's like very short talk, and

  • thanks for the opportunity.

  • Now, let me conclude. Causal

  • inference and AI are from

  • mentally intertwined, novel

  • learning opportunities emerge

  • when this connection is fully

  • understood.

  • Most of the patterns for general

  • AI today are orthogonal to the

  • current eight causal maps

  • available. And we're not even

  • touching the problems, the

  • patterns for general AI,

  • including deep learning, the

  • huge discussions we're having in

  • reinforcement learning.

  • In practice, failure to

  • acknowledge the distant features

  • of causality almost always leads

  • to poor decision-making and

  • superficial type of

  • explanations.

  • The board here, the agenda we're

  • pursuing almost 10 years now,

  • we're developing a framework for

  • principle algorithms and tools

  • for designing causally sensible

  • AI systems integrating the three

  • PCH observational,

  • interventional and

  • counterfactual data.

  • Modes of reasoning and

  • knowledge.

  • And my belief, strong belief is

  • that this will lead to natural

  • treatment of human like

  • explainability given that we're

  • causal machines and rational

  • decision-making.

  • I would like to thank you for

  • listening and also this is, my

  • collaborators, this is joint

  • work with the causal AI lab at

  • Columbia and collaborators,

  • thanks Juan, Sanghac Kai-Zhan,

  • Judea, Andrew, Dulgar and

  • Thomas, and all the others, it's

  • a huge effort. Thanks. I'll be

  • glad to take questions.

  • CHENG ZHANG: Hello, everyone.

  • I'm Cheng Zhang from Microsoft

  • Research UK. Today I'm going to

  • talk about causal view on

  • robustness of neural networks.

  • Deep learning has been very

  • successful in many applications.

  • However, it's also vulnerable.

  • So let's take our favorite

  • stochastic classification, for

  • example.

  • Deep learning can achieve it

  • with 99 percent accuracy.

  • This is impressive.

  • However, if we just shift image

  • a little bit, not much, the safe

  • range won't be more than 10

  • percent.

  • So the accuracy will drop to

  • around 85 percent, which is

  • already not satisfying for

  • application.

  • If we enlarge the safety rings

  • to 20 percent, the accuracy will

  • drop to half, which is not

  • acceptable anymore.

  • The plot shows that the more we

  • shift, the more -- the less the

  • performance.

  • This is not desired.

  • Especially with minor shift.

  • Okay.

  • Now we would like to be robust.

  • Let's vertical shift image in

  • the training set as well.

  • This is a type of training

  • setting.

  • Major dash shift up to

  • 50 percent image in the training

  • set.

  • So that you can see that the

  • performance is much better with

  • about 95 percent accuracy even

  • when we shift up to 50 percent.

  • But do we solve the problem now?

  • What if I didn't know that it

  • will be vertical shifting the

  • test, and start, it would have

  • been horizontal shift.

  • Then, for example, shift in the

  • training data then during the

  • testing time, we test image with

  • vertical shift as before.

  • The online shows the performance

  • with vertical shift.

  • It is actually even worse than

  • training with clean data only.

  • So adverse training does not

  • solve the robust problem with

  • deep learning because it could

  • even harm the robustness to

  • unsim -- manipulated images.

  • And we'll never know our

  • possible attacks.

  • This is the real issue in deep

  • learning. This is a simple task

  • which has seven digits.

  • How about healthcare or

  • policymaking, the decision

  • qualities is critical.

  • But humans are very good at this

  • task.

  • We can recognize that if it has

  • shifted a little bit or if the

  • background changes, because

  • we're very good at causal

  • reasoning.

  • We know that shift or background

  • changes does not change the

  • digit number or make a path to

  • adopt.

  • This is a model property called

  • reasoning, and it is also

  • referred as independent

  • mechanism sometime.

  • So causal the relationship from

  • the previous example can be

  • summarized in this way.

  • The final observation is an

  • effect of three types of causes.

  • One is the digit number, and the

  • other one is the writing style,

  • et cetera, and the left one is

  • the different manipulations such

  • as shift or rotation.

  • The same applies to the other

  • example observation of cat is

  • that a real cat?

  • And it's for color, et cetera,

  • features, and different

  • environments, such as different

  • view and also background.

  • We use Y here to denote the

  • target of the task and D denotes

  • the factors that cannot be

  • manipulated.

  • And M denotes the factors that

  • can be manipulated manually.

  • So add here is the factor we'd

  • like to be robust for you before

  • diving down into the robustness

  • details.

  • Let's review what is a valid

  • attack.

  • We have seen the shapes amidst

  • the digit, or background change

  • with cat.

  • And another common thing is to

  • add a bit noise as we see here.

  • We stressed a very small amount

  • of noise.

  • We can come through a deep

  • learning model claisified as

  • independent as a given.

  • We can rotate the image and even

  • add speakers sometime.

  • It's also been pointed the noise

  • can fool humans. The left image

  • looks more like a dog than cat

  • to me.

  • The question is this still a

  • valid attack, what type of

  • change and how much change can

  • we consider that since you form

  • a valid attack.

  • We'd like to define valid attack

  • from causal lens.

  • Let's take previous example from

  • a causal view.

  • We can see the valid attack of

  • generated from intervention on

  • M, together with original Y and

  • Z it produces manipulated data

  • X.

  • In general, valid attacks should

  • not change the underlying Y

  • because this is a target.

  • This we can now intervene the

  • target of Y or parts of Y if

  • there is an appearance of Y.

  • While Z are not equal to

  • intervene by our definition such

  • as the genetic feature of a cat

  • or writing style of the image

  • itself.

  • In this regard, recent adversary

  • attacks can be considered as

  • specific types of intervention

  • on M such as the adding noise on

  • manipulating the image.

  • In this way a learned predictor

  • is saved.

  • So the goal of robustness of

  • deep learning is to be robust to

  • both the known manipulation and

  • the unknown manipulation.

  • Adversary training can help with

  • the known manipulation but it

  • the unknown manipulation.

  • Our question is how to perform a

  • prediction that can be adaptive

  • to the potential of known

  • manipulations as the shifted

  • digit example.

  • In this work we propose a model

  • naming deep causal manipulation

  • argumented model.

  • We call it Deep CAMA.

  • The idea is to create the deep

  • learning model that is

  • consistent with underlying

  • causal process.

  • In this work, we'll assume that

  • the causal relationship among

  • the variables of interest are

  • provided. The Deep CAMA is one

  • of the generative model.

  • Let's quickly recall the deep

  • generative model variation

  • auto-encoder.

  • The variation auto-encoder

  • bridges deep learning and

  • probablistic modeling and has

  • been successful in many

  • applications. The graphic model

  • is shown on the left.

  • From a probabilistic modeling

  • point of view, we can drive down

  • the model as we can theorize

  • showing in the right-hand side

  • of this equation.

  • We learned the posterior using

  • versional inference.

  • In particular, we can introduce

  • a variational distribution queue

  • and we can try to minimize the

  • divergence between the P and the

  • Q.

  • We can follow the standard staff

  • and form the evidence lower

  • bound.

  • We call it ELBO.

  • And optimize the evidence lower

  • bound to get the posterior

  • estimation.

  • Different from traditional

  • probabilistic modeling every

  • link in the graphic model on the

  • left are all deep neural

  • networks.

  • This becomes an auto-encoder

  • where we can try to reconstruct

  • the X with the stochastic

  • probablity.

  • This can be learned as a

  • standard deep learning framework

  • with the loss using evidence

  • lower bound we just showed

  • before.

  • So CAMA is also a deep

  • generative model. Instead of

  • simple factorization like on the

  • left.

  • Our model is factorized in a

  • causal consistent way.

  • You can see it on the right-hand

  • side.

  • The model is consistent with the

  • causal relationship that we saw

  • before.

  • Next, let's see how can we use

  • the inference?

  • When we only have the clean

  • dataset, which means like the

  • dataset without any augmentation

  • or an adversary example.

  • So from a causal lens, this is

  • the same as do M equals clean.

  • Now we translate it to the deep

  • CAMA model.

  • We can use value zero indicating

  • the clean data.

  • This we set M to be 0, and we

  • can consider it to be observed.

  • We only need to infer the latent

  • variable Z in this case for

  • variational distribution instead

  • of conditioning only on X as

  • traditional inversion of the

  • encoder. In CAMA we can

  • consider XYZ together to define

  • the variational distribution.

  • We follow the same procedure and

  • form the evidence lower bound,

  • ELBO shown below.

  • As L is a root node, we have do

  • M and the DO calculation can be

  • written as conditioning.

  • And the user adverse training

  • way, we may have manipulated

  • data in the training set as

  • well.

  • In this way we may not know the

  • manipulation with the straight M

  • as latent variable.

  • In this case we need to infer

  • both M and Z.

  • This we have the Q5ZM condition

  • on X and Y bookend.

  • We can provide evidence lower

  • bound in this form. Finally,

  • with both clean and manipulative

  • data in the training set, the

  • final loss is in a combined form

  • with corresponding loss with

  • clean data and unmanipulated

  • data as shown before.

  • Here there's the clean subset of

  • the data and D prime is the

  • subset of the data which are

  • manipulated.

  • This is adversary training

  • setting using CAMA.

  • In this way CAMA, can be used

  • either with only clean data and

  • with manipulative data together.

  • The final neural network for

  • architecture I'll show here, so

  • the encoder and decoder are

  • shown on the right.

  • And the decoder network

  • correspond to the solid arrow in

  • the graphic method on left side,

  • and then the encoder network

  • corresponds to the dashed line.

  • The inference network can help

  • us to compute the posterior

  • distribution, M and Z.

  • In the test of time, we'd like

  • to model to be robust to unseen

  • manipulation.

  • We want to learn it in test of

  • time.

  • While with the network we're

  • presenting the generative

  • process from Y to Z and X.

  • And we fine tune the network to

  • adopt to the new M and how M

  • influences X.

  • So in this way the network can

  • learn a new unseen manipulation.

  • Test of time the label is not

  • known so we don't know the Y.

  • In this way we need to

  • marginalize Y and optimize the

  • fine tune last year to adopt to

  • the unseen manipulation.

  • For prediction we use base of

  • the posterior of the Y, we see

  • it parting definitely.

  • We can see CAMA is designed in

  • the causal system way where we

  • can officially train the model

  • following similar procedure as

  • variation of the encoder.

  • We can also fine tune the model

  • in test of time to unseen

  • manipulation and make

  • predictions.

  • Next see the performance of

  • Karma.

  • And first let's use only the

  • clean data as we see in the

  • process Y of this talk.

  • We bring the blue curve from the

  • first slide, which is the

  • regular deep neural network.

  • Our method without fine tuning

  • is shown in orange.

  • With fine tuning with

  • corresponding manipulation, in

  • test of time, which shows the

  • group green curve in figure A,

  • And we can see significant

  • improvement in the performance

  • in the figure A. We can also

  • see with fine tuning our

  • different manipulation in the

  • middle panel, the performance

  • does not drop unlike the

  • traditional neural network.

  • This is thanks to that we fixed

  • the mechanism from Y and Z to X.

  • Fine tuning a long time

  • manipulation does not affect the

  • robustness to other type of

  • manipulation which is desired.

  • In the middle panel the fine

  • tuning was done on the

  • horizontal shift and the testing

  • was on vertical.

  • Furthermore, we use different

  • percentage for tests for fine

  • tuning. We see the more data

  • for fine tuning, the more robust

  • we have for the performance of

  • the unseen manipulation.

  • More importantly, we can see

  • that with only more than 10

  • percent of the data, we already

  • obtained very good performance,

  • which means that the fine tuning

  • procedure is validate

  • utilization.

  • We also tested our method with

  • popular gradient based adversary

  • attacks.

  • In particular, the fast gradient

  • side method as shown on the left

  • and projected gradient descent

  • attack on the right.

  • The blue one is traditional deep

  • learning method which is very

  • vulnerable.

  • The orange one is a common model

  • without fine tuning. And the

  • green one is the one with fine

  • tuning.

  • We can see that CAMA with fine

  • tuning is so much more robust to

  • even gradient based attacks.

  • The red line shows the clean

  • test performance after fine

  • tuning, which means fine tuning

  • does not deteriorate the clean

  • data performance either.

  • So the improvement in the

  • robustness of common model

  • compared to traditional model is

  • significant with gradient based

  • attacks.

  • With adversary training setting

  • we can obtain the same results

  • as with clean data we see

  • before. See our paper for more

  • results.

  • I will not repeat here.

  • But moreover, I would like to

  • say that our method obtained

  • natural disentanglement due to

  • our model Z and M separately.

  • And we can apply do operation to

  • create counterfactual examples.

  • Figure A shows some examples

  • that are vertically shifted in

  • the training data.

  • After fitting the data, we can

  • apply do operation and set it

  • the do M to zero and generate

  • new data.

  • Which is shown on the right hand

  • side. You can we can shift the

  • image back to the centered

  • location.

  • Now we have shown that CAMA

  • works well in the image

  • classification case and how does

  • it work for general case. For

  • example with many vulnerables

  • and more causal relationships is

  • as the one shown in the picture.

  • For example, with many -- with

  • the ring, there can be multiple

  • causes. And whether the jolt is

  • worth it or not can be caused by

  • multiple factors, for example,

  • was it ringing enough, was CAMA

  • broken or not and can we use

  • CAMA in this case, the answer is

  • yes.

  • We can have generalized deep

  • CAMA in this setting.

  • We can see there's a micro

  • blanket and environmental

  • interest and we compute a deep

  • neural network model that's

  • consistent with causal

  • relationship.

  • With target Y, we put all the

  • variables in the corresponding

  • location which either has

  • ancestor A children X and

  • co-parent C that's consistent

  • with causal relationship.

  • We introduce Z in the same way

  • where Z represents hidden

  • factors which cannot be

  • intervened and M is hidden

  • manipulations.

  • We also extend the inference and

  • fine tuning methods in the same

  • way for this generalized Deep

  • CAMA model.

  • We use set of data which have

  • completed a causal relationship

  • with the experiment.

  • We shift the children dataset

  • for testing.

  • Again, the blue line is the

  • baseline and the orange line is

  • the one without fine tuning and

  • the green one is the one with

  • fine tuning.

  • We can see that such generalized

  • Deep CAMA is significantly more

  • robust and the red line shows

  • the type of data after

  • manipulation, and we can see

  • that clean data performance

  • remains high even after the

  • model adapts to unseen

  • manipulation.

  • The same holds for

  • gradient-based adversary

  • attacks. The attack can be on

  • both children and co-parents,

  • and vulnerable attacks as target

  • Y remains the same, comparing to

  • green line and orange line to

  • the baseline which is in blue,

  • our method is significantly more

  • robust to gradient-based

  • attacks.

  • Last you may ask, what if we

  • don't have causal relationship?

  • As to now we always assumed that

  • the causal relationship is given

  • already.

  • In general, there are many

  • methods for causal discovery

  • from observational data and the

  • informational data.

  • So given the dataset, you can

  • use different tools to find the

  • causal relationship.

  • A good review paper is provided

  • by Clark, et cetera, last year

  • which summarized different types

  • of causal discovery method.

  • Myself also did some research on

  • this topic.

  • However, just to be honest, the

  • causal discovery is a

  • challenging problem. And it may

  • not be perfect all the time.

  • What if the causal relationship

  • that we use was not completely

  • correct?

  • There may be small errors.

  • So here we performed experiments

  • to show it with specific data

  • with many variables.

  • The blue line is a baseline and

  • the orange line is the case

  • where the causal relationship is

  • perfect.

  • So here different colored lines

  • shows different degree of

  • misspecification in the causal

  • relationship.

  • In this experiment we have ten

  • children variable in total and

  • we make them have different

  • degree of misspecified causal

  • relationship.

  • The green line shows that two

  • variables are mis-specified in

  • the causal relationship.

  • And the red line is four

  • variables are mis-specified.

  • We see that with the

  • mis-specified causal

  • relationship, the performance

  • drops comparing to the ideal

  • scenario.

  • However, if it's mis-specified

  • by a small fraction, we can

  • still obtain more robust results

  • compared to baseline.

  • It's helpful to consider causal

  • consistent design even though we

  • may not have the perfect causal

  • relationship given.

  • In the end, I would like to

  • summarize my talk.

  • I presented causal view on model

  • robustness and causal inspired

  • deep generative model called

  • Deep CAMA. Our model is

  • manipulation aware and

  • robustness to unseen

  • manipulation.

  • This is efficient with or

  • without manipulated data during

  • the training.

  • Please contact me if you have

  • any questions.

  • Thank you very much.

  • AMIT SHARMA: We're back live for

  • the panel session. One of the

  • questions that was asked a lot

  • during the chat was the question

  • about model mis-specification

  • and model misspecification can

  • happen in two ways. One is that

  • while we're thinking about the

  • causal assumptions, we may miss something.

  • So there could be, for example,

  • an unobserved confounder. And

  • the other way could be when we

  • build our statistical model, we

  • might parameterize it to simply

  • or too complex and so on.

  • So maybe this is a question for

  • both Susan and Elias, is how do

  • you reconcile with that?

  • Are there tools that we can use

  • to detect which kind of error is

  • happening, or can we somehow

  • give some kind of confidence

  • intervals of guarantees on when

  • we are worried that such errors

  • may occur?

  • So maybe, Susan, you can go

  • first.

  • SUSAN ATHEY: Sure. That's a great question.

  • And it's definetly something I worry about in a lot

  • of different aspects of my work.

  • I think one approach is to

  • exploit additional variation.

  • So I guess we should start from the fact

  • that in general in many of these

  • places these models are just identified

  • So there's a theorem that says

  • that you can't detect the

  • presence of the confounder

  • without additional information.

  • But sometimes we do have

  • additional information.

  • So if you have like multiple

  • experiments, for example, that

  • you can exploit that additional information

  • And so in one of my papers we do an

  • exercise where we try to assess,

  • we can look at certain types of

  • violations of our assumptions

  • and see if we can accept or

  • reject their presence.

  • So, for example, one thing that

  • we worried about was there might

  • be an upward trend over time in

  • demand for product that might

  • coincide with an upward trend in

  • prices.

  • So we were already using things

  • like weak effects and throwing

  • out products that had a lot of

  • seasonality.

  • But still our functional form

  • might not capture everything.

  • And so we did these exercises

  • called placebo tests where you

  • put in fake price series that

  • are shifted up or shifted back

  • and then try to assess whether

  • we actually find a treatment

  • effect for that fake price

  • series, and then we had 100

  • different categories so we could

  • test across those hundred

  • categories, and we found

  • basically a uniform distribution

  • of test statistics for the

  • effect of a fake price series,

  • which sort of helped us convince

  • ourselves that at least like

  • these kind of overall time

  • trends were not a problem.

  • But that was designed to look at

  • a very specific type of

  • mis-specification.

  • And in another setting, there

  • might not be an exact analog of

  • that.

  • Another thing that I emphasize

  • in my talk was trying to

  • validate the model using test

  • data which again was only

  • possible because we had lots of

  • price changes in our data.

  • And so those types of validation

  • exercises can also kind of let

  • you know when you're on the

  • right track because if you have

  • mis-estimated price

  • sensitivities then your

  • predictions about differences in

  • behavior between high and low

  • price sensitive people and the

  • test set won't be right.

  • But broadly, this issue of

  • identification, the fundamental

  • assumptions for identification

  • and testing them is challenging.

  • One of the most common mistakes

  • I see from people from the

  • machine learning community is

  • sort of thinking that, oh, well,

  • I can just test it this way or

  • test it that way without

  • realizing actually in many cases

  • there's a theorem that says even

  • infinite data would not allow

  • you to distinguish things.

  • So you have to start with the

  • humbleness that there are

  • theorems that say that you can't

  • answer some of these questions

  • directly.

  • You need assumptions.

  • But sometimes you can be clever

  • and at least provide some data

  • that supports your assumptions.

  • So maybe I can come back to the

  • functional forms and let Elias

  • take a crack at the first

  • question because there's a

  • completely separate answer for functional forms

  • Go ahead.

  • You're muted.

  • ELIAS BAREINBOIM: There we go.

  • Can you hear me?

  • AMIT SHARMA: Yes.

  • ELIAS BAREINBOIM: Cool.

  • Thanks, Susan.

  • Thanks, Amit.

  • The model uses specification

  • machine learning.

  • First comment I would say that

  • is very common is people trying

  • to use this idea of the training

  • and testing sets so paradigm, as

  • I like to call to use, whatever,

  • try to validate a causal model,

  • try to validate a causal query

  • or to verify.

  • It makes no sense in causality,

  • as I just talk -- I just kind of

  • summarize in my talk, usually

  • one type of data that we have in

  • the kind of training and testing

  • data is layer one that is

  • observational data and we're

  • trying to make a statement about

  • another distribution that is the

  • experimental one.

  • Then there's no training or

  • testing, the world that one

  • distribution can tell you about

  • the other, at least not naively,

  • or not in general.

  • And this is the first comment.

  • The second one, I think the

  • interesting scenario, as I

  • mentioned in the chat earlier,

  • is about when you are in the

  • reinforced learning, before

  • reinforcement learning,

  • observational setting for

  • certain trying to get task

  • classification of your causal

  • model, condition independence

  • and quality constraints and

  • other types of constraints to

  • try to validate the model.

  • I think this would be the

  • principal approach, called task

  • classifications. And a lot of

  • the people doing causal

  • inferences trying to understand

  • what are these kind of

  • constraints we usually have.

  • And then you can submit that to

  • some type of statistical test.

  • Now moving to the reinforcement

  • learning, that's more active

  • set, quite interesting. In this

  • setting already taking the

  • decision.

  • We already kind of randomize and

  • controlling the environment,

  • then the very goal of doing that

  • by Fisher, perhaps 100 years

  • ago, was to avoid the unobserved

  • confounding that was what

  • originated the question.

  • Then reinforcement learning is

  • good for that. And then if you

  • have something wrong, many times

  • I can be super critical about

  • that, but many times the effects

  • of having wrong will wash away.

  • I think that's another nice idea

  • in the reinforcement learning

  • setting that we're kind of

  • pursuing, but I think is quite

  • nice, other people should think

  • about, how can you use the

  • combination of these different

  • datasets not only to

  • decision-making itself but to

  • try to validate the model, which

  • parts of the model the model are

  • on.

  • And there's kind of different

  • types of tasks that are usually

  • very unconventional about how to

  • triangulate the observational

  • and different types of

  • experimental distributions in

  • order to detect the parts of the

  • model that have problems.

  • My last note here, my last idea

  • is just do sensitivity analysis.

  • Usually don't have so many

  • matters that have good ones or

  • initial ones, but you don't have

  • so many in particular tailored

  • to the causal inference problem.

  • I think that's a very good area

  • have some initial work, but I

  • think is very promising, and

  • we'll talk about future of

  • frontiers. But for now I think some more

  • people should do sensitivity.

  • I pass the ball here to Amit.

  • AMIT SHARMA: Sure, yeah, right.

  • I think it's a fundamental

  • distinction between

  • identification and estimation,

  • right. And I think maybe,

  • Susan, maybe you can talk about

  • the statistical

  • misspecification.

  • SUSAN ATHEY: So the functional

  • forms.

  • Right.

  • So in econometrics, we often

  • look at nonparametric

  • identification and look at

  • things like semi-parametric

  • estimation.

  • So you might think, for example,

  • in this choice problems I was

  • talking about, we had behavioral

  • assumptions that consumers were

  • maximizing utility. We had

  • identification assumptions which

  • basically say that whether the

  • consumer arrived at the store

  • just before, just after the

  • price change was as good as

  • random.

  • And so the price was -- within a

  • period of two days -- was

  • randomly assigned to the

  • consumer.

  • That's kind of the

  • identification assumption.

  • And then there's a functional

  • form assumption which is

  • type one extreme value which allows

  • you to use the multinomial logit

  • formulation. That functional

  • form assessment is incredibly

  • convenient because it tells you

  • if one product goes out of stock

  • I can predict how you are going

  • to redistribute purchases across

  • substitute products. It's going

  • to allow you to make these

  • counterfactual predictions and

  • it's very efficient.

  • If I change one, if I have one

  • price sensitivity, I can learn

  • that on one product and apply it

  • to other products as well.

  • Those types of things are

  • incredibly efficient, and

  • they've been shown to be

  • incredibly useful for studying

  • consumer choice behavior over

  • many decades.

  • But there's still functional

  • form assumptions.

  • So there are also theorems that

  • say actually in principle you

  • can identify choice behavior

  • even if you don't assume the

  • type one extreme value, don't

  • assume this logit formulation.

  • But then you need a lot of

  • variation in prices in order to

  • trace out what the distribution

  • of your errors really are and to

  • fully uncover the joint

  • distribution of all of the

  • shocks to your preferences, you

  • would need lots of price

  • variation and lots of products

  • over a long period of time.

  • So theoretically, you can learn

  • everything without the

  • functional form assumptions; but

  • in practice, it's not practical.

  • And so you're always going to be

  • relying on some functional form

  • assumptions in practice.

  • Even though theoretically you

  • can identify everything

  • nonperimetrically with enough

  • price variation.

  • So then it comes to sensitivity

  • analysis.

  • You want to check whether your

  • results are sensitive to these,

  • to the various assumptions

  • you've made, and that becomes

  • more of a standard exercise.

  • But I think it's really helpful

  • to frame the exercise by first

  • saying, is it even possible to

  • answer these questions; and what

  • would you need? And many

  • problems are impossible.

  • And just as Elias was saying, if

  • you have a confounder in your

  • training set, you're also going

  • to have one in your test set,

  • and just splitting trust and

  • train doesn't solve anything.

  • So you have to have a

  • theoretical reason why you think

  • that you're going to be able to

  • answer your question.

  • AMIT SHARMA: Makes sense.

  • I have a similar question for

  • Cheng as well, in the sense it

  • will be great if we have a

  • training method that is robust

  • to all adversarial attacks, but

  • obviously that'll be dificult.

  • There's some assumptions you're

  • making in the structure of your causal model

  • itself, in the Deep CAMA method.

  • So my question to you is how

  • sensitive it is and what kinds

  • of attacks can be...

  • What will your model be robust to? But I'll

  • also throw a more ambitious question.

  • Is it possible to formally

  • define the class of attacks on

  • which a causal model may be

  • robust to?

  • CHENG ZHANG: So I think like as

  • a key here is like how can we

  • formulate attack in a causal way.

  • So I think like for some attacks

  • it's very easy to formulating a

  • causal way, for example,

  • shifting, is manipulation, just

  • another cause for the impact

  • you're observing.

  • But for some attacks, it's even

  • more tricky to formulate it in a

  • causal way, for example,

  • gradient-based attack.

  • It's like causal setting and

  • especially with this multiple

  • staff gradient-based attack.

  • So I think then it goes to like

  • overtime with cycle, causal

  • model as an underlying model.

  • So I think if you can formulate

  • it properly as a causal model

  • and design a model that is

  • consistent, and then we can be

  • robust to the attack, but not

  • all cases are so easy or like

  • there can be technical

  • challenges when there's cycles

  • over time for certain type of

  • attacks.

  • So I think in general it's just

  • always good to consider more

  • causality but how difficult and

  • how much assumption you have to

  • make and to which degree you wish to violate the

  • assumptions I think that depends

  • on the situation you're in.

  • AMIT SHARMA: Yeah, that makes

  • sense.

  • And maybe I think one question I

  • want to ask and maybe this will

  • be the last question live.

  • So we talked about really

  • interesting applications of

  • causality.

  • So, Susan, you talked about sort

  • of the classic problem of price

  • sensitivity in economics.

  • Elias, you briefly talked about

  • reinforcement learning and Cheng

  • about adverse attacks.

  • These are interesting ideas that

  • we have seen.

  • I wanted to ask you to look in

  • the future a bit.

  • Maybe a few years. What are the

  • areas or applications where

  • you're more excited about or you

  • think that this amalgamation of

  • causality and machine learning

  • is poised to help and may have

  • the biggest impact.

  • Susan, you want to go.

  • SUSAN ATHEY: That's a good

  • question.

  • So one thing that I'm working on

  • a lot in my lab at Stanford is

  • just personalization of

  • digitally provided services

  • education and training, which of

  • course all the partners I'm

  • working with have had huge

  • uptake in the COVID-19 crisis.

  • So of course you can start to

  • attack personalization in

  • digital services without

  • thinking about causality; you

  • can build sort of classic

  • recommendation systems without

  • really using a causal framework.

  • But as you start to get deeper

  • into this, you realize that you

  • actually can do a fair bit

  • better in some cases by using a

  • causal framework.

  • And so, first of all, it's using

  • reinforcement learning, for

  • example, is I would argue that

  • reinforcement learning is just

  • intrinsically causal.

  • You're running experiments,

  • basically.

  • But if you're trying to do

  • reinforcement learning in a

  • small data setting, you do want

  • to use ideas from causal

  • inference and also be very

  • careful about how you're

  • interpreting your data and how

  • you're extrapolating. I think

  • that at this sort of

  • intersection of causal inference

  • and reinforcement learning and

  • smaller data environments where

  • the statistics are more

  • important, worrying about biases

  • that come up when naive

  • reinforcement learning you're

  • creating selection biases and

  • confounding in your own data.

  • And if the statistician in the

  • reinforcement model isn't

  • actually factoring everything

  • in, you can make mistakes.

  • And more broadly we're seeing a

  • lot of the companies that I'm

  • working with, Ed Tech and

  • training tech, are running a lot

  • of randomized experiments. And

  • so we're combining historical

  • observational data with their

  • experiments. And so you can

  • learn some parts of the model

  • using the historical

  • observational data and use that

  • to make the experimentation as

  • well as the analysis of the

  • experimentation more efficient.

  • And so I think this whole

  • intersection of combining

  • observational experimental data

  • when you're short on statistical

  • power is another super

  • interesting area that a lot of

  • companies will be thinking about

  • as they try to improve their

  • digital services.

  • AMIT SHARMA: Elias, what do you

  • think?

  • ELIAS BAREINBOIM: Amit, thanks

  • for the question, by the way.

  • I was trying to answer you.

  • Thanks, Amit.

  • I think that in terms of

  • applications, my general goal

  • of, the goal in the lab is to

  • build more general types of AI,

  • I would say.

  • That is more human, as people say

  • human-friendly, using this name

  • or you have some type of

  • rational decision, you can attach

  • to this label rational decision making.

  • I would like to review these

  • notions on how we're doing that for the last

  • maybe five years or so, review what

  • this could mean.

  • Because since if up go to books,

  • AI books from 20, 30 years ago,

  • all of them are using the same

  • label and they are usually not

  • causal.

  • Then I would say I personally

  • don't see any way of doing

  • general AI or more general types

  • of AI, I should say, without

  • being serious or attacking

  • causal inference front and

  • center.

  • I can count on my hands the

  • effort today how many people are

  • doing it and I cannot count the

  • number of people that's excited,

  • which is pretty good.

  • I'm excited about the lab

  • excitement at the moment.

  • Then the primary suggestions I

  • just don't go around it, just

  • trying to understand what is a

  • causal model, what causality is

  • about, and then just do it.

  • A little bit of a learning

  • curve. But I think this is the

  • critical path if you want to do

  • AI or more general types of AI.

  • The two other applications that

  • we've been working and go to the

  • website causalAI.net.

  • Causal reinforcement learning as

  • you mentioned, we were chatting

  • before in the internal chat

  • here, I just gave three hours

  • tutorial at ICML that is trying

  • to explain my vision of how I

  • see this intersection of

  • causality and reinforcement

  • learning and check it out see

  • how causal enforcement learning

  • and all the notions of

  • explainability and fairness and ethics.

  • There's many papers and works

  • that are not technical...

  • says causality is hard or is

  • difficult to get a causal model

  • and so on. It's inevitable in

  • some way. Then there's no point

  • in postponing. If you go to the

  • court, or talk to human beings,

  • usually causality is required in

  • the law or in legal circles.

  • And as humans, we're causal machines.

  • Then there's no way

  • to go around. I'd like to see

  • more people work on it,

  • including Microsoft for sure.

  • Microsoft was the leader by the

  • way in the bayes net, in the

  • early 90s, revolution that takes

  • the 90s and into early 2000s, I

  • think, which push a lot the

  • limit and of them today

  • including variation out in

  • encoders and so on, still I'd

  • like to see much bolder steps

  • from Microsoft.

  • ...Eric Horvitz and

  • David Heckerman, those are the

  • two leaders. They understood

  • very well, they are the

  • developers of the theory of

  • graphical mods of bayes net in

  • the late '80s and they pushed

  • that in such a good way.

  • Now I'm not baptized from the

  • bayes net, that's completely different

  • than the causal graphical model.

  • This is the expectation for

  • Microsoft for myself, and I

  • think huge potential... well thats the idea.

  • AMIT SHARMA: Thank you, Elias.

  • Cheng, what sort of domains or

  • applications are you most excited about

  • CHENG ZHANG: I would like to

  • second Susan and Elias.

  • I think these are all

  • interesting directions.

  • I see, like, great importance in

  • considering causality in all

  • occurrence of machine learning,

  • all directions, because for deep

  • learning reinforcement learning,

  • fairness, I like I really your

  • work, Amit, as well on privacy

  • robustness, generalization, a

  • lot of current problems in

  • machine learning is like if we

  • actually consider coming -- I

  • really see the last magic

  • ingredient to solve a lot of

  • this drawbacks in the current

  • machine learning model.

  • But I'd like to bring another

  • angle, is if you think about

  • causality as a direction of

  • machine learning, I think in

  • past years -- in recent years a

  • lot of gap has been bridged but

  • I think in early days is what I

  • see it's a little bit more separated.

  • I would like to second that like

  • from a causal chair, I think a

  • lot of more modern machine learning

  • techniques can also improve the

  • causal discovery itself because

  • traditionally we hear about all

  • these theorems, proof,

  • identifiability and all these

  • things, and commonly we limit

  • ourselves to a simpler function

  • of work.

  • I think in recent years there's

  • been more advances like a format

  • and other things, but I also do

  • see a lot of advances, machine

  • learning techniques that help

  • with causal discovery, for

  • example, with a lot of nonlinear

  • ICA work from Apple recently, it

  • actually bridges nonlinear ICA

  • and IAE and with the self

  • supervised learning time series

  • can also help with causal

  • discovery from observational

  • data. And I also see this is a

  • great trend.

  • For example, even from

  • Burnhouse, recent work, how you

  • use, like, active learning,

  • element based active learning

  • for causal discovery.

  • So I actually see not only

  • causality, too, as machine

  • learning, but also see a great

  • potential for other

  • machine learning methods to

  • causality.

  • AMIT SHARMA: Great.

  • On that note, that's a wrap.

  • Thank you again all the speakers

  • for taking your time and

  • attending this session.

  • And of course thank you all for

  • the audience for coming to the

  • Frontiers in ML event.

  • We'll start again tomorrow at

  • 9:00 a.m. Pacific.

  • And we'll have a session on

  • machine learning, reliability

  • and robustness.

  • Thank you, all.

AMIT SHARMA: Hi, all.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it